# play_tencent.ipynb

##### Junshen Kevin Chen

In this notebook we briefly explore the shared apps between the Tencent app store (腾讯应用宝) and Google Play store (US region), to see any pattern / distinction noticible between the two stores.

#### First load the data

In [72]:
import pandas as pd

# expand pandas display to show all columns
pd.set_option('display.max_columns', None) 
# limit the width of each col so we can see
pd.set_option('display.max_colwidth', 15)

In [2]:
tc_df = pd.read_csv('tencent/data.csv')
tc_df[:3]

Unnamed: 0,apkPublishTime,apkUrl,appAverageRating,appDownCount,appId,appName,appRating1,appRating2,appRating3,appRating4,appRating5,appRatingCount,appTags,authorId,authorName,averageRating,categoryId,categoryName,description,editorIntro,fileSize,flag,iconUrl,newFeature,numImages,numSnapshot,pkgName,versionCode,versionName
0,155126...,https:...,3.533419,771824...,10910,微信,120054,36772,62098,81567,242469,542960,,179,腾讯公司,3.533419,106,社交,1.可以发语...,,110041256,16533,https:...,本次更新: ...,5,5,com.te...,1400,7.0.3
1,155288...,https:...,4.631833,51811733,42270467,企业微信,85,17,22,23,1097,1244,,179,腾讯公司,4.631833,113,办公,企业微信，是...,,101092695,16533,https:...,2.7.6 ...,5,5,com.te...,8926,2.7.6
2,155287...,https:...,4.847615,41275691,12144239,微信读书,204,63,246,1078,15307,16898,,179,腾讯公司,4.847615,102,阅读,爱读书的人都...,好看小说、全...,26088353,16533,https:...,3.4.0 ...,5,5,com.te...,10135763,3.4.0


In [3]:
pl_df = pd.read_csv('play/data_essential.csv')
pl_df[:3]

Unnamed: 0,name,pkg,publisher,genres,content_ratings,is_editor_choice,price,inapp_details,has_video,num_screenshot,rating_overall,rating_5,rating_4,rating_3,rating_2,rating_1,rating_count,update_time,size,min_install,version,android_version,interactive_elements,in_app_products
0,WhatsA...,com.wh...,WhatsA...,Commun...,Everyone,False,0.0,,False,6,4.4,62258836,12291480,5485583,2118908,5344252,87499059,2019-0...,Varies...,100000...,Varies...,Varies...,Users ...,
1,Facebook,com.fa...,Facebook,Social,Teen,False,0.0,Contai...,False,5,4.1,52581861,12751783,7487328,3499060,10113178,86433210,2019-0...,Varies...,100000...,Varies...,Varies...,Users ...,$0.99 ...
2,Instagram,com.in...,Instagram,Social,Teen,True,0.0,Contai...,False,5,4.5,59316278,12425170,4262444,1448081,3356089,80808062,2019-0...,Varies...,100000...,Varies...,Varies...,Users ...,


#### Use the globally unique "android package name" to build a set of shared apps

In [4]:
# build a set of pkgs common to both app stores
tc_pkg_set = set([tc_df['pkgName'][i] for i in range(len(tc_df))])
print(f'There are {len(tc_pkg_set)} unique apps in the tencent set')
pl_pkg_set = set([pl_df['pkg'][i] for i in range(len(pl_df))])
print(f'There are {len(pl_pkg_set)} unique apps in the play set')

print(f'The play set is {len(pl_pkg_set)/len(tc_pkg_set)} times as large as tencent')

common_pkg = tc_pkg_set.intersection(pl_pkg_set)
print(f'There are {len(common_pkg)} apps shared between the two')
print('Proportionally, it is:')
print(f'  {len(common_pkg)/len(tc_pkg_set)*100}% of the tencent set') 
print(f'  {len(common_pkg)/len(pl_pkg_set)*100}% of the google play set')

There are 76838 unique apps in the tencent set
There are 642744 unique apps in the play set
The play set is 8.364923605507691 times as large as tencent
There are 4671 apps shared between the two
Proportionally, it is:
  6.079023399880268% of the tencent set
  0.7267279041111235% of the google play set


#### Some helpers to use later

In [77]:
# build dicts of pkgname to row numbers
tc_pkgtorow = {
    pkg: row for (pkg, row) in 
    [(tc_df['pkgName'][i], i) for i in range(len(tc_df)) if tc_df['pkgName'][i] in common_pkg]
}
pl_pkgtorow = {
    pkg: row for (pkg, row) in 
    [(pl_df['pkg'][i], i) for i in range(len(pl_df)) if pl_df['pkg'][i] in common_pkg]
}

def tc_attr(pkg, column):
    return tc_df[column][tc_pkgtorow[pkg]]

def pl_attr(pkg, column):
    return pl_df[column][pl_pkgtorow[pkg]]

#### Genre distribution of common apps in both sets

#### Update time difference between common apps

#### Common apps sorted by Tencent DOWNLOAD COUNT, and their download count in Google Play

Because there is no accurate download count in Google Play, we will skip ranking them by Google's download count.

In [82]:
# build a list sorted by appDownCount in tencent set
common_pkg_by_tc_download_count = [
    (tc_attr(pkg, 'appDownCount'),            # app download count in tencent
     tc_attr(pkg, 'appName'),                 # name in tencent
     pl_attr(pkg, 'min_install'),             # min install count in play
     pl_attr(pkg, 'name'),                    # name in play
     pkg)                                     # android pkg
    for pkg in common_pkg
]
common_pkg_by_tc_download_count.sort(reverse=True) # sort by tuple[0] app download count in tencent

In [83]:
pd.DataFrame.from_records(common_pkg_by_tc_download_count,
                          columns=['tencent_dl_count','tencent_name','play_min_dl_count','play_name','pkg'])

Unnamed: 0,tencent_dl_count,tencent_name,play_min_dl_count,play_name,pkg
0,8868016334,QQ,10000000,QQ,com.tencent...
1,7718249331,微信,100000000,WeChat,com.tencent.mm
2,4588279472,腾讯视频,1000000,騰訊視頻,com.tencent...
3,2757974400,腾讯新闻-事实派，真实...,1000000,腾讯新闻,com.tencent...
4,2744278138,手机淘宝,10000000,淘宝,com.taobao....
5,2669935989,百度,1000000,百度,com.baidu.s...
6,2362655772,WPS Office,100000000,WPS Office ...,cn.wps.moff...
7,2311254032,高德地图,1000000,高德地图,com.autonav...
8,2306195312,新浪微博,10000000,微博,com.sina.weibo
9,1841197623,百度地图,1000000,百度地图,com.baidu.B...
