# Data Analysis for Profitable App Profiles for the Apple Store and Google Play Markets

Our objective is to look into the Apple Store and Google Play Markets to analyze which type of apps attract more users for our company that builds Android and iOS mobile apps. The main source of revenue is in-app ads which in directly influenced by the number of users of the app, therefore the market research of which type of apps attract more users is highly necessary. 

The main goal of this Data Analysis project is to analyze and investigate the current market of mobile apps to provide a potential solution to effectively increase revenue on in-app ads in the mobil app.

### 1. Opening and Exploring Data

As of September 2018, there were approximately 2M iOS apps available on the App Store and 2.1M Android apps on Google Play. 

We will use only a small sample of the large data.

In [1]:
import csv

file = open('AppleStore.csv')
app_data = list(csv.reader(file))

file = open('googleplaystore.csv')
google_data = list(csv.reader(file))

In [2]:
def explore_data(dataset, start, end, printdata, rows_and_columns):
    dataset_slice = dataset[start:end]  
    if printdata:
        for row in dataset_slice:
            print(row)
            print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# explore the apple store data
explore_data(app_data, 0, 3, printdata=True, rows_and_columns=False)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




In [4]:
print("App Store Data:")
explore_data(app_data, 0, len(app_data), 
             printdata=False, rows_and_columns=True)

App Store Data:
Number of rows: 7198
Number of columns: 16


In [5]:
# explore the apple store data
explore_data(google_data, 0, 3, printdata=True, rows_and_columns=False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [6]:
print("Google Data:")
explore_data(google_data, 0, len(google_data), 
             printdata=False, rows_and_columns=True)

Google Data:
Number of rows: 10842
Number of columns: 13


In [7]:
app_col = app_data[0]
google_col = google_data[0]

print(f"App Store Data features: \n {app_col}", '\n \n', 
      f"Google Play Data features: \n {google_col}")

App Store Data features: 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 
 
 Google Play Data features: 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


**Google Play Store Apps Data (`'googleplaystore.csv'`) features:**

    *Dataset and information from Kaggle.com "Google Play Store Apps"*

'App': Application Name

'Category': Category the app belongs to

'Rating': Overall user rating of the app (at the time of data scraping)

'Reviews': Number of user reviews for the app (as when scraped)

'Size': Size of the app (as when scraped)

'Installs': Number of user downloads/installs for the app (as when scraped)

'Type': Paid/Free

'Price': Price of app (as when scraped)

'Content Rating': Age group the app is targeted at - Children / Mature 21+ / Adult

'Genres': An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

'Last Updated': Date when the app was last updated on Play Store (as when scraped)

'Current Ver': Current version of the app available on Play Store (as when scraped)

'Android Ver': Min required Android version (as when scraped)

**App Store Apps Data (`'AppleStore.csv'`) features:**

    *Dataset and information from Kaggle.com "Mobile App Store (7200 apps)*
    
"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"rating_count_tot": User Rating counts (for all version)

"rating_count_ver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"user_rating_ver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

### 2. Cleaning Data

- we only want apps that are free to download
- our audience is English-speaking audience

**Checking and cleaning google_data:**

In [8]:
# wrong data @i = 10472 for google_data
for j in range(len(google_data[0])):
    print(google_data[0][j], ":")
    for i in range(10471, 10475):
        print(f"{i}:", google_data[i][j])
    print('\n')

App :
10471: Jazz Wi-Fi
10472: Xposed Wi-Fi-Pwd
10473: Life Made WI-Fi Touchscreen Photo Frame
10474: osmino Wi-Fi: free WiFi


Category :
10471: COMMUNICATION
10472: PERSONALIZATION
10473: 1.9
10474: TOOLS


Rating :
10471: 3.4
10472: 3.5
10473: 19
10474: 4.2


Reviews :
10471: 49
10472: 1042
10473: 3.0M
10474: 134203


Size :
10471: 4.0M
10472: 404k
10473: 1,000+
10474: 4.1M


Installs :
10471: 10,000+
10472: 100,000+
10473: Free
10474: 10,000,000+


Type :
10471: Free
10472: Free
10473: 0
10474: Free


Price :
10471: 0
10472: 0
10473: Everyone
10474: 0


Content Rating :
10471: Everyone
10472: Everyone
10473: 
10474: Everyone


Genres :
10471: Communication
10472: Personalization
10473: February 11, 2018
10474: Tools


Last Updated :
10471: February 10, 2017
10472: August 5, 2014
10473: 1.0.19
10474: August 7, 2018


Current Ver :
10471: 0.1
10472: 3.0.0
10473: 4.0 and up
10474: 6.06.14


Android Ver :
10471: 2.3 and up
10472: 4.0.3 and up


IndexError: list index out of range

In [9]:
print(len(google_data[10472]))
print(len(google_data[10473]))
print(len(google_data[10474]))

13
12
13


Here we notice that @i = 10473, we are missing one value and shifted data. 

The length of the data @10473 is shorter and we notice that it is missing `Category` section with rest of the data shifted by one. 

We want to delete this row for an accurate analysis:

In [10]:
del google_data[10473] # already ran

In [11]:
print("Google Data:")
explore_data(google_data, 0, len(google_data), 
             printdata=False, rows_and_columns=True)

Google Data:
Number of rows: 10841
Number of columns: 13


**Checking and cleaning App_data:**

In [12]:
# we will check any duplicates by checking if there are any duplicated App ID since it shuold be unique
duplicate = []
appname = []

for i in range(1, len(app_data)):
    row = app_data[i]
    if row[1] in appname:
        duplicate.append((row[1], i))
    else:
        appname.append(row[1])
print(duplicate)

duplicate = []
appID = []

for i in range(1, len(app_data)):
    row = app_data[i]
    if row[0] in appID:
        duplicate.append((row[0], i))
    else:
        appID.append(row[0])
print(duplicate)

[('Mannequin Challenge', 4464), ('VR Roller Coaster', 4832)]
[]


With app name, we detected two app names, `"Mannequin Challenge"` and `"VR Roller Coaster"` that were duplicated within the dataset. 

However searching through the appID, we see that there are no duplicates so a double check was necessary.

Searching through the dataset with a dictionary to find which apps have the same names and seeing if they are in fact an error in the data:

In [13]:
appnamedict = {}
for i in range(1, len(app_data)):
    name = app_data[i][1]
    if name in appnamedict:
        appnamedict[name].append(i)
    else:
        appnamedict[name] = [i]

for name in appnamedict:
    if len(appnamedict[name]) > 1:
        print(name, appnamedict[name])

Mannequin Challenge [2949, 4464]
VR Roller Coaster [4443, 4832]


In [14]:
print(app_data[2949], '\n')
print(app_data[4464], '\n')
print(app_data[4443], '\n')
print(app_data[4832], '\n')

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1'] 

['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1'] 

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1'] 

['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1'] 



The names of the pair of two apps are identical but the rest of the values seem to differ and so we can conclude that these are not an error in the dataset.  

**Removing any non-English apps**

We can check the characters of the app name with the `ord()` function. 

There are may be some english alphabets (or characters that have ASCII value less than 127) in non-english apps like, `"爱奇艺PPS -《欢乐颂2》电视剧热播"`. There are also the opposite case like emojis or other characters that have a ASCII value greater than 127 in english apps like, `"▻Sudoku"`. So we have to make a decision on how to distinguish the non-english apps. 

We decided to create a function `notEnglish()` that takes a string `name` as an argument. This function iterates through each of the characters in the string, assesses the `ord()` value, or the ASCII, value and essentially calculates the proportion of the non-english alphabets within the entire name. 

We set the boundary to 0.25, meaning if more than half of the characters in the name is not english, then we will consider it a non-english app.

In [15]:
def notEnglish(name):
    notenglish = False
    total = len(name)
    count = 0
    for character in name:
        if ord(character) > 127:
            count += 1
    if count/total > 0.25:
        notenglish = True
    return notenglish

App Data:

In [16]:
# app_data:
iosname = []
count = 0
for row in app_data[1:]:
    iosname.append(row[1])
    
nonenglishApp = []
for i in range(len(iosname)):
    name = iosname[i]
#     print(name)
    if notEnglish(name):
        nonenglishApp.append((i+1, name))

print("(i, non English App names):")
for i in range(10):
    print(nonenglishApp[i])

(i, non English App names):
(814, '爱奇艺PPS -《欢乐颂2》电视剧热播')
(1194, '聚力视频HD-人民的名义,跨界歌王全网热播')
(1428, '优酷视频')
(1519, '网易新闻 - 精选好内容，算出你的兴趣')
(1596, '淘宝 - 随时随地，想淘就淘')
(1604, '搜狐视频HD-欢乐颂2 全网首播')
(1649, '阴阳师-全区互通现世集结')
(1783, '百度贴吧-全球最大兴趣交友社区')
(1895, '百度网盘')
(1906, '爱奇艺HD -《欢乐颂2》电视剧热播')


In [17]:
nonenglishapp_i = []
for i in range(len(nonenglishApp)):
    nonenglishapp_i.append(nonenglishApp[i][0])

app_data2 = []
for i in range(len(app_data)):
    if i in nonenglishapp_i:
        continue
    app_data2.append(app_data[i])
len(app_data2)

6165

In [18]:
print(app_data[813][1])
print(app_data[814][1])
print(app_data[815][1])
print("\n")
print(app_data2[813][1])
print(app_data2[814][1])
print(app_data2[815][1])

BATTLE BEARS -1
爱奇艺PPS -《欢乐颂2》电视剧热播
Filterra – Photo Editor, Effects for Pictures


BATTLE BEARS -1
Filterra – Photo Editor, Effects for Pictures
Live.me – Live Video Chat & Make Friends Nearby


Google_Data:

In [19]:
googlename = []
count = 0
for row in google_data[1:]:
    googlename.append(row[0])
    
nonenglishG = []
for i in range(len(googlename)):
    name = googlename[i]
#     print(name)
    if notEnglish(name):
        nonenglishG.append((i+1, name))

# print("(i, non English App names):")
# for i in range(10):
#     print(nonenglishG[i])

In [20]:
nonenglishG_i = []
for i in range(len(nonenglishG)):
    nonenglishG_i.append(nonenglishG[i][0])
    
google_data2 = []
for i in range(len(google_data)):
    if i in nonenglishG_i:
        continue
    google_data2.append(google_data[i])

In [21]:
print(google_data[710][0])
print(google_data[711][0])
print(google_data[712][0])
print("\n")
print(google_data2[710][0])
print(google_data2[711][0])
print(google_data2[712][0])

English for beginners
Flame - درب عقلك يوميا
Mermaids


English for beginners
Mermaids
Learn Japanese, Korean, Chinese Offline & Free


In [22]:
print(f"Number of nonenglish apps in App_data (deleted data): {len(app_data) - len(app_data2)}")
print(f"Number of apps in App_data after deleting nonenglish apps: {len(app_data2) - 1}")
print(f"\nNumber of nonenglish apps in google_data (deleted data): {len(google_data) - len(google_data2)}")
print(f"Number of apps in google_data after deleting nonenglish apps: {len(google_data2) - 1}")


Number of nonenglish apps in App_data (deleted data): 1033
Number of apps in App_data after deleting nonenglish apps: 6164

Number of nonenglish apps in google_data (deleted data): 37
Number of apps in google_data after deleting nonenglish apps: 10803


**Removing apps that aren't free**

In [23]:
for i in range(20):
    print(app_data2[i][4])

price
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.99
0.0
0.0
0.0
0.0
0.0
0.99
6.99


In [24]:
google_data2[1]

['Photo Editor & Candy Camera & Grid & ScrapBook',
 'ART_AND_DESIGN',
 '4.1',
 '159',
 '19M',
 '10,000+',
 'Free',
 '0',
 'Everyone',
 'Art & Design',
 'January 7, 2018',
 '1.0.0',
 '4.0.3 and up']

In [25]:
# App Data remove paid apps:
appcount = 0
app_data3 = [app_data2[0]]
for row in app_data2[1:]:
    if float(row[4]) == 0:
        app_data3.append(row)
    else:
        appcount += 1
        continue

# Google Data remove paid apps:
googlecount = 0
google_data3 = [google_data2[0]]
for row in google_data2[1:]:
    if row[6] == 'Free':
        google_data3.append(row)
    else:
        googlecount += 1
        continue

print(len(app_data2), appcount, len(app_data3))
print(len(google_data2), googlecount, len(google_data3))


6165 2961 3204
10804 797 10007


**Removing Duplicate entries**

In [26]:
# we will use google_data3 
duplicate_g = []
unique_g = []

for app in google_data3[1:]:
    name = app[0]
    if name in unique_g:
        duplicate_g.append(name)
    else:
        unique_g.append(name)
        
print(f"Number of duplicate apps: {len(duplicate_g)}")
print(duplicate_g[:2])

Number of duplicate apps: 1135
['Quick PDF Scanner + OCR FREE', 'Box']


In [27]:
for i in range(len(google_data3)):
    name = google_data3[i][0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(i, google_data3[i], "\n")

223 ['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'] 

230 ['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'] 

284 ['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'] 



In [28]:
def notHighestReviews(name, dataset):
    index = []
    for i in range(1, len(dataset)):
        app = dataset[i]
        if app[0] == name:
            index.append((int(app[3]),i))
#     print(index)
    index.remove(max(index))
    return index

In [29]:
notHighestReviews(duplicate_g[0], google_data3)

[(80805, 223), (80804, 284)]

In [30]:
google_data4 = [google_data3[0]]

# collect all indices to delete/skip
skipindex = []
for app in duplicate_g:
    indexlist = notHighestReviews(app, google_data3)
    for index in indexlist:
        skipindex.append(index[1])
t = set(skipindex)    # unique values
skipindex = list(t)

count = 0
for i in range(1, len(google_data3)):
    if i in skipindex:
        count += 1
        continue
    else:
        google_data4.append(google_data3[i])

In [31]:
# we will use app_data3 
duplicate_apps = []
unique_apps = []

for app in app_data3[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print(f"Number of duplicate apps: {len(duplicate_apps)}")
print(duplicate_apps[:2])

Number of duplicate apps: 0
[]


All duplicate entries are deleted and thus concludes the data cleaning. 

In [32]:
# Final App Data:
appheader = app_data3[0]
app_cleaned = app_data3[1:]

# Final Google Data:
googleheader = google_data4[0]
google_cleaned = google_data4[1:]

print(appheader, "\n", googleheader)
for i in range(5):
    print(app_cleaned[i])
print("\n")
for i in range(5):
    print(google_cleaned[i])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']
['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0

google_cleaned

In [33]:
# google_cleaned

## 2. Exploring Data

The validation strategy is to build a minimal Android app and add it to Google Play. Then if it has a good response, it will devlop further. If it is profitable, it is then added to the App Store after devloping an iOS version of it. 

Because our goal is to find apps with the most users, we will start by analyzing the most common genres of mobile apps in the market. 

In [58]:
# appheader
# app_cleaned
# googleheader
# google_cleaned
# Google Data:
genresG = {}
for row in google_cleaned:
    if row[1] in genresG:
        genresG[row[1]] += 1
    else:
        genresG[row[1]] = 1
temp = sorted(genresG.items(), key=lambda x:x[1], reverse=True)
genres_google = dict(temp)
genres_google

{'FAMILY': 1680,
 'GAME': 861,
 'TOOLS': 750,
 'BUSINESS': 408,
 'LIFESTYLE': 347,
 'PRODUCTIVITY': 346,
 'FINANCE': 327,
 'MEDICAL': 313,
 'SPORTS': 301,
 'PERSONALIZATION': 295,
 'COMMUNICATION': 287,
 'HEALTH_AND_FITNESS': 273,
 'PHOTOGRAPHY': 261,
 'NEWS_AND_MAGAZINES': 250,
 'SOCIAL': 236,
 'TRAVEL_AND_LOCAL': 207,
 'SHOPPING': 199,
 'BOOKS_AND_REFERENCE': 192,
 'DATING': 165,
 'VIDEO_PLAYERS': 159,
 'MAPS_AND_NAVIGATION': 124,
 'FOOD_AND_DRINK': 109,
 'EDUCATION': 102,
 'ENTERTAINMENT': 84,
 'LIBRARIES_AND_DEMO': 83,
 'AUTO_AND_VEHICLES': 82,
 'HOUSE_AND_HOME': 73,
 'WEATHER': 71,
 'EVENTS': 63,
 'PARENTING': 58,
 'ART_AND_DESIGN': 57,
 'COMICS': 55,
 'BEAUTY': 53}

In [59]:
# appheader
# app_cleaned
# googleheader
# google_cleaned
# Google Data:
genresA = {}
for row in app_cleaned:
    if row[11] in genresA:
        genresA[row[11]] += 1
    else:
        genresA[row[11]] = 1
temp = sorted(genresA.items(), key=lambda x:x[1], reverse=True)
genres_app = dict(temp)
genres_app
# genresA

{'Games': 1865,
 'Entertainment': 255,
 'Photo & Video': 161,
 'Education': 118,
 'Social Networking': 104,
 'Shopping': 82,
 'Utilities': 78,
 'Sports': 69,
 'Music': 65,
 'Health & Fitness': 65,
 'Productivity': 55,
 'Lifestyle': 50,
 'News': 43,
 'Travel': 38,
 'Finance': 35,
 'Weather': 28,
 'Food & Drink': 28,
 'Reference': 17,
 'Business': 17,
 'Book': 14,
 'Navigation': 6,
 'Medical': 6,
 'Catalogs': 4}

In [60]:
def freqPercent(freq_table, totalcount):
    newtable = {}
    for freq in freq_table:
        newtable[freq] = round(freq_table[freq]/totalcount, 4)
    return newtable

In [75]:
# google:
totalG = 0
for val in genres_google:
    totalG += genres_google[val]
genres_freq_google = freqPercent(genres_google, totalG)
# app:
totalA = 0
for val in genres_app:
    totalA += genres_app[val]
genres_freq_app = freqPercent(genres_app, totalA)

From the two ordered frequency tables above, we can see that the first five most popular genres for each are:

`"Family"`
`"Game"`
`"Tools"`
`"Business"`
`"Lifestyle"`

for Google Play Store Apps and

`"Games"`
`"Entertainment"`
`"Photo & Video"`
`"Education"`
`"Social Networking"`

for App Store Apps. However, for Android apps, the biggest genres, `"Family"` is not even greater than a fifth of the entire mobile apps, whereas `"Games"` in App Store apps, are more than half of the entire mobile apps. 

Since the biggest genre of the App Store apps are `"Games"` and the second largest genre in the Google Play Store is also `"Games"`, we can suggest the company to build `"Games"` app.

In [73]:
print("Google Genres Percent:")
genres_freq_google

Google Genres Percent:


{'FAMILY': 0.1894,
 'GAME': 0.0971,
 'TOOLS': 0.0845,
 'BUSINESS': 0.046,
 'LIFESTYLE': 0.0391,
 'PRODUCTIVITY': 0.039,
 'FINANCE': 0.0369,
 'MEDICAL': 0.0353,
 'SPORTS': 0.0339,
 'PERSONALIZATION': 0.0333,
 'COMMUNICATION': 0.0324,
 'HEALTH_AND_FITNESS': 0.0308,
 'PHOTOGRAPHY': 0.0294,
 'NEWS_AND_MAGAZINES': 0.0282,
 'SOCIAL': 0.0266,
 'TRAVEL_AND_LOCAL': 0.0233,
 'SHOPPING': 0.0224,
 'BOOKS_AND_REFERENCE': 0.0216,
 'DATING': 0.0186,
 'VIDEO_PLAYERS': 0.0179,
 'MAPS_AND_NAVIGATION': 0.014,
 'FOOD_AND_DRINK': 0.0123,
 'EDUCATION': 0.0115,
 'ENTERTAINMENT': 0.0095,
 'LIBRARIES_AND_DEMO': 0.0094,
 'AUTO_AND_VEHICLES': 0.0092,
 'HOUSE_AND_HOME': 0.0082,
 'WEATHER': 0.008,
 'EVENTS': 0.0071,
 'PARENTING': 0.0065,
 'ART_AND_DESIGN': 0.0064,
 'COMICS': 0.0062,
 'BEAUTY': 0.006}

In [74]:
print("App Genres Percent:")
genres_freq_app

App Genres Percent:


{'Games': 0.5823,
 'Entertainment': 0.0796,
 'Photo & Video': 0.0503,
 'Education': 0.0368,
 'Social Networking': 0.0325,
 'Shopping': 0.0256,
 'Utilities': 0.0244,
 'Sports': 0.0215,
 'Music': 0.0203,
 'Health & Fitness': 0.0203,
 'Productivity': 0.0172,
 'Lifestyle': 0.0156,
 'News': 0.0134,
 'Travel': 0.0119,
 'Finance': 0.0109,
 'Weather': 0.0087,
 'Food & Drink': 0.0087,
 'Reference': 0.0053,
 'Business': 0.0053,
 'Book': 0.0044,
 'Navigation': 0.0019,
 'Medical': 0.0019,
 'Catalogs': 0.0012}

AppStore Dataset has a feature named `"rating_count_tot"` which shows the total number of reviews of the mobile app. This shows how popular an app can be. 

For each genre in the App data, we will calculate the average number of reviews.

In [107]:
# genres_app
avgRatingCount = {}
for row in app_cleaned:
    if row[11] in avgRatingCount:
        avgRatingCount[row[11]] += float(row[5])
    else:
        avgRatingCount[row[11]] = float(row[5])
for genre in avgRatingCount:
    appcount = genres_app[genre]
    avgRatingCount[genre] = round(avgRatingCount[genre]/appcount, 3)
avgRatingCount = dict(sorted(avgRatingCount.items(), 
                             key=lambda x:x[1], reverse = True))
avgRatingCount

{'Navigation': 86090.333,
 'Reference': 79350.471,
 'Social Networking': 72916.548,
 'Music': 58205.031,
 'Weather': 52279.893,
 'Book': 39758.5,
 'Finance': 32367.029,
 'Food & Drink': 30953.464,
 'Travel': 29721.605,
 'Photo & Video': 28264.888,
 'Shopping': 27572.024,
 'Health & Fitness': 23298.015,
 'Sports': 23008.899,
 'Games': 22898.639,
 'Productivity': 21402.8,
 'News': 21248.023,
 'Utilities': 19408.987,
 'Lifestyle': 16815.48,
 'Entertainment': 13974.208,
 'Business': 7491.118,
 'Education': 7003.983,
 'Catalogs': 4004.0,
 'Medical': 612.0}

For Google Play Store, instead of total number of reviews, we will use the number of `"Installs"` to calculate the average installs for the genre.

In [105]:
avgInstallCount = {}
for row in google_cleaned:
    n = row[5]
    n = n.replace('+','')
    n = n.replace(',','')
    if row[1] in avgInstallCount:
        avgInstallCount[row[1]] += float(n)
    else:
        avgInstallCount[row[1]] = float(n)
for genre in avgInstallCount:
    appcount = genres_google[genre]
    avgInstallCount[genre] = round(avgInstallCount[genre]/appcount,3)
avgInstallCount = dict(sorted(avgInstallCount.items(), 
                             key=lambda x:x[1], reverse = True))
print("Average Install Count:")
avgInstallCount

Average Install Count:


{'COMMUNICATION': 38456119.167,
 'VIDEO_PLAYERS': 24727872.453,
 'SOCIAL': 23253652.127,
 'PHOTOGRAPHY': 17840110.402,
 'PRODUCTIVITY': 16738957.555,
 'GAME': 15594505.749,
 'TRAVEL_AND_LOCAL': 13984077.71,
 'ENTERTAINMENT': 11719761.905,
 'TOOLS': 10801391.299,
 'NEWS_AND_MAGAZINES': 9472829.04,
 'BOOKS_AND_REFERENCE': 8676746.146,
 'SHOPPING': 7036877.312,
 'PERSONALIZATION': 5183850.807,
 'WEATHER': 5074486.197,
 'HEALTH_AND_FITNESS': 4188821.985,
 'MAPS_AND_NAVIGATION': 4057744.194,
 'FAMILY': 3696967.673,
 'SPORTS': 3638640.143,
 'ART_AND_DESIGN': 1986335.088,
 'FOOD_AND_DRINK': 1942556.431,
 'EDUCATION': 1841666.667,
 'BUSINESS': 1708215.907,
 'LIFESTYLE': 1433960.89,
 'FINANCE': 1361355.144,
 'HOUSE_AND_HOME': 1331540.562,
 'DATING': 854028.83,
 'COMICS': 817657.273,
 'AUTO_AND_VEHICLES': 647317.817,
 'LIBRARIES_AND_DEMO': 638503.735,
 'PARENTING': 542603.621,
 'BEAUTY': 513151.887,
 'EVENTS': 253542.222,
 'MEDICAL': 120550.62}

From the average rating count for the App Store and the average install count for Google Play Store, we can see that only considering for number of apps in the genre does not mean much when it comes to analyzing which mobile app is most popular for the users in each market because the top counts does not match the top counts for the number of apps in the genres.

## Conclusion

**Google Play Store:**

For Google Play Store, we calculated that the top five genres with the most number of apps were: 

`'FAMILY'`
 `'GAME'`
 `'TOOLS'`
 `'BUSINESS'`
 `'LIFESTYLE'`.

However, after calculating the average installment count for each genres, the top five genres with the highest average installment for each genres is:

`'COMMUNICATION'`: 38,456,119.167

 `'VIDEO_PLAYERS'`: 24,727,872.453
 
 `'SOCIAL'`: 23,253,652.127
 
 `'PHOTOGRAPHY'`: 17,840,110.402
 
 `'PRODUCTIVITY'`16,738,957.555. 

**App Store:**

For App Store, we calculated that the top five genres with the most number of apps were: 

`'Games'`
 `'Entertainment'`
 `'Photo & Video'`
 `'Education'`
 `'Social Networking'`.
 
However, according to the results of the calculated average number of reviews, the top five genres with the highest average number of reviews for each genres is: 

`'Navigation'`: 86090.333

 `'Reference'`: 79350.471

 `'Social Networking'`: 72916.548
 
 `'Music'`: 58205.031
 
 `'Weather'`: 52279.893.
 
After these calculations, we can conclude that for each market, the genre that is in the top five genres is **"Social Networking"** genre. 