# Exploratory Data Analysis of Profitable Apps in App Store and Google store

The purpose of this project is to help developers understand what type of apps are likely to attract more users on Google Play and the App Store.

In [1]:
#Helper function for exploring the data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let us first open the datasets

In [2]:
from csv import reader

### Google Playstore Dataset ###
opened_file = open('./data/googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### Apple store Dataset ###
opened_file = open('./data/AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Let's check the Android data

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

Checking the columns of the Google Play Store data, I think the useful features that can be helpful for our goal are the following: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Android Ver']

Removing the following coluns: ['Last Updated', 'Current Ver']

Now let's check the Apple data

In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [5]:
# For easier referencing later
def header_indices(header):
    i = 0
    header_dict = {}
    while i < len(header):
        header_dict[header[i]] = i
        i += 1
    return header_dict

In [6]:
an_hi = header_indices(android_header)
ios_hi = header_indices(ios_header)

In [7]:
# Function for checking data completeness
def checkFeaturesIfComplete(dataset, column_headers):
  incomplete_found = False
  column_len = len(column_headers)
  for index, row in enumerate(dataset):
    if len(row) != column_len:
        print(index)
        print(row)
        incomplete_found = True
  if not incomplete_found:
    print('All dataset entry complete')

In [8]:
# Check the row if it has all columns
# Android
checkFeaturesIfComplete(android, android_header)

10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [9]:
len(android[10472])

12

In [10]:
# For iOS
checkFeaturesIfComplete(ios, ios_header)

All dataset entry complete


One of the entry in Android only has 12 featrues instead of 13. Let's check what feature is missing then decide if it can be filled or just delete the entry.

In [11]:
print(android_header)
print(android[0])
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It is missing the 'Category' feature which I think can be filled.

In [12]:
# Let's try to get all existing categories
categories = {row[1] for row in android}
print(categories)

{'MEDICAL', 'ART_AND_DESIGN', 'HOUSE_AND_HOME', 'LIFESTYLE', 'COMICS', 'BUSINESS', 'PARENTING', 'BOOKS_AND_REFERENCE', 'LIBRARIES_AND_DEMO', 'TOOLS', 'VIDEO_PLAYERS', 'COMMUNICATION', 'SHOPPING', 'MAPS_AND_NAVIGATION', 'SOCIAL', 'TRAVEL_AND_LOCAL', 'ENTERTAINMENT', 'PRODUCTIVITY', 'WEATHER', 'DATING', '1.9', 'SPORTS', 'AUTO_AND_VEHICLES', 'BEAUTY', 'EVENTS', 'FINANCE', 'FAMILY', 'PHOTOGRAPHY', 'NEWS_AND_MAGAZINES', 'EDUCATION', 'PERSONALIZATION', 'GAME', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS'}


Based on the existing categories, it would fit the 'Photography' category.

In [13]:
android[10472].insert(1, 'PHOTOGRAPHY')
print(len(android[10472]))
print(android[10472])

13
['Life Made WI-Fi Touchscreen Photo Frame', 'PHOTOGRAPHY', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Let us proceed to check if there are duplicate entries in our data.

In [14]:
# We make a set of app names to be compared with the number of data entries
apps = {row[an_hi['App']] for row in android}
print(f'Total app data: {len(android)}')
print(f'Unique app names: {len(apps)}')
print(f'App names: {apps}')

Total app data: 10841
Unique app names: 9660
App names: {'Town of Princeton, BC', 'BOO! - Next Generation Messenger', 'Lottery Ticket Checker - Florida Results & Lotto', 'Codes for GTA San Andreas', 'Mobile Security & Antivirus', 'LiveMe - Video chat, new friends, and make money', 'V Made', 'Manga-FR - Anime Vostfr', 'FR: My Secret Pets! ', 'CA Mobile Authenticator', 'Flipp - Weekly Shopping', 'AMC Theatres', 'BF 3d Wallpapers', 'Goibibo - Flight Hotel Bus Car IRCTC Booking App', 'ET Telecom from Economic Times', 'EU VAT Checker', 'Ag Across America', 'Hondata Mobile', 'AC Remote Control Simulator', '¡Ay Caramba!', 'Guide (for X-MEN)', 'Tango - Live Video Broadcast', 'Free english course', 'Alopec - Online Shipping System', 'The FN "Baby" pistol explained', 'Doctor Kids', 'Cisco Webex Meetings', 'Ohio State Fair 4-H', 'Gravidez ao Vivo', 'FL Bankers', 'Project Fi by Google', 'TFO BZ Lessons', 'Flightradar24 Flight Tracker', 'File Manager', 'Safe365 – Cell Phone GPS Locator For Your Fam

In [15]:
# For iOS
ios_apps = {row[an_hi['App']] for row in ios}
print(f'Total app data: {len(ios)}')
print(f'Unique app names: {len(ios_apps)}')
print(f'App names: {ios_apps}')

Total app data: 7197
Unique app names: 7197
App names: {'1168399577', '1146128499', '881342787', '1043640363', '1044950341', '584557117', '554602005', '539943615', '1096204046', '578665578', '1112382631', '427941017', '1156121311', '312220102', '1116880272', '481679745', '404299862', '1073002250', '916728593', '888683802', '600626116', '508558296', '953917544', '551798799', '950984120', '966038711', '299029654', '756869261', '1066612270', '1074470421', '858226685', '934510730', '411430426', '1068460848', '481033328', '1159035153', '1072425152', '522408559', '308368164', '1135442411', '1041465860', '1084860489', '557137623', '1052223765', '1052729607', '362348516', '1095336248', '988173374', '1145262669', '1123428617', '1051326718', '1011788068', '742625884', '959954514', '1043824696', '1000593025', '370899391', '643857704', '1033342465', '1079852672', '1100883805', '335047649', '1173688324', '909110675', '1130542083', '701598884', '789356890', '515651240', '1112156258', '848160327', '1

For the case of the iOS apps, there are no duplicates.
Based on the results of our code above for Android apps, there are 1181 cases of apps occuring more than once.
Let's take a look at one of the apps with duplicate entry.

In [16]:
app_histogram = {}
for row in android:
  name = row[0]
  if name in app_histogram:
    app_histogram[name] += 1
  else:
    app_histogram[name] = 1
print(app_histogram)

{'Photo Editor & Candy Camera & Grid & ScrapBook': 1, 'Coloring book moana': 2, 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': 1, 'Sketch - Draw & Paint': 1, 'Pixel Draw - Number Art Coloring Book': 1, 'Paper flowers instructions': 1, 'Smoke Effect Photo Maker - Smoke Editor': 1, 'Infinite Painter': 1, 'Garden Coloring Book': 1, 'Kids Paint Free - Drawing Fun': 1, 'Text on Photo - Fonteee': 1, 'Name Art Photo Editor - Focus n Filters': 1, 'Tattoo Name On My Photo Editor': 1, 'Mandala Coloring Book': 1, '3D Color Pixel by Number - Sandbox Art Coloring': 1, 'Learn To Draw Kawaii Characters': 1, 'Photo Designer - Write your name with shapes': 1, '350 Diy Room Decor Ideas': 1, 'FlipaClip - Cartoon animation': 1, 'ibis Paint X': 1, 'Logo Maker - Small Business': 1, "Boys Photo Editor - Six Pack & Men's Suit": 1, 'Superheroes Wallpapers | 4K Backgrounds': 1, 'Mcqueen Coloring pages': 2, 'HD Mickey Minnie Wallpapers': 1, 'Harley Quinn wallpapers HD': 1, 'Colorfit - Drawing & Coloring':

In [17]:
for name, count in app_histogram.items():
  if count > 2:
    print(f'{name}, {count}')

Google My Business, 3
Box, 3
Quick PDF Scanner + OCR FREE, 3
Google Ads, 3
Slack, 3
QuickBooks Accounting: Invoicing & Expenses, 3
join.me - Simple Meetings, 3
Messenger – Text and Video Chat for Free, 3
WhatsApp Messenger, 3
Google Chrome: Fast & Secure, 3
Gmail, 3
Hangouts, 4
Viber Messenger, 5
Firefox Browser fast & private, 3
Yahoo Mail – Stay Organized, 3
imo free video calls and chat, 4
Opera Mini - fast web browser, 3
Opera Browser: Fast and Secure, 3
Firefox Focus: The privacy browser, 3
Google Voice, 3
WeChat, 4
UC Browser Mini -Tiny Fast Private & Secure, 3
Telegram, 3
Puffin Web Browser, 3
UC Browser - Fast Download Private & Secure, 3
free video calls and chat, 3
Skype - free IM & video calls, 3
Google Allo, 3
LINE: Free Calls & Messages, 3
KakaoTalk: Free Calls & Text, 3
OkCupid Dating, 3
Hily: Dating, Chat, Match, Meet & Hook up, 3
BBW Dating & Plus Size Chat, 3
Moco - Chat, Meet People, 3
Hot or Not - Find someone right now, 3
Just She - Top Lesbian Dating, 3
muzmatch: M

In [18]:
print(android_header)
for app in android:
  name = app[0]
  if name == 'WeChat':
    print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['WeChat', 'COMMUNICATION', '4.2', '5387333', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 31, 2018', 'Varies with device', 'Varies with device']
['WeChat', 'COMMUNICATION', '4.2', '5387446', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 31, 2018', 'Varies with device', 'Varies with device']
['WeChat', 'COMMUNICATION', '4.2', '5387446', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 31, 2018', 'Varies with device', 'Varies with device']
['WeChat', 'COMMUNICATION', '4.2', '5387631', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 31, 2018', 'Varies with device', 'Varies with device']


Inspecting the data of the duplicate entries for WeChat, they differ on the 4th column which represents the number of reviews. This could indicate that data was collected at different times.
With this information, we can use it as a criterion for removing the duplicate entries. We will only keep the entry with the highest review count which indicates that it is the latest entry.

In [19]:
import time
t1 = time.time()

# For the first part, we create a dictionary of apps with its highest review count
reviews_max = {}

for row in android:
  name = row[an_hi['App']]
  review_count = int(row[an_hi['Reviews']])
  if name not in reviews_max:
    reviews_max[name] = review_count
  else:
    reviews_max[name] = review_count if review_count > reviews_max[name] else reviews_max[name]

print(len(reviews_max))

android_clean = []
added_app = set()
for row in android:
  name = row[an_hi['App']]
  review_count = int(row[an_hi['Reviews']])
  if review_count == reviews_max[name] and name not in added_app:
    android_clean.append(row)
    added_app.add(name)

t2 = time.time()
print("Time taken: %.6f" %(t2 - t1))
print(f'Expected length: 9660, Actual: {len(android_clean)}')

9660
Time taken: 0.013507
Expected length: 9660, Actual: 9660


Let us now proceed with removing non-English apps

In [20]:
print(ios[813][1])
print(ios[6731][1])
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


Each character we use in a string has a corresponding number associated with it. The characters commonly used in an English text are all in the range of 0 to 127. With this information we can write a function to check if a character's numeric code lies within 0 to 127.

In [21]:
def isEnglish(name):
  #Use the ord function to get the corresponding number of a character
  for letter in name:
    if ord(letter) > 127:
      return False
  return True

Let's try it out with a few samples
- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

In [22]:
print(isEnglish('Instagram'))
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

True
False
False
False


The last 2 apps gets rejected because of the special characters. Let's try to update the function to handle these cases.

In [23]:
def isEnglishV2(name):
  #Use the ord function to get the corresponding number of a character
  #Create a counter for non-English characters
  special_count = 0
  for letter in name:
    if ord(letter) > 127:
      special_count += 1
    # If the count of special character exceeds 3, it should be considered as non-english
    if special_count > 3:
      return False
  return True

In [24]:
print(isEnglishV2('Instagram'))
print(isEnglishV2('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglishV2('Docs To Go™ Free Office Suite'))
print(isEnglishV2('Instachat 😜'))

True
False
True
True


The function seems to work well, let's try it on the dataset and see how many gets filtered out.

In [25]:
# Android apps
print(f'Original Android data count: {len(android_clean)}')

android_clean_eng = []
for row in android_clean:
  name = row[an_hi['App']]
  if isEnglishV2(name):
    android_clean_eng.append(row)

# iOS apps
print(f'Original iOS data count: {len(ios)}')

ios_clean_eng = []
for row in ios:
  name = row[1]
  if isEnglishV2(name):
    ios_clean_eng.append(row)

print(f'Remaining Android app: {len(android_clean_eng)}')
print(f'Remaining iOS app: {len(ios_clean_eng)}')

Original Android data count: 9660
Original iOS data count: 7197
Remaining Android app: 9615
Remaining iOS app: 6183


The amount of filtered out apps looks aceptable and we still have plenty of data.

We will now to proceed to select only the free apps.

In [26]:
print(an_hi)
print(ios_hi)

{'App': 0, 'Category': 1, 'Rating': 2, 'Reviews': 3, 'Size': 4, 'Installs': 5, 'Type': 6, 'Price': 7, 'Content Rating': 8, 'Genres': 9, 'Last Updated': 10, 'Current Ver': 11, 'Android Ver': 12}
{'id': 0, 'track_name': 1, 'size_bytes': 2, 'currency': 3, 'price': 4, 'rating_count_tot': 5, 'rating_count_ver': 6, 'user_rating': 7, 'user_rating_ver': 8, 'ver': 9, 'cont_rating': 10, 'prime_genre': 11, 'sup_devices.num': 12, 'ipadSc_urls.num': 13, 'lang.num': 14, 'vpp_lic': 15}


In [27]:
# Lets check for apps with price listed as 'Free' or '0.0'
android_clean_eng_free = []
for row in android_clean_eng:
  price = row[an_hi['Price']]
  if price.lower() == 'free' or price == '0' or price == '0.0':
    android_clean_eng_free.append(row)

In [28]:
print(f'Android Clean Eng: {len(android_clean_eng)}')
print(f'Android Clean Eng Free: {len(android_clean_eng_free)}')
print(f'Difference: {len(android_clean_eng)-len(android_clean_eng_free)}')

Android Clean Eng: 9615
Android Clean Eng Free: 8865
Difference: 750


750 apps were removed from the android dataset.

In [29]:
# Lets check for apps for iOS with price listed as 'Free' or '0.0'
ios_clean_eng_free = []
for row in ios_clean_eng:
  price = row[ios_hi['price']]
  if price.lower() == 'free' or price == '0' or price == '0.0':
    ios_clean_eng_free.append(row)

In [30]:
print(f'iOS Clean Eng: {len(ios_clean_eng)}')
print(f'iOS Clean Eng Free: {len(ios_clean_eng_free)}')
print(f'Difference: {len(ios_clean_eng)-len(ios_clean_eng_free)}')

iOS Clean Eng: 6183
iOS Clean Eng Free: 3222
Difference: 2961


Almost half of the data for iOS was lost. Let's check further to confirm.

In [31]:
free_count = 0
for row in ios_clean_eng:
  price = row[ios_hi['price']]
  if price.lower() == 'free' or price == '0' or price == '0.0':
    free_count+=1
print(free_count)

3222


Manually inspecting the dataset, it majority of the iOS apps in the dataset are not free, so I think we can proceed with what we have left.

## Analysis
Now that our data is clean, we may now proceed with the analysis. Our goal is to identify app profiels that are successful in both markets, because the number of people using our apps affect our revenue.
<br>
To minimize risks and overhead, our validation strategy for an app idea has three steps:
1. Build a minimal Android version of the app, and add it to Google Play
2. If the app has a good response from users, we develop it further
3. If the app is profitable after 6 months, we build an iOS version of the app and add it to the App Store
<br>
> The reason for publishing first in Android is that it's free.


Will first try to select the columns which can be useful to generate a frequency table. On our dataset, it can be `genre` or `category` for the Android dataset and `prime_genre` for iOS dataset.

In [32]:
print(ios_header)
print('\n')
explore_data(ios_clean_eng_free, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 3222
Number of columns: 16


In [33]:
print(android_header)
print('\n')
explore_data(android_clean_eng_free, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Every

In [34]:
# Create a frequency table
def freq_table(dataset, column_index):
  frequency_table = {}
  total = 0
  for row in dataset:
    total += 1
    key = row[column_index]
    if key in frequency_table:
      frequency_table[key] += 1
    else:
      frequency_table[key] = 1
  
  table_percentages = {}
  for key in frequency_table:
    percentage = (frequency_table[key]/total) * 100
    table_percentages[key] = percentage

  return table_percentages

def display_table(dataset, column_index):
  table = freq_table(dataset, column_index)
  table_display = []
  for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

  table_sorted = sorted(table_display, reverse=True)
  for entry in table_sorted:
    print(f'{entry[1]}: {entry[0]}')

In [35]:
android_final = android_clean_eng_free
ios_final = ios_clean_eng_free

In [36]:
display_table(android_final, an_hi['Genres'])

Tools: 8.44895657078398
Entertainment: 6.068809926677947
Education: 5.346869712351946
Business: 4.591088550479413
Productivity: 3.8917089678511
Lifestyle: 3.8917089678511
Finance: 3.699943598420756
Medical: 3.5307388606880994
Sports: 3.4630569655950363
Personalization: 3.3164128595600673
Communication: 3.2374506486181613
Action: 3.102086858432036
Health & Fitness: 3.0795262267343486
Photography: 2.9441624365482233
News & Magazines: 2.7975183305132543
Social: 2.662154540327129
Travel & Local: 2.323745064861816
Shopping: 2.2447828539199097
Books & Reference: 2.143260011280316
Simulation: 2.0417371686407217
Dating: 1.8612521150592216
Arcade: 1.849971799210378
Video Players & Editors: 1.7710095882684715
Casual: 1.7597292724196276
Maps & Navigation: 1.3987591652566271
Food & Drink: 1.2408347433728144
Puzzle: 1.1280315848843767
Racing: 0.9926677946982515
Role Playing: 0.9362662154540328
Libraries & Demo: 0.9362662154540328
Auto & Vehicles: 0.924985899605189
Strategy: 0.9137055837563453
House

In the Android dataset, the most common genre is Tools followed by Entertainment. It is noticable from the table is that it's granular and the difference between each genre is just small.

In [37]:
display_table(android_final, an_hi['Category'])

FAMILY: 18.905809362662154
GAME: 9.723632261703328
TOOLS: 8.460236886632826
BUSINESS: 4.591088550479413
LIFESTYLE: 3.902989283699944
PRODUCTIVITY: 3.8917089678511
FINANCE: 3.699943598420756
MEDICAL: 3.5307388606880994
SPORTS: 3.395375070501974
PERSONALIZATION: 3.3164128595600673
COMMUNICATION: 3.2374506486181613
HEALTH_AND_FITNESS: 3.0795262267343486
PHOTOGRAPHY: 2.955442752397067
NEWS_AND_MAGAZINES: 2.7975183305132543
SOCIAL: 2.662154540327129
TRAVEL_AND_LOCAL: 2.33502538071066
SHOPPING: 2.2447828539199097
BOOKS_AND_REFERENCE: 2.143260011280316
DATING: 1.8612521150592216
VIDEO_PLAYERS: 1.793570219966159
MAPS_AND_NAVIGATION: 1.3987591652566271
FOOD_AND_DRINK: 1.2408347433728144
EDUCATION: 1.161872532430908
ENTERTAINMENT: 0.9588268471517203
LIBRARIES_AND_DEMO: 0.9362662154540328
AUTO_AND_VEHICLES: 0.924985899605189
HOUSE_AND_HOME: 0.8234630569655951
WEATHER: 0.8009024252679076
EVENTS: 0.7106598984771574
PARENTING: 0.6542583192329385
ART_AND_DESIGN: 0.6429780033840948
COMICS: 0.620417371

The most common category in android dataset is 'Family' which accounts for almost 19% of the apps. This is then followed closely by Game, Tools, and Business apps.
Based on this and the genre table, we can see that practical apps are more prevalent in Google Play.

In [38]:
display_table(ios_final, ios_hi['prime_genre'])

Games: 58.16263190564867
Entertainment: 7.883302296710118
Photo & Video: 4.9658597144630665
Education: 3.662321539416512
Social Networking: 3.2898820608317814
Shopping: 2.60707635009311
Utilities: 2.5139664804469275
Sports: 2.1415270018621975
Music: 2.0484171322160147
Health & Fitness: 2.0173805090006205
Productivity: 1.7380509000620732
Lifestyle: 1.5828677839851024
News: 1.3345747982619491
Travel: 1.2414649286157666
Finance: 1.1173184357541899
Weather: 0.8690254500310366
Food & Drink: 0.8069522036002483
Reference: 0.5586592178770949
Business: 0.5276225946617008
Book: 0.4345127250155183
Navigation: 0.186219739292365
Medical: 0.186219739292365
Catalogs: 0.12414649286157665


For the iOS dataset, we can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.
<br>
The general impression is that App Store (Free English apps) are dominated by apps designed for fun(games, entertainment, photo and video, social network etc.) while practical apps are more rare.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store

We will create another frequency table in terms of the number of user ratings.

In [42]:
unique_genres = freq_table(ios_final, ios_hi['prime_genre'])
max_avg = (0, '')
for genre in unique_genres:
  total = 0
  len_genre = 0
  for app in ios_final:
    genre_app = app[ios_hi['prime_genre']]
    if genre_app == genre:
      ratings_count = float(app[ios_hi['rating_count_tot']])
      total += ratings_count
      len_genre += 1
  avg_n_ratings = total / len_genre
  print(f'{genre}: {avg_n_ratings}')
  if max_avg[0] < avg_n_ratings:
    max_avg = (avg_n_ratings, genre)
print(f'Highest Average reviews: {max_avg}')


Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0
Highest Average reviews: (86090.33333333333, 'Navigation')


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a miliion user reviews together:

In [44]:
for app in ios_final:
  if app[ios_hi['prime_genre']] == 'Navigation':
    print(f'{app[1]}:{app[ios_hi["rating_count_tot"]]}')

Waze - GPS Navigation, Maps & Real-time Traffic:345046
Google Maps - Navigation & Transit:154911
Geocaching®:12811
CoPilot GPS – Car Navigation & Offline Maps:3582
ImmobilienScout24: Real Estate Search in Germany:187
Railway Route Search:5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, etc. Same applies to music apps, where a few big players like Pandora, Spotify and Shazam heavily influence the average number.
<br>
The other app genre with high average is the Reference.

In [45]:
for app in ios_final:
  if app[ios_hi['prime_genre']] == 'Reference':
    print(f'{app[1]}:{app[ios_hi["rating_count_tot"]]}')

Bible:985920
Dictionary.com Dictionary & Thesaurus:200047
Dictionary.com Dictionary & Thesaurus for iPad:54175
Google Translate:26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran:18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition:17588
Merriam-Webster Dictionary:16849
Night Sky:12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE):8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools:4693
GUNS MODS for Minecraft PC Edition - Mods Tools:1497
Guides for Pokémon GO - Pokemon GO News and Cheats:826
WWDC:762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free:718
VPN Express:14
Real Bike Traffic Rider Virtual Reality Glasses:8
教えて!goo:0
Jishokun-Japanese English Dictionary & Translator:0


This may present a great opportunity since Reference Apps is not that saturated compared to Entertainment apps which has greater competition.

## Most Popular Apps by Genre on Google Play
For the Google Play market, we have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers are not precise.

In [46]:
display_table(android_final, an_hi['Installs'])

1,000,000+: 15.724760293288211
100,000+: 11.551043429216017
10,000,000+: 10.547095318668923
10,000+: 10.197405527354766
1,000+: 8.403835307388608
100+: 6.91483361534123
5,000,000+: 6.824591088550479
500,000+: 5.561195713479977
50,000+: 4.771573604060913
5,000+: 4.512126339537507
10+: 3.542019176536943
500+: 3.248730964467005
50,000,000+: 2.3011844331641287
100,000,000+: 2.131979695431472
50+: 1.9176536943034406
5+: 0.7896221094190639
1+: 0.5076142131979695
500,000,000+: 0.2707275803722504
1,000,000,000+: 0.2256063169768754
0+: 0.04512126339537507
0: 0.011280315848843767


The problem with the data is that is not precise, we are not sure whether an app with 100,000+ installs have 100,000 installs instead of 199,000. Fortunately for our case, we do not require the precise count, we only need to get an idea of what app category attracts most users.

We will leave the numbers as they are, which means that an app with 100,000+ installs has 100,000 installs.