# Profitable App Profiles for the App Store and Google Play Store

For this project, we are going to pretend that we are working as a data scientist for a company that builds Android and iOS apps. The company builds only free apps and apps that are directed toward an English-speaking audience.


Since our company builds only apps that are free to download and install, our main source of revenue depends on in-app ads. Consequently, this means that our revenue for any given app is mostly influenced by the number of users.


Our goal for this project is to analyze data to help our imagined developers understand what kinds of apps are likely to attract more users.


## Opening and Exploring the Data

For our purpose of the exercise, we are going to analyze data from two data sets that could be found on the internet.

A data set that contains data about approximately ten thousand Android apps from the Google Play, you can find on this link - https://bit.ly/2sXyzPt.

A data set that contains data about approximately seven thousand iOS apps from the App Store, you can find on this link - https://bit.ly/2LIiuDO.

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
from csv import reader

# Informations about Google Play Store
opened_googleplay_file = open('googleplaystore.csv')
read_googleplay_file = reader(opened_googleplay_file)
googleplay_dataset = list(read_googleplay_file)
android_header = googleplay_dataset[0]
android_apps = googleplay_dataset[1:]

# Informations about Apple Store
opened_applestore_file = open('AppleStore.csv')
read_applestore_file = reader(opened_applestore_file)
applestore_dataset = list(read_applestore_file)
iOS_header = applestore_dataset[0]
iOS_apps = applestore_dataset[1:]

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use to explore rows repeatedly and in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # this add empty row after each row
        
        if rows_and_columns:
            print('Number of rows: ', len(dataset))
            print('Number of columns: ', len(dataset[0]))
            
explore_data(android_apps, 0, 3, rows_and_columns = True)
explore_data(iOS_apps, 0, 3, rows_and_columns = True)

print('\n')
print('Name of columns for Android apps: ', android_header)
print('\n')
print('Name of columns for iOS apps: ', iOS_header)

    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows:  7197
Number of columns:  16
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & 

In the output above, we see that the number of rows and columns, which include information about Android apps is 10841 and 13. The number of rows and columns for iOS apps is 7197 and 16.

We also see names of columns. Names of columns which gives us information about Android apps are: 'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'.

Names of columns which gives us information about iOS apps are: 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'.

Since some of the column names are unclear, you can see the meaning of their names on the links mentioned above.

Column names that could help us with our futher analyses are: 'App', 'Category', 'Rating', 'Size', 'Installs', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver'.

The column names of iOS apps statistic that could help us are: 'track_name', 'size_bytes', 'currency', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'vpp_lic'.

## Data Cleaning

In the previous step, we opened and briefly explored our data sets. Now, we need to make sure that data we are going to analyze is accurate - that we don't have inaccurate and duplicate data. 

The Google Play data set has a discussion section, in which we can see that a certain row has an error - https://bit.ly/2YJtRAO.

The App Store data set also has a discussion section in which we can see if there are any reports of wrong data - https://bit.ly/2YPG47a. 

Below we are going to print and delete incorrect data. 

### Deleting Wrong Data

In [3]:
#Google Play 
print(android_apps[10472])
print('\n')
print(android_header)
print('\n')
print(android_apps[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In the previous output, we see that 10472 row is incorrect because it doesn't contain category information. Due to incomplete information, we are going to delete this row. 

In the discussion section for the App Store, we don't find incorrect data.

In [4]:
#Google Play
print(len(android_apps))
del android_apps[10472] #don't run this more than once
print(len(android_apps))

10841
10840


### Removing Duplicate Entries

#### Part One

In the discussion section for the Google Play data set, we see that we have some duplicate data. This data should be deleted. For instance, the app Instagram has four entries:

In [5]:
for app in android_apps:
    name = app[0]
    if name == 'Instagram':
        print(app)
        
    

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If we examine these rows, we'll see that the main difference happens in the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. The higher the number of reviews is, the more recent the data should be. We can use this information to build a dictionary in which we are going to set the highest number of reviews for each app. 
 
But before we do that, let's see how much duplicate data we have:

In [6]:
duplicate_apps = []
unique_apps = []

for app in android_apps:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
            
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

Number of duplicate apps:  1181


Number of unique apps:  9659


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In the output below we see that numbers of duplicate and unique apps are 1181 and 9659. 

Now, let's create the dictionary. 
To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app.

#### Part Two

In [7]:
reviews_max = {}

for app in android_apps:
    name = app[0]
    n_reviews = app[3]
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(reviews_max)
len(reviews_max)

{'Timehop': '161610', 'Your Freedom VPN Client': '74497', 'Color by Number - Draw Sandbox Pixel Art': '10247', 'Whoscall - Caller ID & Block': '552635', 'Naruto Shippuden - Watch Free!': '141515', 'ClanPlay: Community and Tools for Gamers': '34443', 'Chictopia': '360', 'Smart-AC Universal Remote Free': '3270', 'FC Parking': '0', 'Who Viewed My FB Profile': '121', 'Official NBSTSA CSFA Exam Prep': '137', 'CI SA': '0', 'Trazado de tuberia El Tubero': '313', 'Best Park in the Universe': '3904', "Cook 'n Learn Smart Kitchen": '205', 'Whist - Tinnitus Relief': '12', 'My Boy! Free - GBA Emulator': '531074', 'Eh Amego!': '13388', 'Droid Zap by Motorola': '2093', 'NPR One': '13217', 'Talking Tom & Ben News': '1131937', 'BD Data Plan (3G & 4G)': '10341', 'IRIS : Customer Service - DZ Algeria': '57', 'Scout GPS Navigation & Meet Up': '120373', 'FM News': '46', 'Words With Friends – Play Free': '711719', 'EZ Cleaner - Booster Optimizer': '6801', 'Car Race by Fun Games For Free': '54221', 'Orbitz 

9659

In one of the previous code cells, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Expected length:', len(android_apps) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


By using the dictionary created above, we are now going to remove duplicated rows. 
Let's start it by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will store app names).

In [9]:
android_clean = []
already_added = []

for app in android_apps:
    name = app[0]
    n_reviews = app[3]
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

In the code above:

- We looped through the Google Play data set and isolated the name of apps (name) and apps reviews (n_reviews).

- We added the current row (app) to the android_clean list and the app name (name) to the already_clean list if:
 - The number of reviews of the current app coincides with the number of reviews of that app in the reviews_max dictionary; and
 - The name of the app is not already in the already_added list. **We need to add this supplementary condition for those cases where the highest number of reviews of duplicate apps is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

Let's now explore our cleaned data set. The data set should have 9659 rows.

In [10]:
explore_data(android_clean, 0, 3, rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows:  9659
Number of columns:  13
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  9659
Number of columns:  13
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


By exploring the Google Play data set, we see that we have 9659 number of rows, just as we expected. 

### Removing Non-English Apps

#### Part One

Since we are a company that only builds apps that are directed toward an English-speaking audience, we want to detect only English apps and analyze them. One way to do this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters. 

In [11]:
def check_app_name(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
        
    return True

In [12]:
print(check_app_name('Instagram'))
print(check_app_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_app_name('Docs To Go™ Free Office Suite'))
print(check_app_name('Instachat 😜'))

True
False
False
False


In the output above, we see that our function couldn't correctly identify certain English app names like Docs To Go™ Free Office Suite and Instachat 😜.  This is because of emojis and characters like ™ fall outside the ASCII standard.

Now, we want to minimize the impact of valuable data loss, so we'll make a function that will remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [13]:
def check_app_name(app_name):
    non_ascii = 0
    
    for character in app_name:
        if ord(character) > 127:
            non_ascii += 1 
            
    if non_ascii <= 3:
        return True
    else:
        return False 

In [14]:
print(check_app_name('Docs To Go™ Free Office Suite'))
print(check_app_name('Instachat 😜'))
print(check_app_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


This function is still not perfect but at this point of analyze it seem good enough.

#### Part Two

Below we are going to filter out non-English apps from both data sets with check_app_name function. If an app name is identified as English, we'll append the whole row to a separate list.

In [15]:
english_android = []
english_iOS = []

for row in android_clean:
    name = row[0]
    if check_app_name(name):
        english_android.append(row)
        
for row in iOS_apps:
    name = row[1]
    if check_app_name(name):
        english_iOS.append(row)


In [16]:
explore_data(english_android, 0, 5, True)
explore_data(english_iOS, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows:  9614
Number of columns:  13
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  9614
Number of columns:  13
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows:  9614
Number of columns:  13
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,

We left with 9614 Android apps and 6183 iOS apps.

### Isolating the Free Apps

How we mentioned before, our company only builds apps that are free to download and install. Our data sets contain both, free and non-free apps, and we'll need to isolate only the free apps for our analysis.

This step is our last step in the data cleaning process.

In [17]:
free_android_apps = []
non_free_android_apps = []
free_iOS_apps = []
non_free_iOS_apps = []

for row in english_android:
    price = row[7]
    if price == '0':
        free_android_apps.append(row)
    else:
        non_free_android_apps.append(row)
        
for row in english_iOS:
    price = row[4]
    if price == '0.0':
        free_iOS_apps.append(row)
    else:
        non_free_iOS_apps.append(row)
        

In [18]:
print(len(free_android_apps))
print(len(free_iOS_apps))

8862
3222


We've left with 8862 apps from the Google Play data set and with 3222 apps from the App data set. This should be enough for our analysis.

## Data Analysis

### Most Common Apps by Genre

#### Part One

As we mentioned in the introduction, we aim to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for the prime_genre column of the App data set, and the Genres and Category columns of the Google Play data set.

#### Part Two

We'll build two functions which we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

In [19]:
def freq_table(dataset, index):
    table_frequency = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table_frequency:
            table_frequency[value] += 1
        else:
            table_frequency[value] = 1
            
    table_percentage = {}
    for key in table_frequency:
        percentage = (table_frequency[key] / total) * 100
        table_percentage[key] = percentage
        
    return table_percentage

def display_table_in_order(dataset, index):
    table = freq_table(dataset, index)
    percentage_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        percentage_display.append(key_val_as_tuple)
        
    table_sorted = sorted(percentage_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
        

#### Part Three

Let's start by examining the frequency table for the prime_genre column of the App Store data set. 

In [20]:
display_table_in_order(free_iOS_apps, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As we see, the App Store has 58% game apps. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of apps are built for education, followed by apps for social networking, which has a 3.2% amount of our data set. 

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, finance, etc.) are rarer. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [21]:
display_table_in_order(free_android_apps, 1) #Category

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

The situation seems to be a little different in the Google Play data set: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). But if we continue with the research, we'll see that the family category means mostly games for kids. 

Even so, practical apps seem to have a better representation in the Google Play Store compared to the App Store.

Let's continue our exploration of the Genres column. 

In [22]:
display_table_in_order(free_android_apps, -4)

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

Analyzing Genres column in the Google Play data set, we can see a similar situation as in the Category column. The highest percentages go to practical apps. Going further with the analysis, we couldn't see the clear difference between Category and Genres columns. Since we're now looking for a bigger picture, we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.


### Most Popular Apps by Genre on the App Store

In [23]:
freq_genre_iOS = freq_table(free_iOS_apps, -5)

for genre in freq_genre_iOS:
    total = 0   #this variable will store the sum of user ratings specific to each genre
    len_genre = 0 #this variable will store the number of apps in each genre
    for app in free_iOS_apps:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
            
    avg_ratings = total / len_genre
    print(genre, ':', avg_ratings)
    
            

Business : 7491.117647058823
Social Networking : 71548.34905660378
Health & Fitness : 23298.015384615384
Medical : 612.0
Weather : 52279.892857142855
Education : 7003.983050847458
Book : 39758.5
Shopping : 26919.690476190477
Utilities : 18684.456790123455
Sports : 23008.898550724636
Finance : 31467.944444444445
Photo & Video : 28441.54375
Productivity : 21028.410714285714
Lifestyle : 16485.764705882353
Music : 57326.530303030304
Food & Drink : 33333.92307692308
News : 21248.023255813954
Travel : 28243.8
Games : 22788.6696905016
Navigation : 86090.33333333333
Reference : 74942.11111111111
Catalogs : 4004.0
Entertainment : 14029.830708661417


We see that genres with the most users are: Navigation (86090.3), Reference (74942.1) and Social Networking (71548.3). 

Let's proceed with further analysis.

In [24]:
for app in free_iOS_apps:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) #print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The average number of navigation apps is heavily influenced by a few giants like Waze and Google Maps. 

In [25]:
for app in free_iOS_apps:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Reference apps have a bit more potential. Our company could take another popular book and turn it into an app. In this app, we could add an audio version of the book, quizzes about that book, some daily quotes from the book, etc.

In [26]:
for app in free_iOS_apps:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Situation is similar as in navigation genre.

In [27]:
for app in free_iOS_apps:
    if app[-5] == 'Productivity':
        print(app[1], ':', app[5])
        
print('\n')

for app in free_iOS_apps:
    if app[-5] == 'Book':
        print(app[1], ':', app[5])

Evernote - stay organized : 161065
Gmail - email by Google: secure, fast & organized : 135962
iTranslate - Language Translator & Dictionary : 123215
Yahoo Mail - Keeps You Organized! : 113709
Google Docs : 64259
Google Drive - free online storage : 59255
Dropbox : 49578
Microsoft Word : 47999
Microsoft OneNote : 39638
Microsoft Outlook - email and calendar : 32807
Hotspot Shield Free VPN Proxy & Wi-Fi Privacy : 32499
Documents 6 - File manager, PDF reader and browser : 29110
Google Sheets : 24602
Microsoft Excel : 24430
Inbox by Gmail : 21561
T-Mobile : 19977
Paper by FiftyThree - Sketch, Diagram, Take Notes : 18219
MyScript Calculator - Handwriting calculator : 16555
VPN Proxy Master - Unlimited WiFi security VPN : 13674
Microsoft OneDrive – File & photo cloud storage : 12797
Ever - Capture Your Memories : 12755
Speak & Translate － Voice and Text Translator : 12062
Tayasui Sketches : 11505
Drawing Desk - Draw, Paint, Doodle & Sketch board : 11040
Microsoft PowerPoint : 10939
Email - F

In the productivity genre, the situation is similar to social networking and navigation apps.

In book apps, we could do something. We could turn on some popular children's books or picture books in apps. In these apps, we could add music and sounds effects (with an option on and off), page bookmarks with child's favorite book or character, an option to correct children's spelling if they read aloud, etc.

### Most Popular Apps by Genre on Google Play

In the Google Play data set, we already have a column with the number of users (Installs column), so we should be able to get a clear picture of genre popularity. Still, the install numbers don't seem precise enough - most values are open-ended (100+, 1,000+, 100,000+).

In [28]:
display_table_in_order(free_android_apps, 5)

1,000,000+ : 15.741367637102236
100,000+ : 11.554953735048521
10,000,000+ : 10.516813360415256
10,000+ : 10.200857594222523
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.838185511171294
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.2906793048973144
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


For our analysis, we don't need so precise data since we only want to find out which app genres attract the most users. So we are going to keep these install numbers and consider that an app with 100,000 installs has 100,000 users, an app with 1,000,000 installs has 1,000,000 users, and so on.

To perform computations, we'll need to convert each number of installation from string to float. This means that we need to remove the commas and the plus characters, otherwise, the conversion will fail and raise an error. In the next step, we'll do that, where we'll also compute the average number of installs for each genre.

In [29]:
freq_category_android = freq_table(free_android_apps, 1)

for category in freq_category_android:
    total = 0 #this variable will store the sum of installs specific to each genre
    len_category = 0 #this variable will store the number of apps specific to each genre
    for app in free_android_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            convert_n_installs = float(n_installs)
            total += convert_n_installs
            len_category += 1
            
    avg_installs = total / len_category
    print(category, ',', avg_installs)

SPORTS , 3638640.1428571427
DATING , 854028.8303030303
FAMILY , 3694276.334922527
PHOTOGRAPHY , 17805627.643678162
LIBRARIES_AND_DEMO , 638503.734939759
EDUCATION , 1820673.076923077
TOOLS , 10682301.033377837
PERSONALIZATION , 5201482.6122448975
VIDEO_PLAYERS , 24727872.452830188
PARENTING , 542603.6206896552
ART_AND_DESIGN , 1986335.0877192982
GAME , 15560965.599534342
ENTERTAINMENT , 11640705.88235294
HEALTH_AND_FITNESS , 4188821.9853479853
COMICS , 817657.2727272727
AUTO_AND_VEHICLES , 647317.8170731707
SOCIAL , 23253652.127118643
NEWS_AND_MAGAZINES , 9549178.467741935
COMMUNICATION , 38456119.167247385
BOOKS_AND_REFERENCE , 8767811.894736841
MAPS_AND_NAVIGATION , 4056941.7741935486
HOUSE_AND_HOME , 1331540.5616438356
FINANCE , 1387692.475609756
TRAVEL_AND_LOCAL , 13984077.710144928
FOOD_AND_DRINK , 1924897.7363636363
EVENTS , 253542.22222222222
BEAUTY , 513151.88679245283
SHOPPING , 7036877.311557789
LIFESTYLE , 1437816.2687861272
WEATHER , 5074486.197183099
MEDICAL , 120616.48717

On average, communication apps have the most installs. In the following order, we are going to analyze this genre.

In [30]:
for app in free_android_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                    or app[5] == '500,000,000+'
                                    or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

The communication apps are dominated by a few giants who are hard to compete against. Also, this genre seems to be more popular than it really is. If we remove apps which have more than 100.000 installs, the average number would reduce ten times:

In [31]:
under_100_mil = []

for app in free_android_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_mil.append(float(n_installs))
        
sum(under_100_mil) / len(under_100_mil)

3603485.3884615386

The situation is similar in the social, productivity and video players genres.

Genre, who has a lot of installations and potential for deeper analyses, is book and references. The game genre seems pretty popular, but previously we found that this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if it's possible. 

Let's now look at the book and references genre:

In [32]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

In the books and reference genre, we see that we have a wide range of apps. We have dictionary apps, apps for online reading, apps for learning, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [33]:
for app in free_android_apps:
    if (app[1] == 'BOOKS_AND_REFERENCE') and (app[5] == '1,000,000,000+'
                                        or app[5] == '500,000,000+'
                                        or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])
    

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


There are only a few very popular apps, so this market still has potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [34]:
for app in free_android_apps:
    if (app[1] == 'BOOKS_AND_REFERENCE') and (app[5] == '1,000,000+'
                                        or app[5] == '5,000,000+'
                                        or app[5] == '10,000,000+'
                                        or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The type of apps that are in the middle of popularity are dictionary apps, various collections of libraries and apps for reading books, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran and Bible, which suggests that building an app around a popular book can be profitable. But, since there are a few apps that offer a collection of libraries, we could build an app that, besides that, add some special features. We could add some daily quotes from that book, the audio version of that book, etc.

There is also a popular app with stats for game Clash Royale so we could build an app with stats for other popular games.

## Conclusion

In this project, we analyzed data from the https://bit.ly/2LIiuDO (App Store data set) and https://bit.ly/2sXyzPt (Google Play data set) intending to recommend an app profile that can be profitable for our company in both markets.

We came to a conclusion to turn a popular book into an app, and besides that, add some special features. This could be some daily quotes from that book, an audio version of that book, quizzes, etc. If we turn some popular children's books or picture books in apps, we could add music and sounds effects, page bookmarks with the child's favourite book and character, an option to correct children's spelling if they read aloud, etc.

Since in both data sets, in book and reference genres, we have some apps connected to games, we can try with building an app with stats for some popular game (this app is very popular in Google Play Store).