*This is my first project in Python. The goal of this project was to get comfortable using the Jupyter Notebook and use fundamental techniques and functions. I aim to revisit this project once my `Python` toolkit has expanded to make new analytical cuts of the data and provide new insights. Thank you for reading!*

# Developing profitable apps for the Android and iOS app stores.

In this project, we will be analysing data about apps in the Google Android Play Store and the Apple App Store.

The goal of this project is to provide a recommendation for creating a profitable free-to-download app. This recommendation could be take the shape of: a list of criteria that make an app popular (prone to be downloaded) or a  category of app to develop. The recommendation should support a dev team in the creation of apps that make the most of a business model where revenue comes from in-app engagement and in-app purchases.

In a [2020 article on Forbes](https://www.forbes.com/sites/forbestechcouncil/2020/04/28/mobile-app-monetization-part-1-revenue-generation-models/), Kontsevoi outlined six revenue models from apps:
- app purchase
- in-app purchase 
- selling third party content
- subscriptions
- advertising
- serivce fees

He suggests that the in-app monetization model or 'freemium' model is becoming" increasingly popular among app publishers" and to consider "mixing and matching in-app purchase types" to meet the diverse needs of an audience. This article shows that there is potential in pursuing this type of revenue models for app development teams which makes this research project worthwhile.  

## The Data

In 2018, there were over 2.1 million apps in the Google Play Store and over 2 million iOS apps in the App Store. For the purposes of this project we will use existing datasets to avoid the expensive and time consuming process that data gathering can be. The datasets can be found on Kaggle.

The dataset on apps in the Apple App Store was initially collected in 2017 by Ramanathan. It can be downloaded from this [link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?search=wrong).

The dataset on apps in the Google Play Store was collected in 2018 by Lavanya. It can be downloaded from this [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

This data was collected in 2017 and 2018. The mobile app market is highly dynamic and will likely have changed since then. This could influence the criteria used as a proxies in this research to measure app popularity (eg. downloads & user rating). As a result, the recommendations to inform the business decision of this project should be contextualized and may not be applicable to 2022 and beyond due to new trends and tendencies that have emerged since 2017. 

---

In [1]:
#importing modules
from csv import reader

## Data Exploration

In [2]:
#opening the files

read_file_1 = open ('googleplaystore.csv')
reader_1 = reader (read_file_1)
android_apps_1 = list (reader_1)
android_apps = android_apps_1 [1:] #creating an android apps dataset without the header row

read_file_2 = open ('AppleStore.csv')
reader_2 = reader (read_file_2)
apple_apps_1 = list (reader_2)
apple_apps = apple_apps_1 [1:] #creating an Apple apps dataset without the header row

We will look at 2 elements for each dataset:

1. The header rows - to get an idea of the app metadata we will be working with during this project
2. Data in the first few rows and the number of rows and columns in each dataset

The header row of the Android Store Dataset.

In [3]:
print (android_apps_1 [0]) #the Android Apps containing the header row

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The header row of the Apple App Store Dataset.

In [4]:
print (apple_apps_1 [0]) #the iOS Apps dataset containing the header row

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


We can display the column names and descriptions in a more eye-friendly to determine which columns (or app metadata) can serve us in our analysis.

### Android App Store column descriptions
[Link to Android App Store Dataset Documentation](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) 

| Column Name in Android Dataset    | Description |
| :---------     | ----------: |
| "App"           | Application Name      |
| "Category"   | Category the app belongs to        |
| "Rating"   | Overall user rating of the app (as when scraped) |
| "Reviews"     | Number of user reviews for the app (as when scraped) |
| "Size"     | Size of the app (as when scraped) |
| "Installs"     | Number of user downloads/installs for the app (as when scraped) |
| "Type"     | Paid or Free |
| "Price"     | Price of the app (as when scraped) |
| "Content Rating"     | Age group the app is targeted at - Children / Mature 21+ / Adult |
| "Genres"     | An app can belong to multiple genres (apart from its main category) |
| "Last Updated"     | Date when the app was last updated on Play Store (as when scraped) |
| "Current Ver"     | Current version of the app available on Play Store (as when scraped) |
| "Android Ver"     | Min required Android version (as when scraped) |


### Apple App Store column descriptions
[Link to Apple App Store Dataset Documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?search=wrong) 

| Column Name in Apple Dataset    | Description |
| :---------     | ----------: |
| "id"           | App ID      |
| "track_name"   | App Name        |
| "size_bytes"   | Size (in Bytes) |
| "price"     | Price amount |
| "ratingcounttot"     | User Rating counts (for all version) |
| "ratingcountver"     | User Rating counts (for current version) |
| "user_rating"     | Average User Rating value (for all version) |
| "userratingver"     | Average User Rating value (for current version) |
| "ver"     | Latest version code |
| "cont_rating"     | Content Rating |
| "prime_genre"     | Primary Genre |
| "sup_devices.num"     | Number of supporting devices |
| "ipadSc_urls.num"     | Number of screenshots showed for display |
| "lang.num"     | Number of supported languages |
| "vpp_lic"     | Vpp Device Based Licensing Enabled |

We can observe some similarities in capturated metadata between the two datasets such as: 
- App Name
- App Price
- App Category (Genre in the Apple Apple DataSet)
- App User Rating
- App User Reviews (number of)
- App Size
- App content Rating (Age Restriction/ Recommendation)

Thera are a few columns of interest for our analysis. `Category` and `prime_genre` gives us information on the category/genre of app to create a potential recommendation for the category in which apps should be developed. `Installs` provides data on the amount of downloads which is crucial for understanding which apps are more likely to be downloaded. We will have to create a proxy for the Apple apps dataset as there is no data on the downloads per app in the dataset. A potentially interesting criteria could be content rating (`Content Rating` and `cont_rating`) to answer the question - are apps for wider audiences (children & adults) more likely to be downloaded than apps with content rating for adults only? 

Each dataset has unique app metadata that is not collected in the other dataset. It could be explored and analysed in isolation for each dataset to test hypotheses and make recommendations custom to each store.  

Unique app metadata in the **Android Dataset**
- Second Category - an app can belong to more than one category
- Free or Paid (we could create this data for the Apple Dataset by adding a new column that indicates an app as Free or Paid based on the price)
- Last Updated

Unique app metadata in the **Apple Dataset**
- Number of supported languages
- Number of supporting devices


We will create a custom function to explore the data which will print a set of rows from the chosen datset and compute the size of the datasets.

In [5]:
#defining a function which will print a cut of the dataset we define in the parameters, with correspinding index numbers for rows

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new (empty) line after each row

    if rows_and_columns:
        print('In the entire dataset the') 
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n') #added break line for better readability

#the last snipet of code prints the number of rows and columns for each dataset if we set the rows_and_columns parameter equivalent to True.

In [6]:
#for the Android dataset
print ('The Android Apps dataset - first 5 rows excluding the header:')
print ('\n')
explore_android_data = explore_data (android_apps,0,5, True)

The Android Apps data set - first 5 rows excluding the header:


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


In the entire 

In [7]:
#for the Apple dataset
print ('The Apple apps dataset - first 5 rows excluding the header:')
print ('\n')
explore_apple_data = explore_data (apple_apps,0,5,True) 

The Apple Apps Data set - first 5 rows excluding the header:


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


In the entire dataset the
Number of rows: 7197
Number of columns: 16




---

## Cleaning the data

### The Android dataset - Removing dirty data

In [this discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) about the data, PhaniKiranSiddineni identified a wrong entry in row 10472 for the Android dataset (`android_apps` in this project).

We will print this row and the ones before and after it to illustrate the wrong entry.

In [8]:
print (android_apps_1 [0])
print ('\n')
for row in android_apps [10471:10474]:
    print (row)
    print ('\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




We can indeed observe that the app `'Life Made WI-Fi Touchscreen Photo Frame'` is missing `'Category'`, and `'Genres'` entries. We will remove this row from our data.

In [9]:
#using the del function
del android_apps [10472]

#printing rows with similar indexes as before to confirm wrong entry was removed
for row in android_apps [10471:10474]:
    print (row)
    print ('\n')

We can confirm the entry for `'Life Made WI-Fi Touchscreen Photo Frame'` has been removed.

### The Android apps dataset - Removing duplicates

We can check for app duplicates and remove them from our data with the following steps:

1. Counting duplicate entries with by looping over our `android_app` dataset and checking for equivalent `'App'` names.
2. Printing duplicate rows to see how they compare 
3. Selecting a criterion to remove duplicate entries to keep the most accurate entry about an app in the dataset

In [11]:
#creating dedicated list for duplicate app entries and unique app entries
duplicate_apps_android = [] 
unique_apps_android = [] 

for app in android_apps: 
    name = app [0]   #extracting the name of an app in column with index number 0
    if name in unique_apps_android:
        duplicate_apps_android.append (name) #adding the duplicate apps to te duplicate dataset
    else:
        unique_apps_android.append (name) #if the app isn't in the list of unique apps, it will be added only once
        
print ('Number of duplicate apps: ', len (duplicate_apps_android))
print ('Number of unique apps: ', len (unique_apps_android))
print ('\n')
print ('Total number of entries: ', len (duplicate_apps_android) +len (unique_apps_android)) 

Number of duplicate apps:  1181
Number of unique apps:  9659


Total number of entries:  10840


In [12]:
print ('A few duplicate app entries:')
print (duplicate_apps_android [:20])

A few duplicate app entries:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


The list above shows us there are a significant number of duplicate entries.

Let's print some duplicate entries of popular social media apps to see which criterion we can use to remove duplicate entries.

In [13]:
print (android_apps_1[0]) #header row
print  ('\n')

print ("'Snapchat' Duplicates")
for app in android_apps:
    name = app [0]
    if name == 'Snapchat':
        print(app)
        
print  ('\n')

print ("'Facebook' Duplicates")
for app in android_apps:
    name = app [0]
    if name == 'Facebook':
        print(app)   

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


'Snapchat' Duplicates
['Snapchat', 'SOCIAL', '4.0', '17014787', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17014705', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17015352', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17000166', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']


'Facebook' Duplicates
['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social',

We can observe that the entries for `'Reviews'` are the only ones that vary. Other potential criteria such as `'Last Updated'` (which could indicate the latest version) remain similar across duplicate entries. If we keep the app row with the highest `'Reviews'` value - we will keep the entry that is most reviews and likely has better data associated with it. We will use`'Reviews'` as our criterion to remove duplicates.

In order to remove the duplicates we will:
- Create a dictionary (`reviews_max`), where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews for that app.
- Use the information stored in the dictionary and create a new dataset, which will store unique app names (for each app, we will select the entry with the highest number of reviews).

In [14]:
reviews_max = {}
for app in android_apps:
    name = app [0]
    n_reviews = float (app[3]) #converting reviews index to float to make calculations
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print ('Expected length of the "reviews_max" dictionary: 9,659') #as is the len() of our unique apps dataset
print ('Actual length of the "reviews_max" dictionary:', len (reviews_max)) #to ensure we have the correct amount of entries

Expected length of the "reviews_max" dictionary: 9,659 entries
Actual length of the "reviews_max" dictionary: 9659


Now we can remove the duplicate rows by:
- creating an empty list (`android_clean`) that will serve as our new dataset with no duplicates
- creating a empty list (`android_added`) that will keep track of apps we have already added in our new dataset `android_clean`

In [15]:
android_clean = []
already_added = []

for apps in android_apps:
    name = apps [0]
    n_reviews = float (apps[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added): 
        android_clean.append(apps) #appending entire row as a list in the list android_clean
        already_added.append(name) #keep track of added apps by their name
        
#confirming the two lists are the same length
print (len (android_clean))
print (len (already_added))

9659
9659


As expected, our lists are similar length as is the original `reviews_max` dictionary. They each consist of 9,659 entries.

Let's explore the first three rows of our cleaned Android dataset `android_clean`.

In [16]:
explore_data (android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


In the entire dataset the
Number of rows: 9659
Number of columns: 13




### The Android and Apple datasets - Removing non-English app entries

Let us turn our attention to app names. As this data collection is global, there are app entries that will not be in English. The goal is to capture as much market share as possible but the English-speaking market will be targeted to prototype the app at first. We also do not have georgraphic data about app downloads or reviews, and therefore would to make a recommendation for different markets based on geography.

We will delete the rows where app names are not in English. 

To do so we will create a function that filters the app names with non-English characters. We will then run the function on the `track_name'`or `app_name` columns. 

According to the ASCII (American Standard Code for Information Interchange) system, each character we use in a string (`ex_string = 'English 101!'`) and including the digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, \*,/) has a corresponding number associated with it (in the range 0 to 127). We can identify the corresponding number of each character using the built-in `ord()` function.

If the number is equal to 0 or less than 127, then the character belongs to the set of common English characters. Our function will need to find string characters that are not in the range `0=< n <= 127`, where `n` is an integer.

In [17]:
def english_only (string):
    for character in string:
        n = ord(character) #assigning n within the loop of the function
        if n > 127 :
            return False
        
    return True

Let's test our function with the following strings:
- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

In [18]:
print(english_only ('Instagram'))
print(english_only ('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_only ('Docs To Go™ Free Office Suite'))
print(english_only ('Instachat 😜'))

True
False
False
False


We can observe that our funtion returns `False`  as intended if the app names are not in English, but also returns `False` if there are special characters beyond the usual letters, punctuation, and numbers.

We may lose valuable data entries if we use this function to filter our current dataset as some English app names, contain special characters such as '😜' or '™'.

We will refine our function so that it returns `False` only if there are more than 3 characters with corresponding numbers falling outside the `n <= 127` ASCII range. 

In [19]:
def english_only (string):
    outside_ascii = 0 #creating a variable designated for outside the ASCII range, starting at 0 
    
    for character in string:
        if ord(character) > 127:
            outside_ascii +=1 #for every ord(character) that is greater than 127, add 1 to the 'outside_ascii' variable
        
    if outside_ascii > 3: #if the 'outside_ascii' value is greater than 3, return False, signaling the app name is not likely in English
        return False 
    else:            #'else' important here as we want the other 'string' character inputs which won't have an 'outside_ascii' value above 3 to remain in the data
        return True 

Let's test our function with the same words as above:
- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

In [20]:
print(english_only ('Instagram'))
print(english_only ('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_only ('Docs To Go™ Free Office Suite'))
print(english_only ('Instachat 😜'))

True
False
True
True


Our function still detects non-English app entries for any `string` input but keeps ones with special characters. 

We will filter the non-English apps from both datasets by:
- looping through the dataset
- if an app name is in English, append the whole row to a separate list (our English only data)

In [21]:
android_english = []

for row in android_clean: #looping through our cleaned dataset
    name = row [0]
    if english_only(name):
        android_english.append(row)

explore_data (android_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


In the entire dataset the
Number of rows: 9614
Number of columns: 13




We can perform the same cleaning task for our `apple_apps` dataset

In [22]:
apple_english = []

for row in apple_apps:
    name = row [1] #app name is index [1] in this dataset
    if english_only (name) is True:
        apple_english.append (row)

explore_data (apple_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


In the entire dataset the
Number of rows: 6183
Number of columns: 16




### The Android and Apple datasets - Removing paid apps

As we mentioned in the introduction, we want to provide a recommendation for building apps that are free to download/ install as our revenue will come from in-app purchases and engagement. Our datasets currently contain both free and paid apps. We willneed to isolate the free apps for our analysis.

To do so we will For each dataset, we will 
- add an app to the list`android_free` if its `Type` is `Free` in the Android Dataset
- add a napp to the list `apple_free` if the value in the `price` column is equal to `0`

In [23]:
print (android_apps_1[0])

A reminder of header row for Android dataset:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In the Android dataset, there are two ways to filter out apps that are free:
- The `'Type'` (index `[6]`) - indicates app as `Paid` or `Free`
- The `'Price'` (index `[7]`) - indicates price as a string

Let's check both.

In [24]:
android_free = []
android_paid = [] #let's create this list for paid apps, we may want to use it as an extension of this project if we deliver under budget :)

for app in android_english:
    free_or_not = app [6]
    if free_or_not == 'Free':
        android_free.append(app)
    elif free_or_not == 'Paid':
        android_paid.append(app )
        
android_free_1 = []
for app in android_english:
    price = app[7]
    if price == '0':
        android_free_1.append(app)


print('Free Apps in Android Dataset:')
print ("Using the 'Type' criterion: ", len(android_free))
print ("Using the 'Price' criterion: ", len(android_free_1))


Free Apps in Android Dataset:
Using the 'Type' criterion:  8863
Using the 'Price' criterion:  8864


The difference is small and therefore not worthwhile to investigate which row it concerns, but let's keep it as a note if we want to revisit this project. We will work with `android_free` for our analysis.

In [25]:
print ('A reminder of header row for Apple dataset:')
print (apple_apps_1[0])

A reminder of header row for Apple dataset:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [26]:
price_format_apple = []
for row in apple_english:
    price = row [4]
    price_format_apple.append(price)

print (price_format_apple[1:50])

['0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '1.99', '0.0', '0.0', '0.0', '0.0', '0.0', '0.99', '6.99', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.99', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.99', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '2.99', '0.0', '0.0', '0.0', '0.0']


In the Apple dataset, there is no coliumn which indicates if the app is free or not, we have to create a filter based on the value in the `price` column.

In [27]:
apple_free = []
apple_paid = [] #we may want to use this list later

for app in apple_english:
    price = app [4]
    if price == '0.0':
        apple_free.append(app)
    else:
        apple_paid.append(app )

print ("Free apps in Apple Dataset: ", len(apple_free))

Free apps in Apple Dataset:  3222


### Data Cleaning Summary

For the Android dataset:
- Removing wrong entries
- Removing duplicates.

for both datasets:
- Removing apps in languages other than english
- Separating free-to-downloads from paid apps.

We will be proceeding with the cleaned `android_free` and `apple_free` for our analysis.

---

## Data Analysis

**Reiterating our Goal**: to provide a recommendation for the the type of apps that are likely to be downloaded and engaged with by a high amount of users. The higher the number of users download our app(s), the more users we expose to in-app purchases we can assume is linked to increased revenue from the app. 

While we will validate our concept by building an Android app first (the Android OS has a dominant 71.8% market share  in the operating systems market), we also want to capture the potential revenue in the Apple app store (2nd in operating systems market share with 27.6%). [Statista, 2023](https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009/#:~:text=Android%20maintained%20its%20position%20as,the%20mobile%20operating%20system%20market.)

We will start by analysing the most popular app genres in each store. Then criteria such as user downloads or user reviews will be analysed to give a more granular breakdown of popularity of apps by genre as an indication of 'downloadability'- potential revenue.

We will need to define two functions to analyze data about app categories as it is categorical and therefore we will need to generate frequency tables to analyse it numerically.

- One function to generate frequency tables that show percentages for a column
- One function to display the column percentages in a descending order

In [28]:
#defining the first function

def freq_table (dataset, index):
    freq_table = {}
    total_apps = len (dataset) #calculating the total number of apps in the dataset
    for row in dataset:
        row_data_point = row[index]
        if row_data_point in freq_table:
            freq_table[row_data_point] +=1
        else:
            freq_table[row_data_point] =1
    
    for iteration_variable in freq_table: #returning frequency table as a percentage
        freq_table [iteration_variable] /= total_apps
        freq_table [iteration_variable] *= 100
    
    return freq_table

Let's test our function to see if we get the desired outputs. 

In [29]:
print ("Frequency table for the 'Category' column in Android dataset")
freq_table (android_free,1)

Frequency table for the 'Category' column in Android dataset


{'ART_AND_DESIGN': 0.6431230960171499,
 'AUTO_AND_VEHICLES': 0.9251946293580051,
 'BEAUTY': 0.5979916506826132,
 'BOOKS_AND_REFERENCE': 2.1437436533904997,
 'BUSINESS': 4.592124562789123,
 'COMICS': 0.6205573733498815,
 'COMMUNICATION': 3.2381812027530184,
 'DATING': 1.8616721200496444,
 'EDUCATION': 1.1621347173643235,
 'ENTERTAINMENT': 0.9590432133589079,
 'EVENTS': 0.7108202640189552,
 'FINANCE': 3.7007785174320205,
 'FOOD_AND_DRINK': 1.241114746699763,
 'HEALTH_AND_FITNESS': 3.0802211440821394,
 'HOUSE_AND_HOME': 0.8236488773552973,
 'LIBRARIES_AND_DEMO': 0.9364774906916393,
 'LIFESTYLE': 3.9038700214374367,
 'GAME': 9.725826469592688,
 'FAMILY': 18.898792733837304,
 'MEDICAL': 3.5315355974275078,
 'SOCIAL': 2.6627552747376737,
 'SHOPPING': 2.245289405393208,
 'PHOTOGRAPHY': 2.944826808078529,
 'SPORTS': 3.396141261423897,
 'TRAVEL_AND_LOCAL': 2.335552296062281,
 'TOOLS': 8.462146000225657,
 'PERSONALIZATION': 3.317161232088458,
 'PRODUCTIVITY': 3.8925871601038025,
 'PARENTING': 0.

In [30]:
#defining the second function

def display_table(dataset, index):
    table = freq_table(dataset, index) #using the function we created above
    table_display = [] 
    for key in table:
        key_val_as_tuple = (table[key], key) #turning the frequency table into a tuple
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) #printing in descending order (largest to smallest value)
        

Let's refresh our minds on the columns that we wan to generate frequency tables for:

- `prime_genre` in the Apple dataset, index [11]
- `Category` in the Android Dataset, index [1]
- `Genre` in the Android Dataset, index [9]

### Apple Apps Genre Frequency Analysis

In [31]:
print ("Frequency table for the `prime_genre` column in the Apple dataset")
print ('\n')
display_table (apple_free,11)

Frequency table for the `prime_genre` column in the Apple dataset


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Observations:
- The most common genre is `Games`, making up 58% of apps in this App Store Dataset
- The second-most common genre is `Entertainment`, consisting nearly 8% of the Dataset
    - This means over 66% of the Apple Store apps are built for entertainment purposes
- The 3rd category is `Photo & Video`, with just under 5% of the apps in the dataset. Media creation is an industry with which the Apple brand is generally associated with, which is represented here.
- The 4th category is `Education`, making up 3.7% of apps in this Dataset.

In General
- Apps in the Apple App Store tend to be designed for entertainment purposes as opposed to practical purposes. 
- The top 4 categories account for 74.7 % of the App store, meaning apps beyond those 4 categories may not be popular or have a strong business case to develop for.

Recommendation
- Based on these observations it would recommended to develop an entertainment-oriented app. This recommendation however is based on genre frequency data, which may not be indicative of 'downloadability', and ultimately revenue for our team.


Lastly, we should also consider that this dataset was first put together in 2017, and the proportion of each app genre to the total will likely have changed since then, when considering all apps in the Apple store. There have been additions to the Apple hardware ecosystem such as the 'Apple Watch' and 'Apple Pen'(both first released in 2015). These may have incentivised companies or development teams to create apps that are compatible and can make use of this hardware. It took the Apple Watch 4 years to reach 50 million users, but in the next two years, adoption increased to 100 million users, indicating an accelerated adoption rate ([Statista, 2021](https://www.statista.com/statistics/1221051/apple-watch-users-worldwide/)). The adoption new hardware may have led to an increase in `Health & Fitness` or `Medical` apps being developed due to the ability to track health-related data. Furthermore, apps which use short nudging and notifications that can be displayed on the Apple Watch's screen such as `Education`, `Finance`, and `News` may have also gathered increased interest from development teams since 2017. 

### Android Apps Genre Frequency Analysis

In [32]:
print ("Frequency table for the 'Category' column in the Android dataset")
print ('\n')
display_table (android_free,1)

Frequency table for the 'Category' column in the Android dataset


FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0

Observations:
- `FAMILY`is the category with most apps, with 18.9 %.
    - After research, we find that apps in this category are for younger audiences and is mostly games as it is aimed at ages: 5 & under, 6-8, 9 & up with labels such as: Action & Adventure, Brain Games, Creativity, Education, Music & Video, Pretend Play. ([Song, 2015](https://www.lowyat.net/2015/64130/googles-new-family-category-now-on-play-store-makes-it-easier-to-discover-apps-for-kids/#:~:text=Just%20like%20Play%20Store%20for,for%20children%20of%20different%20age.))
- `GAME` is the second-most prevalent category with 9.7% of apps of this type.
- `TOOLS` is the 3rd most present category with 8.5& of apps
- `BUSINESS` is the 4th category on the list in terms of app presence  with 4.6&
    - These last two categories show that productivity and practical apps make up a higher proportion of the Android App compared to the entertainment-focused Apple app store.


In General
- The apps in the Android Dataset tend to be spread more evenly between entertainment and practical purposes, but there is still a significant business case for entertainment-focused apps. There is room for other genres as well.
- The top 4 categories are less concentrated - making up 41.8% of the Android Dataset, compared to 74.7 in the Apple Dataset. This may be a result of our sample data skewing the data. However one can hypothesize that the Android App store may offer more flexibility in the genre of app to be developed which can lead to downloads, popularity, and revenue. It does not necessarily need to be belong to those top 4 categories. 
 

Recommendation
- Based on these observations there is an equal business case for developing both an entertainment-focused and productivity-focused app.

Similarly to our note on the Apple apps dataset, the landscape of the Android App Store will likely have changed since 2018. Furthermore, the Android ecosystem is more open source compared to Apple's and therefore will likely attract development teams which may feel restricted by Apple's guidelines and approval processes for app development. This will therefore affect the composition of the Android Store.

### The Android `Genre` Column

In [33]:
print ("Frequency table for the 'Category' column in the Android dataset")
print ('\n')
display_table (android_free,9)

Frequency table for the 'Category' column in the Android dataset


Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries

As mentioned in our introduction of the datasets and the column descriptions, `Genre` is a column that can have more than one category assigned per 'App'. We can observe that this criterion offers a more granular insight into how the apps are subdivided within each category. It is not within the scope of our project to be so granular, so we will not work with this column further. 

It is interesting to note that without the granularity; `TOOLS`, `EDUCATION`, `BUSINESS`, and `PRODUCTIVITY` are all in the top 5 of app types by frequency.


### Conclusions - Apps by Genre in the Datasets

These statistics about genre are useful, but paint an incomplete picture of which app genre would be ideal to prototype, as they doesn't consider the amount of downloads per genre, or the user ratings per genre. 

Without data on user downloads and user satisfaction which may be a better indication of potential in-app purchases, making recommendations using this data may be misleading.

We should take into account the number of downloads per genre to get a more clear idea of which apps are most downloaded. This is of interest to us as the more downloads we have, the higher the total number of potential engagements within our app will occur. While our aim is to to get as many users as possible to use our app, we cannot predict the engagement rate in our app. 

One way to find out what genres are the most downloaded/popular (have the most potential users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the `Installs` column, but this information is missing for the Apple store dataset. As a workaround, we'll use the total number of user ratings for that app as a proxy, which we can find in the `'rating_count_tot'` column.

## In the Apple App Store

In [34]:
print (apple_apps_1[0])
for row in apple_free [0:5]:
    print (row)
    print ('\n')

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [35]:
def freq_table_for_genre (dataset, index): #creating new function that does not give freq table as percentage
    freq_table = {}
    total_apps = len (dataset) #calculating the total number of apps in the dataset
    for row in dataset:
        row_data_point = row[index]
        if row_data_point in freq_table:
            freq_table[row_data_point] +=1
        else:
            freq_table[row_data_point] =1
    
    return freq_table

print ('Average download per category')
print ('\n')
genres_apple = freq_table_for_genre (apple_free,11) #frequency table for the prime_genre column 

for genre in genres_apple: 
    total = 0 #variable will store the sum of user ratings
    len_genre = 0 #variable will store the number of apps specific to each genre
    for app in apple_free:
        genre_app = app [11]
        user_rating = float (app[5]) #we are looking for rating count for all versions "rating_count_tot"
        if genre_app == genre:
            total += user_rating #add up number of user ratings to the total variable
            len_genre += 1 #incrementing the len_genre (total number of apps to specific genere) by 1
    average = total/len_genre
    print (genre,': ',average)
 

Average download per category


Social Networking :  71548.34905660378
Photo & Video :  28441.54375
Games :  22788.6696905016
Music :  57326.530303030304
Reference :  74942.11111111111
Health & Fitness :  23298.015384615384
Weather :  52279.892857142855
Utilities :  18684.456790123455
Travel :  28243.8
Shopping :  26919.690476190477
News :  21248.023255813954
Navigation :  86090.33333333333
Lifestyle :  16485.764705882353
Entertainment :  14029.830708661417
Food & Drink :  33333.92307692308
Sports :  23008.898550724636
Book :  39758.5
Finance :  31467.944444444445
Education :  7003.983050847458
Productivity :  21028.410714285714
Business :  7491.117647058823
Catalogs :  4004.0
Medical :  612.0


Observations:
- Based on the data, we can see that the highest number of downloads per genre is by far the `Navigation`. 
    - This is likely due to the limited amounts of Navigations apps people use: Apple Maps, Google Maps, Waze, City Mapper, etc. Some of which may come pre-installed which would increase average numbers. 
- 2nd - `Reference`, the category with the second most downloads per app. 
- 3rd - `Social Networking`
    - This is likely due to the limited amounts of social media apps people use: Facebook, Instagram, Snapchat, Messenger, WhatsApp. Some of which may come pre-installed which would increase average numbers. 
- 4th - `Music`
- 5th - `Weather`


It is worth making a deep dive into the `Reference` category as it is unlikely we will create an app for other categories due to their high level of concentration and saturation. 

In [36]:
for app in apple_free:
    if app [11] == 'Reference': 
        print (app[1],':', app[5]) #printing the app name and the downloads for the app

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We can infer from the app names in the `Reference` category that these apps are mostly associated with reference books. Let us investigate this 'parallel' category - `Book`. 

In [37]:
for app in apple_free:
    if app [11] == 'Book':
        print (app[1],':', app[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


While the `Book` category is concentrated around `Kindle` & `Audible` which are e-reading and audiobook platforms,  the `Color Therapy Adult Coloring Book for Adults` has a significant amount of downloads by our metrics. This shows there may be potential for children or family-focused app which provide access to book-based content in a playful way. It could be similar to a Kindle interface with a library of books, with added gamification elements such as colouring, quizzes (for kids), and quotes, analysis, key points (for adults). This would enable us to tap into both the children's and adult markets simultaneously - increasing our total potential number of users.

Let us turn our attention to the Android Dataset and look at most the downloaded types of apps and where there may be potential. 

## In the Android App Store 

The `Installs` column which has data about downloads is categorical in the Android dataset.

In [38]:
display_table (android_free, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


In order to generate a frequency table for each category, we will treat each 'tranche' of downloads (ex. `1,000,000+` or `100,000+`) as the **actual** number of downloads. This won't be accurate, but it will give us an estimate for which apps are most downloaded, the metric we are after.

In [39]:
android_apps_1[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [40]:
print ('Average download per category') 
categories_android = freq_table (android_free,1)

for category in categories_android: #looping over our frequency table
    total = 0 #variable will store the sum of 'Installs' tranches
    len_category = 0 #variable will store the number of apps specific to each genre
    for app in android_free:
        app_category = app [1] #the app genre is the 'Category' column index[1]
        if app_category == category:
            installs = app[5] #we are looking for the 'Installs'  
            installs = installs.replace('+','') #removing the '+' at the end of the string
            installs = installs.replace(',','') #removing the ',' at the middle(s) of the string
            installs = float (installs) #converting the 'Installs' tranche to a float 
            total += installs +1 #add up installs as a float
            len_category += 1 #incrementing the len_genre (total number of apps to specific genere) by 1
    average = int(total/len_category) #converting to avoid decimals
    print (category,':', average)

Average download per category
ART_AND_DESIGN : 1986336
AUTO_AND_VEHICLES : 647318
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8767812
BUSINESS : 1712291
COMICS : 817658
COMMUNICATION : 38456120
DATING : 854029
EDUCATION : 1833496
ENTERTAINMENT : 11640706
EVENTS : 253543
FINANCE : 1387693
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437817
GAME : 15588016
FAMILY : 3697849
MEDICAL : 120551
SOCIAL : 23253653
SHOPPING : 7036878
PHOTOGRAPHY : 17840111
SPORTS : 3638641
TRAVEL_AND_LOCAL : 13984078
TOOLS : 10801392
PERSONALIZATION : 5201483
PRODUCTIVITY : 16787332
PARENTING : 542604
WEATHER : 5074487
VIDEO_PLAYERS : 24727873
NEWS_AND_MAGAZINES : 9549179
MAPS_AND_NAVIGATION : 4056942


**Observations**
- `SOCIAL` dominates the average downloads with `23,253,653` per app
    - this like likely to a similar phenomenon with the Apple Dataset where a few of the popular social media apps are downloaded by most people.
- `COMMUNICATION` h `38,456,120`
- `GAME` is also high with `15,588,016` per app
- `BOOKS_AND_REFERENCE` has an intriguingly high `8,767,812` downloads per app for us to enquire this category in the Android store too.
- `EDUCATION` has a moderately high `1,833,496` downloads per app
- `FAMILY`, the most present app type by genre has only `3,697,849` downloads per app, showing that this app genre is likely saturated

Let's analyze the `BOOKS_AND_REFERENCE` category further by analysing the top downloaded apps and what they are. Based on our analysis of the iOs app store data, we may see a similar opportunity. It is wor

In [41]:
for app in android_free:
    if app [1] == 'BOOKS_AND_REFERENCE':
        print (app[0],':',app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This list is too large for us to make observations. We can isolate the apps with the most downloads by adding conditions to display with a large number of downloads in the `Installs` column such as `1,000,000,000+` and `500,000,000+`.

In [42]:
for app in android_free:
    if app [1] == 'BOOKS_AND_REFERENCE' and (app [5] == '1,000,000,000+' 
                                             or app [5] == '500,000,000+' 
                                             or app [5] == '100,000,000+' 
                                             or app [5] == '50,000,000+'):
        print (app[0],':',app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


We can observe that the this category is also quite concentrated but the high amount of donwnloads could suggest people do go to the app stores to find book-related apps. There is `Google Play Books` with `1,000,000,000+` downloads which appears as an outlier. 

Let's take a look at apps across categories with high downloads (`1,000,000,000+`) and see if we can find a pattern among them. 

In [43]:
#checking apps with 1,000,000,000+ in the Android dataset 
for app in android_free:
    if app [5] == '1,000,000,000+':                              
        print (app[0],':',app[5])

Google Play Books : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Subway Surfers : 1,000,000,000+
Facebook : 1,000,000,000+
Google+ : 1,000,000,000+
Instagram : 1,000,000,000+
Google Photos : 1,000,000,000+
Maps - Navigate & Explore : 1,000,000,000+
Google Street View : 1,000,000,000+
Google : 1,000,000,000+
Google Drive : 1,000,000,000+
YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
Google Play Games : 1,000,000,000+
Google News : 1,000,000,000+


When looking at the apps that have `1,000,000,000+` downloads in the app store we find that they are mostly Google Apps and likely come pre-installed on Android phones. Without more granular analysis, we should not draw conclusions about what these downloads figures indicate as they are not likely organic. 

We will therefore take a look at the apps with less downloads to find more actionable insights from the data. 

In [44]:
for app in android_free:
    if app [1] == 'BOOKS_AND_REFERENCE' and (app [5] == '10,000,000+' 
                                             or app [5] == '5,000,000+' 
                                             or app [5] == '1,000,000+'):                                     
        print (app[0],':',app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

**Observations**

We can observe from this list that there is a significant amount of books in connection to a religious texts, translation, and educational material such as dictionaries. This presents a potential opportunity to develop an app in this genre as well for the Android app store.

### The Proposed Solution

As a reminder, we found in the Apple app store dataset that `Books` was an interesting category to pursue with but the store is dominated by apps created for entertainment purposes. The `BOOKS_AND_REFERENCE` category from the Android app store is also worth exploring. Apps with content related to religious texts, translation, and educational material have quite significant download numbers. 

We can combine these observations and patterns to recommend the development of an app which would offer a library of books or texts with the potential to bring the content to life in a engaging and playful way. This approach takes into account the high proportion of entertainment apps in both app stores but with potential for the higher percentage utility apps in the Android app store as it is the one that will be used for prototyping.

Texts targeted for a younger audience could have 'gamification' elements such as:
- colouring
- light quizzes
- shortened versions

and for adult and older audiences the new ways to bring content to life could be:
- quotes
- analyses
- key points 
- taught classes to use frameworks from the book.

The proposed solution to the development team would be to create an app that acts as a 'gamification engine' along with visually engaging User Interface elements to bring to life the content from books or texts. The book's authors or individuals with knowledge about religious texts or languages could create and contribute the content. This would relieve the need for in-house topic experts to create content while providing creators with a solution to create  gamified and interactive content through the app about a topic they know well. In the current mobile app market with high competition and marketing costs to get impressions, this solution could be more profitable than building standalone apps which could have high fixed and variable costs to develop and market. 

There is potentially a hardware ecosystem element to this solution. A colouring book could be immersive for families with smart pens for example. People with smartwatches could also get nudges in the forms of quotes or key points daily. This ensures we have a value proposition for which users can make the most of the ecosystem they have invested in. 

The monetization element could come from affiliate links, selling source content books and related merchandise (classes & lectures, etc.) from which each book gamification element is based on. This could be coupled with the 'recommended books' based on user baskets and related purchases. These revenue sources will hopefully provide educational value for the app users. The intended educational value would shield our users from undesirable effects of in-app purchases such as in the mobile gaming industry - spending on in-game coins and elements that may not provide lasting educational value. The diversity of in-app purchases also aligns with the recommendation of the Forbes author to provide different options for users in order to make the app more attractive for repeat purchases. 

In addition to `Reference` and `Books`, this app proposal could be attributed to multiple categories such as games, education, entertainment, and family (in the Google Store). This would give it more exposure and make it present in categories with significant download volumes.

During the analysis phase of each `Genre` in each app store, we observed that a concentration of a small number of apps are being downloaded disproprotionately more than the rest in each category. This may indicate the concentrated 'winner-take-all' and concentrated nature of these markets. Furthermore, this demonstrates a certain saturation of the mobile app market. While there is potential to create a unique mobile app, a level of specialisation for content creation or app mechanics may be required to make it unique and attractive for users. We can look at Apple's policy of 'Minimum Functionality' as potential evidence of the industry's response to the high volume of apps being created and requiring a certain degree of uniqueness to be listed on the App Store. : 
> "Your app should include features, content, and UI that elevate it beyond a repackaged website. If your app is not particularly useful, unique, or “app-like,” it doesn’t belong on the App Store. If your App doesn’t provide some sort of lasting entertainment value or adequate utility, it may not be accepted." [Apple, 2023](https://developer.apple.com/app-store/review/guidelines/#minimum-functionality:~:text=4.1%20Copycats-,4.2%20Minimum%20Functionality,-4.3%20Spam)

indicating the need to propose a novel experience for users both in terms of content and experience.

The proposed solution would leverage book authors and topic experts to act as 'content specialists' which could provde content in a novel way. This would help us combat content fatigue and provide a unique experience for users. 

---

## Conclusion


In conclusion, a recommendation for a free-to-download app format was provided based on the analysis of the genres of most downloaded apps. The recommendation is to create an app that can bring to life text content that would from books, religious texts, or educational material. We observed that the free-to-download app market is quite contentrated and saturated so an original idea in terms of functionality and value-add for users is key to make the app popular but also get it approved on the app stores.  


In future research projects using this dataset, an area of interest to explore is how price affects the number of downloads an app gets, and possible recommendations about paid apps to develop. This analysis would help us make a recommendation for a company looking to capitalise on the app purchase revenue model. In terms of methods, this dataset could benefit from visual analysis using plots and charts to illustrate the patterns observed. 