# Profitable App Profiles for the App Store and Google Play Markets

This project will simulate working as data analysts for a company that builds both Android and iOS mobile apps, with these apps being available on Google Play and the App Store, respectively. We are only building apps that are free to install and download, so the main source of revenue will be in-app ads. The more users who see and engage with the ads, the better our revenue.

The goal of this project is to use the data to help our developers understand what type of apps are likely the attract more users.

## Opening and Exploring the Data

In [None]:
from csv import reader

# Google Play dataset
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# App Store dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios= ios[1:]

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print(explore_data(android, 0, 3, True))

In [None]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

## Deleting Wrong Data

In [None]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

In [None]:
print(len(android))
del android[10472] # don't run this more than once
print(len(android))

## Removing Duplicate Entries

In [None]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

---
We're see that the Google Play dataset has duplicate entries, some of which were printed above to confirm this.

We don't want to count certain apps more than once when we analyze the data, so we will need to remove the duplicates and only keep one entry per app. We could remove duplicate rows randomly,but we can find a better way. As seen below for the example using Instagram, the *fourth position (index) of each row corresponds to the number of reviews*. The different numbers indicate that the data was collected at different times.

We can use the information in the fourth index as the criterion for removing duplicates - the higher the number of reviews, the more recent the data should be. We will aim to only keep the row with the highest number of reviews so that we keep fairly recent data while removing the other entries for any given app.

---
To remove the duplicates, we will do the following:
1. Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
2. Use the information stored in the dictionary and create a new dataset, which will have only one entry per app. For each app, we'll only choose the entry with the highest number of reviews.

In [None]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

---
We use the dictionary created above to remove the duplicate rows.
* Start by creating two empty lists: ```android_clean``` (which will store our new cleaned data set) and ```already_added``` (which has just store app names)
* Loop through the Google Play dataset (without the header row), and for each iteration, do the following:
    * Assign the app name to a variable names ```name```.
    * Convert the number of reviews of ```float```, and assign it to a vairalbe named ```n_reviews```
* if ```n_reviews``` is the same as the number of maximum reviews of the app ```name``` (the number can be found int he ```reviews_max``` dictionary) **and** ```name``` is not already in the list ```already_added```:
    * Append the entire row to the ```android_clean``` list (which will eventulaly be a list of lists and store our cleaned dataset).
    * Append the name of the app ```name``` to the ```already_added``` list - this helps us to keep track of apps that we already added.

Explore the ```android_clean``` dataset to ensure everythign went as expected. The dataset should have 9,659 rows

In [None]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

# we check to see if rows equals 9659
explore_data(android_clean, 0, 3, True) 

---

Only the Google Play dataset has duplicate entries. The App Store data does not have duplicates - this can be checked using the ```id``` column (not the ```track_name``` column).

## Removing Non-English Apps

Remeber that we use English for the apps we devleop at our company. We are not interested in keeping apps not made for English-speakers, so we'll remove them. We can do this by removing each app with a name containing a symbol that isn't common in English text - letters from the English alphabets, numbers composed of digist 0 to 9, punctation marks (.,!,?,;), and other symbols (+,*,/).

Each character we use in a string has a corresponding humber associated with it. 
* The number for the character ```'a'``` is 97, character ```'A'``` is 65. We can get this humber using the [```ord()``` built-in function](https://docs.python.org/3/library/functions.html#ord). 

The numbers corresponding to the characters we normally use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system.
* We can bulild a function that detects whether a character is equal to or less than 127 to determine if it has an English name. Therefore, it would be an app made for English-speakers.
* We can use indexing to select and individual character, and we can also iterate on the string using a for loop.

In [None]:
# We will first try to write a function, then we'll remove the rows
# corresponding to the non-English apps.
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

In [None]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

The fuction above works, but some English apps an ames can use emojois or other symbols that are outside the ASCII range. This might cause us to remove English apps by mistake and lose useful data.

In [None]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

To minimize this error, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This isn't a perfect filter function, but it should be effective enough.

We change the function as below; if the input string has more than three characters that are outside of ASCII range (0 - 127), then the function should identify as ```False```, otherwise it is ```True```.

In [None]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

We use the new function to filter out non-English apps from both datasets. Loop through each dataset. If an app name is identified as English, append the whole row to a separate list.

In [None]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

---
## Isolating the Free Apps

So far, we have done the following:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

As mentioned in the beginning, we're focused on free apps. We will need to isolate these free apps for our analysis. Isolating these free apps will be the last step in our data cleaning process and we will start analyzing the data.

In [None]:
android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)

print(len(android_free))
print(len(ios_free))

---
## Most Common Apps by Genre

Another objective stated in the beginning was that we wanted to focus on the kinds of apps that are likely to attract users, as the number of people engaging with our apps will affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and then add it to the App Store.

The end goal is to add the app on both platforms so we will need to find app profiles that are successful in both markets. To begin analysts of the most common genres for each market, we'll need to build frequency tables for a few columns in our datasets.

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now we willl analyze first the App Store dataset, then the Google Play dataset

In [None]:
display_table(ios_free, -5)

---
For the iOS apps, we see the most common genres as **Games** at 58.2% and **Entertainment** at 7.9% This dataset seems to point towards more apps being geared towards entertainment versus practical applications. 

In [None]:
display_table(android_free, 1) #for Google Play category column
print('\n')
display_table(android_free, 9) #for Google Play genre column

---
When looking at the **Category** column for the Android dataset, we see that **Family** and **Game** are the top two results, coming in at 18.9% and 9.7% respectively. In contrast to the results seen from the iOS dataset, the more popular apps on Google Play appear to cater more towards practical purposes as opposed to entertainment.

The **Genre** column here also further supports this observation, with **Tools** being the most popular genre at 8.4% and **Entertainment** following behind at 6.1%. The Google Play store appears to be more balanced as opposed to the App Store, which seems to largely be geared towards fun and entertainment.

It is interesting to note that Categories and Genres do not show an overwhelming majority of apps geared towards one kind of app but this may be because the Google Play store offers a more diverse range of applications compared to it's Apple counterpart. 

---
## Most Popular Apps by Genre on the App Store

Now we would like to see the apps with the most users. In the Google Play dataset, we can see this under the ```Installs``` column. However, the App Store dataset does not have this information. As a workaround, we will take the total number of user ratings as a proxy, which can be found in the ```rating_count_tot``` column.

We start by finding the average number of user ratigs per app genre on the App Store. So we will need to do the folllwing:
* Isolate apps of each genre
* Add up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps)
To do this, we will use a for loop inside of another for loop i.e. a **nested loop**.

In [None]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

On average, navigation apps have the highest number of user reviews, but this is heavily influenced by Waze and Google Maps

In [None]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings
print('\n')        
for app in ios_free:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

The same pattern applies to social networking apps and music apps.

Our goal is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. We could get a better picture by removing extremely popular apps for each genre and then rework the averages.

In the block above, refereance apps have 74,942 user ratings but the Bible and Dictionary.com that skey up the average rating.

There is potential with this niche, where we can take a popular book and turn it into an app with extra features besides the book itself. The market does seem saturated here with apps meant for fun so a practical app might be able to stand out more in the App Store. Other categories liseted below do not seem to fit our scope for the analysis:
* Weather apps - people do not spend as much time on these and so the ad revenue can be low.
* Food and Drink - we would need to work with another company and the overhead costs for starting and running the app may be more than the revenue generated from ads alone.
* Finance apps - we would require domain knowledge and a financial consultant/expert just to build an app.

---
## Most Popular Apps by Genre on Google Play

When we try to look at the Play Store, most values are open-ended and the install numbers do not seem precise enough.


In [None]:
display_table(android_free, 5) # the Installs columns

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to ```float``` — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [None]:
categories_android = freq_table(android_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Gmail) and some others with over 100 and 500 million installs.

In [None]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

If we removed all the communication apps with over 100 millsion installs, the average would be reduced by about ten times.

In [None]:
under_100_m = []

for app in android_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

We saw in an earlier analysis from the App Store that Games may be heavily saturated but books seemed to do well. We can explore this here as well as there are over 8 million installs for ```BOOKS AND REFERENCE```

In [None]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

There are a variety of apps seen in this category but there are still a number of very popular apps skewing the average

In [None]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

This niche is dominated by apps for processing and reading ebooks, so perhaps it's best to steer clear from these. We also notice quite a few apps for the Quran, suggesting that building an app around a popular book can be profitable. 

It seems that taking a popular book (perhaps something more recent) and turning it into an app could be profitable for both the Google Play and App Store markets. It does look like there are enough libraries so we will need to add some speacial features to make our app more appealing