<h1>Analysis of App Store Profiles to Maximize Profit</h1>

<p>
The goal of this project is to find the mobile app profile, in Google and Apple app stores, that will attract the most revenue. With capital to invest into development of a mobile app, we will use the results of this study to influence our recommendations to developers on how to best direct their efforts to build the most profitable mobile app. 
</p>

<p>
A condition to this study is that we seek to develop only mobile apps which are free to download and install. Any revenue would be generated through in-app purchases. This implies our hypotheses that the revenue generated by our app would be positively correlated with the number of users (perhaps more specifically, active ones) of our app. 
</p>

We are doing this project in pure Python with no assistant from libraries like Pandas, etc.

<h2>Overview of Datasets</h2>

<p>The datasets we will use for this study include one each from the <a href='https://www.kaggle.com/lava18/google-play-store-apps'>Google Play</a> store and the <a href='https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'>Apple app</a> store. With 7,000 to 10,000 apps in each of these files, we are using a sample out of the several million apps available overall.</p>

First, let's import the data for both files.

In [1]:
from csv import reader

## Apple App Store dataset
open_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

## Google Play dataset
open_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(open_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

Now that we have downloaded both datasets into lists of rows in our notebook, we will create a function to automate the exploration of the data.

In [2]:
## Function to explore each dataset where 'start' and 'end' are used to slice the 'dataset'
def explore_data(dataset, start, end, rows_columns = False):
    ds_slice = dataset[start:end]
    for row in ds_slice:
        print(row)
        print('\n')

    if rows_columns:
        print(" Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

print(google_header)
print('\n')
explore_data(google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


 Number of rows:  10841
Number of columns:  13


Note that in the first code cell we removed the header so we needed to add it back when exploring the first few rows of the Google Play dataset. The function results show the Google dataset has 10,841 rows (or apps) and 13 different columns of data attributes for each app.

Now let's explore the Apple dataset:

In [3]:
print(apple_header)
print("\n")
explore_data(apple, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


 Number of rows:  7197
Number of columns:  17


The Apple dataset has 7,197 rows/apps and 17 columns of attributes. 

<h2>Cleaning the Datasets</h2>

The Google dataset has a known error based on this <a href='https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015'a>discussion thread</a>. So we need to remove row 10,472:

In [6]:
print(len(google))
del google[10472]
print(len(google))

10840
10839


<h3>Removing Duplicate Records</h3>

We created a function to sort and count the number of duplicate records in each dataset based on the 'name' of the app.

In [9]:
unique_apps = []
duplicate_apps = []

for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of Duplicate Apps: ', len(duplicate_apps))

Number of Duplicate Apps:  1180


So in the Google dataset there appear to be 1,180 duplicates based on the 'app' column, or the name of the app. Some examples are queried and listed below.

In [10]:
print(duplicate_apps[0:5])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In order to remove the duplicate rows, we need to consider how each of the duplicate records may differ outside of the 'app' column. Let's take a look at the set of duplicates for 'app' Box and see if there are differences.

In [15]:
for app in google:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We can see in the example above that column index 3 has a different number of ratings in the third record. So let's create a function that keeps the record with the most ratings, implying that this record is the newest. The others/duplicated we will identify and discard.

In [19]:
reviews_max = {}
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


Previously we found that there are 1,180 duplicates and subtracting that from the original amount of total reviews (with duplicates) in our dictionary should match the number of duplicates identified in reviews_max. And below we see these figures do match up.

In [20]:
print('Length of dataset originally: ', len(google) - 1180)
print('Length of new dictionary: ', len(reviews_max))

Length of dataset originally:  9659
Length of new dictionary:  9659


Now, we need to remove those duplicates using the reviews_max dictionary from above. We only want to keep the duplicate with the highest number of reviews/ratings. 

In [24]:
google_clean = [] # to store the unique dataset each with the highest number of reviews
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)

print(len(google_clean))

9659


We can see there are now 9,659 rows in the google_clean list of lists which corresponds exactly to expected length of the new dictionary as in reviews_max. This means we have successfully removed the duplicate rows and retained only the unique apps containing the highest number of reviews.

Fortunately there were no duplicates in the Apple dataset so we needed only to remove duplicates from the google dataset.

<h3>Remove non-English Records</h3>

Since we will only be developing English apps we need to remove the non-English apps in these lists. We can target only English characters by keeping only names with ASCII characters of 127 or lower. However some characters like emojis or dashes or trademarks are special characters with an ASCII value greater than 127 as well. So to minimize data loss we will only remove an app if its name has more than three non-ASCII standard characters.

In [42]:
def eng_char(string):
    non_ascii = 0 # counter for number of non-Ascii chars in string
    
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
            if non_ascii > 3:
                return False
    
    return True

print(eng_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_char('hello'))
print(eng_char('LOL 😜'))

False
True
True


The above function works but it will not catch everything, including apps in other languages like Spanish. However, this is good enough for purposes of this workbook.

Below we apply the datasets to our new function. You will see that we have removed several apps from both datasets as the resulting number of rows are lower.

In [46]:
google_english = []
apple_english = []

for app in google_clean:
    name = app[0]
    if eng_char(name):
        google_english.append(app)

for app in apple:
    name = app[2]
    if eng_char(name):
        apple_english.append(app)

explore_data(google_english, 0, 2, True)
explore_data(apple_english, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


 Number of rows:  9614
Number of columns:  13
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


 Number of rows:  6183
Number of columns:  17


<h3>Isolate the Free Apps</h3>

Next we need to isolate only the free apps since that is the app profile we are analyzing. 

In [48]:
#google[6] type free

#apple[5] == 0

final_google = []
final_apple = []

for app in google_english:
    type = app[6]
    if type == 'Free':
        final_google.append(app)

for app in apple_english:
    price = app[5]
    if price == '0':
        final_apple.append(app)
        
print(len(final_google))
print(len(final_apple))

8863
3222


So we now have 8,863 apps in the Google Play dataset and 3,222 apps in the App Store after filtering out only the free apps.

This is enough for data cleaning. Now we can move onto analysis of the datasets.

<h2>Data Analysis</h2>

We want our app to be profitable in both app stores so we need to focus on apps that are popular in both stores together.

First, we will analyze the most popular apps by genre, building frequency tables.

<h3>Highest Number of Available Apps by Genre</h3>

In [51]:
def freq_table(dataset, index):
    dict = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in dict:
            dict[value] += 1
        else:
            dict[value] = 1
        
    dict_percentages = {}
    for key in dict:
        percentage = (dict[key] / total) * 100
        dict_percentages[key] = percentage
    
    return dict_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Frequency of the prime_genre column for the Apple dataset below. It stands out that the games genre in the App Store makes up 58% of all apps, well ahead of the second highest frequency, Entertainment. While this is the most common genre it does not mean it has the most downloads or users.

In [54]:
display_table(final_apple, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Below is the Category for the Google Play store, then the Genre for Google.

In [58]:
display_table(final_google, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [57]:
display_table(final_google, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

It appears that the Genre column in Google may just be a breakout of sub-groups of the Category column. This is evident further down the frequency table for the Google Genre results where we see what seem to be sub-categories of the Games Category.

To keep it higher level and more comparable to the Apple Store, we will stick with using the Category column for Google.

In summary, these frequency tables show that Games definitely dominate the app stores in terms of number of apps available for download. However, it is easy to download a free app. Let's next dive into finding which types of free apps have the most users.

<h3>Most Popular Apps by Genre</h3>

The Google set has number of Installs as a data point but not the Apple set. So for the Apple set we can use 'rating_count_tot' as a proxy (total number of ratings given per app).

In [66]:
# grab dictionary of unique Apple store Genres with genres as keys
apple_genres = freq_table(final_apple, 12)

for genre in apple_genres:
    total = 0 # to store quantity of user ratings
    len_genre = 0 # number of apps in each genre
    for app in final_apple:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


In the App Store, the most popular genres seem to be navigation, music and then weather.

To analyze the popularity of the Google set we use the Installs data column but this is discretized into buckets. For the purpose of this study it will be sufficient for us as we will assume that 100,000+ installs equals 100,000 installs and no more. 

More importantly, we need to remove the punctuation in this column so we can use the values numerically rather than categorically.

In [69]:
display_table(final_google, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


In [70]:
# grab dictionary of unique Google store Categories with Categories as keys
google_cats = freq_table(final_google, 1)

for cat in google_cats:
    total = 0 # to store quantity of installs
    len_cat = 0 # number of apps in each category
    for app in final_google:
        cat_app = app[1]
        if cat_app == cat:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_cat += 1
    avg_n_installs = total / len_cat
    print(cat, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The Communication category leads installs with about 38 million but like the App Store, many of these categories/genres are skewed by a small number of large super-popular apps within each genre. It will take deeper data manipulation and analysis to capture a more accurate picture of the most popular categories.