## Mobile Apps Analysis Project
For this project, I'll pretend I'm working as a data analyst for a company that builds Android and iOS mobile apps. The company will make apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. My goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Import data files of Google Play Store and Apple Store

In [1]:
from csv import reader
data_android_full= list(reader(open('googleplaystore.csv')))
android_header = data_android_full[0]
android = data_android_full[1:]

data_ios_full = list(reader(open('AppleStore.csv')))
ios_header = data_ios_full[0]
ios = data_ios_full[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Find which attributes that could help with the analysis.

In [5]:
# For Android
data_android_full[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

I have decided that the attributes that could help in the analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

In [6]:
# For IOS
data_ios_full[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

I have decided that the attributes that could help in the analysis are 'track_name', 'currency', 'price', and 'rating_count_tot'.

From https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015 , it has been reported that a data record 10472 has been mistaken. Therefore, it will be deleted.

In [7]:
print(android_header)
print('\n')
print(android[10472])




['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del android[10472]

Finding duplicate data for android apps. In total, there are 1181 records where an apps occurs more than once.

In [9]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps', duplicate_apps[:5])
        

Number of duplicate apps: 1181
Examples of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


To remove the duplicates, I will:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [45]:
reviews_max = {}

for row in android:
    name = row[0]
    n_reviews = int(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


In [48]:
android_clean = []
already_added = []

for row in android:
    name = row[0]
    n_reviews = int(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

print(android_clean[0:5])
print('\n')
print('Android Records remaining:', len(android_clean))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


Android Records remaining: 9659


Duplicate records have been successfully removed!

### Removing Non-English Apps
We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character. Only 3 special characters (not in ASCII range) can be allowed in the name.

In [31]:
def check_english(string):
    check = 0
    for letter in string:
        if ord(letter) >127:
            check +=1 
    if check > 3:
        return False
    else:
        return True

In [32]:
check_english('电视剧热')

False

In [51]:
android_english = []
ios_english = []

for row in android_clean:
    if check_english(row[0]):
        android_english.append(row)
print('Number of Android apps:',len(android_clean))
print('Number of English Android apps:',len(android_english))

for row in ios:
    if check_english(row[1]):
        ios_english.append(row)
print('Number of IOS apps:',len(ios))
print('Number of English IOS apps:',len(ios_english))

Number of Android apps: 9659
Number of English Android apps: 9614
Number of IOS apps: 7197
Number of English IOS apps: 6183


### Only free apps are chosen for the analysis

The company only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [61]:
android_free_apps = []
ios_free_apps = []

for row in android_english:
    if row[7] == '0':
        android_free_apps.append(row)

for row in ios_english:
    if row[4] == '0.0':
        ios_free_apps.append(row)

print('Number of free Android apps:', len(android_free_apps))
print('Number of free IOS apps:', len(ios_free_apps))

Number of free Android apps: 8864
Number of free IOS apps: 3222


### Finding the most common genres for each market
The function 'freq' returns the percentage of each genre relative to the total.

In [74]:
android_dict = {}
ios_dict = {}

def freq(data,index):
    table = {}
    total = 0
    # Create dictionary that stores genre as key and frequency as value
    for row in data:
        total += 1
        genre = row[index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
    
    # Convert to percent of total
    table_percent = {}
    for row in table:
        table_percent[row] = table[row]/total * 100
    
    return table_percent

In [80]:
print(freq(android_free_apps,9))

{'Shopping': 2.2450361010830324, 'Arcade;Action & Adventure': 0.12409747292418773, 'Casual': 1.7599277978339352, 'Casual;Education': 0.02256317689530686, 'Art & Design;Creativity': 0.06768953068592057, 'Role Playing;Brain Games': 0.01128158844765343, 'Parenting': 0.4963898916967509, 'Education;Creativity': 0.04512635379061372, 'Educational;Creativity': 0.033844765342960284, 'Health & Fitness': 3.0798736462093865, 'Action': 3.1024368231046933, 'Parenting;Education': 0.078971119133574, 'Books & Reference': 2.1435018050541514, 'Casual;Pretend Play': 0.236913357400722, 'Role Playing;Action & Adventure': 0.033844765342960284, 'Parenting;Music & Video': 0.06768953068592057, 'Lifestyle;Education': 0.01128158844765343, 'Entertainment;Music & Video': 0.16922382671480143, 'House & Home': 0.8235559566787004, 'Role Playing': 0.9363718411552346, 'Food & Drink': 1.2409747292418771, 'Card;Action & Adventure': 0.01128158844765343, 'Board': 0.3835740072202166, 'Health & Fitness;Education': 0.0112815884

The function 'display' returns a sorted frequency of each genre 

In [100]:
def display(data,index):
    dict_freq = freq(data,index)
    table_display = []
    
    for row in dict_freq:
        row_list = [row,dict_freq[row]]
        table_display.append(row_list)
    
    table_sorted = sorted(table_display,key = lambda x:x[1],reverse=True)
    
    for line in table_sorted:
        print(line[0],': ',line[1])


After the functions have been created, the frequency tables are created by inputting the arguments into the function parameters.

In [103]:
print(display(android_free_apps,1))

FAMILY :  18.907942238267147
GAME :  9.724729241877256
TOOLS :  8.461191335740072
BUSINESS :  4.591606498194946
LIFESTYLE :  3.9034296028880866
PRODUCTIVITY :  3.892148014440433
FINANCE :  3.7003610108303246
MEDICAL :  3.531137184115524
SPORTS :  3.395758122743682
PERSONALIZATION :  3.3167870036101084
COMMUNICATION :  3.2378158844765346
HEALTH_AND_FITNESS :  3.0798736462093865
PHOTOGRAPHY :  2.944494584837545
NEWS_AND_MAGAZINES :  2.7978339350180503
SOCIAL :  2.6624548736462095
TRAVEL_AND_LOCAL :  2.33528880866426
SHOPPING :  2.2450361010830324
BOOKS_AND_REFERENCE :  2.1435018050541514
DATING :  1.861462093862816
VIDEO_PLAYERS :  1.7937725631768955
MAPS_AND_NAVIGATION :  1.3989169675090252
FOOD_AND_DRINK :  1.2409747292418771
EDUCATION :  1.1620036101083033
ENTERTAINMENT :  0.9589350180505415
LIBRARIES_AND_DEMO :  0.9363718411552346
AUTO_AND_VEHICLES :  0.9250902527075812
HOUSE_AND_HOME :  0.8235559566787004
WEATHER :  0.8009927797833934
EVENTS :  0.7107400722021661
PARENTING :  0.6543

In [101]:
print(display(android_free_apps,9))

Tools :  8.449909747292418
Entertainment :  6.069494584837545
Education :  5.347472924187725
Business :  4.591606498194946
Productivity :  3.892148014440433
Lifestyle :  3.892148014440433
Finance :  3.7003610108303246
Medical :  3.531137184115524
Sports :  3.463447653429603
Personalization :  3.3167870036101084
Communication :  3.2378158844765346
Action :  3.1024368231046933
Health & Fitness :  3.0798736462093865
Photography :  2.944494584837545
News & Magazines :  2.7978339350180503
Social :  2.6624548736462095
Travel & Local :  2.3240072202166067
Shopping :  2.2450361010830324
Books & Reference :  2.1435018050541514
Simulation :  2.0419675090252705
Dating :  1.861462093862816
Arcade :  1.8501805054151623
Video Players & Editors :  1.7712093862815883
Casual :  1.7599277978339352
Maps & Navigation :  1.3989169675090252
Food & Drink :  1.2409747292418771
Puzzle :  1.128158844765343
Racing :  0.9927797833935018
Role Playing :  0.9363718411552346
Libraries & Demo :  0.9363718411552346
Aut

Most of the Android apps are focused on practical uses such as 'Tools', 'Education', and 'Business'.

In [102]:
print(display(ios_free_apps,11))

Games :  58.16263190564867
Entertainment :  7.883302296710118
Photo & Video :  4.9658597144630665
Education :  3.662321539416512
Social Networking :  3.2898820608317814
Shopping :  2.60707635009311
Utilities :  2.5139664804469275
Sports :  2.1415270018621975
Music :  2.0484171322160147
Health & Fitness :  2.0173805090006205
Productivity :  1.7380509000620732
Lifestyle :  1.5828677839851024
News :  1.3345747982619491
Travel :  1.2414649286157666
Finance :  1.1173184357541899
Weather :  0.8690254500310366
Food & Drink :  0.8069522036002483
Reference :  0.5586592178770949
Business :  0.5276225946617008
Book :  0.4345127250155183
Navigation :  0.186219739292365
Medical :  0.186219739292365
Catalogs :  0.12414649286157665
None


'Games' is by far the most popular genre in IOS.

Up to this point, we found that apps in the App Store is mostly designed for fun, while Google Play shows a more balanced mix of both practical and fun apps. Now I would like to get an idea about the kind of apps that have most users.

### Most Popular Apps by Genre on the App Store

In [105]:
ios_genre = freq(ios_free_apps,11)

for genre in ios_genre:
    total = 0
    len_genre = 0
    
    for row in ios_free_apps:
        genre_app = row[11]
        
        if genre_app in genre:
            total += float(row[5])
            len_genre += 1
    
    avg_rating = total/len_genre
    print(genre, ':', avg_rating)
    

Travel : 28243.8
Shopping : 26919.690476190477
Book : 39758.5
Navigation : 86090.33333333333
Weather : 52279.892857142855
Utilities : 18684.456790123455
Food & Drink : 33333.92307692308
Business : 7491.117647058823
Catalogs : 4004.0
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Medical : 612.0
Music : 57326.530303030304
Reference : 74942.11111111111
Finance : 31467.944444444445
Social Networking : 71548.34905660378
Entertainment : 14029.830708661417
Photo & Video : 28441.54375
Productivity : 21028.410714285714
Education : 7003.983050847458
News : 21248.023255813954
Lifestyle : 16485.764705882353


'Navigation', 'Social Networking', 'Reference', 'Music', and 'Weather' apps have the higest average number of user reviews.

In [109]:
display(android_free_apps,5)

1,000,000+ :  15.726534296028879
100,000+ :  11.552346570397113
10,000,000+ :  10.548285198555957
10,000+ :  10.198555956678701
1,000+ :  8.393501805054152
100+ :  6.915613718411552
5,000,000+ :  6.825361010830325
500,000+ :  5.561823104693141
50,000+ :  4.7721119133574
5,000+ :  4.512635379061372
10+ :  3.5424187725631766
500+ :  3.2490974729241873
50,000,000+ :  2.3014440433213
100,000,000+ :  2.1322202166064983
50+ :  1.917870036101083
5+ :  0.78971119133574
1+ :  0.5076714801444043
500,000,000+ :  0.2707581227436823
1,000,000,000+ :  0.22563176895306858
0+ :  0.04512635379061372
0 :  0.01128158844765343


Most apps have large number of intallations.

In [120]:
category_freq = freq(android_free_apps,1)
category_table=[]
for category in category_freq:
    total = 0
    len = 0
    for row in android_free_apps:
        category_app = row[1]
        if category_app in category:
            install = row[5]
            install = install.replace('+','')
            install = install.replace(',','')
            total += float(install)
            len += 1
    avg_install = total/len
    print(category,':',avg_install)
    category_table.append((avg_install,category))
            

HOUSE_AND_HOME : 1331540.5616438356
WEATHER : 5074486.197183099
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
COMICS : 817657.2727272727
DATING : 854028.8303030303
LIFESTYLE : 1437816.2687861272
LIBRARIES_AND_DEMO : 638503.734939759
SOCIAL : 23253652.127118643
HEALTH_AND_FITNESS : 4188821.9853479853
ART_AND_DESIGN : 1986335.0877192982
PARENTING : 542603.6206896552
FAMILY : 3695641.8198090694
PERSONALIZATION : 5201482.6122448975
MEDICAL : 120550.61980830671
FINANCE : 1387692.475609756
AUTO_AND_VEHICLES : 647317.8170731707
EDUCATION : 1833495.145631068
SPORTS : 3638640.1428571427
COMMUNICATION : 38456119.167247385
TRAVEL_AND_LOCAL : 13984077.710144928
MAPS_AND_NAVIGATION : 4056941.7741935486
PRODUCTIVITY : 16787331.344927534
BUSINESS : 1712290.1474201474
FOOD_AND_DRINK : 1924897.7363636363
NEWS_AND_MAGAZINES : 9549178.467741935
SHOPPING : 7036877.311557789
GAME : 15588015.603248259
TOOLS : 10801391.298666667
PHOTOGRAPHY : 17840110.40229885
VIDEO_PLAYERS : 24727872.4

Sort the category table according to the number of installations.

In [127]:
category_table_sorted = sorted(category_table,reverse=True)
for row in category_table_sorted:
    print(row[1],':',row[0])
                               

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

On average, communication apps have the most installs: 38,456,119. However, communication apps are mostly dominated by few large companies. The top categories are mostly saturated with apps developed by large companies. There is however, a category that is still on the top list on both IOS and Android and does not seem to have fierce competion, which is Books and References. I would explore more on this genre.

In [136]:
for row in android_free_apps:
    if (row[1] == 'BOOKS_AND_REFERENCE') and (row[5] == '1,000,000+'
                                           or row[5] == '5,000,000+'
                                           or row[5] == '10,000,000+'
                                           or row[5] == '50,000,000+'
                                           or row[5] == '100,000,000+'
                                           or row[5] == '500,000,000+'
                                           or row[5] == '1,000,000,000+'):
        print(row[0],':',row[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Aldiko Book Reader : 10,000,000+
Wattpad 📖 Free Books : 100,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al ka

There seems to be various of apps and does not have a lot of apps with hundred of millions of installations. The market shows potential, but we need to create an app with some unique features or distinction as an app with normal reading features could be dominated by Google Play Books or Amaazon Kindle.

## Conclusion
In this project, I analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets (Google Play Store and Apple Store).

It is concluded that creating a book or reference app could be profitable for both the Google Play and the App Store markets. The markets are already full of book reading apps, so special features besides the general reading feature must be included. Some possible examples are an app for communities for discussion of popular books and daily popular lines from a famous book.