## Potential Application Study for Android and iOS Mobile Markets

### Overview of the project

The main purpose of this project is to gain insights into the relationship between mobile apps and the users. The focus of app development is on how to attract as many users as possible since the revenue mainly comes from in-app ads rather than user payment. 


### Data sets description

As of 2018, there were about 4 million apps available in Google Play and App Store. However, we are using only a small subset of it for simpility. 

1. Approximately 10,000 Android apps from Google Play, dated in August 2018.
2. Rougly 7,000 iOS apps from App Store, collected in July 2017.



In [52]:
from csv import reader
appfile_name='AppleStore.csv'
path=f'/Users/Ming/jupyter/p_apps/{appfile_name}'
open_file = open(path)
read_file = reader(open_file)
ios = list(read_file)

In [214]:
goofile_name='googleplaystore.csv'
path=f'/Users/Ming/jupyter/p_apps/{goofile_name}'
open_file = open(path)
read_file = reader(open_file)
adr = list(read_file)

Exploring the datasets, we see some overlapping of the columns between the two sets, e.g. name and genre of the apps, price, ratings, reviews etc. However, we must be cautious when comparing apps across different market.

In [330]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(ios, 0, 10, True)
explore_data(adr, 0, 2, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

### Data cleaning

#### Correction:
* A well-known mistake of missing category on line 10473 for Google Play dataset. I have kept it and put it under "TOOLS" category.
`['Life Made WI-Fi Touchscreen Photo Frame', 'TOOLS', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'TOOLS', 'February 11, 2018', '1.0.19', '4.0 and up']`

* Duplicate entries. Noticed there are duplicate apps in Android data set. In fact, 1181 number of duplicates has been detected. For instance, Instagram showed up 4 times with different number of reviews. Therefore, I will keep the one with the most number of reviews and delete the rest. Two different methods have been applied which lead to the same results. In the end, 9660 unique apps are saved.

In [280]:
# looking for duplicates. print out the number of duplicates.
duplicate_apps = []
unique_apps = []
for app in ios:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(f'Number of duplicate apps in Apple dataset: {len(duplicate_apps)}')

duplicate_apps = []
unique_apps = []
for app in adr:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(f'Number of duplicate apps in Android dataset: {len(duplicate_apps)}')
    

Number of duplicate apps in Apple dataset: 0
Number of duplicate apps in Android dataset: 0


In [274]:
# one example of a duplicate. Instagram has 4 lines in the dataset. 
# We need to keep the most recent one, a.k.a with the highest review number.

for i, app in enumerate(adr[1:],1):
    name=app[0]
    if name == 'Instagram':
        print(i, app)    

1917 ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [242]:
# Method 1: List
# collect the position of the duplicates in a list 
d_list=[]
for duplicate in duplicate_apps:
    app_review, app_position = -1, 0
    for i, app in enumerate(adr[1:]):
        name, review = app[0], int(app[3])       
        if name == duplicate:
            if app_review == -1:
                app_review, app_position = review, i
            elif review <= app_review:
                d_list.append(i+1)
            elif review > app_review:
                d_list.append(app_position+1)
                app_review, app_position = review, i

[]


In [236]:
# remove the duplicates from the list, sort the list and then delete accordingly (reduce the length gradually) 
l=list(set(d_list))
l.sort()
counter = 0
for x in l:
    del adr[int(x)-counter]
    counter += 1

In [249]:
print(len(adr)) # including the header and the one line that has been modified

9661


In [251]:
# method 2: Dictionary
# first, gather the max reviews for all apps
reviews_max={}
for app in adr[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
print(len(reviews_max))

9660


In [252]:
# second: collect the cleaned data set
android_clean=[]
already_added=[]
for app in adr[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9660


#### Selection:
Based on the main purpose of this project, we have to clean the dataset to keep only the relevent information.

1. Since this analysis is toward English-speaking audience, those non-English apps are removed. The criteria for the non-English apps is that there are more than three non-English characters detected from the name. Therefore, 45 apps are defined as Non-English apps and are not included. 
2. Only free apps will be included, therefore non-free ones are removed.

In [267]:
# Remove Non-English apps
# 1. create a function which return True if there are more than three non-english characters in the name
def english_check(a_string):
    counter=0
    for x in a_string:
        if ord(x)>127:
            counter += 1
    if counter > 3:
        return False
    else:
        return True
print(english_check('Instachat 😜'))


True


In [283]:
# 2. collect all the non-English apps for android.
non_english=[]
for app in android_clean:
    if not english_check(app[0]):
        non_english.append(app)
print(len(non_english))

# 2. collect all the non-English apps for ios.
non_english_ios=[]
for app in ios:
    if not english_check(app[2]):
        non_english_ios.append(app)
print(len(non_english_ios))

45
1014


In [284]:
# 3. collect all English apps for android.
english_app = []
for app in android_clean:
    if english_check(app[0]):
        english_app.append(app)
print(len(english_app))

# 3. collect all English apps for ios.
english_app_ios = []
for app in ios:
    if english_check(app[2]):
        english_app_ios.append(app)
print(len(english_app_ios))

9615
6184


In [287]:
# Collect all Free English apps in Google Play
free_english_adr=[]
for app in english_app:
    if app[6] == 'Free':
        free_english_adr.append(app)
print(len(free_english_adr))

free_english_ios=[]
for app in english_app_ios[1:]:
    if float(app[5]) <= 0.000001:
        free_english_ios.append(app)
print(len(free_english_ios))

8864
3222


### Data Analysis
The current workflow of the new app development in the company works as follows:
1. Create a MVP(Minimum viable product) of an idea for an Android phone, and place it on Google Play.
2. Develop the product further if the initial responses are positive.
3. If this app is profitable after 6 months at Google Play, we will build an iOS version for Apple store. 

Because the ultimate goal is to capture both the Android phone and the iPhone markets, we have to look at both the trends in Apple store and Google Play. In the coming section, I will find the common genre in both markets by looking at the frequency tables in both cleaned data sets.



In [305]:
# Generate frequency table through a freq_table funciton
def freq_table(dataset, index):
    freq={}
    n = 0
    for item in dataset:
        n += 1
        key = item[index]
        if key in freq:
            freq[key] += 1
        else:
            freq[key] = 1
    for x in freq:
        freq[x] = float(freq[x])*(100/n) 
    return freq

# print(freq_table(free_english_adr,1))
# print('\n')
# print(freq_table(free_english_adr,9))
# print('\n')
# print(freq_table(free_english_ios,12))


In [307]:
# sort the frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [308]:
# category for free english android apps
display_table(free_english_adr,1)

FAMILY : 18.896660649819495
GAME : 9.724729241877258
TOOLS : 8.472472924187727
BUSINESS : 4.591606498194946
LIFESTYLE : 3.903429602888087
PRODUCTIVITY : 3.8921480144404335
FINANCE : 3.700361010830325
MEDICAL : 3.531137184115524
SPORTS : 3.3957581227436826
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180508
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.3352888086642603
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.143501805054152
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090254
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505416
LIBRARIES_AND_DEMO : 0.9363718411552348
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833936
EVENTS : 0.7107400722021661
PARENTING : 0.654332129963899
ART_AND_DESIGN :

In [309]:
# genre for free english android apps
display_table(free_english_adr,9)

Tools : 8.44990974729242
Entertainment : 6.069494584837545
Education : 5.347472924187726
Business : 4.591606498194946
Productivity : 3.8921480144404335
Lifestyle : 3.8921480144404335
Finance : 3.700361010830325
Medical : 3.531137184115524
Sports : 3.4634476534296033
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.9444945848375452
News & Magazines : 2.7978339350180508
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.143501805054152
Simulation : 2.041967509025271
Dating : 1.861462093862816
Arcade : 1.8501805054151625
Video Players & Editors : 1.7712093862815885
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090254
Food & Drink : 1.2409747292418774
Puzzle : 1.128158844765343
Racing : 0.9927797833935019
Role Playing : 0.9363718411552348
Libraries & Demo : 0.9363718411552348
Auto & Vehicles : 0.9250902527075

In [311]:
# prime genre for free english ios apps

display_table(free_english_ios,12)

Games : 58.16263190564866
Entertainment : 7.883302296710117
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.513966480446927
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002482
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


#### Genres(Categories): percentage
1. Apple store.
It is clear that "Game" is the most common prime genre in our iOS dataset (free English apps in Apple store). More than 58% of the apps belongs to this genre while the runner up ("Entertainment") only makes less than 8%. Moreover, 4 out of the top 5 in the list belong to the more general leisure genre which has a landslide win based on our data; the majority of the apps are dedicated to leisure purpose. 

2. Google Play. 
Compared to those in our iOS dataset, the apps in the Android dataset (free English apps in Google Play) demonstrate a more balanced mix based on the frequency table on "Categeory" and "Genre" columns. 

Nonetheless, higher portion of the market doesn't translate to popularity which could be measured by number of users. Because we are primarily interested in the number of user of an app, it is of great importance for us to have an idea of the user base. In Google Play dataset, we have the number of installs which could be a good measure of the user base. Meanwhile, we could use the average reviews per genre to get an idea of the number of users. Here, we have to make the assumption that there is a linear relationship between the number of users and the number of reviews. 


In [317]:
# display the genre and the number of reviews(proxy for users) for free english iOS apps
table = {}
for genre in freq_table(free_english_ios,12):
    total = 0
    len_genre = 0 
    
    for app in free_english_ios:
        genre_app = app[12]
        if genre_app == genre:
            total += float(app[6])
            len_genre += 1
    avg_review= total/len_genre
    table[genre]= avg_review
    
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])    

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


In [332]:
# display the genre and the number of reviews(proxy for users) for free english iOS apps
table = {}
for genre in freq_table(free_english_ios,12):
    total = 0
    len_genre = 0 
    
    for app in free_english_ios:
        genre_app = app[12]
        if genre_app == genre:
            total += float(app[9])
            len_genre += 1
    avg_rating= total/len_genre
    table[genre]= avg_rating
    
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])  

Catalogs : 4.0
Productivity : 3.9375
Music : 3.9318181818181817
Games : 3.9116862326574173
Reference : 3.861111111111111
Health & Fitness : 3.623076923076923
Shopping : 3.488095238095238
Photo & Video : 3.384375
Entertainment : 3.3169291338582676
Medical : 3.25
Food & Drink : 3.25
Book : 3.142857142857143
Utilities : 3.123456790123457
Education : 3.110169491525424
Business : 3.0588235294117645
Weather : 3.017857142857143
Social Networking : 2.9858490566037736
Lifestyle : 2.9215686274509802
Finance : 2.8472222222222223
Travel : 2.7375
Sports : 2.681159420289855
News : 2.6627906976744184
Navigation : 2.25


In [321]:
# display the category and the number of installs for free english Android apps

table = {}
for category in freq_table(free_english_adr,1):
    total = 0
    len_category = 0 
    
    for app in free_english_adr:
        category_app = app[1]
        if category_app == category:
            num_install = app[5]
            num_install = num_install.replace('+','').replace(',','')
            total += float(num_install)
            len_category += 1
    avg_installs= total/len_category
    table[category]= avg_installs
    
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])  

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10787009.952063914
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

In [328]:
# display the category and the average rating for free english Android apps

table = {}
for category in freq_table(free_english_adr,1):
    total = 0
    len_category = 0 
    
    for app in free_english_adr:
        category_app = app[1]
        if category_app == category:
            rating = app[2]
            
            if rating != 'NaN':
                total += float(rating)

            len_category += 1
    avg_rating= total/len_category
    table[category]= avg_rating
    
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])  

EDUCATION : 4.298058252427182
ART_AND_DESIGN : 4.185964912280701
ENTERTAINMENT : 4.118823529411763
GAME : 4.030742459396756
COMICS : 4.025454545454546
PHOTOGRAPHY : 3.957088122605364
WEATHER : 3.871830985915492
SHOPPING : 3.781407035175881
FAMILY : 3.6957014925373186
VIDEO_PLAYERS : 3.6874213836477985
AUTO_AND_VEHICLES : 3.674390243902439
MAPS_AND_NAVIGATION : 3.648387096774193
BOOKS_AND_REFERENCE : 3.638421052631579
FINANCE : 3.6375000000000006
SOCIAL : 3.6220338983050833
HEALTH_AND_FITNESS : 3.615384615384615
PARENTING : 3.5913793103448284
TOOLS : 3.5262316910785625
TRAVEL_AND_LOCAL : 3.517874396135265
FOOD_AND_DRINK : 3.4854545454545454
HOUSE_AND_HOME : 3.4602739726027405
PRODUCTIVITY : 3.4182608695652217
PERSONALIZATION : 3.4078231292517014
BEAUTY : 3.3905660377358484
COMMUNICATION : 3.364808362369337
SPORTS : 3.3308970099667774
LIFESTYLE : 3.291618497109824
NEWS_AND_MAGAZINES : 3.277016129032258
LIBRARIES_AND_DEMO : 3.2216867469879515
EVENTS : 3.168253968253969
DATING : 3.16181818


#### Number of User per Genre 
Inspecting the above results, we found that social apps shows up on both markets top 3 list which is a good indicator of the popularity which translates to potential large user base. Further study of the user review shows that on average, the user ratings of those social apps are not high in general. Therefore, there is opportunity for improvement. 

Another sector raise our attention is Medical which are taking a very small portion of both markets, in terms of installs and reviews. The overall ratings for Medical apps are also at the bottom. However, medical sector is such an important one in the physical world, not to mention the significance of our lives. There is obviously a gap between the needs and the availability. 


### Conclusion

Based on the analysis above, we will recommend either develop a social networking apps which satisfies certain requirements which are not yet met in the market; or go into the under-developed medical app market to capture potential interests while facing less competition. 

The foreseeable challenge for the latter could be cultural, religen and possible personal data privacy issues.