# Profitable App Profiles for the App Store and Google Play Markets

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

- A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. https://www.kaggle.com/lava18/google-play-store-apps

- A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.
https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [1]:
f_ios = open('AppleStore.csv')
f_and = open('googleplaystore.csv')
from csv import reader
r_ios = reader(f_ios)
r_and = reader(f_and)
l_ios = list(r_ios)
l_and = list(r_and)

In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(l_ios, 0, 5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']




In [4]:
explore_data(l_and, 0, 5)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




In [5]:
print(l_and[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del l_and[10473]

In [7]:
for app in l_and:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


## Deleting Wrong Data

Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we need to:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

In [8]:
duplicate_apps = []
unique_apps = []

for app in l_and:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)

print(len(duplicate_apps))
print(duplicate_apps[:15])

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [9]:
duplicate2_apps = []
unique2_apps = []

for app in l_ios:
    name2 = app[1]
    if name2 in unique2_apps:
        duplicate2_apps.append(name2)
    unique2_apps.append(name2)

print(len(duplicate2_apps))
print(duplicate2_apps[:15])

2
['Mannequin Challenge', 'VR Roller Coaster']


Duplicates will not be removed randomly becuase there is a column in the duplicate entries that are indeed unique: the number of reviews. We will be keeping the record with the largest number of reviews, because that implies the record is the most recent.

reviews_max dictionary doesn't include duplicates and the max value of reviews from its duplicates

In [10]:
reviews_max = {} 
for app in l_and[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


In [11]:
reviews_max2 = {} 
for app in l_ios[1:]:
    name2 = app[1]
    n_reviews2 = float(app[5])
    if name2 in reviews_max2 and reviews_max2[name2] < n_reviews2:
        reviews_max2[name2] = n_reviews2
    elif name2 not in reviews_max2:
        reviews_max2[name2] = n_reviews2


In [12]:
print(len(reviews_max))


9659


In [13]:
print(len(reviews_max2))

7195


android_clean is a list of a list of clean data.
already_added is a list of just app names. Helps us keep track of apps that are already added.

In [14]:
android_clean = []
already_added = []

for app in l_and[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app) 
        already_added.append(name)

print(len(android_clean))

9659


ios_clean is a list of a list of clean data. already_added2 is a list of just app names. Helps us keep track of apps that are already added.

In [15]:
ios_clean = []
already_added2 = []

for app in l_ios[1:]:
    name = app[1]
    n_reviews = float(app[5])
    if (n_reviews == reviews_max2[name]) and (name not in already_added2):
        ios_clean.append(app) 
        already_added2.append(name)

print(len(ios_clean))

7195


char_check checks for non-English characters and returns False if there are more than 3 non-English characters in the string.

In [16]:
def char_check(string): 
    i = 0
    for char in string:
        char_o = ord(char)
        if char_o > 127:
            i += 1

    if i > 3:
        return False
    else:
        return True

In [17]:
char_check('Instagram爱奇奇艺')

False

In [18]:
char_check('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [19]:
char_check('Docs To Go™ Free Office Suite')

True

In [20]:
char_check('Instachat 😜')

True

In [21]:
char_check('Kenstergramerama')

True

no_eng_list is a list of lists with apps with some non-English characters in its name. 3 non-English characters or more to be exact.
eng_list is a list of lists with English titled apps (with less than 3 non-English characters in its title).

In [22]:
no_eng_list = []
eng_list = []
def char_check2(listo): 
    
    for word in listo:
        i = 0
        for char in word[0]:
            char_o = ord(char)
            if char_o > 127:
                i += 1

        if i > 3:
            no_eng_list.append(word)
        else:
            eng_list.append(word)
       
        
            

In [23]:
no_eng_list2 = []
eng_list2 = []
def char_check3(listo): 
    
    for word in listo:
        i = 0
        for char in word[0]:
            char_o = ord(char)
            if char_o > 127:
                i += 1

        if i > 3:
            no_eng_list2.append(word)
        else:
            eng_list2.append(word)

In [24]:
android_clean[:2]

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up']]

In [25]:
unique_apps[:5]

['App',
 'Photo Editor & Candy Camera & Grid & ScrapBook',
 'Coloring book moana',
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps',
 'Sketch - Draw & Paint']

In [26]:
char_check2(android_clean)

In [27]:
no_eng_list[:2]

[['Flame - درب عقلك يوميا',
  'EDUCATION',
  '4.6',
  '56065',
  '37M',
  '1,000,000+',
  'Free',
  '0',
  'Everyone',
  'Education',
  'July 26, 2018',
  '3.3',
  '4.1 and up'],
 ['သိင်္ Astrology - Min Thein Kha BayDin',
  'LIFESTYLE',
  '4.7',
  '2225',
  '15M',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Lifestyle',
  'July 26, 2018',
  '4.2.1',
  '4.0.3 and up']]

In [28]:
len(android_clean)

9659

In [29]:
len(no_eng_list)

45

In [30]:
len(eng_list)

9614

In [31]:
char_check3(ios_clean)

In [32]:
len(ios_clean)

7195

In [33]:
len(no_eng_list2)

0

In [34]:
len(eng_list2)

7195

Next look for free apps, in row[4] look for '0.0' for the ios app list, and in row[7] look for '0' for the android app list.

In [35]:
free_apps = [] #android list
not_free_apps = []
for row in eng_list:
    if row[7] == '0':
        free_apps.append(row)
    else:
        not_free_apps.append(row)
        

In [36]:
free_apps2 = [] #ios list
not_free_apps2 = []
for row in eng_list2:
    if row[4] == '0.0':
        free_apps2.append(row)
    else:
        not_free_apps2.append(row)

In [37]:
len(free_apps)

8864

In [38]:
len(not_free_apps)

750

In [39]:
len(free_apps2)

4054

In [40]:
len(not_free_apps2)

3141

In [41]:
free_apps[:2]

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up']]

Find what kinds of apps are likely to attract more users because revenue is highly influenced by the number of people using the apps.

Build a minimal Android version of the app, add it to google play. If the app has a good response, when we develop it further. If it is profitable after 6 months, we also build an iOS version and add it to the app store.

In [42]:

def freq_table(dataset, index):
    counts = {}
    for row in dataset:
        freq = row[index]
        if freq in counts:
            counts[freq] += 1
        else:
            counts[freq] = 1
    total_sum = sum(counts.values())
    for freq in counts:
        counts[freq] /= total_sum
        counts[freq] *= 100
    return counts

In [43]:
freq_table(free_apps, 2)

{'1.0': 0.157942238267148,
 '1.2': 0.01128158844765343,
 '1.4': 0.033844765342960284,
 '1.5': 0.033844765342960284,
 '1.6': 0.04512635379061372,
 '1.7': 0.078971119133574,
 '1.8': 0.078971119133574,
 '1.9': 0.12409747292418773,
 '2.0': 0.12409747292418773,
 '2.1': 0.09025270758122744,
 '2.2': 0.157942238267148,
 '2.3': 0.2030685920577617,
 '2.4': 0.1917870036101083,
 '2.5': 0.21435018050541518,
 '2.6': 0.24819494584837545,
 '2.7': 0.236913357400722,
 '2.8': 0.41741877256317694,
 '2.9': 0.4399819494584838,
 '3.0': 0.8235559566787004,
 '3.1': 0.7333032490974729,
 '3.2': 0.6881768953068592,
 '3.3': 1.0717509025270757,
 '3.4': 1.3086642599277978,
 '3.5': 1.6471119133574008,
 '3.6': 1.7712093862815883,
 '3.7': 2.4029783393501805,
 '3.8': 3.012184115523466,
 '3.9': 3.8695848375451267,
 '4.0': 5.516696750902527,
 '4.1': 6.678700361010831,
 '4.2': 8.472472924187725,
 '4.3': 9.521660649819493,
 '4.4': 9.284747292418773,
 '4.5': 8.81092057761733,
 '4.6': 6.836642599277979,
 '4.7': 4.320848375451

In [44]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [45]:
display_table(free_apps, 2)

NaN : 14.643501805054152
4.3 : 9.521660649819493
4.4 : 9.284747292418773
4.5 : 8.81092057761733
4.2 : 8.472472924187725
4.6 : 6.836642599277979
4.1 : 6.678700361010831
4.0 : 5.516696750902527
4.7 : 4.320848375451264
3.9 : 3.8695848375451267
3.8 : 3.012184115523466
5.0 : 2.7414259927797833
3.7 : 2.4029783393501805
4.8 : 2.0645306859205776
3.6 : 1.7712093862815883
3.5 : 1.6471119133574008
3.4 : 1.3086642599277978
3.3 : 1.0717509025270757
4.9 : 0.891245487364621
3.0 : 0.8235559566787004
3.1 : 0.7333032490974729
3.2 : 0.6881768953068592
2.9 : 0.4399819494584838
2.8 : 0.41741877256317694
2.6 : 0.24819494584837545
2.7 : 0.236913357400722
2.5 : 0.21435018050541518
2.3 : 0.2030685920577617
2.4 : 0.1917870036101083
2.2 : 0.157942238267148
1.0 : 0.157942238267148
2.0 : 0.12409747292418773
1.9 : 0.12409747292418773
2.1 : 0.09025270758122744
1.8 : 0.078971119133574
1.7 : 0.078971119133574
1.6 : 0.04512635379061372
1.5 : 0.033844765342960284
1.4 : 0.033844765342960284
1.2 : 0.01128158844765343


In [46]:
display_table(free_apps, 9) # table for genres (android)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [47]:
display_table(free_apps, 1) #table for category (android)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [48]:
unique_genres = display_table(free_apps2, 11) # table for prime_genre (ios)

Games : 55.6240749876665
Entertainment : 8.238776517020227
Photo & Video : 4.119388258510114
Social Networking : 3.5273803650715343
Education : 3.2560434139121854
Shopping : 2.9847064627528366
Utilities : 2.688702516033547
Lifestyle : 2.318697582634435
Finance : 2.0720276270350273
Sports : 1.9486926492353234
Health & Fitness : 1.8746916625555006
Music : 1.6526887025160337
Book : 1.6280217069560927
Productivity : 1.5293537247163296
News : 1.4306857424765662
Travel : 1.3813517513566849
Food & Drink : 1.0606808090774542
Weather : 0.7646768623581648
Reference : 0.493339911198816
Navigation : 0.493339911198816
Business : 0.493339911198816
Catalogs : 0.2220029600394672
Medical : 0.19733596447952642


### Analysis

- In ios, the most common genre is games, and the second most common is entertainment
- In ios, the games genre makes up more than half of all games
- In ios, most apps are designed for fun

- In android, the most commone genres are tools and entertainment. They are more spread out.
- There are more game apps in the app store, there are more tools in the play store

In [50]:
'''for row[11] in free_apps2:
    total = 0
    len_genre = 0
    for row in free_apps2:
        genre_app = row[11]
        total = float(total + row[5])
        len_genre += 1
    avg_num_users = total / len_genre
    print(genre)
    print(avg_num_users)'''
    
    
genres_ios = freq_table(free_apps2, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in free_apps2:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
        

Games : 18941.39733924612
Social Networking : 53078.195804195806
Reference : 67447.9
Entertainment : 10822.961077844311
Health & Fitness : 19952.315789473683
Finance : 13522.261904761905
Productivity : 19053.887096774193
Utilities : 14010.100917431193
Shopping : 18746.677685950413
Book : 8498.333333333334
Medical : 459.75
Catalogs : 1779.5555555555557
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Travel : 20216.01785714286
News : 15892.724137931034
Business : 6367.8
Weather : 47220.93548387097
Navigation : 25972.05
Music : 56482.02985074627
Education : 6266.333333333333
Lifestyle : 8978.308510638299
Photo & Video : 27249.892215568863
