### Profitable App Profiles for the App Store and Google Play Markets¶

Our aim in this project is to analyze mobile app that are profitable for the App Store and Google Play markets and help make data-driven decisions with respect to the kind of apps they build.

Our analysis Analyze can help build apps that are free to download. Our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help company  understand what kinds of apps are likely to attract more users.


### 1. import library

In [1]:
import pandas as pd
from csv import reader

### 2. Import dataset
As of September 2018, there were almost 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. This data set I had from Kaggle link to them you can find below 
- [Android apps](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) 
- [IOS apps](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [12]:
#Android
open_file = open('.../googleplaystore.csv')#data set-->https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
read_file = reader(open_file)
android_data = list(read_file)
android_header = android_data[0]
android = android_data[1:]

#IOS
opened_file = open('.../AppleStore.csv')#data set-->https://dq-content.s3.amazonaws.com/350/AppleStore.csv
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To make it easier to explore the data sets, I write a function named " explore_data()"  that I can use repeatedly to explore rows in a more readable way. You can  also find  an option function to show the number of rows and columns for any data set.

In [13]:
def explore_data(dataset, start, end,rows_and_columns=False):
    dataset_silence = dataset[start:end]
    for row in dataset_silence:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see 10,841 apps in Google Play and 13 columns. It that might be useful for the purpose of our analysis are: ['App', 'Category', 'Rating', 'Reviews','Installs', 'Type', 'Price','Genres']

In [14]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see 7197 apps in Apple store and 16 columns. The columns that might be useful for the purpose of our analysis are: ['track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre']

### 3. Data cleaning

#### 3.1 Delating Wrong data
we can see that one of the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [15]:
print(android_header)
print('\n')
print(android[10472])
print('\n')
print(android[10471])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


In [21]:
print(len(android))
#del android[10472]#use just one time
print(len(android))

10841
10840


#### 3.2 Delate duplication

In [22]:
duplication = []
uniqe_app = []
for app in android:
    title = app[0]
    if title in uniqe_app:
        duplication.append(title)
    else:
        uniqe_app.append(title)
print('Number duplication app:', len(duplication))
print('\n')
print('Number uniqe app:', len(uniqe_app))

Number duplication app: 1181


Number uniqe app: 9659


I don't want to count certain apps more than once so I need to remove the duplicate entries and keep only one entry per app. 

Next step is:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and select the apps with the highest number of reviews)

In [23]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    else:
        reviews_max[name]= n_reviews
        
        
print('Exprcted lenght: ', len(uniqe_app))
print('\n')
print('Currnent lenght: ', len(reviews_max))

Exprcted lenght:  9659


Currnent lenght:  9659


Now, let's use the __reviews_max__ dictionary to remove the duplicates. For the duplicate cases, I'll only keep the entries with the highest number of reviews. In the code cell below:
- I start by initializing __two empty lists, android_clean and already_added.__
- I creat loop through the __android data__ set, and for every iteration:  
     - I isolate the name of the app and the number of reviews.
     - I add the current row (app) to the __android_clean__ list, and the app name (name) to the __already_added__ list if:
         -The number of reviews of the current app matches the number of reviews of that app as described in the __reviews_max__ dictionary
         
The name of the app is not already in the __already_added__ list. I need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry.

In [24]:
android_clean = []
already_added = []

for app in android:
    name= app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

#### let's quickly explore the new data set, and confirm that the number of rows is 9,659.

In [25]:
explore_data(android_clean, 0, 1, True)#--> "function def abowe"

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


#### 3.3 Renoving non-ENG app

In [26]:
def english_app(title):
    non_ascii = 0
    
    for letter in title:
        if ord(letter)>127:
            non_ascii+=1
            
    if non_ascii>3:
            
        return False
    else: 
        return True
print(english_app('Instagram'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Instachat '))

True
False
True


In [27]:
android_english = []
android_non_eng = []

for app in android_clean:
    name= app[0]
    if english_app(name):
        android_english.append(app)
    else:
        android_non_eng.append(app)

ios_english = []
ios_non_eng = []
        
for app in ios:
    name= app[1]
    if english_app(name):
        ios_english.append(app)
    else:
        ios_non_eng.append(app)
        
print('English app Google {} and IOS {}'.format(len(android_english),len(ios_english)))
print('Non English app Google {} and IOS {}'.format(len(android_non_eng),len(ios_non_eng)))


English app Google 9614 and IOS 6183
Non English app Google 45 and IOS 1014


__3.4 Isolating the Free Apps__

In [28]:
android_final =[]
ios_final = []

for app in android_english:
    price = app[7]
    if price =='0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price =='0.0':
        ios_final.append(app)
print('Finale data set Google App {}, and IOS app {}'.format(len(android_final),len(ios_final)))

Finale data set Google App 8864, and IOS app 3222


To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [35]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total +=1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    table_percent ={}
    
    for key in table:
        percente = (table[key]/total)*100
        table_percent[key] = percente
    return table_percent

def display_table (dataset,index):
    table= freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val = (table[key],key)
        table_display.append(key_val)
        
    table_sorted = sorted(table_display, reverse= True)
    for entry in table_sorted:
        print(entry[1],':', round(entry[0],2), '%')
        

In [36]:
display_table(android_final,1)

FAMILY : 19.22 %
GAME : 9.51 %
TOOLS : 8.46 %
BUSINESS : 4.58 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.54 %
SPORTS : 3.42 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.25 %
HEALTH_AND_FITNESS : 3.07 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.78 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.13 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
ENTERTAINMENT : 0.88 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


In [37]:
display_table(ios_final,-5)

Games : 58.16 %
Entertainment : 7.88 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.51 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.33 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

In [162]:
display_table(android_final,-4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.580324909747293
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.5424187725631766
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2490974729241873
Action : 3.1024368231046933
Health & Fitness : 3.068592057761733
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.861462093862816
Video Players & Editors : 1.782490974729242
Casual : 1.7486462093862816
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.925090252707581

__The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.__

## Most Popular Apps by Genre on the App Store

In [171]:
genre_ios = freq_table(ios_final,-5)
for genre in genre_ios:
    total=0
    len_genre= 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app ==genre:
            n_rating = float(app[5])
            total +=n_rating
            len_genre +=1
    avg_rating =round(total/len_genre,2)
    print(genre, ':', avg_rating)     

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


In [165]:
freq_table(ios_final,-5)

{'Social Networking': 3.2898820608317814,
 'Photo & Video': 4.9658597144630665,
 'Games': 58.16263190564867,
 'Music': 2.0484171322160147,
 'Reference': 0.5586592178770949,
 'Health & Fitness': 2.0173805090006205,
 'Weather': 0.8690254500310366,
 'Utilities': 2.5139664804469275,
 'Travel': 1.2414649286157666,
 'Shopping': 2.60707635009311,
 'News': 1.3345747982619491,
 'Navigation': 0.186219739292365,
 'Lifestyle': 1.5828677839851024,
 'Entertainment': 7.883302296710118,
 'Food & Drink': 0.8069522036002483,
 'Sports': 2.1415270018621975,
 'Book': 0.4345127250155183,
 'Finance': 1.1173184357541899,
 'Education': 3.662321539416512,
 'Productivity': 1.7380509000620732,
 'Business': 0.5276225946617008,
 'Catalogs': 0.12414649286157665,
 'Medical': 0.186219739292365}

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [176]:
for app in ios_final:
    if app[-5]=='Navigation':
        print(app[1],':',app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


### Most Popular Apps by Genre on Google Play

In [180]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.749097472924186
100,000+ : 11.563628158844766
10,000,000+ : 10.503158844765343
10,000+ : 10.209837545126353
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [186]:
cat_android = freq_table(android_final, 1)
for cat in cat_android:
    total =0
    len_cat = 0
    for app in android_final:
        cat_app= app[1]
        if cat_app ==cat:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_cat += 1
    avg_n_installs = round(total / len_cat,2)
    print(cat, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1704192.34
COMICS : 817657.27
COMMUNICATION : 38326063.2
DATING : 854028.83
EDUCATION : 1768500.0
ENTERTAINMENT : 9146923.08
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4167457.36
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 12914435.88
FAMILY : 5180161.79
MEDICAL : 123064.79
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 4274688.72
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16772838.59
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24790074.18
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


In [194]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
Gmail : 1,000,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Hangouts : 1,000,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [204]:

for app in android_final:
    if app[1] == 'FOOD_AND_DRINK' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

Cookpad - FREE recipe search makes fun cooking · musical making! : 10,000,000+
DELISH KITCHEN - FREE recipe movies make food fun and easy! : 1,000,000+
Delicious Recipes : 1,000,000+
Tastely : 10,000,000+
Pastry & Cooking (Without Net) : 1,000,000+
McDonald's - McDonald's Japan : 10,000,000+
Pyaterochka : 1,000,000+
Refreshing app Free application that can use deal coupons : 1,000,000+
Grubhub: Food Delivery : 5,000,000+
hellofood - Food Delivery : 1,000,000+
Domino's Pizza USA : 10,000,000+
Chef - Recipes & Cooking : 5,000,000+
Delivery Club-food delivery: pizza, sushi, burger, salad : 5,000,000+
HungerStation : 1,000,000+
Delivery yogi. : 10,000,000+
Delivery trough - delivery trough delivery trough : 5,000,000+
Dr. Oetker recipe ideas : 1,000,000+
GialloZafferano: Recipes : 1,000,000+
OpenRice : 1,000,000+
Eat Fast Prepare "Without Internet" : 1,000,000+
Cookbook Recipes : 5,000,000+
My CookBook (Recipe Manager) : 1,000,000+
Allrecipes Dinner Spinner : 5,000,000+
Yummly Recipes & Sh