# Profitable App Profiles for the App Store and Google Play Markets ##

In this guided project I am going to analize the data set from two markets: Google Play and App Store to find potentialy the most profitable application types.

I am going to be focused only on the free apps, for english speaking usesers, making their revenue based on in-app adds.
In few steps I am going to present data cleaning and further analysing.

Google Play store data set comes from __[here](https://www.kaggle.com/lava18/google-play-store-apps/home)__

Apps Store data set comes from __[here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)__

##  Step 1: Opening Google Play and Apps Store data sets

In [1]:
from csv import reader

#android dataset
file_and = open('googleplaystore.csv')
read_and = reader(file_and)
android = list(read_and)
header_and = android[0]
data_and = android[1:]

#IOS dataset
file_ios = open('AppleStore.csv')
read_ios = reader(file_ios)
ios = list(read_ios)
header_ios = ios[0]
data_ios = ios[1:]


## Using explore_data() function to investigate both data sets.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(explore_data(data_and,0,5,True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
None


In [4]:
print(explore_data(data_ios,0,5,True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
None


Displaying both header rows for easier data identification.

In [5]:
print(header_and)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
print(header_ios)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


# Step I : Cleaning Data

On kaggle.com webiste there is a forum group where people are discussing about this dta sets and they have found an error in Google Play data set which I am going to fix next. Topic can be found and follwed __[here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)__


In [7]:
#Printing row 10472 from android data set with missing data at index 9.
print(data_and[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Deleting an error row from the Google Play data set:

In [8]:
del data_and[10472]

## Checking if data set does not contain any duplicates:

In [9]:
unique_apps = []
duplicate_apps = []

#iterating  over data set and appending duplicate entries to separate list
for row in data_and:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)
    
#displaying number of duplicate entries
print(len(duplicate_apps))

    

1181


## Removing duplicate Apps from Google Play dataset

Removing duplicate enries using criterias such as: number of reviews, last updated or number of installs to find the most recent row in data set.

### Step 1 
### Creating a dictionary where the  key is the app name and value is highest number of reviews found in the data set.

In [10]:
reviews_max = {}

for app in data_and:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    
        

Displaying lenght of the dictionary to make sure it is as long as the data set

In [37]:
print(len(reviews_max))
print(len(data_and) - len(duplicate_apps))

9659
9659


### Step 2
### Removing duplicate rows from Google Palay data set

In [12]:
#creating two empty lists

android_clean = []
already_added = []

# iterating through the Google Play data set

for app in data_and:
    name = app[0]
    n_reviews = float(app[3])
# adding unique app names to the new list
    if reviews_max[name] == n_reviews and name not in already_added: 
        android_clean.append(app)
        already_added.append(name)
        

### Step 3
### Removing non - english apps using is_ang() funcion

In [38]:
# creating function

def is_ang(string):
    for char in string:
        if ord(char) > 127:
            return False
        
    return True
        
        

Checking sample app names to see if they are english and if the function works

In [14]:
is_ang('Instagram')

True

In [15]:
is_ang('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [16]:
is_ang('Docs To Go™ Free Office Suite')

False

In [17]:
is_ang('Instachat 😜')

False

Amending function to make sure that it will chceck apps data corectly including names with special characters (up to 3).

In [39]:
# creating updated function

def is_ang(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count = count + 1
        
        if count > 3:
             return False
        
    return True

Checking updated function:

In [19]:
is_ang('Docs To Go™ Free Office Suite')

True

In [20]:
is_ang('Instachat 😜')

True

In [21]:
is_ang('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Using updated is_ang() function to analize both data sets and remove non-english apps.

In [22]:
#creating new lists for english apps and ireatating over both data sets to filter them.

andr_english = []
ios_engish = []

for app in android_clean:
    name = app[0]
    if is_ang(name):
        andr_english.append(app)
        
for app in data_ios:
    name = app[1]
    if is_ang(name):
        ios_engish.append(app)       
    

Exploring number of rows left in each data set

In [23]:
print(len(andr_english))

9614


In [24]:
print(len(ios_engish))

6183


### Step 4
### Isolating apps which can be downloaded free of charge

In [40]:
andr_final = []
ios_final = []

for app in andr_english:
    if app[7] == '0':
        andr_final.append(app)
        
for app in ios_engish:
    if app[4] == '0.0':
        ios_final.append(app)  
        
print(len(andr_final))
print(len(ios_final))

8864
3222


#  Step II : Data Analysis

I am going to try to find an app profile which is the most popular on both markets using criterias from column Genres and Category for Android or prime_genre for IOS. It will help me to understad which apps are the most comon on each market and I will be able to ideantify app profile with the bigges potential.


### Creating function frequency table for further analysis.
Funcion displays how often each app apears in the data set.
In this subject the values will be displayed in percentage.

In [41]:
def freq_table(dataset,index):
    freq_dict = {}
    count = 0
    
    for row in dataset:
        count = count +  1
        value = row[index]
        if value in freq_dict:
            freq_dict[value] = freq_dict[value] + 1
        else:
            freq_dict[value] = 1
            

# transforming values to percentages          

    freq_perc = {}
    
    for each in freq_dict :
        percentage = (freq_dict[each] / count) * 100
        freq_perc[each] = percentage
       
    return freq_perc
    

### Using display_table() function for presenting sorted  results.

In [27]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Displaying and analising prime_genre column from IOS dataset and  Genres, Category from Google Play data set.

In [28]:
display_table(ios_free,11) #IOS dataset

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Looking at IOS apps frequency table we can see that over a half of the apps are games - 58%, second place has been taken by Entertaiment apps - nearly 8% and theird place apps in ratting are Photo & Video which score just below 5%.
Looks like it there have to be a big demand for the entertaining apps such us games in the IOS dataset if there is so many of them comparing to much smalleer numers infortational apps.


In [29]:
display_table(andr_free,1) #Google Play data set

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Looking at the Goole Play frequency table for Category column it can be noticed that games do not appear as offten as in Apps Store data set. Number of game apps in ios data set is still high which can be a pattern that there is a  market for the android game apps too. Esecially that the first place with nearly 19% of all apps in the data set has been taken by familly apps which are also entertaining and there is propably a number of games for kids too. However for more acurate suggestions we will need number of installs per Genre to see how many user each Genre actually have.

In [42]:
#displayig Genre column from Google Play data set
display_table(andr_free,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

As we can see that one of the most popular apps by Genre are still entertainig apps. We can not also underestimate the Tool apps which have taken the first place.

So far we could be under impression that if there is so many entertaining apps there have to be a demand for it online.
In the further analisys we wil havea look an number of install per Genre to see how many users each type of the app has and see exactly market proportions.

### Counting the average app installs per Genre from IOS dataset based on the number os users ratings column.

In [31]:
def freq_table(dataset,index):
    freq_dict = {}
    
    for row in dataset:
        value = row[index]
        if value in freq_dict:
            freq_dict[value] = freq_dict[value] +  1
        else:
            freq_dict[value] = 1
            
    return freq_dict

In [32]:
rat_count_tot = freq_table(ios_free,-5)

In [33]:
for genre in rat_count_tot:
    total = 0 # number of user ratings
    len_genre = 0 # number of apps per genre
    for each in ios_free:
        genre_app = each[-5] # 'prime genre' column
        if genre_app == genre:
            num_ratings = float(each[5]) # 'rating count total' column
            total = total + num_ratings
            len_genre = len_genre + 1
    average = total / len_genre
    print(genre, ':', average)
        
    
    

Medical : 612.0
Book : 39758.5
News : 21248.023255813954
Lifestyle : 16485.764705882353
Utilities : 18684.456790123455
Music : 57326.530303030304
Shopping : 26919.690476190477
Navigation : 86090.33333333333
Finance : 31467.944444444445
Photo & Video : 28441.54375
Education : 7003.983050847458
Games : 22788.6696905016
Productivity : 21028.410714285714
Weather : 52279.892857142855
Health & Fitness : 23298.015384615384
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Reference : 74942.11111111111
Business : 7491.117647058823
Sports : 23008.898550724636
Travel : 28243.8
Catalogs : 4004.0
Social Networking : 71548.34905660378


The most popular apps in the Apps Store are Navigation apps and Social Networking. It might be coused by apps such us Facebbok, Instagram or Google Maps so to come up with recomedation I would also consider generes just below the top 3. 
Games and Entertaining apps have still aslo large number of installs and looks like it there is a big marked there however seeing frequency of number of papps per genre there is a lot of cometition in this type of the apps too.

### Removing spare characters from installs column in the Google Play data set to count the number of installs per Genre

In [43]:
category_table = freq_table(andr_free, 1)

In [45]:
for category in category_table:
    total = 0 # sum of installs per genre
    len_category = 0 # num apps per category
    for each in andr_free:
        category_app = each[1]
        if category_app == category:
            num_installs = each[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',','')
            num_installs = float(num_installs)
            total = total + num_installs
            len_category = len_category + 1
            
    average_installs = total / len_category
    print(category, ':',average_installs)
    

BEAUTY : 513151.88679245283
SPORTS : 3638640.1428571427
VIDEO_PLAYERS : 24727872.452830188
SHOPPING : 7036877.311557789
TRAVEL_AND_LOCAL : 13984077.710144928
COMICS : 817657.2727272727
MAPS_AND_NAVIGATION : 4056941.7741935486
BOOKS_AND_REFERENCE : 8767811.894736841
EVENTS : 253542.22222222222
HEALTH_AND_FITNESS : 4188821.9853479853
MEDICAL : 120550.61980830671
COMMUNICATION : 38456119.167247385
TOOLS : 10801391.298666667
FAMILY : 3695641.8198090694
NEWS_AND_MAGAZINES : 9549178.467741935
ART_AND_DESIGN : 1986335.0877192982
LIBRARIES_AND_DEMO : 638503.734939759
HOUSE_AND_HOME : 1331540.5616438356
GAME : 15588015.603248259
PHOTOGRAPHY : 17840110.40229885
SOCIAL : 23253652.127118643
PARENTING : 542603.6206896552
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
AUTO_AND_VEHICLES : 647317.8170731707
DATING : 854028.8303030303
FINANCE : 1387692.475609756
WEATHER : 5074486.197183099
ENTERTAINMENT : 11640705.88235294
FOOD_AND_DRINK : 1924897.7363636363
PRODUCTIVITY : 16787331.344927

Looking at this table we can clearly see that the most installs have the comunication apps which is not a supprise as we all these days use apps such as messenger, whatsapp or hangout to comunicate each other. Similar to the IOS data set there is a big potetial to come up with a new comuniactional app however we need also consider how difficoul it has to be to drag people away from egzisting well known apps. So to come up with a new potentially popular app profile we could chose between following trends and competing with the giants on the market or try to find a nish in betewtn already popular apps where comeptition is much easier and the chance for taking a lead is much bigger.