# My first data analysis project
## Profitable App Profiles for the App Store and Google Play Markets


### Introduction
This project was developed during my coursework in [dataquest](http://www.dataquest.io). 
We pretended we're working as data analysts for a company that builds Android and iOS mobile apps. Those apps are free to download and install, and the company main focus is on its in-app ads, what gives us sustainment — the more users that see and engage with the ads, the better.

### Objective
Our goal for this project is to analyze data from Apple App Store and Google Play to help our developers understand what type of apps are likely to attract more users.

### Data sets
We'll analyze a sample of the data. Two data sets are used:
- [A data set](http://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018 and can be [downloaded](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017 and can be [downloaded](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Initially, we open the two dataset, that are saved in the default Jupyter folder. Then, the .csv files are converted to a list of lists.

In [1]:
from csv import reader
fileApple=open('AppleStore.csv', encoding='utf8')
fileGoogle=open('googleplaystore.csv', encoding='utf8')
readApple=reader(fileApple)
readGoogle=reader(fileGoogle)
datasetApple=list(readApple)
datasetGoogle=list(readGoogle)

Then, we create a function to explore the data and print it, as defined below

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns: # Asuming the data set has a header
        print('Number of rows:', len(dataset)-1) 
        print('Number of columns:', len(dataset[0]))

We use the function explore_data() to explore the first few rows of each data set.

In [3]:
explore_data(datasetApple,0,4,True)
print('\n')
explore_data(datasetGoogle,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

We can independently observe the column names for each data set, in order to identify the values for each row. Remember to follow the link for [Apple](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) or [Google](https://www.kaggle.com/lava18/google-play-store-apps) to find out more in the documentation.

In [4]:
print('Google data set')
print('Number of rows:', len(datasetGoogle)-1) 
print('Number of columns:', len(datasetGoogle[0]))
print(datasetGoogle[0])
print('\n')
print('Apple data set')
print('Number of rows:', len(datasetApple)-1) 
print('Number of columns:', len(datasetApple[0]))
print(datasetApple[0])

Google data set
Number of rows: 10841
Number of columns: 13
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Apple data set
Number of rows: 7197
Number of columns: 16
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Data cleaning
We are ready to start with the data cleaning process, before analysis. We want to remove duplicate data, filter unused information and modify some data to fit the purpose of our analysis.

#### Deleting wrong data
Let's begin by detecting and deleting wrong data.

By checking the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) in Google data set, we can see that row 10473 (including header) has an error, as discussed [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). This entry has missing 'Category' _(Column 2)_ and a column shift occurred for next columns.

In [5]:
print('Column names')
print(datasetGoogle[0])
print('\n')
print('Regular row')
print(datasetGoogle[10472])
print('\n')
print('Problem row')
print(datasetGoogle[10473])

Column names
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Regular row
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


Problem row
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


So, we proceed to remove the row with the problem:

In [6]:
print('Before')
print('10472')
print(datasetGoogle[10472])
print('10473')
print(datasetGoogle[10473])
print('10474')
print(datasetGoogle[10474])
print('\n')
del datasetGoogle[10473]
print('After')
print('10472')
print(datasetGoogle[10472])
print('10473')
print(datasetGoogle[10473])

Before
10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10474
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


After
10472
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
10473
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


#### Deleting duplicates

Some duplicate entries have been found in Google Play Data set, as we can see below with Instagram app.

In [7]:
for app in datasetGoogle:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can evaluate the total duplicate entries:

In [8]:
duplicate_apps=[]
unique_apps=[]
for app in datasetGoogle:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Total of duplicate apps:',len(duplicate_apps))
print('\n')
print('Some duplicate apps:',duplicate_apps[:5])

Total of duplicate apps: 1181


Some duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Duplicates will not be removed randomnly. The main difference between duplicates happens on the fourth column, which corresponds to the number of reviews. The different numbers show the data was collected at different times. We can infere that the higher the number of reviews, the more recent the data should be. We'll only keep the row with the highest number of reviews and remove the other entries for any given app.

We will create a dictionary with, where each key is a unique app name and the corresponding value is the highest number of reviews of that app. Then, we use the information stored in the dictionary to create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [9]:
reviews_max={}
for app in datasetGoogle[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if (name in reviews_max and reviews_max[name]<n_reviews):
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews
        
print('New length')
print(len(reviews_max))

New length
9659


Now we can use the dictionary created above to remove the duplicate rows.

We create two empty list to be filled: one with the result and another with the app name already added. Then we loop the data set to find out if the number of reviews of each row is the maximum (previously stored in reviews_max dictionary).

Remember the expected length is 9

In [10]:
android_clean=[]
already_added=[]
for app in datasetGoogle[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if (n_reviews==reviews_max[name] and name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print('New length')
print(len(android_clean))

New length
9659


#### Removing Non-English Apps

By checking the name of the app, we can see that our data sets have some non-english apps. Recalling the ASCII standard, there are 128 values english related, so we can check if the chacacter in the app name is not in those 128 options and know that the app is non-english. We will do it with ord() built-in function. By creating a custom function, we can check if a string has only english characters, as seen below. The function detects as english apps if its names have 3 or less non-english characters (sometimes app names include a symbol or emoji, so we have to take that in mind).

In [11]:
def english(string):
    accumulator=0
    for char in string:
        if ord(char)>127:
            accumulator+=1
            if accumulator==3:
                return False
    return True

print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
True
True


We can use the function created above to filter our data sets. We will create a separate list for the english apps for each data set.

In [12]:
android_english=[]
ios_english=[]
for app in android_clean:
    name=app[0]
    if english(name)==True:
        android_english.append(app)
for app in datasetApple[1:]:
    name=app[2]
    if english(name)==True:
        ios_english.append(app)
        
print('New Google data set length is: '+str(len(android_english)))
print('New Apple data set length is: '+str(len(ios_english)))

New Google data set length is: 9597
New Apple data set length is: 7197


#### Isolating Free Apps

We will keep the data for the free apps only from each repository, so we will check the price (index 7 for Google and 5 for Apple). If the price is $0, we will add the app to a new list of free apps.

In [13]:
android_free=[]
ios_free=[]
for app in android_english:
    price=app[7]
    if price=='0':
        android_free.append(app)
for app in ios_english:
    price=float(app[4])
    if price==0:
        ios_free.append(app)
        
print(android_free[0])
print('\n')
print(android_free[1])
print('\n')
print(ios_free[0])
print('\n')
print(ios_free[1])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


### Finding most common Apps by genre
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Let's begin the analysis by getting a sense of what are the most common genres for each market.

We create a function named freq_table() that takes two inputs: data set and index. It loops the data set and creates a dictionary with frequency for the indexed column.

In [14]:
def freq_table(dataset, index):
    dictionary={}
    total=0
    for i in dataset:
        column=i[index]
        total+=1
        if column in dictionary:
            dictionary[column]+=1
        else:
            dictionary[column]=1
    table_percentages = {}
    for key in dictionary:
        percentage = (dictionary[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

Then, we create the display_table() function that combines the freq_table() function with some steps to sort the dictionary in decreasing order.

In [15]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Then, we can analyze the frequency tables, in order to find the most common genre in iOS for free english apps.

In [16]:
display_table(ios_free, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


We can see that among the free English apps, more than a half (55.65%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.53% of the apps are designed for social networking, followed by educational apps which amount for 3.25% of the apps in our data set.

The general impression is that App Store (at least for free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, etc.), while apps with practical purposes (education, shopping, utilities, productivity, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [17]:
display_table(android_free, 1) #Category column

FAMILY : 18.942133815551536
GAME : 9.697106690777577
TOOLS : 8.453887884267631
BUSINESS : 4.599909584086799
PRODUCTIVITY : 3.899186256781193
LIFESTYLE : 3.887884267631103
FINANCE : 3.7070524412296564
MEDICAL : 3.5375226039783
SPORTS : 3.390596745027125
PERSONALIZATION : 3.322784810126582
COMMUNICATION : 3.2323688969258586
HEALTH_AND_FITNESS : 3.0854430379746836
PHOTOGRAPHY : 2.949819168173599
NEWS_AND_MAGAZINES : 2.802893309222423
SOCIAL : 2.667269439421338
TRAVEL_AND_LOCAL : 2.3395117540687163
SHOPPING : 2.2490958408679926
BOOKS_AND_REFERENCE : 2.1360759493670884
DATING : 1.8648282097649187
VIDEO_PLAYERS : 1.7970162748643763
MAPS_AND_NAVIGATION : 1.3901446654611211
FOOD_AND_DRINK : 1.2432188065099457
EDUCATION : 1.164104882459313
ENTERTAINMENT : 0.9606690777576853
LIBRARIES_AND_DEMO : 0.9380650994575045
AUTO_AND_VEHICLES : 0.9267631103074141
HOUSE_AND_HOME : 0.8024412296564195
WEATHER : 0.7911392405063291
EVENTS : 0.7120253164556962
PARENTING : 0.6555153707052441
ART_AND_DESIGN : 0.64

We can see that Google data set is different. Family apps are the most common ones with almost 19%, followed up by games (9.7 %) and then tools with 8.45%. If we investigate this further, we can see that the family category means mostly games for kids.

The Category list for Android seems to be a little longer than the one for iOS, and also we can check the Genres column for Google to provide more specifications.

In [18]:
display_table(android_free, -4) #Genres column

Tools : 8.44258589511754
Entertainment : 6.080470162748644
Education : 5.357142857142857
Business : 4.599909584086799
Productivity : 3.899186256781193
Lifestyle : 3.8765822784810124
Finance : 3.7070524412296564
Medical : 3.5375226039783
Sports : 3.4584086799276674
Personalization : 3.322784810126582
Communication : 3.2323688969258586
Action : 3.096745027124774
Health & Fitness : 3.0854430379746836
Photography : 2.949819168173599
News & Magazines : 2.802893309222423
Social : 2.667269439421338
Travel & Local : 2.328209764918626
Shopping : 2.2490958408679926
Books & Reference : 2.1360759493670884
Simulation : 2.0456600361663653
Dating : 1.8648282097649187
Arcade : 1.842224231464738
Video Players & Editors : 1.7744122965641953
Casual : 1.763110307414105
Maps & Navigation : 1.3901446654611211
Food & Drink : 1.2432188065099457
Puzzle : 1.1301989150090417
Racing : 0.9945750452079566
Role Playing : 0.9380650994575045
Libraries & Demo : 0.9380650994575045
Auto & Vehicles : 0.9267631103074141
St

There is not much further information to analyse in this case, the columns seem to be simmilar.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps.

### Most popular Apps by Genre
To find out what genres are the most popular will be different for each data set. For the Google Play data set, we can find the Installs column, that describes the total downloads of each app. For the App Store data set, that information is missing, so we'll take the total number of user ratings in the column rating_count_tot app.

#### App Store
Let's start with calculating the average number of user ratings per app genre on the App Store.

In [19]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Travel : 20216.01785714286
Weather : 47220.93548387097
Reference : 67447.9
Entertainment : 10822.961077844311
Education : 6266.333333333333
Lifestyle : 8978.308510638299
Health & Fitness : 19952.315789473683
Shopping : 18746.677685950413
Photo & Video : 27249.892215568863
Sports : 20128.974683544304
Social Networking : 53078.195804195806
Utilities : 14010.100917431193
Productivity : 19053.887096774193
Music : 56482.02985074627
Finance : 13522.261904761905
Medical : 459.75
Book : 8498.333333333334
Business : 6367.8
Food & Drink : 20179.093023255813
Games : 18924.68896765618
Catalogs : 1779.5555555555557
Navigation : 25972.05
News : 15892.724137931034


Social Networking, Weather and Music seem to be the most popular genres. Those are above 47K ratings, while categories such as Education or Medical are below 7K.

#### Google Play Store
We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). We will consider the number without the '+' symbol, because it will not affect the analysis, we are looking for popularity, not exact values.


In [23]:
android_category=freq_table(android_free,1)

for category in android_category:
    total=0
    len_category=0
    for app in android_free:
        category_app=app[1]
        if category_app==category:            
            installs=app[5]
            n_installs=installs.replace('+','')
            n_installs=n_installs.replace(',','')
            n_installs=float(n_installs)
            total+=n_installs
            len_category+= 1
    avg_n_installs = total/len_category
    print(category, ':', avg_n_installs)

LIBRARIES_AND_DEMO : 638503.734939759
ENTERTAINMENT : 11640705.88235294
FAMILY : 3695641.8198090694
BOOKS_AND_REFERENCE : 8814199.78835979
MAPS_AND_NAVIGATION : 4049274.6341463416
SOCIAL : 23253652.127118643
VIDEO_PLAYERS : 24727872.452830188
LIFESTYLE : 1446158.2238372094
COMMUNICATION : 38590581.08741259
GAME : 15544014.51048951
HOUSE_AND_HOME : 1360598.042253521
NEWS_AND_MAGAZINES : 9549178.467741935
FOOD_AND_DRINK : 1924897.7363636363
EVENTS : 253542.22222222222
AUTO_AND_VEHICLES : 647317.8170731707
COMICS : 832613.8888888889
TRAVEL_AND_LOCAL : 13984077.710144928
HEALTH_AND_FITNESS : 4188821.9853479853
WEATHER : 5145550.285714285
MEDICAL : 120550.61980830671
ART_AND_DESIGN : 1986335.0877192982
SHOPPING : 7036877.311557789
TOOLS : 10830251.970588235
PHOTOGRAPHY : 17840110.40229885
BUSINESS : 1712290.1474201474
FINANCE : 1387692.475609756
PARENTING : 542603.6206896552
BEAUTY : 513151.88679245283
PERSONALIZATION : 5201482.6122448975
DATING : 854028.8303030303
SPORTS : 3650602.27666666

BOOKS_AND_REFERENCE, VIDEO_PLAYERS and COMMUNICATION seem to be the most popular categories for Android devices.

### Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that there are different type of applications prefered in each market, and a combination might be desirable to be successfull in both of them.