# Profitable App Profiles for the App Store and Google Play Markets

This is the first guided project of the Data Engineer module in Dataquest.io. 

# Opening and Exploring the Data

In [21]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

def get_csv(file):
    from csv import reader
    opened_file = open(file)
    read_file = reader(opened_file)
    datalist = list(read_file)
    return datalist[1:], datalist[0]

iOS_data, iOS_data_header = get_csv('AppleStore.csv')
android_data, android_data_header = get_csv('googleplaystore.csv')


We will print out the first four data rows for iOS.

In [22]:
print(iOS_data_header)
print('\n')
explore_data(iOS_data,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 iOS apps in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

We will print out first four data rows for Android

In [23]:
print(android_data_header)
print('\n')
explore_data(android_data,0,4,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'. 

# Deleting Wrong Data

We need to look through the data to see if there are any errors in the rows. From reading [this](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) discussion, we learn to delete row 10472.  

In [24]:
print(len(android_data))
del android_data[10472]
print(len(android_data))

10841
10840


# Removing Duplicate Entries
## Part 1

The Google Play dataset on Android has duplicate data for certain apps. Let's see an example of an app with duplicate data. 

In [25]:
for row in android_data:
    app = row[0]
    if app in "Twitter":
        print(row)

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


 Duplicates exist perhaps because new data entered for the same app does not overwrite the old data. Since we are only concerned with the most recent data for a particular app, we need to deal with these duplicate entries. First, let's find out how many duplicate entries exist. 

In [26]:
duplicate_data = []
unique_data = []
for row in android_data:
    app = row[0]
    if app in unique_data:
        duplicate_data.append(app)
    else:
        unique_data.append(app)
        
print('Number of duplicate apps in data:', len(duplicate_data))
print('\n')
print('Some examples of the duplicate apps include:', duplicate_data[20:30])
        

Number of duplicate apps in data: 1181


Some examples of the duplicate apps include: ['Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite']


## Part 2

Let's now deal with the duplicate data from the dataset and keep only the most recent data for each app. The most recent data can be identified by looking at the fourth column, which contains the amount of reviews for the app. We will say the data row with the most reviews is the most recent data entry for the app. We will create a dictionary that contains each unique app and its respective reviews. We expect to see that there are 9659 unique apps.  

In [27]:
reviews_max = {}
for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


We will now create a new list containing data only from the unique apps from the dataset and no duplicates. 

In [28]:
android_clean = []
already_added = []
for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

9659


# Removing Non-English Apps
## Part One

For this project, we would only like to analyze English apps. We need to create a function to check if the title of an app contains English characters. 

In [29]:
def isEnglish(input):
    for char in input:
        if ord(char)>127:
            return False
    return True
print(isEnglish('Instagram'))
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

True
False
False
False


## Part Two
In the previous step, we saw that the last two outputs returned False. This should not happen because the apps are in English, although the trademark and emoji symbols are interprested as not being "English" because of the way we wrote our function. To account for this, we will re-write our function to only reject apps if they have more than 3 non-English characters

In [30]:
def isEnglish(input):
    nonenglish = 0
    for char in input:
        if ord(char)>127:
            nonenglish+=1
    if nonenglish >3:
        return False
    return True

print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

False
True
True


We now need to filter out non-English apps from both the android and iOS data sets. 

In [31]:
android_english = []
iOS_english = []

for app in android_clean:
    name = app[0]
    if isEnglish(name):
        android_english.append(app)

for app in iOS_data:
    name = app[1]
    if isEnglish(name):
        iOS_english.append(app)

explore_data(android_english,0,5,True)
print('\n')
explore_data(iOS_english,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', 

We can see that the above function removed 44 non-English apps from the android dataset.

# Isolating the Free Apps

Our last step of the data cleaning process involves separating the free apps from the datasets.

In [32]:
free_android = []
free_iOS = []

for app in android_english:
    price = app[7]
    if price=='0':
        free_android.append(app)

for app in iOS_english:
    price = app[4]
    if price=='0.0':
        free_iOS.append(app)

print(len(free_android))
print(len(free_iOS))

8864
3222


# Most Common Apps by Genre
## Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the *prime_genre* column of the iOS data set, and the *Genres* and *Category* columns of the android data set.

## Part Two

We'll build two functions we can use to analyze the frequency tables:

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [33]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        data = row[index]
        if data in table:
            table[data]+=1
        else:
            table[data]=1
            
    return table
        

# Part Three

Now that we've created the above fuctions, let's analyze the frequency tables for the *prime_genre*, *Genres*, and *Category* columns. Let's first look at the *prime_genre* column.

In [34]:
display_table(free_iOS, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


The most common genre of free app in the Apple Store are **Games**. The majority of apps appear to be related to entertainment or "fun." Now we cannot necessarily make the assumption that the most profitable app profiles are in entertainment just because the market is saturated with them. We will continue analyzing the iOS data later.

Now let's take a look at frequency tables for the Google Play market. Let's look at the *Genres* and *Category* columns.

In [35]:
print("Frequency of Genres: ")
print('\n')
display_table(free_android, 9)
print('\n')
print("Frequency of Categories")
print('\n')
display_table(free_android, 1)

Frequency of Genres: 


Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 1

The most common genre of free app in the Google Play store are **Tools**, followed by **Entertainment**. The most common category is **Family**. It appears that games/entertainment are the most popular form of app in both the Apple and Google datasets. Although Google apps seem to be more designated for the family and have more practicality.

## Most Ppular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

* Isolate the apps of each genre.
* Sum up the user ratings for the apps of that genre.
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [36]:
unique_genre_freq = freq_table(free_iOS,11)
for genre in unique_genre_freq:
    total = 0
    len_genre = 0
    for app in free_iOS:
        genre_app = app[11]
        if genre == genre_app:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
    avg_rating = total/len_genre
    print(genre,':',avg_rating)

Travel : 28243.8
Utilities : 18684.456790123455
Education : 7003.983050847458
Book : 39758.5
Navigation : 86090.33333333333
Sports : 23008.898550724636
Music : 57326.530303030304
Social Networking : 71548.34905660378
Catalogs : 4004.0
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Games : 22788.6696905016
Business : 7491.117647058823
Health & Fitness : 23298.015384615384
Reference : 74942.11111111111
Food & Drink : 33333.92307692308
Productivity : 21028.410714285714
Shopping : 26919.690476190477
Weather : 52279.892857142855
Medical : 612.0
Lifestyle : 16485.764705882353
Finance : 31467.944444444445
News : 21248.023255813954


It appears that navigation apps have the most number of reviews, though that does not necessarily mean it is the most profitable because the majority of these reviews could be dominated by a few apps.

## Most Popular Apps by Genre on Google Play

Let's do the same for apps on Google Play store.

In [38]:
category_freq = freq_table(free_android,1)
for category in category_freq:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category == category_app:
            installs = app[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            total += float(installs)
            len_category += 1
    avg_installs = total/len_category
    print(category,':',avg_installs)

BOOKS_AND_REFERENCE : 8767811.894736841
FOOD_AND_DRINK : 1924897.7363636363
WEATHER : 5074486.197183099
BUSINESS : 1712290.1474201474
GAME : 15588015.603248259
NEWS_AND_MAGAZINES : 9549178.467741935
FAMILY : 3695641.8198090694
EDUCATION : 1833495.145631068
MEDICAL : 120550.61980830671
PERSONALIZATION : 5201482.6122448975
DATING : 854028.8303030303
BEAUTY : 513151.88679245283
PRODUCTIVITY : 16787331.344927534
AUTO_AND_VEHICLES : 647317.8170731707
TRAVEL_AND_LOCAL : 13984077.710144928
SPORTS : 3638640.1428571427
COMMUNICATION : 38456119.167247385
COMICS : 817657.2727272727
TOOLS : 10801391.298666667
HEALTH_AND_FITNESS : 4188821.9853479853
ENTERTAINMENT : 11640705.88235294
ART_AND_DESIGN : 1986335.0877192982
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 24727872.452830188
LIBRARIES_AND_DEMO : 638503.734939759
MAPS_AND_NAVIGATION : 4056941.7741935486
HOUSE_AND_HOME : 1331540.5616438356
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
FINANCE : 1387692.475609756
LIFESTYLE : 1437816.26

It appears that communication apps have the greatest amount of installs. 