# Profitable App Profiles

This project analyzes app data (for both Android and iOS mobile apps) to identify which apps are most popular and why. This will help us understand what types of apps are likely to attract more users.

### Import and Explore Data

Data can be downloaded directly from:

[Android apps](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

[iOS apps](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [1]:
from csv import reader

# The Google Play dataset 
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The App Store dataset 
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(android,0,3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
# iOS rows: 7197  columns: 16
# android rows: 10841  columns: 13

In [5]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [6]:

print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Identify similar and useful columns

**ios.prime_genre = android.Category**

Identifies main type of app


**ios.user_rating = android.Rating**

Identifies overal ratings from users

**ios.track_name = android.App**

Common name of each app

## Clean Data

Android data has an bad entry on row 10472 according to [this Kaggle discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)

In [7]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
# delete bad row
del android[10472]

Find and remove duplicate entries

In [9]:
for app in android:
    name = app[0]
    if name == 'Facebook':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


In [10]:
# count the number of duplicate rows in the Android dataset

unique_app_names = []
duplicate_app_names = []

for x in android:
    app_name = x[0]
    if app_name in unique_app_names:
        duplicate_app_names.append(app_name)
    else:
        unique_app_names.append(app_name)
        
print('Number of unique apps: ', len(unique_app_names))
print('\n')
print('Number of duplicate apps: ', len(duplicate_app_names))

Number of unique apps:  9659


Number of duplicate apps:  1181


In [11]:
# count the number of duplicate rows in the iOS dataset

ios_unique_app_names = []
ios_duplicate_app_names = []

for x in ios:
    app_name = x[0]
    if app_name in ios_unique_app_names:
        ios_duplicate_app_names.append(app_name)
    else:
        ios_unique_app_names.append(app_name)
        
print('Number of unique apps: ', len(ios_unique_app_names))
print('\n')
print('Number of duplicate apps: ', len(ios_duplicate_app_names))

Number of unique apps:  7197


Number of duplicate apps:  0


Android dataset has 1181 duplicate entries.

iOS dataset has 0 duplicate entries.

We don't want to remove the duplicate entries randomly. Instead, let's keep the entry with the largest number of reviews. This will give us the most up-to-date and insightful information.

To remove the duplicates, we will create a dictionary with only the highest number of reviews saved. The dictionary key will be the app name and the value will be the highest number of reviews. In the next step, we will use this to identify with row of android data corresponds to this.

In [12]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

Now, we will use the *reviews_max* dictionary to check if each app matches the number of max reviews. If it does, we will save that row. We'll confirm each app is not already in *android_clean* before adding it to account for the edge case where the total number of reviews matches on multiple entries.

In [13]:
android_clean = [] # new, clean dataset
already_added = [] # only store app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)


In [14]:
print(android_clean[235:236])

[['Cisco Webex Meetings', 'BUSINESS', '4.4', '108741', '28M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '11.1.0', '4.3 and up']]


In [15]:
print(len(android_clean))

9659


## Remove Non-English Apps

We are only interested in English language apps for this exercise, so we will remove, using `ord()`, any non-English apps.

First, let's write a function that takes a string and returns `False` if there is a character in the string that does not belong to the set of common English characters. 

In [16]:
# We'll set the limit at 3 non-english characters to account
# account for emoji, etc. 

def check_characters(string):
    non_english = 0
    for x in string:
        if ord(x) > 127:
            non_english += 1
    if non_english > 3:
        return False
    else:
        return True

In [17]:
check_characters('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [18]:
check_characters('Instacart')

True

Use the `check_characters` function to filter out non-English apps from both datasets

In [19]:
ios_english = []
android_english = []

for row in ios:
    app_name = row[1]
    check_name = check_characters(app_name)
    if check_name:
        ios_english.append(row)
        
for row in android_clean:
    app_name = row[0]
    check_name = check_characters(app_name)
    if check_name:
        android_english.append(row)

In [20]:
print(len(ios_english))

6183


In [21]:
print(len(android_english))

9614


## Isolating the Free Apps

Now, we want to isolate only the free apps. The business model is in-app ads. 

In [22]:
ios_free = []
android_free = []

for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_free.append(row)
        
        
for row in android_english:
    price = row[7]
    if price == '0':
        android_free.append(row)

In [23]:
explore_data(ios_free, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


In [24]:
explore_data(android_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


After isolating the free apps, we are left with 8864 Android apps and 3222 iOS apps.

## Most Common Apps By Genre

Our goal is to identify apps that are successful (a.k.a. attract more users) on both Android and iOS operating systems. The validation strategy for an app has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We now want to determine the most common genres for each market. 

In [25]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We first need to know the column to identify the genre of each app. These are:

```
ios[11] = prime_genre
android[9] = Genres
```

Alternatively, we could use `android[1] = Catagory`. These appear to be the same.

In [26]:
android_free[1916][1]

'SHOPPING'

In [27]:
def freq_table(dataset, index):
    new_dict = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in new_dict:
            new_dict[value] += 1
        else:
            new_dict[value] = 1
            
    dict_percentages = {}
    for key in new_dict:
        percentage = (new_dict[key] / total) * 100
        dict_percentages[key] = percentage
    return dict_percentages

In [28]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0],3))

In [29]:
display_table(ios_free, 11)

Games : 58.163
Entertainment : 7.883
Photo & Video : 4.966
Education : 3.662
Social Networking : 3.29
Shopping : 2.607
Utilities : 2.514
Sports : 2.142
Music : 2.048
Health & Fitness : 2.017
Productivity : 1.738
Lifestyle : 1.583
News : 1.335
Travel : 1.241
Finance : 1.117
Weather : 0.869
Food & Drink : 0.807
Reference : 0.559
Business : 0.528
Book : 0.435
Navigation : 0.186
Medical : 0.186
Catalogs : 0.124


#### Notes and comments on iOS apps

The most common genre is Games at 58% followed by Entertainment at 7.8%. 

Games and entertainments apps are the most popular general types of apps. There is a spattering of productivity apps (Education, Utilities, Health & Fitness), but these form a significant minority of the general app types. 

In [30]:
display_table(android_free, 9) # Genres

Tools : 8.45
Entertainment : 6.069
Education : 5.347
Business : 4.592
Productivity : 3.892
Lifestyle : 3.892
Finance : 3.7
Medical : 3.531
Sports : 3.463
Personalization : 3.317
Communication : 3.238
Action : 3.102
Health & Fitness : 3.08
Photography : 2.944
News & Magazines : 2.798
Social : 2.662
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.144
Simulation : 2.042
Dating : 1.861
Arcade : 1.85
Video Players & Editors : 1.771
Casual : 1.76
Maps & Navigation : 1.399
Food & Drink : 1.241
Puzzle : 1.128
Racing : 0.993
Role Playing : 0.936
Libraries & Demo : 0.936
Auto & Vehicles : 0.925
Strategy : 0.914
House & Home : 0.824
Weather : 0.801
Events : 0.711
Adventure : 0.677
Comics : 0.609
Beauty : 0.598
Art & Design : 0.598
Parenting : 0.496
Card : 0.451
Casino : 0.429
Trivia : 0.417
Educational;Education : 0.395
Board : 0.384
Educational : 0.372
Education;Education : 0.338
Word : 0.259
Casual;Pretend Play : 0.237
Music : 0.203
Racing;Action & Adventure : 0.169
Puzzle;Brain G

In [31]:
display_table(android_free, 1) # Category

FAMILY : 18.908
GAME : 9.725
TOOLS : 8.461
BUSINESS : 4.592
LIFESTYLE : 3.903
PRODUCTIVITY : 3.892
FINANCE : 3.7
MEDICAL : 3.531
SPORTS : 3.396
PERSONALIZATION : 3.317
COMMUNICATION : 3.238
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.944
NEWS_AND_MAGAZINES : 2.798
SOCIAL : 2.662
TRAVEL_AND_LOCAL : 2.335
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.144
DATING : 1.861
VIDEO_PLAYERS : 1.794
MAPS_AND_NAVIGATION : 1.399
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.959
LIBRARIES_AND_DEMO : 0.936
AUTO_AND_VEHICLES : 0.925
HOUSE_AND_HOME : 0.824
WEATHER : 0.801
EVENTS : 0.711
PARENTING : 0.654
ART_AND_DESIGN : 0.643
COMICS : 0.62
BEAUTY : 0.598


#### Notes and comments on Android apps

There is not as much of a clear preference amongst Android app users. 

It is not immediately clear the exact difference between the `Category` and `Genres` columns, except that `Category` is a coarser grouping. 

In both categories, there is a mix of entertainment and productivity apps. 

Now, let's determine the kind of apps with the most users. For the Android dataset, we can use the `Installs` columns. For iOS, we will use the `rating_count_tot` column.

In [33]:
ios_genres = freq_table(ios_free, 11)

for genre in ios_genres:
    total = 0
    len_genre = 0
    for row in ios_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    avg_num_ratings = total / len_genre
    print(genre, avg_num_ratings)


Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22788.6696905016
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


Navigation apps have the most average reviews. Social Networking, Reference, and Music are the next highest. These might be driven by a few particularly popular apps in each case, but we do get a variety here.

Next, we can look at the average number of installs for the Google Play store. This dataset only gives a minimum number of installs based on ranges such as '100,000+', so we will use this base number to get an estimate.

In [42]:
android_genres = freq_table(android_free, 1)

for category in android_genres:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_num_ratings = total / len_category
    print(category, avg_num_ratings)


ART_AND_DESIGN 1986335.0877192982
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 1833495.145631068
ENTERTAINMENT 11640705.88235294
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15588015.603248259
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17840110.40229885
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10801391.298666667
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24727872.452830188
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.774193

Maps and Navigation, News and Magazines, Productivity are the categories of apps that contain the highest approximate average installs. 