# Guided Project: Profitable App Profiles for the App Store and Google Play Markets

In this project, our objective is to identify profitable mobile app profiles suitable for both the App Store and Google Play markets. As data analysts employed by a company specializing in Android and iOS app development, our primary responsibility involves equipping our team of developers with data-backed insights to guide their app creation decisions.

Within our company, our focus remains on crafting apps available for free download and installation. Our primary source of income originates from in-app advertisements. This underscores the fact that the revenue generated by our apps primarily hinges on the user base. The central aim of this project is to meticulously examine data to provide our developers with a comprehensive understanding of the types of apps that are more likely to attract a larger user audience.

### datasets

A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [3]:
print('\n Android')
explore_data(android, 0, 0, True)
print('\n iOS')
explore_data(ios, 0, 0, True)


 Android
Number of rows: 10841
Number of columns: 13

 iOS
Number of rows: 7197
Number of columns: 16


In [4]:
print('\n Android')
print(android_header)
print('\n iOS')
print(ios_header)


 Android
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

 iOS
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Deleting Wrong Data (data cleaning)
#### * Detect inaccurate data, and correct or remove it.
#### * Detect duplicate data, and remove the duplicates.
#### * Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
#### * Remove apps that aren't free.


[Wrong rating for entry 10472](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015)

In [5]:
for row in android[1:]:
    if len(row) != len(android_header):
        print(row)
        print("\n")
        print("Index postion is:", android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index postion is: 10472


In [6]:
# category column is missing
print(android_header)
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
# deleting row
del android[10472]

### Removing Duplicate Entries: Part One
#### 1,181 cases where an app occurs more than once

In [8]:
# search for duplicated apps (Android)

duplicated_apps = []
analised_apps = []
for app in android:
    name_app = app[0]
    if name_app in analised_apps:
        duplicated_apps.append(name_app)
    else:
        analised_apps.append(name_app)
        
print(len(duplicated_apps))

1181


#### We need to remove duplicated apps, but analysing the list we will find out that there are small differences among duplicated rows. For example, the number of reviews are differents in Instagram rows, indicating they were colleced several times, in differents periods. Because of this our deletion criteria will be removing the rows with less number of reviews, the older registers

In [9]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print('name: '+app[0] +' // nº reviews: '+ app[3])

name: Instagram // nº reviews: 66577313
name: Instagram // nº reviews: 66577446
name: Instagram // nº reviews: 66577313
name: Instagram // nº reviews: 66509917


### Removing Duplicate Entries: Part Two

In [10]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected unique apps name = 9659')
len_is_correct = True if len(reviews_max) == 9659 else False
print('Len of reviews_max is 9659 ? '+str(len_is_correct))

Expected unique apps name = 9659
Len of reviews_max is 9659 ? True


#### Remove the duplicate rows

In [11]:
android_clean = [] #new cleaned dataset
already_added = [] #app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    # add only app with max reviews and not duplicate info
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

# android_clean will be the new dataset
# checking infos
len_is_ok = True if len(android_clean) == 9659 else False
print('Len of android_clean is 9659 ? '+str(len_is_ok))

Len of android_clean is 9659 ? True


### checking if ios database has duplicated infos

In [12]:
# search for duplicated apps (iOS)

duplicated_apps = []
analised_apps = []
for app in ios:
    name_app = app[0] #id column
    if name_app in analised_apps:
        duplicated_apps.append(name_app)
    else:
        analised_apps.append(name_app)
        
print('Duplicated apps infos: '+str(len(duplicated_apps)))

Duplicated apps infos: 0


#### Removing Non-English Apps: Part One


In [13]:
# both databases have non-english apps

print(ios[813][1])
print(android_clean[4412][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
中国語 AQリスニング


**ord()** buit-in function: Given a string representing one Unicode character, return an integer representing the Unicode code point of that character.

English characters: <= 127

In [14]:
def check_english_name(name):
    for char in name:
        if ord(char) > 127:
            return False
    return True

testing_apps = ('Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 
                'Docs To Go™ Free Office Suite', 'Instachat 😜')

for test in testing_apps:
    print(check_english_name(test))

True
False
False
False


#### There are problems with functions. Apps number 3 and 4 are english name, but both have simbols. The functions needs to ignore them or we need to clean the name before submit

#### Removing Non-English Apps: Part Two

In [15]:
print(ord('😜'))
print(ord('™'))

128540
8482


In [16]:
# changing fn to count number of non-english chars:
# if number higher than 3 -> remove app

def check_english_name(name):
    non_english_chars = 0
    for char in name:
        if ord(char) > 127:
            non_english_chars += 1
            if non_english_chars > 3:
                return False
    return True

testing_apps = ('Docs To Go™ Free Office Suite', 'Instachat 😜','爱奇艺PPS -《欢乐颂2》电视剧热播')
for app in testing_apps:
    print(check_english_name(app))

True
True
False


#### Filtering out non-English apps from both datasets

In [17]:
android_english_apps = []
ios_english_apps = []

for app in android_clean:
    name = app[0]
    if check_english_name(name):
        android_english_apps.append(app)

for app in ios:
    name = app[0]
    if check_english_name(name):
        ios_english_apps.append(app)

print('Android all apps = '+str(len(android_clean)))
print('Andrid english apps = '+str(len(android_english_apps)))
print('----------------')
print('iOS all apps = '+str(len(ios)))
print('iOS english apps = '+str(len(ios_english_apps)))

Android all apps = 9659
Andrid english apps = 9614
----------------
iOS all apps = 7197
iOS english apps = 7197


#### Isolating the Free Apps

In [18]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [19]:
# price column -> index 7
# type column (free or paid) (index 6)

android_english_free_apps = []
for app in android_english_apps:
    price = app[7]
    
    if price != '0.0' and price != '0':
        android_english_free_apps.append(app)      

In [20]:
print (ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [21]:
# price column -> index 4
ios_english_free_apps = []
for app in ios_english_apps:
    price = app[4]
    if price != '0.0' and price != '0':
        ios_english_free_apps.append(app)

In [22]:
print('Android all apps = '+str(len(android_clean)))
print('Andrid english apps = '+str(len(android_english_apps)))
print('Andrid english free apps = '+str(len(android_english_free_apps)))
print('----------------')
print('iOS all apps = '+str(len(ios)))
print('iOS english apps = '+str(len(ios_english_apps)))
print('iOS english free apps = '+str(len(ios_english_free_apps)))

Android all apps = 9659
Andrid english apps = 9614
Andrid english free apps = 750
----------------
iOS all apps = 7197
iOS english apps = 7197
iOS english free apps = 3141


#### Most Common Apps by Genre
#### Our validation strategy for an app idea has three steps:
 * Build a minimal Android version of the app, and add it to Google Play.
 * If the app has a good response from users, we develop it further.
 * If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets.
We need to do a analysis to determine the most common genres for each market
We'll need to build frequency tables for a few columns in our datasets.

In [23]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Important columns
* iOS: prime_genre (index 11)
* Android: genres (index 9) // Category (index 1)

In [24]:
def freq_table(dataset, index):
    frequency_table = {}
    for row in dataset:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1    
    
    total = sum(frequency_table.values())
    for row in frequency_table:
        frequency_table[row] = round(frequency_table[row]/total*100, 2)
    return frequency_table

In [25]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [26]:
# ordered genres from android apps
display_table(android_english_free_apps, 9)

Medical : 10.93
Personalization : 10.8
Tools : 10.4
Education : 3.87
Productivity : 3.73
Books & Reference : 3.73
Communication : 3.6
Sports : 3.2
Action : 3.2
Role Playing : 2.8
Arcade : 2.67
Puzzle : 2.53
Photography : 2.53
Entertainment : 2.53
Lifestyle : 2.4
Finance : 2.27
Health & Fitness : 2.0
Strategy : 1.73
Travel & Local : 1.6
Education;Pretend Play : 1.6
Business : 1.6
Adventure : 1.6
Simulation : 1.2
Casual : 1.2
Weather : 1.07
Board : 1.07
Card : 0.93
Board;Brain Games : 0.93
Maps & Navigation : 0.67
Educational;Pretend Play : 0.67
Education;Education : 0.67
Dating : 0.67
Video Players & Editors : 0.53
Educational : 0.53
Casual;Pretend Play : 0.53
Social : 0.4
Racing : 0.4
Educational;Education : 0.4
Art & Design : 0.4
Arcade;Action & Adventure : 0.4
Action;Action & Adventure : 0.4
Sports;Action & Adventure : 0.27
Simulation;Education : 0.27
Shopping : 0.27
Puzzle;Brain Games : 0.27
Parenting : 0.27
News & Magazines : 0.27
Food & Drink : 0.27
Educational;Creativity : 0.27
E

In [27]:
# ordered Category from android apps
display_table(android_english_free_apps, 1)

FAMILY : 24.27
MEDICAL : 10.93
GAME : 10.93
PERSONALIZATION : 10.8
TOOLS : 10.4
PRODUCTIVITY : 3.73
BOOKS_AND_REFERENCE : 3.73
COMMUNICATION : 3.6
SPORTS : 3.2
PHOTOGRAPHY : 2.53
LIFESTYLE : 2.4
FINANCE : 2.27
HEALTH_AND_FITNESS : 2.0
TRAVEL_AND_LOCAL : 1.6
BUSINESS : 1.6
WEATHER : 1.07
MAPS_AND_NAVIGATION : 0.67
DATING : 0.67
VIDEO_PLAYERS : 0.53
SOCIAL : 0.4
EDUCATION : 0.4
ART_AND_DESIGN : 0.4
SHOPPING : 0.27
PARENTING : 0.27
NEWS_AND_MAGAZINES : 0.27
FOOD_AND_DRINK : 0.27
ENTERTAINMENT : 0.27
AUTO_AND_VEHICLES : 0.27
LIBRARIES_AND_DEMO : 0.13
EVENTS : 0.13


In [28]:
# ordered genres from iOS apps
display_table(ios_english_free_apps, 11)

Games : 51.1
Education : 10.22
Entertainment : 6.4
Photo & Video : 5.79
Utilities : 4.43
Productivity : 3.69
Health & Fitness : 3.31
Music : 2.26
Lifestyle : 1.59
Book : 1.46
Reference : 1.4
Weather : 1.31
Business : 1.18
Sports : 1.11
Navigation : 0.83
Travel : 0.8
Social Networking : 0.76
Food & Drink : 0.64
Finance : 0.64
News : 0.54
Medical : 0.48
Shopping : 0.03
Catalogs : 0.03


### Most Common Apps by Genre: Analyzing frequency tables.
**only for FREE apps**


**iOS apps**
 * most common: Games (51%)
 * 2nd most common: Education (10.22 %)
 * Others above 2%: 
     * Entertainment : 6.4 
     * Photo & Video : 5.79
     * Utilities : 4.43
     * Productivity : 3.69
     * Health & Fitness : 3.31
     * Music : 2.26
 * market more focused in entertainment

**Android Apps (Genre)**
 * most common: Medical(10.93 %)
 * 2nd most common: Personalization(10.8 %)
 * 3rd most common: Tools(10.4 %)
 * Others above 2%:
    * Education : 3.87
    * Productivity : 3.73
    * Books & Reference : 3.73
    * Communication : 3.6
    * Sports : 3.2
    * Action : 3.2
    * Role Playing : 2.8
    * Arcade : 2.67
    * Puzzle : 2.53
    * Photography : 2.53
    * Entertainment : 2.53
    * Lifestyle : 2.4
    * Finance : 2.27
    * Health & Fitness : 2.0

* market more focused in medical, personalization and tools, representing more than 30%

**Android Apps (Category)**
* most common: FAMILY (24.27%)
* 2nd most common: MEDICAL (10.93%)
* 3rd most common: GAME (10.93%)
* 4th most common: PERSONALIZATION(10.8%)
* 5th most common: TOOLS(10.4%)
* Others categories above 2%:
    * PRODUCTIVITY : 3.73
    * BOOKS_AND_REFERENCE : 3.73
    * COMMUNICATION : 3.6
    * SPORTS : 3.2
    * PHOTOGRAPHY : 2.53
    * LIFESTYLE : 2.4
    * FINANCE : 2.27
    * HEALTH_AND_FITNESS : 2.0

* market more focused in family, medical, game, personalization and tools, representing more than 67%

>"The frequency tables we analyzed on the previous screen showed us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps. **Now, we'd like to determine the kind of apps with the most users.**"

calculate the average number of installs for each app genre. 

Android -> installs column
iOS -> missing info -> total number of user raitngs (rating_count_tot)

### iOS Calculations

In [32]:
frequence_ios_prime_genre = freq_table(ios_english_free_apps, 11)
print(frequence_ios_prime_genre)

{'Games': 51.1, 'Entertainment': 6.4, 'Music': 2.26, 'Photo & Video': 5.79, 'Health & Fitness': 3.31, 'Business': 1.18, 'Weather': 1.31, 'Utilities': 4.43, 'News': 0.54, 'Education': 10.22, 'Reference': 1.4, 'Productivity': 3.69, 'Navigation': 0.83, 'Lifestyle': 1.59, 'Book': 1.46, 'Finance': 0.64, 'Sports': 1.11, 'Medical': 0.48, 'Travel': 0.8, 'Shopping': 0.03, 'Food & Drink': 0.64, 'Social Networking': 0.76, 'Catalogs': 0.03}


In [43]:
for genre in frequence_ios_prime_genre:
    total = 0
    len_genre = 0
    for app in ios_english_free_apps:
        genre_app = app[11]
        if genre_app == genre:
            user_rating = float(app[7])
            total += user_rating
            len_genre += 1
    average_num_user_rating = round((total / len_genre), 2)
    print(genre +': '+str(average_num_user_rating))
    

Games: 3.9
Entertainment: 3.41
Music: 4.01
Photo & Video: 3.81
Health & Fitness: 3.79
Business: 3.88
Weather: 3.85
Utilities: 3.14
News: 3.32
Education: 3.33
Reference: 3.52
Productivity: 4.03
Navigation: 3.06
Lifestyle: 3.21
Book: 3.74
Finance: 3.33
Sports: 3.13
Medical: 3.63
Travel: 3.38
Shopping: 4.5
Food & Drink: 3.5
Social Networking: 2.92
Catalogs: 4.5


#### App profile recommendation for the App Store = Games (highest number of apps and high user rating)

### Android Calculations

Use install columns (index 5)
It is not precise number
Remove simbols
 * 100,000+ convert to 100000
 * 1,000,000+ convert to 1000000

In [44]:
freq_table_category = freq_table(android_english_free_apps, 1)
print(freq_table_category)

{'BUSINESS': 1.6, 'COMMUNICATION': 3.6, 'DATING': 0.67, 'EDUCATION': 0.4, 'ENTERTAINMENT': 0.27, 'FOOD_AND_DRINK': 0.27, 'HEALTH_AND_FITNESS': 2.0, 'GAME': 10.93, 'FAMILY': 24.27, 'MEDICAL': 10.93, 'PHOTOGRAPHY': 2.53, 'SPORTS': 3.2, 'PERSONALIZATION': 10.8, 'PRODUCTIVITY': 3.73, 'WEATHER': 1.07, 'TOOLS': 10.4, 'TRAVEL_AND_LOCAL': 1.6, 'LIFESTYLE': 2.4, 'AUTO_AND_VEHICLES': 0.27, 'NEWS_AND_MAGAZINES': 0.27, 'SHOPPING': 0.27, 'BOOKS_AND_REFERENCE': 3.73, 'SOCIAL': 0.4, 'ART_AND_DESIGN': 0.4, 'VIDEO_PLAYERS': 0.53, 'FINANCE': 2.27, 'MAPS_AND_NAVIGATION': 0.67, 'PARENTING': 0.27, 'LIBRARIES_AND_DEMO': 0.13, 'EVENTS': 0.13}


In [48]:
for cat in freq_table_category:
    total = 0
    len_category = 0
    for app in android_english_free_apps:
        category_app = app[1]
        if category_app == cat:
            num_install = app[5]
            num_install = float(num_install.replace('+','').replace(',',''))
            total += num_install
            len_category += 1
    average_num_installs = round(total / len_category, 2)
    print(cat +': '+str(average_num_installs))

BUSINESS: 17731.25
COMMUNICATION: 50372.22
DATING: 2070.0
EDUCATION: 34000.0
ENTERTAINMENT: 100000.0
FOOD_AND_DRINK: 30000.0
HEALTH_AND_FITNESS: 31607.33
GAME: 256097.13
FAMILY: 116201.73
MEDICAL: 6838.21
PHOTOGRAPHY: 98881.05
SPORTS: 51825.62
PERSONALIZATION: 40232.02
PRODUCTIVITY: 50430.54
WEATHER: 101500.0
TOOLS: 22146.68
TRAVEL_AND_LOCAL: 15255.0
LIFESTYLE: 65506.11
AUTO_AND_VEHICLES: 25025.0
NEWS_AND_MAGAZINES: 2750.0
SHOPPING: 5050.0
BOOKS_AND_REFERENCE: 832.71
SOCIAL: 2000.0
ART_AND_DESIGN: 5333.33
VIDEO_PLAYERS: 17750.0
FINANCE: 10917.76
MAPS_AND_NAVIGATION: 24220.0
PARENTING: 25050.0
LIBRARIES_AND_DEMO: 100.0
EVENTS: 1.0


App profile recommendation for the Android = GAME and
FAMILY

Here are a few next steps you could take:

Analyze the frequency table for the Genre column of the Google Play dataset, and see if you can find useful patterns.
Assume we could also make revenue via in-app purchases and subscriptions, and try to determine which genres seem to be liked the most by users — you could examine app ratings here.
Refine your project using our data science [project style guide](https://www.dataquest.io/blog/data-science-project-style-guide/).