# Determining the most popular mobile app genre among users

**Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.**

We will take into consideration:
- only free apps
- apps for which main soruce of revenoue are adds
- the number of users of our apps determines our revenue for any given app

## Import android and ios app data

In [1]:
from csv import reader

In [2]:
appleStoreData = open('AppleStore.csv', encoding='utf8')
googlePlayStoreData = open('googleplaystore.csv', encoding='utf8')
appleData = list(reader(appleStoreData))
googleData = list(reader(googlePlayStoreData))

## Explore the data for analysis

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Below cell shows all columns names from apple store data file, more information can be found in: https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?resource=download

In [4]:
explore_data(appleData, 0,1)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




Display first 10 rows from apple store data set, and the number of total rows and columns available

In [5]:
explore_data(appleData, 0,11, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

Below cell shows all columns names from google play store data file, more information can be found in: https://www.kaggle.com/datasets/lava18/google-play-store-apps

In [6]:
explore_data(googleData, 0,1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




Firt 10 rows from google play store data set, and on the bottom we can see number of total rows and columns available

In [7]:
explore_data(googleData, 0,11, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

## Remove wrong row (10472) from google play data set

In [8]:
print(googleData[10473])
del googleData[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Funtion "check_row_length" validates if there are rows with more columns than rows in header

In [9]:
def check_row_length(dataSet, printCorrectData = True):
    headerLen = len(dataSet[0])
    correctedDataSet = []
    wrongRowsDataSet = []
    for row in dataSet:
        if headerLen == len(row):
            correctedDataSet.append(row)
        else:
            wrongRowsDataSet.append(row)
    if printCorrectData:
        return correctedDataSet
    else:
        return wrongRowsDataSet
    

Compare the length of original apple data set with the same data set after row check

In [10]:
print(len(appleData))
print(len(check_row_length(appleData)))

7198
7198


Compare the length of original google play data set with the same data set after row check

In [11]:
print(len(googleData))
print(len(check_row_length(googleData)))

10841
10841


## Check for duplicates

In [12]:
def check_duplicated_app_names(dataSet, appNameIndex, printCorrectData = True):
    uniqueApps = []
    duplicatedApps = []
    for app in dataSet:
        if app[appNameIndex] not in uniqueApps:
            uniqueApps.append(app[appNameIndex])
        else:
            duplicatedApps.append(app)
            
    if printCorrectData:
        return uniqueApps
    else:
        return duplicatedApps       

In [13]:
print((len(appleData)))
print(len(check_duplicated_app_names(appleData, 2, False)))

7198
2


Check duplicates in apple data set

In [14]:
duplicatedAppleApps = check_duplicated_app_names(appleData, 2, False)
dupilcatedApple1, duplicatedApple2 = duplicatedAppleApps
print(dupilcatedApple1)
print(duplicatedApple2)

['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


Check duplicates in google data set

In [15]:
duplicatedGoogleApps = check_duplicated_app_names(googleData, 0, False)
print(len(duplicatedGoogleApps))
print(len(googleData))

for app in duplicatedGoogleApps[:10]:
    print(app[0])

1181
10841
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack


Clear apple data set from duplicated values

In [16]:
appleDataCleared = check_duplicated_app_names(appleData, 2)

Clear google data set from duplicated values

In [17]:
goolgeDataCleared = check_duplicated_app_names(googleData, 0)

### Verify if there is duplicated values, if so, keep only one with the highest reviews amount

We start by creating a dictionary "reviews_max" which will contain app name and the highest numer of reviews for this app

In [18]:
reviews_max = {}

for app in googleData[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [19]:
print('Expected length:', len(googleData) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9660
Actual length: 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

* We start by initializing two empty lists, android_clean and already_added.
* We loop through the android data set, and for every iteration:
   * We isolate the name of the app and the number of reviews.
   * We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
        * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
        * The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [20]:
google_clean = []
already_added = []

for app in googleData[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name)

In [21]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Clear our data sets from non-english characters

Funtion "check_non_english_characters" keeps only app names with no more than three non-english characters

In [22]:
def check_non_english_characters(word):
    non_english_char_counter = 0
    isEnglishWord = True;
    for char in word:
        char_num = ord(char)
        if char_num > 127:
            non_english_char_counter += 1
            if (non_english_char_counter > 3):
                non_english_char_counter = 0
                isEnglishWord = False
    return isEnglishWord

In [23]:
print(check_non_english_characters('Instagram'))
print(check_non_english_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'
'Docs To Go™ Free Office Suite'))
print(check_non_english_characters('Instachat 😜'))
print(check_non_english_characters('Docs To Go™ Free Office Suite'))

True
False
True
True


Separte english and none english google apps

In [24]:
googleEnglishApps = []
googleNoneEnglishApps = []

for app in google_clean:
    name = app[0]
    if (check_non_english_characters(name)):
        googleEnglishApps.append(app)
    else:
        googleNoneEnglishApps.append(app)
    
    
print(len(googleEnglishApps))    
print(len(googleNoneEnglishApps))   

9614
45


Separte english and none english apple apps

In [25]:
appleEnglishApps = []
appleNoneEnglishApps = []

for app in appleData[1:]:
    name = app[2]
    if (check_non_english_characters(name)):
        appleEnglishApps.append(app)
    else:
        appleNoneEnglishApps.append(app)
    
    
print(len(appleEnglishApps))    
print(len(appleNoneEnglishApps))   

6183
1014


In [26]:
print(appleEnglishApps[0])

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


## Separate application which apps which earn on adds from paid applications

In [27]:
# print(googleData[0])
print(googleEnglishApps[0])


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Separate android paid and free apps

In [28]:
googlePaidApps = []
googleFreeApps = []

for app in googleEnglishApps:
    price = app[7]
        
    if (price != '0'):
        googlePaidApps.append(app)
    else:
        googleFreeApps.append(app)

In [29]:
print(len(googlePaidApps))
print(len(googleFreeApps))

750
8864


Separate android paid and free apps

In [30]:
applePaidApps = []
appleFreeApps = []

for app in appleEnglishApps:
    price = app[5]
    
    if (price == '0'):
        appleFreeApps.append(app)
    else:
        applePaidApps.append(app)

In [31]:
print(len(applePaidApps))
print(len(appleFreeApps))
print(appleEnglishApps[0:5])

2961
3222
[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']]


## Determinate which type of application brings the bigest amount of profit

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue. We need to find app profiles that are successful in both markets (google and apple). For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Add headers to free apps data sets

In [32]:
googleFreeApps.insert(0, googleData[0])
print(googleFreeApps[0])
appleFreeApps.insert(0, appleData[0])
print(appleFreeApps[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Function "freq_table" creates a frequency table for a particular column in concrete data set

In [40]:
def freq_table(dataset, index):
    freq_dictionary = {}
    total = 0
    for app in dataset[1:]:
        total += 1
        column_value = app[index]
        if (column_value in freq_dictionary):
            freq_dictionary[column_value] += 1
        else:
            freq_dictionary[column_value] = 1
            
    table_percentage = {}
    for key in freq_dictionary:
        percentage = round((freq_dictionary[key] / total) * 100,2)
        table_percentage[key] = percentage
    return table_percentage   

In [41]:
print(freq_table(googleFreeApps, 9))

{'Art & Design': 0.6, 'Art & Design;Creativity': 0.07, 'Auto & Vehicles': 0.93, 'Beauty': 0.6, 'Books & Reference': 2.14, 'Business': 4.59, 'Comics': 0.61, 'Comics;Creativity': 0.01, 'Communication': 3.24, 'Dating': 1.86, 'Education': 5.35, 'Education;Creativity': 0.05, 'Education;Education': 0.34, 'Education;Pretend Play': 0.06, 'Education;Brain Games': 0.03, 'Entertainment': 6.07, 'Entertainment;Brain Games': 0.08, 'Entertainment;Creativity': 0.03, 'Entertainment;Music & Video': 0.17, 'Events': 0.71, 'Finance': 3.7, 'Food & Drink': 1.24, 'Health & Fitness': 3.08, 'House & Home': 0.82, 'Libraries & Demo': 0.94, 'Lifestyle': 3.89, 'Lifestyle;Pretend Play': 0.01, 'Card': 0.45, 'Arcade': 1.85, 'Puzzle': 1.13, 'Racing': 0.99, 'Sports': 3.46, 'Casual': 1.76, 'Simulation': 2.04, 'Adventure': 0.68, 'Trivia': 0.42, 'Action': 3.1, 'Word': 0.26, 'Role Playing': 0.94, 'Strategy': 0.91, 'Board': 0.38, 'Music': 0.2, 'Action;Action & Adventure': 0.1, 'Casual;Brain Games': 0.14, 'Educational;Creativ

"display table" function aims to print frequency distribution table in a user friendly way

In [42]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [43]:
display_table(googleFreeApps[1:], 1)

FAMILY : 18.91
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.63
COMICS : 0.62
BEAUTY : 0.6


In [44]:
display_table(googleFreeApps[1:], 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.59
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual

In [45]:
display_table(appleFreeApps[1:], 12)

Games : 58.18
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.71
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


When it comes to android app, the most popular category type is Family and the most popular genre type is Tools 

If we talk about IOs apps the bigest pupularity is among Games genre.

**Get number of users per genre**

In [63]:
genres_dict = freq_table(appleFreeApps, 12)
total = 0
len_genre = 0
for genre in genres_dict:
    for app in appleFreeApps[1:]:
        app_genre = app[12]
        if (genre == app_genre):
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg_num_usr = round(total / len_genre, 2)
    print(genre,':',avg_num_usr)
    

Productivity : 21028.41
Weather : 31445.57
Shopping : 29182.63
Reference : 33610.97
Finance : 33263.45
Music : 38777.91
Utilities : 34367.15
Travel : 33768.29
Social Networking : 41544.38
Sports : 39354.4
Health & Fitness : 37746.29
Games : 26636.27
Food & Drink : 26704.59
News : 26614.06
Book : 26684.68
Photo & Video : 26786.31
Entertainment : 25713.41
Business : 25611.41
Lifestyle : 25460.69
Education : 24781.38
Navigation : 24895.9
Medical : 24850.62
Catalogs : 24824.74


In [65]:
display_table(googleFreeApps, 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


**The average number of installs per app genre for the Google Play dataset**

In [93]:
category_freq = freq_table(googleFreeApps, 1)

for category in category_freq:
    total = 0
    len_category = 0
    for app in googleFreeApps[1:]:
        app_category = app[1]
        if (category == app_category):
            n_installs = float(app[5].replace('+', '').replace(',',''))
            total += n_installs
            len_category += 1
    try:        
        avg_n_instals = round(total / len_category, 2)
    except ZeroDivisionError:
        print('total: ' + str(total))
        print('n_installs: ' + str(n_installs))
    print(category,':', avg_n_instals)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.6
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77
