## Analyzing Mobile App Data
A company builds apps that are free to download and install. However, the main source of revenue for this company is in-app ads.

The goal for this project is to analyze data to help developers of the company understand what type of apps are likely to attract more customers.

In [1]:
from csv import reader
open_apple = open('AppleStore.csv')
open_google = open('googleplaystore.csv')
read_apple = reader(open_apple)
read_google = reader(open_google)

apple = list(read_apple)
google = list(read_google)

This function allows us to print rows in a readable way

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
explore_data(google, 0,3
            , True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Use this [link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) to read about the descriptions of columns in the AppleStore dataset

Use this [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps) to read about the descriptions of columns in the Google dataset

## Delete rows with empty columns from Playstore Dataset

In [5]:
google_columns = google[0]
for row in google[1:]:
    if len(row) != len(google_columns):
        print(f'{row}\n')
        row_index = google.index(row)
        del google[row_index]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']



## Remove Duplicate Entries in Google Dataset
Create a list of duplicate and unique app names
Use the highest Review count as a criterion to remove duplicate apps.

In [6]:
duplicate = []
unique = []
for row in google[1:]:
    name = row[0]
    if name in unique:
        duplicate.append(name)
    else:
        unique.append(name)

print(f'Number of duplicate apps: {len(duplicate)}')

Number of duplicate apps: 1181


In [7]:
reviews_max = {}
for row in google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [8]:
len(reviews_max.keys())

9659

In [9]:
google_clean = [] # store the clean dataset
already_added = [] # store app names

for row in google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)

In [10]:
explore_data(google_clean, 0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Remove Non-English Apps from Google Dataset
Create a function that checks if a character is a non-English character

In [11]:
# if ord <= 127
def is_english(string):
    non_eng_char = []
    for char in string:
        if ord(char) >= 127:
            non_eng_char.append(char)
        if len(non_eng_char) > 3:
            return False
            
    return True

In [26]:
is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [13]:
google_clean_eng = []
for row in google_clean:
    name = row[0]
    if is_english(name):
        google_clean_eng.append(row)
    

In [17]:
apple_eng = []
for row in apple:
    name = row[1]
    if is_english(name):
        apple_eng.append(row)

In [20]:
print(len(google_clean_eng))
print(len(apple_eng))

9614
6184


## Isolate free apps
Loop through each dataset to remove non-free apps

In [21]:
print(google[0])
print('\n')
print(apple[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [32]:
google_free = []
apple_free = []

for row in google_clean_eng:
    price_type = row[6]
    if price_type == 'Free':
        google_free.append(row)
        

In [36]:
for row in apple_eng:
    price = row[4]
    if price == '0.0':
        apple_free.append(row)

In [37]:
print(len(google_free))
print(len(apple_free))

8863
3222


## App Profiles
The goal is to minimize risks and develop a solid strategy for an app idea.

The strategy is to basically create a light/minimal version of apps that are likely to attract users on the Google PlayStore since Android has a higher number of users as compared to IOS.

If the app has a good response, we can then develop it further by adding more features to the application through updates

If our app is profitable after three-six months, we can then develop an IOS version and add it to the App Store

In [47]:
print(google[0].index('Genres'))
print(google[0].index('Category'))
print(apple[0].index('prime_genre'))

9
1
11


In [49]:
def freq_table(dataset, index):
    frequency_table = {}

#   express count as integers
    for row in dataset:
        category = row[index]
        if category in frequency_table:
            frequency_table[category]+=1
        else:
            frequency_table[category] = 1
#   express count as percentage
    for key in frequency_table:
        frequency_table[key] = (frequency_table[key]/len(dataset)) * 100
    
    return frequency_table
    

In [50]:
# Use this table to sort your frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [51]:
prime_genre_table = display_table(apple_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [68]:
category_table = display_table(google_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

## Most Popular Apps by Genre on the App Store

In [66]:
genres_table = display_table(google_free, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

In [54]:
prime_genre = freq_table(apple_free, 11)

In [65]:
from pprint import pp
apple_avg_rating = {}
for genre in prime_genre:
    total = 0 # stores the sum of user ratings
    len_genre = 0 # stores number of apps specific to genre
    for row in apple_free:
        genre_app = row[11]
        if genre_app == genre:
            len_genre += 1
            user_ratings = float(row[5])
            total += user_ratings
    
    average_ratings = total/len_genre
    apple_avg_rating[genre] = average_ratings

pp(sorted(apple_avg_rating.items(), key=lambda item: item[1], reverse=True))
    

[('Navigation', 86090.33333333333),
 ('Reference', 74942.11111111111),
 ('Social Networking', 71548.34905660378),
 ('Music', 57326.530303030304),
 ('Weather', 52279.892857142855),
 ('Book', 39758.5),
 ('Food & Drink', 33333.92307692308),
 ('Finance', 31467.944444444445),
 ('Photo & Video', 28441.54375),
 ('Travel', 28243.8),
 ('Shopping', 26919.690476190477),
 ('Health & Fitness', 23298.015384615384),
 ('Sports', 23008.898550724636),
 ('Games', 22788.6696905016),
 ('News', 21248.023255813954),
 ('Productivity', 21028.410714285714),
 ('Utilities', 18684.456790123455),
 ('Lifestyle', 16485.764705882353),
 ('Entertainment', 14029.830708661417),
 ('Business', 7491.117647058823),
 ('Education', 7003.983050847458),
 ('Catalogs', 4004.0),
 ('Medical', 612.0)]


Per the data above, the apps categories with the highest average rating count are:
1. Navigation
2. Reference
3. Social Networking
4. Music
5. Weather

I would recommend we consider building an app that fit into any of these categories as there is a higher chance of that app to be downloaded than other genres


## Most Popular Apps by Genre on Google Play

In [74]:
category_table = freq_table(google_free, 1)
print(category_table)

{'ART_AND_DESIGN': 0.6431230960171499, 'AUTO_AND_VEHICLES': 0.9251946293580051, 'BEAUTY': 0.5979916506826132, 'BOOKS_AND_REFERENCE': 2.1437436533904997, 'BUSINESS': 4.592124562789123, 'COMICS': 0.6205573733498815, 'COMMUNICATION': 3.2381812027530184, 'DATING': 1.8616721200496444, 'EDUCATION': 1.1621347173643235, 'ENTERTAINMENT': 0.9590432133589079, 'EVENTS': 0.7108202640189552, 'FINANCE': 3.7007785174320205, 'FOOD_AND_DRINK': 1.241114746699763, 'HEALTH_AND_FITNESS': 3.0802211440821394, 'HOUSE_AND_HOME': 0.8236488773552973, 'LIBRARIES_AND_DEMO': 0.9364774906916393, 'LIFESTYLE': 3.9038700214374367, 'GAME': 9.725826469592688, 'FAMILY': 18.898792733837304, 'MEDICAL': 3.5315355974275078, 'SOCIAL': 2.6627552747376737, 'SHOPPING': 2.245289405393208, 'PHOTOGRAPHY': 2.944826808078529, 'SPORTS': 3.396141261423897, 'TRAVEL_AND_LOCAL': 2.335552296062281, 'TOOLS': 8.462146000225657, 'PERSONALIZATION': 3.317161232088458, 'PRODUCTIVITY': 3.8925871601038025, 'PARENTING': 0.6544059573507841, 'WEATHER':

In [75]:
from pprint import pp
google_avg_rating = {}

for category in category_table:
    total = 0 # stores the sum installs
    len_genre = 0 # stores number of apps specific to category
    
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            len_genre += 1
            installs = row[5].replace('+','')
            installs = float(installs.replace(',',''))
            total += installs
    
    average_ratings = total/len_genre
    apple_avg_rating[genre] = average_ratings

pp(sorted(apple_avg_rating.items(), key=lambda item: item[1], reverse=True))

[('Medical', 4056941.7741935486),
 ('Navigation', 86090.33333333333),
 ('Reference', 74942.11111111111),
 ('Social Networking', 71548.34905660378),
 ('Music', 57326.530303030304),
 ('Weather', 52279.892857142855),
 ('Book', 39758.5),
 ('Food & Drink', 33333.92307692308),
 ('Finance', 31467.944444444445),
 ('Photo & Video', 28441.54375),
 ('Travel', 28243.8),
 ('Shopping', 26919.690476190477),
 ('Health & Fitness', 23298.015384615384),
 ('Sports', 23008.898550724636),
 ('Games', 22788.6696905016),
 ('News', 21248.023255813954),
 ('Productivity', 21028.410714285714),
 ('Utilities', 18684.456790123455),
 ('Lifestyle', 16485.764705882353),
 ('Entertainment', 14029.830708661417),
 ('Business', 7491.117647058823),
 ('Education', 7003.983050847458),
 ('Catalogs', 4004.0)]


Per the data above, the apps categories with the highest average rating count are:

1. Medical
2. Navigation
3. Reference
4. Social Networking
5. Music

I would recommend we consider building an app that fit into any of these categories as there is a higher chance of that app to be downloaded than other genres