# Analyzing Mobile App Data

In this *guided* project I'll pretend I'm working as data analysts for a company that builds Android and iOS mobile apps. They make apps available on Google Play and the App Store.

They only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.
The goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Load Data

In [1]:
#load csv function
def load_data(data_csv):
    opened_file = open(data_csv, encoding="utf8")
    from csv import reader
    read_file = reader(opened_file)
    return list(read_file)

apple_data = load_data("AppleStore.csv")
google_data = load_data("googleplaystore.csv")

## Explore Data

In [2]:
#data explore function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [7]:
explore_data(google_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [13]:
explore_data(apple_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


## Clean Data: google_data

https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015
In google_data:
1. the above shows the below entry isnt complete, hence I deleted the entry. It doent have a 'Category' data
2. some apps like Instagram has duplicates, removed them and maintain the most recent row

In [10]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [11]:
del(google_data[10473])

### separating duplicates from main data set

In [18]:
duplicate_apps = []
unique_apps = []

for row in google_data[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicates: ', len(duplicate_apps))
print('\n Number of rows left: ', len(unique_apps))
print('\n example: ', duplicate_apps[0:4])



Number of duplicates:  1181

 Number of rows left:  9659

 example:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


#### checking how many duplicates of 'Box' we have in the data set

In [17]:
for row in google_data[1:]:
    app_name = row[0]
    if app_name == 'Box':
        print(row)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


From the above, 'Box' app has three entries. Which of the entries should be removed?

    The main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.
    We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be

In [22]:
reviews_max ={}
for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max) #9659 expected

9659

In order to ensure we are collecting the clean data only we need the function below to ensure this

In [67]:
android_clean = [] #cleaned googe_data
already_added = [] # serves as a checklist 

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
android_header = google_data[0]
len(reviews_max) == len(android_clean) #True means all collected 9659 apps

True

In [26]:
explore_data(android_clean, 0, 2, True)
# clean data doesnt include the headers

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


### removing non-English apps
ASCII for English characters are 0-127

In [48]:
def english_only(string):
    for i in string:
        if ord(i) > 127: #ord() returns ASCII
            return False
    return True
        
english_only('Docs To Go™ Free Office Suite')
#english_only('我爱中国')
#english_only('Instachat 😜')

False

Emojis and subscripts are out of range > 127, this implies we will lose data if we use this function as it is(as many apps have out of ASCII range characters though they are English apps), we can reduce this effect by checking if an app has more than 3 non-English characters in its name: therefore the `english_only()` function is rewritten below

In [55]:
def english_only(string):
    count = 0
    for i in string:
        if ord(i) > 127: #ord() returns ASCII
            count += 1
            if count >= 3:
                return False
    return True

#english_only('Docs To Go™ Free Office Suite')
#english_only('我爱中国')
english_only('Instachat 😜')

True

*Use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.*

In [58]:
android_clean_eng = []

for row in android_clean:
    name = row[0]
    if english_only(name):
        android_clean_eng.append(row)
print(f'There are {len(android_clean_eng)} English Android apps now')

There are 9597 English apps now


## Clean Data: apple_data

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion
In apple_data:
1. no issue yet

In [68]:
ios_clean_eng = [] #cleaned apple_data

for row in apple_data[1:]:
    name = row[0]
    if english_only(name):
        ios_clean_eng.append(row)
        
ios_header = apple_data[0]
print(f'There are {len(ios_clean_eng)} English iOS apps now')

There are 7197 English iOS apps now


### isolate free apps
Our interest is in free apps, hence first step is to isolate those for further analysis
* correctly identify price column: *5th and 8th coloumn of apple data and google data respectively*

In [71]:
free_android_clean_eng = []

for row in android_clean_eng:
    price = float(row[7].strip('$')) # 8th column
    if price == 0.0:
        free_android_clean_eng.append(row)
print(f'There are {len(android_clean_eng)} English Android apps now \n')
print(f'There are {len(free_android_clean_eng)} FREE English Android apps now')

There are 9597 English Android apps now 

There are 8848 FREE English Android apps now


In [78]:
free_ios_clean_eng = [] 

for row in ios_clean_eng:
    price = float(row[4])  # 5th column
    if price == 0.0:
        free_ios_clean_eng.append(row)
        
print(f'There are {len(ios_clean_eng)} English iOS apps now \n')
print(f'There are {len(free_ios_clean_eng)} FREE English iOS apps now')

There are 7197 English iOS apps now 

There are 4056 FREE English iOS apps now


## ANALYSIS 
Preamble
our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. Hence validation strategy is as follows:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [80]:
print(f'This is header of android {android_header} \n')
print(f'This is header of ios {ios_header}')

This is header of android ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

This is header of ios ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Our conclusion was that we'll need to build a frequency table for the `prime_genre` column of the App Store data set, and for the `Genres` and Category columns of the Google Play data set.

In [124]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

In [122]:
# Android
display_table(free_android_clean_eng, 9)
print('\n')
display_table(free_android_clean_eng, 1)

Tools : 8.44258589511754
Entertainment : 6.080470162748644
Education : 5.357142857142857
Business : 4.599909584086799
Productivity : 3.899186256781193
Lifestyle : 3.8765822784810124
Finance : 3.7070524412296564
Medical : 3.5375226039783
Sports : 3.4584086799276674
Personalization : 3.322784810126582
Communication : 3.2323688969258586
Action : 3.096745027124774
Health & Fitness : 3.0854430379746836
Photography : 2.949819168173599
News & Magazines : 2.802893309222423
Social : 2.667269439421338
Travel & Local : 2.328209764918626
Shopping : 2.2490958408679926
Books & Reference : 2.1360759493670884
Simulation : 2.0456600361663653
Dating : 1.8648282097649187
Arcade : 1.842224231464738
Video Players & Editors : 1.7744122965641953
Casual : 1.763110307414105
Maps & Navigation : 1.3901446654611211
Food & Drink : 1.2432188065099457
Puzzle : 1.1301989150090417
Racing : 0.9945750452079566
Role Playing : 0.9380650994575045
Libraries & Demo : 0.9380650994575045
Auto & Vehicles : 0.9267631103074141
St

In [123]:
# iOS
display_table(free_ios_clean_eng, 11)


Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032
