# Data Analysis for mobile apps.

We will analysis the free and paid apps based in the number of the users and see the ads results to determinate our revenue.

The goal of this project is analyze data to help the devs understand what type of apps are more attractive to users.

We will analyze the most valuable apps in Apple Store and Google Play, if we sum the totally of apps in the platforms we will have more than 4.1 millions of apps available.

Using two datasets, _googleplaystore.csv_ and _AppleStore.csv_ we have unlimited parameters to analyze the apps of the platforms.

On box below, we have the instanciating and opening of the data sets.

To run this notebook's boxes we can do a CTRL+ENTER or click in run bottom.

In [29]:
class Dataset():
    
    def __init__(self, file):
        opened_file = open(file)
        read_file = reader(opened_file)
        dataset = list(read_file)
        self.dataset = dataset
        
    def explore_data(dataset, start, end, rows_and_columns=False):
        dataset_slice = dataset[start:end]    
        for row in dataset_slice:
            print(row)
            print('\n')

        if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))
    
    def freq_table(dataset, index):
        table = {}
        total = 0
    
        for row in dataset:
            total += 1
            value = row[index]
            if value in table:
                table[value] += 1
            else:
                table[value] = 1
    
        table_percentages = {}
        for key in table:
            percentage = (table[key] / total) * 100
            table_percentages[key] = percentage 
    
        return table_percentages

    def display_table(dataset, index):
        table = freq_table(dataset, index)
        table_display = []
        for key in table:
            key_val_as_tuple = (table[key], key)
            table_display.append(key_val_as_tuple)
        
        table_sorted = sorted(table_display, reverse = True)
        for entry in table_sorted:
            print(entry[1], ':', entry[0])
    
    def is_english(string):
        non_ascii = 0
    
        for character in string:
            if ord(character) > 127:
                non_ascii += 1
    
        if non_ascii > 3:
            return False
        else:
            return True

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

We can see the content of the datasets printing a row, like as below.

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


To remove a row where have a content problem we can do the delete function. The row will be deleted and will decrease a one number on the length of the data set.

In [5]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


We can remove duplicate apps in the data set search for more one occurrence using a _for_ on the dataset, in this _for_ we set a list of duplicate apps and unique apps to count what app aren't be duplicate on the data set.

We can print the list of duplicate apps and remove them.

In [None]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Duplicate apps:', len(duplicate_apps))

print('Examples of duplicate apps:', duplicate_apps[:5])

Now, create a list and a dictionary to clean and store the data set, using a _for_ we can store the number of maximum reviews of the app, if the app don't have an occurrence, it will append a row in _android_clean_ list, if have an occurrence the name of the app will be appended in _already_added_ list.

In [7]:
reviews_max = {}
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
    elif (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's print a row in the clean data set:

In [8]:
explore_data(android_clean, 0, 3, True)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


Number of rows: 396
Number of columns: 13


In [9]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now, we'll remove the strangers characters of the data set, we can have only english alphabet. To do this we'll map all the characters in data set and will return True for english char or False if isn't using _is_english_ function.

Let's test the function:

In [30]:
print(is_english('Instagram'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))

True
True
False
True


We can remove the not english name apps of the clean data sets.

In [32]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


Number of rows: 396
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38',

In each dataset, let's isolate the free apps in separate list, where the price of the app can be identified.

In [15]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

361
3222


To the store the number of the people using a app, we can build a frequency table, where we can understand a profile of the users is searching of an determined app. So, we can store the percentages of apps genre and mensure in a determined platform.

Using the function _freq_table_, we can build the percentages of most popular apps in Apple Store.

In [17]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Looking for the _prime_genre_ in iOS apps.

In [22]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Put an specific row and a specific column, we can determine the number of ratings of apps of the genre.

In [23]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [19]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Examining the number of installs for Android platform.

In [20]:
display_table(android_final, 5)

10,000,000+ : 24.653739612188367
1,000,000+ : 21.329639889196674
5,000,000+ : 14.127423822714682
100,000+ : 9.141274238227147
100,000,000+ : 8.310249307479225
500,000+ : 5.8171745152354575
50,000,000+ : 3.6011080332409975
10,000+ : 2.4930747922437675
500,000,000+ : 2.21606648199446
100+ : 1.9390581717451523
1,000+ : 1.9390581717451523
1,000,000,000+ : 1.3850415512465373
5,000+ : 1.10803324099723
50,000+ : 0.8310249307479225
500+ : 0.554016620498615
50+ : 0.2770083102493075
10+ : 0.2770083102493075


Using the _freq_table_ to see the most popular Android apps, we have this result:

In [25]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

BUSINESS : 4483333.333333333
COMMUNICATION : 280937500.0
DATING : 1604436.111111111
EDUCATION : 10342105.263157895
ENTERTAINMENT : 13456521.739130436
FINANCE : 17716666.666666668
FOOD_AND_DRINK : 5333333.333333333
HEALTH_AND_FITNESS : 5062903.225806451
HOUSE_AND_HOME : 5437500.0
LIFESTYLE : 340000.0
GAME : 140095238.0952381
FAMILY : 12607407.407407407
MEDICAL : 368588.275862069
SOCIAL : 36454545.45454545
SHOPPING : 15280000.0
PHOTOGRAPHY : 38888888.88888889
SPORTS : 6477777.777777778
TRAVEL_AND_LOCAL : 16833333.333333332
PERSONALIZATION : 75000000.0
PRODUCTIVITY : 107923076.92307693
NEWS_AND_MAGAZINES : 137873333.33333334
BOOKS_AND_REFERENCE : 5000000.0
TOOLS : 10000000.0


To have a idea of Google Play numbers, we can take a specific genre and print an app in the dataset.

In [26]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Ebook Reader : 5,000,000+


That's it. This project is an a way of analyze data with real data, this practice is powerful to use Python and Data Analysis in anyway context of graduation degree or in real life.

This is my first project using Python with DS. That's nice!