# DataQuest - Python for Data Science: Fundamentals
Each DataQuest course is followed by a Guided Project to practice what was tought with real porjects. The subjects that were tought in this first course were:
- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions

## Introduction
This Guided project consists in making a data driven decision for an App Development company. 
The company creates **free apps for Google Play Store and IOS mobie Apps**, their profit comes from in-app Ads, which means that the more active users there are, the greater the revenue. The prospective app should work for both stores (IOS and GPS)
#### Datasets
- [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/home): Public dataset extracted by Lavanya Gupta published in Kagle (last update on January 2019), containing a **sample of 10k apps**:
        - Name
        - Category
        - Rating
        - Number of reviews
        - Size (mb)
        - Installs
        - Price
        - Content Rating (18+, Teen, Everyone)
        - Genre
- [Mobile App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home): Public dataset extracted by Research Engineer Ramathan published in Kagle, extracted on July 2017 containing a **sample of 7k+ apps**:

| Code             | Content                                         |
|------------------|-------------------------------------------------|
| id               | App ID                                          |
| track_name       | App Name                                        |
| size_bytes       | Size (in Bytes)                                 |
| currency         | Currency Type                                   |
| price            | Price amount                                    |
| rating_count_tot | User Rating counts (for all version)            |
| rating_count_ver | User Rating counts (for current version)        |
| user_rating      | Average User Rating value (for all version)     |
| user_rating_ver  | Average User Rating value (for current version) |
| ver              | Latest version code                             |
| cont_rating      | Content Rating                                  |
| prime_genre      | Primary Genre                                   |
| sup_devices.num  | Number of supporting devices                    |
| ipadSc_urls.num  | Number of screenshots showed for display        |
| lang.num         | Number of supported languages                   |
| vpp_lic          | Vpp Device Based Licensing Enabled              |

        

## Data Opening
First, we start by making the following steps for each file:
- opening the csv file and iterate over it with the reader() command
- Create a list for the read file with list()

In [1]:
opened_file_as = open(r"C:\Users\Jose\Documents\Dataquest\Apps_Data\AppleStore.csv", encoding = 'utf-8-sig')
from csv import reader
read_file_as = reader(opened_file_as)
apps_data_as = list(read_file_as)

opened_file_gps = open(r"C:\Users\Jose\Documents\Dataquest\Apps_Data\googleplaystore.csv", encoding = 'utf-8-sig')
read_file_gps = reader(opened_file_gps)
apps_data_gps = list(read_file_gps)

def explore_data(dataset, start, end, rows_and_columns=False): #Prints slice added in parameters, total dataset rows and columns
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## Data Cleaning
With the **explore_data() function** created above I explore the first data rows to make sure rows and columns are ordered correctly. The first column of the **IOS (apps_data_as)** is created when the file was imported. It will make it difficult for me to follow DataQuest instructions, so the **first column is deleted**

In [2]:
explore_data(apps_data_as, 0, 2, True)
print('\n')
explore_data(apps_data_gps, 0, 2, True)
print('\n')
for row in apps_data_as:
    row.pop(0)
print('\n')
explore_data(apps_data_as, 0, 2, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


Number of rows: 7198
Number of columns: 17


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13




['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipa

### Deleting incorrect row
In the GPS (Google Play Store) dataset discussion section there is an open discussion about **row 10473** wich has incorrect information, so **it is deleted** with del() function

In [3]:
header_gps = apps_data_gps[0:1]
print(header_gps)
print('\n')

print(apps_data_gps[10473]) # Wrong data found from discssion. Row must be deleted
wrong_data_gps = apps_data_gps[10473]

print('\n')
del apps_data_gps[10473]
print('New row number = ', len(apps_data_gps))

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


New row number =  10841


### Deleting duplicated rows
We check if there are any **duplicated apps in the GPS dataset**

In [4]:
unique_apps = []
duplicated_apps = []

for row in apps_data_gps[1:]:
    app_name_gps = row[0]
    if app_name_gps in unique_apps:
        duplicated_apps.append(app_name_gps)
    else:
        unique_apps.append(app_name_gps)

print(duplicated_apps[:3])
len(duplicated_apps)

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']


1181

The duplicated app with the **most reviews is the one the one that will stay in dataset** since it has the most updated information. 

In [5]:
reviews_max = {}

for row in apps_data_gps[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print(reviews_max["Instagram"])

9659
66577446.0


We create a new list without duplications. The **new total row number is 9659**

In [6]:
android_clean = []
already_added = []

for row in apps_data_gps[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
print('New lenght is ', len(android_clean))
print('New lenght is ', len(already_added))

New lenght is  9659
New lenght is  9659


### Deleting Non-English apps
We create a **function that return a Boolean if app name is in english**. We allow up to three words outside the english dictionary per app (since some apps might have emojis, symbols or punctuation marks).

In [7]:
weird_apps = ['Instagram',
'爱奇艺PPS -《欢乐颂2》电视剧热播',
'Docs To Go™ Free Office Suite',
'Instachat 😜']

def e_string(string):
    w_letter_count = 0
    for letter in string:
        if ord(letter) > 127:    
            w_letter_count += 1
    if w_letter_count > 3: 
        return False
    return True

print(e_string('Docs To Go™ Free Office Suite')) 
print(e_string('Instachat 😜')) 
e_string('爱奇艺PPS -《欢乐颂2》电视剧热播')

True
True


False

The function **e_string(string) is used to reclean both datasets**

In [8]:
recleaned_android = []
recleaned_ios = []

for row in android_clean:
    app = row[0]
    if e_string(app):
        recleaned_android.append(row)
        
for row in apps_data_as:
    app = row[1]
    if e_string(app):
        recleaned_ios.append(row)
        

print(len(recleaned_android))
print(len(recleaned_ios))

9614
6184


### Deleting Non-Free apps
Data is cleaned again so **only Free** ones are analyzed

In [9]:
free_android = []
free_ios = []

for row in recleaned_android:
    price = row[7]
    if price == "0":
        free_android.append(row)
        
for row in recleaned_ios[1:]:
    price = row[4]
    if price == '0':
        free_ios.append(row)
        
        
print(len(free_android))
print(len(free_ios))

8864
3222


## Apps Analysis
### Number of apps by category
A frecuency table is created to see **which category has the most apps** for each dataset. Ordered from biggest to lowest

In [10]:
def freq_table(dataset, index):
    freq_list = {}
    total = 0
    for row in dataset: 
        column = row[index]
        total += 1
        if column in freq_list:
            freq_list[column] += 1
        else:
            freq_list[column] = 1
    
    freq_perc = {}
    for key in freq_list:
        perc = (freq_list[key] / total) * 100
        freq_perc[key] = perc
        
    return freq_perc
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(free_android, 9)
print('\n')
display_table(free_ios, 11)
        

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Rating Count per category (Google Play Store)
The **total rating count per category** is extracted to see which categories have the most active users (Google Play Store). The top three categories are Navigation, Reference and Social Networking

In [11]:
unique_genre_ios = freq_table(free_ios, 11)

print('GENRES LIST')
for genre in unique_genre_ios:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[11]
        if genre_app == genre:
            user_rating_count = float(app[5])
            total += user_rating_count
            len_genre += 1
    avg_rating_count = total / len_genre
    print(genre, ':', avg_rating_count) 
print('\n')
    
print('TOP 3 GENRES')

top3ios = []
top3iosvalues = []
for genre in unique_genre_ios:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[11]
        if genre_app == genre:
            user_rating_count = float(app[5])
            total += user_rating_count
            len_genre += 1
    avg_rating_count = total / len_genre 
    if avg_rating_count >= 71000:
        top3ios.append(genre)
        top3iosvalues.append(avg_rating_count)
print(top3ios)
print(top3iosvalues)

GENRES LIST
Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


TOP 3 GENRES
['Reference', 'Social Networking', 'Navigation']
[74942.11111111111, 71548.34905660378, 86090.33333333333]


### Top 3 Google Play Store Categories in Depth
A function is created to return the **app rating count distribution** compared to the category for both data sets. We use it only for the GPS dataset, with the critical values (rating counts): 50000, 25000, 5000 and 1000.

- We see that 'Navigation' Category rating count is highly affected by few apps. Most of the apps rating count have less than the category average.
- Reference and Social Networking have a more distributed  rating count.

In [23]:
def disprating(data, app_category, total_rating, critical_value1 = 50000, critical_value2 = 25000, critical_value3 = 5000, critical_value4 = 1000, critical_value5 = 0, ios = True):
    critical_value1_list = []
    critical_value2_list = []
    critical_value3_list = []
    critical_value4_list = []
    critical_value5_list = []
    total_sn = 0

    if ios:
        for app in data:
            name = app[1]
            genre_app = app[11]
            user_rating_count = float(app[5]) 
            if genre_app == app_category:
                total_sn += 1
                if user_rating_count >= critical_value1:
                    critical_value1_list.append(name)
                elif critical_value1 > user_rating_count >= critical_value2:
                    critical_value2_list.append(name)
                elif critical_value2 > user_rating_count >= critical_value3:
                    critical_value3_list.append(name)
                elif critical_value3 > user_rating_count >= critical_value4:
                    critical_value4_list.append(name)
                elif critical_value4 > user_rating_count >= critical_value5:
                    critical_value5_list.append(name)
                    
        print(app_category, ': ', total_rating)
        print('Above', critical_value1, 'n=', len(critical_value1_list), ' Pctg=',(len(critical_value1_list)) / total_sn * 100)
        print(critical_value2, 'to', critical_value1, 'n=', len(critical_value2_list), ' Pctg=',(len(critical_value2_list)) / total_sn * 100)
        print(critical_value3, 'to', critical_value2, 'n=', len(critical_value3_list), ' Pctg=',(len(critical_value3_list)) / total_sn * 100)
        print(critical_value4, 'to', critical_value3, 'n=', len(critical_value4_list), ' Pctg=',(len(critical_value4_list)) / total_sn * 100)
        print(critical_value5, 'to', critical_value4, 'n=', len(critical_value5_list), ' Pctg=',(len(critical_value5_list)) / total_sn * 100)
    else:
        for app in data:
            name = app[0]
            category_app = app[1]
            app[5] = app[5].replace('+','')
            app[5] = app[5].replace(',','')
            installs_count = float(app[5])
            if category_app == app_category:
                total_sn += 1
                if installs_count >= critical_value1:
                    critical_value1_list.append(name)
                elif critical_value1 > installs_count >= critical_value2:
                    critical_value2_list.append(name)
                elif critical_value2 > installs_count >= critical_value3:
                    critical_value3_list.append(name)
                elif critical_value3 > installs_count >= critical_value4:
                    critical_value4_list.append(name)
                elif critical_value4 > installs_count >= critical_value5:
                    critical_value5_list.append(name)
                    
        print(app_category, ': ', total_rating)
        print('Above', critical_value1, 'n=', len(critical_value1_list), ' Pctg=',(len(critical_value1_list)) / total_sn * 100)
        print(critical_value2, 'to', critical_value1, 'n=', len(critical_value2_list), ' Pctg=',(len(critical_value2_list)) / total_sn * 100)
        print(critical_value3, 'to', critical_value2, 'n=', len(critical_value3_list), ' Pctg=',(len(critical_value3_list)) / total_sn * 100)
        print(critical_value4, 'to', critical_value3, 'n=', len(critical_value4_list), ' Pctg=',(len(critical_value4_list)) / total_sn * 100)
        print(critical_value5, 'to', critical_value4, 'n=', len(critical_value5_list), ' Pctg=',(len(critical_value5_list)) / total_sn * 100)


In [24]:
disprating(free_ios, top3ios[2], top3iosvalues[2])
print('\n')
disprating(free_ios, top3ios[1], top3iosvalues[1])
print('\n')
disprating(free_ios, top3ios[0], top3iosvalues[0])

Navigation :  86090.33333333333
Above 50000 n= 2  Pctg= 33.33333333333333
25000 to 50000 n= 0  Pctg= 0.0
5000 to 25000 n= 1  Pctg= 16.666666666666664
1000 to 5000 n= 1  Pctg= 16.666666666666664
0 to 1000 n= 2  Pctg= 33.33333333333333


Social Networking :  71548.34905660378
Above 50000 n= 19  Pctg= 17.92452830188679
25000 to 50000 n= 10  Pctg= 9.433962264150944
5000 to 25000 n= 23  Pctg= 21.69811320754717
1000 to 5000 n= 18  Pctg= 16.9811320754717
0 to 1000 n= 36  Pctg= 33.9622641509434


Reference :  74942.11111111111
Above 50000 n= 3  Pctg= 16.666666666666664
25000 to 50000 n= 1  Pctg= 5.555555555555555
5000 to 25000 n= 5  Pctg= 27.77777777777778
1000 to 5000 n= 2  Pctg= 11.11111111111111
0 to 1000 n= 7  Pctg= 38.88888888888889


### Installs per category (IOS)
The **total installs per category** are extracted in order to see which IOS categories have the most of these

In [14]:
unique_genre_gps = freq_table(free_android, 1)

for category in unique_genre_gps:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:
            app[5] = app[5].replace('+','')
            app[5] = app[5].replace(',','')
            installs_count = float(app[5])
            total += installs_count
            len_category += 1
    avg_intalls = total / len_category
    print(category, ': ', avg_intalls)
    
print('\n')    
print('TOP 3 CATEGORIES')

top3gps = []
top3gpsvalues = []

for category in unique_genre_gps:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:
            app[5] = app[5].replace('+','')
            installs_new = app[5].replace(',','')
            installs_count = float(installs_new)
            total += installs_count
            len_category += 1
    avg_intalls = total / len_category
    if avg_intalls >= 20727872.4:
        top3gps.append(category)
        top3gpsvalues.append(avg_intalls)
        print(category, ': ', avg_intalls)

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

### Top 3 IOS Categories in Depth
The **function created above is used again** to see more in depth at the IOS category Installs distribution
- We see the pattern of social and communications repeated for the IOS list. These categories have the most downloads again

In [15]:
disprating(free_android, top3gps[2], top3gpsvalues[2], ios = False)
print('\n')
disprating(free_android, top3gps[1], top3gpsvalues[1], ios = False)
print('\n')
disprating(free_android, top3gps[0], top3gpsvalues[0], ios = False)

VIDEO_PLAYERS :  24727872.452830188
Above 50000 ratings: n= 107  Pctg= 67.29559748427673
25000 to 50000 ratings: n= 0  Pctg= 0.0
5000 to 25000 ratings: n= 29  Pctg= 18.238993710691823
1000 to 5000 ratings: n= 8  Pctg= 5.031446540880504
0 to 1000 ratings: n= 15  Pctg= 9.433962264150944


SOCIAL :  23253652.127118643
Above 50000 ratings: n= 145  Pctg= 61.440677966101696
25000 to 50000 ratings: n= 0  Pctg= 0.0
5000 to 25000 ratings: n= 29  Pctg= 12.288135593220339
1000 to 5000 ratings: n= 22  Pctg= 9.322033898305085
0 to 1000 ratings: n= 40  Pctg= 16.94915254237288


COMMUNICATION :  38456119.167247385
Above 50000 ratings: n= 174  Pctg= 60.62717770034843
25000 to 50000 ratings: n= 0  Pctg= 0.0
5000 to 25000 ratings: n= 36  Pctg= 12.543554006968641
1000 to 5000 ratings: n= 19  Pctg= 6.620209059233449
0 to 1000 ratings: n= 58  Pctg= 20.209059233449477


## Conclusion
I recommend creating a **social media app**. 
- Both datasets have Social Media categories on their top 3 categories for rating count and installs
- The rating counts and installs are not influenced by extreme highs like the Navigation category