# Analyzing Mobile App Data
## DataQuest guided project - Introduction to Python

This project is a guided exercise in which I test my knowledge on Python basics. It is the last step for the course "Introduction to Python - Python Functions and Jupyter Notebooks".

---
**Goals**

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## 01. Open and Explore the Data

---
First, lets open the data and create 2 datasets

In [2]:
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
ggl_apps_data = list(read_file)

opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apl_apps_data = list(read_file)

In the following cell we create a function that slices the dataset and prints some of its rows with spaces between them

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(apl_apps_data, 0, 2, True)
print('\n')
explore_data(ggl_apps_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


---
Below Follow the Details of each dataset

[**Google Play Store Dataset**](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

| Columns        | Description                                     |
| -------        | -----------                                     |
|'App'           | Application name                                |
|'Category'      | Category the app belongs to                     |
|'Rating'        | Overall user rating of the app                  |
|'Reviews'       | Number of user reviews for the app              |
|'Size'          | Size of the app                                 |
|'Installs'      | Number of user downloads/installs for the app   |
|'Type'          | Paid or Free                                    |
|'Price'         | Price of the app                                |
|'Content Rating'| Target age group: Children / Mature 21+ / Adult |
|'Genres'        | An app can belong to multiple genres            |
|'Last Updated'  | Date of last update                             |
|'Current Ver'   | Version                                         |
|'Android Ver'   | Android Version                                 |

[**Apple App Store Dataset**](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

| Columns          | Description                                     |
| -------          | -----------                                     |
|'id'              | Application id                                  |
|'track_name'      | Application name                                |
|'size_bytes'      | Size of the app                                 |
|'currency'        | Currency Type                                   |
|'price'           | Price of the app                                |
|'rating_count_tot'| User Rating counts (for all version)            |
|'rating_count_ver'| User Rating counts (for current version)        |
|'user_rating'     | Average User Rating value (for all version)     |
|'user_rating_ver' | Average User Rating value (for current version) |
|'ver'             | Last Version Code                               |
|'cont_rating'     | Target age group: Children / Mature 21+ / Adult |
|'prime_genre'     | Primary Genre                                   |
|'sup_devices.num' |  Number of supporting devices                   |
|'ipadSc_urls.num' | Number of screenshots showed for display        |
|'lang.num'        | Number of supported languages                   |
|'vpp_lic'         | Vpp Device Based Licensing Enabled              |

## 02. Clean and Filter the Data

Now lets remove a row that contais an error, as mentioned in this [discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015). First let's check whether the error occurs.

In [5]:
print(ggl_apps_data[10472])
print('\n')
print(ggl_apps_data[10473])
print('\n')
print(ggl_apps_data[10474])


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Now let's **delete row index 10473**

In [6]:
del ggl_apps_data[10473]

And let's check if it worked

In [7]:
print(ggl_apps_data[10472])
print('\n')
print(ggl_apps_data[10473])
print('\n')
print(ggl_apps_data[10474])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


### Removing Duplicate Apps

- From the discussion on Kaggle, we found out that the Play Store dataset has duplicate rows, as highlighted below
- Lets count the number of duplicates with a for loop

We will remove those duplicates by keeping the lines with the greatest number of rating counts, since those lines must be more recent. 

In [8]:
for row in ggl_apps_data:
    app_name = row[0]
    if app_name == 'Instagram':
        print(row)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




In [9]:
ggl_unique_apps = []
ggl_duplicate_apps = []
for row in ggl_apps_data:
    app_name = row[0]
    
    if app_name in ggl_unique_apps:
        ggl_duplicate_apps.append(app_name)
    else:
        ggl_unique_apps.append(app_name)
        
print('Number of unique apps: ', len(ggl_unique_apps))
print('Number of duplicate apps: ', len(ggl_duplicate_apps))

Number of unique apps:  9660
Number of duplicate apps:  1181


In [10]:
print('Expected lenght: ', len(ggl_apps_data[1:]) - 1181)

Expected lenght:  9659


Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [11]:
ggl_reviews_max = {}

for row in ggl_apps_data[1:]:
    app_name = row[0]
    n_reviews = float(row[3])
    
    if (app_name in ggl_reviews_max) and (ggl_reviews_max[app_name] < n_reviews):
        ggl_reviews_max[app_name] = n_reviews
    elif app_name not in ggl_reviews_max:
        ggl_reviews_max[app_name] = n_reviews
        

Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.

In [12]:
print(len(ggl_reviews_max))

9659


Use the dictionary you created above to remove the duplicate rows. 

- We start by initializing two empty lists, android_clean and already_added.
- We loop through the android data set, and for every iteration:
- We isolate the name of the app and the number of reviews.
- We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
- The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
- The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [13]:
android_clean = []
already_added = []

for row in ggl_apps_data[1:]:
    app_name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == ggl_reviews_max[app_name]) and (app_name not in already_added):
        android_clean.append(row)
        already_added.append(app_name)

        

In [14]:
print(len(android_clean))

9659


Check if there are duplicate apps in Apple's App dataset

In [15]:
apl_unique_apps = []
apl_duplicate_apps = []
for row in apl_apps_data:
    app_id = row[0]
    
    if app_id in apl_unique_apps:
        apl_duplicate_apps.append(app_id)
    else:
        apl_unique_apps.append(app_id)
        
print('Number of unique apps: ', len(apl_unique_apps))
print('Number of duplicate apps: ', len(apl_duplicate_apps))

Number of unique apps:  7198
Number of duplicate apps:  0


### Removing non-english apps


First we'll create a function that detects if there's any non-english character in each app's name. Then, we'll create another funcion that identifies apps 

In [16]:
def char_eng(a_string):
    for char in a_string:
        if ord(char) > 127:
            return False
    return True

In [17]:
print(char_eng('Instagarm'))
print(char_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(char_eng('Docs To Go™ Free Office Suite'))
print(char_eng('Instachat 😜'))


True
False
False
False


In [18]:
def char_eng_2(a_string):
    non_english = {}
    for char in a_string:
        if ord(char) > 127 and a_string not in non_english:
            non_english[a_string] = 1
        elif ord(char) > 127 and a_string in non_english:
            non_english[a_string] += 1
        else:
            non_english[a_string] = 0
    if non_english[a_string] > 3:
        return False
    else:
        return True

In [19]:
print(char_eng_2('Instagarm'))
print(char_eng_2('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(char_eng_2('Docs To Go™ Free Office Suite'))
print(char_eng_2('Instachat 😜'))

True
False
True
True


---

Now let's filter both datasets and keep only english apps. Let's use the clean Android dataset.

In [20]:
android_clean_en = []
ios_clean_en = []


for row in android_clean[1:]:
    if char_eng_2(row[0]):
        android_clean_en.append(row)

for row in apl_apps_data[1:]:
    if char_eng_2(row[1]):
        ios_clean_en.append(row)

print(len(android_clean_en))
print(len(ios_clean_en))

9638
6408


---
### Selecting only free apps

Now let's select only free apps, that is, the ones whose price is equal to zero 

In [21]:
android_clean_en_free = []
ios_clean_en_free = []


for row in android_clean_en[1:]:
    if row[6] == 'Free' :
        android_clean_en_free.append(row)

for row in ios_clean_en[1:]:
    if row[4] == '0.0':
        ios_clean_en_free.append(row)

print(len(android_clean_en_free))
print(len(ios_clean_en_free))

8884
3393


Clean datasets:

- **android_clean_en_free**
- **ios_clean_en_free))**

## 04. Data Analysis

**Motive**

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

---

Let's now create a function for generating frequency tables and use it in combination with the display_table() function.

In [22]:
def freq_table(dataset, index):
    temp_dict = {}
    temp_dict_sum = []
    category = []
    for row in dataset:
        category.append(row[index]) 
    for i in category:
        if i not in temp_dict:
            temp_dict[i] = 1
        else:
            temp_dict[i] += 1
    for key in temp_dict:
        temp_dict[key] = temp_dict[key] / len(dataset) * 100
        
    return temp_dict

In [23]:
freq_table(android_clean_en_free, 1)

{'ART_AND_DESIGN': 0.6303466906798739,
 'AUTO_AND_VEHICLES': 0.9230076542098155,
 'BEAUTY': 0.5965781179648807,
 'BOOKS_AND_REFERENCE': 2.1837010355695634,
 'BUSINESS': 4.592525889239082,
 'COMICS': 0.6303466906798739,
 'COMMUNICATION': 3.2417829806393517,
 'DATING': 1.8572714993246284,
 'EDUCATION': 1.1593876632147682,
 'ENTERTAINMENT': 0.9567762269248086,
 'EVENTS': 0.7091400270148582,
 'FINANCE': 3.692030616839262,
 'FOOD_AND_DRINK': 1.2381809995497524,
 'HEALTH_AND_FITNESS': 3.0729401170643853,
 'HOUSE_AND_HOME': 0.8217019360648358,
 'LIBRARIES_AND_DEMO': 0.9342638451148131,
 'LIFESTYLE': 3.928410625844214,
 'GAME': 9.702836560108059,
 'FAMILY': 18.932913102206214,
 'MEDICAL': 3.5231877532642955,
 'SOCIAL': 2.656461053579469,
 'SHOPPING': 2.239981990094552,
 'PHOTOGRAPHY': 2.9378658262044124,
 'SPORTS': 3.3881134624043225,
 'TRAVEL_AND_LOCAL': 2.330031517334534,
 'TOOLS': 8.442143178748312,
 'PERSONALIZATION': 3.3205763169743356,
 'PRODUCTIVITY': 3.8833858622242237,
 'PARENTING': 0

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (round(table[key],1), key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

**Android English Apps by Genre**

In [25]:
display_table(android_clean_en_free[1:], 1)

FAMILY : 18.9
GAME : 9.7
TOOLS : 8.4
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.9
FINANCE : 3.7
MEDICAL : 3.5
SPORTS : 3.4
PERSONALIZATION : 3.3
COMMUNICATION : 3.2
HEALTH_AND_FITNESS : 3.1
PHOTOGRAPHY : 2.9
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.7
TRAVEL_AND_LOCAL : 2.3
SHOPPING : 2.2
BOOKS_AND_REFERENCE : 2.2
DATING : 1.9
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.2
EDUCATION : 1.2
ENTERTAINMENT : 1.0
LIBRARIES_AND_DEMO : 0.9
AUTO_AND_VEHICLES : 0.9
WEATHER : 0.8
HOUSE_AND_HOME : 0.8
PARENTING : 0.7
EVENTS : 0.7
COMICS : 0.6
BEAUTY : 0.6
ART_AND_DESIGN : 0.6


In [26]:
display_table(android_clean_en_free[1:], 9)

Tools : 8.4
Entertainment : 6.1
Education : 5.4
Business : 4.6
Productivity : 3.9
Lifestyle : 3.9
Finance : 3.7
Sports : 3.5
Medical : 3.5
Personalization : 3.3
Communication : 3.2
Health & Fitness : 3.1
Action : 3.1
Photography : 2.9
News & Magazines : 2.8
Social : 2.7
Travel & Local : 2.3
Shopping : 2.2
Books & Reference : 2.2
Simulation : 2.0
Dating : 1.9
Video Players & Editors : 1.8
Casual : 1.8
Arcade : 1.8
Maps & Navigation : 1.4
Food & Drink : 1.2
Puzzle : 1.1
Racing : 1.0
Strategy : 0.9
Role Playing : 0.9
Libraries & Demo : 0.9
Auto & Vehicles : 0.9
Weather : 0.8
House & Home : 0.8
Events : 0.7
Adventure : 0.7
Comics : 0.6
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.5
Trivia : 0.4
Educational;Education : 0.4
Educational : 0.4
Casino : 0.4
Board : 0.4
Word : 0.3
Education;Education : 0.3
Racing;Action & Adventure : 0.2
Puzzle;Brain Games : 0.2
Music : 0.2
Entertainment;Music & Video : 0.2
Casual;Pretend Play : 0.2
Simulation;Action & Adventure : 0.1
Parenting;Music

**IOS English Apps by Genre**

In [27]:
display_table(ios_clean_en_free[1:], 11)

Games : 58.2
Entertainment : 7.9
Photo & Video : 4.8
Education : 3.5
Social Networking : 3.3
Shopping : 2.7
Utilities : 2.5
Sports : 2.1
Music : 1.9
Health & Fitness : 1.9
Productivity : 1.7
Lifestyle : 1.7
News : 1.3
Travel : 1.2
Finance : 1.2
Weather : 0.9
Food & Drink : 0.9
Reference : 0.6
Business : 0.5
Book : 0.5
Navigation : 0.3
Medical : 0.2
Catalogs : 0.1


### Main genres
---

**Android Category %**

- FAMILY : 18.9
- GAME : 9.7
- TOOLS : 8.4
- BUSINESS : 4.6
- PRODUCTIVITY : 3.9
- LIFESTYLE : 3.9
- FINANCE : 3.7
- MEDICAL : 3.5
- SPORTS : 3.4

**Android Genres**

- Tools : 8.4
- Entertainment : 6.1
- Education : 5.4
- Business : 4.6
- Productivity : 3.9
- Lifestyle : 3.9
- Finance : 3.7
- Sports : 3.5
- Medical : 3.5
- Personalization : 3.3
- Communication : 3.2
- Health & Fitness : 3.1
- Action : 3.1

---

**IOS genres %**

- Games : 58.2
- Entertainment : 7.9
- Photo & Video : 4.8
- Education : 3.5
- Social Networking : 3.3
- Shopping : 2.7
- Utilities : 2.5
- Sports : 2.1

In [28]:
category_test = []

for row in android_clean_en_free:
    if row[1] == 'FAMILY':
        category_test.append(row)
print(category_test)

[['Jewels Crush- Match 3 Puzzle', 'FAMILY', '4.4', '14774', '19M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 23, 2018', '1.9.3901', '4.0.3 and up'], ['Coloring & Learn', 'FAMILY', '4.4', '12753', '51M', '5,000,000+', 'Free', '0', 'Everyone', 'Educational;Creativity', 'July 17, 2018', '1.49', '4.0.3 and up'], ['Mahjong', 'FAMILY', '4.5', '33983', '22M', '5,000,000+', 'Free', '0', 'Everyone', 'Puzzle;Brain Games', 'August 2, 2018', '1.24.3181', '4.0.3 and up'], ['Super ABC! Learning games for kids! Preschool apps', 'FAMILY', '4.6', '20267', '46M', '1,000,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'July 16, 2018', '1.1.6.7', '4.1 and up'], ['Toy Pop Cubes', 'FAMILY', '4.5', '5761', '21M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '1.8.3181', '4.0.3 and up'], ['Educational Games 4 Kids', 'FAMILY', '4.3', '11618', '39M', '5,000,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'April 3, 2018', '2.4', '4.1 and u

Analyzing the data it seems that most of the apps are games (58.2% in IOS store and 28.6% in Android store considering "Family" + "Games"). Android store is more balanced, since Tools, Business and Productivity apps account for roughly 17%.

---
Now we'll calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [29]:
android_unique_genres = []

for row in android_clean_en_free:
    a_genre = row[1]
    if a_genre not in android_unique_genres:
        android_unique_genres.append(a_genre)
    
print(android_unique_genres)

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION']


In [30]:
ios_unique_genres = []

for row in ios_clean_en_free:
    a_genre = row[11]
    if a_genre not in ios_unique_genres:
        ios_unique_genres.append(a_genre)
    
print(ios_unique_genres)

['Photo & Video', 'Games', 'Music', 'Social Networking', 'Reference', 'Health & Fitness', 'Weather', 'Utilities', 'Travel', 'Shopping', 'News', 'Navigation', 'Lifestyle', 'Entertainment', 'Food & Drink', 'Sports', 'Book', 'Finance', 'Education', 'Productivity', 'Business', 'Catalogs', 'Medical']


In [39]:
for genre in ios_unique_genres:
    total = 0
    len_genre = 0
    for row in ios_clean_en_free:
        genre_app = row[11]
        if genre_app == genre:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
    print(genre, round(total / len_genre,1))

Photo & Video 27747.8
Games 21634.9
Music 57326.5
Social Networking 41531.5
Reference 70997.8
Health & Fitness 22945.0
Weather 50477.6
Utilities 17819.4
Travel 27558.0
Shopping 25131.6
News 20303.7
Navigation 57482.1
Lifestyle 14750.5
Entertainment 13300.0
Food & Drink 27085.2
Sports 22680.2
Book 30926.7
Finance 26973.0
Education 6945.1
Productivity 20318.3
Business 7074.9
Catalogs 3203.2
Medical 525.4


---

**IOS genres with most ratings:**
- Reference 70997.8, 
- Navigation 57482.1, 
- Music 57326.5
- Social Networking 41531.5

In [52]:
display_table(android_clean_en_free, 5)

1,000,000+ : 15.7
100,000+ : 11.6
10,000,000+ : 10.5
10,000+ : 10.2
1,000+ : 8.4
100+ : 6.9
5,000,000+ : 6.8
500,000+ : 5.5
50,000+ : 4.8
5,000+ : 4.5
10+ : 3.5
500+ : 3.2
50,000,000+ : 2.3
100,000,000+ : 2.1
50+ : 1.9
5+ : 0.8
1+ : 0.5
500,000,000+ : 0.3
1,000,000,000+ : 0.2
0+ : 0.0


**Problem: the number of installs is not a number, but a category**

To perform computations, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

In [55]:
android_unique_cat = []

for row in android_clean_en_free:
    a_cat = row[1]
    if a_cat not in android_unique_cat:
        android_unique_cat.append(a_cat)
print(android_unique_cat)

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION']


In [57]:
for category in android_unique_cat:
    total = 0
    len_category = 0
    for row in android_clean_en_free:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            total += float(installs)
            len_category += 1
    print(category, round(total / len_category,1))

ART_AND_DESIGN 1932358.9
AUTO_AND_VEHICLES 647317.8
BEAUTY 513151.9
BOOKS_AND_REFERENCE 8587351.9
BUSINESS 1708215.9
COMICS 803234.8
COMMUNICATION 38322625.7
DATING 854028.8
EDUCATION 1833495.1
ENTERTAINMENT 11640705.9
EVENTS 253542.2
FINANCE 1387692.5
FOOD_AND_DRINK 1924897.7
HEALTH_AND_FITNESS 4188822.0
HOUSE_AND_HOME 1331540.6
LIBRARIES_AND_DEMO 638503.7
LIFESTYLE 1440098.7
GAME 15588015.6
FAMILY 3686097.9
MEDICAL 120550.6
SOCIAL 23253652.1
SHOPPING 7036877.3
PHOTOGRAPHY 17840110.4
SPORTS 3638640.1
TRAVEL_AND_LOCAL 13984077.7
TOOLS 10801391.3
PERSONALIZATION 5183850.8
PRODUCTIVITY 16787331.3
PARENTING 542603.6
WEATHER 5074486.2
VIDEO_PLAYERS 24573948.2
NEWS_AND_MAGAZINES 9472829.0
MAPS_AND_NAVIGATION 4025286.2


---

**Android genres with most installs:**
- COMMUNICATION 38,322,625.7, 
- VIDEO_PLAYERS 24,573,948.2 
- SOCIAL 23,253,652.1
- PHOTOGRAPHY 17,840,110.4
- PRODUCTIVITY 16,787,331.3
- GAME 15,588,015.6
- TRAVEL_AND_LOCAL 13,984,077.7
- ENTERTAINMENT 11,640,705.9 





