# Analysis goal 
- The goal for this project is to analyze Google Play's data to help understand what kinds of apps are likely to attract more users.
- I'll focus on free apps for this analysis.

## About the Data
As of September 2019, there were [approximately 2.8 million Android apps](https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/) on Google Play.

Collecting data for these many apps is not an easy task. So I decided to look for a data set that could help me. After some search I found two promising data sets:
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) with 10k apps collected on february 2019
- And [a data set](https://www.kaggle.com/lava18/google-play-store-apps) with 267k apps collected on april 2019


After some thought I decided to use the last one, because it has more data and was collected more recently.


## Opening the Data

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('..\\data\\raw\\Google-Playstore-Full.csv')
read_file = reader(opened_file)
google_play = list(read_file)
google_play_header = google_play[0]
google_play_data = google_play[1:]

## Exploring the Data
To make it easier to explore the data set I created 3 functions that I will reuse throughout the project.

In [2]:
def print_header(header):
    print(header)
    print('\n')

In [3]:
def print_data(data, start = 0, end = 5):
    data_slice = data[start:end]
    for row in data_slice:
        print(row)
        print('\n')

In [5]:
def print_data_info(data):
    print('Number of rows:', len(data))
    print('Number of columns:', len(data[0]))

In [6]:
def print_data_overview():
    print_header(google_play_header)
    print_data(google_play_data)
    print_data_info(google_play_data)   

In [7]:
print_data_overview()

['App Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Size', 'Price', 'Content Rating', 'Last Updated', 'Minimum Version', 'Latest Version', '', '', '', '']


['DoorDash - Food Delivery', 'FOOD_AND_DRINK', '4.548561573', '305034', '5,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device', '', '', '', '']


['TripAdvisor Hotels Flights Restaurants Attractions', 'TRAVEL_AND_LOCAL', '4.400671482', '1207922', '100,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device', '', '', '', '']


['Peapod', 'SHOPPING', '3.656329393', '1967', '100,000+', '1.4M', '0', 'Everyone', 'September 20, 2018', '5.0 and up', '2.2.0', '', '', '', '']


['foodpanda - Local Food Delivery', 'FOOD_AND_DRINK', '4.107232571', '389154', '10,000,000+', '16M', '0', 'Everyone', 'March 22, 2019', '4.2 and up', '4.18.2', '', '', '', '']


['My CookBook Pro (Ad Free)', 'FOOD_AND_DRINK', '4.647752285', '2291', 

## Data Wrangling

After a quick glance we can get some useful information about this data, like the columns that can be important ('Category', 'Rating', 'Reviews', 'Installs', 'Price', 'Content Rating').

If we pay a little more attention, we can see that the header and rows are missing the last 4 values. Since the header is mmissing too any information in these fields are meaningless. So we can start fixing that.

### Drop 4 last missing values

In [8]:
number_of_items_to_delete = 4

del google_play_header[-number_of_items_to_delete:]

for row in google_play_data:
    del row[-number_of_items_to_delete:]

In [9]:
print_data_overview()

['App Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Size', 'Price', 'Content Rating', 'Last Updated', 'Minimum Version', 'Latest Version']


['DoorDash - Food Delivery', 'FOOD_AND_DRINK', '4.548561573', '305034', '5,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['TripAdvisor Hotels Flights Restaurants Attractions', 'TRAVEL_AND_LOCAL', '4.400671482', '1207922', '100,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['Peapod', 'SHOPPING', '3.656329393', '1967', '100,000+', '1.4M', '0', 'Everyone', 'September 20, 2018', '5.0 and up', '2.2.0']


['foodpanda - Local Food Delivery', 'FOOD_AND_DRINK', '4.107232571', '389154', '10,000,000+', '16M', '0', 'Everyone', 'March 22, 2019', '4.2 and up', '4.18.2']


['My CookBook Pro (Ad Free)', 'FOOD_AND_DRINK', '4.647752285', '2291', '10,000+', 'Varies with device', '$5.99', 'Everyone', 'April 1, 2019', 'Varies w

### Find duplicates

So here is where I hit a wall using pure python, to check for duplicates I'd have to do something like:
```python
duplicate_apps = []
unique_apps = []

for app in google_play_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
```
This would work just fine for a small data set. But in this data set it just takes too long, so for now I'll start using pandas

In [10]:
import pandas as pd

In [11]:
df = pd.read_csv('..\\data\\raw\\Google-Playstore-Full.csv', low_memory=False)

In [12]:
df.head()

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,DoorDash - Food Delivery,FOOD_AND_DRINK,4.548561573,305034,"5,000,000+",Varies with device,0,Everyone,"March 29, 2019",Varies with device,Varies with device,,,,
1,TripAdvisor Hotels Flights Restaurants Attract...,TRAVEL_AND_LOCAL,4.400671482,1207922,"100,000,000+",Varies with device,0,Everyone,"March 29, 2019",Varies with device,Varies with device,,,,
2,Peapod,SHOPPING,3.656329393,1967,"100,000+",1.4M,0,Everyone,"September 20, 2018",5.0 and up,2.2.0,,,,
3,foodpanda - Local Food Delivery,FOOD_AND_DRINK,4.107232571,389154,"10,000,000+",16M,0,Everyone,"March 22, 2019",4.2 and up,4.18.2,,,,
4,My CookBook Pro (Ad Free),FOOD_AND_DRINK,4.647752285,2291,"10,000+",Varies with device,$5.99,Everyone,"April 1, 2019",Varies with device,Varies with device,,,,


In [13]:
df.drop(columns=['Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], inplace=True)

### Filtering only free apps

For this we'll focus on the price column, looking closer we can see a price like this '$5.99', which indicates that the entire column is composed of strings. Knowwing this we can create a function to transform the price column in number.


In [14]:
def price_to_number(price):
    price = price.replace('$', '')
    return float(price)

Let's apply this function to the price column

In [15]:
df['Price'].apply(price_to_number)

ValueError: could not convert string to float: '2.4M'

Let's check what's going on

In [16]:
df[df['Price'].str.contains(pat = 'M')]

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
13504,Never have I ever 18+,),GAME_STRATEGY,4.0,6,100+,2.4M,$0.99,Mature 17+,"December 30, 2018",4.0.3 and up
32229,Old-time Radio presents,,ENTERTAINMENT,4.0,20,"10,000+",3.1M,0,Everyone,"October 16, 2018",4.1 and up
48438,Mojo Times: Bihar Hindi Video News,Breaking News,NEWS_AND_MAGAZINES,4.775640965,156,"10,000+",6.9M,0,Teen,"March 30, 2019",4.1 and up
113151,Steins,Gate ALARM,ENTERTAINMENT,4.716867447,166,500+,67M,$0.99,Teen,"November 12, 2018",4.4 and up
125479,2-6 Ya? E?itici �ocuk Zeka Oyunlar?,Alfabe �?ren,EDUCATION,5.0,1,10+,57M,$2.49,Everyone,"October 31, 2017",2.3 and up
125480,2-6 Ya? E?itici �ocuk Zeka Oyunlar?,T�rk Alfabesi,EDUCATION,4.333333492,54,"50,000+",43M,0,Everyone,"October 31, 2017",2.3 and up
165230,Shytter -Twitter client,not notified you follow -,SOCIAL,4.098591328,71,"5,000+",7.7M,0,Everyone,"March 30, 2019",4.1 and up
168914,CorreosTrack 2.0 (Correos de M�xico,Mexpost),PRODUCTIVITY,4.389830589,59,"10,000+",16M,0,Everyone,"December 21, 2018",4.1 and up
180371,eShagird - Online academy,ETEA & MDCAT,EDUCATION,4.504273415,117,"10,000+",6.9M,0,Everyone,"March 2, 2019",4.0 and up
190759,Friend in Iceland,Tour Guide,TRAVEL_AND_LOCAL,5.0,6,"1,000+",27M,0,Everyone,"October 16, 2017",4.0 and up


Some kind of shift happened to the data on these rows. As consequence of this error, we'll delete these rows.

In [17]:
#Get column indexes
indexes = df[df['Price'].str.contains(pat = 'M')].index

# Delete these row indexes from dataFrame
df.drop(indexes , inplace=True)

Let's try to apply our price_to_number function again

In [18]:
df['Price'] = df['Price'].apply(price_to_number)

ValueError: could not convert string to float: 'Varies with device'

After trying to apply our function again, we just realized that we can do a better job by just getting all rows where price is '0', this will also fix problematic rows at the same time.

In [20]:
df_free = df[df['Price'] == '0']