# Milestone 2
# Kumaran Singaram

## Best Practices for Assignments & Milestones

- <b>Break the assignment into sections - one section per numbered requirement.</b> Each assignment has numbered requirements/instructions e.g. "1. Read the CIFAR-10 dataset". Each requirement should have at least one markdown cell and at least one code cell. Feel free to combine sections or make other sensible changes if that makes sense for your code and is still clear. The intent is to give you a useful structure and to make sure you get full credit for your work.

- <b>Break the milestone into sections - one section for each item in the rubric.</b> Each milestone has rubric items e.g. "5. Handle class imbalance problem". Each rubric item should have at least one markdown cell and at least one code cell. Feel free to combine sections or make other sensible changes if that makes sense for your code and is still clear. The intent is to give you a useful structure and to make sure you get full credit for your work.

- <b>Include comments, with block comments preferred over in-line comments.</b> A good habit is to start each code cell with comments.

The above put into a useful pattern:

<b>Markdown cell:</b> Requirement #1: Read the CIFAR-10 dataset<br>
<b>Code cell:</b>: Comments followed by code<br>
<b>Markdown cell:</b> Requirement #2: Explore the data<br>
<b>Code cell:</b>: Comments followed by code<br>
<b>Markdown cell:</b> Requirement #3: Preprocess the data and prepare for classification<br>
<b>Code cell:</b>: Comments followed by code<br>

For more information:
- A good notebook example: [DataFrame Basics](https://github.com/Tanu-N-Prabhu/Python/blob/master/Pandas/Pandas_DataFrame.ipynb) 
- More example notebooks: [A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#pandas-for-data-analysis)
- [PEP 8 on commenting](https://www.python.org/dev/peps/pep-0008/)
- [PEP 257 - docstrings](https://www.python.org/dev/peps/pep-0257/)

Occasionally an assignment or milestone will ask you to do something other than write Python code e.g. ask you turn in a .docx file. In which case, please use logical structuring, but the specific notes above may not apply.

**1. Read in data and replace missing numerical data**

In [192]:
#import necessary packages and read in dataset
import pandas as pd
import numpy as np

app = pd.read_csv("data/googleplaystore.csv")

app.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0.0,Teen,Social,3-Aug-18,Varies with device,Varies with device
1,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0.0,Teen,Social,3-Aug-18,Varies with device,Varies with device
2,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
3,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
4,WhatsApp Messenger,COMMUNICATION,4.4,69109672,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device


In [193]:
#impute missing values of the Rating column with the mean

def fill_mean(x):
    x = pd.to_numeric(x, errors = 'coerce')
    dropped_na = x[~np.isnan(x)]
    mean = np.mean(dropped_na)
    x[np.isnan(x)] = mean
    return x

app['Rating'] = fill_mean(app['Rating'])
    
app.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0.0,Teen,Social,3-Aug-18,Varies with device,Varies with device
1,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0.0,Teen,Social,3-Aug-18,Varies with device,Varies with device
2,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
3,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
4,WhatsApp Messenger,COMMUNICATION,4.4,69109672,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Communication,3-Aug-18,Varies with device,Varies with device


**2. Account for outliers in the numeric columns**

In [199]:
#define replace outlier function
#replacing outliers with a mean
def replace_outlier(x, m = 3):
    d = np.abs(x - np.mean(x))
    mdev = np.mean(d)
    s = d/mdev if mdev else 0
    x[s>m] = np.mean(x[s<m])
    return x

In [200]:
#call outlier function for Reviews column
replace_outlier(app['Reviews'])

0        45509.216628
1        45509.216628
2        45509.216628
3        45509.216628
4        45509.216628
             ...     
10836        0.000000
10837        0.000000
10838        0.000000
10839        0.000000
10840        0.000000
Name: Reviews, Length: 10841, dtype: float64

**3. Normalize numeric values**

In [201]:
#subset numeric columns
#z-normalize numeric columns 

num_cols = ['Rating', 'Reviews', 'Price']

def normalize(x):
    x_norm = ((x - np.mean(x))/np.std(x))
    return x_norm

for col in num_cols:
    app[col] = normalize(app[col])
    
app.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Facebook,SOCIAL,-0.186851,0.606796,Varies with device,"1,000,000,000+",Free,-0.064413,Teen,Social,3-Aug-18,Varies with device,Varies with device
1,Facebook,SOCIAL,-0.186851,0.606796,Varies with device,"1,000,000,000+",Free,-0.064413,Teen,Social,3-Aug-18,Varies with device,Varies with device
2,WhatsApp Messenger,COMMUNICATION,0.413709,0.606796,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
3,WhatsApp Messenger,COMMUNICATION,0.413709,0.606796,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,3-Aug-18,Varies with device,Varies with device
4,WhatsApp Messenger,COMMUNICATION,0.413709,0.606796,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,3-Aug-18,Varies with device,Varies with device


**4. Bin numeric variables**

In [202]:
#bin review column into 6 groups
app['review_range'] = pd.cut(app['Reviews'], bins = [-0.5, 0.5, 1.5, 2.5, 3.5, 4.5])
app[['Reviews', 'review_range']].head()

Unnamed: 0,Reviews,review_range
0,0.606796,"(0.5, 1.5]"
1,0.606796,"(0.5, 1.5]"
2,0.606796,"(0.5, 1.5]"
3,0.606796,"(0.5, 1.5]"
4,0.606796,"(0.5, 1.5]"


**5. Consolidate categorical data**

In [186]:
#count categories
pd.value_counts(app['Type'])

Free    10039
Paid      800
0           1
Name: Type, dtype: int64

In [187]:
pd.value_counts(app['Content Rating'])

Everyone           8714
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

In [188]:
#replace a wrong input and consolidate to free category
app['Type'] = app['Type'].replace({'0': 'Free'})
pd.value_counts(app['Type'])

Free    10040
Paid      800
Name: Type, dtype: int64

In [189]:
#consolidate categories
app['Content Rating'] = app['Content Rating'].replace({'Adults only 18+': 'Adults/Unrated',
                                                      'Unrated': 'Adults/Unrated'})
pd.value_counts(app['Content Rating'])

Everyone          8714
Teen              1208
Mature 17+         499
Everyone 10+       414
Adults/Unrated       5
Name: Content Rating, dtype: int64

**6. One hot encode categorical data with at least 3 categories**

In [190]:
#use get dummies for content rating category
OneHot = pd.get_dummies(app['Content Rating'], prefix = ['Content Rating'])

OneHot.head()

Unnamed: 0,['Content Rating']_Adults/Unrated,['Content Rating']_Everyone,['Content Rating']_Everyone 10+,['Content Rating']_Mature 17+,['Content Rating']_Teen
0,0,0,0,0,1
1,0,0,0,0,1
2,0,1,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


**7. Remove obselete columns**

In [191]:
#removed columns that will not be very useful for further data analysis
app = app.drop(columns=['Last Updated', 'Current Ver', 'Android Ver'])

app.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,review_range
0,Facebook,SOCIAL,-0.186851,4.374357,Varies with device,"1,000,000,000+",Free,-0.064413,Teen,Social,"(3.5, 4.5]"
1,Facebook,SOCIAL,-0.186851,4.374357,Varies with device,"1,000,000,000+",Free,-0.064413,Teen,Social,"(3.5, 4.5]"
2,WhatsApp Messenger,COMMUNICATION,0.413709,4.374357,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,"(3.5, 4.5]"
3,WhatsApp Messenger,COMMUNICATION,0.413709,4.374357,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,"(3.5, 4.5]"
4,WhatsApp Messenger,COMMUNICATION,0.413709,4.374357,Varies with device,"1,000,000,000+",Free,-0.064413,Everyone,Communication,"(3.5, 4.5]"
