## Understanding Data Processing Code

You are going to complete two tasks that involve finding bugs. The notebook has been divided into two parts; each part contains a data processing bug that makes the model performance worse in later stages. **Could you locate the bugs in the notebook?**

### Instruction
+ You have **12 minutes** to complete each task. Complete each task one at a time. Each task contains only **one bug**.
+ Feel free to **use the Internet** to search for any problems you might encounter.
+ Feel free to **run the notebooks or add new statements** to run to satisfy your needs.
+ You should focus on **data processing** mostly. Try to understand the intention and effect of the code. Bugs are likely to result from **mismatch between user intention and actual effect**.


## Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph.

import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

import warnings
warnings.filterwarnings('ignore')
from pylab import rcParams
# figure size in inches

%matplotlib inline

In [2]:
## Read file
data = pd.read_csv("./input/googleplaystore.csv")
data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


# --------------- Task3 BEGINs ---------------

In [3]:
data.dropna(how ='any', inplace = True)

In [4]:
# convert to int
data['Reviews'] = data['Reviews'].astype(int)

## Installs

In [5]:
# Remove + and ,
data.Installs = data.Installs.apply(lambda x: x.replace(',',''))
data.Installs = data.Installs.apply(lambda x: x.replace('+',''))
data.Installs = data.Installs.apply(lambda x: int(x))

In [6]:
# encode by order of size
Sorted_value = sorted(list(data['Installs'].unique()))
data['Installs'].replace(Sorted_value,range(0,len(Sorted_value),1), inplace = True )

## Size

In [7]:
# change it to NA first
data['Size'].replace('Varies with device', np.nan, inplace = True )

# Convert Size (e.g., 18k to 18000.0)
data.Size = (data.Size.replace(r'[kM]+$', '', regex=True).astype(float) * \
             data.Size.str.extract(r'[\d\.]+([KM]+)', expand=False)
            .fillna(1)
            .replace(['k','M'], [10**3, 10**6]).astype(int))

# Fill NA (i.e., Varies with device) with mean
data['Size'].fillna(data['Size'].mean(),inplace = True)


**<font color='forestgreen'> Note </font>**

    I decide to fill "Varies with device" with mean of size

## Type

In [8]:
data['Free'] = data['Type'].map(lambda s :1  if s =='Free' else 0)
data.drop(['Type'], axis=1, inplace=True)

**<font color='forestgreen'> Note </font>**

    Because string can't enter to model, I need to change format a little bit

# --------------- Task3 ENDs ---------------

# --------------- Task4 BEGINs ---------------


## Price

**<font color='tomato'> Finding</font>**

    Data is in object type, in format of dollar sign.

In [9]:
data.Price = data.Price.apply(lambda x: x.replace('$',''))
data['Price'] = data['Price'].apply(lambda x: float(x))

In [10]:
data['PriceBand'] = ""
data['PriceBand'].loc[ data['Price'] == 0] = '0 Free'
data['PriceBand'].loc[(data['Price'] > 0) & (data['Price'] <= 0.99)] = '1 cheap'
data['PriceBand'].loc[(data['Price'] > 0.99) & (data['Price'] <= 2.99)]   = '2 not cheap'
data['PriceBand'].loc[(data['Price'] > 2.99) & (data['Price'] <= 4.99)]   = '3 normal'
data['PriceBand'].loc[(data['Price'] > 4.99) & (data['Price'] <= 14.99)]   = '4 expensive'
data['PriceBand'].loc[(data['Price'] > 14.99) & (data['Price'] <= 29.99)]   = '5 too expensive'
data['PriceBand'].loc[(data['Price'] > 29.99)]  = '6 FXXXing expensive'

## Content Rating

In [11]:
# remove one 'Unrated' data
data = data[data['Content Rating'] != 'Unrated']

In [12]:
data = pd.get_dummies(data, columns= ["Content Rating"])

## Genres

In [13]:
# Many genre contain only few record, it may make a bias.
# Decide to group them to bigger genre by ignore sub-genre (after " ; " sign)
data['Genres'] = data['Genres'].str.split(';').str[0]

In [14]:
# There is one item that is 'Music & Audio'; let's change it to Music
is_music = data['Genres'] == 'Music & Audio'
data.loc[is_music, 'Genres'].loc[is_music] = 'Music'

In [15]:
data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Price,Genres,Last Updated,Current Ver,Android Ver,Free,PriceBand,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,1.900000e+07,8,0.0,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,1,0 Free,0,1,0,0,0
1,Coloring book moana,ART_AND_DESIGN,3.9,967,1.400000e+07,11,0.0,Art & Design,"January 15, 2018",2.0.0,4.0.3 and up,1,0 Free,0,1,0,0,0
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.700000e+06,13,0.0,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,1,0 Free,0,1,0,0,0
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,2.500000e+07,15,0.0,Art & Design,"June 8, 2018",Varies with device,4.2 and up,1,0 Free,0,0,0,0,1
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.800000e+06,10,0.0,Art & Design,"June 20, 2018",1.1,4.4 and up,1,0 Free,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.600000e+06,5,0.0,Education,"June 18, 2017",1.0.0,4.1 and up,1,0 Free,0,1,0,0,0
10836,Sya9a Maroc - FR,FAMILY,4.5,38,5.300000e+07,7,0.0,Education,"July 25, 2017",1.48,4.1 and up,1,0 Free,0,1,0,0,0
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.600000e+06,4,0.0,Education,"July 6, 2018",1.0,4.1 and up,1,0 Free,0,1,0,0,0
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,2.295612e+07,6,0.0,Books & Reference,"January 19, 2015",Varies with device,Varies with device,1,0 Free,0,0,0,1,0


# --------------- Task4 ENDs ---------------