# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


In [1]:
# Start your codes here!
import pandas as pd
import numpy as py
import matplotlib.pyplot as plt

In [2]:
from google.colab import files

In [3]:
data_to_load = files.upload()

Saving clean_googleplaydata.csv to clean_googleplaydata.csv


In [6]:

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error # 0.3 error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from google.colab import drive
#drive.mount('/content/drive')

In [10]:
df = pd.read_csv('clean_googleplaydata.csv')
# Data cleaning for "Size" column

In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Content Rating NUM,Category NUM
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,1,0
1,35,How to draw Ladybug and Cat Noir,ART_AND_DESIGN,3.8,564,9.2,100000,Free,0.0,Everyone,Art & Design,"July 11, 2018",2.1,4.1 and up,1,0
2,36,UNICORN - Color By Number & Pixel Art Coloring,ART_AND_DESIGN,4.7,8145,24.0,500000,Free,0.0,Everyone,Art & Design;Creativity,"August 2, 2018",1.0.9,4.4 and up,1,0
3,38,PIP Camera - PIP Collage Maker,ART_AND_DESIGN,4.7,158,11.0,10000,Free,0.0,Everyone,Art & Design,"November 29, 2017",1.3,4.0.3 and up,1,0
4,39,How To Color Disney Princess - Coloring Pages,ART_AND_DESIGN,4.0,591,9.4,500000,Free,0.0,Everyone,Art & Design,"March 31, 2018",1,4.0 and up,1,0


In [13]:
df.sort_values("Category", inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9133 entries, 0 to 9132
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          9133 non-null   int64  
 1   App                 9133 non-null   object 
 2   Category            9133 non-null   object 
 3   Rating              9133 non-null   float64
 4   Reviews             9133 non-null   int64  
 5   Size                9133 non-null   float64
 6   Installs            9133 non-null   int64  
 7   Type                9133 non-null   object 
 8   Price               9133 non-null   float64
 9   Content Rating      9133 non-null   object 
 10  Genres              9133 non-null   object 
 11  Last Updated        9133 non-null   object 
 12  Current Ver         9133 non-null   object 
 13  Android Ver         9133 non-null   object 
 14  Content Rating NUM  9133 non-null   int64  
 15  Category NUM        9133 non-null   int64  
dtypes: flo

In [14]:
lb_make = LabelEncoder()
# Create column for "numeric" Content Rating 
df["Content Rating NUM"] = lb_make.fit_transform(df["Content Rating"])
# Form dicitonary for Content Rating and numeric values 
dict_content_rating = {"Adults only 18+": 0, "Everyone": 1, "Everyone 10+": 2, "Mature 17+": 3, "Teen": 4}
# Numeric value for Content Rating
'''
Adults only 18+ = 0
Everyone = 1
Everyone 10+ = 2
Mature 17+ = 3
Teen = 4
'''
# Create column for "numeric" Category
df["Category NUM"] = lb_make.fit_transform(df["Category"])
# Form dicitonary for Category and numeric values
dict_category = {}
val = 0
for i in df["Category"].unique():
 dict_category[i] = val
 val += 1

In [15]:
imputer = SimpleImputer()
df['Rating'] = imputer.fit_transform(df[['Rating']])
# Rounding the mean value to 1 decimal place
df['Rating'].round(1)
df.dropna(axis=0, inplace=True)

In [16]:

# Change datatype
df['Reviews'] = pd.to_numeric(df['Reviews'])
df['Installs'] = pd.to_numeric(df['Installs'])
df['Price'] = pd.to_numeric(df['Price'])

In [17]:
df.to_csv('new_googleplaydata.csv')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9133 entries, 0 to 9132
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          9133 non-null   int64  
 1   App                 9133 non-null   object 
 2   Category            9133 non-null   object 
 3   Rating              9133 non-null   float64
 4   Reviews             9133 non-null   int64  
 5   Size                9133 non-null   float64
 6   Installs            9133 non-null   int64  
 7   Type                9133 non-null   object 
 8   Price               9133 non-null   float64
 9   Content Rating      9133 non-null   object 
 10  Genres              9133 non-null   object 
 11  Last Updated        9133 non-null   object 
 12  Current Ver         9133 non-null   object 
 13  Android Ver         9133 non-null   object 
 14  Content Rating NUM  9133 non-null   int64  
 15  Category NUM        9133 non-null   int64  
dtypes: flo

In [None]:
# include the link to your Google Data Studio report.
#https://datastudio.google.com/u/0/reporting/3852503c-9715-4187-90b4-0a2d22eef997/page/6lyZB?fbclid=IwAR0zFKkOwLbJmozV6hRxs8ZLBzVoE2sdJP1cRv6luDThbe6MsSmqCVpRZlo
# name your team: Reshub
# Members of the group: Phạm Ngọc Tài, Lại Ngọc Tân, Ngô Thị Huệ