# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


In [169]:
# Start your codes here!
import pandas as pd

data = pd.read_csv('sample_data/google-play-store.csv')

In [170]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [171]:
data['Rating'].fillna((data['Rating'].mean()), inplace = True)

In [172]:
data['Rating']

0        4.100000
1        3.900000
2        4.700000
3        4.500000
4        4.300000
           ...   
10836    4.500000
10837    5.000000
10838    4.193338
10839    4.500000
10840    4.500000
Name: Rating, Length: 10841, dtype: float64

In [173]:
repl_dict = {'[kK]': '*1e3', '[mM]': '*1e6', '[bB]': '*1e9', }
data.loc[data['Size'].str.match('^\d+(\.\d+)?[MmKk]*$'), 'Size'] = data.loc[data['Size'].str.match('^\d+(\.\d+)?[MmKk]*$'), 'Size'].replace(repl_dict, regex=True).apply(pd.eval)

In [174]:
data['Size'] = data['Size'].apply(pd.to_numeric,errors='coerce')
data['Size'].fillna((data['Size'].mean()), inplace = True)

In [175]:
data['Size']

0        1.900000e+07
1        1.400000e+07
2        8.700000e+06
3        2.500000e+07
4        2.800000e+06
             ...     
10836    5.300000e+07
10837    3.600000e+06
10838    9.500000e+06
10839    2.151653e+07
10840    1.900000e+07
Name: Size, Length: 10841, dtype: float64

In [176]:
data['Installs'] = data['Installs'].str.replace(r'\+$', '').str.replace(',', '')
data.loc[data['Installs'].str.match('^\d+$'), 'Installs'] = data.loc[data['Installs'].str.match('^\d+$'), 'Installs'].astype('float')

In [177]:
#data['Price'] = data['Price'].apply(pd.to_numeric,errors='coerce')
data['Price'] = data['Price'].str.replace('$', '')
data.loc[data['Type'] == 'Paid', 'Price'] = data.loc[data['Type'] == 'Paid', 'Price'].apply(pd.to_numeric,errors='coerce')
data.loc[data['Type'] == 'Paid', 'Price'].fillna((data.loc[data['Type'] == 'Paid', 'Price'].mean()), inplace = True)

In [178]:
data.loc[data['Type'] == 'Paid', 'Price']

234       4.99
235       4.99
290       4.99
291       4.99
427       3.99
         ...  
10735     0.99
10760     7.99
10782    16.99
10785      1.2
10798     1.04
Name: Price, Length: 800, dtype: object

In [180]:
data.to_csv('google-play-store-clean.csv')