# Google Playstore Apps Rating Prediction

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')


In [44]:

data = pd.read_csv('googleplaystore.csv')



In [45]:
data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


## Data Overview...

In [46]:

data.columns = data.columns.str.strip()

pd.set_option('display.max_columns', None)


In [47]:
data.sample(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2395,EMT Review Plus,MEDICAL,4.5,199,1.8M,"10,000+",Paid,$11.99,Everyone,Medical,"June 27, 2018",3.0.5,4.4W and up
2449,Jaylex,MEDICAL,,0,13M,10+,Free,0,Everyone,Medical,"July 30, 2018",1.2.3-DEBUG,4.2 and up
2781,Wanelo Shopping,SHOPPING,4.6,94205,6.5M,"1,000,000+",Free,0,Everyone,Shopping,"June 13, 2018",5.6.8,4.4 and up


In [48]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


**Initital Insights**

- ***App*** and ***Current Ver*** needs to be dropped as they contained a lot of unique values. Almost like the _Name_ and _ID_ column in different datasets.

- ***Reviews*** Column's data needs to be changed from Object type to int.

- ***Size*** column contains object values. It can be converted to Numerical, we just have to remove the alphabets from the columns

- ***Installs*** column as well. Datatype change to Numerical by removing the alphabet characters from the entries.

- ***Ratings*** is our Output column and contains Null Values... ->  ( Drop Missing Values )

- ***Price*** datatype needs to be changed from object to int.

- ***Last Updated**** column's datatype needs to be changed from _object_ to _date time_.

- ***Android Ver*** column may have a lot of categories and can be reduced.

In [49]:
print(f"\nThe Dataset Contains {data.shape[0]} rows and {data.shape[1]} columns\n")


The Dataset Contains 10841 rows and 13 columns



In [50]:
print("\nDataset has Null Values: ",data.isnull().sum().any(),"\n")


Dataset has Null Values:  True 



In [51]:
print("\nDataset has Duplicated Values: ",data.duplicated().sum().any(),"\n")


Dataset has Duplicated Values:  True 



In [52]:
data.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


The Ratings column has a maximum Rating as 19.0 which is Unusual because the Ratings are always between 0.0 and 5.0.

In [53]:
print("\nNull Value Percentage: ")
( data.isnull().sum() / data.shape[0] ) * 100




Null Value Percentage: 


App                0.000000
Category           0.000000
Rating            13.596532
Reviews            0.000000
Size               0.000000
Installs           0.000000
Type               0.009224
Price              0.000000
Content Rating     0.009224
Genres             0.000000
Last Updated       0.000000
Current Ver        0.073794
Android Ver        0.027673
dtype: float64

**As we can see, Ratings column has 1474 Null Values ( Around 13% of the actual data).**


**There are Multiple ways we can handle this:**  

- Dropping Entire Column : But this Data is the central Pivot of our Dataset ❌

- Dropping the Rows : This is the best option for Model Building as we cannot impute the missing data in case of the output column. We may loss some information but it is the only Valid option for Model building.

- For Analysis Part, we will impute this column with a Constant, "Unrated". It is valid as these Apps may not have been rated at all... We can also use Mean, Mode or KNN imputation techniques to fill the NaN values... 

## Data Cleaning

In [54]:
data.sample(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2297,Hospitalist Handbook,MEDICAL,4.8,12,18M,"1,000+",Paid,$19.99,Everyone,Medical,"June 5, 2017",6.0.4,4.1 and up
6787,Mediatek SmartDevice,TOOLS,3.6,11187,7.3M,"1,000,000+",Free,0,Everyone,Tools,"March 17, 2017",V1.7.6,4.0 and up
7947,Resume PDF Maker / CV Builder,BUSINESS,4.4,2147,2.2M,"500,000+",Free,0,Everyone,Business,"December 14, 2017",1.8,4.4 and up


Droping Unnecessary columns from the Dataset.

In [55]:


data = data.drop(columns=['App','Current Ver'])


- These two Columns are unnecassary to be kept in the dataset as they might create the high caredinality in the dataset

Checking for Rating greater than 5.0

In [56]:


data[data['Rating'] > 5.0]


Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Android Ver
10472,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,


In [57]:

data.drop(10472, inplace = True)



In [58]:
data.describe()

Unnamed: 0,Rating
count,9366.0
mean,4.191757
std,0.515219
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,5.0


The Ratings Columns is now Evenly Distributed and does not contain any extreme or unusual Values.

In [59]:

data["Reviews"] = data["Reviews"].astype('int')


In [60]:
data['Installs'] = data['Installs'].str.replace(",","").str.replace("+","")

In [61]:

data['Installs'] = data['Installs'].astype('int')

data['Reviews'] = data['Reviews'].astype('int')




In [62]:

data['Price in Dollars'] = data['Price'].str.replace("$","")


data['Price in Dollars'] = data['Price in Dollars'].astype('float')


del data['Price']



In [63]:

data['Last Updated'] = pd.to_datetime(data['Last Updated'])


In [64]:
import numpy as np

# Replace "Varies with device" with NaN
cols_to_clean = ["Size", "Android Ver"]
data[cols_to_clean] = data[cols_to_clean].replace("Varies with device", np.nan)

# ----- Size -----
data["Size_raw"] = data["Size"]  # keep raw copy


# Convert KB to MB (optional)
def size_to_mb(size):
    if pd.isna(size) or size == "Varies with device":
        return np.nan
    size = size.strip()
    if size.endswith("M"):
        return round(float(size.replace("M", "")), 2)
    if size.endswith("k"):
        return round(float(size.replace("k", "")) / 1024, 2)
    return np.nan


data["Size(in MBs)"] = data["Size"].apply(size_to_mb)




# ----- Android Ver -----
data["Android_Ver_raw"] = data["Android Ver"]  # keep raw copy
data["Android Ver"] = (
    data["Android Ver"].astype(str)
                       .str.extract(r"(\d+(\.\d+)*)")[0]
)
data["Android Ver"] = pd.to_numeric(data["Android Ver"], errors="coerce")

These three columns contained a value "Varies with device". Keeping them would have created a lot of noise for Model Building or even for Analysis. So I replaced these values with NaN, so that I can drop them later.

Also the Column size contained the values in MB's as well as in Kb. I have converted all of them values to Kb's and converted the column to float type


In [65]:
data.sample(3)

Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Content Rating,Genres,Last Updated,Android Ver,Price in Dollars,Size_raw,Size(in MBs),Android_Ver_raw
2422,MEDICAL,,0,3.1M,1,Paid,Everyone,Medical,2018-08-01,4.1,2.99,3.1M,3.1,4.1 and up
6661,SHOPPING,3.4,25515,,10000000,Free,Everyone,Shopping,2018-07-16,,0.0,,,
2177,FAMILY,4.7,2195,37M,10000,Paid,Everyone,Board;Brain Games,2016-08-02,2.3,2.99,37M,37.0,2.3 and up


In [66]:
data.dtypes

Category                    object
Rating                     float64
Reviews                      int64
Size                        object
Installs                     int64
Type                        object
Content Rating              object
Genres                      object
Last Updated        datetime64[ns]
Android Ver                float64
Price in Dollars           float64
Size_raw                    object
Size(in MBs)               float64
Android_Ver_raw             object
dtype: object

Initital Insights

App and Current Ver needs to be dropped as they contained a lot of unique values. Almost like the Name and ID column in different datasets. ✔️

Reviews Column's data needs to be changed from Object type to int. ✔️

Installs column as well. Datatype change to Numerical by removing the alphabet characters from the entries. ✔️

Price datatype needs to be changed from object to int. ✔️

Last Updated* column's datatype needs to be changed from object to date time.  ✔️


***These three columns contain varies with version cagtegory that has large number of values :***

- Size column contains object values. It can be converted to Numerical, we just have to remove the alphabets from the columns.  ✔️

- Android Ver column may have a lot of categories and can be reduced.  ✔️


- Ratings is our Output column and contains Null Values... -> ( Drop Missing Values ) ⚠️

In [67]:

data.drop(columns=['Size_raw', 'Size', 'Android_Ver_raw'], inplace=True)


Deleted the Raw columns that we kept for the backup of the column, if something goes wrong...

In [68]:
data.shape

(10840, 11)

In [69]:


max_date = data['Last Updated'].max()

data['Days Since Last Update'] = (max_date - data["Last Updated"]).dt.days

def freshness(days):
    if days <=90:
        return "Recently Updated"
    elif 91 <= days <= 365:
        return "Moderately Updated"
    else:
        return "Outdated"
    
data["Update Freshness"] = data['Days Since Last Update'].apply(freshness)


data.drop(columns=['Last Updated'], inplace = True)

_We can see the max of the date column(i.e most recent updated date )of any app.  Then that will give us a little idea about when the dataset was fetched, based on that we will create another column for freshness of the update for apps or days since last update etc.... ??_


**Created two new columns from the **Last Update** column.**
1. **Days Since Last Update:** Total Days since the App is last updated.
2. **Update Freshness:** Categories based on how frequently the developers give updates on the App.


- Dropped the "Last Update" Column afterwards...


In [70]:
data.sample(6)

Unnamed: 0,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Android Ver,Price in Dollars,Size(in MBs),Days Since Last Update,Update Freshness
7613,FAMILY,4.3,82499,5000000,Free,Everyone,Entertainment;Music & Video,4.4,0.0,,34,Recently Updated
4472,PERSONALIZATION,3.6,53,5000,Free,Everyone,Personalization,,0.0,2.0,177,Moderately Updated
5377,GAME,4.5,5849,500000,Free,Everyone,Trivia,,0.0,5.1,329,Moderately Updated
7574,TOOLS,3.6,949,100000,Free,Everyone,Tools,1.6,0.0,0.4,117,Moderately Updated
8622,PRODUCTIVITY,4.3,596,50000,Free,Everyone,Productivity,4.1,0.0,22.0,35,Recently Updated
8410,FAMILY,4.3,1130,50000,Free,Teen,Simulation,4.2,0.0,48.0,1223,Outdated


- Dealing with Null Values...

In [71]:
data.isnull().sum()

Category                     0
Rating                    1474
Reviews                      0
Installs                     0
Type                         1
Content Rating               0
Genres                       0
Android Ver               3155
Price in Dollars             0
Size(in MBs)              1695
Days Since Last Update       0
Update Freshness             0
dtype: int64

Null Values in ***Android Ver*** and ***Size(in MBs)*** needs to be dropped completed as these are the rows that had the value: "Varies with Device". If kept (_by imputing_), they would create a lot of noise and improper values that will induce the bias in the data.

**Missing values in the Ratings will be dealt separately based on Analysis Part and Model Building part:**

- For Model Buliding : The Missing Values will be dropped completed.
- For Analyis : We will impute the missing values with constand "Unrated".  

In [72]:


data.dropna(subset=['Android Ver','Size(in MBs)'], inplace = True)


In [73]:
data.isnull().sum()

Category                     0
Rating                    1116
Reviews                      0
Installs                     0
Type                         0
Content Rating               0
Genres                       0
Android Ver                  0
Price in Dollars             0
Size(in MBs)                 0
Days Since Last Update       0
Update Freshness             0
dtype: int64

**Now we will create two different datasets...** 
One for Analysis and One for Model Building.


In [74]:


EDA_data = data.copy()


In [75]:


EDA_data.fillna("Unrated", inplace = True)


Filled Missing values with "Unrated" for Data Analysis.

In [76]:


data.dropna(inplace = True)


Dropped the Missing Values for Model Building.

In [77]:
data.isnull().sum()

Category                  0
Rating                    0
Reviews                   0
Installs                  0
Type                      0
Content Rating            0
Genres                    0
Android Ver               0
Price in Dollars          0
Size(in MBs)              0
Days Since Last Update    0
Update Freshness          0
dtype: int64

In [78]:
EDA_data.isnull().sum()

Category                  0
Rating                    0
Reviews                   0
Installs                  0
Type                      0
Content Rating            0
Genres                    0
Android Ver               0
Price in Dollars          0
Size(in MBs)              0
Days Since Last Update    0
Update Freshness          0
dtype: int64

No Null Values in any of the Dataset...

In [79]:
data.sample(3)

Unnamed: 0,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Android Ver,Price in Dollars,Size(in MBs),Days Since Last Update,Update Freshness
4014,FAMILY,4.3,43,10000,Free,Everyone,Education,4.1,0.0,3.7,235,Moderately Updated
5194,FAMILY,4.6,27,10000,Free,Everyone,Education,4.1,0.0,4.2,25,Recently Updated
7659,LIFESTYLE,4.7,12,500,Free,Everyone,Lifestyle,4.0,0.0,6.0,163,Moderately Updated


In [80]:
EDA_data.sample(3)

Unnamed: 0,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Android Ver,Price in Dollars,Size(in MBs),Days Since Last Update,Update Freshness
8377,TOOLS,2.8,27,1000,Free,Everyone,Tools,4.4,0.0,34.0,169,Moderately Updated
8752,PERSONALIZATION,Unrated,1,100,Free,Everyone,Personalization,4.0,0.0,3.9,78,Recently Updated
10671,GAME,4.1,88941,1000000,Free,Teen,Action,4.0,0.0,34.0,743,Outdated


- Checking Final Shape of Data : 

In [81]:


print(f"\nOur Dataset(for model building) has now {data.shape[0]} rows and {data.shape[1]} columns\n")



Our Dataset(for model building) has now 6233 rows and 12 columns



In [82]:


print(f"\nOur Dataset(for Analysis) has now {EDA_data.shape[0]} rows and {EDA_data.shape[1]} columns\n")



Our Dataset(for Analysis) has now 7349 rows and 12 columns



In [83]:


data.to_csv("Cleaned Google Playstore Data (ML).csv")


In [84]:


EDA_data.to_csv("Cleaned Google Playstore Data (Analysis).csv")
