The goal of this project is to predict User Ratings of a google play store application based on predictor variables. 
We are using 3 datasets:
Google Play Store: This dataset is downloaded from Kaggle, which has information of all the apps on the google play store
Google Play Store User Reviews: This dataset has data about the user review of the apps on the google store
Ratings result: This dataset is created by us, which maps the rating scale and to its labels.

In [37]:
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import numpy as np

In [2]:
#Importing all the datasets
app = pd.read_csv('googleplaystore.csv')
app_reviews = pd.read_csv('googleplaystore_user_reviews.csv')
app_ratings = pd.read_csv('Rating_result.csv')
app = app.rename(columns={"Content Rating": "content_rating","Current Ver": "current_ver", "Last Updated": "last_updated"})

In [3]:
for col in app_reviews:
    print(col)

App
Translated_Review
Sentiment
Sentiment_Polarity
Sentiment_Subjectivity


In [4]:
app.shape

(10840, 13)

In [5]:
app_reviews.shape

(64295, 5)

In [7]:
app.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
content_rating     object
Genres             object
last_updated       object
current_ver        object
Android Ver        object
dtype: object

So apart from Rating all others could be categorical.
Reviews, Size are continous variables and would not considered as categorical variables
We would check the categorical columns for number of distinct values -- a single unique value or an extremely high number would be bad.

In [11]:
cat_columns = ['App', 'Category', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'content_rating', 'Genres', 'last_updated', 'current_ver',
       'Android Ver']
# Use apply to check the data type and call the .nunique() method on each column
app[cat_columns].apply(lambda col: col.nunique())

App               9659
Category            33
Reviews           6001
Size               461
Installs            21
Type                 2
Price               92
content_rating       6
Genres             119
last_updated      1377
current_ver       2783
Android Ver         33
dtype: int64

We observe that App, Current Version, Last Updated have an extremely high number of distinct values.
Based on the domain knowledge we find out that the Price is unhelpful in this modelling problem.
So we are going to drop these columns in our model.

In [12]:
#Data cleaning
#Dropping columns which are not useful for determining the response variable
app = app.drop(['current_ver','last_updated','Android Ver','Price'] , axis = 1)
app_reviews = app_reviews.drop(['Translated_Review','Sentiment_Polarity','Sentiment_Subjectivity'] , axis = 1)

In [13]:
app.isnull().sum(axis=0).reset_index()

Unnamed: 0,index,0
0,App,0
1,Category,0
2,Rating,1474
3,Reviews,0
4,Size,0
5,Installs,0
6,Type,1
7,content_rating,0
8,Genres,0


Cleaning the 1st Dataset

In [15]:
app=app.dropna(axis=0,how='any')
app

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,content_rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,Everyone,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...
10833,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,Everyone,Education
10835,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,Everyone,Education
10836,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,Everyone,Education
10838,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,Mature 17+,Books & Reference


In [16]:
app_reviews.isnull().sum(axis=0).reset_index()


Unnamed: 0,index,0
0,App,0
1,Sentiment,26863


In [17]:
app_reviews.dtypes

App          object
Sentiment    object
dtype: object

Cleaning the 2nd Dataset.

In [18]:
app_reviews=app_reviews.dropna(axis=0,how='any')
app_reviews

Unnamed: 0,App,Sentiment
0,10 Best Foods for You,Positive
1,10 Best Foods for You,Positive
3,10 Best Foods for You,Positive
4,10 Best Foods for You,Positive
5,10 Best Foods for You,Positive
...,...,...
64222,Housing-Real Estate & Property,Positive
64223,Housing-Real Estate & Property,Positive
64226,Housing-Real Estate & Property,Negative
64227,Housing-Real Estate & Property,Positive


Merging both the cleaned datasets by using the common attribute 'App' in both the tables.

In [19]:
#Joining the datasets
joined_data = pd.merge(app,app_reviews,on='App',how='inner')
joined_data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,content_rating,Genres,Sentiment
0,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play,Neutral
3,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play,Positive
4,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
...,...,...,...,...,...,...,...,...,...,...
72571,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,Varies with device,"10,000,000+",Free,Everyone,Photography,Positive
72572,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,Varies with device,"10,000,000+",Free,Everyone,Photography,Positive
72573,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,Varies with device,"10,000,000+",Free,Everyone,Photography,Positive
72574,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,Varies with device,"10,000,000+",Free,Everyone,Photography,Neutral


Using One-Hot Encoding

In [20]:
## selecting only the categorical variables
cat_joined_data= joined_data[['Category','Installs','Type','content_rating','Genres','Sentiment']]

In [21]:
from sklearn.preprocessing import OneHotEncoder
#Create an Encoder
encoder=OneHotEncoder(sparse=False)
#Fit the encoder to the columns we want to transform
encoder.fit(cat_joined_data)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [22]:
#Transform these columns
cat_cols_1hot=encoder.transform(cat_joined_data)
cat_cols_1hot

array([[1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [23]:
#Convert the results back to dataframe
cat_cols_1hot_df=pd.DataFrame(cat_cols_1hot,columns=encoder.get_feature_names())
cat_cols_1hot_df

Unnamed: 0,x0_ART_AND_DESIGN,x0_AUTO_AND_VEHICLES,x0_BEAUTY,x0_BOOKS_AND_REFERENCE,x0_BUSINESS,x0_COMICS,x0_COMMUNICATION,x0_DATING,x0_EDUCATION,x0_ENTERTAINMENT,...,x4_Sports;Action & Adventure,x4_Strategy,x4_Tools,x4_Travel & Local,x4_Travel & Local;Action & Adventure,x4_Video Players & Editors,x4_Weather,x5_Negative,x5_Neutral,x5_Positive
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72572,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72573,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72574,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [26]:
# Append them back to the original
joined_data=pd.concat([joined_data.drop(['Category','Installs','Type','content_rating','Genres','Sentiment'],axis=1),cat_cols_1hot_df],axis=1)

In [45]:
X=joined_data.drop(['Rating','App','Size'], axis=1)

In [44]:
X.columns

Index(['Reviews', 'Size', 'x0_ART_AND_DESIGN', 'x0_AUTO_AND_VEHICLES',
       'x0_BEAUTY', 'x0_BOOKS_AND_REFERENCE', 'x0_BUSINESS', 'x0_COMICS',
       'x0_COMMUNICATION', 'x0_DATING',
       ...
       'x4_Sports;Action & Adventure', 'x4_Strategy', 'x4_Tools',
       'x4_Travel & Local', 'x4_Travel & Local;Action & Adventure',
       'x4_Video Players & Editors', 'x4_Weather', 'x5_Negative', 'x5_Neutral',
       'x5_Positive'],
      dtype='object', length=125)

In [41]:
y = joined_data['Rating']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)


(50803, 124) (50803,)
(21773, 124) (21773,)


In [35]:
from sklearn.linear_model import Lasso,LinearRegression,Ridge


In [48]:
# Building a Linear Regression
Lin_R=LinearRegression()
lin_model = Lin_R.fit(X_train,y_train)
y_predict=lin_model.predict(X_test)


In [56]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predict})
df.head(15)

Unnamed: 0,Actual,Predicted
22289,4.3,4.237835
25862,4.4,4.432558
43137,4.6,4.380386
69315,4.0,4.235869
49305,2.6,4.072698
53919,4.7,4.386543
3923,4.5,4.168204
38151,4.7,4.582389
31812,4.5,4.521642
33707,4.4,4.3988


In [54]:
Lin_R.score(X_test, y_test)

0.33328719176305766