The goal of this project is to predict User Ratings of a google play store application based on predictor variables. 
We are using 3 datasets:
Google Play Store: This dataset is downloaded from Kaggle, which has information of all the apps on the google play store
Google Play Store User Reviews: This dataset has data about the user review of the apps on the google store
Ratings result: This dataset is created by us, which maps the rating scale and to its labels.

In [51]:
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import numpy as np

In [52]:
#Importing all the datasets
app = pd.read_csv('googleplaystore.csv')
app_reviews = pd.read_csv('googleplaystore_user_reviews.csv')
app_ratings = pd.read_csv('Rating_result.csv')
app = app.rename(columns={"Content Rating": "content_rating","Current Ver": "current_ver", "Last Updated": "last_updated"})

In [53]:
for col in app_reviews:
    print(col)

App
Translated_Review
Sentiment
Sentiment_Polarity
Sentiment_Subjectivity


In [54]:
app.shape

(10840, 13)

In [55]:
app_reviews.shape

(64295, 5)

In [56]:
app.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
content_rating     object
Genres             object
last_updated       object
current_ver        object
Android Ver        object
dtype: object

So apart from Rating all others could be categorical.
Reviews, Size are continous variables and would not considered as categorical variables
We would check the categorical columns for number of distinct values -- a single unique value or an extremely high number would be bad.

In [57]:
#Function to convert size column to integer, in megabytes
def convert_bytes(size):
    if 'M' in size:
        x = size[:-1]
        x = float(x) * 1000
        return(float(x))
    elif 'k' in size:
        return size[:-1]
    else:
        return 0

 
app["Size"] = app["Size"].map(convert_bytes)
app.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
content_rating     object
Genres             object
last_updated       object
current_ver        object
Android Ver        object
dtype: object

In [58]:
cat_columns = ['App', 'Category', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'content_rating', 'Genres', 'last_updated', 'current_ver',
       'Android Ver']
# Use apply to check the data type and call the .nunique() method on each column
app[cat_columns].apply(lambda col: col.nunique())

App               9659
Category            33
Reviews           6001
Size               460
Installs            21
Type                 2
Price               92
content_rating       6
Genres             119
last_updated      1377
current_ver       2783
Android Ver         33
dtype: int64

We observe that App, Current Version, Last Updated have an extremely high number of distinct values.
Based on the domain knowledge we find out that the Price is unhelpful in this modelling problem.
So we are going to drop these columns in our model.

In [59]:
#Data cleaning
#Dropping columns which are not useful for determining the response variable
app = app.drop(['current_ver','last_updated','Android Ver','Price'] , axis = 1)
app_reviews = app_reviews.drop(['Translated_Review','Sentiment_Polarity','Sentiment_Subjectivity'] , axis = 1)

In [60]:
app.isnull().sum(axis=0).reset_index()

Unnamed: 0,index,0
0,App,0
1,Category,0
2,Rating,1474
3,Reviews,0
4,Size,0
5,Installs,0
6,Type,1
7,content_rating,0
8,Genres,0


Cleaning the 1st Dataset

In [61]:
app=app.dropna(axis=0,how='any')
app

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,content_rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000,"10,000+",Free,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700,"5,000,000+",Free,Everyone,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000,"50,000,000+",Free,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800,"100,000+",Free,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...
10833,FR Calculator,FAMILY,4.0,7,2600,500+,Free,Everyone,Education
10835,Sya9a Maroc - FR,FAMILY,4.5,38,53000,"5,000+",Free,Everyone,Education
10836,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3600,100+,Free,Everyone,Education
10838,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,0,"1,000+",Free,Mature 17+,Books & Reference


In [62]:
app_reviews.isnull().sum(axis=0).reset_index()


Unnamed: 0,index,0
0,App,0
1,Sentiment,26863


In [63]:
app_reviews.dtypes

App          object
Sentiment    object
dtype: object

Cleaning the 2nd Dataset.

In [64]:
app_reviews=app_reviews.dropna(axis=0,how='any')
app_reviews

Unnamed: 0,App,Sentiment
0,10 Best Foods for You,Positive
1,10 Best Foods for You,Positive
3,10 Best Foods for You,Positive
4,10 Best Foods for You,Positive
5,10 Best Foods for You,Positive
...,...,...
64222,Housing-Real Estate & Property,Positive
64223,Housing-Real Estate & Property,Positive
64226,Housing-Real Estate & Property,Negative
64227,Housing-Real Estate & Property,Positive


Merging both the cleaned datasets by using the common attribute 'App' in both the tables.

In [77]:
#Joining the datasets
joined_data = pd.merge(app,app_reviews,on='App',how='inner')
joined_data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,content_rating,Genres,Sentiment
0,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play,Neutral
3,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play,Positive
4,Coloring book moana,ART_AND_DESIGN,3.9,967,14000,"500,000+",Free,Everyone,Art & Design;Pretend Play,Negative
...,...,...,...,...,...,...,...,...,...,...
72571,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,0,"10,000,000+",Free,Everyone,Photography,Positive
72572,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,0,"10,000,000+",Free,Everyone,Photography,Positive
72573,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,0,"10,000,000+",Free,Everyone,Photography,Positive
72574,A+ Gallery - Photos & Videos,PHOTOGRAPHY,4.5,223941,0,"10,000,000+",Free,Everyone,Photography,Neutral


In [90]:
# Apps Category getting the highest number of Sentiments
joined_data.groupby('Category').size().nlargest(5).reset_index(name='frequency')

Unnamed: 0,Category,frequency
0,GAME,19125
1,FAMILY,5910
2,HEALTH_AND_FITNESS,4508
3,SPORTS,3504
4,DATING,3198


In [None]:
joined_data.pivot_table(columns='Sentiment',index='',values='App')

Using One-Hot Encoding

In [78]:
## selecting only the categorical variables
cat_joined_data= joined_data[['Category','Installs','Type','content_rating','Genres','Sentiment']]

In [75]:
from sklearn.preprocessing import OneHotEncoder
#Create an Encoder
encoder=OneHotEncoder(sparse=False)
#Fit the encoder to the columns we want to transform
encoder.fit(cat_joined_data)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [68]:
#Transform these columns
cat_cols_1hot=encoder.transform(cat_joined_data)
cat_cols_1hot

array([[1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [69]:
#Convert the results back to dataframe
cat_cols_1hot_df=pd.DataFrame(cat_cols_1hot,columns=encoder.get_feature_names())
cat_cols_1hot_df

Unnamed: 0,x0_ART_AND_DESIGN,x0_AUTO_AND_VEHICLES,x0_BEAUTY,x0_BOOKS_AND_REFERENCE,x0_BUSINESS,x0_COMICS,x0_COMMUNICATION,x0_DATING,x0_EDUCATION,x0_ENTERTAINMENT,...,x4_Sports;Action & Adventure,x4_Strategy,x4_Tools,x4_Travel & Local,x4_Travel & Local;Action & Adventure,x4_Video Players & Editors,x4_Weather,x5_Negative,x5_Neutral,x5_Positive
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72572,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72573,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72574,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [70]:
# Append them back to the original
joined_data=pd.concat([joined_data.drop(['Category','Installs','Type','content_rating','Genres','Sentiment'],axis=1),cat_cols_1hot_df],axis=1)

In [71]:
X=joined_data.drop(['Rating','App'], axis=1)

KeyError: "['Genres'] not found in axis"

In [22]:
X.columns

Index(['Reviews', 'Size', 'x0_ART_AND_DESIGN', 'x0_AUTO_AND_VEHICLES',
       'x0_BEAUTY', 'x0_BOOKS_AND_REFERENCE', 'x0_BUSINESS', 'x0_COMICS',
       'x0_COMMUNICATION', 'x0_DATING',
       ...
       'x4_Sports;Action & Adventure', 'x4_Strategy', 'x4_Tools',
       'x4_Travel & Local', 'x4_Travel & Local;Action & Adventure',
       'x4_Video Players & Editors', 'x4_Weather', 'x5_Negative', 'x5_Neutral',
       'x5_Positive'],
      dtype='object', length=125)

In [23]:
y = joined_data['Rating']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)


(50803, 125) (50803,)
(21773, 125) (21773,)


In [38]:
from sklearn.linear_model import Lasso,LinearRegression,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.metrics import confusion_matrix

In [98]:
# Building a Linear Regression
Lin_R=LinearRegression()
lin_model = Lin_R.fit(X_train,y_train)
y_predict=lin_model.predict(X_test)


In [99]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predict})
df.head(15)

Unnamed: 0,Actual,Predicted
1856,4.1,4.391855
27868,4.2,4.418661
25854,4.4,4.394914
14731,3.6,4.290714
6061,4.1,3.964087
58245,3.5,4.293734
37552,4.5,4.459729
6804,4.4,4.141243
68724,4.5,4.241156
2080,4.4,4.278517


In [100]:
Lin_R.score(X_test, y_test)

0.33526711908592877

In [104]:
# Building a Lasso Regression
Lasso_R=Lasso()
lasso_model = Lasso_R.fit(X_train,y_train)
y_predict_lasso=lasso_model.predict(X_test)


In [106]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predict_lasso})
df.head(15)

Unnamed: 0,Actual,Predicted
1856,4.1,4.31801
27868,4.2,4.330322
25854,4.4,4.446956
14731,3.6,4.324275
6061,4.1,4.326392
58245,3.5,4.321037
37552,4.5,4.354577
6804,4.4,4.32186
68724,4.5,4.318178
2080,4.4,4.31724


In [107]:
Lasso_R.score(X_test, y_test)

0.016269779422790176

In [26]:
# Decision Tree
Dt = DecisionTreeRegressor()
dt_model = Dt.fit(X_train,y_train)
y_predict_dt=dt_model.predict(X_test)


In [27]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predict_dt})
df.head(15)

Unnamed: 0,Actual,Predicted
37492,4.5,4.5
45330,4.3,4.3
5567,3.7,3.7
16785,4.5,4.5
68487,3.8,3.8
27713,4.2,4.2
33689,4.4,4.4
57799,4.5,4.5
13260,3.9,3.9
43007,4.6,4.6


In [28]:
Dt.score(X_test, y_test)

0.9986792190734259

In [41]:
dt_species = np.array(y_test)
dt_predictions = np.array(y_predict_dt)


In [29]:
# Random Forest
Rf=RandomForestRegressor()
Rf_model = Rf.fit(X_train,y_train)
y_predict_rf=Rf_model.predict(X_test)



In [30]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predict_rf})
df.head(15)

Unnamed: 0,Actual,Predicted
37492,4.5,4.5
45330,4.3,4.3
5567,3.7,3.7
16785,4.5,4.5
68487,3.8,3.8
27713,4.2,4.2
33689,4.4,4.4
57799,4.5,4.5
13260,3.9,3.9
43007,4.6,4.6


In [31]:
Rf.score(X_test, y_test)

0.9985652427550747