# Machine learning to predict ratings of app

This is a continuation of the project using the Google Play Store data. I've previously cleaned and preprocessed the data before outputting them into a separate csv.

# Libraries

In [1]:
import pandas as pd
import numpy as np

## Normalising data

Some of the columns have huge difference between the largest and smallest number, it would be nice to normalise them before running machine learning.

In [2]:
# load the data
df = pd.read_csv("Processed_google_df.csv")
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Category_num,Types_num,Content_num,Ratings_imp,Size_Mb,Installs_log
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0.0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3 and up,0,0,1,4.1,19.0,9.21044
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0.0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up,0,0,1,3.9,14.0,13.122365
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0.0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3 and up,0,0,1,4.7,8.7,15.424949
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,0.0,Teen,Art & Design,2018-06-08,Varies with device,4.2 and up,0,0,4,4.5,25.0,17.727534
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,0.0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4 and up,0,0,1,4.3,2.8,11.512935


In [3]:
# library
from sklearn import preprocessing

# normalise
cols_to_normalise = ["Size_Mb", "Installs", "Price", "Reviews"]
scaler = preprocessing.StandardScaler()
for col in cols_to_normalise:
    new_col = col + "_scaled"
    x = df[col].values.reshape(-1, 1)
    x_scaled = scaler.fit_transform(x)
    df[new_col] = x_scaled
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,...,Category_num,Types_num,Content_num,Ratings_imp,Size_Mb,Installs_log,Size_Mb_scaled,Installs_scaled,Price_scaled,Reviews_scaled
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0.0,Everyone,Art & Design,...,0,0,1,4.1,19.0,9.21044,-0.115361,-0.144491,-0.065232,-0.118192
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0.0,Everyone,Art & Design;Pretend Play,...,0,0,1,3.9,14.0,13.122365,-0.344851,-0.135375,-0.065232,-0.117751
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0.0,Everyone,Art & Design,...,0,0,1,4.7,8.7,15.424949,-0.58811,-0.051659,-0.065232,-0.070489
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,0.0,Teen,Art & Design,...,0,0,4,4.5,25.0,17.727534,0.160027,0.785508,-0.065232,-0.000513
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,0.0,Everyone,Art & Design;Creativity,...,0,0,1,4.3,2.8,11.512935,-0.858908,-0.142817,-0.065232,-0.117751


## Subset the training and test data

In [4]:
# load libraries
from sklearn.model_selection import train_test_split

# subset the df for the features
variables = ['Reviews_scaled', 'Price_scaled', 'Category_num', 'Types_num', 'Content_num',
       'Ratings_imp', 'Size_Mb_scaled', 'Installs_scaled']
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.7, test_size=0.3, random_state=0)

I'm encoded the categorical data such as Category, Type and Content into numerical. Then, I've decided to use a logarithm scaled Installs instead of the original installs. The codes on how they are done is included in the previous notebook. I've also showed previously that the variables do not have normal distribution, so I won't be applying linear regression

# Feature selection

## Recursive Feature Selection: GradientBoostingRegressor

In [5]:
# the independent variables
ind_var = x.columns.tolist()

# import libraries
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

rfe_selector = RFE(estimator=GradientBoostingRegressor(), n_features_to_select=5, step=10, verbose=5)  # step=number features to remove at each iteration
rfe_selector.fit(x, y)
rfe_support = rfe_selector.get_support()
rfe_feature = [var for var, boolean in zip(ind_var, rfe_support) if boolean == True]
rfe_feature

Fitting estimator with 7 features.


['Reviews_scaled',
 'Price_scaled',
 'Category_num',
 'Content_num',
 'Installs_scaled']

## SelectFromModel: RandomForestRegressor

In [6]:
# correlation matrix
x.corr()

Unnamed: 0,Reviews_scaled,Price_scaled,Category_num,Types_num,Content_num,Size_Mb_scaled,Installs_scaled
Reviews_scaled,1.0,-0.007597,0.017293,-0.033076,0.05563,0.176755,0.625165
Price_scaled,-0.007597,1.0,-0.01377,0.223868,-0.014485,-0.024251,-0.009404
Category_num,0.017293,-0.01377,1.0,0.016989,-0.093881,-0.093615,0.031655
Types_num,-0.033076,0.223868,0.016989,1.0,-0.041921,-0.030307,-0.041746
Content_num,0.05563,-0.014485,-0.093881,-0.041921,1.0,0.175586,0.049814
Size_Mb_scaled,0.176755,-0.024251,-0.093615,-0.030307,0.175586,1.0,0.132662
Installs_scaled,0.625165,-0.009404,0.031655,-0.041746,0.049814,0.132662,1.0


In [7]:
# there are no variables that have high correlation with each other

# libraries
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

rf_selector = SelectFromModel(RandomForestRegressor(), max_features=5)
rf_selector.fit(x, y)

rf_support = rf_selector.get_support()
rf_features = [var for var, boolean in zip(ind_var, rf_support) if boolean == True]
rf_features

['Reviews_scaled', 'Category_num', 'Size_Mb_scaled', 'Installs_scaled']

## SelectFromModel: XGBoostRegressor

In [8]:
# library
from xgboost import XGBRegressor
model = XGBRegressor(objective='reg:squarederror')

xg_selector = SelectFromModel(model, max_features=5)
xg_selector.fit(x, y)

xg_support = xg_selector.get_support()
xg_features = [var for var, boolean in zip(ind_var, xg_support) if boolean == True]
xg_features

  import pandas.util.testing as tm


['Price_scaled', 'Category_num', 'Content_num', 'Installs_scaled']

## Mutual Information

Mutual information measures the contribution of a variable towards another variables. It allows one to non-linear relationship

In [9]:
# library
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(mutual_info_regression, k=5)
new_training = selector.fit_transform(x, y)
mi_mask = selector.get_support()
mi_features = [var for var, boolean in zip(ind_var, mi_mask) if boolean == True]
mi_features

['Reviews_scaled',
 'Price_scaled',
 'Category_num',
 'Size_Mb_scaled',
 'Installs_scaled']

## Combining all of them

In [10]:
data_dict = {
    "Features": ind_var,
    "GBRegressor": rfe_support,
    "RFRegressor": rf_support,
    "XGBRegressor": xg_support,
    "MI": mi_mask
}
feature_selection_df = pd.DataFrame(data_dict)
feature_selection_df["Total"] = np.sum(feature_selection_df, axis=1)

# sort
feature_selection_df = feature_selection_df.sort_values(["Total", "Features"], ascending=False)
feature_selection_df=feature_selection_df.reset_index(drop=True)
feature_selection_df

Unnamed: 0,Features,GBRegressor,RFRegressor,XGBRegressor,MI,Total
0,Installs_scaled,True,True,True,True,4
1,Category_num,True,True,True,True,4
2,Reviews_scaled,True,True,False,True,3
3,Price_scaled,True,False,True,True,3
4,Size_Mb_scaled,False,True,False,True,2
5,Content_num,True,False,True,False,2
6,Types_num,False,False,False,False,0


It appears that out of the 4 different feature selection methods, Installs and Category types are highly important contributors to predicting the ratings. 

There is one caveat to the method I was using, however. I applied LabelEncoding to turn them into numerical, when training, regression will think of them as ordinal, ie 1 is better than 2. However, we have 33 levels in Category alone, which is a bit much to use one-hot encoder.We'll experiment with one hot encoding afterwards to see if there's a difference.

## Machine Learning Models

I decided to use Random Forest Regressor and XGBoost Regressor. RF is a popular algorithm given that it's fast and general give pretty accurate results. It does have limitations, namely, it doesn't extrapolate well. As for XGBoost, I just want to try it.

## With all variables

### Random Forest Regressor

In [11]:
# model
rf_model = RandomForestRegressor(max_leaf_nodes=100, random_state=0)
rf_model.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=100,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

In [12]:
# libraries
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

rf_pred = rf_model.predict(x_test)
r2_rf = r2_score(y_test, rf_pred)
msr_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
r2_rf, msr_rf, mae_rf
print(f"r2 is {r2_rf}, mae is {mae_rf} and msr is {msr_rf}.")  # r2 is 0.11768667951622525, mae is 0.3848103252802534 and msr is 0.3007642018178562.

r2 is 0.2663787790090971, mae is 0.35733750397542874 and msr is 0.2500778304548192.


In [13]:
rf_model.score(x_train, y_train)

0.45126532160548605

In [14]:
# experiment with different max_leaf_node
leaf_nodes = [5,10,100, 500, 1000, 5000, 10000, None]
for nodes in leaf_nodes:
    tmp_model = RandomForestRegressor(max_leaf_nodes=nodes, random_state=0)
    tmp_model.fit(x_train, y_train)
    tmp_pred = tmp_model.predict(x_test)
    tmp_r2 = r2_score(y_test, tmp_pred)
    tmp_mae = mean_absolute_error(y_test, tmp_pred)
    print(f"Leaf nodes is {nodes}, r2 is {tmp_r2} and mae is {tmp_mae}.")  # the best score achieved is when max_leaf_nodes is 500

Leaf nodes is 5, r2 is 0.0957725407727712 and mae is 0.40031693315129185.
Leaf nodes is 10, r2 is 0.14661349629197118 and mae is 0.3913406641186324.
Leaf nodes is 100, r2 is 0.2663787790090971 and mae is 0.35733750397542874.
Leaf nodes is 500, r2 is 0.28847541086249673 and mae is 0.3445534238296447.
Leaf nodes is 1000, r2 is 0.2792830150910389 and mae is 0.3453183903148569.
Leaf nodes is 5000, r2 is 0.2698598875639675 and mae is 0.3469667130192774.
Leaf nodes is 10000, r2 is 0.2698598875639675 and mae is 0.3469667130192774.
Leaf nodes is None, r2 is 0.26899070280145243 and mae is 0.34781902623765165.


### XGBoost

I understand that Random Forest and XGB have the similarity that both uses decision trees at the basic level but there are differences between the both of them such how the trees are built and combined. https://www.datasciencecentral.com/profiles/blogs/decision-tree-vs-random-forest-vs-boosted-trees-explained

In [15]:
# libraries 
import xgboost as xg

xg_model = xg.XGBRegressor(objective="reg:squaredlogerror", random_state=0, eta=0.25, max_depth=8)
xg_model.fit(x_train, y_train)

xg_pred = xg_model.predict(x_test)
xg_r2 = r2_score(y_test, xg_pred)
xg_msr = mean_squared_error(y_test, xg_pred)
xg_mae = mean_absolute_error(y_test, xg_pred)
print(f"r2 is {xg_r2}, mae is {xg_mae} and msr is {xg_msr}.") 
# without any tuning, the xg model performs poorly compared to random forest
# r2 is 0.2371537048975535, mae is 0.36212612609771766 and msr is 0.2600401146957583.

r2 is 0.23722900655660106, mae is 0.3621262373268132 and msr is 0.26001444576063826.


## Using only 4 variables

Using only the 4 variables from feature selection, let's test and see if the models fare better!

In [16]:
# subset
new_var = ["Category_num", "Content_num", "Reviews_scaled", "Installs_scaled", "Ratings_imp"]
fs_df = df[new_var]  # fs is feature selection
fs_x = fs_df.drop("Ratings_imp", axis=1)
fs_y = fs_df["Ratings_imp"]

# train_test_split
fs_x_train, fs_x_test, fs_y_train, fs_y_test = train_test_split(fs_x,fs_y, train_size=0.7, test_size=0.3, random_state=0)

### RandomForestRegressor

In [17]:
fs_rf_model = RandomForestRegressor(max_leaf_nodes=100, random_state=0)
fs_rf_model.fit(fs_x_train, fs_y_train)

fs_rf_pred = fs_rf_model.predict(fs_x_test)
fs_r2_rf = r2_score(fs_y_test, fs_rf_pred)
fs_msr_rf = mean_squared_error(fs_y_test, fs_rf_pred)
fs_mae_rf = mean_absolute_error(fs_y_test, fs_rf_pred)
print(f"r2 is {fs_r2_rf}, mae is {fs_mae_rf} and msr is {fs_msr_rf}.")  

# when compared to using all variables, using only 4 variables fare better in that the r2 is higher, msr is smaller and mae is a bit smaller

r2 is 0.27535109094579546, mae is 0.3521197209149075 and msr is 0.24701933618135377.


### XGBoost

In [18]:
fs_xg_model = xg.XGBRegressor(objective="reg:squaredlogerror", random_state=0, eta=0.25, max_depth=8)
fs_xg_model.fit(fs_x_train, fs_y_train)

fs_xg_pred = fs_xg_model.predict(fs_x_test)
fs_r2_xg = r2_score(fs_y_test, fs_xg_pred)
fs_msr_xg = mean_squared_error(fs_y_test, fs_xg_pred)
fs_mae_xg = mean_absolute_error(fs_y_test, fs_xg_pred)
print(f"r2 is {fs_r2_xg}, mae is {fs_mae_xg} and msr is {fs_msr_xg}.") 
# when using 4 variables, the metrics is better than when using all variables but still not better than random forest

r2 is 0.24815473371158425, mae is 0.3584163443848441 and msr is 0.2562900685685922.


So, it seem Random Forest performs better than XGBoost in this case but still it's still not a great model. Somebody used KNN and got 94% accuracy. I'm going to give it a go.

### KNN

In [19]:
# library
from sklearn.neighbors import KNeighborsRegressor
knn_fs_model = KNeighborsRegressor()

knn_fs_model.fit(fs_x_train, fs_y_train)
knn_pred = knn_fs_model.predict(fs_x_test)

knn_r2 = r2_score(fs_y_test, knn_pred)
knn_msr = mean_squared_error(fs_y_test, knn_pred)
knn_mae = mean_absolute_error(fs_y_test, knn_pred)
print(f"r2 is {knn_r2}, mae is {knn_mae} and msr is {knn_msr}.") 
# r2 is 0.1331057815713581, mae is 0.3856934108909264 and msr is 0.2955081166894833.
# nope, this doesn't look like the best model

r2 is 0.2101375248796329, mae is 0.3560316244372145 and msr is 0.2692494280208729.


### Support Vector Regression

In [20]:
# I'm curious about it so decided to give it a try
# library
from sklearn.svm import SVR

# data
variables = ["Category_num", "Content_num", "Reviews_scaled", "Installs_scaled", "Ratings_imp"]
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.9, test_size=0.1, random_state=8)

# model
svr_model = SVR(kernel="poly")  # tried all the options, poly seems to the best in this data
svr_model.fit(x_train, y_train)
svr_pred = svr_model.predict(x_test)
svr_r2 = r2_score(y_test, svr_pred)
svr_mse = mean_squared_error(y_test, svr_pred)
svr_mae = mean_absolute_error(y_test, svr_pred)

print(f"r2 is {svr_r2}, mae is {svr_mae} and msr is {svr_mse}.") 

r2 is -0.025321337538159572, mae is 0.3995576515715841 and msr is 0.3595575781722654.


## Using only variables specify per model

What is I use only the variables output by each model?

In [21]:
variables = ['Category_num','Ratings_imp','Installs_scaled', 'Reviews_scaled', 'Size_Mb_scaled']
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.7, test_size=0.3, random_state=0)

In [22]:
# random forest
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(x_train, y_train)

rf_pred = rf_model.predict(x_test)
r2_rf = r2_score(y_test, rf_pred)
msr_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
r2_rf, msr_rf, mae_rf
print(f"r2 is {r2_rf}, mae is {mae_rf} and msr is {msr_rf}.")  # nope not any better

r2 is 0.1935692731259585, mae is 0.3641288622499256 and msr is 0.27489723690975937.


In [23]:
feature_selection_df

Unnamed: 0,Features,GBRegressor,RFRegressor,XGBRegressor,MI,Total
0,Installs_scaled,True,True,True,True,4
1,Category_num,True,True,True,True,4
2,Reviews_scaled,True,True,False,True,3
3,Price_scaled,True,False,True,True,3
4,Size_Mb_scaled,False,True,False,True,2
5,Content_num,True,False,True,False,2
6,Types_num,False,False,False,False,0


In [24]:
# xg
variables = ['Category_num','Ratings_imp', 'Installs_scaled', 'Content_num', 'Price_scaled']
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.7, test_size=0.3, random_state=0)

xg_model = xg.XGBRegressor(objective="reg:squaredlogerror", random_state=0, eta=0.25, max_depth=8)
xg_model.fit(x_train, y_train)

xg_pred = xg_model.predict(x_test)
xg_r2 = r2_score(y_test, xg_pred)
xg_msr = mean_squared_error(y_test, xg_pred)
xg_mae = mean_absolute_error(y_test, xg_pred)
print(f"r2 is {xg_r2}, mae is {xg_mae} and msr is {xg_msr}.")  # nope, not better either

r2 is 0.16795558792457632, mae is 0.38504045950656945 and msr is 0.28362846583531115.


## Sub-final Machine Learning model

It looks like the best model that I could use at the moment is Random Forest Regressor, max_leaf_nodes at 500. The variables are the top 4. This only works when the data are not normalised.

In [25]:
new_df.shape

(9660, 5)

In [26]:
# get the data
variables = ["Category_num", "Content_num", "Reviews_scaled", "Installs_scaled", "Ratings_imp"]
# variables = ['Reviews', 'Price', 'Category_num', 'Types_num', 'Content_num',
#        'Ratings_imp', 'Size_Mb', 'Installs_log']
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.9, test_size=0.1, random_state=8)

In [27]:
rf_model = RandomForestRegressor(max_leaf_nodes=260, random_state=8, n_jobs=-1)
rf_model.fit(x_train, y_train)

rf_pred = rf_model.predict(x_test)
r2_rf = r2_score(y_test, rf_pred)
msr_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
print(f"r2 is {r2_rf}, mae is {mae_rf} and msr is {msr_rf}.")  # the best so far r2 is 0.2893709583033749, mae is 0.339268447634886 and msr is 0.242240224138613.

r2 is 0.3950345586381413, mae is 0.32034424502864833 and msr is 0.21214803692300035.


In [28]:
leaf_nodes = np.arange(10, 1000, 50)
best = 0
for nodes in leaf_nodes:
    tmp_model = RandomForestRegressor(max_leaf_nodes=nodes, random_state=8, n_jobs=-1)
    tmp_model.set_params(n_estimators=80)
    tmp_model.fit(x_train, y_train)
    tmp_pred = tmp_model.predict(x_test)
    tmp_r2 = r2_score(y_test, tmp_pred)
    tmp_mae = mean_absolute_error(y_test, tmp_pred)
    if tmp_r2 > best:
        print(f"Leaf nodes is {nodes}, r2 is {tmp_r2} and mae is {tmp_mae}.")
        best = tmp_r2

Leaf nodes is 10, r2 is 0.14575517445110397 and mae is 0.3820703485709907.
Leaf nodes is 60, r2 is 0.3336184372867338 and mae is 0.34406207771861713.
Leaf nodes is 110, r2 is 0.3724521929586293 and mae is 0.3317959913545915.
Leaf nodes is 160, r2 is 0.39055920779465714 and mae is 0.32489821370291894.
Leaf nodes is 210, r2 is 0.39618254921764406 and mae is 0.3220245918320109.
Leaf nodes is 260, r2 is 0.3971318851179779 and mae is 0.3204367515399854.


In [29]:
estimators = np.arange(10, 500, 10)
best = 0
for n in estimators:
    rf_model.set_params(n_estimators=n)
    rf_model.fit(x_train, y_train)
    rf_pred = rf_model.predict(x_test)
    r2_rf = r2_score(y_test, rf_pred)
    mse_rf = mean_squared_error(y_test, rf_pred)
    mae_rf = mean_absolute_error(y_test, rf_pred)
    if best < r2_rf:
        print(f"estimator is {n},r2 is {r2_rf}, mae is {mae_rf} and mse is {mse_rf}.")
        best = r2_rf

estimator is 10,r2 is 0.3677699546153661, mae is 0.3255453376707456 and mse is 0.22170913219464733.
estimator is 20,r2 is 0.37738119399287506, mae is 0.32272443882033136 and mse is 0.2183386825343403.
estimator is 30,r2 is 0.383127869015377, mae is 0.32280750539288977 and mse is 0.21632345035493883.
estimator is 40,r2 is 0.38667106341056756, mae is 0.32201374213955425 and mse is 0.21508093023067504.
estimator is 50,r2 is 0.39053522518870676, mae is 0.3217662699252988 and mse is 0.21372585392459761.
estimator is 60,r2 is 0.39551137620779897, mae is 0.321793590724716 and mse is 0.21198082751820269.
estimator is 70,r2 is 0.3967557764855585, mae is 0.3208985655826423 and mse is 0.21154444378778853.
estimator is 80,r2 is 0.39713188511797815, mae is 0.3204367515399854 and mse is 0.21141255078600318.


It looks like the best I can get with Random Forest for now is max_leaf_nodes at 260, n_estimators at 80 with 4 variables:
- Installs_log
- Category_num
- Reviews_scaled
- Content_num

In [30]:
# the conclusive machine learning model for this round
variables = ["Category_num", "Content_num", "Reviews_scaled", "Installs_scaled", "Ratings_imp"]
new_df = df[variables]
x = new_df.drop("Ratings_imp", axis=1)
y = new_df["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.9, test_size=0.1, random_state=8)

rf_model = RandomForestRegressor(max_leaf_nodes=260, random_state=8, n_jobs=-1, n_estimators=80)
rf_model.fit(x_train, y_train)
rf_pred = rf_model.predict(x_test)
r2_rf = r2_score(y_test, rf_pred)
msr_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
print(f"r2 is {r2_rf}, mae is {mae_rf} and msr is {msr_rf}.") 
rf_model.score(x_test, y_test)  # the accuracy of the model on testing set

r2 is 0.3971318851179779, mae is 0.3204367515399854 and msr is 0.21141255078600324.


0.3971318851179779

In [31]:
import math
math.sqrt(msr_rf)  # rmse

0.45979620571075097

# Machine learning on data with dummy variables

In [32]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,...,Category_num,Types_num,Content_num,Ratings_imp,Size_Mb,Installs_log,Size_Mb_scaled,Installs_scaled,Price_scaled,Reviews_scaled
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0.0,Everyone,Art & Design,...,0,0,1,4.1,19.0,9.21044,-0.115361,-0.144491,-0.065232,-0.118192
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0.0,Everyone,Art & Design;Pretend Play,...,0,0,1,3.9,14.0,13.122365,-0.344851,-0.135375,-0.065232,-0.117751
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0.0,Everyone,Art & Design,...,0,0,1,4.7,8.7,15.424949,-0.58811,-0.051659,-0.065232,-0.070489
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,0.0,Teen,Art & Design,...,0,0,4,4.5,25.0,17.727534,0.160027,0.785508,-0.065232,-0.000513
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,0.0,Everyone,Art & Design;Creativity,...,0,0,1,4.3,2.8,11.512935,-0.858908,-0.142817,-0.065232,-0.117751


In [33]:
variables = ['Reviews_scaled', 'Category','Price_scaled', 'Types_num', 'Content_num',
       'Ratings_imp', 'Size_Mb_scaled', 'Installs_scaled']
tmp_df = df[variables]
tmp_df.head()

Unnamed: 0,Reviews_scaled,Category,Price_scaled,Types_num,Content_num,Ratings_imp,Size_Mb_scaled,Installs_scaled
0,-0.118192,ART_AND_DESIGN,-0.065232,0,1,4.1,-0.115361,-0.144491
1,-0.117751,ART_AND_DESIGN,-0.065232,0,1,3.9,-0.344851,-0.135375
2,-0.070489,ART_AND_DESIGN,-0.065232,0,1,4.7,-0.58811,-0.051659
3,-0.000513,ART_AND_DESIGN,-0.065232,0,4,4.5,0.160027,0.785508
4,-0.117751,ART_AND_DESIGN,-0.065232,0,1,4.3,-0.858908,-0.142817


In [34]:
# dummy variable
data = pd.get_dummies(tmp_df, prefix=["Category"], columns=["Category"])

In [35]:
data.columns

Index(['Reviews_scaled', 'Price_scaled', 'Types_num', 'Content_num',
       'Ratings_imp', 'Size_Mb_scaled', 'Installs_scaled',
       'Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
       'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
       'Category_COMICS', 'Category_COMMUNICATION', 'Category_DATING',
       'Category_EDUCATION', 'Category_ENTERTAINMENT', 'Category_EVENTS',
       'Category_FAMILY', 'Category_FINANCE', 'Category_FOOD_AND_DRINK',
       'Category_GAME', 'Category_HEALTH_AND_FITNESS',
       'Category_HOUSE_AND_HOME', 'Category_LIBRARIES_AND_DEMO',
       'Category_LIFESTYLE', 'Category_MAPS_AND_NAVIGATION',
       'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_WEA

In [36]:
# random forest
variables = ["Content_num", "Reviews_scaled", "Installs_scaled", "Ratings_imp", 'Category_ART_AND_DESIGN', 'Category_AUTO_AND_VEHICLES',
       'Category_BEAUTY', 'Category_BOOKS_AND_REFERENCE', 'Category_BUSINESS',
       'Category_COMICS', 'Category_COMMUNICATION', 'Category_DATING',
       'Category_EDUCATION', 'Category_ENTERTAINMENT', 'Category_EVENTS',
       'Category_FAMILY', 'Category_FINANCE', 'Category_FOOD_AND_DRINK',
       'Category_GAME', 'Category_HEALTH_AND_FITNESS',
       'Category_HOUSE_AND_HOME', 'Category_LIBRARIES_AND_DEMO',
       'Category_LIFESTYLE', 'Category_MAPS_AND_NAVIGATION',
       'Category_MEDICAL', 'Category_NEWS_AND_MAGAZINES', 'Category_PARENTING',
       'Category_PERSONALIZATION', 'Category_PHOTOGRAPHY',
       'Category_PRODUCTIVITY', 'Category_SHOPPING', 'Category_SOCIAL',
       'Category_SPORTS', 'Category_TOOLS', 'Category_TRAVEL_AND_LOCAL',
       'Category_VIDEO_PLAYERS', 'Category_WEATHER']
new_data = data[variables]
x = new_data.drop("Ratings_imp", axis=1)
y = new_data["Ratings_imp"]
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.9, test_size=0.1, random_state=8)

rf_model = RandomForestRegressor(max_leaf_nodes=260, random_state=8, n_jobs=-1, n_estimators=80)
rf_model.fit(x_train, y_train)
rf_pred = rf_model.predict(x_test)
r2_rf = r2_score(y_test, rf_pred)
msr_rf = mean_squared_error(y_test, rf_pred)
mae_rf = mean_absolute_error(y_test, rf_pred)
print(f"r2 is {r2_rf}, mae is {mae_rf} and msr is {msr_rf}.") 
math.sqrt(msr_rf)
# it seem the model with one hot encoding is a tad better?

r2 is 0.4115481391317368, mae is 0.31838953740912235 and msr is 0.20635708847411077.


0.4542654383442689

In [37]:
leaf_nodes = np.arange(10, 1000, 50)
estimators = np.arange(10, 500, 10)
best_r2 = 0
best_mae = 100
for nodes in leaf_nodes:
    for n in estimators:
        tmp_model = RandomForestRegressor(max_leaf_nodes=nodes, random_state=8, n_jobs=-1, n_estimators = n)
        tmp_model.fit(x_train, y_train)
        tmp_pred = tmp_model.predict(x_test)
        tmp_r2 = r2_score(y_test, tmp_pred)
        tmp_mae = mean_absolute_error(y_test, tmp_pred)
        tmp_mse = mean_squared_error(y_test, tmp_pred)
        if (tmp_r2 > best_r2) and (tmp_mae < best_mae):
            best_r2 = tmp_r2
            best_mae = tmp_mae
            print(f"Leaf nodes is {nodes}, estimators is {n}, r2 is {tmp_r2}, mae is {tmp_mae} and mse is {tmp_mse}.")

Leaf nodes is 10, estimators is 10, r2 is 0.1917770919375088, mae is 0.3824855872948222 and mse is 0.28342594736596854.
Leaf nodes is 10, estimators is 20, r2 is 0.20322383920541298, mae is 0.37992973061518925 and mse is 0.27941182557320476.
Leaf nodes is 60, estimators is 10, r2 is 0.33850350429674503, mae is 0.345528283513934 and mse is 0.2319722308087154.
Leaf nodes is 60, estimators is 20, r2 is 0.33936729363810925, mae is 0.3433076755188741 and mse is 0.23166931893878626.
Leaf nodes is 60, estimators is 30, r2 is 0.34379379659425247, mae is 0.3424847720770799 and mse is 0.23011704198480745.
Leaf nodes is 60, estimators is 40, r2 is 0.3481138289869925, mae is 0.3419162893213485 and mse is 0.22860210190905628.
Leaf nodes is 60, estimators is 50, r2 is 0.34825507951598467, mae is 0.34184111929165417 and mse is 0.22855256846401756.
Leaf nodes is 60, estimators is 70, r2 is 0.34919416668054803, mae is 0.3415102103834518 and mse is 0.2282232513082917.
Leaf nodes is 60, estimators is 150

In [38]:
final_model = RandomForestRegressor(max_leaf_nodes=260, n_estimators=230, random_state=8, n_jobs=-1)
final_model.fit(x_train, y_train)
final_pred = final_model.predict(x_test)
final_r2 = r2_score(y_test, final_pred)
final_mse = mean_squared_error(y_test, final_pred)
final_mae = mean_absolute_error(y_test, final_pred)
final_r2, final_mse, final_mae

(0.4140740648976029, 0.2054713020208625, 0.31787991608878896)

In [39]:
math.sqrt(final_mse)

0.45328942412200895

# Conclusion

It seems that the best model I could come up with is using Random Forest Regressor (max_leaf_nodes=260, n_estimators=230, random_state=8, n_jobs=-1) using the following variables:

- Content_num: The encoded Content Rating, ie "Everyone", "Teen", etc.
- Reviews_scaled: Number of reviews standardised.
- Installs_scaled: Number of installds standardised.
- The categories that have been encoded into dummy variables.

Using the variables and the variables with standardised data, the model achieved roughly 41% accuracy with 0.21 mean absolute error and root mean-squared error of 0.45. It's not the most ideal but it's the best I can achieve at the moment and while it is likely that there are better models, I think there's a possibility that the variables here can't completely predict the average ratings of an app. For example, if you look at the correlation matrix (from the data analysis), ratings are not strongly correlated with any variables. I understand that the dummy variables are a lot (33 in total!), it gave me overall better, albeit slightly, metrics. Let me know what you think, but be nice!