# Data Splitting and Modelling

This notebook contains the code used in splitting and modelling the data.

## 1. Import Libraries and Dataset

In [1]:
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinedf.head(2)
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats

from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import MinMaxScaler

Loading main dataset from CSV file:

In [3]:
df = pd.read_csv("ready.csv")

In [4]:
df.head(2)

Unnamed: 0,rating_score,votes,alcohol,aldehydic,almond,amber,animalic,anis,aquatic,aromatic,...,yellow floral,longevity,sillage,spring,summer,autumn,winter,gender_man,gender_unisex,gender_women
0,4.1,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.5,2.4,0.11,0.11,0.44,0.33,0.0,0.0,1.0
1,4.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.83,2.12,0.37,0.16,0.16,0.32,0.0,0.0,1.0


## 2. Data spliting

Spliting my data in 'X' and 'y' datasets:

'X' dataset contains all features.

In 'y' dataset is a dependent variable = 'rationg_score'.

In [229]:
X = df.drop('rating_score', axis = 1)  #independent variables
y = df[['rating_score']] #dependent variable

Spliting my data in train and test datasets:

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 1)

In [12]:
print("My training set is: ", X_train.shape)
print("My test set is: ", X_test.shape)
print("My training dependent variable is: ", y_train.shape)
print("My test dependent variable is: ", y_test.shape)

My training set is:  (14948, 86)
My test set is:  (3737, 86)
My training dependent variable is:  (14948, 1)
My test dependent variable is:  (3737, 1)


Spliting my train data in train and validation datasets:

In [13]:
X_train_f, X_validation, y_train_f, y_validation = train_test_split(X_train, y_train, test_size = 0.20, random_state = 1)

In [14]:
print("My final training set is: ", X_train_f.shape)
print("My validation set is: ", X_validation.shape)
print("My training dependent variable is: ", y_train_f.shape)
print("My test dependent variable is: ", y_validation.shape)

My final training set is:  (11958, 86)
My validation set is:  (2990, 86)
My training dependent variable is:  (11958, 1)
My test dependent variable is:  (2990, 1)


All subsets are big enough to do validation using Train Test Split Method.

### Scaling

Among other models, I'm going to run Lasso Regression Model, so my data needs to be in the same scale.

Scalling three columns: 'votes','longevity','sillage' using MinMaxScaler:

In [43]:
mms = MinMaxScaler()

In [44]:
X_train_f[['votes','longevity','sillage']] = mms.fit_transform(X_train_f[['votes','longevity','sillage']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)


In [45]:
X_validation[['votes','longevity','sillage']] = mms.transform(X_validation[['votes','longevity','sillage']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)


In [46]:
X_test[['votes','longevity','sillage']] = mms.transform(X_test[['votes','longevity','sillage']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)


Now, all features are in range 0 to 1.

## 3. Modelling

In this part, I run statistical models to find best method to predict 'rating_score' value.

### Training and Evaluating on the Trainig Set

I train a linear regeression model as my baseline model:

In [139]:
lin_reg = LinearRegression()

In [140]:
lin_reg.fit(X_train_f, y_train_f) 

#parameters = lin_reg.coef_   # -> shape: (1,86)
#intercept = lin_reg.intercept_ # -> (1,)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [152]:
print(f'R^2 for training set: {lin_reg.score(X_train_f, y_train_f)}')
print(f'R^2 for validation set: {lin_reg.score(X_validation, y_validation)}')

R^2 for training set: 0.09293191697928027
R^2 for validation set: -2.8899773933717357e+19


R^2 value for a basic linear regression is very, very low. 

It can mean that the data doesn't suit for linear regression model. 

In next steps I try to improve it.

### Lasso Model

I try to get better scores using Lasso Regression.

In [156]:
lasso = Lasso()

In [157]:
lasso.fit(X_train_f, y_train_f)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [159]:
print(f'R^2 for training set: {lasso.score(X_train_f, y_train_f)}')
print(f'R^2 for validation set: {lasso.score(X_validation, y_validation)}')

R^2 for training set: 0.0
R^2 for validation set: -1.5286184418261684e-05


R^2 values are still very low. 

The performance of Lasso Regression in even worse then the performance of Baseline model.

Improving results using GridSearchCV:

In [168]:
# find optimal alpha with grid search
alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=lasso, param_grid=param_grid, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train_f, y_train_f)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Best Score:  0.07843873934437766
Best Params:  {'alpha': 0.001}


[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:    2.7s finished


In [169]:
lasso = Lasso(0.001)

In [170]:
lasso.fit(X_train_f, y_train_f)

Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [171]:
print(f'R^2 for training set: {lasso.score(X_train_f, y_train_f)}')
print(f'R^2 for validation set: {lasso.score(X_validation, y_validation)}')

R^2 for training set: 0.08323895305637352
R^2 for validation set: 0.08133919142865054


After using GridSearchCV, R^2 values are higher, but still not enough to be significant.

### Ridge Model

I try to get better results using Ridge Regression and repeating these steps.

In [172]:
ridge = Ridge()

In [173]:
ridge.fit(X_train_f, y_train_f)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [176]:
print(f'R^2 for training set: {ridge.score(X_train_f, y_train_f)}')
print(f'R^2 for validation set: {ridge.score(X_validation, y_validation)}')

R^2 for training set: 0.09425871348061476
R^2 for validation set: 0.07988939184365762


In [179]:
# finding optimal alpha with grid search

alpha = [0.001, 0.01, 0.1, 1, 10, 20, 30, 40, 50, 100, 1000]
param_grid = dict(alpha=alpha)

grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='r2', verbose=1, n_jobs=-1)

grid_result = grid.fit(X_train_f, y_train_f)

print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Best Score:  0.08089376329529138
Best Params:  {'alpha': 20}


[Parallel(n_jobs=-1)]: Done  40 out of  55 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:    0.3s finished


In [181]:
ridge = Ridge(20)

In [182]:
ridge.fit(X_train_f, y_train_f)

Ridge(alpha=20, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [183]:
print(f'R^2 for training set: {ridge.score(X_train_f, y_train_f)}')
print(f'R^2 for validation set: {ridge.score(X_validation, y_validation)}')

R^2 for training set: 0.09068855368382034
R^2 for validation set: 0.0820614358251387


After using GridSearchCV, R^2 values are higher, but also still not enough to be significant.

### Comparition of Models 

In [197]:
print(f'Basic Linear Regression R^2 for training set: {lin_reg.score(X_train_f, y_train_f)}')
print(f'Basic Linear Regression R^2 for validation set: {lin_reg.score(X_validation, y_validation)}')

print(f'\nLasso R^2 for training set: {lasso.score(X_train_f, y_train_f)}')
print(f'Lasso R^2 for validation set: {lasso.score(X_validation, y_validation)}')

print(f'\nRidge R^2 for training set: {ridge.score(X_train_f, y_train_f)}')
print(f'Ridge R^2 for validation set: {ridge.score(X_validation, y_validation)}')

Basic Linear Regression R^2 for training set: 0.09293191697928027
Basic Linear Regression R^2 for validation set: -2.8899773933717357e+19

Lasso R^2 for training set: 0.08323895305637352
Lasso R^2 for validation set: 0.08133919142865054

Ridge R^2 for training set: 0.09068855368382034
Ridge R^2 for validation set: 0.0820614358251387


## 4. Model selection

From comparing the r^2 values of the test and train data sets it was found that Ridge regression performed best. All are very, very low, it's not what I expected but let's try to explain it.

In [207]:
print(f'\nRidge R^2 for training set: {ridge.score(X_train_f, y_train_f)}')
print(f'Ridge R^2 for validation set: {ridge.score(X_validation, y_validation)}')


Ridge R^2 for training set: 0.09068855368382034
Ridge R^2 for validation set: 0.0820614358251387


R^2 value explains only about 9% of training set and 8% of validation set.

It means that a set of independent variables, in this form, for linear regression models, in not enough to predict dependent variable.

## 5. Final check on the test dataset

In [210]:
test_r_squared = ridge.score(X_test, y_test)

y_pred_test = ridge.predict(X_test)
test_mse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f'Test r^2: {test_r_squared} \nTest MSE: {test_mse}')

Test r^2: 0.09268805541546654 
Test MSE: 0.3590187092613027


Measure the R^2, MSE, MAE of final model:

In [218]:
initial_score_r = ridge.score(X_train_f, y_train_f)
print('The initial R-Squared value for the ridge model is:', initial_score_r.round(4))

X_train_f_preidct_ridge = ridge.predict(X_train_f)

mse = mean_squared_error(y_train_f, X_train_f_preidct_ridge)
print('The Mean Squared Error value for the ridge model is:', mse.round(4))

mae = mean_absolute_error(y_train_f, X_train_f_preidct_ridge)
print('The Mean Absolute Error value for the ridge model is:', mae.round(4))

The initial R-Squared value for the ridge model is: 0.0907
The Mean Squared Error value for the ridge model is: 0.1381
The Mean Absolute Error value for the ridge model is: 0.284


In [240]:
#final model intialisation

final_model_ridge = Ridge(20)
final_model = final_model_ridge.fit(X_train_f, y_train_f)
ridge.score(X_test,y_test)

0.09268805541546654

## 6. Possible improvements and future work:

My data proved to be difficult to understand. 
My starting idea for building linear regression model wasn't correct.

I assume that next work on feature engineering can be good step to extend the project.

My plan is combine some of features into groups to discrease number of independent variables and add some new to improve performance of the model.

Examples of new information:

* perfume's price
* common capacity 
* price per 100 ml
* popularity and worldwide availability