## Data Science - Applications

In the field of Data Science, it is necessary to find the best model fit for prediction. In term of Regression, we would like to find the best model to predict continuous variable. Until now, we learnt how to select the best model by finding the highest R^2 score using Linear Regression with various predictors and interaction terms. Instead, we can apply some other algorithms that best fit for the particular problems. 

This section attempts to show some related Regression classifiers such as:
- Lasso Regression
- Ridge Regression
- ElasticNet
- Random Forest Regressor
- Support Vector Regressor

For the experiment, this section used Diabetes Dataset.

In [1]:
#import 
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

Loading Diabetes Dataset

In [2]:
# Load the Diabetes dataset
columns = "age sex bmi map tc ldl hdl tch ltg glu".split() # Declare the columns names
diabetes = datasets.load_diabetes() # Call the diabetes dataset from sklearn
df = pd.DataFrame(diabetes.data, columns=columns) # load the dataset as a pandas data frame # df = X(random variable)
y = diabetes.target # define the target variable (dependent variable) as y

In [3]:
df.head()
# this data has been normalized
# range : -1 ~ 1

Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


Now we can use the train_test_split function in order to make the split. The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It’s usually around 80/20 or 70/30.

In [4]:
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
#(353, 10) (353,) : training
#(89, 10) (89,) : testing

(353, 10) (353,)
(89, 10) (89,)


Let's fit on the training data.

In [5]:
# fit a model
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

The result of the first 5 predictions are as follows.

In [6]:
predictions[0:5] # the first 5 testing data

array([166.2160369 ,  68.4771832 , 276.94662252, 186.06352781,
        81.16345835])

In [7]:
print('The R^2 of Linear Regression on training set: {:.2f}'
     .format(model.score(X_train, y_train)))
print('The R^2 of Linear Regression on test set: {:.2f}'
     .format(model.score(X_test, y_test)))

The R^2 of Linear Regression on training set: 0.55
The R^2 of Linear Regression on test set: 0.39


### Ridge Regression

Ridge regression is basically a regularized linear regression model. The λ parameter(has been set to 'default') is a scalar that should be learned as well, using a method called cross validation. This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. 

In [8]:
###Model Ridge regression
from sklearn.linear_model import Ridge
model_ridge = Ridge()
model_ridge.fit(X_train, y_train)
print('The R^2 of Ridge classifier on training set: {:.2f}'
     .format(model_ridge.score(X_train, y_train)))
print('The R^2 of Ridge classifier on test set: {:.2f}'
     .format(model_ridge.score(X_test, y_test)))

The R^2 of Ridge classifier on training set: 0.45
The R^2 of Ridge classifier on test set: 0.38


### Lasso (Least Absolute Shrinkage Selector Operator) Regression 

Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

The only difference lasso regressison contains from ridge regression is that the regularization term is in absolute value. But this difference has a huge impact on the trade-off. Lasso method overcomes the disadvantage of Ridge regression by not only punishing high values of the coefficients β but actually setting them to zero if they are not relevant. Therefore, you might end up with fewer features included in the model than you started with, which is a huge advantage.

In [9]:
###Model Lasso regression
from sklearn.linear_model import LassoCV
model_lasso = LassoCV()
model_lasso.fit(X_train, y_train)
print('The R^2 of Lasso classifier on training set: {:.2f}'
     .format(model_lasso.score(X_train, y_train)))
print('The R^2 of Lasso classifier on test set: {:.2f}'
     .format(model_lasso.score(X_test, y_test)))

The R^2 of Lasso classifier on training set: 0.55
The R^2 of Lasso classifier on test set: 0.38


### Elastic Net

Elastic net is basically a combination of both L1 and L2 regularization. So if you know elastic net, you can implement both Ridge and Lasso by tuning the parameters.

###### L1 :Manhattan / L2 : Euclidean

In [10]:
from sklearn.linear_model import ElasticNet
model_eln = ElasticNet()
model_eln.fit(X_train, y_train)
print('The R^2 of ElasticNet classifier on training set: {:.2f}'
     .format(model_eln.score(X_train, y_train)))
print('The R^2 of ElasticNet classifier on test set: {:.2f}'
     .format(model_eln.score(X_test, y_test)))
# you can not get any model

The R^2 of ElasticNet classifier on training set: 0.01
The R^2 of ElasticNet classifier on test set: 0.01


Without parameter tuning, the result of Elastic Net is not promising. Hence, you can set some parameters required on the Elastic Net to fit with your dataset.

In [11]:
model_eln = ElasticNet(alpha=0.001, l1_ratio=0.5, normalize=False) # l1 : Manhattan distance # alpha(parameter) : can change
model_eln.fit(X_train, y_train)
print('The R^2 of ElasticNet classifier on training set: {:.2f}'
     .format(model_eln.score(X_train, y_train)))
print('The R^2 of ElasticNet classifier on test set: {:.2f}'
     .format(model_eln.score(X_test, y_test)))



The R^2 of ElasticNet classifier on training set: 0.54
The R^2 of ElasticNet classifier on test set: 0.40


### Random Forest Regressor

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
Approach :

- Pick at random K data points from the training set.
- Build the decision tree associated with those K data points.
- Choose the number Ntree of trees you want to build and repeat step 1 & 2.
- For a new data point, make each one of your Ntree trees predict the value of Y for the data point, and assign the new data point the average across all of the predicted Y values.

https://www.geeksforgeeks.org/random-forest-regression-in-python/

In [14]:
###Model Random Forest regression
from sklearn.ensemble import RandomForestRegressor
model_RFR = RandomForestRegressor()
model_RFR.fit(X_train, y_train)
print('The R^2 of RFR classifier on training set: {:.2f}'
     .format(model_RFR.score(X_train, y_train)))
print('The R^2 of RFR classifier on test set: {:.2f}'
     .format(model_RFR.score(X_test, y_test)))
# training set >>>>>>> testing set : overfitting (model is very good for training data)
# big accuracy difference in trainig and testing : overfitting
# smallest gap in training and testing : good

The R^2 of RFR classifier on training set: 0.93
The R^2 of RFR classifier on test set: 0.35


### Support Vector Regressor

Support Vector Regression (SVR) is a regression algorithm, and it applies a similar technique of Support Vector Machines (SVM) for regression analysis. As we know, regression data contains continuous real numbers. To fit such type of data, the SVR model approximates the best values with a given margin called ε-tube (epsilon-tube, ε identifies a tube width) with considering the model complexity and error rate.

In [15]:
from sklearn.svm import SVR
svm = SVR()
svm.fit(X_train, y_train)
print('The R^2 of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('The R^2 of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))
# result is not that good

The R^2 of SVM classifier on training set: 0.19
The R^2 of SVM classifier on test set: 0.14


While using SVR, it requires to set a particular parameters to get a better result. Understanding the parameters in SVM (in this case SVR), it is necessary to understand the theoretical basis of the algorithms as well as the distribution of the dataset. For further info, please refer to the following link. 

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html



In [16]:
# changing parameter again
from sklearn.svm import SVR
svm = SVR(kernel='linear', C=100000, gamma='auto')
svm.fit(X_train, y_train)
print('The R^2 of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('The R^2 of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))

The R^2 of SVM classifier on training set: 0.54
The R^2 of SVM classifier on test set: 0.36


In [17]:
from sklearn.svm import SVR
svm = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1) # rbf : radicial basis function
svm.fit(X_train, y_train)
print('The R^2 of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('The R^2 of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))
# not that good result

The R^2 of SVM classifier on training set: 0.30
The R^2 of SVM classifier on test set: 0.26


In [18]:
# LR / Lasso / EL_P : good(representative for data) - nonparamaetric
# RFR : overfitting - false result

# Summary

This section presents briefly about some relevant algorithms in the Regression such as Linear Regression, Lasso Regression, Ridge Regression, ElasticNet, Random Forest Regressor, and Support Vector Regressor.
While some algorithms are non-parametric, the parametric-based algorithms such as ElasticNet and SVR requires further experiments to tune the proper parameters to find the best results. 