### Concrete data set

We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

| Variable   | Description |
| ---------- | ---------------- |
| Cement in kg  | Cement in a m3 mixture | 
| Blast Furnace Slag in kg  | Blast Furnace Slag  in a m3 mixture | 
| Fly Ash in kg | Fly Ash in a m3 mixture | 
| Water in kg |  in a m3 mixture | 
| Superplasticizer in kg | Water in a m3 mixture | 
| Coarse Aggregate in kg  |Coarse Aggregate in a m3 mixture | 
| Fine Aggregate in kg  | Fine Aggregatein a m3 mixture | 
| Age in  Day  | Days (1-365) | 

In [1]:
import pandas                       as     pd
import numpy                        as     np
import scipy.stats                  as     stats

import seaborn                      as     sns
import matplotlib.pyplot            as     plt
import matplotlib

matplotlib.rcParams.update({'font.size': 12})

from   statsmodels.compat           import lzip

from   sklearn                      import model_selection

from sklearn.linear_model           import  LinearRegression
from sklearn.linear_model           import  Lasso
from sklearn.linear_model           import  Ridge
from sklearn.linear_model           import  ElasticNet

from   sklearn.tree                 import DecisionTreeRegressor
from   sklearn.ensemble             import RandomForestRegressor
from   sklearn.neural_network       import MLPRegressor
from   sklearn                      import ensemble
from   sklearn.ensemble             import GradientBoostingRegressor

from   sklearn.neighbors            import KNeighborsRegressor
from   sklearn.svm                  import SVR

from   sklearn.model_selection      import GridSearchCV
from   sklearn.model_selection      import cross_val_score, cross_val_predict


from   sklearn.metrics              import mean_squared_error, mean_absolute_error
from   statsmodels.compat           import lzip


  from numpy.core.umath_tests import inner1d


In [2]:
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [3]:
def  report_rmse_mape(lm, y, X, title):
    
    '''
        Reports rmse and mape for the given model, pair of dataset (y and X) and title
    ''' 
    rmse_     = (np.sqrt(mean_squared_error(y_true = y, y_pred = lm.predict(X))))
    mape_    = mean_absolute_percentage_error(y,       y_pred = lm.predict(X))
    print("\n")
    print(title)
    print("--------------------------------------")
    print('RMSE is {}'.format(rmse_ ))
    print('MAPE is {}'.format(mape_))

#### End


In [4]:
train_df       =   pd.read_csv('train.csv', usecols =  ['Cement', 'Blast', 'Fly Ash', 'Water', 'Superplasticizer',\
       'CA', 'FA', 'Age', 'CMS'])
test_df        =   pd.read_csv('test.csv')
print(train_df.shape)
print(test_df.shape)
print(train_df.head().T)   
print(test_df.head().T)   

(823, 9)
(206, 9)
                        0       1       2       3        4
Cement             275.10  516.00  393.00  183.90   246.80
Blast                0.00    0.00    0.00  122.60     0.00
Fly Ash            121.40    0.00    0.00    0.00   125.10
Water              159.50  162.00  192.00  203.50   143.30
Superplasticizer     9.90    8.20    0.00    0.00    12.00
CA                1053.60  801.00  940.00  959.20  1086.80
FA                 777.50  802.00  758.00  800.00   800.90
Age                 56.00   28.00   90.00   28.00     3.00
CMS                 56.85   41.37   48.79   24.05    23.52
                      0      1       2        3       4
Cement            318.8  362.6  322.00   212.00  446.00
Blast             212.5  189.0    0.00     0.00   24.00
Fly Ash             0.0    0.0    0.00   124.80   79.00
Water             155.7  164.9  203.00   159.00  162.00
Superplasticizer   14.3   11.6    0.00     7.80   11.60
CA                852.1  944.7  974.00  1085.40  967.00


In [5]:
X_train  = train_df[['Cement', 'Blast', 'Fly Ash', 'Water', 'Superplasticizer','CA', 'FA', 'Age']]
X_test   = test_df[['Cement', 'Blast', 'Fly Ash', 'Water', 'Superplasticizer','CA', 'FA', 'Age']]
y_train  = train_df['CMS'] 
y_test   = test_df['CMS']
 

In [6]:
df_names      = ['x_train shape', 'x_test shape', 'y_train shape', 'y_test shape']
shapes        = (X_train.shape, X_test.shape,  y_train.shape, y_test.shape)
types         = (type(X_train), type(X_test), type(y_train),type(y_test))
lzip(df_names,shapes, types)

[('x_train shape', (823, 8), pandas.core.frame.DataFrame),
 ('x_test shape', (206, 8), pandas.core.frame.DataFrame),
 ('y_train shape', (823,), pandas.core.series.Series),
 ('y_test shape', (206,), pandas.core.series.Series)]

### Performance of the model evaluation

### Prediction Accuracy

Prediction error or residuals is the difference between the predicted target variable values and the actual target variable vaues.

Most popular measure to evaluate the model performance is Root Mean Square Error (RMSE) which is the arithmatic mean of the sum of the residuals.
The model with low RMSE is the best model among many other models.

### OLS

In [7]:
seed                  =   12345

In [8]:
lm_ols              =   LinearRegression()

In [9]:
lm_ols.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [10]:
print('\OLS Multiple Linear Regression Models\n')
print("R Square value for Ridge Regression complete data %4.2f" % np.round(lm_ols.score(X_train, y_train) * 100, 2))

\OLS Multiple Linear Regression Models

R Square value for Ridge Regression complete data 61.23


In [11]:
report_rmse_mape(lm_ols, y_train, X_train, 'The model performance for training set - \nOLS regression')



The model performance for training set - 
OLS regression
--------------------------------------
RMSE is 10.284254442304984
MAPE is 30.736849660088144


In [12]:
report_rmse_mape(lm_ols, y_test, X_test, 'The model performance for test set - \nOLS regression')



The model performance for test set - 
OLS regression
--------------------------------------
RMSE is 10.6308239107729
MAPE is 32.80051602830256


In [21]:
lm_ridge              =   Ridge()

In [22]:
lm_ridge.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [23]:
print("R Square value for Ridge Regression complete data %4.2f" \
      % np.round(lm_ridge.score(X_train, y_train) * 100, 2))

R Square value for Ridge Regression complete data 61.23


In [24]:
report_rmse_mape(lm_ridge, y, X, 'The model performance for training set - \nRidge regression')



The model performance for training set - 
Ridge regression
--------------------------------------
RMSE is 10.630824528151651
MAPE is 32.800547581387825


In [25]:
report_rmse_mape(lm_ridge, y, X, 'The model performance for testing set - \nRidge regression')



The model performance for testing set - 
Ridge regression
--------------------------------------
RMSE is 10.630824528151651
MAPE is 32.800547581387825


In [26]:
lm_lasso              =   Lasso()

In [27]:
lm_lasso.fit(X_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [28]:
print("R Square value for Lasso Regression complete data %4.2f" \
      %np.round(lm_lasso.score(X_train, y_train) * 100, 2))

R Square value for Lasso Regression complete data 61.20


In [29]:
report_rmse_mape(lm_lasso, y, X, \
                 'The model performance for training set - \nLasso regression')



The model performance for training set - 
Lasso regression
--------------------------------------
RMSE is 10.63250210808717
MAPE is 32.901672169697


In [30]:
report_rmse_mape(lm_lasso, y, X, \
                 'The model performance for testing set - \nLasso regression')



The model performance for testing set - 
Lasso regression
--------------------------------------
RMSE is 10.63250210808717
MAPE is 32.901672169697


In [13]:


lm_elastic            =   ElasticNet()
lm_elastic            =   ElasticNet()

### Non linear models

kfold                 =   model_selection.KFold(n_splits = 10, random_state = seed)
lm_CART               =   DecisionTreeRegressor()
lm_RF                 =   RandomForestRegressor(random_state = seed)
lm_ANN                =   MLPRegressor(alpha=0.000001, activation = 'tanh', random_state = seed, tol = 0.001)
lm_GB                 =   ensemble.GradientBoostingRegressor()
lm_SVR                =   SVR(kernel='linear', C=1.0, epsilon=0.2, )
lm_KNN                =   KNeighborsRegressor()




lm_elastic.fit(X_train, y_train) 
lm_CART.fit(X_train, y_train) 
lm_RF.fit(X_train, y_train)
lm_GB.fit(X_train, y_train)
lm_SVR.fit(X_train, y_train)  
lm_KNN.fit(X_train, y_train)  
lm_ANN.fit(X_train, y_train)

MLPRegressor(activation='tanh', alpha=1e-06, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=12345,
       shuffle=True, solver='adam', tol=0.001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [14]:
print('\nOther Linear Models\n')



print("R Square value for Elastic Net Regression complete data %4.2f" % np.round(lm_elastic.score(X_train,y_train) * 100, 2))


Other Linear Models

R Square value for Ridge Regression complete data 61.23
R Square value for Lasso Regression complete data 61.20
R Square value for Elastic Net Regression complete data 61.22


In [15]:
print('\nNon linear Models\n')
print("R Square value for CART Regression complete data %4.2f" % np.round(lm_CART.score(X_train,y_train) * 100, 2))
print("R Square value for Random Forest Regression complete data %4.2f" % np.round(lm_RF.score(X_train,y_train) * 100, 2))
print("R Square value for Artificial Neural Network Regression complete data %4.2f" % np.round(lm_ANN.score(X_train,y_train) * 100, 2))

print("R Square value for Gradient Boosting Regression complete data %4.2f" % np.round(lm_GB.score(X_train,y_train) * 100, 2))
print("R Square value for SVR Regression complete data %4.2f" % np.round(lm_SVR.score(X_train,y_train) * 100, 2))
print("R Square value for KNN Regression complete data %4.2f" % np.round(lm_KNN.score(X_train,y_train) * 100, 2))


Non linear Models

R Square value for CART Regression complete data 99.88
R Square value for Random Forest Regression complete data 98.21
R Square value for Artificial Neural Network Regression complete data 3.54
R Square value for Gradient Boosting Regression complete data 95.12
R Square value for SVR Regression complete data 59.20
R Square value for KNN Regression complete data 81.20


In [16]:
### For testing dataset
seed                  =   12345
X                     =   X_train
y                     =   y_train

In [17]:
print('\nOther Linear Models\n')



report_rmse_mape(lm_elastic, y, X, 'The model performance for training set - \nElsasticnet regression')

print('\nOther Non Linear Models\n')

report_rmse_mape(lm_CART, y, X, 'The model performance for training set - \nCART regression')
report_rmse_mape(lm_RF, y, X,   'The model performance for training set - \nRandom forest regression')
report_rmse_mape(lm_GB, y, X,   'The model performance for training set - \nGradient Boosting regression')
report_rmse_mape(lm_SVR, y, X,  'The model performance for training set - \nSVR regression')
report_rmse_mape(lm_KNN, y, X,  'The model performance for training set - \nKNN regression')
report_rmse_mape(lm_ANN, y, X,  'The model performance for training set - \nNeural Network regression')


Other Linear Models



The model performance for training set - 
Ridge regression
--------------------------------------
RMSE is 10.28425444268223
MAPE is 30.73687002900695


The model performance for training set - 
Lasso regression
--------------------------------------
RMSE is 10.287164739928244
MAPE is 30.838741686702022


The model performance for training set - 
Elsasticnet regression
--------------------------------------
RMSE is 10.28536906864074
MAPE is 30.79460310596816

Other Non Linear Models



The model performance for training set - 
CART regression
--------------------------------------
RMSE is 0.5748490917895828
MAPE is 0.15747031194231098


The model performance for training set - 
Random forest regression
--------------------------------------
RMSE is 2.210776925933515
MAPE is 5.095857583457737


The model performance for training set - 
Gradient Boosting regression
--------------------------------------
RMSE is 3.6502454977381755
MAPE is 9.385840937349945


The mod

In [18]:
### For testing dataset
seed                  =   12345
X                     =   X_test
y                     =   y_test

In [19]:
print('\nOther Linear Models\n')



report_rmse_mape(lm_elastic, y, X, 'The model performance for testing set - \nElsasticnet regression')

print('\nOther Non Linear Models\n')

report_rmse_mape(lm_CART, y, X, 'The model performance for testing set - \nCART regression')
report_rmse_mape(lm_RF, y, X,   'The model performance for testing set - \nRandom forest regression')
report_rmse_mape(lm_GB, y, X,   'The model performance for testing set - \nGradient Boosting regression')
report_rmse_mape(lm_SVR, y, X,  'The model performance for testing set - \nSVR regression')
report_rmse_mape(lm_KNN, y, X,  'The model performance for testing set - \nKNN regression')
report_rmse_mape(lm_ANN, y, X,  'The model performance for testing set - \nNeural Network regression')


Other Linear Models



The model performance for testing set - 
Ridge regression
--------------------------------------
RMSE is 10.630824528151651
MAPE is 32.800547581387825


The model performance for testing set - 
Lasso regression
--------------------------------------
RMSE is 10.63250210808717
MAPE is 32.901672169697


The model performance for testing set - 
Elsasticnet regression
--------------------------------------
RMSE is 10.631645631613996
MAPE is 32.86193566967066

Other Non Linear Models



The model performance for testing set - 
CART regression
--------------------------------------
RMSE is 6.056301149753441
MAPE is 14.563738165213444


The model performance for testing set - 
Random forest regression
--------------------------------------
RMSE is 6.150536680671961
MAPE is 14.790039862335297


The model performance for testing set - 
Gradient Boosting regression
--------------------------------------
RMSE is 5.685615822808401
MAPE is 14.35568453834877


The model perfor