# K fold cross validation


1. Randomly split your entire dataset into k”folds”

2. For each k-fold in your dataset, build your model on k – 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold

3. Record the error you see on each of the predictions
4. Repeat this until each of the k-folds has served as the test set
5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

In [52]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [53]:
df= pd.read_csv('50_Startups.csv')

In [54]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [55]:
df.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


Here 0 values as minimum is acceptabled as the spend can be zero

In [56]:
df.dtypes

R&D Spend          float64
Administration     float64
Marketing Spend    float64
State               object
Profit             float64
dtype: object

In [57]:
cat_var= [i for i in df.columns if (df[i].dtypes=='object')]

In [58]:
df[cat_var].nunique()

State    3
dtype: int64

In [59]:
for i in cat_var:
    print(i)
    print(df[i].value_counts())

State
New York      17
California    17
Florida       16
Name: State, dtype: int64


In [60]:
df.isnull().any()

R&D Spend          False
Administration     False
Marketing Spend    False
State              False
Profit             False
dtype: bool

So in data, variables has no null values 

In [61]:
# Now getting the independent and dependent variable
X= df.loc[:, [i for i in df.columns if not(i=='Profit')]]
Y= df.loc[:,'Profit']

In [62]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [63]:
l=LabelEncoder()
OneHotEncoder()
for i in cat_var:
    X[i]= l.fit_transform(X[i])

In [64]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,2
1,162597.7,151377.59,443898.53,0
2,153441.51,101145.55,407934.54,1
3,144372.41,118671.85,383199.62,2
4,142107.34,91391.77,366168.42,1


In [65]:
# onehotencoder = OneHotEncoder(categorical_features = cat_var)
# X = onehotencoder.fit_transform(X).toarray()
X= pd.get_dummies(X,columns=cat_var,drop_first=True) # Drop first true to remove the dummy variable trap


In [68]:
X.shape

(50, 5)

In [74]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [75]:
from sklearn.linear_model import LinearRegression
reg= LinearRegression()
reg.fit(x_train,y_train)
yp=reg.predict(x_test)

In [76]:
reg.coef_

array([ 7.73467193e-01,  3.28845975e-02,  3.66100259e-02, -9.59284160e+02,
        6.99369053e+02])

In [77]:
from sklearn.metrics import mean_squared_error, r2_score
rmse = np.sqrt(mean_squared_error(y_test,yp))
r2 = r2_score(y_test,yp)

In [78]:
print('rmse: {}' .format(rmse))
print('r square: {}' .format(r2))

rmse: 9137.990152794959
r square: 0.9347068473282423


In [79]:
from sklearn.model_selection import cross_val_score
cv_score= np.sqrt(np.abs(cross_val_score(reg, X, Y, cv=10 , scoring='neg_mean_squared_error')))

In [80]:
cv_score.mean()

8892.041217617565

scoring metrcs have ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples',
                     'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted',
                     'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']

In [81]:
from sklearn.model_selection import cross_val_score
cvs= np.sqrt(np.abs(cross_val_score(reg, X, Y, cv=10 , scoring='adjusted_rand_score')))

In [82]:
cvs.mean()

1.0

In [27]:
c= cross_val_score(reg, X, Y, scoring='r2')



In [28]:
c.mean()

0.6525555448295809

## Kfold

In [97]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.linear_model import LinearRegression
scores = []
reg = LinearRegression()
cv = KFold(n_splits=5,shuffle=True, random_state=10)
for train_index, test_index in cv.split(X):
    print("Train Index: ",train_index)
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X.iloc[train_index],X.iloc[test_index],Y.iloc[train_index],Y.iloc[test_index]
    reg.fit(X_train, y_train)
    scores.append(reg.score(X_test, y_test))
print('scores: {}'.format(scores))
print('mean score:{}'.format(np.array(scores).mean()))
print('mean variance:{}'.format(np.array(scores).std()))

Train Index:  [ 0  1  2  4  5  8  9 10 11 12 13 14 15 16 17 18 19 21 22 24 25 26 27 28
 29 31 32 33 34 35 36 38 39 40 41 43 45 46 48 49]
Test Index:  [ 3  6  7 20 23 30 37 42 44 47]
Train Index:  [ 0  1  3  4  5  6  7  8  9 11 12 13 14 15 16 17 19 20 22 23 24 25 26 28
 29 30 33 34 36 37 38 41 42 43 44 45 46 47 48 49]
Test Index:  [ 2 10 18 21 27 31 32 35 39 40]
Train Index:  [ 0  2  3  4  6  7  8  9 10 11 14 15 16 18 20 21 23 24 25 27 28 29 30 31
 32 33 35 36 37 38 39 40 41 42 43 44 46 47 48 49]
Test Index:  [ 1  5 12 13 17 19 22 26 34 45]
Train Index:  [ 0  1  2  3  5  6  7  8  9 10 12 13 15 17 18 19 20 21 22 23 25 26 27 28
 29 30 31 32 34 35 36 37 39 40 42 43 44 45 47 49]
Test Index:  [ 4 11 14 16 24 33 38 41 46 48]
Train Index:  [ 1  2  3  4  5  6  7 10 11 12 13 14 16 17 18 19 20 21 22 23 24 26 27 30
 31 32 33 34 35 37 38 39 40 41 42 44 45 46 47 48]
Test Index:  [ 0  8  9 15 25 28 29 36 43 49]
scores: [0.9901105113396018, 0.9399733860983267, 0.912144418872901, 0.9249814949116109, 0.

## Stratified K fold CV, can only be done in classification problem


In [None]:
# scores = []
# reg = LinearRegression()
# cv = StratifiedKFold(n_splits=5,shuffle=True, random_state=7)
# for train_index, test_index in cv.split(X,Y):
#     print("Train Index: ",train_index)
#     print("Test Index: ", test_index)
#     X_train, X_test, y_train, y_test = X.iloc[train_index],X.iloc[test_index],Y.iloc[train_index],Y.iloc[test_index]
#     reg.fit(X_train, y_train)
#     scores.append(reg.score(X_test, y_test))
# print('scores: {}'.format(scores))
# print('mean score:{}'.format(np.array(scores).mean()))
# print('mean variance:{}'.format(np.array(scores).std()))

# Backward elimination

"""Backward eleimination is basically to get the most significant
independent variable who has main statstitical impact on depenednt variable
and eliminating the least impact variable
so first we need to append a new column of x with constant value 1 which will
symbolise b0 constant in our linear regression formula of y= b0+b1x1+...+bnxn as in statsmodel there is no constant 
but to make it as equation we have to """

In [29]:
import statsmodels.formula.api as sm

In [30]:

# X= np.append(arr= np.ones((len(X),1)).astype(int), values= X , axis= 1)
# we are adding a constant array of one to consider the constant b0 in our model.
#ordinary least squares (OLS) is a type of linear least squares method
#for estimating the unknown parameters in a linear regression model.

In [31]:
# X_opt = X.loc[:, :]
#creating new ordinary least squared class object and fitting the value to ols model
regressor_OLS = sm.OLS(endog = Y, exog = X).fit()
"""now we will check summary and in summary look for p values, high p values means
less impact so we will one by one eliminate the columns who will have more p values
then significant level of 5%"""
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.988
Model:,OLS,Adj. R-squared:,0.986
Method:,Least Squares,F-statistic:,727.1
Date:,"Thu, 25 Jul 2019",Prob (F-statistic):,7.87e-42
Time:,13:53:03,Log-Likelihood:,-545.15
No. Observations:,50,AIC:,1100.0
Df Residuals:,45,BIC:,1110.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.7182,0.066,10.916,0.000,0.586,0.851
Administration,0.3113,0.035,8.885,0.000,0.241,0.382
Marketing Spend,0.0786,0.023,3.429,0.001,0.032,0.125
State_1,3464.4536,4905.406,0.706,0.484,-6415.541,1.33e+04
State_2,5067.8937,4668.238,1.086,0.283,-4334.419,1.45e+04

0,1,2,3
Omnibus:,1.355,Durbin-Watson:,1.288
Prob(Omnibus):,0.508,Jarque-Bera (JB):,1.241
Skew:,-0.237,Prob(JB):,0.538
Kurtosis:,2.391,Cond. No.,828000.0


Here Administration has hgher p value of 0.608 which is v than our sl level of 5% of 0.05, we will remove it

In [32]:
X_opt = X.loc[:,[i for i in X.columns if i not in ['Administration'] ]]
#creating new ordinary least squared class object and fitting the value to ols model
regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit()
"""now we will check summary and in summary look for p values, high p values means
less impact so we will one by one eliminate the columns who will have more p values
then significant level of 5%"""
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.966
Model:,OLS,Adj. R-squared:,0.963
Method:,Least Squares,F-statistic:,330.0
Date:,"Thu, 25 Jul 2019",Prob (F-statistic):,3.12e-33
Time:,13:53:03,Log-Likelihood:,-570.48
No. Observations:,50,AIC:,1149.0
Df Residuals:,46,BIC:,1157.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,0.9242,0.101,9.145,0.000,0.721,1.128
Marketing Spend,0.1055,0.037,2.831,0.007,0.030,0.181
State_1,1.806e+04,7587.023,2.381,0.021,2790.267,3.33e+04
State_2,2.166e+04,7022.878,3.084,0.003,7524.659,3.58e+04

0,1,2,3
Omnibus:,6.065,Durbin-Watson:,1.064
Prob(Omnibus):,0.048,Jarque-Bera (JB):,2.42
Skew:,-0.174,Prob(JB):,0.298
Kurtosis:,1.98,Cond. No.,691000.0


Now Marketing spend has high p value than significance level of 0.05 we will remove it

In [33]:
X_opt = X.loc[:,[i for i in X.columns if i not in ['Administration','Marketing Spend'] ]]
#creating new ordinary least squared class object and fitting the value to ols model
regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit()
"""now we will check summary and in summary look for p values, high p values means
less impact so we will one by one eliminate the columns who will have more p values
then significant level of 5%"""
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.958
Method:,Least Squares,F-statistic:,380.6
Date:,"Thu, 25 Jul 2019",Prob (F-statistic):,5.84e-33
Time:,13:53:03,Log-Likelihood:,-574.49
No. Observations:,50,AIC:,1155.0
Df Residuals:,47,BIC:,1161.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
R&D Spend,1.1645,0.059,19.805,0.000,1.046,1.283
State_1,2.477e+04,7726.535,3.206,0.002,9225.790,4.03e+04
State_2,2.503e+04,7419.964,3.373,0.001,1.01e+04,4e+04

0,1,2,3
Omnibus:,7.516,Durbin-Watson:,0.91
Prob(Omnibus):,0.023,Jarque-Bera (JB):,2.46
Skew:,-0.005,Prob(JB):,0.292
Kurtosis:,1.913,Cond. No.,223000.0


Now when we evaluate all the 3 on basis of adjusted r2, we will see when we remove admin, adj r2 increases but when we remove marketing spend it reduces which is not good, and it happened coz p value of marketing spend was not much high then our sl so we can consider it in our model and thatswhy second model is good model.