## Cross Validation

Generally speaking cross validation is used to compare the cost functions and accuracy scores between a variety of models before choosing the best performer.

This function can now be used to evaluate the Ridge Regression model I have used by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score 
from sklearn.model_selection import cross_val_score

# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# storing dataset from filepath into new dataframe
df = pd.DataFrame(btc_cad).dropna(axis=0)

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

ridge_reg = linear_model.Ridge(alpha=0.1, solver="auto")
ridge_reg.fit(X_train, y_train)
print(cross_val_score(ridge_reg, X, y, cv=10))

NameError: name 'linear_model' is not defined

The range of R-squared scores ranges from 0.8797 to 0.9953. 

In [None]:
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# predict
y_pred = cross_val_score(ridge_reg, X, y, cv=10)
y_pred = ridge_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [None]:
print(y_pred[:5])

Measuring the RMSE and r-squared score for the linear model (based on training set):

In [None]:
y_pred = ridge_reg.predict(X_train)
ridge_mse = mean_squared_error(y_train, y_pred)
ridge_rmse = np.sqrt(ridge_mse)
print(ridge_rmse)
    
r2_train = r2_score(y_train, y_pred)
print(r2_train)

This definitely appears to have reduced the Root Mean Squared Error for this particular sample of observations. We can say the Standard Deviation is distilled from the 'Standard Error' (which is the sum of the squared errors) and is a standardized term derived from the square root of the Standard Error. The predictions (Target Values) are compared to the actual observations (from the Training Set Values). 

Next, I compare the RMSE and r-squared for the Test Set to determine how well the model generalizes to unseen data.

In [None]:
y_pred = ridge_reg.predict(X_test)
ridge_mse = mean_squared_error(y_test, y_pred)
ridge_rmse = np.sqrt(ridge_mse)
print(ridge_rmse)
    
r2_test = r2_score(y_test, y_pred)
print(r2_test)

The Loss function (RMSE) has been reduced and the r-squared (accuracy) has improved slightly which implies the ridge regression (Regularization algorithm) which has been applied to the model, seems to translate fairly well in real world data.

Next, I want to apply a different scoring parameter, 'Negative Mean Squared Error' displaying the results from 10 separate cross folds.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(ridge_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
ridge_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)
                         
def display_scores(scores):
    print("Scores:", scores)
    print("\n")
    print("Mean", scores.mean())
    print("\n")
    print("Standard Deviation", scores.std())
    print("\n")
    print("R-Squared:", r2_test)
    print("\n")
          
display_scores(ridge_rmse_scores)

The lowest possible score achieved in 10 separate sub folds is 389.29477471. The average of these scores would be 705.1488929437664 which means the score I achieved to measure overall loss above is much better at 472.2805639459104. The Standard Deviation is just the squared root of the Negative Mean Squared Error values.

In [None]:
ridge_rmse_scores / 10
print(ridge_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)
display_scores(lin_rmse_scores)

Using the 10 different r-squared values from the Ridge Regression above gives:

In [None]:
X_score = (0.97503778 + 0.87973022 + 0.98669072 + 0.96474476 + 0.98429722 + 0.99525415 + 0.99115778 + 0.89504007 + 0.94847447 + 0.90033073) / 10
print(X_score)

In [None]:
print(cross_val_score(ridge_reg, X_poly, y, cv=10))

In [None]:
X_poly_score = (0.97006818 + 0.85076677 + 0.99061033 + 0.97345638 + 0.98865061 + 0.99535352 + 0.99329041 + 0.79100498 + 0.79524206 + 0.87280349) / 10
print(X_poly_score)

In [None]:
print(cross_val_score(ridge_reg, X_train, y_train, cv=10))

In [None]:
X_train_score = (0.99859545 + 0.9982508 + 0.99873121 + 0.99971355 + 0.97387001 + 0.99905469 + 0.99721033 + 0.99879823 + 0.99922155 + 0.99974627) / 10
print(X_train_score)

The first important detail to note is that the average cross-validation score is higher for the training set than the entire dataset from ridge regression, or the polynomial dataset.

Trying a different scoring parameter called 'Explained Variance' provides consistently accurate results.

In [None]:
from sklearn.metrics import explained_variance_score

mse = mean_squared_error(y_test, ridge_reg.predict(X_test), squared=True)
mse_scores = cross_val_score(ridge_reg, X_train, y_train, scoring="explained_variance", cv=10)
mse_scores

In [None]:
average_mse = (0.99859549 + 0.99826071 + 0.99876091 + 0.9997251 + 0.97460254 + 0.99909548 + 0.99725663 + 0.99898202 + 0.99922302 + 0.99974633) / 10
print(average_mse)

And for the root mean squared error values, finding the average rmse:

In [None]:
rmse_scores = np.sqrt(mse_scores)
rmse_scores

In [None]:
average_rmse = (0.9992975 + 0.99912998 + 0.99938027 + 0.99986254 + 0.9872196 + 0.99954764 + 0.99862737 + 0.99949088 + 0.99961143 + 0.99987316) / 10
print(average_rmse)

Now to evaluate the r-squared scoring parameter and cross-validating ten different sub-samples or folds once more:

In [None]:
from sklearn.metrics import r2_score

r2_scores_train = cross_val_score(ridge_reg, X_train, y_train, scoring="r2", cv=10)
r2_scores_train

Using the results from r-squared scores first I can determine the average r2 value from 10 different folds.

In [None]:
average_r2_train = (0.99859545 + 0.9982508 + 0.99873121 + 0.99971355 + 0.97387 + 0.99905469 + 0.99721033 + 0.99879823 + 0.99922155 + 0.99974627) / 10
print(average_r2_train)

To see how this model generalizes to the test data:

In [None]:
r2_scores_test = cross_val_score(ridge_reg, X_test, y_test, scoring="r2", cv=10)
r2_scores_test

In [None]:
average_r2_test = (0.99984803 + 0.99985311 + 0.99813416 + 0.99927484 + 0.99985759 + 0.99957481 + 0.99982543 + 0.99799085 + 0.96522031 + 0.99736867) / 10
print(average_r2_test)

The scores seem to drop slightly for the test data but are approximately the same which is a good sign for the model because it means the fit, training and prediction phases will translate well to unseen data (generally) with at least 99.57% accuracy.

Another thing to notice is how the 'explained_variance_score' score between the X training values and y training values are similar to the 'r-squared' scores.

This particular 'cross_val_predict' metric is actually derived from the 'model_selection' model validation API and then, using indexing to return the first 5 values I get the following predictions:

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(ridge_reg, X, y, cv=10)
print(y_pred[0:5])

Cross validation is meant to determine the performance of different machine learning algorithms so they can be compared and contrasted. It's not meant for training a model, but for checking different models. I will apply the cross-validation on the dataset with all 4 different features (Open, High, Low and Volume) and a target column (Close). Starting from the beginning I will attempt to find the optimal value for K, or that which provides the highest accuracy, or the lowest MSE score.