#  **Evaluating Techniques**

I will discuss some of the common Machine Learning techniques to evaluate your model.

*(If you would like to understand various evaluating methods then I have written another notebook on that well)*

---
---

## **Preparing Data**

### Importing Libraries

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing

### Importing Data

In [2]:
X, y = fetch_california_housing(as_frame=True,
                               return_X_y=True)

Merging the *data* and *target* values so that we can split them into *Train* and *Test* sets together.

### Splitting the data
***Please Note:** I will not be focusing on splitting the data in this notebook*

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---

## **Preprocessing the data**

Let's quickly look at the our data and see if what kind of data are we dealing with

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


As we can see there are no categorical values, so we just need to prepare our data with some numerical transformation. We could easily combine our model and all the preprocessing steps into one step using a `pipeline`. But for the sake of simplicity let's just create pre-processing pipeline

### Importing Libraries

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

### Creating a pre-processing pipeline

In [6]:
preprocessing = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scaler', StandardScaler())
])

### Pre-processing the data

In [7]:
X_train_prepared = preprocessing.fit_transform(X_train)

---

## **Preparing a simple regression model**
*(For the purpose of this notebook, we will use a simple Linear Regression Model)*

### Importing Library

In [8]:
from sklearn.linear_model import LinearRegression

### Initializing the model

In [9]:
lin_reg = LinearRegression()

### Fitting the model to our pre-processed data

In [10]:
lin_reg.fit(X_train_prepared, y_train)

LinearRegression()

---

## **Let's get our preditctions**

In [11]:
y_pred = lin_reg.predict(X_train_prepared)

---
---

# **Cross Validation**

- The general idea is to split the training set into smaller training sets and a validation set, then train your models against the smaller training set and evaluate them against the validation set. 
- Scikit-Learn provides with a method called `cross_validate` for this purpose *([link to Scikit-Learn Page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate))*


## *`cross_validate` :*

- It is used in order to understand the score on both *test* and *train* sets and also to see *fit* and *score* times
- You could provide multiple scoring parameters

*To run cross-validation for a single metric evalution, you can use `cross_val_score`*

#### Some parameters which are offered are:
- estimator: Estimator object implementing 'fit'
    - The object to use to fit the data
- scoring: str, callable, list, tuple, or dict, default=*None*
    - Strategy to evaluate the performance of the cross-validated model on the test *(validation)* set
    - You can find the complete list of values [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)
- cv: int, *default=None*
    - How many folds do you want to create of your data set
    - None, to use the default 5-fold cross validation
- return_train_score: bool, *default=False*
    - Whether to include train scores
    - Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off
    - Computing scores on the training set can be computationally expensive and is not strictly required to select parameters that yeild the best generalization performance

#### Implementing `cross_validate` :

In [12]:
from sklearn.model_selection import cross_validate

scores = cross_validate(estimator=lin_reg,
                       X=X_train_prepared,
                       y=y_train,
                       scoring='neg_mean_squared_error',
                       cv=10,
                       return_train_score=True)

In [13]:
scores

{'fit_time': array([0.00579262, 0.00482678, 0.00386262, 0.00386119, 0.0038588 ,
        0.00384784, 0.00579071, 0.00425839, 0.00303721, 0.00286961]),
 'score_time': array([0.00096607, 0.        , 0.        , 0.0009656 , 0.        ,
        0.00096488, 0.00096583, 0.00099468, 0.00096631, 0.00101519]),
 'test_score': array([-0.46912055, -0.57023278, -0.52267997, -0.48319985, -0.54304536,
        -0.49879977, -0.47454501, -0.54283267, -0.54130712, -0.55059256]),
 'train_score': array([-0.52340385, -0.51216672, -0.51747305, -0.52183985, -0.51516999,
        -0.52019339, -0.52281692, -0.51562862, -0.5154741 , -0.51435056])}

#### `cross_validate` returns few things:
- fit_time
    - The time for fitting the estimator on the train set for each cv split
- score_time
    - The time for scoring the estimator on the test set for each cv split
    - Time for scoring on the train set is not included even if `return_train_score` is set to `True`
- test_score
    - The score array for test scores on each cv split
    - Suffix `_score` in `test_score` changes to a specific metric like `test_r2` or `test_auc` if there are multiple scoring metrics in the scoring parameter
- train_score
    - The score array for train scores on each cv split
    - This is available only if `return_train_score` parameter is `True`
    - Suffix _score in `train_score` changes to a specific metric like `train_r2` or `train_auc` if there are multiple scoring metrics in the scoring parameter
- estimator
    - The estimator objects for each cv split
    - This is available only if `return_estimator` parameter is set to `True`

If you noticed, we set `scoring` as `neg_mean_squared_error` which doesn't provide RMSE by default. We can easily change it to the RMSE by using a `sqrt` fucntion from `numpy`



In [14]:
rmse_scores = np.sqrt(-scores['test_score'])

Since we set `cv` as 10, we have 10 different scores. Another advantage of `cross_validation` is that you can calculate variance in your scores

In [15]:
mean_score = np.mean(rmse_scores)
std_score = np.std(rmse_scores)

print(f'Mean: {mean_score}')
print(f'Std Dev: {std_score}')

Mean: 0.720472599387967
Std Dev: 0.023554156653336395


## *`cross_val_score` :*

- We can use `cross_val_score` instead of `cross_validate` if we just want to run cross-validation for a single metric evaluation
- We can't specify multiple metrics for evaluation
- It only returns a dict of test score

In [16]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(estimator=lin_reg,
                           X=X_train_prepared,
                           y=y_train,
                           scoring='neg_mean_squared_error',
                           cv=10)

In [17]:
cv_scores

array([-0.46912055, -0.57023278, -0.52267997, -0.48319985, -0.54304536,
       -0.49879977, -0.47454501, -0.54283267, -0.54130712, -0.55059256])

---
---