# Module 2: Cross-Validation - Practice

In this practice you will create a **20-fold cross-validation** to a **Gaussian Naive Bayes model**, 
which attempts to fit the **titanic** dataset. We will be using the entire dataset for training with cross-validation. 

+ Look for **placeholders** in the code and fill in the appropriate code.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [4]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB


## Load Dataset

Load dataset from files into multidimensional array.

In [5]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
count,890.0,890.0,890.0,890.0,890.0,890.0,890.0,890.0
mean,2.31236,0.642697,29.548697,0.503371,0.351685,32.865772,0.895506,0.389888
std,0.837241,0.479475,13.379025,1.095286,0.790069,52.639685,0.529535,0.487999
min,1.0,0.0,0.17,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,1.0,0.0
50%,3.0,1.0,28.0,0.0,0.0,13.775,1.0,0.0
75%,3.0,1.0,37.0,1.0,0.0,29.925,1.0,1.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,2.0,1.0


In [6]:
dataset.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
0,2,1,21.0,1,0,11.5,1,0
1,2,1,31.0,1,1,37.0042,0,0
2,3,1,27.0,0,0,8.6625,1,1
3,3,1,26.0,0,0,7.8958,1,0
4,2,1,28.0,0,0,10.5,1,0


## Part 1: Cross-validation with sklearn

Make a **20-fold** cross-validation using `cross_val_score()` provided by sklearn.

In [7]:
model = GaussianNB()

# Add your code below this comment (Question #P01)
# ----------------------------------
X = dataset.iloc[:,:-1]    # include all the columns except the last one
y = dataset.survived   # last col (survived)


In [8]:
from sklearn.preprocessing import MinMaxScaler

# perform scaling

scaler = MinMaxScaler().fit(X)  

X_scaled = scaler.transform(X)


In [10]:
cv_scores = sklearn.model_selection.cross_val_score(model, X_scaled, y, cv=20)
cv_scores

array([0.77777778, 0.77777778, 0.66666667, 0.73333333, 0.77777778,
       0.75555556, 0.77777778, 0.86666667, 0.82222222, 0.77777778,
       0.86363636, 0.81818182, 0.59090909, 0.81818182, 0.68181818,
       0.77272727, 0.72727273, 0.86363636, 0.79545455, 0.68181818])

In [11]:
np.mean(cv_scores)

0.7673484848484848

## Part 2: Create cross-validation manually

Make a 20-fold cross-validation **without** using the scikit learn provided cross-validation scoring method.

(This cell is just a copy in case you lose the original code.)

```python
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(<placeholder>, <placeholder>)
    y_folds = np.array_split(<placeholder>, <placeholder>)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))

# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[<placeholder>] for j in range(cv) if <placeholder>])
        X_test = X_folds[<placeholder>]
        y_train = np.concatenate([y_folds[<placeholder>] for j in range(cv) if <placeholder>])
        y_test = y_folds[<placeholder>]
        model.fit(<placeholder>, <placeholder>)
        yield model.score(<placeholder>, <placeholder>)

print("Cross-validation:")
for i, score in enumerate(cross_val_score(model, X, y, cv=<placeholder>)):
    print(('\tscore[%d] ='%i), score)
```


In [14]:
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(X, cv)
    y_folds = np.array_split(y, cv)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))

# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(cv) if j!=i])
        X_test = X_folds[i]
        y_train = np.concatenate([y_folds[j] for j in range(cv) if j!=i])
        y_test = y_folds[i]
        model.fit(X_train, y_train)
        yield model.score(X_test, y_test)


In [15]:
# now, test the above function
print("Cross-validation:")
cv_scores = []
for i, score in enumerate(cross_val_score(model, X_scaled, y, cv=20)):
    print(('\tscore[%d] ='%i), score)
    cv_scores.append(score)

Cross-validation:
X_folds Counter({(45, 7): 10, (44, 7): 10}) y_folds Counter({(45,): 10, (44,): 10})
	score[0] = 0.8
	score[1] = 0.7555555555555555
	score[2] = 0.6666666666666666
	score[3] = 0.7333333333333333
	score[4] = 0.7777777777777778
	score[5] = 0.7555555555555555
	score[6] = 0.8
	score[7] = 0.8444444444444444
	score[8] = 0.8222222222222222
	score[9] = 0.7777777777777778
	score[10] = 0.9090909090909091
	score[11] = 0.7727272727272727
	score[12] = 0.5909090909090909
	score[13] = 0.7727272727272727
	score[14] = 0.7727272727272727
	score[15] = 0.7045454545454546
	score[16] = 0.7727272727272727
	score[17] = 0.9090909090909091
	score[18] = 0.7272727272727273
	score[19] = 0.7272727272727273


In [16]:
np.mean(cv_scores)

0.769621212121212

# Save your notebook!  Then `File > Close and Halt`