# Module 2: Cross Validation - Answers for practice

In this practice you will create a **20-fold cross validation** to a **Gaussian Naive Bayes model**, which attempts to fit the **titanic** dataset.

+ Look for **placeholders** in the code and fill in the blanks.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [None]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB


## Load Dataset

Load dataset from files into multidimensional array.

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

In [None]:
dataset.head()

## Part 1: Cross validation with sklearn

Make 20-fold cross validation using cross_val_score() provided by sklearn.

In [None]:
model = GaussianNB()

# Add your code below this comment (Question #P01)
# ----------------------------------
X = dataset.iloc[:,:-1]  # include all the columns except the last one
y = dataset.survived


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X)  

X_scaled = scaler.transform(X)


In [None]:

cv_scores = sklearn.model_selection.cross_val_score(model, X_scaled, y, cv=20) # <placeholder>
cv_scores

In [None]:
np.mean(cv_scores)

## Part 2: Create cross validation manually

In [None]:
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(X, cv)
    y_folds = np.array_split(y, cv)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))
    
# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(cv) if j!=i])
        X_test = X_folds[i]
        y_train = np.concatenate([y_folds[j] for j in range(cv) if j!=i])
        y_test = y_folds[i]
        model.fit(X_train, y_train)
        yield model.score(X_test, y_test)



In [None]:
print("Cross validation:")
cv_scores = []
for i, score in enumerate(cross_val_score(model, X_scaled, y, cv=20)):
    print(('\tscore[%d] ='%i), score)
    cv_scores.append(score)

In [None]:
np.mean(cv_scores)

# Save your notebook!