# Cross validation

In [2]:
import pandas as pd

In [5]:
df = pd.read_csv('train.csv')

In [40]:
y = df['Survived']
X = df[['Pclass', 'SibSp']]

## Split the data in training and test set

In [41]:
# Split the data into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape


((712, 1), (179, 1))

## Create a model

In [42]:
# Build a simple logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.6235955056179775

## Cross validation

The train data set is split and one part is used for fitting, the other for validation. E.g for cv=3 the data set is divided in three parts and always two are used for fitting and one for validation. If all three scores are similar, that means that the model is not overfitted. If they vary much, overfitting might be a problem. The model is not robust!

In [34]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy')
scores

array([0.68907563, 0.66666667, 0.66244726])

In [35]:
scores.mean().round(3), scores.std().round(3)

(0.673, 0.012)

## Optimization finished, what next?

In [36]:
print('training score: ', model.score(X_train, y_train).round(3))
print('test score: ', model.score(X_test, y_test).round(3))

training score:  0.673
test score:  0.704


#### Interpretation:
- training and test score are similar: all good
- training >> test: overfitting
- training < test: random fluctuation, probably your data set is very small or BUG