## Pipeline: Fit a basic model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a basic model using 5-fold Cross-Validation.

### Read in data

![Initial Model](../../img/fit_model.png)

In [32]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

tr_features = pd.read_csv('../../../train_features.csv')
tr_labels = pd.read_csv('../../../train_labels.csv', header=None, skiprows=1)

In [33]:
tr_features.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,2,0,62.0,10.5,0,0
1,3,0,8.0,29.125,5,0
2,3,0,32.0,56.4958,0,0
3,3,1,20.0,9.825,1,0
4,2,1,28.0,13.0,0,0


In [28]:
tr_labels.head()

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1


In [29]:
tr_features.shape

(534, 6)

In [30]:
tr_labels.shape

(534, 1)

### Fit and evaluate a basic model using 5-fold Cross-Validation

![CV Image](../../img/CV_image.png)

In [45]:
tr_labels.values.ravel()

array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,

In [46]:
tr_labels

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1
...,...
529,1
530,0
531,0
532,1


In [38]:
rf = RandomForestClassifier()

scores = cross_val_score(rf, tr_features, tr_labels.values.ravel(), cv=5)

In [39]:
tr_features

Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,2,0,62.000000,10.5000,0,0
1,3,0,8.000000,29.1250,5,0
2,3,0,32.000000,56.4958,0,0
3,3,1,20.000000,9.8250,1,0
4,2,1,28.000000,13.0000,0,0
...,...,...,...,...,...,...
529,3,1,21.000000,7.6500,0,0
530,1,0,29.699118,31.0000,0,0
531,3,0,41.000000,14.1083,2,0
532,1,1,14.000000,120.0000,3,1


In [40]:
scores

array([0.81308411, 0.82242991, 0.78504673, 0.78504673, 0.80188679])

In [43]:
# want labels in the array, rather than column names
scores = cross_val_score(rf, tr_features, tr_labels, cv=5)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


In [44]:
scores

array([0.82242991, 0.82242991, 0.81308411, 0.79439252, 0.83962264])