# Train Test Split

To make sure that the classifier doesn't simply "remember" all the data, we will put some data aside before we train the classifier. This way, we can ensure that the algorithm actually extracts useful rules and is not only remembering examples.

We will do this by using the function [`sklearn.model_selection.train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), which allows us to get well sampled training and test sets.


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('train.csv')
df = df.fillna(0)

In [None]:
from sklearn.model_selection import train_test_split

features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]

xTrain, xTest, yTrain, yTest = train_test_split(
    df[features],
    df["Survived"],
    stratify = df["Survived"],
    random_state = 42
)

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(xTrain, yTrain)
tree.score(xTrain, yTrain)

In [None]:
tree.score(xTest, yTest)

In [None]:
df["Survived"].mean()

As we can see the accuracy we can expect on unknown data is only 0.63%. Given that the probability to die is $1 - 0.38 = 0.62$ we could have achieved the same accuracy saying everyone will die. Not to good is it?

This occurs if the algorithm learns noise. E.g. everyone of us would say learning both of the following rules will not generalize well: 
```
if age > 29 -> dead
if age > 31 -> survived
```
We would clearly see that there was simply one person of age 30 in the dataset that died, but there is (probably) no general knowledge extractable from this.
```
if age > 15 -> dead
if age > 60 -> survived
```
Would be a completely different story though! Children are normally more likely to get a spot in a rescue boat (and maybe there grandmas and grandpas, too.


This behaviour is called overfitting. This is the case if the training score is higher than the test score. On the otherhand there is underfitting. This is the case if the training score and the test score are nearly equal and both are low. In this case the algorithm is not complex enough to grasp all realtions in your data, or, there are simply no such connections.


So how can we restrict the algorithm to learn more generally applicable knowledge? Turns out each algorithm has a lot of parameters we can adjust. Let us look back to the documentation of the [`"DecisionTreeClassifier"`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) and try to restrict it.

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(xTrain, yTrain)
tree.score(xTrain, yTrain)

In [None]:
tree.score(xTest, yTest)

Well this makes it better, but not a lot. What else can we do?
If you remember back, one of our first decisions was to discard all data which are not numeric. 
Let us have a look at [feature extraction and engineering](./05-FeatureEngineering.ipynb) to make use of the remaining columns.