# Titanic Dataset: Your model is already pretty good 
## ...and it's certainly better than the top 150 "models".


So you've just joined Kaggle and, whether through the Kaggle ML courses, an ML book or because its reputation precedes it, you've tried your hand at the famous Titanic dataset. Perhaps you've tried a few things and you've managed to increase your model's categorization accuracy score ever so slightly.

My first score was 0.75598 and after 12 new attempts, I managed to get 0.77990 (there was a 13th new attempt, didn't go too well...).

#### Not particularly impressive right? 

At the time of writing, there are about 155 Kaggle teams appearing in the public leaderboard with a perfect score of 1.00000 (notice the five 0s, these models must be at least 0.999995 accurate if there's any rounding going on). 

How did they do it? There are a lot of novices at the top, novices just like you and me. Do they just have a knack, a gift for ML? Why are you wasting your time with this notebook when you could have a look at some of theirs? (OK, I'll answer that one right off the bat: These notebooks are mostly identical and none of them show you how to get a score of 1.00000 from the data on Kaggle)

#### What if I told you your latest model, nay, your very first attempt at a model is/was already pretty good and better than the top 150 or so "models". How could this be? How could I possibly know this?

Let's look at a maritime disaster from a parallel dimension...

## <center>-- The Tatinac disaster --</center>

Up until the early morning hours of 15 April 1912, the RMS Titanic and RMS Tatinac differed in name only. The passengers on both ships were identical to each other in every way, shape or form...that is, until evacuation begun, then something utterly bizzare happened in the Tatinac universe...people formed orderly queues. As on the RMS Titanic, cries rang out "Women and children first" and on that basis, people queued. Unfortunetaly, when the last woman and child under the age of 15 got onto the lifeboats, there suddenly weren't any boat...and so all men sank with the ship, having humbly accepted their fate. 

    ¯\_(ツ)_/¯

#### Let us now take the Titanic dataset and transform it into the Tatinac dataset, including a survivor column in the test set.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

tatinac_train = pd.read_csv("../input/titanic/train.csv")
tatinac_test = pd.read_csv("../input/titanic/test.csv")

#let's start from a blank slate, with no survival...how bleak...
tatinac_train["Survived"] = 0 #this changes the values in the pre-existing column "Survived" to 0.
tatinac_test["Survived"] = 0 #this adds a new column "Survived" to the test set.

#To reiterate, in the Tatinac universe, all women and children (under 15) survive.
tatinac_train.loc[(tatinac_train["Sex"]=="female") | (tatinac_train["Age"]<15),"Survived"] = 1 
tatinac_test.loc[(tatinac_test["Sex"]=="female") | (tatinac_test["Age"]<15), "Survived"] = 1
#There are few men for which age is not given, we will assume for now they were all 15 or older.

This is what the data for the Tatinac disaster looks like. Eyeballing it, the pattern quickly becomes fairly obvious.

In [2]:
#This is what the data for the Tatinac disaster looks like.
tatinac_train.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,1,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


It will not come as a suprise to anyone that any model, even a simple one, that picks up on the all female and all under 15 pattern in the data will do exceptionally well. Armed with the gift of foresight, let's just use a simple decision tree.

(As with the Titanic dataset, we do need to do a little preprocessing on the data. Don't worry if this next cell is unclear, just skip to the next cell. I have borrowed this section from Aurelien Geron's excellent 'Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow'. This is not vital to understand the point of this notebook)

In [3]:
#preprocessing start...
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

num_pipeline = Pipeline([
    ('select', DataFrameSelector(["Age","Fare"])),
    ('imputer',SimpleImputer(strategy="median")),
])

cat_pipeline = Pipeline([
    ('select', DataFrameSelector(["Pclass","Sex","SibSp"])),
    ('imputer',SimpleImputer(strategy="most_frequent")),
    ('onehot', OneHotEncoder(sparse='False')),
])

preprocess_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
##...preprocessing end...phew, now back to the good stuff.

Now we fit our preprocessed Tatinac dataset to a simple decision tree classifier.

In [4]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz 
import graphviz ##we will want to visualize the tree we've created.

tt = tatinac_train.copy(deep=True)

X_train = preprocess_pipeline.fit_transform(tt)
y_train = tatinac_train["Survived"]

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

Let's see how well our model did.

In [5]:
from sklearn.metrics import accuracy_score

X_test = preprocess_pipeline.fit_transform(tatinac_test.drop("Survived", axis=1))
y_test = tatinac_test["Survived"]
y_pred = dtc.predict(X_test)

print("categorization accuracy score:",accuracy_score(y_pred,y_test))

##have a look at your decision tree in your output folder...makes a lot sense doesn't it?
visual = graphviz.Source(export_graphviz(dtc,class_names=True))
visual.render("imageA")

categorization accuracy score: 1.0


'imageA.pdf'

#### So it IS possible to get an accuracy of 1.00000 after all!?

Wait a minute, even in the Tatinac universe reality is a little more complex. Remember we have men who's age is not given. Each of them could be on either side of 15. Let's revisit those few cases and asign them a random chance of survival. There's just a few of them, so it shouldn't change the logic of the training or test set considerably.

In [6]:
import random

#Every man who's age is not known is randomly assigned a survival value (i.e. Some we assume to be under 15, others not so)
tatinac_train.loc[(tatinac_train["Sex"]=="male") & (tatinac_train["Age"].isna()), "Survived"] = random.choice([0,1])
tatinac_test.loc[(tatinac_test["Sex"]=="male") & (tatinac_test["Age"].isna()), "Survived"] = random.choice([0,1])

In [7]:
#We repeat all the same steps as earlier, now that values have been updated.
X_train = preprocess_pipeline.fit_transform(tatinac_train.drop("Survived", axis=1))
y_train = tatinac_train["Survived"]

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

X_test = preprocess_pipeline.fit_transform(tatinac_test.drop("Survived", axis=1))
y_test = tatinac_test["Survived"]
y_pred = dtc.predict(X_test)

print("categorization accuracy score:", accuracy_score(y_pred,y_test))

##The decision tree is very likely exactly the same. If not, it'll be at the very least functionally identical.
visual = graphviz.Source(export_graphviz(dtc,class_names=True))
visual.render("imageB")

categorization accuracy score: 0.8492822966507177


'imageB.pdf'

#### Ouch, that took a turn...

Just by integrating age uncertainty the categorization accuracy score falls to a value of around 0.85 (note: occasionally one gets lucky and gets 0.95. Respectable, but a farcry from 1). The model arrived at was the same (how could it be any different?) and yet because of age uncertainty, we can't do much better than this.

Ok, so maybe you're not entirely convinced yet. Maybe the models that get 1.00000 are doing something clever and age is just not the right metric to focus on due to the various NaN entries. Alright, let's make age perfectly predictable again and focus on sex.

In [8]:
#Reverting the survival value of men of unknown age back to 0.
tatinac_train.loc[(tatinac_train["Sex"]=="male") & (tatinac_train["Age"].isna()), "Survived"] = 0
tatinac_test.loc[(tatinac_test["Sex"]=="male") & (tatinac_test["Age"].isna()), "Survived"] = 0

Let's now create an example dataset relating events nearly as improbable as our original Tatinac disaster. Let us say, that the test set contains one counterexample: one adult man survived. He was next in the queue and got on right before the cutoff. The first entry in our test dataset is a man, he is our lone male survivor.

In [9]:
tatinac_test.loc[0,"Survived"]=1 ##the one lucky survivor
tatinac_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1


In [10]:
#We repeat all the same steps as earlier, now that values have been updated. 
X_train = preprocess_pipeline.fit_transform(tatinac_train.drop("Survived", axis=1))
y_train = tatinac_train["Survived"]

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

X_test = preprocess_pipeline.fit_transform(tatinac_test.drop("Survived", axis=1))
y_test = tatinac_test["Survived"]
y_pred = dtc.predict(X_test)

print("categorization accuracy score:",accuracy_score(y_pred,y_test))

##The decision tree is STILL functionally identical!
visual = graphviz.Source(export_graphviz(dtc,class_names=True))
visual.render("imageC")

categorization accuracy score: 0.9976076555023924


'imageC.pdf'

#### What?! 0.99761?! 

That certainly doesn't round to 1 if we're considering five decimals! Not even close. 

Notice that all it took was just ONE exception in the test set. One single exception. If that's all it takes not to get a clean 1.00000 prediction for this ever-so-slightly modified Tatinac dataset, imagine how much lower maximum categorization accuracy is going to be for the Titanic dataset! Let's add a few more exceptions to get a sense how quickly an otherwise perfect prediction loses in accuracy.

In [11]:
tatinac_test.loc[1,"Survived"]=0 ##the second entry is a woman. Sadly she now falls overboard.
tatinac_test.loc[2,"Survived"]=1 ##the third entry is a man. Floats on top a door.
tatinac_test.loc[3,"Survived"]=1 ##There's enough space on the door for two people.
tatinac_test.loc[4,"Survived"]=0 ##But not for the original three.

tatinac_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


In [12]:
##We haven't changed the training dataset this time so we can just power ahead with the test data preprocessing and prediction.
X_test = preprocess_pipeline.fit_transform(tatinac_test.drop("Survived", axis=1))
y_test = tatinac_test["Survived"]
y_pred = dtc.predict(X_test)

print("categorization accuracy score:",accuracy_score(y_pred,y_test))

##You get the gist...
visual = graphviz.Source(export_graphviz(dtc,class_names=True))
visual.render("imageD")

categorization accuracy score: 0.9880382775119617


'imageD.pdf'

#### Only 5 miserable examples running counter to the Tatinac's disaster's orderly evacuation and the prediction is already down to 0.988?!

Now you're truly wondering how anyone can possibly get 1.00000 on the Titanic dataset.

Perhaps they somehow got a hold of the survivor column for the test set (hidden on Kaggle for very very good reason!) and in an act of excusable silliness, ran the test set through the fitting phase of their model instead of the training set, thus perhaps overfitting? Oh dear!

That's too forgiving an explanation and it doesn't even work. Let's give that a try (obviously, never do this in any other context)

In [13]:
dtc = DecisionTreeClassifier()
dtc.fit(X_test, y_test) #training with the test set. Bad, terrible, no good practice!

X_test = preprocess_pipeline.fit_transform(tatinac_test.drop("Survived", axis=1))
y_test = tatinac_test["Survived"]
y_pred = dtc.predict(X_test)

print("categorization accuracy score:",accuracy_score(y_pred,y_test))

##Yes, even with the test set, it all just stays the same!
visual = graphviz.Source(export_graphviz(dtc,class_names=True))
visual.render("imageE")

categorization accuracy score: 0.9976076555023924


'imageE.pdf'

#### We still don't get 1.00000! 

This is because, even though we are plugging in the test set to predict the test set, we are asking the model to derive some simple rules, not copy the data exactly, so our simple decision tree will continue learning that women and boys under age 15, strongly tended to survive, while the men didn't. We could push our model to overfit by increasing the number of nodes in our decision tree, but that won't even get us to 1.00000 (again, this is not advice anyway, this is the opposite of what you'd normally do, you want a model to generalize well which is incidentally another reason 1.0000 is never a desirable score, the only other scenario where you get 1.00000 is one where you're data is too simple to necessitate ML)

#### So what IS happening? 

Well, in case it isn't obvious yet, each and every single one of these 150 or so teams simply submitted the test with the correct survivor column attached, that they have somehow found online (I didn't look for it and neither should you!). Given that the sinking of the RMS Titanic is something that actually happened, one should be able to find this information out and turn it into the test set.

The problem here isn't just the cheating, nothing is gained from simply submitting the answers. You cannot even replicate this dishonest strategy for any other dataset (so it isn't even 'clever' cheating). Those who do it arrest their potential growth as MLers on Kaggle. What a shame!

#### Think about your prediction again. Whether you've got 0.75 or higher, you're doing great. 

Keep it up, try a few things. Go learn some more ML tricks and try again. A seemingly small increment, is a leap in your knowledge of ML. Quickly move on to other datasets. Maybe come back to the Titanic dataset from time to time. Whatever you do, DO NOT get disheartened by the fact you're not getting anything near 1.00000 or 0.98804, these are illusory results.

If you look for the holy grail of models in the various 'winning' dataset notebooks, you'll quickly notice a lot of notebooks look suspiciously the same (because they are) and they mostly cover a lot of data visualization (which is nice, but that's probably not why you were looking at them in the first place, right?).

#### So what is the highest legitimate categorical accuracy score?

It is near impossible to tell where the dividing line between legitimate and fake is, but some notebooks below 0.85 look legit (this is a wild guess based on some circumstantial evidence, it could be a little higher). Don't worry too much about it. Try to improve your score, then move on to other datasets. <bold>You're doing great don't get discouraged!</bold>