# Titanic Survival Prediction
## Scientific Python - Final Project -SS23
### By Lucia Walther and Milena Schlichting

On April the 15th 1912 the infamous sinking of the RMS Titanic caused the tragic death of more than 1500 people, making it the the deadliest sinking of a single ship in that time. Until today the fate of the "unsinkable" ship stays prominent in our minds, leading to improvement of maritime safety and social (class) critizism. The latter is of interest for us, since we want to investigate how social status, gender, age and other factors influenced the survival chance of the crew and passengers aboard. 

In this project you will get an informed prediction about your survival during the sinking of the Titanic, based on your personalized input. 

 ...
- welche ML algorithmen usen wir?
-  welches daten set?


First we will import necessary libaries and load the data set and have a look at it to see what we are working with.

In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("titanic.csv")

In [67]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


As we can see there are 12 features including 
- PassengerId: numerating the passengers with integers ranging from 1 to 891
- Survived: either 1 (survived) or 0 (did not survive)
- Pclass: A proxy for the social-economic status, can be either 1st (upper) class, 2nd (middle) class or 3rd (lower) class
- Name: the name of the passenger/ crew member
- Sex: either female or male
- Age: if the age is less than 1 it is given as a fraction and if estimated it is given in the form xx.5
- SibSp: indicates with how many sibling or spouses the person traveled
- Parch: Indicates how many family members (including parents and children) traveled with the person. Some children traveled with just a nanny and therefore have the value 0.
- Ticket: the ticket number
- Fare: passenger fare
- Cabin: the cabin number
- Embarked: port of embarkation, either C = Cherbourg,  Q = Queenstown or S = Southampton

......Our target class is "Survived". In the following we are going to investigate which features are important for predicting the target class. By excluding any non-informative features we can reduce computation time and complexity.......

## Normalizing
In order to work with the data we are going to normalize it and format certain features to make them easier to work with. 

There are some missing values that we convert into numpy 'NaN' values.

In [68]:
df = df.replace('nan', np.nan)

We will start with the feature "Sex" and turn the possible values "Male" and "Female" into numerical values, with 1 = Female and 0 = Male.

In [69]:
df['Sex'] = df['Sex'].replace('male', 0)
df['Sex'] = df['Sex'].replace('female', 1)

Next we have a look at the unique values of the "Cabin" feature. On the titanica encyclopedia website (https://www.encyclopedia-titanica.org/cabins.html) we can see that the Letters A-G and T indicate the deck the cabin was on. This is interesting, as the deck directly corresponds with the class of the passenger. The room number on the other hand is not of further interest. <br>
Therefore we simplify the "Cabin" feature into "Deck", which indicates the deck number (A-G, T) of the Cabin. We further simplify by taking the first letter as the deck. Values like *"B57 B59 B63 B66"* and *"F G63"* are therefore reduced to their first listed cabin floor.

![Titanic Cutout](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Olympic_%26_Titanic_cutaway_diagram.png/220px-Olympic_%26_Titanic_cutaway_diagram.png "Titanic Cutout")

In [70]:
print(df["Cabin"].unique())

#replace the values with its first character or 'nan'
df["Deck"] = df["Cabin"].str[0] 

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
 'B42' 'C148']


To be able to do calculations with these values we turn the letters A-G into the numbers 0-6 and T = 7. 

In [71]:
replacement_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'T': 7}
df['Deck'] = df['Deck'].replace(replacement_dict)

When normalizing the Embarked column (where the passengers got on the ship) we replace the strings with numbers in the order in which the ship sailed the towns: <br>
- S (*Southhampton*) will be __0__
- C (*Cherbourg*) will be __1__
- Q (*Queenstown*) will be __2__.


In [72]:
df["Embarked"] = df["Embarked"].replace("S", 0)
df["Embarked"] = df["Embarked"].replace("C", 1)
df["Embarked"] = df["Embarked"].replace("Q", 2)

![Titanic Route](https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/TitanicRoute.svg/1200px-TitanicRoute.svg.png "Titanic Route")


### Data set after normalizing:

In [78]:
sub = df.loc[["Survived", "PClass", "SibSp", "Parch", "Fare", "Sex", "Deck", "Age", "Embarked"]]

IndexingError: Too many indexers

In [73]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,0.0,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1.0,2.0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,0.0,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,0.0,2.0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,0.0,


## Finding correlation in the data

Now we can work with the data and investigate the relationships between the different features. To get an overview of the correlations we plot a heatmap. We leave out the features *"Name", "Ticket"* and *"Cabin"*, because they are non-numeric and don't provide further information. Instead of *"Cabin"* we use the normalized feature *"Deck"*.

In [74]:
def make_heatmap(df):
    """ Plots heatmap of D&D creature type frequencies per alignment. """
    #df_subset = df["Survived", "PClass", "SibSp", "Parch", "Fare", "Sex", "Deck", "Age", "Embarked"]
    

    # Create a subplot.
    fig, ax = plt.subplots()
    # Plot the frequency data using imshow.
    im = ax.imshow(df, cmap = "inferno")

    # Set the necessary labels and title for the Axes object.
    _ = ax.set(
        # xticklabels = freq_names_align,
        # xticks = range(10),
        # yticklabels = freq_names_type, 
        # yticks = range(10),
        title = "Creature Type Frequencies per Alignment"
    )
    plt.setp(ax.get_xticklabels(), rotation=90, ha="right",
         rotation_mode="anchor")

    # Create a colorbar on the right side of the plot.
    fig.colorbar(im)

In [76]:
df_subset = df["Survived", "Pclass", "SibSp", "Parch", "Fare", "Sex", "Deck", "Age", "Embarked"]
make_heatmap(df_subset)

KeyError: ('Survived', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Sex', 'Deck', 'Age', 'Embarked')

In [None]:
hist, xedges, yedges = np.histogram2d(df_subset["Survived"], df_subset['Pclass'], bins=30)
plt.imshow(hist, interpolation='nearest', origin='lower', aspect='auto', cmap='hot', norm=LogNorm())
plt.colorbar(label='count in bin')
plt.show()