# Titanic Data Analysis

The purpose of this project is to perfom some statistical analysis using python about the [Titanic data set](https://www.kaggle.com/c/titanic/data) found on kaggle and draw some initial inferences about the characteristics of the passengers and their probability of survival based on these characterisitcs. 

### Importing the data

We have previously downloaded the data from the link shown above and saved it with the name ['train.csv'](/dataset/train.csv). 

We use pandas csv data reader to import the information into a pandas dataframe and perform a general description of it.

In [125]:
import pandas as pd 
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('datasets/train.csv')
print(df.dtypes)
df.head()

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [126]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


From the iniital description of the data we can see that we have 891 datapoints at our disposal, and 12 attributes:
- A unique passenger ID, stored as an integer
- A binary variable indicating if the passenger survived the sinking (This is the ultimate variable we will try to predict)
- The class of the ticker of the passenger, stored as an integer (i.e 1: First class, 2: Second class...)
- The name of the passenger stored as a string
- The sex of the passenger stored as a string
- The number of sieblings or spouses that the passenger had on board, stored as an integer
- The age of the passenger stored as a float
- The number of parents or children the passenger had on board, stored as a integer
- The ticket number
- The price of the ticker stored as a float
- The cabin number
- The port of embarcation


### Data cleaning

#### Column names
Formatting column titles. We eliminate any space before or after the text on the columns title, and convert them into lowercase

In [127]:
df.columns = [x.strip().lower() for x in df.columns]

#### Missing values
Firstly, we have to make sure there are no missing values in our dataset and if so dealing with them accordingly.

In [128]:
def missing_values(df):
    columns_missin = []
    for column in df.columns:
        if df[column].isnull().values.any():
            missing_pct = df[column].isnull().sum() / 891
            print('Column %s contains %d%% of missing values' %(column, missing_pct*100) )
            columns_missin.append(column)
        else:
            print('Column', column, 'contains NO missing values')

missing_values(df)


Column passengerid contains NO missing values
Column survived contains NO missing values
Column pclass contains NO missing values
Column name contains NO missing values
Column sex contains NO missing values
Column age contains 19% of missing values
Column sibsp contains NO missing values
Column parch contains NO missing values
Column ticket contains NO missing values
Column fare contains NO missing values
Column cabin contains 77% of missing values
Column embarked contains 0% of missing values


Now we know that three of the columns contains some missing values:
- Age contains 19% of the observations with missing values, we can do something to correct them
- Cabin contains 77% of the values missing, so this column was dropped from the rest as well as embarked as it won't be used

We will replace missing values of the age attribute according to the sex. That is, all male missing values will be replaced with the average age of all men, and the same will be applied for females.

In [129]:
df[['sex', 'age']][df['sex'] == 'male']
# Obtaining average ages
avg_male_age = df['age'][df['sex'] == 'male'].mean()
avg_female_age = df['age'][df['sex'] == 'female'].mean()

# Creating a column with the average age of the sex of each passenger
df['age_avgs'] = df.apply(lambda x: avg_male_age if x['sex'] == 'male' 
                                                else avg_female_age, axis=1)

# Filling missing values with the average of the sex of each passenger
df['age'] = df.apply(lambda x: x['age_avgs'] if pd.isnull(x['age']) else x['age'], axis=1)

# Dropping unneccesary columns
df.drop(columns=['age_avgs', 'cabin', 'embarked', 'ticket'], inplace=True)

df.head(6)


Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05
5,6,0,3,"Moran, Mr. James",male,30.726645,0,0,8.4583


#### Converting attributes to numeric values
To be able to quantify, we need to convert some of the attributes that are text into a numeric representation of the data. In this case the only attribute of interest remaining to convert is "sex". 

We will define female = 1 and male = 0

In [130]:
df['sex'].replace('male', 0, inplace=True)
df['sex'].replace('female', 1, inplace=True)

df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,fare
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05


#### Extracting information from name
The raw data contains the attribute of name, which is composed by last name, title. First name Second Name. In order to extract the most information out of this attribute we will decompose it into its' individual parts and create new attributes:
- First name
- Middle name
- Last name
- Title

To do this we have to eliminate the parenthesis but not the content between it and then split the text into the components mentioned

In [None]:
def extract_names(df):
    df['name'].replace()
