<h1><center>Titanic</center></h1>

<h1>1. Problem definition</h1>

In this notebook, I'll consider the problem of predicting survival on the Titanic using the Titanic dataset. It is based on a Kaggle competition called "Titanic - Machine Learning from Disaster". We are tackling with Classification problem, we will clean and analyze dataset, and then create machine learning model using scikit-learn library. This solution is created solely for educational purposes.

<h2>1.1 Preliminary assumptions</h2>

- PassengerId won't matter to the model, it only counts the number of passengers.
- Passenger wealth will help survival, so variables such as Pclass, Fare, Name (especially specific honors) and perhaps Ticket will matter to the model.
- The gender of the passenger will matter, in this type of disaster women are more likely to be rescued first, especially in the years of this disaster.
- The age of the passenger will matter, those who are young or older may have a higher survival rate.
- SibSp and Parch will matter, those with more siblings (especially children) on board will have a higher survival rate.
- Port of embarkation and cabin may matter because they can dictate location on the ship.
- Numbers in the Ticket variable are most likely useless, but letters may have some meaning.

<h1>2. Data preparation</h1>

<h3>Import packages</h3>

In [1]:
# data analysis and manipulation
import pandas as pd
import numpy as np

<h3>Load datasets</h3>

In [2]:
# load data from csv file into dataframe
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/train.csv')

<h3>Check the data</h3>

In [3]:
# print first five records
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We have 12 variables, we will use the data dictionary from Kaggle*, which contains their descriptions. Here are some observations about them:
- PassangerId is a unique identifier for each passenger, it almost certainly does not carry any useful information. This is the first candidate for removal but we'll decide on it after closer examination.
- Survived is our target(dependent) variable, we want to create a model that predicts it.
- There are 5 independent variables that are categorical, namely Pclass, Sex, Cabin, Ticket and Embarked.
- Pclass is an ordinal variable, 1 is the first class, and so on.
- The ticket will require further investigation, and it is unclear whether it is unique and whether there is a pattern to it.
- Name variable contains honorifics.
- Embarked variable contains the abbreviation of Embarkation cities.
- Cabin variable contains empty values, as well as letters, this will also require further investigation.

*https://www.kaggle.com/competitions/titanic/data

<h2>2.1 Data cleaning</h2>

<h3>Identify columns that contain zero or very few values</h3>

In [18]:
# get the Series of numbers of unique values
unique_values_series = train.nunique()

# iterate through the Series and calculate unique values as a percentage of all values
for i, v in unique_values_series.items():
    percentage = v / train.shape[0] * 100
    print('%s: %d, %.2f%%' % (i, v, percentage))

PassengerId: 891, 100.00%
Survived: 2, 0.22%
Pclass: 3, 0.34%
Name: 891, 100.00%
Sex: 2, 0.22%
Age: 88, 9.88%
SibSp: 7, 0.79%
Parch: 7, 0.79%
Ticket: 681, 76.43%
Fare: 248, 27.83%
Cabin: 147, 16.50%
Embarked: 3, 0.34%


Key observations:
- PassengerId and Name are unique, PassengerId seems to have no predictive power, so we will drop it.
- Name contains titles, which we will extract, they may have predictive power, so we will leave them.
- Survived is our target variable, so we will leave it.
- Pclass, Sex, Embarked, SibSp, Parch are categorical variables, so it is obvious that their percentage is low
- Age and Fare need some more investigation, we will check their variance, but first we will need to get more knowledge about them.
- Ticket and Cabin need some simplifications, they have too many unique values.

<h3>Identify rows that contain duplicate data</h3>

In [19]:
# check dataset for duplicates
dups = train.duplicated()
print(dups.any())

False


Dataset contains no duplicates.

<h3>Identify outliers</h3>