# TITANIC DATASET EDA (EXPLORATORY DATA ANALYSIS)

Dataset
The Titanic dataset has following variables:

1. PassengerID : ID of the Passenger.
2. Survived: Survival (0 = No; 1 = Yes)
3. Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
4. Name : Name of the Passenger
5. Gender: Gender of the Passenger (Female / Male)
6. Age: Age of the Passenger.
7. Sibsp: Number of siblings/spouses aboard
8. Parch: Number of parents/children aboard
9. Ticket : Ticket number.
10. Fare: Passenger fare (British pound)
11. Cabin: Cabin number
12. Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Project Roadmap
* Import libraries and dataset
* Exploratory data analysis
* Model construction and evaluation
* Summary

# Import Libraries and Dataset

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('tested.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

# Exploratory Data Analysis

### Missing Data

In [None]:
# missing data
df.isna().sum()

There are some missing values in the dataset so let's deal with these missing values first.

In [None]:
# Visualize missing data
sns.heatmap(df.isnull());

In [None]:
# Deal with missing data
df['Age'].replace(np.nan, df['Age'].mean(), inplace=True)
df['Fare'].replace(np.nan, df['Fare'].mean(), inplace=True)
df.drop('Cabin', axis=1,inplace=True)
df.isna().sum()

We have imputed missing values with mean for 'Age' and 'Fare' variables and we have droped 'Cabin' variable because of too many missing values.

In [None]:
df.nunique()

In [None]:
# Let's make seperate list of categorical and numerical variables

cat_var = ['Pclass', 'Gender', 'SibSp', 'Parch', 'Embarked']
num_var = ['Age', 'Fare']

In [None]:
round(df['Survived'].value_counts(normalize=True)*100, 2)

This dataset is little imbalanced. There are 36.36% passengers who survived Titanic disaster and 63.64% passengers who did not survive.

In [None]:
ax = sns.countplot(x=df['Survived'], palette='RdBu')
ax.bar_label(ax.containers[0])
plt.show()

In [None]:
for column in cat_var:   
    plt.figure(figsize=(12,5))
    plt.subplot(1,2,1)
    ax = sns.countplot(x=column, data=df, palette='Accent')
    ax.bar_label(ax.containers[0])

    plt.subplot(1,2,2)
    ax = sns.countplot(x=column, data=df, hue='Survived', palette='Accent')
    ax.bar_label(ax.containers[0])
    ax.bar_label(ax.containers[1])
    plt.show()

* The number of not surviving passengers is high for passenger class 3 and 2.
* The graph shows all of the females survived the disaster and all of the males did not survive the disaster.
* Single passengers with no siblings/spouses aboard have less chance of survival.
* Single passengers with no parents/children aboard with them have less chance of survival.
* The passngers who embarked from Queenstown have good chance of survival and the passngers who embarked from Southampton have less chance of survival.

In [None]:
df[num_var].describe()

In [None]:
for column in num_var:
    plt.figure(figsize=(14,5))
    plt.subplot(1,2,1)
    ax = sns.boxplot(df[column])

    plt.subplot(1,2,2)
    ax = sns.distplot(df[column])
    plt.show()

* There are some outliers in 'Age' variable and the distribution of 'Age' shows that the most of the passengers were aged between 20 and 40 years.
* There are several outliers in 'Fare' variable and the distribution of 'Fare' is right skewed.

In [None]:
sns.scatterplot(data=df, x='Age', y='Fare', hue='Survived')
plt.show()

* The graph shows that the passengers with high fares had more chance of survival

In [None]:
sns.scatterplot(data=df, x='Age', y='Fare', hue='Pclass')
plt.show()

In [None]:
# Let's drop bad features before modeling
df_update = df.drop(['PassengerId','Name', 'Ticket'], axis=1)
df_update.head()

In [None]:
df_update['Gender'].replace({'male':1, 'female':0}, inplace=True)
df_update['Embarked'].replace({'Q':0, 'S':1, 'C':2}, inplace=True)
df_update.head()

In [None]:
# Let's see the correlation between variables
sns.heatmap(df_update.corr(), annot=True, cmap='viridis')
plt.title('Correlation')
plt.show()