# Introduction
The sinking of Titanic is one of the most notorious shipwrecks in the history. In 1912, during her voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

<font color = "blue">
Content:
    
1. [Load and Check Data](#1)
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable Analysis](#4)
        * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6) 
4. [Outlier Detection](#7)
5. [Missing Value](#8)  
    * [Find Missing Value](#9)
    * [Fill Missing Value](#10)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid") # use seaborn style.
import seaborn as sns
from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a>
# Load and Check Data

In [None]:
df_train = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")
df_passengerid = df_test["PassengerId"]

In [None]:
df_train

<a id="2"></a>
# Variable Description

1. **PassengerId** - Unique id number to each passenger
1. **Survived** - Passenger survive(1) or died(0)
1. **Pclass** - Passenger class 
1. **Name** - Passenger name
1. **Sex** - Gender of passenger
1. **Age** - Age of passenger
1. **SibSp** - Number of siblings/spouses
1. **Parch** - Number of parents/children
1. **Ticket** - Ticket number
1. **Fare** - Amount of money spent on ticket
1. **Cabin** - Cabin category
1. **Embarked** - Port where passenger embarked ( C = Cherbourg, Q = Queenstown, S = Southampton )

* **float64(2)** : Fare, Age
* **int64(5)** : Pclass, SibSp, Parch, PassengerId, Survived
* **object(5)** : Cabin, Embarked, Ticket, Name, Sex

In [None]:
df_train.info()

<a id="3"></a>
# Univariate Variable Analysis
* **Categorical Variable:** Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsp, Parch
* **Numerical Variable:** Age, PassengerId, Fare

<a id="4"></a>
## Categorical Variable

In [None]:
def bar_plot(variable):
    """
        input : variable ex: "Sex"
        output : bar plot & value coun
    """
    # get feature
    var = df_train[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts() 
    
    # visualize
    
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{} : \n {}".format(variable,varValue))

In [None]:
category1 = ["Survived","Sex","Pclass","Embarked","SibSp","Parch"]
for c in category1:
    bar_plot(c)

In [None]:
category2 = ["Cabin", "Name", "Ticket"]
for c in category2:
    print("{} \n".format(df_train[c].value_counts()))

<a id="5"></a>
## Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(df_train[variable], bins = 100)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar = ["Fare", "Age", "PassengerId"]
for n in numericVar:
    plot_hist(n)

<a id="6"></a>
# Basic Data Analysis
* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

In [None]:
# Pclass vs Survived

df_train[["Pclass","Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# Sex vs Survived

df_train[["Sex","Survived"]].groupby(["Sex"],as_index = False).mean().sort_values(by="Survived",ascending = False)

In [None]:
# SibSp vs Survived

df_train[["SibSp","Survived"]].groupby(["SibSp"],as_index=False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# Parch vs Survived

df_train[["Parch","Survived"]].groupby(["Parch"],as_index=False).mean().sort_values(by = "Survived",ascending= False)

<a id="7"></a>
# Outlier Detection

In [None]:
def deteck_outlier(df,features):
    outlier_indices = []
    for c in features:
        #1st quartile
        Q1 = np.percentile(df[c],25)
        #3rd quartile:
        Q3 = np.percentile(df[c],75)
        #IQR 
        IQR = Q3 - Q1
        #Outlier step
        outlier_step = IQR * 1.5
        #Detect outlier and their indeces
        outlier_list_col = df[(df[c]<Q1-outlier_step) | (df[c]>Q3 + outlier_step)].index
        #Store indeces
        outlier_indices.extend(outlier_list_col)
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v>2)
    
    return multiple_outliers

In [None]:
df_train.loc[deteck_outlier(df_train,["Age","SibSp","Parch","Fare"])]

In [None]:
# Drop outliers
df_train = df_train.drop(deteck_outlier(df_train,["Age","SibSp","Parch","Fare"]),axis= 0).reset_index(drop = True)

<a id="8"></a>
# Missing Value
* Find Missing Value
* Fill Missing Value

In [None]:
df_train_len=len(df_train)
df_train = pd.concat([df_train,df_test],axis=0).reset_index(drop=True)

<a id="9"></a>
# Find Missing Value

In [None]:
df_train.columns[df_train.isnull().any()]

In [None]:
df_train.isnull().sum()

<a id="10"></a>
# Fill Missing Value
* Embarked has 2 missing value
* Fare has only 1 missing value

In [None]:
df_train[df_train["Embarked"].isnull()]

In [None]:
df_train.boxplot(column = "Fare",by = "Embarked")
plt.show()

In [None]:
df_train["Embarked"] = df_train["Embarked"].fillna("C")

In [None]:
df_train["Fare"] =df_train["Fare"].fillna(np.mean(df_train[df_train["Pclass"] == 3]["Fare"]))

In [None]:
df_train[df_train["Fare"].isnull()]