# Introduction

The sinking of Titanic is one of the most notorious shipwrecks in history. In 1912, during her voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

## Contents

1. [Load and Check Data](#load-and-check-data)  
2. [Variable Description](#variable-description)  
    * [Univariate Variable Analysis](#univariate-variable-analysis)  
        * [Categorical Variable Analysis](#categorical-variable-analysis)  
        * [Numerical Variable Analysis](#numerical-variable-analysis)  
3. [Basic Data Analysis](#basic-data-analysis)  
4. [Outlier Detection](#outlier-detection)  
5. [Missing Value](#missing-value)  
    * [Find Missing Value](#find-missing-value)  
    * [Fill Missing Value](#fill-missing-value)


In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and Check Data


In [7]:
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df =  pd.read_csv("/kaggle/input/titanic/test.csv")
test_PassengerId = test_df["PassengerId"]


FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/titanic/train.csv'

In [None]:
train_df.columns

In [None]:
train_df.head()

In [None]:
train_df.describe()


# Variable Description
1. PassengerId: unique id number to each passenger
2. Survived: passenger survive(1) or died(0)
3. Pclass: passenger class
4. Name: name
5. Sex: gender of passenger
6. Age: age of passenger
7. SibSp: number of siblings/spouses
8. Parch: number of parents/children
9. Ticket: ticket number
10. Fare: amount of money spent on ticket
11. Cabin: cabin category
12. Embarked: port where passenger embarked (c = cherbourg, Q = Queenstown,S = Southampton)


    
  

In [None]:
train_df.info()

* float64(2) : Fare and Age
* int64(5) : Pclass, sibsp, parch, passengerId, and survived
* object(5) : Cabin, embarked, ticket, name, and sex

## Univariate Variable Analysis
 * Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsp and Parch
 * Numerical Variable: Fare, Age, PassengerId,


### Categorical Variable

In [None]:
def bar_plot(variable):
    """
       input:variable ex: "Sex"
       output: bar plot & value count
    """
    # get feature
    var = train_df[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()

    #visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{} : \n {}".format(variable,varValue))

In [None]:
category1 = ["Survived","Sex","Pclass","Embarked","SibSp","Parch"]
for c in category1:
    bar_plot(c)

In [None]:
 "Variables that become more complex when visualized."
category2 = ["Cabin","Name","Ticket"]
for c in category2:
    print("{} \n".format(train_df[c].value_counts()))
    

### Numerical Variable 

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(train_df[variable])
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar = ["Fare","Age","PassengerId"]
for n in numericVar: 
    plot_hist(n)

# Basic Data Analysis
* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

In [None]:
# Pclass vs Survived
train_df[["Pclass","Survived"]].groupby(["Pclass"],as_index = False).mean().sort_values(by="Survived",ascending = False)


In [None]:
# Sex vs Survived
train_df[["Sex","Survived"]].groupby(["Sex"],as_index = False).mean().sort_values(by="Survived",ascending = False)

In [None]:
# SibSp vs Survived
train_df[["SibSp","Survived"]].groupby(["SibSp"],as_index = False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# Parch vs Survived
train_df[["Parch","Survived"]].groupby(["Parch"],as_index = False).mean().sort_values(by = "Survived", ascending = False)

In [None]:
# Embarked vs Survived
train_df[["Embarked","Survived"]].groupby(["Embarked"],as_index = False).mean().sort_values(by = "Survived",ascending = False)

# Outlier Detection

In [None]:
def detect_outliers(df, features):
    outlier_indices = []

    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c], 25)
        # 3rd quartile
        Q3 = np.percentile(df[c], 75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # Detect outlier and their indices
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # Store indices
        outlier_indices.extend(outlier_list_col)

    # Count the number of outliers per row index
    outlier_indices = Counter(outlier_indices)
    # Only keep those that are outliers in more than 2 features
    multiple_outliers = [i for i, v in outlier_indices.items() if v > 2]

    return multiple_outliers



In [None]:
train_df.loc[detect_outliers(train_df,["Age","SibSp","Parch","Fare"])]

In [None]:
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp","Parch","Fare"]),axis = 0).reset_index(drop = True)

# Missing Value
* Find Missing Value
* Fill Missing Value

In [None]:
train_df_len = len(train_df)
train_df = pd.concat([train_df,test_df],axis = 0).reset_index(drop = True)

In [None]:
train_df.head()

## Find Missing Value

In [None]:
train_df.columns[train_df.isnull().any()]

In [None]:
train_df.isnull().sum()

## Fill Missing Value
* Embarked has 2 missing value
* Fare has only 1

In [None]:
train_df[train_df["Embarked"].isnull()]

In [None]:
train_df.boxplot(column = "Fare",by = "Embarked")
plt.show()

In [None]:
train_df["Embarked"] = train_df["Embarked"].fillna("C")
train_df[train_df["Embarked"].isnull()]

In [None]:
train_df[train_df["Fare"].isnull()]

In [None]:
np.mean(train_df[train_df["Pclass"] == 3]["Fare"])

In [None]:
train_df["Fare"] = train_df["Fare"].fillna(np.mean(train_df[train_df["Pclass"] == 3] ["Fare"]))

In [None]:
train_df[train_df["Fare"].isnull()]