<a href="https://www.kaggle.com/code/osmantekdamar/titanic?scriptVersionId=218300884" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Itroduction
The Titanic was a British passenger liner that famously sank on its maiden voyage in April 1912. Here is a summary of the Titanic story:

1. Construction and Maiden Voyage:
   - The Titanic was one of three sister ships, along with the RMS Olympic and HMHS Britannic, built by the White Star Line.
   - It was touted as the most luxurious and largest ship of its time, with advanced safety features.

2. Departure and Passengers:
   - The Titanic left Southampton, England, on April 10, 1912, and made stops in Cherbourg, France, and Queenstown (now known as Cobh), Ireland, before heading to New York City.
   - Onboard were over 2,200 passengers and crew members, including many wealthy and prominent individuals.

3. Collision with an Iceberg:
   - On the night of April 14, 1912, the Titanic struck an iceberg in the North Atlantic Ocean.
   - The collision damaged the ship's hull, leading to the flooding of several compartments below deck.

4. Sinking:
   - Despite efforts to slow the sinking and evacuate passengers, there were not enough lifeboats for everyone on board.
   - The Titanic sank in the early hours of April 15, 1912, just over two hours after hitting the iceberg.

5. Rescue:
   - The nearby RMS Carpathia received distress signals and rushed to the scene.
   - Over 700 survivors were rescued from lifeboats, but more than 1,500 people perished in the disaster.

6. Aftermath:
   - The sinking of the Titanic was a profound tragedy that shocked the world and prompted significant changes in maritime safety regulations.
   - Investigations revealed shortcomings in the ship's design, safety measures, and the handling of the emergency.

7. Cultural Impact:
   - The Titanic disaster has inspired numerous books, films, and documentaries, including James Cameron's 1997 movie "Titanic."
   - It remains a symbol of human hubris and the consequences of disregarding safety measures.

The sinking of the Titanic is a well-known and tragic event in history, and its legacy continues to captivate people's imaginations and serve as a cautionary tale about the importance of safety at sea.

Here is a simple sketch of that epic ship

![Map](https://i.pinimg.com/originals/73/90/a6/7390a6730ca3c56602bc0495910c1a54.png)

<font color= "blue">
Content:

1. [Load and Check Data](#1)
1. [Variable Description and Basic Data Analysis](#2)
    * [Categorical Variable](#3)
    * [Numerical Variable](#4)
    * [Cardinal Variable](#5)
1. [Outlier Detection](#6)
1. [Missing Value](#7)
    * [Find Missing Value](#8)
    * [Fill Missing Value](#9)
1. [Visualization](#10)
    * [Correlation Between Sibsp -- Parch -- Age -- Fare -- Survived](#11)
    * [SibSp -- Survived](#12)
    * [Parch -- Survived](#13)
    * [Pclass -- Survived](#14)
    * [Age -- Survived](#15)
    * [Pclass -- Survived -- Age](#16)
    * [Embarked -- Sex -- Pclass -- Survived](#17)
    * [Embarked -- Sex -- Fare -- Survived](#18)
    * [Fill Missing: Age Feature](#19)
1. [Feature Engineering](#20)
    * [Name -- Title](#21)
    * [Family Size](#22)
    * [Embarked](#23)
    * [Ticket](#24)
    * [Pclass](#25)
    * [Sex](#26)
    * [Drop Passenger ID and Cabin](#27)
1. [Modeling](#28)
    * [Train - Test Split](#29)
    * [Simple Logistic Regression](#30)
    * [Hyperparameter Tuning -- Grid Search -- Cross Validation](#31)
    * [Ensemble Modeling](#32)
    * [Prediction and Submission](#33)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import time

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


In [2]:
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.width', 1000)

<a id = "1"></a>
# Load and Check Data

In [3]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
submission = test_data[["PassengerId"]]
train_data.drop("PassengerId", axis = 1, inplace = True)
test_data.drop("PassengerId", axis = 1, inplace = True)

In [4]:
train_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
train_data.describe([0.10, 0.25, 0.50, 0.75, 0.90]).T

Unnamed: 0,count,mean,std,min,10%,25%,50%,75%,90%,max
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,0.0,1.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,1.0,2.0,3.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,14.0,20.125,28.0,38.0,50.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,0.0,1.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,0.0,2.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.55,7.9104,14.4542,31.0,77.9583,512.3292


<a id = "2"></a>
# Variable Descirption

1. PassengerId : Unique identification number for each passenger, but I deleted this variable because it does not contribute to the problem.
1. Survived : Target variable.Indicates whether the passengers are dead(0) or alive(1).
1. Pclass :  Passenger class.(1, 2, 3)
1. Name : Passenger names
1. Sex : Gender of passenger
1. Age : Age of passenger
1. SibSp : Number of siblings/spouses
1. Parch : Number of parents/children
1. Ticket: Ticket number
1. Fare : Ticket prices
1. Cabin : Cabin category
1. Embarked : port where passenger embarked ( C = Cherbourg, Q = Queenstown, S = Southampton )

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


In [7]:
def analyze_data(dataframe, cat_th=10, car_th=20):
    """
    It gives the names of categorical, numerical and categorical but cardinal variables in the data set. It also performs incomplete data analysis.
    Parameters
    ------
        dataframe: dataframe
            The dataframe from which variable names are to be retrieved
        cat_th: int, optional
            Class threshold value for numeric but categorical variables
        car_th: int, optional
            Class threshold for categorical but cardinal variables

    Returns
    ------
        cat_cols: list
            Categorical variable list
        num_cols: list
            Numerik değişken listesi
        cat_but_car: list
            Categorical view cardinal variable list
    """
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtype == "O"]
    num_cols = [col for col in dataframe.columns if dataframe[col].dtype != "O"]

    num_but_cat = [col for col in num_cols if dataframe[col].nunique() < cat_th]
    cat_but_car = [col for col in cat_cols if dataframe[col].nunique() > car_th]

    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    cat_cols = cat_cols + num_but_cat
    
    print(f"Number of Observations: {dataframe.shape[0]}")
    print(f"Number of Variables: {dataframe.shape[1]}")
    print(f'Cat cols: {len(cat_cols)}, Num cols: {len(num_cols)}, Cat but car cols: {len(cat_but_car)}')
    print("\nMissing Data")
    print(dataframe.isna().sum())

    return cat_cols, num_cols, cat_but_car

In [8]:
cat_cols, num_cols, cat_but_car = analyze_data(train_data)

Number of Observations: 891
Number of Variables: 11
Cat cols: 6, Num cols: 2, Cat but car cols: 3

Missing Data
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64


In [9]:
analyze_data(test_data)

Number of Observations: 418
Number of Variables: 10
Cat cols: 5, Num cols: 2, Cat but car cols: 3

Missing Data
Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64


(['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch'],
 ['Age', 'Fare'],
 ['Name', 'Ticket', 'Cabin'])

In [10]:
cat_cols.remove("Survived")
target_feature = "Survived"

# Visualizations

In [11]:
train_data_vis = train_data.copy()

In [12]:
train_data_vis[cat_cols] = train_data[cat_cols].astype("str")

## Categorical Variable

In [13]:
def analyze_categorical_features(train_data, cat_cols, target_feature):
    figures_dict = {}
    for feature in cat_cols:
        value_counts = train_data_vis[feature].value_counts().reset_index()
        value_counts.columns = [feature, 'Adet']
        value_counts_fig = px.bar(value_counts, x=feature, y='Adet', title=f'{feature} Değer Sayıları')
        
        # Cross Table
        cross_table = pd.crosstab(train_data_vis[feature], train_data_vis[target_feature])
        cross_table2 = pd.crosstab(train_data_vis[target_feature], train_data_vis[feature])
        
        # Calculate Percentage
        cross_table_percent = cross_table.div(cross_table.sum(1), axis=0) * 100
        
        # Relationship Bar Plot (Counts)
        relationship_fig_counts = px.bar(cross_table, y=cross_table.index, x=cross_table.columns, title=f'{feature} - {target_feature} İlişkisi (Adet)')
        relationship_fig_counts.update_layout(yaxis_title=feature)
    
        # Relationship Bar Plot (Percentages)
        relationship_fig_percent = px.bar(cross_table_percent, y=cross_table_percent.index, x=cross_table_percent.columns, title=f'{feature} - {target_feature} İlişkisi (Yüzde)')
        relationship_fig_percent.update_layout(yaxis_title=feature)
    
        # Store figures in dictionary
        figures_dict[f'{feature}_value_counts'] = value_counts_fig
        figures_dict[f'{feature}_{target_feature}_counts'] = relationship_fig_counts
        figures_dict[f'{feature}_{target_feature}_percentages'] = relationship_fig_percent
        
    return figures_dict


## Numerical Variable

In [14]:
def analyze_numerical_features(train_data, cat_cols, target_feature):
    figures_dict = {}
    for feature in num_cols:
        # Histogram
        hist_fig = px.histogram(train_data_vis, x=feature, title=f'{feature} Dağılımı')
        
        # Box Plot (with Survived Comparison)
        box_fig = px.box(train_data_vis, x=target_feature, y=feature, 
                         title=f'{feature} Boxplot - {target_feature} Kıyaslaması',
                         color=target_feature)
        
        # Mean Comparison
        grouped_data = train_data_vis.groupby(target_feature)[feature].mean().reset_index()
        bar_fig = px.bar(grouped_data, x=target_feature, y=feature, 
                         title=f'{feature} Ortalaması - {target_feature}')
        bar_fig.update_layout(yaxis_title='Ortalama')
        
        # Outlier Analysis
        Q1 = train_data_vis[feature].quantile(0.25)
        Q3 = train_data_vis[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Create a column for outlier detection
        train_data_vis['Outlier'] = ((train_data_vis[feature] < lower_bound) | 
                                     (train_data_vis[feature] > upper_bound))
        
        # Create cross tables for outliers
        cross_table = pd.crosstab(train_data_vis['Outlier'], train_data_vis[target_feature])
        
        # Calculate percentages
        cross_table_percent = cross_table.div(cross_table.sum(1), axis=0) * 100
        
        # Outlier Relationship Plot (Counts)
        outlier_counts_fig = px.bar(cross_table, 
                                   y=cross_table.index, 
                                   x=cross_table.columns,
                                   title=f'{feature} Outlier - {target_feature} İlişkisi (Adet)')
        outlier_counts_fig.update_layout(yaxis_title='Outlier Durumu')
        
        # Outlier Relationship Plot (Percentages)
        outlier_percent_fig = px.bar(cross_table_percent, 
                                    y=cross_table_percent.index, 
                                    x=cross_table_percent.columns,
                                    title=f'{feature} Outlier - {target_feature} İlişkisi (Yüzde)')
        outlier_percent_fig.update_layout(yaxis_title='Outlier Durumu')
        
        # Store figures in dictionary
        figures_dict[f'{feature}_distribution'] = hist_fig
        figures_dict[f'{feature}_boxplot'] = box_fig
        figures_dict[f'{feature}_mean_comparison'] = bar_fig
        figures_dict[f'{feature}_outlier_counts'] = outlier_counts_fig
        figures_dict[f'{feature}_outlier_percentages'] = outlier_percent_fig
    
    return figures_dict

In [36]:
cat_figures = analyze_categorical_features(train_data, cat_cols, target_feature)

In [37]:
num_figures = analyze_numerical_features(train_data_vis, num_cols, target_feature)


In [17]:
cat_figures.keys()

dict_keys(['Sex_value_counts', 'Sex_Survived_counts', 'Sex_Survived_percentages', 'Embarked_value_counts', 'Embarked_Survived_counts', 'Embarked_Survived_percentages', 'Pclass_value_counts', 'Pclass_Survived_counts', 'Pclass_Survived_percentages', 'SibSp_value_counts', 'SibSp_Survived_counts', 'SibSp_Survived_percentages', 'Parch_value_counts', 'Parch_Survived_counts', 'Parch_Survived_percentages'])

In [40]:
for i in  cat_figures.keys():
    try:
        cat_figures[i].show("iframe")
    except:
        print(cat_figures[i])
    time.sleep(1)

çok para verenlerin çoğu hayatta kalmış belli ki paranın hayat üzerine bir nüfuzu var

In [19]:
for i in  num_figures.keys():
    try:
        num_figures[i].show("iframe")
    except:
        print(num_figures)
    time.sleep(1)

<a id = "5"></a>
### Cardinal Variable

In [20]:
train_data[cat_but_car].dropna().sample(20)

Unnamed: 0,Name,Ticket,Cabin
307,"Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)",PC 17758,C65
151,"Pears, Mrs. Thomas (Edith Wearne)",113776,C2
700,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",PC 17757,C62 C64
370,"Harder, Mr. George Achilles",11765,E50
139,"Giglio, Mr. Victor",PC 17593,B86
435,"Carter, Miss. Lucile Polk",113760,B96 B98
185,"Rood, Mr. Hugh Roscoe",113767,A32
621,"Kimball, Mr. Edwin Nelson Jr",11753,D19
123,"Webber, Miss. Susan",27267,E101
292,"Levy, Mr. Rene Jacques",SC/Paris 2163,D


In [21]:
train_data[cat_but_car].nunique()

Name      891
Ticket    681
Cabin     147
dtype: int64

<a id = "6"></a>
### Outlier Detection

In [22]:
train_data

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [23]:
def check_outliers(data, columns, threshold=1.5):
    """
    Check for outliers in multiple columns of a DataFrame.
    
    Parameters:
    - data: DataFrame containing the data.
    - columns: List of column names to check for outliers.
    - threshold: Multiplier for the IQR (Interquartile Range) to determine outliers.
    
    Returns:
    - outliers_dict: A dictionary where keys are column names and values are lists of outliers' indices.
    """
    outliers_dict = {}
    
    for col in columns:
        # Calculate the IQR (Interquartile Range) for the column
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        
        # Define the lower and upper bounds for outliers
        lower_bound = Q1 - threshold * IQR
        upper_bound = Q3 + threshold * IQR
        
        # Find the indices of outliers in the column
        outliers_indices = data[(data[col] < lower_bound) | (data[col] > upper_bound)].index.tolist()
        
        # Store the outliers in the dictionary
        outliers_dict[col] = outliers_indices
    
    return outliers_dict




In [24]:
outliers = check_outliers(train_data, num_cols)

In [25]:
train_data.loc[outliers["Fare"]]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
27,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
31,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
34,0,1,"Meyer, Mr. Edgar Joseph",male,28.0,1,0,PC 17604,82.1708,,C
52,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
...,...,...,...,...,...,...,...,...,...,...,...
846,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.5500,,S
849,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",female,,1,0,17453,89.1042,C92,C
856,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S
863,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S


### Feature Enginering

In [26]:
train_data["family_size"] = train_data["SibSp"] + train_data["Parch"]+1
test_data["family_size"] = test_data["SibSp"] + test_data["Parch"]+1

In [27]:
train_data[(train_data["Cabin"].notna())&(train_data["Survived"] == 0)]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,family_size
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,1
27,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S,6
54,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C,2
62,0,1,"Harris, Mr. Henry Birkhardt",male,45.0,1,0,36973,83.4750,C83,S,2
75,0,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.6500,F G73,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...
789,0,1,"Guggenheim, Mr. Benjamin",male,46.0,0,0,PC 17593,79.2000,B82 B84,C,1
806,0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0000,A36,S,1
815,0,1,"Fry, Mr. Richard",male,,0,0,112058,0.0000,B102,S,1
867,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S,1


sadece 68 kişi kabin bilgisine sahip olduğu halde hayatta kalmamış bu durumda kabin bilgisinin kayıp olma durumu ile hayatta kalmama yüksek correlasyona sahip diyebiliriz bunuda bir özelliğe çevirebiliriz

In [28]:
train_data["Cabin_isNa"] = train_data["Cabin"].isna().astype("int")
test_data["Cabin_isNa"] = test_data["Cabin"].isna().astype("int")

In [29]:
train_data.groupby(["Pclass","family_size"]).agg({"Fare":"mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare
Pclass,family_size,Unnamed: 2_level_1
1,1,63.672514
1,2,91.848039
1,3,95.681075
1,4,133.521429
1,5,262.375
1,6,263.0
2,1,14.066106
2,2,24.682962
2,3,31.693819
2,4,36.575969


In [30]:
train_data["Age_Category"] = pd.cut(train_data['Age'], 
      bins=[0, 12, 18, 35, 50, 65, float('inf')],
      labels=['Çocuk', 'Genç', 'Yetişkin', 'Orta Yaş', 'Yaşlı', '65+ Yaş'])
train_data['Age_Category'] = train_data['Age_Category'].cat.add_categories('nan')
train_data['Age_Category'] = train_data["Age_Category"].fillna("nan")

In [31]:
train_data['Fare_Category'] = pd.cut(train_data['Fare'], 
                                      bins=[0, 7.91, 14.454, 31, 100, float('inf')],
                                      labels=['Çok Düşük', 'Düşük', 'Orta', 'Yüksek', 'Çok Yüksek'])

train_data['Fare_Category'] = train_data['Fare_Category'].cat.add_categories('nan')
train_data['Fare_Category'] = train_data["Fare_Category"].fillna("nan")