<a href="https://colab.research.google.com/github/merdogan97/Projects/blob/main/Titanic_Project_25_02_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INTRODUCTION**
**Titanic is one of the most notorious shiprwrecks in the history. In 1912, during her voyage, Titanic sank after colliding wit an iceberg, killing 1502 out of 2224 passengers and crew.**

# **Content:**

 [**1. Loading and Checking Data**](#1)

 [**2. Variable Description**](#2)
     
   [**2.a Univariate Variable Analysis**](#3)
     
   [- Categorical Variables Analysis](#4)
        
   [- Numerical Variables Analysis](#5)
 
[**3. Basic Data Analysis**](#6)
 
[**4. Outlier Detection**](#7)

[**5. Missing Values**](#8)

  [- Finding Missing Values](#9)
  
  [- Filling Missing Values](#10)
 


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import seaborn as sns

from collections import Counter
import warnings

warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



<a id ="1"></a><br>
# **Loading and Ckecking Data**

In [None]:
train_df = pd.read_csv("/kaggle/input/traincsv/train.csv")

In [None]:
# plt.style.available

In [None]:
train_df.columns

In [None]:
train_df.head()

In [None]:
train_df.sample(10)

In [None]:
train_df.describe()

In [None]:
train_df.info()

<a id ="2"></a><br>
# **Variable Decriptions**
 1. PassengerId : unique Id number to each passenger
 2. Survived    : passenger survive (1) or died (0) 
 3. Pclass      : passenger class
 4. Name        : The name of passenger   
 5. Sex         : Gender of passengers         
 6. Age         : age of passenger    
 7. SibSp       : number of siblings /spouses    
 8. Parch       : number of parents/ children    
 9. Ticket      : ticket number
 10. Fare       : amount of money spent on ticket    
 11. Cabin      : Cabin category     
 12. Embarked   : (C= Cherbourg, Q = Ouenstown, S = Southampton)

In [None]:
train_df.info()

* float64(2): Fare and Age
* int64(5)  : Pclass, sibsp, parch, passengerId and survived 
* object(5) : Cabin, embarked, ticket, name and sex

<a id ="3"></a><br>
# Univariate Variable Analysis
 * Categorical Variables Analysis : Survive, Sex, Pclass, Embark, Name, Ticket, Sibsp, and Parch
 * Numerical Variables Analysis : Age, PassengerId, Fare

<a id ="4"></a><br>
# **Categorical Variable Analysis**

In [None]:
def bar_plot(variable):
    """
        input : variable ex:"Sex"
        output: bar plot & value count
    """
    var = train_df[variable]          # getting feature
    varValue= var.value_counts()      #  counting number of categorical (value/sample)
    
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category1 = ["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
for c in category1:
    bar_plot(c)

 # **Analysis:**
 
1. The values are not equal, 549 died, 342 alive. We can conclude that dataset of Survive is inbalance. 
2. Dataset of Sex is also inbalance. We can say that the number of women is about half the men.
If there isa a passenger whom we dont know his/her identity,  we can  predict of this passenger that is 
high probably a male.
3. There are three classes of passenger. According to data, we can easily see that most of the deads are 3rd class of passengers. 
Probably, these groups of passengers had travelled at the ground floors of the ship. We can conclude that during the evacuation of the ship, 
the filikas(lifeboats) were given priority to first class of passengers, children and women.
4. We can say that most of the passenger are from Southampton (644), the least of them are from Ouenstown(77).
5. According to fifth figure, we can state that most of the passengers (686) havent any parental relations.
6. 678 of The passengers has no relationship with other passengers. We can say that while first class of 
passengers had been travelling for an adventure trip, a these alone passengers might have traveled to reach America for settling a new life.

In [None]:
category2 = ["Cabin", "Name", "Ticket"]
for c in category2:
    print("{} \n".format(train_df[c].value_counts()))

<a id ="5"></a><br>
# **Numerical Variable Analysis**

In [None]:
def plot_hist(variable):
    
    plt.figure(figsize=(9,3)) 
    plt.hist(train_df[variable], bins = 100)    #  bins= 10  default 10
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()
    


In [None]:
numericVar = ["Fare", "Age", "PassengerId"]

for n in numericVar:
    plot_hist(n)

 # **Analysis:**

**Figure 1: ** 
We can see that most of the passengers had paid less than 50$. 
We can conclude that they are second and third class passengers. There is small group of passengers  that paid more than 100$. 
Probably their tickets might have been paid in group as family or relative. We can't make an exact inference about whether they are rich or poor passengers.

**Figure-2 : ** We can conclude that there are passengers from every ages of group.  Especially we can remark that there is an enough amount of children that we cant ignore.  And we can certainly state that most of the passengers are at middle-age group between 20-40s.  If we ignore below age 14, we can see the there is a right skewed distribution at figure-2.  It means that mean of age is bigger tham median and mode. And most of the passengers are accumulated between 20-30s.

In [None]:
# line-Plot: 

train_df.Age.plot(kind = 'line', color = 'black',label = 'Age',linewidth=2,alpha = 1,grid = True,linestyle = '-', figsize=(16,8))
train_df.Fare.plot(color = 'orange',label = 'Fare',linewidth=1, alpha = 1,grid = True,linestyle = '-')
plt.legend(loc='upper right')     
plt.xlabel('x axis')              
plt.ylabel('y axis')
plt.title('Line Plot')            
plt.show()

In [None]:
# Scatter Plot 

train_df.plot(kind='scatter', x='Age', y='Fare',alpha = 0.3,color = 'red')
plt.xlabel('Age')              # label = name of label
plt.ylabel('Fare')
plt.title('Age-Fare Scatter Plot')  

<a id ="6"></a><br>
# **Basic Data Analysis**

**1. Pclass - Survived**

**2. Sex - Survived**

**3. SibSp - Survived**

**4. Parch - Survived**

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(12, 10))
sns.heatmap(train_df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

In [None]:
df_corr = train_df.corr()[["Survived"]].sort_values(by="Survived", ascending=False)
df_corr

In [None]:
plt.figure(figsize=(3, 6)) 
sns.heatmap(df_corr, annot= True, cmap="BrBG", vmin= -1, vmax= 1)

 # **Analysis:**
 
Correlation is a measure of the relationship between two variables. The measure (identified by the variable r) reflects both the strength of the relation on a scale from 0 to 1 and its direction - either positive or negative. No relation is indicated when r is in the neighborhood of zero.

* -1 indicates a perfectly negative linear correlation between two variables.
* 0 indicates no linear correlation between two variables.
* 1 indicates a perfectly positive linear correlation between two variables.

Now, we can see the relationship between two variables at map. For example there is a negative correlation between Fare and Pclass. Furthermore there is slightly negative correlation between Survived and Pclass. It means that as the prices goes down, the percentage of dead 2nd and 3rd Class passengers goes up. On the other hand, we can say that there is slightly positive correlation between Fare and Survived of passengers. As the price of ticket goes up, the percentage of survived passengers goes up slightly.

# Pclass- Survived:

In [None]:
train_df[["Pclass", "Survived"]].groupby(["Pclass"],as_index= False).mean().sort_values(by="Survived", ascending=False)

# Sex- Survived:

In [None]:
train_df[["Sex", "Survived"]].groupby(["Sex"],as_index= False).mean().sort_values(by="Survived", ascending=False)

# SibSp- Survived:

In [None]:
train_df[["SibSp", "Survived"]].groupby(["SibSp"],as_index= False).mean().sort_values(by="Survived", ascending=False)

# Parch - Survived:

In [None]:
train_df[["Parch", "Survived"]].groupby(["Parch"],as_index= False).mean().sort_values(by="Survived", ascending=False)

<a id ="7"></a><br>
# **Outlier Detection**

In [None]:
def detect_outliers (df, features):
    outlier_indices = []
    
    for c in features:
        Q1 = np.percentile(df[c], 25)
        Q3 = np.percentile(df[c], 75)
        IQR = Q3-Q1
        outlier_step = IQR * 1.5
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i,v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

**Droping Outliers:**

In [None]:
train_df.loc[detect_outliers(train_df,["Age","SibSp", "Parch", "Fare"])]       # Outliers çıktı ! 

In [None]:
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp", "Parch", "Fare"]), axis= 0).reset_index(drop= True)

In [None]:
train_df

<a id ="8"></a><br>
# **Missing Values**

In [None]:
 test_df = pd.read_csv("/kaggle/input/titanic-machine-learning-from-disaster/test.csv")

In [None]:
train_df_len = len(train_df)
train_df= pd.concat([train_df,test_df], axis=0).reset_index(drop=True)

In [None]:
train_df.head()

<a id ="9"></a><br>
# Finding Missing Values:

In [None]:
train_df.columns

In [None]:
train_df.isnull()

In [None]:
train_df.columns[train_df.isnull().any()]

In [None]:
train_df.isnull().sum()

 # **Analysis:**

Missing Values are in columns of Survived (418), Age(256), Fare(1), Cabin(1007) and Embarked(2). Since there is a distribution of different age of groups, We should be carefull while handling and filling missings of column Age. 
Data of Cabin is not so important for us. We wont do anything for it. 

<a id ="10"></a><br>
# Finding Missing Values:
*  Embarked has 2 missing values
*  Fare has only 1 missing value

# Embarked:

In [None]:
train_df[train_df["Embarked"].isnull()]

In [None]:
train_df.boxplot(column="Fare", by= "Embarked")
plt.show()

 # **Analysis:**

When we analyzed the boxplot's, we can see three boxplot which are related to passengers embarked. First boxplot shows us passengers from Cherbourg, second from Ouenstown, and third one from Southampton. The mean of Second Boxplot is very low and we can say that these passengers are low-income. The third Boxplot is higher than the second one. But we can classify it as middlegroup and lower-income. When we lokk at the first Boxplot, we can easily see that passengers of income are higher than both second and third group. On the other hand,Missing values are 80 $. The upper outliers of second and third Boxplot are below of 80$. When we compare 80 $ with first Boxplot, it is easy to see the familiarity. As a result, we conclude that two missing values are suitable for the first boxplot and we can say that these pssengers might embark on the ship in Cherbourg.

In [None]:
train_df["Embarked"] = train_df["Embarked"].fillna("C")

In [None]:
train_df["Embarked"].isnull().any() 

# Fare:

In [None]:
train_df[train_df["Fare"].isnull()]

In [None]:
train_df[train_df["Pclass"] == 3]["Fare"]

In [None]:
fare_mean = train_df[train_df["Pclass"] == 3]["Fare"].mean()
fare_mean

In [None]:
train_df["Fare"]= train_df["Fare"].fillna(fare_mean)

In [None]:
train_df["Fare"].isnull().any() 

# Finally:
**After we detected the missing Vlaues, we repleaced the missing values with new meaningful Values without detoriating the DataFrame.**
**While manipulating the our DataFrames, we had to handle correctly. Otherwise, we can deform our DataFrame and get wrong results and make wrong analysises.**