<a href="https://colab.research.google.com/github/kkedji/MyPortfolio/blob/main/Titanic_Passengers_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**EXPLORATORY DATA ANALYSIS OF TITANIC DATASET FROM** **KAGGLE**

We will explore the Titanic dataset looking for answers to the questions listed below and discover interestings insights.

*   Number of survivors
*   Distribution of passengers by gender
*   The average age of passenger and fare paid
*   Survival rate by cabin class and gender
*   Survival rate amoung oldest and youngest passengers
*   Boarding stops








In [5]:
import pandas as pd
import numpy as np

In [8]:
#Let's read the csv file which was downloaded from Kaggle and stored in the content file of our session

df = pd.read_csv('/content/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
#Shape of the data
#The shape function gives us information about how many rows and columns there are in out data set
df.shape

(891, 12)

In [10]:
#The info function gives us a quick overview of our dataset with informations about the datatypes, null-values and index among others.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [11]:
#The sum function applied to isnull method return the number of null values in each column.
# giving us a clear view on how and where to handle null value in our dataset

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [12]:
#The describe function gives a statistical summary of the dataset's numerical columns.
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [13]:
#The duplicated method return a boolean series and the sum function appended gives the total of duplicated values for the column
#In this case,In this case the is no duplicatedId, we do not have any passenger listed more than once.

df['PassengerId'].duplicated().sum()

0

**EXPLORING THE DATASET WITH SPECIFIC** **QUESTIONS**

In [14]:
# How many people survived the tradegy?

survived_count = df['Survived'].sum()

print('The number of people who survived are ', survived_count)

The number of people who survived are  342


In [15]:
#What was the distribution of the gender of the passengers
#We will use the value_counts function for this.

df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [16]:
# We can also look at the gender distribution in term of percentage

round(df['Sex'].value_counts(normalize = True)*100,2)

male      64.76
female    35.24
Name: Sex, dtype: float64

In [31]:
# What was the average age of the passengers on board?
# We will use the mean and median function
avg_age = df['Age'].mean()

print('The average age of a passenger on board of the Titanic was ', round(avg_age,0))

The average age of a passenger on board of the Titanic was  30.0


In [20]:
df['Age'].median()

28.0

In [30]:
#What was the mean and median Fare paid by the passengers on board
average = df['Fare'].mean()

print('The average price paid by a passenger on board of the Titanic was £', round(average,2))

The average price paid by a passenger on board of the Titanic was £ 32.2


In [24]:
df['Fare'].median()

14.4542

In [19]:
# Which passenger paid the max fare?
# We can see that all the 3 passengers with the most expensive place on the ship survived
# This can sustain the assumption that priority was given to fisrt class passengers
df[df['Fare'] == df['Fare'].max()]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C


In [33]:
# We can confirm this assumption by calculating the survival rate per Cabin Class
# Survival rate by class using the groupby function

df.groupby('Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [34]:
#We can dive deep bringing Sex into the groupby function
#Survival rate by class and sex combined
#This shows clearly that priority was given to women by rescue operations

df.groupby(['Pclass', 'Sex'])['Survived'].mean()

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64

In [35]:
# We can confirm this assumption by looking at the overall survival rate by gender

df.groupby('Sex')['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [37]:
# How many older people who have more than 70 years was there in the passenger list
# we can see that the oldest passenger on the ship, aged 80 survived

df[df['Age'] > 70.0]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [39]:
# How many babies were there among the passengers ?
# There were 14 babies among the passengers and 12 of them survived
# This shows that priority was given to children also

df[df['Age'] <= 1.0]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S
172,173,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C
386,387,0,3,"Goodwin, Master. Sidney Leonard",male,1.0,5,2,CA 2144,46.9,,S
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S


In [40]:
# Where did the passengers boarded?
# S stand for Southampton
# C for Cherbourg and
# Q for Queenstown

df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**END**