# TITANIC SURVIVOR'S ANALYSIS

In [None]:
import pandas as pd
import numpy as np

**Description of the attributes of the dataset**                      

---


Pclass : Passenger Class(1=1st,2=2nd,3=3rd)                
survival : Survival(0=No,1=Yes)             
name : Name            
sex : Sex         
age : Age          
SibSp : Number of siblings/spouses abroad           
ticket : Ticket Number              
fare : Passenger Fare              
cabin : Cabin          
Embarked : Port of Embarkation (C=Cherbourg,Q=Queenstown,S=Southampton)

In [None]:
df = pd.DataFrame(pd.read_csv('/content/train (1).csv'))
print(df)


     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]


In [None]:
df.shape

(891, 12)

Finding the places having null values :

In [None]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


Finding the sum of null values under each head for ease of understanding :

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can't leave the null values as it is. If it is a significant amount,we should drop that column or else fill the null values with the appropriate value after analysing the dataset.
Here,we are dropping the columns having null values more than 40% of the entries of the dataset.

In [None]:
a = df.isnull().sum()
drop_col = a[a>(40/100 * df.shape[0] )]
drop_col

Cabin    687
dtype: int64

It is the head 'Cabin' which has null values more than 40% of total entries.So, it is better to drop this column.

In [None]:
drop_col.index

Index(['Cabin'], dtype='object')

Finding the sum of null values in each column after dropping column 'Cabin' :

In [None]:
df.drop(drop_col.index, axis=1, inplace = True)
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

Columns 'Age' and 'Embarked' still have null values. So,we need to fill these null places.

Filling null values with mean values of the concerned column and again finding out the sum of null values in each column :

In [None]:
df.fillna(df.mean(), inplace = True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

The head 'Embarked' still has 2 null values because of the fact that it contains String values and we know that strings do not have mean.

Finding out the escription of the column 'Embarked' so that according the two null values can be filled :

In [None]:
df['Embarked'].describe()

count     891
unique      3
top         S
freq      646
Name: Embarked, dtype: object

We found out that there are three unique values in the head 'Embarked'.'S' value has the maximum frequency being 644. So, we should fill the two null values with the value 'S'

In [None]:
df['Embarked'].fillna('S',inplace=True)

Finding the sum of null values under each head after filling the null values of the column 'Embarked' :

In [None]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Finding out the correlation between the heads of the dataset :

In [None]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


SibSp = Number of spouses/siblings abroad    
Parch = Number of parents/children abroad    
Creating a new column 'Family_Size' by combining the columns 'SibSp' and 'Parch' :

In [None]:
df['Family_Size']=df['SibSp']+df['Parch']
df.drop(['SibSp','Parch'],axis=1,inplace=True)
df.corr()


Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,Family_Size
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
Family_Size,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


**Family Size does not have much correlance with the Survival Rate**.


Let us check if the person alone or not has any effect on the survival rate. Creating a new column 'Alone' that shows 0 when family size is greater than 1 and 1 when family size is 0.

In [None]:
df['Alone']=[0 if df['Family_Size'][i]>0 else 1 for i in df.index]
print(df)

     PassengerId  Survived  Pclass  ... Embarked Family_Size  Alone
0              1         0       3  ...        S           1      0
1              2         1       1  ...        C           1      0
2              3         1       3  ...        S           0      1
3              4         1       1  ...        S           1      0
4              5         0       3  ...        S           0      1
..           ...       ...     ...  ...      ...         ...    ...
886          887         0       2  ...        S           0      1
887          888         1       1  ...        S           0      1
888          889         0       3  ...        S           3      0
889          890         1       1  ...        C           0      1
890          891         0       3  ...        Q           0      1

[891 rows x 11 columns]


In [None]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

**If the person is alone, he/she has less chance of surviving.**    
A possible reason could be that the person who is travelling with his/her family might be belonging to the rich class and so prioritized over the other classes.

In [None]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

**The passengers whose port of embarkation was Cherbourg have the highest survival rate**.

In [None]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


**If the person is not alone,the fare is higher than usual.**

In [None]:
df['Sex']=[0 if df['Sex'][i]=='male' else 1 for i in df.index]
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

**Female Passengers have higher chances of survival than the male passengers.**   
This means that women were prioritized over men.

# CONCLUSION

---


*  If the person is alone, he/she has less chance of surviving. In other words, passengers travelling with their family have higher survival rate.
*   Female passengers were prioritized over men.

*   People travelling in higher class have higher chances of survival. It might be the case that this hirarchy was kept in mind while rescuing the people.

*   The passengers whose port of embarkation was Cherbourg survived more in proportion than passengers having other ports of embarkment.















