Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. You can add a new cell and set the type as "Markdown" so you can include your answers in this notebook.

Below, please fill in your name:

In [1]:
NAME = "Iswaryaah Balakrishnan"

## Assignment 3 - Data Analysis using Pandas
**(15 points total)**

For this assignment, we will analyze data on the passengers aboard the Titanic.

Use the .csv file provided. The definition of all variables can be found below:


- survival --> Survival --> 0 = No, 1 = Yes
- pclass --> Ticket class --> 1 = 1st, 2 = 2nd, 3 = 3rd
- sex --> Sex --> M = Male, F = Female
- Age --> Age in years
- sibsp --> # of siblings / spouses aboard the Titanic
- parch --> # of parents / children aboard the Titanic
- ticket --> Ticket number
- fare --> Passenger fare --> Price paid for the fare
- cabin --> Cabin number
- embarked --> Port of Embarkation --> C = Cherbourg, Q = Queenstown, S = Southampton

The main purpose of this assignment is to identify which passengers were more likely to survive the Titanic.

**Part 1.**  _(25 points)_
* Import the data into a pandas DataFrame (1 point)

* Use the describe() and info() functions to assess the data. What can you conclude? (2 points)

* Write a function to identify how many men vs. women were on board, and how many survived in each group (3 points)

* Write a function to identify how many men vs. women were traveling with families of 3 or more members, and how many were traveling alone or in pairs (3 points)

* Write a function to identify how many passengers departed from each of the 3 ports (1 point)

* Write a function to identify how many passengers were in each class (can be inferred from the cabin variable) (1 point)

* Write a function toidentify how many passengers paid high vs. low fare tickets (1 point)

* Define a function to classify each person in an age group with the following groupings: 0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 60+. Add a new column in the DataFrame which identifies each person's age group. Then, count the number of survivalists in each age group (8 points)

* Now that you have some basic information about the passengers, conduct additional analysis to identify which passengers were most likely to survive. You should assess all the variables provided and anchor on the Survived variable. As a hint, you should also combine variables. For example, were all women equally as likely to survive? Or only those with larger families or those who were younger? (5 points)

**Part 2.**  _(5 points)_
* How did you approach your analysis using Python? Which functionalities did you use? Why?


## Import the data into a pandas DataFrame (1 point)

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [166]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('TitanicData.csv')

## Use the describe() and info() functions to assess the data. What can you conclude? (2 points)

In [4]:
df.info()

# Based on the info() function, we can see that we have 714 observations 
# in float, integer and string data types. 
# We know the data types are strings because the Dtype column specifies 'object'.
# There are 12 columns in total
# We can also see that there is some missing cabin and embarked data
# as there are 195 data entries in the cabin column and 712 entries in the embarked column
# where there should be 714 entries.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  714 non-null    int64  
 1   Survived     714 non-null    int64  
 2   Pclass       714 non-null    int64  
 3   Name         714 non-null    object 
 4   Sex          714 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        714 non-null    int64  
 7   Parch        714 non-null    int64  
 8   Ticket       714 non-null    object 
 9   Fare         714 non-null    float64
 10  Cabin        185 non-null    object 
 11  Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 67.1+ KB


In [5]:
df.describe(include='all')

# Based on the describe() function, we can conclude the following:
# There were 714 passengers in total
# The age range of passengers is 0.4 to 80
# The maximum number of parents/children of a boarded passenger is 6
# The maximum number of siblings/spouses of a boarded passenger is 5
# The maximum fare price was 512
# The minimum fare price was 0

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,714.0,714.0,714.0,714,714,714.0,714.0,714.0,714.0,714.0,185,712
unique,,,,714,2,,,,542.0,,134,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,G6,S
freq,,,,1,453,,,,7.0,,4,554
mean,448.582633,0.406162,2.236695,,,29.699118,0.512605,0.431373,,34.694514,,
std,259.119524,0.49146,0.83825,,,14.526497,0.929783,0.853289,,52.91893,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,222.25,0.0,1.0,,,20.125,0.0,0.0,,8.05,,
50%,445.0,0.0,2.0,,,28.0,0.0,0.0,,15.7417,,
75%,677.75,1.0,3.0,,,38.0,1.0,1.0,,33.375,,


In [6]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Write a function to identify how many men vs. women were on board and how many survived in each group (3 points)

In [20]:
df.groupby('Sex')['Sex'].count().to_frame()

# 261 females were on board
# 453 males were on board

Unnamed: 0_level_0,Sex
Sex,Unnamed: 1_level_1
female,261
male,453


In [165]:
df.groupby(['Sex','Survived'])['Survived'].count().to_frame()

# 197 out of 261 i.e. 75%  of female passengers survived. 
# 93 out of 453 i.e. 20.5% of male passengers survived.

# From this analysis, we can deduce that there's certainly a possibility
# that women and children were given priority to be rescued first.

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Sex,Survived,Unnamed: 2_level_1
female,0,64
female,1,197
male,0,360
male,1,93


## Write a function to identify how many men vs. women were traveling with families of 3 or more members, and how many were traveling alone or in pairs (3 points)

In [9]:
#df['Family'] = df['SibSp'] + df['Parch']<=(2)
#df[(df['Sex']=='male') & (df['Family']==True)].count()

In [10]:
#df['Family'] = df['SibSp'] + df['Parch']>(2)
#df[(df['Sex']=='male') & (df['Family']==True)].count()

In [11]:
#df['Family'] = df['SibSp'] + df['Parch']<=(2)
#df[(df['Sex']=='female') & (df['Family']==True)].count()

In [12]:
#df['Family'] = df['SibSp'] + df['Parch']>(2)
#df[(df['Sex']=='female') & (df['Family']==True)].count()

In [13]:
df['Family'] = df['SibSp'] + df['Parch']>(2)
df.groupby(['Sex','Family'])['Family'].count().to_frame()

# 216 females were travelling alone or in pairs
# 45 females were travelling with families of 3 or more
# 420 males were travelling alone or in pairs
# 33 males were travelling with families of 3 or more

Unnamed: 0_level_0,Unnamed: 1_level_0,Family
Sex,Family,Unnamed: 2_level_1
female,False,216
female,True,45
male,False,420
male,True,33


## Write a function to identify how many passengers departed from each of the 3 ports (1 point)

In [25]:
df.groupby(['Embarked'])['Embarked'].count().to_frame()

# 130 passengers departed from port C
# 28 passengers departed from port Q
# 554 passengers departed from port S

Unnamed: 0_level_0,Embarked
Embarked,Unnamed: 1_level_1
C,130
Q,28
S,554


## Write a function to identify how many passengers were in each class (can be inferred from the cabin variable) (1 point)

In [162]:
df.groupby(['Pclass'])['Pclass'].count().to_frame()

# There were 186 passengers in first class
# There were 173 passengers in second class
# There were 355 passengers in third class

Unnamed: 0_level_0,Pclass
Pclass,Unnamed: 1_level_1
1,186
2,173
3,355


In [164]:
df.groupby(['Pclass','Survived'])['Pclass'].count().to_frame()

# Passengers in first class were twice as likely to survive.
# Passengers in second class had a less than 50% chance at survival.
# Passengers in third class had less than a third chance at survival. 

Unnamed: 0_level_0,Unnamed: 1_level_0,Pclass
Pclass,Survived,Unnamed: 2_level_1
1,0,64
1,1,122
2,0,90
2,1,83
3,0,270
3,1,85


## Write a function to identify how many passengers paid high vs. low fare tickets (1 point)

In [16]:
df['medianFare'] = df['Fare'].median()
df['highFare'] = df['Fare'] >= df['medianFare']
df.groupby(['highFare'])['highFare'].count().to_frame()

# Assuming that a high fare ticket is considered one that costs below the median cost of tickets,
# and a high fare ticket is considered one that is equivalent or greater than the median ticket price...
# There are 356 passengers who paid for low fare tickets
# and 358 passengers who paid for high fare tickets

Unnamed: 0_level_0,highFare
highFare,Unnamed: 1_level_1
False,356
True,358


## Define a function to classify each person in an age group with the following groupings: 0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 60+. Add a new column in the DataFrame which identifies each person's age group. Then, count the number of survivalists in each age group (8 points)

In [17]:
df['ageGroups'] = pd.cut(df['Age'], bins=7, labels=('0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '60+'))

In [151]:
df.groupby(['ageGroups'])['Survived'].sum().to_frame()

# 39 passengers in the 0-10 age group survived
# 64 passengers in the 11-20 age group survived
# 93 passengers in the 21-30 age group survived
# 56 passengers in the 31-40 age group survived
# 28 passengers in the 41-50 age group survived
# 9 passengers in the 51-60 age group survived
# 1 passenger in the 60+ age group survived

# Most survivors were in the 21-30 age group, followed by 11-20 and then 31-40.

# From this analysis, it does seem that perhaps priority was given for women and children
# based on the analysis earlier between gender and survival as well as the 
# analysis here on survival based on age groups.

Unnamed: 0_level_0,Survived
ageGroups,Unnamed: 1_level_1
0-10,39
11-20,64
21-30,93
31-40,56
41-50,28
51-60,9
60+,1


## Now that you have some basic information about the passengers, conduct additional analysis to identify which passengers were most likely to survive. 
## You should assess all the variables provided and anchor on the Survived variable. As a hint, you should also combine variables. 
## For example, were all women equally as likely to survive? Or only those with larger families or those who were younger? (5 points)

In [None]:
#df[(df['Sex']=='female') & (df['Survived']==1) & (df['Family'] == True)].count()
#df[(df['Sex']=='female') & (df['Survived']==1) & (df['Family'] == False)].count()
#df[(df['Sex']=='male') & (df['Survived']==1) & (df['Family'] == True)].count()
#df[(df['Sex']=='male') & (df['Survived']==1) & (df['Family'] == False)].count()

In [157]:
df.groupby(['Survived','Sex','Family'])['Family'].count().to_frame()

# 172 out of the 216 female passengers who travelled alone or in pairs, survived
# 25 out of the 45 female passengers who travelled with family of 3 or more, survived
# 87 out of the 420 male passengers who travelled alone or in pairs, survived
# 6 out of the 33 male passengers who travelled with family of 3 or more, survived

# From this analysis, it appears that there's a better chance of survival
# if travelled alone or in pairs than in families of 3 or more.

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Family
Survived,Sex,Family,Unnamed: 3_level_1
0,female,False,44
0,female,True,20
0,male,False,333
0,male,True,27
1,female,False,172
1,female,True,25
1,male,False,87
1,male,True,6


In [142]:
print(df['Age'].mean())
df['meanAge'] = df['Age'].mean()
df['Younger'] = df['Age'] < df['meanAge']

29.69911764705882


In [155]:
# 92 of the female passengers who survived were older than the mean age
# 105 of the female passengers who survived were younger than the mean age of passengers
# 42 of the male passengers who survived were older than the mean age 
# 51 of the male passengers who survived were younger than the mean age


df.groupby(['Survived','Sex','Younger'])['Younger'].count().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Younger
Survived,Sex,Younger,Unnamed: 3_level_1
0,female,False,22
0,female,True,42
0,male,False,174
0,male,True,186
1,female,False,92
1,female,True,105
1,male,False,42
1,male,True,51


In [144]:
# The analysis shows that any passenger, would have a slightly higher chance of survival
# if younger than the mean average age.

In [154]:
df.groupby(['Survived','Pclass','Embarked'])['Embarked'].count().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Embarked
Survived,Pclass,Embarked,Unnamed: 3_level_1
0,1,C,21
0,1,Q,1
0,1,S,42
0,2,C,7
0,2,Q,1
0,2,S,82
0,3,C,23
0,3,Q,18
0,3,S,229
1,1,C,53


In [160]:
# From this analysis, we can deduce the following:
# Those who embarked on Q on 1st and 2nd class had a 50% chance of survival.
# Those who embarked on Q on 3rd class had 1 third of a chance at survival.

# Those who emabarked on C on 1st class were more than twice as likely to survive.
# Those who emabarked on C on 2nd class had about a 50% chance of survival.
# Those who emabarked on C on 3rd class had about a lower than 50% chance of survival.

# Those who embarked on S on 1st class had more than a 50% chance of survival.
# Those who embarked on S on 2nd class had a less than 50% chance of survival.
# Those who embarked on S on 3rd class had a 21% chance of survival.

# Generally, first class passengers had the highest chance of survival (across all ports)
# Third class passengers had the lowest chance of survival (particularly  
# third class passengers who embarked from port S).

## How did you approach your analysis using Python? Which functionalities did you use? Why?