<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# Lab: Titanic EDA

---
For this lab, we're going to take a look at the Titanic manifest. We'll be exploring this data to see what we can learn regarding the survival rates of different groups of people.

## Step 1: Reading the data

1. Read the titanic data (in the form of the `train.csv` in this repo using the appropriate Pandas method).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [None]:
train = pd.read_csv("train.csv")

In [None]:
train.columns

In [None]:
train.isnull().sum()

### Data Dictionary

| Variable | Description | Details |
|----------|-------------|---------|
| survival | Survival | 0 = No; 1 = Yes |
| pclass | Passenger Class | 1 = 1st; 2 = 2nd; 3 = 3rd |
| name | First and Last Name | |
| sex | Sex | |
| age | Age | |
| sibsp | Number of Siblings/Spouses Aboard | |
| parch | Number of Parents/Children Aboard | |
| ticket | Ticket Number | |
| fare | Passenger Fare | |
| cabin | Cabin | |
| embarked | Port of Embarkation | C = Cherbourg; Q = Queenstown; S = Southampton |

## Step 2: Cleaning the data
####  1. Create a bar chart showing how many missing values are in each column

In [None]:
msno.bar(train.sample(891))

####  2. Which column has the most `NaN` values? How many cells in that column are empty?


In [None]:
train.isnull().sum() # The 'Cabin' column has the most NaNs = 687.

####  3. Delete all rows where `Embarked` is empty

In [None]:
train = train.dropna(subset=['Embarked']).reset_index(drop=True)

In [None]:
train['Embarked'].isnull().sum()

#### 4. Fill all empty cabins with **¯\\_(ツ)_/¯**

Note: `NaN`, empty, and missing are synonymous.

In [None]:
train['Cabin'] = train['Cabin'].fillna("¯\\_(ツ)_/¯")

In [None]:
train['Cabin'].isnull().sum() #check

## Step 3: Feature extraction

#### 1.  There are two columns that pertain to how many family members are on the boat for a given person. Create a new column called `FamilyCount` which will be the sum of those two columns.

In [None]:
train['FamilyCount'] = train['SibSp'] + train['Parch']

In [None]:
train.info()

#### 2. Reverends have a special title in their name. Create a column called `IsReverend`: 1 if they're a preacher, 0 if they're not.


In [None]:
#train['Name'].unique()

In [None]:
train['IsReverend'] = train['Name'].apply(lambda x: 1 if 'Rev.' in x else 0)

#### 3. In order to feed our training data into a classification algorithm, we need to convert our categories into 1's and 0's using `pd.get_dummies`.

  - Familiarize yourself with the [**`pd.get_dummies` documentation**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
  - Create 3 columns: `Embarked_C`, `Embarked_Q` and `Embarked_S`. These columns will have 1's and 0's that correspond to the `C`, `Q` and `S` values in the `Embarked` column
  - Do the same thing for `Sex`
  - BONUS (required): Extract the title from everyone's name and create dummy columns

In [None]:
train['Embarked'].unique()

In [None]:
# to create dummy variables 
embarked_dummies = pd.get_dummies(train['Embarked'], prefix='Embarked') # I forgot to put dtype=int
# to merge the dummy variables to the original df train.
train = pd.concat([train, embarked_dummies], axis=1) 

In [None]:
train.drop(columns=['Embarked_C', 'Embarked_S', 'Embarked_Q'], inplace=True)

In [None]:
train.head()

In [None]:
# to create dummy variables 
embarked_dummies = pd.get_dummies(train['Embarked'], prefix='Embarked', dtype=int) # put dtype=int
# to merge the dummy variables to the original df train.
train = pd.concat([train, embarked_dummies], axis=1)

In [None]:
# Extract titles from the Name column
train['Title'] = train['Name'].str.extract('([A-Za-z]+)\\.', expand=False)
 
# Create dummy variables for Sex
sex_dummies = pd.get_dummies(train['Sex'], prefix='Sex', dtype=int)

# Create dummy variables for Title
title_dummies = pd.get_dummies(train['Title'], prefix='Title', dtype=int)

# Merge the dummy variables with the original DataFrame
train = pd.concat([train, sex_dummies, title_dummies], axis=1)

In [None]:
train.columns

## Step 4: Exploratory analysis

_[`df.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) may be very useful._

#### 1. What was the survival rate overall?


In [None]:
train.shape

In [None]:
train['Survived'].value_counts() # (340/889)*100

In [None]:
survival_rate = (340/889.0)*100 # survival rate = 38.2% (percent survived)
print(survival_rate)

#### 2. Which gender fared the worst? What was their survival rate?

In [None]:
#train['Sex'].value_counts() #577 males, 312 females in total??
total_survivors = train[train['Survived']==1].shape[0]

In [None]:
#train = train.loc[:,~train.columns.duplicated()]

In [None]:
female_survivors = train[(train['Sex'] == 'female') & (train['Survived'] == 1)].shape[0] #output 231 females survived

In [None]:
male_survivors = train[(train['Sex'] == 'male') & (train['Survived'] == 1)].shape[0] #output 109 males survived

In [None]:
female_survival_rate = (female_survivors/total_survivors)*100 # % females survived out of total survivors

In [None]:
male_survival_rate = (male_survivors/total_survivors)*100 # % males survived out of total survivors

In [None]:
print(f"The males fared the worst. Only {male_survival_rate:.2f}% males survived.")

#### 3. What was the survival rate for each `Pclass`?

In [None]:
get_survivors = train[['Survived', 'Pclass']].groupby(['Pclass']).sum()
get_survivors

In [None]:
#percent_survivors = (get_survivors / total_survivors) * 100
"""
    1st class= (134/340)*100 = 39.4%, 
    2nd class = (87/340)*100 = 25.6%, 
    3rd class=(119/340)*100 = 35.0% """

#### 4. Did any reverends survive? How many?`

In [None]:
reverends = train[train['Name'].str.contains('Rev.')]

reverends_survived = reverends[reverends['Survived'] == 1]

# Get the number of reverends who survived
num_reverends_survived = reverends_survived.shape[0]

#### 5. What is the survival rate for cabins marked **¯\\_(ツ)_/¯**

In [None]:
# Filter rows where 'Cabin' is marked with '¯\_(ツ)_/¯' 
marked_cabins = train[(train['Cabin'] == r"¯\_(ツ)_/¯") & (train['Survived'] ==1)]

# Calculate the survival rate
survival_rate_marked_cabins = marked_cabins.shape[0]/total_survivors*100

In [None]:
marked_cabins.shape[0]

In [None]:
print(f"The survival rate for marked cabins is {survival_rate_marked_cabins:.2f}%.")

In [None]:
# just checking answer please ignore
#train[['Survived', 'Cabin']].value_counts() 
#print("(206/340)*100 = 60.5%")

#### 6. What is the survival rate for people whose `Age` is empty?

In [None]:
nan_age_passengers = train[train['Age'].isna()].shape[0]
survival_rate_nan_age = (nan_age_passengers/total_survivors)*100

####  7. What is the survival rate for each port of embarkation?

In [None]:
#Histogram
axes = train.hist('Survived', by='Embarked', layout=[1,3], figsize =[7,2])
# Add labels on top of the bars
for ax in axes.flatten():
    for patch in ax.patches:
        ax.annotate(f'{int(patch.get_height())}', 
                    (patch.get_x() + patch.get_width() / 2, patch.get_height()), 
                    ha='center', va='bottom');

c_survivors = (93/total_survivors)*100
q_survivors = (30/total_survivors)*100
s_survivors = (217/total_survivors)*100
print(f"C survivors {c_survivors}%, Q survivors {q_survivors}%, and S survivors {s_survivors}%.")

#### 8. What is the survival rate for children (under 12) in each `Pclass`?

In [None]:
train[train['Age'] <12].shape  #total children under 12 = 68 kids.

In [None]:
train[(train['Age'] <12) & (train['Survived']==1)] # =39 kids

In [None]:
train[(train['Age'] <12) & (train['Survived']==1)].groupby('Pclass').size()

In [None]:
"""Answer: firstclass_undertwelve = (3/39)*100 = 7.7%
            secondclass_undertwelve = (17/39)*100 = 43.6%
            thridclass_undertwelve = (19/39)*100 = 48.7%"""

####  9. Did the captain of the ship survive? Is he on the list?

In [None]:
captain = train[train['Name'].str.contains('Capt.')] #actual captain's name was Edward John Smith
print(f"The captain's name was Crosby, Capt. Edward Gifford and he did not survive.")

#### 10. Of all the people that died, who had the most expensive ticket? How much did it cost?

In [None]:
train[train['Survived'] == 0].describe() #Fare max = 263 USD/Pounds Sterling

In [None]:
train[(train['Survived'] == 0) & (train['Fare']==263)] # 2 people with the highest fare. Name: Fortune, Mr. Charles Alexander and Fortune, Mr. Mark

In [None]:
deaths = train[train['Survived'] == 0] #df filtered just the deaths i.e. survived =0 

In [None]:
deaths['Fare'].max()

#### 11. Does having family on the boat help or hurt your chances of survival?

In [None]:
train[['FamilyCount', 'Survived']].value_counts() # It appears that by being alone (FamilyCount =0) your chances of dying increases.
#374 people died alone (FamilyCount=0) compared to deaths lower than 72 for 1,2 3, 4, 5, etc. family members.
#However, the number of single people are also the maximum in terms of count for survivors. 
singles_survived = (161/340)*100
print(f"The survival rate for single people is {singles_survived:.2f}%, which is the highest percentage of the survivors.")

## Step 5: Plotting
Using Matplotlib and Seaborn, create multiple charts showing the survival rates of different groups of people. It's fine if a handful of charts are basic (Gender, Age, etc), but what we're really looking for is something beneath the surface.


In [None]:
train['Sex'].value_counts().plot(kind='bar');

In [None]:
#Survival rate for Males and Females

fig, axs = plt.subplots(1,2)
train[train['Sex'] == 'female'].Survived.value_counts().plot(kind='barh', ax=axs[0], title="Females Survived (Survived=1)")
train[train['Sex'] == 'male'].Survived.value_counts().plot(kind='barh', ax=axs[1], title="Male Survived (Survived=1)");

In [None]:
# Countplot for Males and Females survived
sns.catplot(x ="Sex", hue ="Survived",  
kind ="count", data = train);

In [None]:
#Survival rate for Children <12 Years Old
train[train['Age'] <12].Survived.value_counts().plot(kind='barh', title="Children <12 Survived (survived=1)");

In [None]:
#count plot for embarked place, class and survival rate. Seems like 3rd class had the worst survival, who mostly embarked from S.
 
sns.catplot(x ='Embarked', hue ='Survived',  
kind ='count', col ='Pclass', data = train); 


In [None]:
#Distribution of Age
sns.displot(train['Age']);

In [None]:
#survival rate for age : Mostly children under 8 and middle aged people survived 20-50s years old.
# Bin the ages into 10 bins
train['AgeBin'] = pd.cut(train['Age'], bins=10)

# Count the number of survivors for each age bin
age_bin_survivor_count = train[train['Survived'] == 1].groupby('AgeBin')['Survived'].count()

# Plot the bar chart
plt.figure(figsize=(12, 6))
age_bin_survivor_count.plot(kind='bar', color='blue')
plt.title('Survivor Count by Age Range on the Titanic', fontsize=15)
plt.xlabel('Age Range', fontsize=12)
plt.ylabel('Survivor Count', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.show();

In [None]:
#survival rate for family

# Filter the DataFrame for survivors
survivors_train = train[train['Survived'] == 1]

#  histogram for FamilyCount
plt.figure(figsize=(10, 6))
plt.hist(survivors_train['FamilyCount'], bins=range(survivors_train['FamilyCount'].max() + 2), color='blue', edgecolor='black')
plt.title('Histogram of Family Count for Survived Passengers on the Titanic', fontsize=15)
plt.xlabel('Number of Family Members', fontsize=12)
plt.ylabel('Count of Survivors', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show();

In [None]:
# Filter the DataFrame for survivors
survivors_train = train[train['Survived'] == 0]

#  histogram for FamilyCount
plt.figure(figsize=(10, 6))
plt.hist(survivors_train['FamilyCount'], bins=range(survivors_train['FamilyCount'].max() + 2), color='blue', edgecolor='black')
plt.title('Histogram of Family Count for Passengers on the Titanic Who Did Not Survive', fontsize=15)
plt.xlabel('Number of Family Members', fontsize=12)
plt.ylabel('Count of Survivors', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show();

In [None]:
#survival rate for fare : seems like higher fares have higher survival rates.

# Divide Fare into 10 bins 
train['Fare_Range'] = pd.qcut(train['Fare'], 4)
sns.barplot(x ='Fare_Range', y ='Survived',  
data = train);