# Titanic Dataset - Exploratory Data Analysis

## AI/ML Fellowship Week 2 Task

**Objective**: Perform comprehensive exploratory data analysis on the Titanic dataset to understand passenger demographics, survival patterns, and other interesting relationships in the data.

### Dataset Overview
- Contains information about passengers aboard the Titanic
- Features include age, gender, class, fare, family size, and survival status
- Target variable: Survival (binary - 0 for not survived, 1 for survived)

## Step 1: Import Libraries and Load Data

First, let's import the necessary libraries and load the Titanic dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
sns.set_style("whitegrid")
plt.style.use('seaborn-v0_8')

print("Libraries imported successfully!")

In [None]:
# Load the Titanic dataset
# We'll use the seaborn built-in dataset
titanic = sns.load_dataset('titanic')

print(f"Dataset shape: {titanic.shape}")
print(f"Columns: {list(titanic.columns)}")

## Step 2: Initial Data Exploration

Let's take a look at the first few rows and get basic information about the dataset.

In [None]:
# Display first few rows
titanic.head()

In [None]:
# Basic information about the dataset
titanic.info()

In [None]:
# Statistical summary
titanic.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
print(titanic.isnull().sum())

# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(titanic.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

## Step 3: Data Cleaning and Preprocessing

Address missing values and prepare the data for analysis.

In [None]:
# Check unique values in categorical columns
categorical_cols = ['sex', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

for col in categorical_cols:
    print(f'{col}: {titanic[col].unique()}')

In [None]:
# Handle missing values
# Age: Fill with median age
median_age = titanic['age'].median()
titanic['age'].fillna(median_age, inplace=True)

# Embarked: Fill with most common value
most_common_embarked = titanic['embarked'].mode()[0]
titanic['embarked'].fillna(most_common_embarked, inplace=True)

# Deck: Too many missing values, we might drop it or create a category for missing
titanic['deck'] = titanic['deck'].fillna('Unknown')

# Check if there are still missing values
print("Missing values after cleaning:")
print(titanic.isnull().sum())

## Step 4: Exploratory Data Analysis

Analyze relationships between variables and survival rates.

In [None]:
# Overall survival rate
survival_rate = titanic['survived'].mean()
print(f"Overall survival rate: {survival_rate:.2%}")

# Survival rate by gender
gender_survival = titanic.groupby('sex')['survived'].mean()
print(f"\nSurvival rate by gender:")
print(gender_survival)

In [None]:
# Visualize survival by gender
plt.figure(figsize=(8, 5))
sns.barplot(data=titanic, x='sex', y='survived')
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.show()

In [None]:
# Survival rate by passenger class
class_survival = titanic.groupby('class')['survived'].mean()
print(f"Survival rate by class:")
print(class_survival)

# Visualize survival by class
plt.figure(figsize=(8, 5))
sns.barplot(data=titanic, x='class', y='survived')
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

In [None]:
# Survival rate by age groups
titanic['age_group'] = pd.cut(titanic['age'], bins=[0, 12, 18, 35, 60, 100], labels=['Child', 'Teen', 'Adult', 'Middle-Aged', 'Senior'])

age_survival = titanic.groupby('age_group')['survived'].mean()
print(f"Survival rate by age group:")
print(age_survival)

# Visualize survival by age group
plt.figure(figsize=(10, 6))
sns.barplot(data=titanic, x='age_group', y='survived')
plt.title('Survival Rate by Age Group')
plt.ylabel('Survival Rate')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Survival rate by embarkation port
embark_survival = titanic.groupby('embarked')['survived'].mean()
print(f"Survival rate by embarkation port:")
print(embark_survival)

# Visualize survival by embarkation port
plt.figure(figsize=(8, 5))
sns.barplot(data=titanic, x='embarked', y='survived')
plt.title('Survival Rate by Embarkation Port')
plt.ylabel('Survival Rate')
plt.show()

In [None]:
# Correlation matrix for numerical features
numerical_features = titanic.select_dtypes(include=[np.number])
correlation_matrix = numerical_features.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, square=True)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

In [None]:
# Distribution of age by survival status
plt.figure(figsize=(10, 6))
sns.histplot(data=titanic, x='age', hue='survived', multiple='stack', bins=30)
plt.title('Distribution of Age by Survival Status')
plt.show()

In [None]:
# Fare distribution by survival status
plt.figure(figsize=(10, 6))
sns.boxplot(data=titanic, x='survived', y='fare')
plt.title('Fare Distribution by Survival Status')
plt.ylabel('Fare')
plt.xlabel('Survived (0=No, 1=Yes)')
plt.show()

In [None]:
# Survival by family size (combining sibsp and parch)
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

family_survival = titanic.groupby('family_size')['survived'].mean()
print(f"Survival rate by family size:")
print(family_survival)

# Visualize survival by family size
plt.figure(figsize=(12, 6))
sns.barplot(data=titanic, x='family_size', y='survived')
plt.title('Survival Rate by Family Size')
plt.ylabel('Survival Rate')
plt.xlabel('Family Size')
plt.show()

## Step 5: Insights and Conclusions

Summarize the key findings from our analysis.

## Step 6: Advanced Analysis

Let's perform some more advanced analysis including cross-tabulations and deeper insights.

In [None]:
# Cross-tabulation of sex and class with survival
sex_class_survival = pd.crosstab([titanic['sex'], titanic['class']], titanic['survived'], margins=True)
print("Cross-tabulation of Sex, Class and Survival:")
print(sex_class_survival)

# Survival rates by sex and class
survival_by_sex_class = titanic.groupby(['sex', 'class'])['survived'].agg(['mean', 'count']).round(3)
print("\nSurvival Rates by Sex and Class:")
print(survival_by_sex_class)

In [None]:
# Visualize survival rates by sex and class
plt.figure(figsize=(10, 6))
sns.barplot(data=titanic, x='class', y='survived', hue='sex')
plt.title('Survival Rate by Class and Gender')
plt.ylabel('Survival Rate')
plt.legend(title='Gender')
plt.show()

In [None]:
# Age distribution by survival status
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Overall age distribution
axes[0,0].hist(titanic['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0,0].set_title('Overall Age Distribution')
axes[0,0].set_xlabel('Age')
axes[0,0].set_ylabel('Frequency')

# Age distribution for survivors
axes[0,1].hist(titanic[titanic['survived']==1]['age'], bins=30, edgecolor='black', alpha=0.7, color='green')
axes[0,1].set_title('Age Distribution - Survivors')
axes[0,1].set_xlabel('Age')
axes[0,1].set_ylabel('Frequency')

# Age distribution for non-survivors
axes[1,0].hist(titanic[titanic['survived']==0]['age'], bins=30, edgecolor='black', alpha=0.7, color='red')
axes[1,0].set_title('Age Distribution - Non-Survivors')
axes[1,0].set_xlabel('Age')
axes[1,0].set_ylabel('Frequency')

# Combined age distribution by survival
sns.kdeplot(data=titanic, x='age', hue='survived', ax=axes[1,1], fill=True, alpha=0.5)
axes[1,1].set_title('Age Distribution by Survival Status')
axes[1,1].set_xlabel('Age')
axes[1,1].set_ylabel('Density')

plt.tight_layout()
plt.show()

In [None]:
# Impact of having family members on board
titanic['is_alone'] = (titanic['family_size'] == 1).astype(int)

plt.figure(figsize=(8, 5))
sns.barplot(data=titanic, x='is_alone', y='survived')
plt.title('Survival Rate: Alone vs With Family')
plt.ylabel('Survival Rate')
plt.xlabel('Alone (1) vs With Family (0)')
plt.xticks([0, 1], ['With Family', 'Alone'])
plt.show()

print(f"Survival rate for passengers traveling alone: {titanic[titanic['is_alone']==1]['survived'].mean():.2%}")
print(f"Survival rate for passengers with family: {titanic[titanic['is_alone']==0]['survived'].mean():.2%}")

In [None]:
# Deck analysis (where available)
deck_survival = titanic.groupby('deck')['survived'].agg(['mean', 'count']).round(3)
print("Survival Rates by Deck:")
print(deck_survival)

# Visualize deck survival rates
plt.figure(figsize=(10, 6))
sns.barplot(data=titanic[titanic['deck']!='Unknown'], x='deck', y='survived')
plt.title('Survival Rate by Deck')
plt.ylabel('Survival Rate')
plt.xlabel('Deck')
plt.show()

## Step 7: Final Insights and Conclusions

Summarize the key findings from our analysis.

In [None]:
# Summary of key findings
print("## Key Findings from Titanic EDA:\n")
print(f"1. Overall survival rate: {survival_rate:.2%}")
print(f"2. Female survival rate: {titanic[titanic['sex'] == 'female']['survived'].mean():.2%}")
print(f"3. Male survival rate: {titanic[titanic['sex'] == 'male']['survived'].mean():.2%}")
print(f"4. First class survival rate: {titanic[titanic['class'] == 'First']['survived'].mean():.2%}")
print(f"5. Third class survival rate: {titanic[titanic['class'] == 'Third']['survived'].mean():.2%}")

print(f"\nStrategies and Insights:")
print("- Women had significantly higher survival rates than men (chivalry principle)")
print("- Higher social class correlated with higher survival rates")
print("- Age and family size also appear to influence survival chances")
print("- Passengers who paid higher fares (typically higher class) had better survival odds")
print("- Passengers traveling alone had lower survival rates than those with family")
print("- Location on the ship (deck) may have influenced survival chances")