<a href="https://colab.research.google.com/github/yourusername/CMSC173/blob/main/tex/eda_lecture/eda_companion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis (EDA) - Companion Notebook

**Course**: CMSC 173 - Data Science for Computer Scientists  
**Topic**: Comprehensive EDA with the Titanic Dataset  
**Companion to**: EDA Lecture Slides

---

This notebook provides hands-on implementation of concepts covered in the EDA lecture slides. We'll work through the complete data analysis pipeline using the famous Titanic dataset.

## 📚 Setup and Imports

First, let's import all necessary libraries for our EDA journey.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")

## 🚢 Data Loading and Initial Exploration

Let's load the Titanic dataset and perform our initial exploration as discussed in **Slide 2** of the lecture.

In [None]:
# Load the Titanic dataset
# Using seaborn's built-in dataset for convenience
titanic = sns.load_dataset('titanic')

print("📊 Dataset loaded successfully!")
print(f"Dataset shape: {titanic.shape}")
print(f"Columns: {list(titanic.columns)}")

In [None]:
# First look at the data
print("🔍 First 5 rows of the dataset:")
display(titanic.head())

print("\n📈 Dataset information:")
titanic.info()

## 🏷️ Data Types Analysis (Slide 3)

Let's examine and categorize our data types as shown in the lecture slides.

In [None]:
# Create a comprehensive data type analysis
def analyze_data_types(df):
    analysis = []
    
    for col in df.columns:
        dtype = df[col].dtype
        unique_count = df[col].nunique()
        null_count = df[col].isnull().sum()
        
        # Determine data type category
        if dtype in ['int64', 'float64']:
            if unique_count == 2:
                category = 'Binary'
            elif unique_count < 10:
                category = 'Discrete'
            else:
                category = 'Continuous'
        else:
            if unique_count < 10:
                category = 'Nominal'
            else:
                category = 'Text/Identifier'
        
        analysis.append({
            'Column': col,
            'Data Type': str(dtype),
            'Category': category,
            'Unique Values': unique_count,
            'Missing Values': null_count,
            'Missing %': round(null_count/len(df)*100, 2)
        })
    
    return pd.DataFrame(analysis)

# Analyze our Titanic dataset
type_analysis = analyze_data_types(titanic)
print("🏷️ Data Types Analysis:")
display(type_analysis)

## 🔍 Missing Data Analysis (Slide 8)

Understanding and handling missing data is crucial for effective EDA.

In [None]:
# Missing data visualization
def plot_missing_data(df):
    # Calculate missing data
    missing_data = df.isnull().sum()
    missing_percent = (missing_data / len(df)) * 100
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Missing data counts
    missing_data[missing_data > 0].plot(kind='bar', ax=ax1, color='coral')
    ax1.set_title('Missing Data Count by Column')
    ax1.set_ylabel('Number of Missing Values')
    ax1.tick_params(axis='x', rotation=45)
    
    # Missing data percentages
    missing_percent[missing_percent > 0].plot(kind='bar', ax=ax2, color='lightblue')
    ax2.set_title('Missing Data Percentage by Column')
    ax2.set_ylabel('Percentage of Missing Values')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    return missing_data[missing_data > 0], missing_percent[missing_percent > 0]

print("🔍 Missing Data Analysis:")
missing_counts, missing_percentages = plot_missing_data(titanic)

In [None]:
# Missing data patterns analysis
print("📊 Missing Data Summary:")
for col in missing_counts.index:
    count = missing_counts[col]
    percentage = missing_percentages[col]
    print(f"{col}: {count} missing values ({percentage:.1f}%)")
    
    # Suggest handling strategy
    if percentage < 5:
        strategy = "Consider removing rows or simple imputation"
    elif percentage < 20:
        strategy = "Use advanced imputation methods"
    else:
        strategy = "Consider dropping column or creating 'missing' indicator"
    print(f"  → Suggested strategy: {strategy}\n")

## 📊 Univariate Analysis (Slides 4-5)

Let's explore individual variables through various visualization techniques.

In [None]:
# Numerical variables analysis
numerical_cols = ['age', 'fare', 'sibsp', 'parch']

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle('Univariate Analysis - Numerical Variables', fontsize=16, fontweight='bold')

for i, col in enumerate(numerical_cols):
    # Histogram
    axes[0, i].hist(titanic[col].dropna(), bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0, i].set_title(f'Distribution of {col.title()}')
    axes[0, i].set_xlabel(col.title())
    axes[0, i].set_ylabel('Frequency')
    
    # Box plot
    axes[1, i].boxplot(titanic[col].dropna(), patch_artist=True,
                      boxprops=dict(facecolor='lightgreen', alpha=0.7))
    axes[1, i].set_title(f'Box Plot of {col.title()}')
    axes[1, i].set_ylabel(col.title())

plt.tight_layout()
plt.show()

In [None]:
# Categorical variables analysis
categorical_cols = ['survived', 'pclass', 'sex', 'embarked']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Univariate Analysis - Categorical Variables', fontsize=16, fontweight='bold')

for i, col in enumerate(categorical_cols):
    row = i // 2
    col_idx = i % 2
    
    # Count plot
    data_counts = titanic[col].value_counts()
    axes[row, col_idx].bar(data_counts.index, data_counts.values, 
                          color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'][:len(data_counts)])
    axes[row, col_idx].set_title(f'Distribution of {col.title()}')
    axes[row, col_idx].set_xlabel(col.title())
    axes[row, col_idx].set_ylabel('Count')
    
    # Add value labels on bars
    for j, v in enumerate(data_counts.values):
        axes[row, col_idx].text(j, v + 5, str(v), ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 🔗 Bivariate Analysis (Slide 6)

Exploring relationships between variables to uncover insights.

In [None]:
# Correlation matrix for numerical variables
numerical_data = titanic[['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']].corr()

plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(numerical_data, dtype=bool))
sns.heatmap(numerical_data, mask=mask, annot=True, cmap='RdYlBu_r', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Correlation Matrix - Numerical Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("🔍 Key Correlations with Survival:")
survival_corr = numerical_data['survived'].sort_values(key=abs, ascending=False)[1:]
for var, corr in survival_corr.items():
    print(f"{var.title()}: {corr:.3f}")

In [None]:
# Survival analysis by different factors
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Survival Analysis by Key Factors', fontsize=16, fontweight='bold')

# Survival by Gender
survival_by_sex = titanic.groupby(['sex', 'survived']).size().unstack()
survival_by_sex.plot(kind='bar', ax=axes[0,0], color=['#FF6B6B', '#4ECDC4'])
axes[0,0].set_title('Survival by Gender')
axes[0,0].set_xlabel('Gender')
axes[0,0].set_ylabel('Count')
axes[0,0].legend(['Did not survive', 'Survived'])
axes[0,0].tick_params(axis='x', rotation=0)

# Survival by Class
survival_by_class = titanic.groupby(['pclass', 'survived']).size().unstack()
survival_by_class.plot(kind='bar', ax=axes[0,1], color=['#FF6B6B', '#4ECDC4'])
axes[0,1].set_title('Survival by Passenger Class')
axes[0,1].set_xlabel('Passenger Class')
axes[0,1].set_ylabel('Count')
axes[0,1].legend(['Did not survive', 'Survived'])
axes[0,1].tick_params(axis='x', rotation=0)

# Age distribution by survival
survived = titanic[titanic['survived'] == 1]['age'].dropna()
not_survived = titanic[titanic['survived'] == 0]['age'].dropna()
axes[1,0].hist([not_survived, survived], bins=30, alpha=0.7, 
              label=['Did not survive', 'Survived'], color=['#FF6B6B', '#4ECDC4'])
axes[1,0].set_title('Age Distribution by Survival')
axes[1,0].set_xlabel('Age')
axes[1,0].set_ylabel('Frequency')
axes[1,0].legend()

# Fare distribution by survival
survived_fare = titanic[titanic['survived'] == 1]['fare'].dropna()
not_survived_fare = titanic[titanic['survived'] == 0]['fare'].dropna()
axes[1,1].hist([not_survived_fare, survived_fare], bins=50, alpha=0.7,
              label=['Did not survive', 'Survived'], color=['#FF6B6B', '#4ECDC4'])
axes[1,1].set_title('Fare Distribution by Survival')
axes[1,1].set_xlabel('Fare')
axes[1,1].set_ylabel('Frequency')
axes[1,1].legend()
axes[1,1].set_xlim(0, 200)  # Limit x-axis for better visualization

plt.tight_layout()
plt.show()

## 🔧 Feature Engineering (Slide 9)

Creating new features and transforming existing ones to improve our analysis.

In [None]:
# Create a copy for feature engineering
titanic_fe = titanic.copy()

# 1. Family size feature
titanic_fe['family_size'] = titanic_fe['sibsp'] + titanic_fe['parch'] + 1

# 2. Is alone feature
titanic_fe['is_alone'] = (titanic_fe['family_size'] == 1).astype(int)

# 3. Age groups
def categorize_age(age):
    if pd.isna(age):
        return 'Unknown'
    elif age < 12:
        return 'Child'
    elif age < 18:
        return 'Teen'
    elif age < 65:
        return 'Adult'
    else:
        return 'Senior'

titanic_fe['age_group'] = titanic_fe['age'].apply(categorize_age)

# 4. Fare categories
titanic_fe['fare_category'] = pd.cut(titanic_fe['fare'], 
                                    bins=[0, 7.9, 14.5, 31.0, float('inf')],
                                    labels=['Low', 'Medium', 'High', 'Very High'])

# 5. Title extraction from name
titanic_fe['title'] = titanic_fe['who'].str.title()

print("🔧 New Features Created:")
new_features = ['family_size', 'is_alone', 'age_group', 'fare_category', 'title']
for feature in new_features:
    print(f"✅ {feature}")

print("\n📊 Sample of engineered features:")
display(titanic_fe[['survived'] + new_features].head(10))

In [None]:
# Analyze impact of new features on survival
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Impact of Engineered Features on Survival', fontsize=16, fontweight='bold')

# Family size vs survival
family_survival = titanic_fe.groupby('family_size')['survived'].mean()
axes[0,0].bar(family_survival.index, family_survival.values, color='lightcoral')
axes[0,0].set_title('Survival Rate by Family Size')
axes[0,0].set_xlabel('Family Size')
axes[0,0].set_ylabel('Survival Rate')

# Age group vs survival
age_survival = titanic_fe.groupby('age_group')['survived'].mean()
axes[0,1].bar(age_survival.index, age_survival.values, color='lightblue')
axes[0,1].set_title('Survival Rate by Age Group')
axes[0,1].set_xlabel('Age Group')
axes[0,1].set_ylabel('Survival Rate')
axes[0,1].tick_params(axis='x', rotation=45)

# Fare category vs survival
fare_survival = titanic_fe.groupby('fare_category')['survived'].mean()
axes[1,0].bar(fare_survival.index, fare_survival.values, color='lightgreen')
axes[1,0].set_title('Survival Rate by Fare Category')
axes[1,0].set_xlabel('Fare Category')
axes[1,0].set_ylabel('Survival Rate')

# Title vs survival
title_survival = titanic_fe.groupby('title')['survived'].mean()
axes[1,1].bar(title_survival.index, title_survival.values, color='gold')
axes[1,1].set_title('Survival Rate by Title')
axes[1,1].set_xlabel('Title')
axes[1,1].set_ylabel('Survival Rate')

plt.tight_layout()
plt.show()

## 🎯 Feature Selection (Slide 7)

Identifying the most important features for predicting survival.

In [None]:
# Prepare data for machine learning
# Handle missing values and encode categorical variables

ml_data = titanic_fe.copy()

# Fill missing ages with median
ml_data['age'].fillna(ml_data['age'].median(), inplace=True)

# Fill missing embarked with mode
ml_data['embarked'].fillna(ml_data['embarked'].mode()[0], inplace=True)

# Encode categorical variables
le = LabelEncoder()
categorical_columns = ['sex', 'embarked', 'age_group', 'fare_category', 'title']

for col in categorical_columns:
    ml_data[col + '_encoded'] = le.fit_transform(ml_data[col].astype(str))

# Select features for analysis
feature_columns = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'family_size', 'is_alone'] + \
                 [col + '_encoded' for col in categorical_columns]

X = ml_data[feature_columns]
y = ml_data['survived']

print(f"🎯 Feature matrix shape: {X.shape}")
print(f"📊 Target variable shape: {y.shape}")
print(f"✅ Features selected: {len(feature_columns)}")

In [None]:
# Feature importance using Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 8))
sns.barplot(data=feature_importance.head(10), x='Importance', y='Feature', palette='viridis')
plt.title('Top 10 Most Important Features (Random Forest)', fontsize=14, fontweight='bold')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

print("🏆 Top 10 Most Important Features:")
for i, (_, row) in enumerate(feature_importance.head(10).iterrows(), 1):
    print(f"{i:2d}. {row['Feature']:20s}: {row['Importance']:.4f}")

In [None]:
# Statistical feature selection
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected features
selected_features = X.columns[selector.get_support()]
feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'Score': selector.scores_,
    'Selected': selector.get_support()
}).sort_values('Score', ascending=False)

print("📊 Statistical Feature Selection (F-score):")
print(f"Selected {len(selected_features)} features out of {len(X.columns)}")
print("\n🏆 Top features by F-score:")
for i, (_, row) in enumerate(feature_scores.head(10).iterrows(), 1):
    status = "✅" if row['Selected'] else "❌"
    print(f"{i:2d}. {status} {row['Feature']:20s}: {row['Score']:.2f}")

## 📏 Data Normalization (Slide 10)

Scaling features to ensure they're on similar scales for machine learning algorithms.

In [None]:
# Compare different normalization techniques
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Select numerical features for normalization
numerical_features = ['age', 'fare', 'family_size']
original_data = ml_data[numerical_features]

# Apply different scaling techniques
scalers = {
    'Original': None,
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

fig, axes = plt.subplots(len(numerical_features), len(scalers), figsize=(20, 12))
fig.suptitle('Comparison of Normalization Techniques', fontsize=16, fontweight='bold')

for col_idx, feature in enumerate(numerical_features):
    for scaler_idx, (scaler_name, scaler) in enumerate(scalers.items()):
        if scaler is None:
            data_to_plot = original_data[feature]
        else:
            scaled_data = scaler.fit_transform(original_data[[feature]])
            data_to_plot = scaled_data.flatten()
        
        axes[col_idx, scaler_idx].hist(data_to_plot, bins=30, alpha=0.7, 
                                      color=plt.cm.Set3(scaler_idx))
        axes[col_idx, scaler_idx].set_title(f'{feature.title()} - {scaler_name}')
        axes[col_idx, scaler_idx].set_ylabel('Frequency')
        
        # Add statistics
        mean_val = np.mean(data_to_plot)
        std_val = np.std(data_to_plot)
        axes[col_idx, scaler_idx].axvline(mean_val, color='red', linestyle='--', alpha=0.7)
        axes[col_idx, scaler_idx].text(0.02, 0.95, f'μ={mean_val:.2f}\nσ={std_val:.2f}', 
                                     transform=axes[col_idx, scaler_idx].transAxes, 
                                     verticalalignment='top',
                                     bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

## 🚀 Machine Learning Pipeline (Slide 11)

Putting it all together in a complete ML pipeline from EDA insights to model evaluation.

In [None]:
# Complete ML Pipeline based on EDA insights

# 1. Use top features identified in feature selection
top_features = feature_importance.head(8)['Feature'].tolist()
X_final = X[top_features]

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

# 3. Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for model_name, model in models.items():
    # Train model
    if model_name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate accuracy
    accuracy = (y_pred == y_test).mean()
    results[model_name] = {'accuracy': accuracy, 'predictions': y_pred}
    
    print(f"🎯 {model_name} Accuracy: {accuracy:.4f}")

print(f"\n📊 Training set size: {len(X_train)}")
print(f"📊 Test set size: {len(X_test)}")
print(f"🎯 Features used: {len(top_features)}")

In [None]:
# Detailed model evaluation
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for idx, (model_name, result) in enumerate(results.items()):
    y_pred = result['predictions']
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])
    axes[idx].set_title(f'Confusion Matrix - {model_name}')
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('Actual')
    
    # Print classification report
    print(f"\n📊 {model_name} - Detailed Results:")
    print(classification_report(y_test, y_pred, target_names=['Did not survive', 'Survived']))

plt.tight_layout()
plt.show()

## 💡 Key Insights and Takeaways

Based on our comprehensive EDA of the Titanic dataset, here are the main findings:

In [None]:
# Generate summary insights
print("🚢 TITANIC DATASET - KEY INSIGHTS FROM EDA")
print("=" * 50)

# 1. Survival statistics
survival_rate = titanic['survived'].mean()
print(f"\n📊 Overall Survival Rate: {survival_rate:.1%}")

# 2. Gender impact
gender_survival = titanic.groupby('sex')['survived'].mean()
print(f"\n👥 Survival by Gender:")
for gender, rate in gender_survival.items():
    print(f"   {gender.title()}: {rate:.1%}")

# 3. Class impact
class_survival = titanic.groupby('pclass')['survived'].mean()
print(f"\n🎫 Survival by Class:")
for pclass, rate in class_survival.items():
    class_name = ['First', 'Second', 'Third'][pclass-1]
    print(f"   {class_name} Class: {rate:.1%}")

# 4. Age insights
child_survival = titanic[titanic['age'] < 16]['survived'].mean()
adult_survival = titanic[titanic['age'] >= 16]['survived'].mean()
print(f"\n👶 Children (<16) Survival Rate: {child_survival:.1%}")
print(f"👨 Adult (≥16) Survival Rate: {adult_survival:.1%}")

# 5. Family size impact
alone_survival = titanic_fe[titanic_fe['is_alone'] == 1]['survived'].mean()
family_survival = titanic_fe[titanic_fe['is_alone'] == 0]['survived'].mean()
print(f"\n👤 Traveling Alone Survival Rate: {alone_survival:.1%}")
print(f"👨‍👩‍👧‍👦 Traveling with Family Survival Rate: {family_survival:.1%}")

# 6. Top predictive features
print(f"\n🏆 Most Predictive Features:")
for i, feature in enumerate(feature_importance.head(5)['Feature'], 1):
    print(f"   {i}. {feature}")

print(f"\n🎯 Best Model Performance:")
best_model = max(results.keys(), key=lambda x: results[x]['accuracy'])
best_accuracy = results[best_model]['accuracy']
print(f"   {best_model}: {best_accuracy:.1%} accuracy")

print("\n✨ EDA PROCESS COMPLETE! ✨")

## 🎓 Conclusion

This notebook demonstrated a comprehensive EDA workflow that includes:

1. **Data Understanding**: Loading and initial exploration
2. **Data Types Analysis**: Categorizing variables appropriately
3. **Missing Data Handling**: Identifying patterns and strategies
4. **Univariate Analysis**: Understanding individual variable distributions
5. **Bivariate Analysis**: Exploring relationships between variables
6. **Feature Engineering**: Creating meaningful new features
7. **Feature Selection**: Identifying the most predictive variables
8. **Data Normalization**: Preparing data for machine learning
9. **ML Pipeline**: Implementing a complete predictive model

### 🚀 Next Steps

- Try advanced feature engineering techniques
- Experiment with different machine learning algorithms
- Perform hyperparameter tuning
- Create ensemble models for better prediction accuracy
- Apply similar EDA techniques to other datasets

### 📚 Additional Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)

---

**Happy Data Exploring! 🎉**