# Titanic Pandas Assignment - A Learning-Focused Walkthrough

This notebook is structured to not only complete the assignment but to explain the **why** behind each pandas operation, helping you build a solid foundation.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better DataFrame output in the notebook
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 100)
pd.set_option('display.max_colwidth', None)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.3.2
NumPy version: 2.2.6


# PART A - BASIC: Building the Foundation

This section covers the fundamentals: loading data, getting a high-level overview, and performing initial inspections. These are the first steps in any data analysis project.

## 📊 Task 1: Load & Inspect

**Goal:** Load the `train.csv` dataset into a pandas DataFrame and perform the first-look inspections.

In [4]:
# We use a try-except block to handle potential errors, like the file not being in the right place.
# This makes our code more robust.
try:
    # pd.read_csv() is the fundamental function for reading CSV files. 
    # It parses the text file and loads it into a highly optimized DataFrame structure.
    df = pd.read_csv('../data_titanic/train.csv')
    print("✅ Data loaded successfully!")
    
except FileNotFoundError:
    print("❌ Error: Could not find train.csv in ../data_titanic/")
    print("Please make sure your data file is in the correct location!")

✅ Data loaded successfully!


In [7]:
# LEARNING POINT: The Three Essential First-Look Commands

# 1. .shape: An *attribute* (not a method, so no '()') that gives you the dimensions (rows, columns).
# It's the quickest way to understand the scale of your data.
print("🔍 Dataset Shape:")
print(df.shape)

🔍 Dataset Shape:
(891, 12)


In [8]:
# 2. .info(): A *method* that provides a technical summary of the DataFrame.
# Why it's critical: It shows column names, the number of non-null values, and the data type (dtype) of each column.
# This is your first and best tool for spotting missing data and incorrect data types.
print("\n📋 Dataset Info:")
df.info()


📋 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [9]:
# 3. .head(): A *method* to view the first few rows of the data (default is 5).
# Why it's critical: It gives you a feel for the actual values and format of the data in each column.
print("\n👀 First 5 rows:")
df.head()


👀 First 5 rows:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 📊 Task 2: Column Summary

**Goal:** Create a summary table to profile each column, focusing on data types and missing values.

**Why this is important:** A programmatic summary helps you quickly identify which columns need cleaning or special handling. Sorting by missing values immediately brings the most problematic columns to your attention.

In [12]:
# We create a function to make this process reusable for any DataFrame.
def create_column_summary(dataframe):
    """
    Creates a comprehensive summary of a DataFrame's columns.
    """
    summary_list = []
    for col in dataframe.columns:
        col_info = {
            'column_name': col,
            'dtype': dataframe[col].dtype,
            '# missing': dataframe[col].isnull().sum(),
            '% missing': round(dataframe[col].isnull().sum() / len(dataframe) * 100, 2),
            '# unique values': dataframe[col].nunique(),
        }
        summary_list.append(col_info)
    
    summary_df = pd.DataFrame(summary_list)
    return summary_df.sort_values(by='# missing', ascending=False).reset_index(drop=True)

# Apply our custom function to the DataFrame
column_summary = create_column_summary(df)
print("📋 Column Summary (sorted by missing values):")
column_summary

📋 Column Summary (sorted by missing values):


Unnamed: 0,column_name,dtype,# missing,% missing,# unique values
0,Cabin,object,687,77.1,147
1,Age,float64,177,19.87,88
2,Embarked,object,2,0.22,3
3,PassengerId,int64,0,0.0,891
4,Name,object,0,0.0,891
5,Pclass,int64,0,0.0,3
6,Survived,int64,0,0.0,2
7,Sex,object,0,0.0,2
8,Parch,int64,0,0.0,7
9,SibSp,int64,0,0.0,7


**Insight from the summary:**
- `Cabin` has a very high percentage of missing values (~77%), suggesting it might be difficult to use directly.
- `Age` is missing about 20% of its values, which is significant. We will need to handle this (impute).
- `Embarked` has only 2 missing values, which should be easy to fix.

## 📊 Task 3: Value Counts & Proportions

**Goal:** For key categorical columns, understand the distribution of their values.

**Why this is important:** This helps you spot imbalances in your data. For example, we can see there are far more passengers in 3rd class than in 1st or 2nd, and more males than females. These imbalances can influence analysis results.

In [15]:
categorical_columns = ['Pclass', 'Sex', 'Embarked']

for col in categorical_columns:
    print(f"\n🏷️  Distribution for '{col}':")
    
    # .value_counts() is the go-to method for counting unique values in a Series.
    counts = df[col].value_counts()
    
    # Using normalize=True gives proportions instead of raw counts.
    percentages = df[col].value_counts(normalize=True) * 100
    
    # We combine them into a new DataFrame for a clean, readable output.
    distribution_df = pd.DataFrame({
        'Count': counts,
        'Percentage': percentages.round(2)
    })
    
    print(distribution_df)
    
    # It's also good practice to report the number of missing values here.
    missing_count = df[col].isnull().sum()
    if missing_count > 0:
        print(f"⚠️  Missing values: {missing_count}")


🏷️  Distribution for 'Pclass':
        Count  Percentage
Pclass                   
3         491       55.11
1         216       24.24
2         184       20.65

🏷️  Distribution for 'Sex':
        Count  Percentage
Sex                      
male      577       64.76
female    314       35.24

🏷️  Distribution for 'Embarked':
          Count  Percentage
Embarked                   
S           644       72.44
C           168       18.90
Q            77        8.66
⚠️  Missing values: 2


## 📊 Task 4: Select & Filter

**Goal:** Create a new DataFrame containing only passengers who meet multiple specific criteria.

**Why this is important:** Filtering is a core data manipulation task. It allows you to isolate subsets of your data for focused analysis. We'll explore two common ways to do this.

### Method 1: Boolean Indexing (The Standard Way)

This is the most common and powerful way to filter in pandas. We create a "mask" (a Series of `True`/`False` values) for each condition and then combine them.

In [16]:
# Each condition is wrapped in parentheses () which is crucial when combining with '&' (AND).
female_firstclass_over_30 = df[
    (df['Sex'] == 'female') & 
    (df['Pclass'] == 1) & 
    (df['Age'] > 30)
]

print(f"Found {len(female_firstclass_over_30)} passengers matching the criteria.")

Found 50 passengers matching the criteria.


### Method 2: The `.query()` Method (Alternative)

The `.query()` method allows you to write your conditions as a string, which can be more readable for complex queries. It's often slightly faster as well.

In [None]:
female_firstclass_over_30_query = df.query("Sex == 'female' and Pclass == 1 and Age > 30")

print(f"Found {len(female_firstclass_over_30_query)} passengers using .query().")

# We can verify that both methods produce the exact same result.
print(f"Results are identical: {female_firstclass_over_30.equals(female_firstclass_over_30_query)}")

### Sorting the Results

Now we sort the filtered DataFrame to find the top 10 passengers by fare.

In [17]:
# .sort_values() is used to reorder the DataFrame based on column values.
# `ascending=False` sorts from highest to lowest.
# We chain .head(10) to get just the top 10 rows after sorting.
top_10_by_fare = female_firstclass_over_30.sort_values(by='Fare', ascending=False).head(10)

print("🎖️ Top 10 female passengers in 1st class over 30, sorted by Fare:")
top_10_by_fare

🎖️ Top 10 female passengers in 1st class over 30, sorted by Fare:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
716,717,1,1,"Endres, Miss. Caroline Louise",female,38.0,0,0,PC 17757,227.525,C45,C
380,381,1,1,"Bidois, Miss. Rosalie",female,42.0,0,0,PC 17757,227.525,,C
779,780,1,1,"Robert, Mrs. Edward Scott (Elisabeth Walton McMillan)",female,43.0,0,1,24160,211.3375,B3,S
318,319,1,1,"Wick, Miss. Mary Natalie",female,31.0,0,2,36928,164.8667,C7,S
856,857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S
268,269,1,1,"Graham, Mrs. William Thompson (Edith Junkins)",female,58.0,0,1,PC 17582,153.4625,C125,S
609,610,1,1,"Shutes, Miss. Elizabeth W",female,40.0,0,0,PC 17582,153.4625,C125,S
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C


## 📊 Task 5: Basic Aggregations

**Goal:** Compute summary statistics for the entire dataset and for specific groups.

**Why this is important:** Aggregations condense large amounts of data into single, meaningful numbers (like an average or a total). This is the foundation of quantitative analysis.

In [18]:
# --- Mean, Median, and Mode of Age ---
# These functions automatically ignore missing (NaN) values by default.
mean_age = df['Age'].mean()
median_age = df['Age'].median() # Median is often better for skewed data like age/income
mode_age = df['Age'].mode()[0]  # .mode() returns a Series, so we take the first item with [0]

print("📊 Age Statistics:")
print(f"   - Mean Age:   {mean_age:.2f}")
print(f"   - Median Age: {median_age:.2f}")
print(f"   - Mode Age:   {mode_age:.2f}")

📊 Age Statistics:
   - Mean Age:   29.70
   - Median Age: 28.00
   - Mode Age:   24.00


In [19]:
# --- Mean Fare per Pclass ---
# This introduces .groupby(), one of the most powerful tools in pandas.
# 1. df.groupby('Pclass'): Splits the DataFrame into 3 separate groups (one for each class).
# 2. ['Fare']: Selects the 'Fare' column within each of those groups.
# 3. .mean(): Calculates the mean of the 'Fare' column for each group.
mean_fare_per_pclass = df.groupby('Pclass')['Fare'].mean()

print("\n💰 Mean Fare per Pclass:")
print(mean_fare_per_pclass)


💰 Mean Fare per Pclass:
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64


In [20]:
# --- Survival Rate (Overall and by Gender) ---
# Since 'Survived' is 1 for survived and 0 for not, the mean is the survival rate!

# Overall survival rate
overall_survival_rate = df['Survived'].mean()
print(f"\n🆘 Overall Survival Rate: {overall_survival_rate:.2%}")

# We use groupby() again to see how this rate changes for different groups.
survival_by_gender = df.groupby('Sex')['Survived'].mean()

print("\n🆘 Survival Rate by Gender:")
# We can use .map() to apply formatting to each value in the resulting Series
print(survival_by_gender.map('{:.2%}'.format))


🆘 Overall Survival Rate: 38.38%

🆘 Survival Rate by Gender:
Sex
female    74.20%
male      18.89%
Name: Survived, dtype: object


# PART A COMPLETE - Foundation Established!

## 🎓 What you learned in Part A:

*   ✅ **Loading and Inspecting:** Using `pd.read_csv`, `.shape`, `.info()`, and `.head()`.
*   ✅ **Data Profiling:** Systematically checking for missing data, data types, and unique values.
*   ✅ **Distribution Analysis:** Using `.value_counts()` to understand categorical data.
*   ✅ **Filtering Data:** Using boolean indexing (`df[...]`) and the `.query()` method to select data subsets.
*   ✅ **Basic Aggregations:** Calculating statistics like `.mean()`, `.median()`, and `.mode()`.
*   ✅ **Grouped Analysis:** Using `.groupby()` to perform aggregations on specific groups within the data.

# PART B - INTERMEDIATE: Data Transformation Skills

Now we move on to modifying the data. This includes handling missing values (imputation), creating new features from existing ones (feature engineering), and reshaping data for analysis.

## 🚀 Ready to continue with Part B!

Each section will build on previous concepts while introducing new, powerful pandas functionality.