# Simple Titanic Data Processing - One Step at a Time

Welcome to your first hands-on data processing experience! 🚢

In this notebook, we'll work with the famous Titanic dataset and learn data processing **one simple step at a time**. Each step focuses on a single concept, making it easy to understand and practice.

## About the Titanic Dataset
The Titanic dataset contains information about passengers aboard the RMS Titanic. We'll use this real-world data to learn essential data processing skills.

## What You'll Learn (Step by Step):
1. Load and view the dataset
2. Find missing values in a specific column
3. Find all categorical columns
4. Find all numerical columns
5. Count unique values in categorical columns
6. Check data types
7. Get basic statistics
8. And much more!

Let's start our journey! 🎯

## Step 1: Import Libraries and Load Data

First, let's import the libraries we need and load our Titanic dataset.

In [None]:
# Import the pandas library for data manipulation
import pandas as pd
import numpy as np

print("📚 Libraries imported successfully!")

In [None]:
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')

print("🚢 Titanic dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"This means we have {df.shape[0]} rows (passengers) and {df.shape[1]} columns (features)")

## Step 2: View the First Few Rows

Let's see what our data looks like by viewing the first 5 rows.

In [None]:
# Display the first 5 rows of the dataset
print("👀 First 5 rows of the Titanic dataset:")
print(df.head())

## Step 3: See All Column Names

Let's see what columns (features) we have in our dataset.

In [None]:
# Display all column names
print("📋 All columns in the dataset:")
for i, column in enumerate(df.columns, 1):
    print(f"{i:2d}. {column}")

print(f"\nTotal number of columns: {len(df.columns)}")

## Step 4: Find Missing Values in a Specific Column

Let's check for missing values in the 'age' column specifically.

In [None]:
# Check missing values in the 'age' column
column_name = 'age'

missing_count = df[column_name].isnull().sum()
total_count = len(df[column_name])
missing_percentage = (missing_count / total_count) * 100

print(f"🔍 Missing values analysis for '{column_name}' column:")
print(f"Missing values: {missing_count}")
print(f"Total values: {total_count}")
print(f"Missing percentage: {missing_percentage:.2f}%")

if missing_count > 0:
    print(f"\n⚠️  The '{column_name}' column has missing values!")
else:
    print(f"\n✅ The '{column_name}' column has no missing values!")

## Step 5: Find Missing Values in ALL Columns

Now let's check missing values in all columns at once.

In [None]:
# Check missing values in all columns
print("🔍 Missing values in ALL columns:")
print("=" * 40)

missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

# Create a summary table
missing_summary = pd.DataFrame({
    'Column': missing_data.index,
    'Missing Count': missing_data.values,
    'Missing %': missing_percentage.values
})

# Sort by missing count (highest first)
missing_summary = missing_summary.sort_values('Missing Count', ascending=False)

print(missing_summary)

# Show only columns with missing values
columns_with_missing = missing_summary[missing_summary['Missing Count'] > 0]
print(f"\n📊 Summary: {len(columns_with_missing)} columns have missing values")

## Step 6: Find All Categorical Columns

Let's identify which columns contain categorical (text/string) data.

In [None]:
# Method 1: Find columns with 'object' data type (usually categorical)
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

print("📝 Categorical columns (object data type):")
print("=" * 45)

for i, column in enumerate(categorical_columns, 1):
    unique_count = df[column].nunique()
    print(f"{i:2d}. {column:<15} (has {unique_count} unique values)")

print(f"\nTotal categorical columns: {len(categorical_columns)}")

In [None]:
# Let's see some examples from each categorical column
print("🔍 Sample values from each categorical column:")
print("=" * 50)

for column in categorical_columns:
    print(f"\n{column}:")
    # Show first 5 unique values (excluding NaN)
    unique_values = df[column].dropna().unique()[:5]
    for value in unique_values:
        print(f"  - {value}")
    
    if len(df[column].dropna().unique()) > 5:
        print(f"  ... and {len(df[column].dropna().unique()) - 5} more values")

## Step 7: Find All Numerical Columns

Now let's identify columns that contain numerical data.

In [None]:
# Find columns with numerical data types
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("🔢 Numerical columns:")
print("=" * 25)

for i, column in enumerate(numerical_columns, 1):
    data_type = df[column].dtype
    min_val = df[column].min()
    max_val = df[column].max()
    print(f"{i:2d}. {column:<15} ({data_type}) - Range: {min_val} to {max_val}")

print(f"\nTotal numerical columns: {len(numerical_columns)}")

## Step 8: Check Data Types of All Columns

Let's see the data type of each column in our dataset.

In [None]:
# Display data types of all columns
print("📊 Data types of all columns:")
print("=" * 35)

data_types = df.dtypes

for column, dtype in data_types.items():
    print(f"{column:<15} : {dtype}")

# Summary of data types
print("\n📈 Summary of data types:")
type_counts = df.dtypes.value_counts()
for dtype, count in type_counts.items():
    print(f"{dtype}: {count} columns")

## Step 9: Count Unique Values in Categorical Columns

For each categorical column, let's see how many unique values it has.

In [None]:
# Count unique values in each categorical column
print("🔍 Unique value counts for categorical columns:")
print("=" * 50)

for column in categorical_columns:
    unique_count = df[column].nunique()
    total_count = df[column].count()  # Non-null values
    
    print(f"\n{column}:")
    print(f"  Unique values: {unique_count}")
    print(f"  Non-null values: {total_count}")
    
    # Show the actual unique values if there are not too many
    if unique_count <= 10:
        print(f"  Values: {list(df[column].dropna().unique())}")
    else:
        print(f"  (Too many unique values to display - showing first 5)")
        print(f"  Sample values: {list(df[column].dropna().unique()[:5])}")

## Step 10: Get Basic Statistics for Numerical Columns

Let's get some basic statistics (mean, median, etc.) for our numerical columns.

In [None]:
# Get basic statistics for numerical columns
print("📊 Basic statistics for numerical columns:")
print("=" * 45)

numerical_stats = df[numerical_columns].describe()
print(numerical_stats)

print("\n💡 What these statistics mean:")
print("  count: Number of non-missing values")
print("  mean:  Average value")
print("  std:   Standard deviation (how spread out the values are)")
print("  min:   Smallest value")
print("  25%:   25th percentile (1st quartile)")
print("  50%:   50th percentile (median)")
print("  75%:   75th percentile (3rd quartile)")
print("  max:   Largest value")

## Step 11: Focus on a Specific Column - Age Analysis

Let's do a detailed analysis of the 'age' column.

In [None]:
# Detailed analysis of the 'age' column
print("👶👴 Detailed Age Analysis:")
print("=" * 30)

age_column = df['age']

# Basic information
print(f"Total passengers: {len(age_column)}")
print(f"Ages recorded: {age_column.count()}")
print(f"Missing ages: {age_column.isnull().sum()}")
print(f"Missing percentage: {(age_column.isnull().sum() / len(age_column)) * 100:.1f}%")

# Age statistics (only for non-missing values)
print(f"\n📊 Age Statistics:")
print(f"Youngest passenger: {age_column.min():.1f} years old")
print(f"Oldest passenger: {age_column.max():.1f} years old")
print(f"Average age: {age_column.mean():.1f} years old")
print(f"Median age: {age_column.median():.1f} years old")

# Age groups
print(f"\n👶 Age Groups:")
children = age_column[age_column < 18].count()
adults = age_column[(age_column >= 18) & (age_column < 65)].count()
seniors = age_column[age_column >= 65].count()

print(f"Children (< 18): {children} passengers")
print(f"Adults (18-64): {adults} passengers")
print(f"Seniors (65+): {seniors} passengers")

## Step 12: Focus on a Specific Column - Sex Analysis

Let's analyze the 'sex' column to understand the gender distribution.

In [None]:
# Detailed analysis of the 'sex' column
print("👨👩 Gender Distribution Analysis:")
print("=" * 35)

sex_column = df['sex']

# Basic information
print(f"Total passengers: {len(sex_column)}")
print(f"Gender recorded: {sex_column.count()}")
print(f"Missing gender info: {sex_column.isnull().sum()}")

# Gender distribution
print(f"\n📊 Gender Distribution:")
gender_counts = sex_column.value_counts()
gender_percentages = sex_column.value_counts(normalize=True) * 100

for gender in gender_counts.index:
    count = gender_counts[gender]
    percentage = gender_percentages[gender]
    print(f"{gender.capitalize()}: {count} passengers ({percentage:.1f}%)")

# Show unique values
print(f"\n🔍 Unique values in sex column: {list(sex_column.unique())}")

## Step 13: Focus on a Specific Column - Survival Analysis

Let's analyze the 'survived' column - this is very important!

In [None]:
# Detailed analysis of the 'survived' column
print("🚢⚰️  Survival Analysis:")
print("=" * 25)

survived_column = df['survived']

# Basic information
print(f"Total passengers: {len(survived_column)}")
print(f"Survival info available: {survived_column.count()}")
print(f"Missing survival info: {survived_column.isnull().sum()}")

# Survival statistics
print(f"\n📊 Survival Statistics:")
survival_counts = survived_column.value_counts().sort_index()
survival_percentages = survived_column.value_counts(normalize=True).sort_index() * 100

for status in survival_counts.index:
    count = survival_counts[status]
    percentage = survival_percentages[status]
    status_text = "Survived" if status == 1 else "Did not survive"
    print(f"{status_text}: {count} passengers ({percentage:.1f}%)")

# Overall survival rate
survival_rate = survived_column.mean() * 100
print(f"\n💡 Overall survival rate: {survival_rate:.1f}%")

## Step 14: Find Columns with No Missing Values

Let's identify which columns are complete (no missing values).

In [None]:
# Find columns with no missing values
print("✅ Columns with NO missing values:")
print("=" * 35)

complete_columns = []
incomplete_columns = []

for column in df.columns:
    missing_count = df[column].isnull().sum()
    if missing_count == 0:
        complete_columns.append(column)
    else:
        incomplete_columns.append((column, missing_count))

# Show complete columns
print(f"Complete columns ({len(complete_columns)} total):")
for i, column in enumerate(complete_columns, 1):
    print(f"{i:2d}. {column}")

# Show incomplete columns
print(f"\n❌ Columns with missing values ({len(incomplete_columns)} total):")
for i, (column, missing_count) in enumerate(incomplete_columns, 1):
    missing_percentage = (missing_count / len(df)) * 100
    print(f"{i:2d}. {column:<15} - {missing_count} missing ({missing_percentage:.1f}%)")

## Step 15: Simple Data Quality Check

Let's do a simple overall data quality assessment.

In [None]:
# Overall data quality assessment
print("🔍 Data Quality Assessment:")
print("=" * 30)

total_cells = df.shape[0] * df.shape[1]
missing_cells = df.isnull().sum().sum()
complete_cells = total_cells - missing_cells

print(f"Dataset dimensions: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Total cells: {total_cells:,}")
print(f"Complete cells: {complete_cells:,} ({(complete_cells/total_cells)*100:.1f}%)")
print(f"Missing cells: {missing_cells:,} ({(missing_cells/total_cells)*100:.1f}%)")

print(f"\n📊 Column Summary:")
print(f"Complete columns: {len(complete_columns)}")
print(f"Columns with missing data: {len(incomplete_columns)}")
print(f"Numerical columns: {len(numerical_columns)}")
print(f"Categorical columns: {len(categorical_columns)}")

# Data quality score (simple)
quality_score = (complete_cells / total_cells) * 100
print(f"\n⭐ Data Quality Score: {quality_score:.1f}%")

if quality_score >= 90:
    print("   Excellent data quality! 🌟")
elif quality_score >= 80:
    print("   Good data quality! 👍")
elif quality_score >= 70:
    print("   Fair data quality - some cleaning needed 🔧")
else:
    print("   Poor data quality - significant cleaning needed ⚠️")

## 🎯 Practice Exercises

Now it's your turn! Try these simple exercises:

### Exercise 1: Explore Another Column
Pick any column from the dataset and do a detailed analysis like we did for 'age' and 'sex'.

### Exercise 2: Find Specific Missing Values
Find which passengers (rows) have missing age information.

### Exercise 3: Simple Filtering
Find all passengers who were:
- Under 18 years old
- Female
- Survived the disaster

### Exercise 4: Count and Compare
Compare survival rates between different groups (male vs female, different passenger classes, etc.)

Try these exercises in the cells below! 👇

In [None]:
# Exercise 1: Your code here
# Pick a column and analyze it



In [None]:
# Exercise 2: Your code here
# Find passengers with missing age information



In [None]:
# Exercise 3: Your code here
# Find specific groups of passengers



In [None]:
# Exercise 4: Your code here
# Compare survival rates between groups


## 🎉 Congratulations!

You've completed your first step-by-step data processing journey! 🚀

### What You've Learned:
✅ How to load and explore a dataset  
✅ How to find missing values in specific columns  
✅ How to identify categorical and numerical columns  
✅ How to count unique values  
✅ How to check data types  
✅ How to get basic statistics  
✅ How to analyze specific columns in detail  
✅ How to assess overall data quality  

### Next Steps:
🔄 Practice with different datasets  
📊 Learn data visualization  
🧹 Learn data cleaning techniques  
🔧 Learn feature engineering  

Keep practicing and exploring! Every dataset tells a story - you're learning to read it! 📖✨