# Step 1: Setting Up the Project

**Objective:**  
Prepare the project structure and ensure the Titanic dataset is ready for analysis.

**What happens in this step:**  
1. We create a dedicated folder `data` to store datasets and output files (like JSON).  
2. We define file paths for the CSV dataset (`titanic.csv`) and the final JSON output (`titanic_data.json`).  
3. The script ensures the `data` folder exists, so later steps won’t fail when reading or writing files.  

**Why this matters:**  
- Keeping data organized makes the project easier to maintain.  
- Defining paths as variables allows us to reference files consistently throughout the analysis.


In [21]:
"""
Titanic Data Analysis and JSON Export
Author: Marco Martins
Description: Analyze Titanic passenger data, engineer features, and export to JSON
"""

import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime

# -----------------------------
# Setup Project Paths
# -----------------------------
DATA_DIR = Path("data")  # Folder for datasets and outputs
CSV_FILE = DATA_DIR / "titanic.csv"  # Titanic dataset CSV
JSON_FILE = DATA_DIR / "titanic_data.json"  # JSON export file

# Create data folder if it doesn't exist
DATA_DIR.mkdir(exist_ok=True)

# Set up paths
DATA_DIR = Path("data")
CSV_FILE = DATA_DIR / "titanic.csv"
JSON_FILE = DATA_DIR / "titanic_data.json"

# Create data directory if it doesn't exist
DATA_DIR.mkdir(exist_ok=True)

print("Project setup complete!")
print(f"Data directory: {DATA_DIR}")
print(f"CSV file location: {CSV_FILE}")


Project setup complete!
Data directory: data
CSV file location: data\titanic.csv


# Step 2: Importing and Exploring the Data

**Objective:**  
Load the Titanic dataset into a pandas DataFrame and explore its structure.

**What happens in this step:**  
1. We read the CSV file (`titanic.csv`) into a pandas DataFrame using `pd.read_csv()`.  
2. We inspect the dataset’s shape (rows × columns) to understand its size.  
3. We list all columns to know which features are available.  
4. We display the first few rows to get a quick look at the data and verify it loaded correctly.

**Why this matters:**  
- Loading and exploring data is a fundamental first step in any data analysis workflow.  
- It allows us to understand what features exist, detect potential issues (like missing values), and plan feature engineering.


In [22]:
# -----------------------------
# Step 2: Importing and Exploring the Data
# -----------------------------

# Load Titanic dataset
df = pd.read_csv(CSV_FILE)

# Basic dataset information
print(f"Dataset loaded successfully! Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

# Display first few rows
print("\nFirst few rows of the dataset:")
print(df.head())



Dataset loaded successfully! Shape: (891, 12)

Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

First few rows of the dataset:
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0 

# Step 3: Calculating Descriptive Statistics

**Objective:**  
Calculate summary statistics for numeric columns in the Titanic dataset, such as mean, median, and standard deviation.

**What happens in this step:**  
1. We identify all numeric columns in the DataFrame (Age, Fare, SibSp, Parch, etc.).  
2. For each numeric column, we calculate:  
   - Mean: average value  
   - Median: middle value  
   - Standard deviation: measure of variability  
3. We display the results in a clear, readable format so we can understand the distribution of numeric features.

**Why this matters:**  
- Summary statistics help detect unusual values or outliers.  
- They give a quick overview of data distributions, which is crucial before feature engineering or modeling.


In [23]:
# -----------------------------
# Step 3: Calculating Descriptive Statistics
# -----------------------------

# Display all columns and prevent truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Select numeric columns only
numeric_columns = df.select_dtypes(include=np.number).columns

# Calculate statistics
stats = pd.DataFrame({
    'mean': df[numeric_columns].mean(),
    'median': df[numeric_columns].median(),
    'std': df[numeric_columns].std()
})

print("\nDescriptive Statistics for Numeric Columns:")
print(stats)



Descriptive Statistics for Numeric Columns:
                   mean    median         std
PassengerId  446.000000  446.0000  257.353842
Survived       0.383838    0.0000    0.486592
Pclass         2.308642    3.0000    0.836071
Age           29.699118   28.0000   14.526497
SibSp          0.523008    0.0000    1.102743
Parch          0.381594    0.0000    0.806057
Fare          32.204208   14.4542   49.693429


# Step 4: Identifying Missing Values

**Objective:**  
Detect missing values in the Titanic dataset and understand their distribution.

**What happens in this step:**  
1. For each column, we count how many values are missing (`NaN`).  
2. We calculate the percentage of missing values relative to the total number of rows.  
3. We store these results in a dictionary for easy access.  
4. We print a readable summary to quickly identify which columns have the most missing data.

**Why this matters:**  
- Missing values can affect analysis and modeling.  
- Understanding which columns are incomplete helps decide how to handle them (fill, drop, or leave as-is).  
- For example, `Age` and `Cabin` often have missing values in the Titanic dataset, which may influence feature engineering.

In [24]:
# -----------------------------
# Step 4: Identifying Missing Values
# -----------------------------

print("\n" + "="*50)
print("MISSING VALUES ANALYSIS")
print("="*50)

# Dictionary to store missing value info
missing_data = {}

# Loop through each column in the DataFrame
for col in df.columns:
    # Count how many missing values in this column
    missing_count = df[col].isnull().sum()
    
    # Calculate percentage of missing values
    missing_percent = (missing_count / len(df)) * 100
    
    # Store results in the dictionary
    missing_data[col] = {'Missing Count': missing_count, 'Missing Percent': missing_percent}

# Print results in a readable format
for col, stats in missing_data.items():
    print(f"{col}: {stats['Missing Count']} missing ({stats['Missing Percent']:.2f}%)")



MISSING VALUES ANALYSIS
PassengerId: 0 missing (0.00%)
Survived: 0 missing (0.00%)
Pclass: 0 missing (0.00%)
Name: 0 missing (0.00%)
Sex: 0 missing (0.00%)
Age: 177 missing (19.87%)
SibSp: 0 missing (0.00%)
Parch: 0 missing (0.00%)
Ticket: 0 missing (0.00%)
Fare: 0 missing (0.00%)
Cabin: 687 missing (77.10%)
Embarked: 2 missing (0.22%)


# Step 5: Feature Engineering

**Objective:**  
Create new features that may help distinguish between passengers who survived and those who did not.

**What happens in this step:**  
1. **FamilySize**: Calculates the total number of family members aboard including the passenger (`SibSp + Parch + 1`).  
2. **IsAlone**: Indicates whether the passenger was traveling alone (`1` if FamilySize = 1, else `0`).  
3. **AgeGroup**: Categorizes passengers by age into meaningful groups:
   - Child: < 18  
   - Young Adult: 18–29  
   - Adult: 30–49  
   - Senior: 50+  
   - Unknown: missing age  

4. After creating these features, we compare them between survivors and non-survivors to see if they help explain survival patterns.

**Why this matters:**  
- Feature engineering can reveal hidden patterns and improve predictive power for survival analysis.  
- Understanding these patterns informs later steps like modeling or more advanced analysis.

In [25]:
# -----------------------------
# Step 5: Feature Engineering
# -----------------------------

# Create a copy of the dataframe for feature engineering
df_features = df.copy()

# -----------------------------
# Feature 1: Family Size
# -----------------------------
# Total family members aboard including the passenger
df_features['FamilySize'] = df_features['SibSp'] + df_features['Parch'] + 1
print(df_features[['SibSp', 'Parch', 'FamilySize']].head(10))

# -----------------------------
# Feature 2: Is Alone
# -----------------------------
# 1 if passenger has no family aboard, else 0
df_features['IsAlone'] = df_features['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
print(df_features[['FamilySize', 'IsAlone']].head(10))

# -----------------------------
# Feature 3: Age Groups
# -----------------------------
def categorize_age(age):
    """Categorize age into groups"""
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return 'Child'
    elif age < 30:
        return 'Young Adult'
    elif age < 50:
        return 'Adult'
    else:
        return 'Senior'

# Apply age categorization
df_features['AgeGroup'] = df_features['Age'].apply(categorize_age)
print(df_features[['Age', 'AgeGroup']].head(10))

# -----------------------------
# Analyze feature differences between survivors and non-survivors
# -----------------------------
print("\n" + "="*50)
print("FEATURE ANALYSIS: SURVIVED vs NOT SURVIVED")
print("="*50)

# Family Size statistics by survival
print("\nFamily Size by Survival:")
family_survival = df_features.groupby('Survived')['FamilySize'].agg(['mean', 'median', 'std'])
print(family_survival)

# Compare Family Size between survivors and non-survivors
print("\n" + "="*50)
print("FEATURE DIFFERENTIATION ANALYSIS")
print("="*50)

# Separate survivors and non-survivors
survived = df_features[df_features['Survived'] == 1]
not_survived = df_features[df_features['Survived'] == 0]

# Family Size comparison
print("\nFamily Size:")
print(f"  Survived mean: {survived['FamilySize'].mean():.2f}")
print(f"  Not Survived mean: {not_survived['FamilySize'].mean():.2f}")
print(f"  Difference: {abs(survived['FamilySize'].mean() - not_survived['FamilySize'].mean()):.2f}")



   SibSp  Parch  FamilySize
0      1      0           2
1      1      0           2
2      0      0           1
3      1      0           2
4      0      0           1
5      0      0           1
6      0      0           1
7      3      1           5
8      0      2           3
9      1      0           2
   FamilySize  IsAlone
0           2        0
1           2        0
2           1        1
3           2        0
4           1        1
5           1        1
6           1        1
7           5        0
8           3        0
9           2        0
    Age     AgeGroup
0  22.0  Young Adult
1  38.0        Adult
2  26.0  Young Adult
3  35.0        Adult
4  35.0        Adult
5   NaN      Unknown
6  54.0       Senior
7   2.0        Child
8  27.0  Young Adult
9  14.0        Child

FEATURE ANALYSIS: SURVIVED vs NOT SURVIVED

Family Size by Survival:
              mean  median       std
Survived                            
0         1.883424     1.0  1.830669
1         1.938596     2.0 

# Step 5.1: Handling Missing Data (Feature Engineering)

**Objective:**  
Fill in missing values in key columns and create new features to improve data completeness and usability.

**What happens in this step:**  
1. **Age**: Replace missing values with the median age of all passengers. This ensures the `Age` feature is complete and can be used in `AgeGroup` and further analysis.  
2. **Cabin**: Too many missing values (~77%), so instead of filling, we create a new feature `HasCabin`:  
   - `1` if the passenger has a cabin listed  
   - `0` if the cabin is missing  

**Why this matters:**  
- Missing values can distort statistics and break downstream calculations or models.  
- Handling missing data is part of **feature engineering** because it transforms raw, incomplete data into usable, meaningful features.  
- Note: Passengers without cabin info may often be in **3rd class**, but this is an assumption that would need confirmation.

**Next steps:**  
- Updated `Age` and `HasCabin` will later be included in our JSON export.

In [26]:
# -----------------------------
# Step 5.1: Handling Missing Data
# -----------------------------

# Replace missing Age with the median age
median_age = df_features['Age'].median()
df_features['Age'] = df_features['Age'].fillna(median_age)

# Update AgeGroup after filling missing ages
df_features['AgeGroup'] = df_features['Age'].apply(categorize_age)

# Create HasCabin feature: 1 if cabin exists, 0 if missing
df_features['HasCabin'] = df_features['Cabin'].apply(lambda x: 0 if pd.isna(x) else 1)

# Optional: quick verification
print("\nUpdated missing data handling:")
print(df_features[['Age', 'AgeGroup', 'Cabin', 'HasCabin']].head(10))

# Verify no more missing ages
missing_ages = df_features['Age'].isnull().sum()
print(f"\nMissing ages after imputation: {missing_ages}")

# Check HasCabin distribution
has_cabin_counts = df_features['HasCabin'].value_counts()
print("\nHasCabin feature counts:")
print(has_cabin_counts)



Updated missing data handling:
    Age     AgeGroup Cabin  HasCabin
0  22.0  Young Adult   NaN         0
1  38.0        Adult   C85         1
2  26.0  Young Adult   NaN         0
3  35.0        Adult  C123         1
4  35.0        Adult   NaN         0
5  28.0  Young Adult   NaN         0
6  54.0       Senior   E46         1
7   2.0        Child   NaN         0
8  27.0  Young Adult   NaN         0
9  14.0        Child   NaN         0

Missing ages after imputation: 0

HasCabin feature counts:
HasCabin
0    687
1    204
Name: count, dtype: int64


## Step 5.2: Creating FarePerPerson Feature

Objective: Create a new feature `FarePerPerson` to capture the fare paid per individual passenger.  

- **What it does:** Divides the total `Fare` by `FamilySize` to normalize the fare per person.  
- **Why it matters:** Passengers paying more per person may reflect higher socio-economic status, which could correlate with survival chances.  


In [32]:
# Step 5.2: Fare per person feature
# Calculate fare per person by dividing total Fare by FamilySize
df_features['FarePerPerson'] = df_features['Fare'] / df_features['FamilySize']

# Display first few rows to verify
print(df_features[['Fare', 'FamilySize', 'FarePerPerson']].head(10))

      Fare  FamilySize  FarePerPerson
0   7.2500           2        3.62500
1  71.2833           2       35.64165
2   7.9250           1        7.92500
3  53.1000           2       26.55000
4   8.0500           1        8.05000
5   8.4583           1        8.45830
6  51.8625           1       51.86250
7  21.0750           5        4.21500
8  11.1333           3        3.71110
9  30.0708           2       15.03540


# Step 5.3: Extracting Titles from Names
**Objective:** Derive a `Title` feature from the passenger's `Name`.  
Titles like Mr, Mrs, Miss, Master, etc., can provide useful information about social status, age, or gender, which may help in predicting survival.

**Notes:**  
- Titles are extracted from the `Name` column using string parsing.  
- Some rare titles may be grouped into `Other` to simplify categories.  
- This is part of feature engineering, so it complements the previous steps (FamilySize, IsAlone, AgeGroup, Age/Cabin handling).  
- Later, this feature will be included in the JSON export.


In [None]:
# -----------------------------
# Step 5.3: Extract Titles from Names
# -----------------------------

# Function to extract title from passenger name
def extract_title(name):
    """Extract title from passenger name string."""
    if pd.isna(name):
        return None
    # Split by comma and period to get the title
    try:
        title = name.split(',')[1].split('.')[0].strip()
        # Group rare titles as 'Other'
        if title not in ['Mr', 'Mrs', 'Miss', 'Master', 'Dr', 'Rev', 'Col', 'Major', 'Mlle', 'Ms', 'Mme', 'Sir', 'Lady', 'Capt', 'Don', 'Jonkheer']:
            return 'Other'
        # Standardize some variations
        if title in ['Mlle', 'Ms']:
            return 'Miss'
        if title == 'Mme':
            return 'Mrs'
        return title
    except Exception:
        return 'Other'

# Apply function to create Title column
df_features['Title'] = df_features['Name'].apply(extract_title)

# Preview the new feature
print(df_features[['Name', 'Title']].head(10))

# Optional: check the distribution of titles
print("\nTitle counts:")
print(df_features['Title'].value_counts())


                                                Name   Title
0                            Braund, Mr. Owen Harris      Mr
1  Cumings, Mrs. John Bradley (Florence Briggs Th...     Mrs
2                             Heikkinen, Miss. Laina    Miss
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)     Mrs
4                           Allen, Mr. William Henry      Mr
5                                   Moran, Mr. James      Mr
6                            McCarthy, Mr. Timothy J      Mr
7                     Palsson, Master. Gosta Leonard  Master
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)     Mrs
9                Nasser, Mrs. Nicholas (Adele Achem)     Mrs

Title counts:
Title
Mr          517
Miss        185
Mrs         126
Master       40
Dr            7
Rev           6
Major         2
Col           2
Don           1
Lady          1
Sir           1
Capt          1
Other         1
Jonkheer      1
Name: count, dtype: int64


# Step 6: Creating a Data Export Class with Comprehensive Statistics

**Objective:**  
Encapsulate the Titanic dataset and passenger data in Python classes and export everything to JSON. Include detailed summary statistics by sex, age group, and engineered features.

---

## Classes

### **Passenger**
- Represents a single passenger.  
- Stores all attributes, including engineered features like `FamilySize`, `IsAlone`, `AgeGroup`, `HasCabin`, and `FarePerPerson`.  
- Handles missing values and converts data to appropriate types.  
- `to_dict()` prepares the passenger for JSON export.  

### **TitanicDataset**
- Represents the full dataset.  
- Converts each row into a `Passenger` object.  
- Stores metadata such as total passengers and survival rate.  
- `get_summary_stats()` computes comprehensive statistics:
  - Total survived / not survived  
  - Average age, fare, and fare per person  
  - Survival by sex with percentages relative to group and total  
  - Survival by age group with percentages relative to group and total  
- `to_json()` exports all passengers and summary stats to a JSON file in the `data` folder.


In [35]:
# -----------------------------
# Step 6: Data Export Class with Comprehensive Statistics
# -----------------------------

class Passenger:
    """Represents a passenger with all attributes and engineered features."""
    def __init__(self, passenger_id, name, age, sex, survived, pclass, 
                 fare, embarked=None, family_size=None, is_alone=None, 
                 title=None, age_group=None, has_cabin=None, fare_per_person=None):
        self.passenger_id = int(passenger_id) if pd.notna(passenger_id) else None
        self.name = str(name) if pd.notna(name) else None
        self.age = float(age) if pd.notna(age) else None
        self.sex = str(sex) if pd.notna(sex) else None
        self.survived = int(survived) if pd.notna(survived) else None
        self.pclass = int(pclass) if pd.notna(pclass) else None
        self.fare = float(fare) if pd.notna(fare) else None
        self.embarked = str(embarked) if pd.notna(embarked) else None
        self.family_size = int(family_size) if pd.notna(family_size) else None
        self.is_alone = int(is_alone) if pd.notna(is_alone) else None
        self.title = str(title) if pd.notna(title) else None
        self.age_group = str(age_group) if pd.notna(age_group) else None
        self.has_cabin = int(has_cabin) if pd.notna(has_cabin) else None
        self.fare_per_person = float(fare_per_person) if pd.notna(fare_per_person) else None

    def to_dict(self):
        return {
            'passenger_id': self.passenger_id,
            'name': self.name,
            'age': self.age,
            'sex': self.sex,
            'survived': self.survived,
            'pclass': self.pclass,
            'fare': self.fare,
            'embarked': self.embarked,
            'family_size': self.family_size,
            'is_alone': self.is_alone,
            'title': self.title,
            'age_group': self.age_group,
            'has_cabin': self.has_cabin,
            'fare_per_person': self.fare_per_person
        }

class TitanicDataset:
    """Represents the entire Titanic dataset with comprehensive stats and JSON export."""
    def __init__(self, dataframe):
        self.dataframe = dataframe
        self.passengers = []
        self._create_passengers()

    def _create_passengers(self):
        for idx, row in self.dataframe.iterrows():
            passenger = Passenger(
                passenger_id=row.get('PassengerId', idx),
                name=row.get('Name'),
                age=row.get('Age'),
                sex=row.get('Sex'),
                survived=row.get('Survived'),
                pclass=row.get('Pclass'),
                fare=row.get('Fare'),
                embarked=row.get('Embarked'),
                family_size=row.get('FamilySize'),
                is_alone=row.get('IsAlone'),
                title=row.get('Title') if 'Title' in row else None,
                age_group=row.get('AgeGroup'),
                has_cabin=row.get('HasCabin'),
                fare_per_person=row.get('FarePerPerson')
            )
            self.passengers.append(passenger)

    def get_summary_stats(self):
        total_passengers = len(self.passengers)
        survived_count = sum(1 for p in self.passengers if p.survived == 1)
        did_not_survive_count = total_passengers - survived_count

        ages = [p.age for p in self.passengers if p.age is not None]
        fares = [p.fare for p in self.passengers if p.fare is not None]
        fare_per_person = [p.fare_per_person for p in self.passengers if p.fare_per_person is not None]

        # Survival by sex
        male = [p for p in self.passengers if p.sex == 'male']
        female = [p for p in self.passengers if p.sex == 'female']
        male_survived = sum(1 for p in male if p.survived == 1)
        female_survived = sum(1 for p in female if p.survived == 1)

        # Survival by age group
        age_groups = {}
        for p in self.passengers:
            group = p.age_group if p.age_group else 'Unknown'
            if group not in age_groups:
                age_groups[group] = {'total': 0, 'survived': 0}
            age_groups[group]['total'] += 1
            if p.survived == 1:
                age_groups[group]['survived'] += 1
        for g, stats in age_groups.items():
            stats['survived_pct_of_group'] = round(stats['survived'] / stats['total'] * 100, 2) if stats['total'] else 0
            stats['survived_pct_of_total'] = round(stats['survived'] / total_passengers * 100, 2) if total_passengers else 0

        return {
            'total_passengers': total_passengers,
            'survived': survived_count,
            'did_not_survive': did_not_survive_count,
            'average_age': round(sum(ages)/len(ages),2) if ages else None,
            'average_fare': round(sum(fares)/len(fares),2) if fares else None,
            'average_fare_per_person': round(sum(fare_per_person)/len(fare_per_person),2) if fare_per_person else None,
            'male_stats': {
                'count': len(male),
                'survived': male_survived,
                'survived_pct_of_males': round(male_survived/len(male)*100,2) if male else 0,
                'survived_pct_of_total': round(male_survived/total_passengers*100,2) if total_passengers else 0
            },
            'female_stats': {
                'count': len(female),
                'survived': female_survived,
                'survived_pct_of_females': round(female_survived/len(female)*100,2) if female else 0,
                'survived_pct_of_total': round(female_survived/total_passengers*100,2) if total_passengers else 0
            },
            'age_group_stats': age_groups
        }

    def to_json(self, filename='titanic_data.json'):
        filepath = DATA_DIR / filename
        data = {
            'metadata': {
                'dataset_name': 'Titanic Passenger Dataset',
                'export_date': datetime.now().isoformat(),
                'total_passengers': len(self.passengers),
                'survival_rate': round(sum(1 for p in self.passengers if p.survived==1)/len(self.passengers)*100,2) if self.passengers else 0
            },
            'passengers': [p.to_dict() for p in self.passengers],
            'summary_stats': self.get_summary_stats()
        }
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2)
        print(f"Data exported to {filepath}")
        return data

# -----------------------------
# Create dataset object and export
# -----------------------------
if 'df_features' in locals() and not df_features.empty:
    dataset = TitanicDataset(df_features)
    print(f"\nTotal passengers: {len(dataset.passengers)}")

    stats = dataset.get_summary_stats()

    # Nicely print summary statistics
    print("\nSummary Statistics:")
    print(f"Total Survived: {stats['survived']}")
    print(f"Total Did Not Survive: {stats['did_not_survive']}")
    print(f"Average Age: {stats['average_age']}")
    print(f"Average Fare: {stats['average_fare']}")
    print(f"Average Fare Per Person: {stats['average_fare_per_person']}\n")

    print("=== Survival by Sex ===")
    for sex, s in [('Male', stats['male_stats']), ('Female', stats['female_stats'])]:
        print(f"{sex}: {s['survived']}/{s['count']} survived "
              f"({s['survived_pct_of_males' if sex=='Male' else 'survived_pct_of_females']}% of {sex.lower()}s, "
              f"{s['survived_pct_of_total']}% of total)")

    print("\n=== Survival by Age Group ===")
    for group, s in stats['age_group_stats'].items():
        print(f"{group}: {s['survived']}/{s['total']} survived "
              f"({s['survived_pct_of_group']}% of group, {s['survived_pct_of_total']}% of total)")

    dataset.to_json('titanic_data.json')




Total passengers: 891

Summary Statistics:
Total Survived: 342
Total Did Not Survive: 549
Average Age: 29.36
Average Fare: 32.2
Average Fare Per Person: 19.92

=== Survival by Sex ===
Male: 109/577 survived (18.89% of males, 12.23% of total)
Female: 233/314 survived (74.2% of females, 26.15% of total)

=== Survival by Age Group ===
Young Adult: 147/448 survived (32.81% of group, 16.5% of total)
Adult: 107/256 survived (41.8% of group, 12.01% of total)
Senior: 27/74 survived (36.49% of group, 3.03% of total)
Child: 61/113 survived (53.98% of group, 6.85% of total)
Data exported to data\titanic_data.json


# Step 7: Testing and Validation

**Objective:**  
Verify that the JSON export is correct, complete, and valid.

**What we are doing:**
1. **Load JSON** from the `data` folder.
2. **Inspect Metadata**: Check total passengers, survival rate, and export date.
3. **Verify Required Fields**: Make sure every passenger has all mandatory fields (`passenger_id`, `name`, `age`, etc.).
4. **Check Engineered Features**: Verify that features created during feature engineering (`FamilySize`, `IsAlone`, `AgeGroup`) are present for all passengers.
5. **Basic JSON Validity**: Try converting the loaded data back to a string to ensure it can be parsed without errors.

**Notes:**
- Missing engineered features or required fields indicate issues in feature engineering or JSON export.
- This step ensures that all preprocessing, cleaning, and feature creation steps are correctly captured in the JSON output.

In [36]:
# -----------------------------
# Step 7: Testing and Validation of JSON Export
# -----------------------------

# Full path to JSON file in the existing data folder
JSON_FILE = DATA_DIR / "titanic_data.json"

# Load the JSON file exported in Step 6
with open(JSON_FILE, 'r', encoding='utf-8') as f:
    json_data = json.load(f)

# -----------------------------
# 1. Print Metadata
# -----------------------------
print("\n=== METADATA ===")
for key, value in json_data.get('metadata', {}).items():
    print(f"{key}: {value}")

# -----------------------------
# 2. Inspect Passengers
# -----------------------------
passengers = json_data.get('passengers', [])
total_passengers = len(passengers)
print(f"\nTotal passengers in JSON: {total_passengers}")

# -----------------------------
# 3. Check Required Fields
# -----------------------------
required_fields = [
    'passenger_id', 'name', 'age', 'sex', 'survived',
    'pclass', 'fare', 'embarked', 'family_size', 'is_alone', 'title'
]

print("\n=== REQUIRED FIELDS CHECK ===")
missing_any = False
for i, p in enumerate(passengers):
    missing = [field for field in required_fields if field not in p or p[field] is None]
    if missing:
        missing_any = True
        print(f"Passenger {i} missing fields: {missing}")
if not missing_any:
    print("All passengers have all required fields.")

# -----------------------------
# 4. Check Engineered Features
# -----------------------------
engineered_fields = ['family_size', 'is_alone', 'age_group', 'has_cabin', 'fare_per_person']

print("\n=== ENGINEERED FEATURES CHECK ===")
missing_eng_any = False
for i, p in enumerate(passengers):
    missing_eng = [field for field in engineered_fields if field not in p or p[field] is None]
    if missing_eng:
        missing_eng_any = True
        print(f"Passenger {i} missing engineered fields: {missing_eng}")
if not missing_eng_any:
    print("All passengers have all engineered features.")

# -----------------------------
# 5. Validate Engineered Feature Counts
# -----------------------------
print("\n=== ENGINEERED FEATURE COUNTS ===")
# HasCabin
has_cabin_count = sum(1 for p in passengers if p.get('has_cabin') == 1)
no_cabin_count = total_passengers - has_cabin_count
print(f"Has Cabin: {has_cabin_count} ({has_cabin_count/total_passengers*100:.2f}%)")
print(f"No Cabin: {no_cabin_count} ({no_cabin_count/total_passengers*100:.2f}%)")

# IsAlone
alone_count = sum(1 for p in passengers if p.get('is_alone') == 1)
not_alone_count = total_passengers - alone_count
print(f"Is Alone: {alone_count} ({alone_count/total_passengers*100:.2f}%)")
print(f"Not Alone: {not_alone_count} ({not_alone_count/total_passengers*100:.2f}%)")

# -----------------------------
# 6. Validate Summary Statistics
# -----------------------------
summary_stats = json_data.get('summary_stats', {})
print("\n=== SUMMARY STATISTICS ===")

# Basic stats
print(f"Total Survived: {summary_stats.get('survived')}")
print(f"Total Did Not Survive: {summary_stats.get('did_not_survive')}")
print(f"Average Age: {summary_stats.get('average_age')}")
print(f"Average Fare: {summary_stats.get('average_fare')}")
print(f"Average Fare Per Person: {summary_stats.get('average_fare_per_person')}")

# Survival by sex
print("\n--- Survival by Sex ---")
for sex, key in [('Male', 'male_stats'), ('Female', 'female_stats')]:
    s = summary_stats.get(key, {})
    print(f"{sex}: {s.get('survived')}/{s.get('count')} survived "
          f"({s.get('survived_pct_of_males' if sex=='Male' else 'survived_pct_of_females')}% of {sex.lower()}s, "
          f"{s.get('survived_pct_of_total')}% of total)")

# Survival by age group
print("\n--- Survival by Age Group ---")
for group, stats in summary_stats.get('age_group_stats', {}).items():
    print(f"{group}: {stats.get('survived')}/{stats.get('total')} survived "
          f"({stats.get('survived_pct_of_group'):.2f}% of group, {stats.get('survived_pct_of_total'):.2f}% of total)")

# -----------------------------
# 7. JSON Validity Check
# -----------------------------
try:
    json_string = json.dumps(json_data)  # Convert back to string
    print("\nJSON is valid and can be parsed successfully!")
except Exception as e:
    print(f"\nJSON parsing error: {e}")



=== METADATA ===
dataset_name: Titanic Passenger Dataset
export_date: 2026-02-06T20:51:42.574743
total_passengers: 891
survival_rate: 38.38

Total passengers in JSON: 891

=== REQUIRED FIELDS CHECK ===
Passenger 61 missing fields: ['embarked']
Passenger 829 missing fields: ['embarked']

=== ENGINEERED FEATURES CHECK ===
All passengers have all engineered features.

=== ENGINEERED FEATURE COUNTS ===
Has Cabin: 204 (22.90%)
No Cabin: 687 (77.10%)
Is Alone: 537 (60.27%)
Not Alone: 354 (39.73%)

=== SUMMARY STATISTICS ===
Total Survived: 342
Total Did Not Survive: 549
Average Age: 29.36
Average Fare: 32.2
Average Fare Per Person: 19.92

--- Survival by Sex ---
Male: 109/577 survived (18.89% of males, 12.23% of total)
Female: 233/314 survived (74.2% of females, 26.15% of total)

--- Survival by Age Group ---
Young Adult: 147/448 survived (32.81% of group, 16.50% of total)
Adult: 107/256 survived (41.80% of group, 12.01% of total)
Senior: 27/74 survived (36.49% of group, 3.03% of total)
Chi