To verify that you have mastered the "Architect" mindset, I have designed a Data Engineering Case Study based on the Titanic dataset found in your MDA repository.

This is not a "tutorial" assignment. It is a Production Simulation. You are acting as the Lead Data Engineer for a maritime insurance firm analyzing historical disaster data to build a risk model.

The Objective
Transform the raw titanic.csv into a "Gold-Standard" Analytical Base Table (ABT). You must clean, type-cast, feature-engineer, and aggregate the data using the Modern Pandas (2.0+) standards we discussed.

Phase 1: Ingestion & Schema Enforcement (Data Types)

Scenario: The raw data is messy. Integers are loaded as floats due to NaNs, categoricals are strings, and memory usage is unoptimized. Task: Create an ingestion function load_and_optimize_titanic(path: str) that:

Loads only the relevant columns (consult the Data Dictionary).

Strictly enforces these Modern Pandas types:

string[pyarrow] for text (Name, Ticket).

category for low-cardinality fields (Sex, Embarked, Pclass).

Int16 (Nullable) for Age (rounded) and SibSp/Parch.

Float32 for Fare (to save memory).

Refactor: The pclass column is numeric (1, 2, 3) but logically ordinal. Convert it to an Ordered Categorical type so that 1st > 2nd > 3rd.

Architect's Hint: Don't forget pd.options.mode.copy_on_write = True. Use .assign() for the casting.


Phase 2: 

The "Bouncer" & The "Broadcaster" (Filtering & Transform)

Scenario: We need to handle missing data and outliers before aggregation. Task: Extend your pipeline with a function clean_demographics(df) that:

Filter (The Bouncer): We suspect data corruption in the fare column. Remove any group of passengers sharing the same ticket number if their group size is greater than 8 (likely data entry errors or non-standard tickets).

Transform (The Broadcaster): Fill missing age values. Instead of a generic mean, fill missing ages with the median age of their specific sex and pclass group. (e.g., A missing age for a Female in 1st Class gets the median age of all Females in 1st Class).

Architect's Hint: Use .groupby([...])['age'].transform(...) combined with .fillna().


Phase 3: Feature Engineering (Binning & Method Chaining)

Scenario: "Age" and "Fare" are too granular for risk reporting. We need logical buckets. Task: Create a function engineer_features(df) that adds two columns:

age_group: Bin age into: ['Child' (0-12), 'Teen' (13-17), 'Adult' (18-59), 'Senior' (60+)]. Use specific bin edges.

fare_quantile: Discretize fare into 5 equal-frequency buckets (Quantiles) labeled ['Very Low', 'Low', 'Med', 'High', 'Very High'].

Architect's Hint: Use pd.cut for Age and pd.qcut for Fare. Ensure you handle NaNs explicitly if any remain.

Phase 4: The Executive Report (Grouping & Aggregation)

Scenario: The VP of Risk wants a summary table showing survival rates by Class and Life Stage. Task: Create the final aggregation function generate_risk_report(df) that:

Groups by pclass and your new age_group.

Calculates:

Survival Rate: Mean of survived (formatted as percentage).

Avg Fare: Mean of fare.

Total Passengers: Count of passengers.

Most Common Embarked: The port where most of these passengers boarded.

Sorts the result by Class (High to Low) and Survival Rate (Low to High).

Architect's Hint: Use Named Aggregation inside .agg(). For the "Most Common Embarked," you might need a lambda with .mode().

In [195]:
titanic_dd = pd.read_csv('../data/dictionaries/titanic_data_dictionary.csv')

titanic_dd

Unnamed: 0,Variable,Definition,Key
0,survival,Survival,"0 = No, 1 = Yes"
1,pclass,Ticket class,"1 = 1st, 2 = 2nd, 3 = 3rd"
2,sex,Sex,
3,Age,Age in years,
4,sibsp,# of siblings / spouses aboard the Titanic,
5,parch,# of parents / children aboard the Titanic,
6,ticket,Ticket number,
7,fare,Passenger fare,
8,cabin,Cabin number,
9,embarked,Port of Embarkation,"C = Cherbourg, Q = Queenstown, S = Southampton"


In [None]:
from numpy import astype
import pandas as pd 
import numpy as np

pd.options.mode.copy_on_write = True

titanic_raw = pd.read_csv('../data/titanic.csv')

def load_clean_df(df: pd.DataFrame):

    #ticket class 
    ticket_class = pd.CategoricalDtype(
        categories=[3,2,1], 
        ordered=True
    )

    return(df   #load data
            .rename(columns=lambda c:c.strip().lower()) #standardize column names
            #.dropna(subset=['age'])
            .assign(
                #STRING OPTIMIZATION (PyArrow)
                name=lambda x: x['name'].astype("string[pyarrow]"),
                ticket=lambda x: x['ticket'].astype("string[pyarrow]"),

                #NUMERIC OPTIMIZATION
                passengerid = lambda x:x['passengerid'].astype('Int16'),
                age = lambda x:x['age'].round().astype('Int8'),
                sibsp = lambda x:x['sibsp'].astype('Int8'),
                parch = lambda x:x['parch'].astype('Int8'),
                fare = lambda x:x['fare'].astype('float32'),

                #Boolean optimization
                survived = lambda x:x['survived'].astype(bool),

                #Ordered categories
                pclass = lambda x:x['pclass'].astype(ticket_class),
                
                #categroy
                sex = lambda x:x['sex'].astype('category'),
                embarked = lambda x:x['embarked'].astype('category')
            )
            
        )

def clean_demographics(df: pd.DataFrame):

    return (df
            .groupby('ticket').filter(lambda x: x['passengerid'].size < 7) #Removing groups of passengers sharing the same ticket number if their group size is greater than 6
            #add median age and fill missing age with the median age
            .assign(
                group_median_age = df.groupby(['sex','pclass'], observed=True)['age'].transform('median'), #get the median age by using transform
                age = lambda x:x['age'].fillna(x['group_median_age'].round()).astype('Int8')
            )
            .drop(columns=['name','group_median_age'])
        )

def feature_engineering(df: pd.DataFrame):

    return (df
            .assign(
                age_group =  pd.cut(
                                df['age'],
                                bins=[-1, 12,17,59, np.inf],
                                labels=['Child', 'Teen', 'Adult','Senior']
                            ),
                fare_quantile = pd.qcut(
                                df['fare'],
                                q=5,
                                labels=['Very Low', 'Low', 'Med','High','Very High']
                            )
            )

    )


def generate_risk_report(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generates the final ABT (Analytical Base Table).
    Aggregates by Class and Life Stage to determine Survival Risk.
    """
    return (df
        # 1. Grouping
        # observed=True ensures we don't generate rows for empty categories (e.g., Senior in 3rd class if none exist)
        .groupby(['pclass', 'age_group'], observed=True)
        
        # 2. Named Aggregation (The Modern Standard)
        # Syntax: new_col_name = (target_col, function)
        .agg(
            survival_rate=('survived', 'mean'),
            avg_fare=('fare', 'mean'),
            total_passengers=('passengerid', 'size'),
            # Custom Aggregation: Mode (Most common element)
            # We use a lambda to get the first value of mode() in case of a tie
            most_common_embarked=('embarked', lambda x: x.mode().iloc[0] if not x.mode().empty else None)
        )
        
        # 3. Final Formatting
        .reset_index()
        .assign(
            # Convert decimal to percentage for readability
            survival_rate=lambda x: (x['survival_rate'] * 100).round(1)
        )
        
        # 4. Sorting (Risk Analysis View)
        # Sort by Class (High to Low) then Survival (Low to High)
        # Note: Since Pclass is Ordered [3 < 2 < 1], Descending (False) puts 1st Class at top.
        .sort_values(by=['pclass', 'survival_rate'], ascending=[False, True])
    )



if __name__ == "__main__":
    final_report = (
        load_clean_df(titanic_raw)
        .pipe(clean_demographics)
        .pipe(feature_engineering)
    )


In [260]:
final_report.pipe(generate_risk_report)

Unnamed: 0,pclass,age_group,survival_rate,avg_fare,total_passengers,most_common_embarked
11,1,Senior,29.4,60.033337,17,S
10,1,Adult,64.2,84.812233,187,S
8,1,Child,75.0,126.239578,4,S
9,1,Teen,100.0,99.0,8,S
7,2,Senior,25.0,17.625,4,S
6,2,Adult,41.4,19.962978,157,S
5,2,Teen,66.7,18.095133,6,S
4,2,Child,100.0,28.7402,17,S
3,3,Senior,20.0,7.82,5,S
2,3,Adult,21.4,10.451241,392,S


In [247]:
final_report.pipe(generate_risk_report)

  .groupby(['pclass','age_group'])


Unnamed: 0_level_0,Unnamed: 1_level_0,avg_survival_rate
pclass,age_group,Unnamed: 2_level_1
3,Child,0.465116
3,Teen,0.3
3,Adult,0.214286
3,Senior,0.2
2,Child,1.0
2,Teen,0.666667
2,Adult,0.414013
2,Senior,0.25
1,Child,0.75
1,Teen,1.0


In [167]:
titanic_raw.pipe(load_clean_df).groupby(['age','pclass'], observed=True)['age'].transform('median')

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    <NA>
889    26.0
890    32.0
Name: age, Length: 891, dtype: Float64

In [None]:
cleaned_df = titanic_raw.pipe(load_clean_df)

In [132]:
cleaned_df

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,False,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.250000,,S
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.283302,C85,C
2,3,True,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925000,,S
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.099998,C123,S
4,5,False,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.050000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,False,2,"Montvila, Rev. Juozas",male,27,0,0,211536,13.000000,,S
887,888,True,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30.000000,B42,S
888,889,False,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.450001,,S
889,890,True,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30.000000,C148,C


In [None]:
cleaned_df.loc[lambda df_:df_['age'].isna()]['passengerid'].

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
5,6,False,3,"Moran, Mr. James",male,,0,0,330877,8.458300,,Q
17,18,True,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.000000,,S
19,20,True,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225000,,C
26,27,False,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225000,,C
28,29,True,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.879200,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,False,3,"Razi, Mr. Raihed",male,,0,0,2629,7.229200,,C
863,864,False,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.550003,,S
868,869,False,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.500000,,S
878,879,False,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.895800,,S


In [None]:
cleaned_df.groupby('ticket').agg(passenger_count=('passengerid','size')).sort_values('passenger_count',ascending=False)

Unnamed: 0_level_0,passenger_count
ticket,Unnamed: 1_level_1
1601,7
CA. 2343,7
347082,7
CA 2144,6
3101295,6
...,...
112052,1
112050,1
111428,1
111427,1


In [137]:
ticet = ['1601','CA. 2343','347082']

len(cleaned_df.query('ticket in @ticet'))

21

In [None]:
from numpy import iinfo


iinfo('int8')

ValueError: Invalid integer data type 'U'.