# 1_ingest: Build Master Database

This notebook:
1. Reads raw CSV/Parquet from `data/raw/`
2. Cleans and filters
3. Engineers core feature flags
4. Writes out `data/processed/master.parquet`
5. Quick sanity checks


In [17]:
# 0. Import Libraries

import pandas as pd

## 1. Extract
Read raw file into one DataFrame.


In [18]:
# 1. Read the CSV into a DataFrame with explicit dtypes
dtype_map = {
    'Context ID': str,
    'Booking ID': str,
    'Session ID': str,
    'Search Days Ahead': 'Int64',
    'Search Charge': 'float',
    'Search Charge Type': 'category',
    'Venue ID': str,
    'Venue Name': 'category',
    'Party Size': 'Int64',
    'Was Search Available': 'boolean',
    'Reservation Days Ahead': 'Int64',
    'Reservation Charge': 'float',
    'Reservation Charge Type': 'category',
    'Year': 'Int64',
    'Month': 'Int64',
    'Reservation Cost ($)': 'float',
    'Packages Cost ($)': 'float',
    'Add Ons Cost ($)': 'float',
    'Promo Code Discount ($)': 'float',
    'Total Cost ($)': 'float',
    'Deposit Amount': 'float',
}
df = pd.read_csv(
    '../data/raw/Clays_data.csv',
    dtype=dtype_map,
    encoding='latin1',
    low_memory=False
)

# 2. Safety copy
df.to_parquet('../data/raw/full_raw.parquet', index=False)

# 3. Standardize date/time columns
df['Search At'] = pd.to_datetime(df['Search At'], dayfirst=True, errors='coerce')
df['Search Date'] = pd.to_datetime(df['Search Date'], dayfirst=True, errors='coerce')
df['Reservation Date'] = pd.to_datetime(df['Reservation Date'], dayfirst=True, errors='coerce')
df['Reservation Datetime'] = pd.to_datetime(df['Reservation Datetime'], dayfirst=True, errors='coerce')

# 4. Drop rows without Context ID (must be a bug)
df = df.dropna(subset=['Context ID'])

# 5. Save cleaned DataFrame
df.to_parquet('../data/processed/full_cleaned.parquet', index=False)

df


  df['Search Date'] = pd.to_datetime(df['Search Date'], dayfirst=True, errors='coerce')
  df['Reservation Date'] = pd.to_datetime(df['Reservation Date'], dayfirst=True, errors='coerce')


Unnamed: 0,Context ID,Booking ID,Session ID,Search At,Search Date,Search Time,Search Time Iso,Search Days Ahead,Search Charge,Search Charge Type,...,Reservation Reference Code,Reservation Tags,Reservation Cost ($),Packages Cost ($),Add Ons Cost ($),Promo Code Discount ($),Total Cost ($),Deposit Amount,Year,Month
0,202406010624Q11YGA,202406010624Q11YGA,202406010624Q11YGA,2024-01-06 06:24:00,2024-07-13,63900000000000,17:45:00,42,14.0,person,...,45A3LK4B,"[""""]",84.0,72.0,0.0,0.0,156.0,156.0,2024,6
1,202406010714KXIEZJ,202406010714KXIEZJ,202406010714KXIEZJ,2024-01-06 07:14:00,2024-06-01,41400000000000,11:30:00,0,10.0,person,...,64UKL232,"[""dietary#vegetarian""]",20.0,0.0,0.0,0.0,20.0,20.0,2024,6
2,202406010726X2ZGX5,202406010726X2ZGX5,202406010726X2ZGX5,2024-01-06 07:26:00,2024-06-01,42300000000000,11:45:00,0,10.0,person,...,64UL3834,"[""""]",20.0,0.0,0.0,0.0,20.0,20.0,2024,6
3,202406010755ACH0QY,202406010755ACH0QY,202406010755ACH0QY,2024-01-06 07:55:00,2024-06-05,70200000000000,19:30:00,4,14.0,person,...,45A998P8,"[""""]",154.0,0.0,0.0,0.0,154.0,154.0,2024,6
4,202406010815SQJ15X,202406010815SQJ15X,202406010815SQJ15X,2024-01-06 08:15:00,2024-06-02,54900000000000,15:15:00,1,12.0,person,...,64UP3B4A,"[""""]",24.0,24.0,0.0,0.0,48.0,48.0,2024,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439770,202502261147417BET,,86.186.215.75,NaT,2025-02-28,48600000000000,13:30:00,2,8.0,person,...,,"[""""]",0.0,0.0,0.0,0.0,0.0,0.0,2025,2
439771,202502261148AC8B1E,202502261149KMFIF6,80.46.30.140,NaT,2025-02-26,68400000000000,19:00:00,0,10.0,person,...,,"[""""]",20.0,0.0,0.0,0.0,20.0,20.0,2025,2
439772,202502261148OMT8T6,,157.125.9.187,NaT,2025-02-26,73800000000000,20:30:00,0,10.0,person,...,,"[""""]",0.0,0.0,0.0,0.0,0.0,0.0,2025,2
439773,202502261148WO5U9E,,185.156.117.147,NaT,2025-04-19,54000000000000,15:00:00,52,12.0,person,...,,"[""""]",0.0,0.0,0.0,0.0,0.0,0.0,2025,2


## 2. Clean
Business-rule filters:
- Drop `party_size > 20`
- Remove negative money & days ahead
- Cap `Search Days Ahead` at 99th percentile or 180 days


In [19]:
# 2. Clean
df = pd.read_parquet("../data/processed/full_cleaned.parquet")

# Drop invalid party sizes: <= 0 or > 20
mask_party = (df['Party Size'] > 0) & (df['Party Size'] <= 20)

# Remove negative days ahead and cap at 99th percentile (or 180 days)
# First drop negative values
df = df[ df['Search Days Ahead'] >= 0 ]

# Compute 99th percentile
pct_99 = df['Search Days Ahead'].quantile(0.99)
# Use guardrail of 180 days or computed pct, whichever is smaller
cap_days = min(180, pct_99)
#    Cap values
df['Search Days Ahead'] = df['Search Days Ahead'].clip(upper=cap_days)

# Drop rows outside party-size mask
df = df[ mask_party ]

# Remove negative money columns and absurd values
money_cols = [
    'Search Charge', 'Reservation Charge',
    'Reservation Cost ($)', 'Packages Cost ($)',
    'Add Ons Cost ($)', 'Promo Code Discount ($)',
    'Total Cost ($)', 'Deposit Amount'
]
for col in money_cols:
    # drop or set to NaN? here, i think we drop rows with negative
    df = df[df[col] >= 0]

# Reset index
df = df.reset_index(drop=True)

# Write out filtered dataset
df.to_parquet("../data/processed/mid_step_filtered.parquet", index=False)

print(f"Business-rule filtering complete. Rows now: {len(df)}")



  df = df[ mask_party ]


Business-rule filtering complete. Rows now: 431440


## 3. Feature Engineering
Core flags:
- `was_booked`
- `lead_time_days`
- `hour_of_day`, `day_of_week`, `is_weekend`
- Convert key cols to categorical


In [20]:
# 3. Feature Engineering
def add_feature_flags(
    input_path: str = "../data/processed/mid_step_filtered.parquet",
    output_path: str = "../data/processed/master.parquet"
):
    # 1. Load filtered data
    df = pd.read_parquet(input_path)
    
    # 2. was_booked: 1 if there's a Reservation ID, else 0
    df["was_booked"] = df["Reservation ID"].notnull().astype("int8")
    
    # 3. lead_time_days = (reservation_date − search_date).days
    df["lead_time_days"] = (
        df["Reservation Date"] - df["Search Date"]
    ).dt.days.astype("Int16")
    # Drop any negative lead times that remain
    df = df[df["lead_time_days"] >= 0]
    
    # 4. Extract search-time features
    df["hour_of_day"] = pd.to_datetime(df["Search Time"]).dt.hour
    
    # undecided how to do these:
    #   df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype("boolean")
    #   df["day_of_week"] = df["Search At"].dt.dayofweek.astype("Int8")  # Monday=0

    
    # 5. Convert business columns to categorical
    for col in [
        "Venue Name",
        "Search Charge Type",
        "Reservation Charge Type",
        "Booking Status",
    ]:
        if col in df:
            df[col] = df[col].astype("category")
    
    # 6. Persist the feature‐augmented dataset
    df.to_parquet(output_path, index=False)
    print(f"✅ Features added & saved to {output_path}")

if __name__ == "__main__":
    add_feature_flags()

✅ Features added & saved to ../data/processed/master.parquet


## 4. Quick QA
Ensure everything looks good.


In [21]:
# 4. Load
df2 = pd.read_parquet('../data/processed/master.parquet')
print(df2.info())
df2.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233214 entries, 0 to 233213
Data columns (total 54 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   Context ID                  233214 non-null  object        
 1   Booking ID                  233214 non-null  object        
 2   Session ID                  233214 non-null  object        
 3   Search At                   21064 non-null   datetime64[ns]
 4   Search Date                 233214 non-null  datetime64[ns]
 5   Search Time                 233214 non-null  int64         
 6   Search Time Iso             233214 non-null  object        
 7   Search Days Ahead           233214 non-null  Int64         
 8   Search Charge               233214 non-null  float64       
 9   Search Charge Type          233214 non-null  category      
 10  Venue ID                    233214 non-null  object        
 11  Venue Name                  233214 non-

Unnamed: 0,Context ID,Booking ID,Session ID,Search At,Search Date,Search Time,Search Time Iso,Search Days Ahead,Search Charge,Search Charge Type,...,Packages Cost ($),Add Ons Cost ($),Promo Code Discount ($),Total Cost ($),Deposit Amount,Year,Month,was_booked,lead_time_days,hour_of_day
0,202406010624Q11YGA,202406010624Q11YGA,202406010624Q11YGA,2024-01-06 06:24:00,2024-07-13,63900000000000,17:45:00,42,14.0,person,...,72.0,0.0,0.0,156.0,156.0,2024,6,1,0,17
1,202406010714KXIEZJ,202406010714KXIEZJ,202406010714KXIEZJ,2024-01-06 07:14:00,2024-06-01,41400000000000,11:30:00,0,10.0,person,...,0.0,0.0,0.0,20.0,20.0,2024,6,1,0,11
2,202406010726X2ZGX5,202406010726X2ZGX5,202406010726X2ZGX5,2024-01-06 07:26:00,2024-06-01,42300000000000,11:45:00,0,10.0,person,...,0.0,0.0,0.0,20.0,20.0,2024,6,1,0,11
3,202406010755ACH0QY,202406010755ACH0QY,202406010755ACH0QY,2024-01-06 07:55:00,2024-06-05,70200000000000,19:30:00,4,14.0,person,...,0.0,0.0,0.0,154.0,154.0,2024,6,1,0,19
4,202406010815SQJ15X,202406010815SQJ15X,202406010815SQJ15X,2024-01-06 08:15:00,2024-06-02,54900000000000,15:15:00,1,12.0,person,...,24.0,0.0,0.0,48.0,48.0,2024,6,1,0,15


## 5. Data Quality Report
Double check how the data is looking like before moving forward.


In [None]:
# Export HTML summary as data_quality.html

from ydata_profiling import ProfileReport

def generate_data_quality_report(
    input_path: str = "../data/processed/master.parquet",
    output_path: str = "../outputs/data_quality.html"
):
    # 1. Load your feature-augmented data
    df = pd.read_parquet(input_path)
    
    # 2. Create a profiling report
    profile = ProfileReport(
        df,
        title="Clays Data Quality Report",
        explorative=True,
        minimal=False  # set True for a slimmer report
    )
    
    # 3. Export to HTML
    profile.to_file(output_path)
    print(f"✅ Data-quality report written to {output_path}")

if __name__ == "__main__":
    generate_data_quality_report()

100%|██████████| 54/54 [00:05<00:00,  9.04it/s]7<00:01,  4.21it/s, Describe variable: hour_of_day]               
Summarize dataset: 100%|██████████| 60/60 [00:08<00:00,  7.45it/s, Completed]                     
Generate report structure: 100%|██████████| 1/1 [00:11<00:00, 11.28s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 50.89it/s]

✅ Data-quality report written to ../outputs/data_quality_slimmer.html



