# 2 â€“ Load Processed Dataset

This notebook loads the processed extreme precipitation dataset from Amazon S3.
It validates schema integrity, checks class balance, and confirms the dataset
is ready for downstream registration in Athena and Feature Store.


## Import Libraries and Initialize SageMaker Session


In [11]:
import boto3
import pandas as pd
import sagemaker
from sagemaker import get_execution_role

# Initialize session
sess = sagemaker.Session()
bucket = sess.default_bucket()
region = boto3.Session().region_name
role = get_execution_role()

s3 = boto3.client("s3")

print("Bucket:", bucket)
print("Region:", region)
print("Role:", role)


Bucket: sagemaker-us-east-1-083422367993
Region: us-east-1
Role: arn:aws:iam::083422367993:role/LabRole


## Define S3 Location of Processed Dataset


In [14]:
project_prefix = "ghcn-extreme"

processed_csv_prefix = f"{project_prefix}/processed_csv"
processed_parquet_prefix = f"{project_prefix}/processed_parquet"

csv_key = f"{processed_csv_prefix}/extreme_precip_processed.csv"
parquet_key = f"{processed_parquet_prefix}/extreme_precip_processed.parquet"

csv_path = f"s3://{bucket}/{csv_key}"
parquet_path = f"s3://{bucket}/{parquet_key}"

print("CSV path:", csv_path)
print("Parquet path:", parquet_path)


CSV path: s3://sagemaker-us-east-1-083422367993/ghcn-extreme/processed_csv/extreme_precip_processed.csv
Parquet path: s3://sagemaker-us-east-1-083422367993/ghcn-extreme/processed_parquet/extreme_precip_processed.parquet


## Load Dataset from S3 (Parquet Format)


In [15]:
df = pd.read_parquet(parquet_path)

df.head()


Unnamed: 0,station_id,date,year,month,TMAX,TMIN,prcp_lag_1,prcp_roll_7,extreme_precip_tomorrow
0,USW00012921,2006-02-18,2006,2,3.9,-1.1,1.5,0.214286,0
1,USW00012921,2006-02-19,2006,2,5.6,-1.7,0.0,0.257143,0
2,USW00012921,2006-02-20,2006,2,8.9,1.7,0.3,0.4,0
3,USW00012921,2006-02-21,2006,2,13.9,6.1,1.0,0.442857,0
4,USW00012921,2006-02-22,2006,2,22.2,12.8,0.3,0.442857,0


## Dataset Shape and Column Overview


In [16]:
print("Dataset shape:", df.shape)
print("\nColumns:")
df.columns.tolist()


Dataset shape: (36444, 9)

Columns:


['station_id',
 'date',
 'year',
 'month',
 'TMAX',
 'TMIN',
 'prcp_lag_1',
 'prcp_roll_7',
 'extreme_precip_tomorrow']

## Validate Data Types


In [17]:
df.dtypes


station_id                         object
date                       datetime64[ns]
year                                int32
month                               int32
TMAX                              float64
TMIN                              float64
prcp_lag_1                        float64
prcp_roll_7                       float64
extreme_precip_tomorrow             int64
dtype: object

## Date Range Validation


In [18]:
print("Date range:")
print("Start:", df["date"].min())
print("End:", df["date"].max())


Date range:
Start: 2006-02-18 00:00:00
End: 2026-02-11 00:00:00


## Station Coverage Validation


In [19]:
print("Number of stations:", df["station_id"].nunique())
df["station_id"].value_counts()


Number of stations: 5


station_id
USW00023174    7299
USW00094728    7299
USW00012921    7298
USW00013904    7285
USW00023293    7263
Name: count, dtype: int64

## Class Distribution Check (Extreme Event Rate)


In [20]:
df["extreme_precip_tomorrow"].value_counts(normalize=True)


extreme_precip_tomorrow
0    0.949786
1    0.050214
Name: proportion, dtype: float64

## Missing Value Analysis


In [21]:
df.isna().mean().sort_values(ascending=False)


station_id                 0.0
date                       0.0
year                       0.0
month                      0.0
TMAX                       0.0
TMIN                       0.0
prcp_lag_1                 0.0
prcp_roll_7                0.0
extreme_precip_tomorrow    0.0
dtype: float64

## Summary Statistics for Numerical Features


In [22]:
df.describe()


Unnamed: 0,date,year,month,TMAX,TMIN,prcp_lag_1,prcp_roll_7,extreme_precip_tomorrow
count,36444,36444.0,36444.0,36444.0,36444.0,36444.0,36444.0,36444.0
mean,2016-02-14 02:22:55.041158912,2015.620898,6.525244,23.238536,12.642764,1.915896,1.915957,0.050214
min,2006-02-18 00:00:00,2006.0,1.0,-10.5,-18.2,0.0,0.0,0.0
25%,2011-02-14 00:00:00,2011.0,4.0,17.8,7.8,0.0,0.0,0.0
50%,2016-02-15 00:00:00,2016.0,7.0,23.3,13.3,0.0,0.257143,0.0
75%,2021-02-12 00:00:00,2021.0,10.0,29.4,18.3,0.0,2.428571,0.0
max,2026-02-11 00:00:00,2026.0,12.0,43.3,28.9,317.2,66.114286,1.0
std,,5.770397,3.447617,8.432097,7.588402,8.012012,3.574008,0.218389


## Confirm Dataset is Ready for Athena Registration

The dataset:

- Contains engineered leakage-safe features
- Includes next-day extreme precipitation target
- Has validated schema and date range
- Contains no unexpected null values

This dataset is now ready for Athena external table registration.
