# Day 9: Session-level Aggregate Features
This notebook focuses on extracting session-level aggregate features from the OTTO dataset.
We will compute:
- Total number of interactions per session
- Number of unique items per session
- Average interaction time difference per session


## Step 1: Load Data

In [None]:
import pandas as pd

# read parquet documents (Parquet is faster and smaller than CSV)
train = pd.read_parquet('train.parquet')
test = pd.read_parquet('test.parquet')

# Knowledge point:
# - pandas.read_parquet(): Efficiently reads parquet files.
# - DataFrame: Pandas' core structure for storing tabular data.


## Step 2: Combine Train and Test

In [None]:
# Merge the train and test sets to facilitate uniflied feature statistics (Avoid train/test feature mismatch)
df = pd.concat([train, test], axis=0)

# Knowledge point:
# - pd.concat(): Combines DataFrames either by rows (axis=0) or columns (axis=1).
# - Combining before feature engineering ensures consistent statistics.


## Step 3: Total Interactions per Session

In [None]:
# Session counts the number of interactions
session_interactions = df.groupby('session')['aid'].count().reset_index()
session_interactions.columns = ['session', 'session_total_interactions']

# Knowledge point:
# - groupby(): Groups rows by key(s).
# - count(): Counts rows in each group.
# - reset_index(): Resets group keys to columns.


## Step 4: Unique Items per Session

In [None]:
# Session counts the number of unique items
session_unique_items = df.groupby('session')['aid'].nunique().reset_index()
session_unique_items.columns = ['session', 'session_unique_items']

# Knowledge point:
# - nunique(): Counts distinct values in each group.
# - This reflects the diversity of items in a session.


## Step 5: Average Time Difference per Session

In [None]:
#Convert the time to datatime format
df['ts'] = pd.to_datetime(df['ts'], unit='ms')

# Calculate the average time gap
session_time_diff = df.groupby('session')['ts'].apply(lambda x: x.diff().mean()).reset_index()
session_time_diff.columns = ['session', 'session_avg_time_diff']

# Knowleage point:
# - pd.to_datetime(): Converts timestamps to datetime.
# - diff(): Finds differences between consecutive rows.
# - mean(): Calculates the average.


## Step 6: Merge Features Back to Train and Test

In [None]:
# Merge all sessions features
features = session_interactions.merge(session_unique_items, on='session', how='left')
features = features.merge(session_time_diff, on='session', how='left')

train = train.merge(features, on='session', how='left')
test = test.merge(features, on='session', how='left')

# Save the updated dataset
train.to_parquet('train_day9.parquet')
test.to_parquet('test_day9.parquet')

# Knowleage point:
# - merge(): Joins DataFrames on keys.
# - how='left': Keeps all rows from left DataFrame.
# - Saving as parquet allows faster subsequent reads.
