## Neural Network Development Playground
To use this notebook, you need to have the "clean_data.parquet" file in the ./data/ directory. I will share this file with you on Google Drive. However, if you want to create it on your own, you will need to:
```
1. Download the impressions and conversions data from Claritas – Use the aws cli commands in the README.md file. (Credentials are in GH secrets)
2. Run `python main.py preprocess --impressions-file ./data/impressions.csv --conversions-file ./data/conversions.csv --output-file ./data/clean_data.parquet`
     - This command will clean the data for you and save it as a parquet file. It's pretty cool, actually, because I've made it use multiprocessing to speed up the user agent string parsing. 
```
### Goal For this Notebook
I want to use the `src.data_processing.datasets.AuctionDataset` class to create a PyTorch Dataset and develop the required functionality for this class. Then I want to use this dataset to train a simple neural network.

### Dataset Overview
Each dataset row contains information about a particular impression, and it contains a `conversion_flag` column that indicates whether the impression resulted in a conversion. We're trying to predict the `conversion_flag` column given the other information about the impression.


In [1]:
import pandas as pd
from src.data_processing.datasets import AuctionDataset

dataset = AuctionDataset(dataframe=pd.read_parquet('./data/clean_data.parquet'))

Dataset initialized. Number of samples: 3845798
Number of features: 16
Feature names: ['placement_id', 'cnxn_type', 'dma', 'country', 'prizm_premier_code', 'campaign_id', 'ua_browser', 'ua_os', 'ua_device_family', 'ua_device_brand', 'ua_is_mobile', 'ua_is_tablet', 'ua_is_pc', 'ua_is_bot', 'impression_hour', 'impression_dayofweek']


In [2]:
dataset.features.info()
display(dataset.features.describe(include='all'))
dataset.features.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3845798 entries, 0 to 3845797
Data columns (total 16 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   placement_id          int64 
 1   cnxn_type             object
 2   dma                   int32 
 3   country               object
 4   prizm_premier_code    object
 5   campaign_id           int32 
 6   ua_browser            object
 7   ua_os                 object
 8   ua_device_family      object
 9   ua_device_brand       object
 10  ua_is_mobile          bool  
 11  ua_is_tablet          bool  
 12  ua_is_pc              bool  
 13  ua_is_bot             bool  
 14  impression_hour       int32 
 15  impression_dayofweek  int32 
dtypes: bool(4), int32(4), int64(1), object(7)
memory usage: 308.1+ MB


Unnamed: 0,placement_id,cnxn_type,dma,country,prizm_premier_code,campaign_id,ua_browser,ua_os,ua_device_family,ua_device_brand,ua_is_mobile,ua_is_tablet,ua_is_pc,ua_is_bot,impression_hour,impression_dayofweek
count,3845798.0,3845798,3845798.0,3845798,3845798,3845798.0,3845798,3845798,3845798,3845798,3845798,3845798,3845798,3845798,3845798.0,3845798.0
unique,,3,,47,69,,121,17,914,29,2,2,2,2,,
top,,Cable/DSL,,us,Unknown,,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,,
freq,,2094122,,3843051,2428584,,1720968,2641499,1867796,2647091,3186944,3718120,3755881,3797739,,
mean,590214.7,,588.3996,,,9317.0,,,,,,,,,11.6502,2.886486
std,14612.52,,86.41265,,,0.0,,,,,,,,,6.868596,1.977513
min,557650.0,,0.0,,,9317.0,,,,,,,,,0.0,0.0
25%,596771.0,,602.0,,,9317.0,,,,,,,,,6.0,1.0
50%,596772.0,,602.0,,,9317.0,,,,,,,,,12.0,3.0
75%,596772.0,,602.0,,,9317.0,,,,,,,,,18.0,5.0


Unnamed: 0,placement_id,cnxn_type,dma,country,prizm_premier_code,campaign_id,ua_browser,ua_os,ua_device_family,ua_device_brand,ua_is_mobile,ua_is_tablet,ua_is_pc,ua_is_bot,impression_hour,impression_dayofweek
0,557650,Corporate,602,us,Unknown,9317,Mobile Safari UI/WKWebView,iOS,iPhone,Apple,True,False,False,False,14,3
1,557650,Cable/DSL,623,us,Unknown,9317,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,18,3
2,557650,Cable/DSL,602,us,21,9317,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,11,3
3,557650,Cable/DSL,602,us,21,9317,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,3,3
4,557650,Cellular,602,us,Unknown,9317,Mobile Safari UI/WKWebView,iOS,iPhone,Apple,True,False,False,False,18,3


In [3]:
# The class implements __getitem__ so that we can use indexing to get a single sample from the dataset.

features, target = dataset[3000000]
print(f"Features: {features}")
print(f"Target: {target}")

Features: {'placement_id': 596772, 'cnxn_type': 'Corporate', 'dma': 602, 'country': 'us', 'prizm_premier_code': 'Unknown', 'campaign_id': 9317, 'ua_browser': 'Podcasts', 'ua_os': 'iOS', 'ua_device_family': 'iOS-Device', 'ua_device_brand': 'Apple', 'ua_is_mobile': True, 'ua_is_tablet': False, 'ua_is_pc': False, 'ua_is_bot': False, 'impression_hour': 10, 'impression_dayofweek': 0}
Target: 0.0


# Train a simple neural network
Now that we have the dataset and our features in a nice format, we can train a simple neural network.
We do need to do some more playing around with the data though.

1. The categorical features need to be converted to numerical values. I haven't thought through the "best" way to do this, but I will proceed with an encoding scheme for now. 

2. Then we also have the datetime features (day of week, hour of day) that need to be converted to numerical values. It makes sense to apply a sin/cosine transformation to these features. (23 hours is close to 0 for example).

3. The numerical features should be scaled to a range of 0-1.

4. The boolean features should be converted to 0/1 values. 

After this we can proceed with training a simple neural network. 

In [9]:
from sklearn.model_selection import train_test_split

# Step 1: Split the data into training and validation sets

features_df = dataset.features
targets_tensor = dataset.target

targets_np = targets_tensor.cpu().numpy()

TEST_SIZE = 0.15
VAL_SIZE = 0.15
TRAIN_SIZE = 1.0 - TEST_SIZE - VAL_SIZE

num_samples = len(features_df)
indices = list(range(num_samples))

train_val_indices, test_indices = train_test_split(
    indices,
    test_size=TEST_SIZE,
    random_state=42,
    stratify=targets_np  # https://scikit-learn.org/stable/modules/cross_validation.html#stratification
)

val_size = VAL_SIZE / (1.0 - TEST_SIZE)

train_indices, val_indices = train_test_split(
    train_val_indices,
    test_size=val_size,
    random_state=42,
    stratify=targets_np[train_val_indices]
)

X_train = features_df.iloc[train_indices].copy()
y_train = targets_tensor[train_indices]

X_val = features_df.iloc[val_indices].copy()
y_val = targets_tensor[val_indices]

X_test = features_df.iloc[test_indices].copy()
y_test = targets_tensor[test_indices]

# --- Print shapes to verify ---
print("Original shapes:")
print(f"  Features: {features_df.shape}")
print(f"  Targets: {targets_tensor.shape}")

print("\nSplit shapes:")
print(f"  Train: X={X_train.shape}, y={y_train.shape} ({len(train_indices)} samples)")
print(f"  Validation: X={X_val.shape}, y={y_val.shape} ({len(val_indices)} samples)")
print(f"  Test: X={X_test.shape}, y={y_test.shape} ({len(test_indices)} samples)")

# --- Check stratification (optional, but good practice) ---
# Calculate target proportions (conversion rate) in each set
original_prop = targets_np.mean()
train_prop = y_train.float().mean().item() # Use .float() for mean and .item() to get Python number
val_prop = y_val.float().mean().item()
test_prop = y_test.float().mean().item()

print("\nTarget variable proportions (Conversion Rate):")
print(f"  Original: {original_prop:.4f}")
print(f"  Train: {train_prop:.4f}")
print(f"  Validation: {val_prop:.4f}")
print(f"  Test: {test_prop:.4f}")


Original shapes:
  Features: (3845798, 16)
  Targets: torch.Size([3845798])

Split shapes:
  Train: X=(2692058, 16), y=torch.Size([2692058]) (2692058 samples)
  Validation: X=(576870, 16), y=torch.Size([576870]) (576870 samples)
  Test: X=(576870, 16), y=torch.Size([576870]) (576870 samples)

Target variable proportions (Conversion Rate):
  Original: 0.0072
  Train: 0.0072
  Validation: 0.0072
  Test: 0.0072


### Step 2: Make everything numeric
Now that we have our train, test, validation sets, we can proceed with making everything numeric.

In [21]:
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
import joblib


categorical_features = [
    'placement_id', 'cnxn_type', 'dma', 'country', 'prizm_premier_code',
    'campaign_id', 'ua_browser', 'ua_os', 'ua_device_family', 'ua_device_brand'
]

boolean_features = [
    'ua_is_mobile', 'ua_is_tablet', 'ua_is_pc', 'ua_is_bot'
]

cyclical_features = [
    'impression_hour', 'impression_dayofweek'
]

all_features = categorical_features + boolean_features + cyclical_features
assert set(all_features) == set(X_train.columns), "Features do not match"

print("\nFitting categorical encoder...")
categorical_encoder = OrdinalEncoder(
    handle_unknown='use_encoded_value',
    unknown_value=-1, # Use -1 for unknown, easier to map later if needed
    dtype=np.int64 # Ensure integer output
)
categorical_encoder.fit(X_train[categorical_features])


category_sizes = {
    col: len(cats) + 1 # Add 1 for the 'unknown' category we'll map to index 0
    for col, cats in zip(categorical_features, categorical_encoder.categories_)
}
print("Category mappings fitted. Example sizes (including unknown):")
for i, (col, size) in enumerate(category_sizes.items()):
    print(f"  '{col}': {size} unique values")


print("\nPreparing numerical features for scaling...")
temp_numeric_df = pd.DataFrame(index=X_train.index)

# Convert booleans to float
for col in boolean_features:
    temp_numeric_df[col] = X_train[col].astype(float)

# Apply cyclical transformations
hour = X_train['impression_hour']
day = X_train['impression_dayofweek']
temp_numeric_df['hour_sin'] = np.sin(2 * np.pi * hour / 24.0)
temp_numeric_df['hour_cos'] = np.cos(2 * np.pi * hour / 24.0)
temp_numeric_df['day_sin'] = np.sin(2 * np.pi * day / 7.0)
temp_numeric_df['day_cos'] = np.cos(2 * np.pi * day / 7.0)


numerical_features_to_scale = temp_numeric_df.columns.tolist()
print(f"Numerical columns to scale: {numerical_features_to_scale}")

print("Fitting numerical scaler...")
numerical_scaler = StandardScaler()
numerical_scaler.fit(temp_numeric_df[numerical_features_to_scale])
print("Numerical scaler fitted.")


preprocessor_dir = './preprocessors' # Create this directory if it doesn't exist
import os
os.makedirs(preprocessor_dir, exist_ok=True)

joblib.dump(categorical_encoder, os.path.join(preprocessor_dir, 'categorical_encoder.joblib'))
joblib.dump(numerical_scaler, os.path.join(preprocessor_dir, 'numerical_scaler.joblib'))
joblib.dump(category_sizes, os.path.join(preprocessor_dir, 'category_sizes.joblib'))
joblib.dump(categorical_features, os.path.join(preprocessor_dir, 'categorical_features.joblib'))
joblib.dump(boolean_features, os.path.join(preprocessor_dir, 'boolean_features.joblib'))
joblib.dump(cyclical_features, os.path.join(preprocessor_dir, 'cyclical_features.joblib'))
joblib.dump(numerical_features_to_scale, os.path.join(preprocessor_dir, 'numerical_features_to_scale.joblib'))

print(f"\nPreprocessors (encoder, scaler, category sizes, feature lists) saved to '{preprocessor_dir}'")



Fitting categorical encoder...
Category mappings fitted. Example sizes (including unknown):
  'placement_id': 4 unique values
  'cnxn_type': 4 unique values
  'dma': 179 unique values
  'country': 47 unique values
  'prizm_premier_code': 70 unique values
  'campaign_id': 2 unique values
  'ua_browser': 116 unique values
  'ua_os': 18 unique values
  'ua_device_family': 830 unique values
  'ua_device_brand': 30 unique values

Preparing numerical features for scaling...
Numerical columns to scale: ['ua_is_mobile', 'ua_is_tablet', 'ua_is_pc', 'ua_is_bot', 'hour_sin', 'hour_cos', 'day_sin', 'day_cos']
Fitting numerical scaler...
Numerical scaler fitted.

Preprocessors (encoder, scaler, category sizes, feature lists) saved to './preprocessors'


We now have:
- `categorical_encoder`: Fitted sklearn OrdinalEncoder.
- `numerical_scaler`: Fitted sklearn StandardScaler.
- `category_sizes`: Dictionary mapping categorical feature names to their vocabulary size (including unknown).
- Lists of feature names for each type saved to disk.