## Neural Network Development Playground
To use this notebook, you need to have the "clean_data.parquet" file in the ./data/ directory. I will share this file with you on Google Drive. However, if you want to create it on your own, you will need to:
```
1. Download the impressions and conversions data from Claritas â€“ Use the aws cli commands in the README.md file. (Credentials are in GH secrets)
2. Run `python main.py preprocess --impressions-file ./data/impressions.csv --conversions-file ./data/conversions.csv --output-file ./data/clean_data.parquet`
     - This command will clean the data for you and save it as a parquet file. It's pretty cool, actually, because I've made it use multiprocessing to speed up the user agent string parsing. 
```
### Goal For this Notebook
I want to use the `src.data_processing.datasets.AuctionDataset` class to create a PyTorch Dataset and develop the required functionality for this class. Then I want to use this dataset to train a simple neural network.

### Dataset Overview
Each dataset row contains information about a particular impression, and it contains a `conversion_flag` column that indicates whether the impression resulted in a conversion. We're trying to predict the `conversion_flag` column given the other information about the impression.


In [1]:
import pandas as pd
from src.data_processing.datasets import AuctionDataset

dataset = AuctionDataset(dataframe=pd.read_parquet('./data/clean_data.parquet'))

Dataset initialized. Number of samples: 3845798
Number of features: 19
Feature names: ['placement_id', 'cnxn_type', 'dma', 'country', 'os', 'prizm_premier_code', 'device_type', 'campaign_id', 'goal_name', 'ua_browser', 'ua_os', 'ua_device_family', 'ua_device_brand', 'ua_is_mobile', 'ua_is_tablet', 'ua_is_pc', 'ua_is_bot', 'impression_hour', 'impression_dayofweek']


In [2]:
dataset.features.head()

Unnamed: 0,placement_id,cnxn_type,dma,country,os,prizm_premier_code,device_type,campaign_id,goal_name,ua_browser,ua_os,ua_device_family,ua_device_brand,ua_is_mobile,ua_is_tablet,ua_is_pc,ua_is_bot,impression_hour,impression_dayofweek
0,557650,Corporate,602,us,iOS,Unknown,p,9317,No Goal Name,Mobile Safari UI/WKWebView,iOS,iPhone,Apple,True,False,False,False,14,3
1,557650,Cable/DSL,623,us,unknown,Unknown,p,9317,No Goal Name,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,18,3
2,557650,Cable/DSL,602,us,unknown,21,p,9317,No Goal Name,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,11,3
3,557650,Cable/DSL,602,us,unknown,21,p,9317,No Goal Name,Podcasts,iOS,iOS-Device,Apple,True,False,False,False,3,3
4,557650,Cellular,602,us,iOS,Unknown,p,9317,No Goal Name,Mobile Safari UI/WKWebView,iOS,iPhone,Apple,True,False,False,False,18,3


In [5]:
# The class implements __getitem__ so that we can use indexing to get a single sample from the dataset.

features, target = dataset[20000]
print(f"Features: {features}")
print(f"Target: {target}")

Features: {'placement_id': 557650, 'cnxn_type': 'Cable/DSL', 'dma': 602, 'country': 'us', 'os': 'unknown', 'prizm_premier_code': '06', 'device_type': 'Unknown', 'campaign_id': 9317, 'goal_name': 'No Goal Name', 'ua_browser': 'Other', 'ua_os': 'WatchOS', 'ua_device_family': 'Other', 'ua_device_brand': 'Unknown', 'ua_is_mobile': False, 'ua_is_tablet': False, 'ua_is_pc': False, 'ua_is_bot': False, 'impression_hour': 14, 'impression_dayofweek': 3}
Target: 0.0
