# Data Cleaning
We need to clean up these parquet files before we can use them for training the neural network. On a high level, we need to:
1. Figure out the linking strategy between impressions and conversions. I.e. which impressions lead to which conversions.
2. Ingest the data into a torch dataset.
   1. Remove unused or underused columns.
   2. Handle missing values
   3. Rename columns to be more descriptive.
   4. Parse any columns that need to be parsed.(eg. user-agent strings)
   5. Finally, think about what feature engineering needs to be done.

In [48]:
import pandas as pd

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000) # Adjust width for better table display if needed

impressions_path = './data/test_dataset/impressions_test.pqt/'
conversions_path = './data/test_dataset/conversions_test.pqt/'
device_types_path = './data/data_dictionary/device_types.csv'

# Load the impressions dataset
# Note: Parquet datasets can be stored as directories. Pandas reads them correctly.
df_impressions = pd.read_parquet(impressions_path)
df_conversions = pd.read_parquet(conversions_path)

# df_conversions[['imp_click_dttm_utc', 'conv_dttm_utc']].head()

df_impressions = df_impressions.drop(columns=[col for col in df_impressions.columns if col.startswith('aip')])
df_conversions = df_conversions.drop(columns=[col for col in df_conversions.columns if col.startswith('aip')])



## 1. Identify the linking strategy between impressions and conversions.
Claritas has updated the dataset to include a `unique_id` column on both the impressions and conversions datasets. This should make the linking strategy a lot easier.

DEPRECATED: I think the best way to do this is to look at the `imp_click_dttm_utc` and `conv_dttm_utc` columns in conjunction with the `campaign_id`, `placement_id`, and `imp_click_campaign_id`, `imp_click_placement_id` columns. I think that this should be unique, but I'm not sure at the moment.

In [51]:
import pandas as pd

# Assuming df_impressions and df_conversions are already loaded with the new data

# --- Linking using Unique IDs ---
# Perform a left join using the unique IDs
df_merged_final = pd.merge(
    df_impressions,
    df_conversions[['imp_click_unique_id', 'conv_dttm_utc', 'goal_id', 'goal_name']], # Select relevant conversion cols + the key
    left_on='unique_id',
    right_on='imp_click_unique_id',
    how='left'
)

# Create the conversion flag based on successful merge
# Check if a column from the right dataframe (conversions) is not null. 'conv_dttm_utc' is a good choice.
df_merged_final['conversion_flag'] = (~df_merged_final['conv_dttm_utc'].isnull()).astype(int)

# Optional: Drop the redundant ID column from the conversions table if desired
# df_merged_final = df_merged_final.drop(columns=['imp_click_unique_id'])

# --- Verification ---
print("--- Merged Data Info ---")
df_merged_final.info()

print(f"\nNumber of impressions successfully linked to a conversion: {df_merged_final['conversion_flag'].sum()}")

print("\n--- Sample of Merged Data ---")
# Display relevant columns to check the merge
display(df_merged_final[['unique_id', 'dttm_utc', 'conv_dttm_utc', 'goal_name', 'conversion_flag']].head())

# Check a few linked rows
print("\n--- Sample of Linked Rows (conversion_flag == 1) ---")
display(df_merged_final[df_merged_final['conversion_flag'] == 1][['unique_id', 'dttm_utc', 'conv_dttm_utc', 'goal_name', 'conversion_flag']].head())

--- Merged Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   dte                  0 non-null      object        
 1   campaign_id          0 non-null      int64         
 2   placement_id         0 non-null      int64         
 3   dttm_utc             0 non-null      datetime64[ns]
 4   cnxn_type            0 non-null      object        
 5   user_agent           0 non-null      object        
 6   dma                  0 non-null      int32         
 7   country              0 non-null      object        
 8   os                   0 non-null      object        
 9   prizm_premier_code   0 non-null      object        
 10  device_type          0 non-null      object        
 11  unique_id            0 non-null      object        
 12  imp_click_unique_id  0 non-null      object        
 13  conv_dttm_utc        0

Unnamed: 0,unique_id,dttm_utc,conv_dttm_utc,goal_name,conversion_flag



--- Sample of Linked Rows (conversion_flag == 1) ---


Unnamed: 0,unique_id,dttm_utc,conv_dttm_utc,goal_name,conversion_flag


In [53]:
import pandas as pd

# Assuming df_impressions and df_conversions are loaded

# --- Sanity Check: ID Existence ---

# Get the unique IDs from conversions
conversion_ids = df_conversions['imp_click_unique_id'].unique()
print(f"Number of unique imp_click_unique_id in conversions: {len(conversion_ids)}")
# Display a few to check format
print("Sample conversion IDs:", conversion_ids[:5])


# Check which unique_ids from impressions are present in the conversion IDs
matching_mask = df_impressions['unique_id'].isin(conversion_ids)
num_matches_found = matching_mask.sum()

print(f"\nNumber of impression unique_ids found within conversion imp_click_unique_ids: {num_matches_found}")

# If matches were found, show a few of the matching IDs from impressions
if num_matches_found > 0:
    matching_impression_ids = df_impressions.loc[matching_mask, 'unique_id'].unique()
    print("\nSample of matching unique_ids found in impressions:")
    print(matching_impression_ids[:10]) # Show up to 10 matching IDs
else:
    print("\nNo unique_ids from the impressions dataset were found in the conversions dataset's imp_click_unique_id column.")



Number of unique imp_click_unique_id in conversions: 49
Sample conversion IDs: ['93b16c75-919c-4f8e-addc-a3f0f60ab549'
 '92e13770-1ee7-4168-8d2c-913cafc22b42'
 'd6bdf968-9356-4435-85c1-d456d892ff38'
 'e8b0e97b-c6f3-4aed-bc0e-5286d5985451'
 '1e75cbc6-b4e6-49ba-96d5-6f6d343d3c64']

Number of impression unique_ids found within conversion imp_click_unique_ids: 0

No unique_ids from the impressions dataset were found in the conversions dataset's imp_click_unique_id column.
