## Initial Data Exploration

This notebook performs an initial exploration of the impressions and conversions datasets provided.
We will look at the data structure, sample rows, basic statistics, and missing values to get a preliminary understanding of the data.

In [31]:
import pandas as pd

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000) # Adjust width for better table display if needed

### Data Dictionaries Summary

Based on the provided `data/data_dictionary` files:

**Impressions Data (`impressions_test.pqt`)**
Contains information about ad impressions served. Key columns include:
- `dte`: Date of the impression.
- `campaign_id`, `placement_id`: Identifiers for the ad campaign and placement.
- `dttm_utc`: Timestamp of the impression.
- `cnxn_type`: Type of internet connection (e.g., Cable/DSL, Cellular).
- `user_agent`: Browser/device user agent string.
- `dma`: DMA code (Designated Market Area).
- `country`: Country code.
- `os`: Operating system.
- `prizm_premier_code`: Nielsen PRIZM Premier marketing segmentation code.
- `device_type`: Type of device (likely uses codes from `device_types.csv`).
- `aip_*`: Additional device/connection info (screen dimensions, ISP, OS name, hardware).

**Conversions Data (`conversions_test.pqt`)**
Contains information about conversion events attributed to ad impressions/clicks. Key columns include:
- `dte`: Date of the conversion.
- `conv_dttm_utc`: Timestamp of the conversion event.
- `imp_click_dttm_utc`: Timestamp of the attributed impression or click.
- `imp_click_campaign_id`, `imp_click_placement_id`: Identifiers for the campaign/placement associated with the conversion.
- `conv_property_id`: Identifier for the property (e.g., website pixel) where the conversion occurred.
- `conv_dma`: DMA code at time of conversion.
- `conv_user_agent`: User agent at time of conversion.
- `goal_id`, `goal_name`: Identifier and name for the conversion goal (e.g., 'Purchase', 'Sign Up').
- `conv_prizm_premier_code`: Prizm Premier code at time of conversion.
- `conv_cnxn_type`: Connection type at time of conversion.
- `conv_device_type`: Device type at time of conversion.

**Device Types (`device_types.csv`)**
A simple mapping from short device type codes (e.g., 't', 'p') to full names (e.g., 'Tablet', 'Mobile Phone').
- `short`: Device code used in other datasets.
- `full_name`: Descriptive name of the device type.

### Define File Paths

In [2]:
impressions_path = './data/test_dataset/impressions_test/'
conversions_path = './data/test_dataset/conversions_test/'
device_types_path = './data/data_dictionary/device_types.csv'

### Impressions Data Exploration

In [3]:
# Load the impressions dataset
# Note: Parquet datasets can be stored as directories. Pandas reads them correctly.
try:
    df_impressions = pd.read_parquet(impressions_path)
    print(f"Successfully loaded impressions data from {impressions_path}\nShape: {df_impressions.shape}")
except Exception as e:
    print(f"Error loading {impressions_path}: {e}")
    df_impressions = pd.DataFrame() # Create empty dataframe if load fails

Successfully loaded impressions data from ./data/test_dataset/impressions_test/
Shape: (3825710, 20)


In [4]:
# Display basic information and schema
print("\nImpressions Data Info:")
if not df_impressions.empty:
    df_impressions.info()
else:
    print("Impressions DataFrame is empty.")


Impressions Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3825710 entries, 0 to 3825709
Data columns (total 20 columns):
 #   Column              Dtype         
---  ------              -----         
 0   placement_id        int64         
 1   dttm_utc            datetime64[ns]
 2   cnxn_type           object        
 3   user_agent          object        
 4   dma                 int32         
 5   country             object        
 6   os                  object        
 7   prizm_premier_code  object        
 8   device_type         object        
 9   unique_id           object        
 10  aip_device_brand    object        
 11  aip_screen_width    object        
 12  aip_screen_height   object        
 13  aip_screen_ratio    object        
 14  aip_isp             object        
 15  aip_asn             object        
 16  aip_osName          object        
 17  aip_hardware        object        
 18  campaign_id         category      
 19  dte               

In [5]:
# Show the first few rows
print("\nImpressions Data - First 5 Rows:")
if not df_impressions.empty:
    display(df_impressions.head())
else:
    print("Impressions DataFrame is empty.")


Impressions Data - First 5 Rows:


Unnamed: 0,placement_id,dttm_utc,cnxn_type,user_agent,dma,country,os,prizm_premier_code,device_type,unique_id,aip_device_brand,aip_screen_width,aip_screen_height,aip_screen_ratio,aip_isp,aip_asn,aip_osName,aip_hardware,campaign_id,dte
0,557650,2025-01-02 14:49:33,Corporate,PodcastRepublic/18.0 (iPhone; CPU iPhone OS 14...,602,us,iOS,,p,b0eea5e2-0f98-4609-b6c1-141118980e82,,,,,,,,,9317,2025-01-02
1,557650,2025-01-02 18:48:43,Cable/DSL,Podcasts/4024.310.3 CFNetwork/1568.300.101 Dar...,623,us,unknown,,p,b5352d12-7098-4378-929f-51db0bf40d8f,,,,,,,,,9317,2025-01-02
2,557650,2025-01-02 11:14:21,Cable/DSL,Podcasts/4024.210.1 CFNetwork/1568.200.51 Darw...,602,us,unknown,21.0,p,75d06ce4-d3a2-4ec8-94ea-42c0ffaf6ed2,,,,,,,,,9317,2025-01-02
3,557650,2025-01-02 03:12:29,Cable/DSL,Podcasts/4024.210.1 CFNetwork/1568.200.51 Darw...,602,us,unknown,21.0,p,b715b1c9-b150-4fbb-a5df-c02b4e252c9e,,,,,,,,,9317,2025-01-02
4,557650,2025-01-02 18:13:09,Cellular,iHeartRadio/10.47.0 (iPhone; iOS 18.1.1; iPhon...,602,us,iOS,,p,43f8dc0d-3aee-4b7d-8b11-38bb9c778671,,,,,,,,,9317,2025-01-02


In [6]:
# Get descriptive statistics for all columns (including object/string types)
print("\nImpressions Data - Descriptive Statistics:")
if not df_impressions.empty:
    display(df_impressions.describe(include='all'))
else:
    print("Impressions DataFrame is empty.")


Impressions Data - Descriptive Statistics:


Unnamed: 0,placement_id,dttm_utc,cnxn_type,user_agent,dma,country,os,prizm_premier_code,device_type,unique_id,aip_device_brand,aip_screen_width,aip_screen_height,aip_screen_ratio,aip_isp,aip_asn,aip_osName,aip_hardware,campaign_id,dte
count,3825710.0,3825710,3825710,3825710,3825710.0,3825710,3825710,1414182.0,3431224,3825710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3825710.0,3825710
unique,,,3,41424,,47,3,68.0,9,3825710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,103
top,,,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,,us,unknown,4.0,p,b0eea5e2-0f98-4609-b6c1-141118980e82,,,,,,,,,9317.0,2025-01-02
freq,,,2074801,699358,,3822963,2508421,78578.0,3173785,1,,,,,,,,,3825710.0,59008
mean,590220.6,2025-02-20 06:30:15.050688768,,,588.3364,,,,,,,,,,,,,,,
min,557650.0,2025-01-02 00:14:15,,,0.0,,,,,,,,,,,,,,,
25%,596771.0,2025-01-24 08:47:58.750000128,,,602.0,,,,,,,,,,,,,,,
50%,596772.0,2025-02-18 03:26:04,,,602.0,,,,,,,,,,,,,,,
75%,596772.0,2025-03-19 19:31:28.750000128,,,602.0,,,,,,,,,,,,,,,
max,596772.0,2025-04-14 23:59:58,,,881.0,,,,,,,,,,,,,,,


In [7]:
# Check for missing values
print("\nImpressions Data - Missing Value Counts:")
if not df_impressions.empty:
    missing_imp = df_impressions.isnull().sum()
    missing_imp_filtered = missing_imp[missing_imp > 0].sort_values(ascending=False)
    if not missing_imp_filtered.empty:
        print("Columns with missing values:")
        display(missing_imp_filtered)
    else:
        print("No missing values found.")
else:
    print("Impressions DataFrame is empty.")


Impressions Data - Missing Value Counts:
Columns with missing values:


aip_device_brand      3825710
aip_screen_width      3825710
aip_screen_height     3825710
aip_screen_ratio      3825710
aip_isp               3825710
aip_asn               3825710
aip_osName            3825710
aip_hardware          3825710
prizm_premier_code    2411528
device_type            394486
dtype: int64

### Conversions Data Exploration

In [8]:
# Load the conversions dataset
try:
    df_conversions = pd.read_parquet(conversions_path)
    print(f"Successfully loaded conversions data from {conversions_path}\nShape: {df_conversions.shape}")
except Exception as e:
    print(f"Error loading {conversions_path}: {e}")
    df_conversions = pd.DataFrame() # Create empty dataframe if load fails

Successfully loaded conversions data from ./data/test_dataset/conversions_test/
Shape: (28056, 14)


In [9]:
# Display basic information and schema
print("\nConversions Data Info:")
if not df_conversions.empty:
    df_conversions.info()
else:
    print("Conversions DataFrame is empty.")


Conversions Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28056 entries, 0 to 28055
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   conv_dttm_utc            28056 non-null  datetime64[ns]
 1   imp_click_dttm_utc       28056 non-null  datetime64[ns]
 2   imp_click_placement_id   28056 non-null  int64         
 3   imp_click_unique_id      28056 non-null  object        
 4   conv_property_id         28056 non-null  int64         
 5   conv_dma                 28056 non-null  int32         
 6   conv_user_agent          28056 non-null  object        
 7   goal_id                  28056 non-null  int64         
 8   goal_name                28056 non-null  object        
 9   conv_prizm_premier_code  5126 non-null   object        
 10  conv_cnxn_type           28043 non-null  object        
 11  conv_device_type         28002 non-null  object        
 12  imp_clic

In [10]:
# Show the first few rows
print("\nConversions Data - First 5 Rows:")
if not df_conversions.empty:
    display(df_conversions.head())
else:
    print("Conversions DataFrame is empty.")


Conversions Data - First 5 Rows:


Unnamed: 0,conv_dttm_utc,imp_click_dttm_utc,imp_click_placement_id,imp_click_unique_id,conv_property_id,conv_dma,conv_user_agent,goal_id,goal_name,conv_prizm_premier_code,conv_cnxn_type,conv_device_type,imp_click_campaign_id,dte
0,2025-01-01 01:51:56,2024-12-19 17:18:06,557650,127f8d07-ca50-4e34-af4f-1e10628d8f99,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like...,58454,lead,,Cable/DSL,p,9317,2025-01-01
1,2025-01-01 03:19:51,2024-12-20 01:50:56,557650,61c40d8a-8540-433a-951e-e295c90957a4,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like...,58454,lead,,Cable/DSL,p,9317,2025-01-01
2,2025-01-01 07:05:02,2024-12-19 06:48:07,557650,77eb1ab1-354e-440a-aa5e-ecdde7f11fac,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like M...,58454,lead,,Cable/DSL,p,9317,2025-01-01
3,2025-01-01 23:21:14,2024-12-19 20:24:38,557650,de677643-3e11-41a9-b332-f2e3efd825f6,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like...,58454,lead,,Cable/DSL,p,9317,2025-01-01
4,2025-01-01 18:01:06,2024-12-21 03:33:09,557650,dd0fd289-8e3a-4508-b3f8-839cf2e322e3,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 17_6_1 like...,58454,lead,,Cable/DSL,p,9317,2025-01-01


In [11]:
# Get descriptive statistics for all columns
print("\nConversions Data - Descriptive Statistics:")
if not df_conversions.empty:
    try:
        display(df_conversions.describe(include='all'))
    except Exception as e:
        print(f"Could not generate descriptive statistics: {e}")
else:
    print("Conversions DataFrame is empty.")


Conversions Data - Descriptive Statistics:


Unnamed: 0,conv_dttm_utc,imp_click_dttm_utc,imp_click_placement_id,imp_click_unique_id,conv_property_id,conv_dma,conv_user_agent,goal_id,goal_name,conv_prizm_premier_code,conv_cnxn_type,conv_device_type,imp_click_campaign_id,dte
count,28056,28056,28056.0,28056,28056.0,28056.0,28056,28056.0,28056,5126.0,28043,28002,28056.0,28056
unique,,,,7827,,,1202,,2,64.0,3,4,1.0,104
top,,,,769f42b0-9f92-455d-9b36-9f632a4972ca,,,Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like...,,lead,4.0,Cable/DSL,p,9317.0,2025-01-25
freq,,,,91,,,7176,,26542,363.0,26554,22749,28056.0,1162
mean,2025-02-12 14:13:03.331836416,2025-02-08 09:12:15.321143552,588849.890825,,4228.0,586.734388,,58454.053964,,,,,,
min,2025-01-01 00:10:27,2024-12-18 03:18:03,557650.0,,4228.0,0.0,,58454.0,,,,,,
25%,2025-01-21 17:16:33.500000,2025-01-16 04:49:30,596771.0,,4228.0,602.0,,58454.0,,,,,,
50%,2025-02-05 20:18:20,2025-02-02 00:48:31,596771.0,,4228.0,602.0,,58454.0,,,,,,
75%,2025-03-06 14:45:46,2025-03-03 15:50:45,596772.0,,4228.0,602.0,,58454.0,,,,,,
max,2025-04-14 11:35:55,2025-04-14 02:21:09,596772.0,,4228.0,718.0,,58455.0,,,,,,


In [12]:
# Check for missing values
print("\nConversions Data - Missing Value Counts:")
if not df_conversions.empty:
    missing_conv = df_conversions.isnull().sum()
    missing_conv_filtered = missing_conv[missing_conv > 0].sort_values(ascending=False)
    if not missing_conv_filtered.empty:
        print("Columns with missing values:")
        display(missing_conv_filtered)
    else:
        print("No missing values found.")
else:
    print("Conversions DataFrame is empty.")



Conversions Data - Missing Value Counts:
Columns with missing values:


conv_prizm_premier_code    22930
conv_device_type              54
conv_cnxn_type                13
dtype: int64

### Device Types Mapping

In [13]:
# Load the device types mapping
try:
    df_device_types = pd.read_csv(device_types_path)
    print(f"Successfully loaded device types data from {device_types_path}\nShape: {df_device_types.shape}")
except Exception as e:
    print(f"Error loading {device_types_path}: {e}")
    df_device_types = pd.DataFrame()

Successfully loaded device types data from ./data/data_dictionary/device_types.csv
Shape: (10, 2)


In [14]:
# Display basic information
print("\nDevice Types Info:")
if not df_device_types.empty:
    df_device_types.info()
else:
    print("Device Types DataFrame is empty.")


Device Types Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   short      10 non-null     object
 1   full_name  10 non-null     object
dtypes: object(2)
memory usage: 292.0+ bytes


In [15]:
# Show all rows (it's a small file)
print("\nDevice Types Data:")
if not df_device_types.empty:
    display(df_device_types)
else:
    print("Device Types DataFrame is empty.")


Device Types Data:


Unnamed: 0,short,full_name
0,t,Tablet
1,p,Mobile Phone
2,d,Desktop
3,g,Game Console
4,m,Media Player
5,v,TV
6,o,Other
7,s,Set Top Box
8,e,E-Reader
9,x,Mobile Device


## Insights

**Insights from the Exploration:**

1.  **Data Scope & Size:**
    *   The datasets are quite small (97 impressions, 100 conversions).
    *   The primary event date (`dte`) is only '2025-04-09' in both files.
    *   Impressions (`dttm_utc`) and Conversions (`conv_dttm_utc`) occur throughout this day.
    *   The attributed impression/click timestamps (`imp_click_dttm_utc` in conversions) go back further (to late March), indicating an attribution window.
    *   The data seems focused: single `campaign_id` (9317), predominantly one `placement_id` (596772), single `country` ('us'), and single `conv_property_id` (4228). Findings will be specific to this segment.
2.  **Missing Data:**
    *   **Impressions:** The `aip_*` columns (`aip_device_brand`, `aip_screen_width`, etc.) are *entirely empty* in this sample. `prizm_premier_code` is missing in over half (54/97) the rows. `os` is often 'unknown' (69/97).
    *   **Conversions:** `conv_prizm_premier_code` is missing in 42/100 rows.
3.  **Key Features & Values:**
    *   **Timestamps:** Loaded correctly as datetime objects. The difference between `imp_click_dttm_utc` and `conv_dttm_utc` represents conversion latency.
    *   **Identifiers:** Campaign/Placement IDs are consistent. `dma` mostly '602', with some '0' (needs clarification - unknown?).
    *   **Categorical:** `cnxn_type`, `os`, `device_type`, `goal_name` have limited unique values. `device_type` uses codes ('p', 'd', 't') which map to full names via `device_types.csv`.
    *   **User Agent:** Highly variable strings; require parsing for useful features.
    *   **Conversion Goals:** Primarily 'lead' (95/100), few 'content'.
    *   **Device Types:** Impressions are mostly 'p' (Mobile Phone, 79/97). Conversions are also mostly 'p' (77/100), with some 'd' (Desktop, 23/100). `df_device_types` provides the lookup.
4.  **Data Quality/Usefulness:**
    *   The `aip_*` columns seem unusable based on this sample.
    *   Missing `prizm_premier_code` needs a strategy.
    *   The small, focused nature of the sample limits generalizability.

**Suggested Further Exploration Steps:**

1.  **Data Cleaning & Preprocessing:**
    *   **Drop Useless Columns:** Remove the `aip_*` columns from `df_impressions`.
    *   **Handle Missing Values:**
        *   Decide on a strategy for `prizm_premier_code` and `conv_prizm_premier_code` (e.g., fill with a placeholder like 'Unknown', investigate imputation later if models require it).
        *   Decide how to handle `os == 'unknown'`.
        *   Investigate what `dma == 0` represents.
    *   **Map Device Types:** Merge `df_device_types` into both `df_impressions` and `df_conversions` using the `device_type`/`conv_device_type` codes (on `short`) to get the full names. Standardize the column name (e.g., to `device_type_full`).
    *   **Parse User Agents:** Use a library (like `user_agents`) to parse `user_agent` and `conv_user_agent` into structured features (e.g., browser family, OS family, device type/brand if available). This might provide more detail than the existing `os` and `device_type` columns.
    *   **Check Timestamps/Dates:** Ensure `dte` columns are needed or if timestamps suffice. Convert `dte` to datetime if used.
2.  **Feature Engineering:**
    *   **Conversion Latency:** Calculate `df_conversions['conversion_latency'] = df_conversions['conv_dttm_utc'] - df_conversions['imp_click_dttm_utc']`. Analyze its distribution (e.g., using `.describe()` or a histogram).
    *   **Time Features:** Extract hour of day, day of week (though less useful with single-day data) from relevant timestamps.
3.  **Deeper Analysis (after cleaning):**
    *   **Univariate Analysis:** Plot distributions of key cleaned features (histograms for numeric/latency, bar charts for categoricals like `device_type_full`, `cnxn_type`, `goal_name`, parsed UA features, `prizm_premier_code`).
    *   **Bivariate Analysis:**
        *   How do features in the `conversions` dataset relate to `goal_name`? (e.g., `pd.crosstab(df_conversions['device_type_full'], df_conversions['goal_name'])`)
        *   Is there a relationship between `device_type_full` (from impression) and conversion likelihood? (This requires defining a target variable, potentially by linking impressions to conversions if possible, or using the conversion dataset itself).
        *   Analyze conversion latency by different dimensions (e.g., device type).
