## Initial Data Exploration

This notebook performs an initial exploration of the impressions and conversions datasets provided.
We will look at the data structure, sample rows, basic statistics, and missing values to get a preliminary understanding of the data.

In [1]:
import pandas as pd

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000) # Adjust width for better table display if needed

### Data Dictionaries Summary

Based on the provided `data/data_dictionary` files:

**Impressions Data (`impressions_test.pqt`)**
Contains information about ad impressions served. Key columns include:
- `dte`: Date of the impression.
- `campaign_id`, `placement_id`: Identifiers for the ad campaign and placement.
- `dttm_utc`: Timestamp of the impression.
- `cnxn_type`: Type of internet connection (e.g., Cable/DSL, Cellular).
- `user_agent`: Browser/device user agent string.
- `dma`: DMA code (Designated Market Area).
- `country`: Country code.
- `os`: Operating system.
- `prizm_premier_code`: Nielsen PRIZM Premier marketing segmentation code.
- `device_type`: Type of device (likely uses codes from `device_types.csv`).
- `aip_*`: Additional device/connection info (screen dimensions, ISP, OS name, hardware).

**Conversions Data (`conversions_test.pqt`)**
Contains information about conversion events attributed to ad impressions/clicks. Key columns include:
- `dte`: Date of the conversion.
- `conv_dttm_utc`: Timestamp of the conversion event.
- `imp_click_dttm_utc`: Timestamp of the attributed impression or click.
- `imp_click_campaign_id`, `imp_click_placement_id`: Identifiers for the campaign/placement associated with the conversion.
- `conv_property_id`: Identifier for the property (e.g., website pixel) where the conversion occurred.
- `conv_dma`: DMA code at time of conversion.
- `conv_user_agent`: User agent at time of conversion.
- `goal_id`, `goal_name`: Identifier and name for the conversion goal (e.g., 'Purchase', 'Sign Up').
- `conv_prizm_premier_code`: Prizm Premier code at time of conversion.
- `conv_cnxn_type`: Connection type at time of conversion.
- `conv_device_type`: Device type at time of conversion.

**Device Types (`device_types.csv`)**
A simple mapping from short device type codes (e.g., 't', 'p') to full names (e.g., 'Tablet', 'Mobile Phone').
- `short`: Device code used in other datasets.
- `full_name`: Descriptive name of the device type.

### Define File Paths

In [12]:
impressions_path = './data/test_dataset/impressions_test.pqt/'
conversions_path = './data/test_dataset/conversions_test.pqt/'
device_types_path = './data/data_dictionary/device_types.csv'

### Impressions Data Exploration

In [27]:
# Load the impressions dataset
# Note: Parquet datasets can be stored as directories. Pandas reads them correctly.
try:
    df_impressions = pd.read_parquet(impressions_path)
    print(f"Successfully loaded impressions data from {impressions_path}\nShape: {df_impressions.shape}")
except Exception as e:
    print(f"Error loading {impressions_path}: {e}")
    df_impressions = pd.DataFrame() # Create empty dataframe if load fails

Successfully loaded impressions data from ./data/test_dataset/impressions_test.pqt/part-00000-tid-5776622799839121443-364b55fd-bac7-4594-9804-54ded5d936c4-228037-1-c000.snappy.parquet
Shape: (97, 19)


In [28]:
# Display basic information and schema
print("\nImpressions Data Info:")
if not df_impressions.empty:
    df_impressions.info()
else:
    print("Impressions DataFrame is empty.")


Impressions Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   dte                 97 non-null     object        
 1   campaign_id         97 non-null     int64         
 2   placement_id        97 non-null     int64         
 3   dttm_utc            97 non-null     datetime64[ns]
 4   cnxn_type           97 non-null     object        
 5   user_agent          97 non-null     object        
 6   dma                 97 non-null     int32         
 7   country             97 non-null     object        
 8   os                  97 non-null     object        
 9   prizm_premier_code  43 non-null     object        
 10  device_type         97 non-null     object        
 11  aip_device_brand    0 non-null      object        
 12  aip_screen_width    0 non-null      object        
 13  aip_screen_height   0 non-nu

In [15]:
# Show the first few rows
print("\nImpressions Data - First 5 Rows:")
if not df_impressions.empty:
    display(df_impressions.head())
else:
    print("Impressions DataFrame is empty.")


Impressions Data - First 5 Rows:


Unnamed: 0,dte,campaign_id,placement_id,dttm_utc,cnxn_type,user_agent,dma,country,os,prizm_premier_code,device_type,aip_device_brand,aip_screen_width,aip_screen_height,aip_screen_ratio,aip_isp,aip_asn,aip_osName,aip_hardware
0,2025-04-09,9317,596772,2025-04-09 02:58:19,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,602,us,unknown,19,p,,,,,,,,
1,2025-04-09,9317,596772,2025-04-09 02:58:24,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,602,us,unknown,19,p,,,,,,,,
2,2025-04-09,9317,596772,2025-04-09 02:58:38,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,602,us,unknown,19,p,,,,,,,,
3,2025-04-09,9317,596772,2025-04-09 02:58:54,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,602,us,unknown,19,p,,,,,,,,
4,2025-04-09,9317,596772,2025-04-09 02:59:21,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,602,us,unknown,19,p,,,,,,,,


In [17]:
# Get descriptive statistics for all columns (including object/string types)
print("\nImpressions Data - Descriptive Statistics:")
if not df_impressions.empty:
    display(df_impressions.describe(include='all'))
else:
    print("Impressions DataFrame is empty.")


Impressions Data - Descriptive Statistics:


Unnamed: 0,dte,campaign_id,placement_id,dttm_utc,cnxn_type,user_agent,dma,country,os,prizm_premier_code,device_type,aip_device_brand,aip_screen_width,aip_screen_height,aip_screen_ratio,aip_isp,aip_asn,aip_osName,aip_hardware
count,97,97.0,97.0,97,97,97,97.0,97,97,43.0,97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unique,1,,,,3,35,,1,3,24.0,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
top,2025-04-09,,,,Cable/DSL,Podcasts/4024.400.4 CFNetwork/3826.400.120 Dar...,,us,unknown,19.0,p,,,,,,,,
freq,97,,,,52,24,,97,69,5.0,79,,,,,,,,
mean,,9317.0,596771.824742,2025-04-09 06:32:53.329896960,,,554.391753,,,,,,,,,,,,
min,,9317.0,596771.0,2025-04-09 00:27:52,,,0.0,,,,,,,,,,,,
25%,,9317.0,596772.0,2025-04-09 03:05:15,,,602.0,,,,,,,,,,,,
50%,,9317.0,596772.0,2025-04-09 06:40:02,,,602.0,,,,,,,,,,,,
75%,,9317.0,596772.0,2025-04-09 09:38:55,,,602.0,,,,,,,,,,,,
max,,9317.0,596772.0,2025-04-09 11:51:13,,,602.0,,,,,,,,,,,,


In [18]:
# Check for missing values
print("\nImpressions Data - Missing Value Counts:")
if not df_impressions.empty:
    missing_imp = df_impressions.isnull().sum()
    missing_imp_filtered = missing_imp[missing_imp > 0].sort_values(ascending=False)
    if not missing_imp_filtered.empty:
        print("Columns with missing values:")
        display(missing_imp_filtered)
    else:
        print("No missing values found.")
else:
    print("Impressions DataFrame is empty.")


Impressions Data - Missing Value Counts:
Columns with missing values:


aip_device_brand      97
aip_screen_width      97
aip_screen_height     97
aip_screen_ratio      97
aip_isp               97
aip_asn               97
aip_osName            97
aip_hardware          97
prizm_premier_code    54
dtype: int64

### Conversions Data Exploration

In [29]:
# Load the conversions dataset
try:
    df_conversions = pd.read_parquet(conversions_path)
    print(f"Successfully loaded conversions data from {conversions_path}\nShape: {df_conversions.shape}")
except Exception as e:
    print(f"Error loading {conversions_path}: {e}")
    df_conversions = pd.DataFrame() # Create empty dataframe if load fails

Successfully loaded conversions data from ./data/test_dataset/conversions_test.pqt/part-00000-tid-5580591345044934395-614171d7-bf52-4ea7-bad7-1efbcf97748d-228631-1-c000.snappy.parquet
Shape: (100, 13)


In [20]:
# Display basic information and schema
print("\nConversions Data Info:")
if not df_conversions.empty:
    df_conversions.info()
else:
    print("Conversions DataFrame is empty.")


Conversions Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   dte                      100 non-null    object        
 1   conv_dttm_utc            100 non-null    datetime64[ns]
 2   imp_click_dttm_utc       100 non-null    datetime64[ns]
 3   imp_click_campaign_id    100 non-null    int64         
 4   imp_click_placement_id   100 non-null    int64         
 5   conv_property_id         100 non-null    int64         
 6   conv_dma                 100 non-null    int32         
 7   conv_user_agent          100 non-null    object        
 8   goal_id                  100 non-null    int64         
 9   goal_name                100 non-null    object        
 10  conv_prizm_premier_code  58 non-null     object        
 11  conv_cnxn_type           100 non-null    object        
 12  conv_device_t

In [21]:
# Show the first few rows
print("\nConversions Data - First 5 Rows:")
if not df_conversions.empty:
    display(df_conversions.head())
else:
    print("Conversions DataFrame is empty.")


Conversions Data - First 5 Rows:


Unnamed: 0,dte,conv_dttm_utc,imp_click_dttm_utc,imp_click_campaign_id,imp_click_placement_id,conv_property_id,conv_dma,conv_user_agent,goal_id,goal_name,conv_prizm_premier_code,conv_cnxn_type,conv_device_type
0,2025-04-09,2025-04-09 00:06:02,2025-04-02 20:26:30,9317,596772,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_2 like...,58454,lead,,Cable/DSL,p
1,2025-04-09,2025-04-09 03:03:17,2025-04-09 01:54:07,9317,596772,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_2 like...,58455,content,30.0,Cable/DSL,p
2,2025-04-09,2025-04-09 00:48:00,2025-04-09 00:33:15,9317,596772,4228,602,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,58454,lead,12.0,Cable/DSL,d
3,2025-04-09,2025-04-09 11:43:39,2025-03-26 20:17:28,9317,596772,4228,602,Mozilla/5.0 (Linux; Android 13; T432W Build/TP...,58454,lead,,Cable/DSL,p
4,2025-04-09,2025-04-09 05:00:22,2025-04-05 21:34:48,9317,596772,4228,602,Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_2 like...,58454,lead,36.0,Cable/DSL,p


In [22]:
# Get descriptive statistics for all columns
print("\nConversions Data - Descriptive Statistics:")
if not df_conversions.empty:
    try:
        display(df_conversions.describe(include='all'))
    except Exception as e:
        print(f"Could not generate descriptive statistics: {e}")
else:
    print("Conversions DataFrame is empty.")


Conversions Data - Descriptive Statistics:


Unnamed: 0,dte,conv_dttm_utc,imp_click_dttm_utc,imp_click_campaign_id,imp_click_placement_id,conv_property_id,conv_dma,conv_user_agent,goal_id,goal_name,conv_prizm_premier_code,conv_cnxn_type,conv_device_type
count,100,100,100,100.0,100.0,100.0,100.0,100,100.0,100,58.0,100,100
unique,1,,,,,,,31,,2,19.0,1,2
top,2025-04-09,,,,,,,Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_2 like...,,lead,12.0,Cable/DSL,p
freq,100,,,,,,,25,,95,13.0,100,77
mean,,2025-04-09 08:10:46.079999744,2025-04-06 03:07:07.800000256,9317.0,596771.89,4228.0,595.98,,58454.05,,,,
min,,2025-04-09 00:06:02,2025-03-26 20:17:28,9317.0,596771.0,4228.0,0.0,,58454.0,,,,
25%,,2025-04-09 02:53:52.500000,2025-04-03 16:08:09.750000128,9317.0,596772.0,4228.0,602.0,,58454.0,,,,
50%,,2025-04-09 04:18:34.500000,2025-04-07 09:25:21,9317.0,596772.0,4228.0,602.0,,58454.0,,,,
75%,,2025-04-09 11:05:43.249999872,2025-04-08 21:55:34,9317.0,596772.0,4228.0,602.0,,58454.0,,,,
max,,2025-04-09 23:11:36,2025-04-09 15:55:04,9317.0,596772.0,4228.0,602.0,,58455.0,,,,


In [23]:
# Check for missing values
print("\nConversions Data - Missing Value Counts:")
if not df_conversions.empty:
    missing_conv = df_conversions.isnull().sum()
    missing_conv_filtered = missing_conv[missing_conv > 0].sort_values(ascending=False)
    if not missing_conv_filtered.empty:
        print("Columns with missing values:")
        display(missing_conv_filtered)
    else:
        print("No missing values found.")
else:
    print("Conversions DataFrame is empty.")



Conversions Data - Missing Value Counts:
Columns with missing values:


conv_prizm_premier_code    42
dtype: int64

### Device Types Mapping

In [24]:
# Load the device types mapping
try:
    df_device_types = pd.read_csv(device_types_path)
    print(f"Successfully loaded device types data from {device_types_path}\nShape: {df_device_types.shape}")
except Exception as e:
    print(f"Error loading {device_types_path}: {e}")
    df_device_types = pd.DataFrame()

Successfully loaded device types data from ./data/data_dictionary/device_types.csv
Shape: (10, 2)


In [25]:
# Display basic information
print("\nDevice Types Info:")
if not df_device_types.empty:
    df_device_types.info()
else:
    print("Device Types DataFrame is empty.")


Device Types Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   short      10 non-null     object
 1   full_name  10 non-null     object
dtypes: object(2)
memory usage: 292.0+ bytes


In [26]:
# Show all rows (it's a small file)
print("\nDevice Types Data:")
if not df_device_types.empty:
    display(df_device_types)
else:
    print("Device Types DataFrame is empty.")


Device Types Data:


Unnamed: 0,short,full_name
0,t,Tablet
1,p,Mobile Phone
2,d,Desktop
3,g,Game Console
4,m,Media Player
5,v,TV
6,o,Other
7,s,Set Top Box
8,e,E-Reader
9,x,Mobile Device


## Insights

**Insights from the Exploration:**

1.  **Data Scope & Size:**
    *   The datasets are quite small (97 impressions, 100 conversions).
    *   The primary event date (`dte`) is only '2025-04-09' in both files.
    *   Impressions (`dttm_utc`) and Conversions (`conv_dttm_utc`) occur throughout this day.
    *   The attributed impression/click timestamps (`imp_click_dttm_utc` in conversions) go back further (to late March), indicating an attribution window.
    *   The data seems focused: single `campaign_id` (9317), predominantly one `placement_id` (596772), single `country` ('us'), and single `conv_property_id` (4228). Findings will be specific to this segment.
2.  **Missing Data:**
    *   **Impressions:** The `aip_*` columns (`aip_device_brand`, `aip_screen_width`, etc.) are *entirely empty* in this sample. `prizm_premier_code` is missing in over half (54/97) the rows. `os` is often 'unknown' (69/97).
    *   **Conversions:** `conv_prizm_premier_code` is missing in 42/100 rows.
3.  **Key Features & Values:**
    *   **Timestamps:** Loaded correctly as datetime objects. The difference between `imp_click_dttm_utc` and `conv_dttm_utc` represents conversion latency.
    *   **Identifiers:** Campaign/Placement IDs are consistent. `dma` mostly '602', with some '0' (needs clarification - unknown?).
    *   **Categorical:** `cnxn_type`, `os`, `device_type`, `goal_name` have limited unique values. `device_type` uses codes ('p', 'd', 't') which map to full names via `device_types.csv`.
    *   **User Agent:** Highly variable strings; require parsing for useful features.
    *   **Conversion Goals:** Primarily 'lead' (95/100), few 'content'.
    *   **Device Types:** Impressions are mostly 'p' (Mobile Phone, 79/97). Conversions are also mostly 'p' (77/100), with some 'd' (Desktop, 23/100). `df_device_types` provides the lookup.
4.  **Data Quality/Usefulness:**
    *   The `aip_*` columns seem unusable based on this sample.
    *   Missing `prizm_premier_code` needs a strategy.
    *   The small, focused nature of the sample limits generalizability.

**Suggested Further Exploration Steps:**

1.  **Data Cleaning & Preprocessing:**
    *   **Drop Useless Columns:** Remove the `aip_*` columns from `df_impressions`.
    *   **Handle Missing Values:**
        *   Decide on a strategy for `prizm_premier_code` and `conv_prizm_premier_code` (e.g., fill with a placeholder like 'Unknown', investigate imputation later if models require it).
        *   Decide how to handle `os == 'unknown'`.
        *   Investigate what `dma == 0` represents.
    *   **Map Device Types:** Merge `df_device_types` into both `df_impressions` and `df_conversions` using the `device_type`/`conv_device_type` codes (on `short`) to get the full names. Standardize the column name (e.g., to `device_type_full`).
    *   **Parse User Agents:** Use a library (like `user_agents`) to parse `user_agent` and `conv_user_agent` into structured features (e.g., browser family, OS family, device type/brand if available). This might provide more detail than the existing `os` and `device_type` columns.
    *   **Check Timestamps/Dates:** Ensure `dte` columns are needed or if timestamps suffice. Convert `dte` to datetime if used.
2.  **Feature Engineering:**
    *   **Conversion Latency:** Calculate `df_conversions['conversion_latency'] = df_conversions['conv_dttm_utc'] - df_conversions['imp_click_dttm_utc']`. Analyze its distribution (e.g., using `.describe()` or a histogram).
    *   **Time Features:** Extract hour of day, day of week (though less useful with single-day data) from relevant timestamps.
3.  **Deeper Analysis (after cleaning):**
    *   **Univariate Analysis:** Plot distributions of key cleaned features (histograms for numeric/latency, bar charts for categoricals like `device_type_full`, `cnxn_type`, `goal_name`, parsed UA features, `prizm_premier_code`).
    *   **Bivariate Analysis:**
        *   How do features in the `conversions` dataset relate to `goal_name`? (e.g., `pd.crosstab(df_conversions['device_type_full'], df_conversions['goal_name'])`)
        *   Is there a relationship between `device_type_full` (from impression) and conversion likelihood? (This requires defining a target variable, potentially by linking impressions to conversions if possible, or using the conversion dataset itself).
        *   Analyze conversion latency by different dimensions (e.g., device type).
4.  **Joining/Modeling Prep:**
    *   **Clarify Goal:** What is the ultimate aim? Predicting conversion probability per impression? Analyzing post-conversion behavior? Understanding attribution drivers?
    *   **Define Target:** If predicting conversions, you'll need a target variable (e.g., 1 if an impression led to a conversion, 0 otherwise). This might involve complex joining or assuming the `conversions` dataset represents the "1"s and sampling/finding the "0"s from `impressions`. Given the attribution seems pre-calculated in the `conversions` table, analysis might focus *on the characteristics of converted impressions*.
    *   **Merge Data:** Combine cleaned features (like mapped device types, parsed user agents) into the relevant DataFrames.
5.  **Address Limitations:** Acknowledge the small sample size and limited scope. Results are preliminary. If possible, obtain a larger, more representative dataset covering more campaigns, time periods, and possibly user identifiers for better linking.
