## Initial Data Exploration

This notebook performs an initial exploration of the impressions and conversions datasets provided.
We will look at the data structure, sample rows, basic statistics, and missing values to get a preliminary understanding of the data.

In [None]:
import pandas as pd

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000) # Adjust width for better table display if needed

### Data Dictionaries Summary

Based on the provided `data/data_dictionary` files:

**Impressions Data (`impressions_test.pqt`)**
Contains information about ad impressions served. Key columns include:
- `dte`: Date of the impression.
- `campaign_id`, `placement_id`: Identifiers for the ad campaign and placement.
- `dttm_utc`: Timestamp of the impression.
- `cnxn_type`: Type of internet connection (e.g., Cable/DSL, Cellular).
- `user_agent`: Browser/device user agent string.
- `dma`: DMA code (Designated Market Area).
- `country`: Country code.
- `os`: Operating system.
- `prizm_premier_code`: Nielsen PRIZM Premier marketing segmentation code.
- `device_type`: Type of device (likely uses codes from `device_types.csv`).
- `aip_*`: Additional device/connection info (screen dimensions, ISP, OS name, hardware).

**Conversions Data (`conversions_test.pqt`)**
Contains information about conversion events attributed to ad impressions/clicks. Key columns include:
- `dte`: Date of the conversion.
- `conv_dttm_utc`: Timestamp of the conversion event.
- `imp_click_dttm_utc`: Timestamp of the attributed impression or click.
- `imp_click_campaign_id`, `imp_click_placement_id`: Identifiers for the campaign/placement associated with the conversion.
- `conv_property_id`: Identifier for the property (e.g., website pixel) where the conversion occurred.
- `conv_dma`: DMA code at time of conversion.
- `conv_user_agent`: User agent at time of conversion.
- `goal_id`, `goal_name`: Identifier and name for the conversion goal (e.g., 'Purchase', 'Sign Up').
- `conv_prizm_premier_code`: Prizm Premier code at time of conversion.
- `conv_cnxn_type`: Connection type at time of conversion.
- `conv_device_type`: Device type at time of conversion.

**Device Types (`device_types.csv`)**
A simple mapping from short device type codes (e.g., 't', 'p') to full names (e.g., 'Tablet', 'Mobile Phone').
- `short`: Device code used in other datasets.
- `full_name`: Descriptive name of the device type.

### Define File Paths

In [None]:
# Define file paths relative to the notebook's location in the 'notebooks' directory
impressions_path = '../data/test_dataset/impressions_test.pqt'
conversions_path = '../data/test_dataset/conversions_test.pqt'
device_types_path = '../data/data_dictionary/device_types.csv'

### Impressions Data Exploration

In [None]:
# Load the impressions dataset
# Note: Parquet datasets can be stored as directories. Pandas reads them correctly.
try:
    df_impressions = pd.read_parquet(impressions_path)
    print(f"Successfully loaded impressions data from {impressions_path}\nShape: {df_impressions.shape}")
except Exception as e:
    print(f"Error loading {impressions_path}: {e}")
    df_impressions = pd.DataFrame() # Create empty dataframe if load fails

In [None]:
# Display basic information and schema
print("\nImpressions Data Info:")
if not df_impressions.empty:
    df_impressions.info()
else:
    print("Impressions DataFrame is empty.")

In [None]:
# Show the first few rows
print("\nImpressions Data - First 5 Rows:")
if not df_impressions.empty:
    display(df_impressions.head())
else:
    print("Impressions DataFrame is empty.")

In [None]:
# Get descriptive statistics for all columns (including object/string types)
print("\nImpressions Data - Descriptive Statistics:")
if not df_impressions.empty:
    try:
        display(df_impressions.describe(include='all', datetime_is_numeric=True))
    except Exception as e:
        print(f"Could not generate descriptive statistics: {e}")
else:
    print("Impressions DataFrame is empty.")

In [None]:
# Check for missing values
print("\nImpressions Data - Missing Value Counts:")
if not df_impressions.empty:
    missing_imp = df_impressions.isnull().sum()
    missing_imp_filtered = missing_imp[missing_imp > 0].sort_values(ascending=False)
    if not missing_imp_filtered.empty:
        print("Columns with missing values:")
        display(missing_imp_filtered)
    else:
        print("No missing values found.")
else:
    print("Impressions DataFrame is empty.")

### Conversions Data Exploration

In [None]:
# Load the conversions dataset
try:
    df_conversions = pd.read_parquet(conversions_path)
    print(f"Successfully loaded conversions data from {conversions_path}\nShape: {df_conversions.shape}")
except Exception as e:
    print(f"Error loading {conversions_path}: {e}")
    df_conversions = pd.DataFrame() # Create empty dataframe if load fails

In [None]:
# Display basic information and schema
print("\nConversions Data Info:")
if not df_conversions.empty:
    df_conversions.info()
else:
    print("Conversions DataFrame is empty.")

In [None]:
# Show the first few rows
print("\nConversions Data - First 5 Rows:")
if not df_conversions.empty:
    display(df_conversions.head())
else:
    print("Conversions DataFrame is empty.")

In [None]:
# Get descriptive statistics for all columns
print("\nConversions Data - Descriptive Statistics:")
if not df_conversions.empty:
    try:
        display(df_conversions.describe(include='all', datetime_is_numeric=True))
    except Exception as e:
        print(f"Could not generate descriptive statistics: {e}")
else:
    print("Conversions DataFrame is empty.")

In [None]:
# Check for missing values
print("\nConversions Data - Missing Value Counts:")
if not df_conversions.empty:
    missing_conv = df_conversions.isnull().sum()
    missing_conv_filtered = missing_conv[missing_conv > 0].sort_values(ascending=False)
    if not missing_conv_filtered.empty:
        print("Columns with missing values:")
        display(missing_conv_filtered)
    else:
        print("No missing values found.")
else:
    print("Conversions DataFrame is empty.")


### Device Types Mapping

In [None]:
# Load the device types mapping
try:
    df_device_types = pd.read_csv(device_types_path)
    print(f"Successfully loaded device types data from {device_types_path}\nShape: {df_device_types.shape}")
except Exception as e:
    print(f"Error loading {device_types_path}: {e}")
    df_device_types = pd.DataFrame()

In [None]:
# Display basic information
print("\nDevice Types Info:")
if not df_device_types.empty:
    df_device_types.info()
else:
    print("Device Types DataFrame is empty.")

In [None]:
# Show all rows (it's a small file)
print("\nDevice Types Data:")
if not df_device_types.empty:
    display(df_device_types)
else:
    print("Device Types DataFrame is empty.")