
# Sampling Bias Correction in TV Viewer Data

## Introduction
In this analysis, we address the problem of sampling bias in a dataset of TV viewers. The dataset is a biased sample from a universe of television-watching individuals, with certain demographic categories being over– or under–represented. Our goal is to calculate a set of person–level weights that unbias the dataset.

## Data Loading and Initial Exploration
First, we load the 'demographic attributes.csv' and 'demo ground truth.csv' datasets and explore their structure and content.


In [4]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

# Load the datasets
demographic_attributes = pd.read_csv('/Users/yannik/Projects/VideoAmp/demographic_attributes.csv')
demo_ground_truth = pd.read_csv('/Users/yannik/Projects/VideoAmp/demo_ground_truth_csv.csv')

# Show the first few rows of each dataset
demographic_attributes.head(), demo_ground_truth.head()


(   person id     age        education ethnicity
 0          0   75_84     Some College     white
 1          1  85_120       HS Diploma     white
 2          2   25_34     Some College     white
 3          3   55_64       HS Diploma     black
 4          4   45_54  Bachelor Degree     white,
   demographic category  number of individuals
 0                18_24               11839159
 1                25_34               16399632
 2                35_44               15335704
 3                45_54               16430762
 4                55_64               15148777)


## Data Cleaning and Validation
Next, we check for missing values and validate the consistency of demographic categories between the two datasets.


In [5]:

# Checking for missing values in both datasets
missing_values_demographic_attributes = demographic_attributes.isnull().sum()
missing_values_demo_ground_truth = demo_ground_truth.isnull().sum()

# Handling missing values in 'education' column by categorizing them as a separate category
demographic_attributes['education'].fillna('Unknown', inplace=True)

# Checking unique values in demographic categories to validate consistency
unique_age = demographic_attributes['age'].unique()
unique_education = demographic_attributes['education'].unique()
unique_ethnicity = demographic_attributes['ethnicity'].unique()

missing_values_demographic_attributes, missing_values_demo_ground_truth, unique_age, unique_education, unique_ethnicity


(person id        0
 age              0
 education    22190
 ethnicity        0
 dtype: int64,
 demographic category     0
 number of individuals    0
 dtype: int64,
 array(['75_84', '85_120', '25_34', '55_64', '45_54', '18_24', '65_74',
        '35_44'], dtype=object),
 array(['Some College', 'HS Diploma', 'Bachelor Degree', 'Graduate Degree',
        'Unknown', '< Than HS Diploma'], dtype=object),
 array(['white', 'black', 'hispanic', 'asian', 'islander'], dtype=object))


## Data Cleaning and Validation
Next, we check for missing values and validate the consistency of demographic categories between the two datasets.

### Missing Values
We'll identify and handle any missing values in our datasets to ensure the quality of our analysis. I chose to redistribute individuals in the 'Unknown' category proportionally based on the distribution of known education levels. This would eliminate the 'Unknown' category from the data, resolving missing data issues.

### Consistency of Demographic Categories
We'll also check that the demographic categories in our sample dataset are consistent with the categories in the ground truth dataset, to ensure that our analysis is based on a correct understanding of the data.


In [6]:
# Step 1: Identify 'Unknown' individuals
unknown_education_mask = demographic_attributes['education'] == 'Unknown'

# Step 2: Calculate distribution of known education levels
known_education_distribution = demographic_attributes[~unknown_education_mask]['education'].value_counts(normalize=True)

# Step 3: Redistribute 'Unknown' individuals
np.random.seed(42)  # For reproducibility
redistributed_education = np.random.choice(known_education_distribution.index, size=unknown_education_mask.sum(), p=known_education_distribution.values)

# Step 4: Update the dataset
demographic_attributes.loc[unknown_education_mask, 'education'] = redistributed_education

# Check the first few rows of the dataset to verify the changes
demographic_attributes.head()


Unnamed: 0,person id,age,education,ethnicity
0,0,75_84,Some College,white
1,1,85_120,HS Diploma,white
2,2,25_34,Some College,white
3,3,55_64,HS Diploma,black
4,4,45_54,Bachelor Degree,white


In [7]:

# Calculating the observed proportions in the sampled dataset (Us)
age_distribution_sampled = demographic_attributes['age'].value_counts(normalize=True)
education_distribution_sampled = demographic_attributes['education'].value_counts(normalize=True)
ethnicity_distribution_sampled = demographic_attributes['ethnicity'].value_counts(normalize=True)

# Calculating the ground truth proportions in the national population (U)
total_individuals_national = demo_ground_truth['number of individuals'].sum()
ground_truth_proportions = demo_ground_truth.set_index('demographic category')
ground_truth_proportions['proportion'] = ground_truth_proportions['number of individuals'] / total_individuals_national

# Separating the proportions by demographic category type (age, education, ethnicity)
age_categories = ['18_24', '25_34', '35_44', '45_54', '55_64', '65_74', '75_84', '85_120']
education_categories = ['< Than HS Diploma', 'HS Diploma', 'Some College', 'Bachelor Degree', 'Graduate Degree']
ethnicity_categories = ['white', 'hispanic', 'black', 'asian', 'islander']

age_proportions_national = ground_truth_proportions.loc[age_categories]['proportion']
education_proportions_national = ground_truth_proportions.loc[education_categories]['proportion']
ethnicity_proportions_national = ground_truth_proportions.loc[ethnicity_categories]['proportion']

# Calculating the weights for each individual in the sampled dataset
demographic_attributes['age_weight'] = demographic_attributes['age'].map(lambda x: age_proportions_national[x] / age_distribution_sampled[x])
demographic_attributes['education_weight'] = demographic_attributes['education'].map(lambda x: education_proportions_national[x] / education_distribution_sampled[x])
demographic_attributes['ethnicity_weight'] = demographic_attributes['ethnicity'].map(lambda x: ethnicity_proportions_national[x] / ethnicity_distribution_sampled[x])

# Calculating the overall weight for each individual (product of age, education, and ethnicity weights)
demographic_attributes['overall_weight'] = demographic_attributes['age_weight'] * demographic_attributes['education_weight'] * demographic_attributes['ethnicity_weight']

# Normalizing the weights
sum_overall_weights = demographic_attributes['overall_weight'].sum()
normalization_factor = total_individuals_national / sum_overall_weights
demographic_attributes['normalized_weight'] = demographic_attributes['overall_weight'] * normalization_factor

# Verifying that the sum of normalized weights equals the total number of individuals in the national population
sum_normalized_weights = demographic_attributes['normalized_weight'].sum()
#print(sum_normalized_weights, total_individuals_national)



In [14]:
#age_proportions_national.head()
#demographic_attributes['age'].unique()


## Calculate Sampling Weights
In this section, we calculate the sampling weights for each individual in the sampled dataset. The weights are calculated based on the discrepancies between the observed proportions in the sampled dataset (Us) and the ground truth proportions in the national population (U) for age, education, and ethnicity.

### Steps:
1. **Calculate Observed Proportions**: Compute the distribution of each demographic category in the sampled dataset.
2. **Calculate Ground Truth Proportions**: Using the 'demo ground truth.csv' file, calculate the proportion of each demographic category in the national population.
3. **Calculate Initial Weights**: For each individual, calculate weights for age, education, and ethnicity based on the ratio of ground truth to observed proportions.
4. **Calculate Overall Weights**: Multiply the age, education, and ethnicity weights together to get an overall weight for each individual.
5. **Normalize Weights**: Adjust the weights so that their sum equals the total number of individuals in the national population, ensuring that the weighted sampled dataset is representative of the national population.

This process helps to correct for sampling bias in the dataset, making it more representative of the national population.


In [26]:

# Optimized raking function
def rake_optimized(weights, variable, demographic_data, ground_truth_series):
    # Calculating the weighted distribution for the current variable
    categories = demographic_data[variable].unique()
    weighted_distribution = pd.Series(index=categories, dtype='float64')
    for category in categories:
        mask = demographic_data[variable] == category
        weighted_distribution[category] = weights[mask].sum() / weights.sum()
    
    # Calculating the raking ratio
    raking_ratio = ground_truth_series / weighted_distribution
    raking_ratio = raking_ratio.fillna(1)  # Replace NaN with 1 for categories with no samples
    
    # Adjusting the weights
    adjusted_weights = weights.copy()
    for category in categories:
        mask = demographic_data[variable] == category
        adjusted_weights[mask] *= raking_ratio[category]
    
    return adjusted_weights

# Initializing weights with the normalized weights
raked_weights = demographic_attributes['normalized_weight'].copy()

# Setting a threshold for convergence
convergence_threshold = 1e-6

# Performing raking until convergence
while True:
    # Previous weights
    prev_weights = raked_weights.copy()
    
    # Raking for 'age'
    raked_weights = rake_optimized(raked_weights, 'age', demographic_attributes, age_proportions_national)
    
    # Raking for 'education'
    raked_weights = rake_optimized(raked_weights, 'education', demographic_attributes, education_proportions_national)
    
    # Raking for 'ethnicity'
    raked_weights = rake_optimized(raked_weights, 'ethnicity', demographic_attributes, ethnicity_proportions_national)
    
    # Checking for convergence
    if np.abs(raked_weights - prev_weights).sum() < convergence_threshold:
        break

# Normalizing the raked weights
raked_weights = normalize(raked_weights.values.reshape(1, -1), norm='l1') * total_individuals_national
raked_weights = raked_weights.flatten()

# Adding the raked weights to the dataset
demographic_attributes['raked_weight'] = raked_weights

# Checking the first few rows of the dataset with the raked weights
demographic_attributes.head()


Unnamed: 0,person id,age,education,ethnicity,age_weight,education_weight,ethnicity_weight,overall_weight,normalized_weight,raked_weight
0,0,75_84,Some College,white,0.197185,0.313308,0.290993,0.017977,516.925692,423.634924
1,1,85_120,HS Diploma,white,0.245415,0.266008,0.290993,0.018997,546.232925,622.333438
2,2,25_34,Some College,white,0.837213,0.313308,0.290993,0.076329,2194.777954,1851.716351
3,3,55_64,HS Diploma,black,0.175287,0.266008,0.480094,0.022386,643.68172,736.878989
4,4,45_54,Bachelor Degree,white,0.220188,0.222091,0.290993,0.01423,409.173739,450.725903



## Validation
In this section, we validate that the weighted distribution of demographic categories in the sampled dataset now matches the ground truth distribution. This is a crucial step to ensure that our weighting procedure has been successful in correcting the sampling bias.

### Steps:
1. **Calculate Weighted Distribution**: Using the normalized weights, calculate the weighted distribution of each demographic category in the sampled dataset.
2. **Compare with Ground Truth**: Compare the weighted distribution with the ground truth distribution to ensure that they are closely aligned.

A successful validation indicates that our sampling weights have effectively adjusted the sampled dataset to be representative of the national population, mitigating the effects of sampling bias.


In [28]:
import pandas as pd

# Assuming demographic_attributes is your DataFrame and it has a column 'normalized_weight' for the weights
# Also assuming demo_ground_truth is your ground truth DataFrame

# 1. Calculate Weighted Distribution
weighted_age_distribution = demographic_attributes.groupby('age').apply(lambda x: (x['raked_weight']).sum()) / demographic_attributes['raked_weight'].sum()
weighted_education_distribution = demographic_attributes.groupby('education').apply(lambda x: (x['raked_weight']).sum()) / demographic_attributes['raked_weight'].sum()
weighted_ethnicity_distribution = demographic_attributes.groupby('ethnicity').apply(lambda x: (x['raked_weight']).sum()) / demographic_attributes['raked_weight'].sum()

# 2. Compare with Ground Truth
# Assuming ground truth distributions are stored in age_proportions_national, education_proportions_national, and ethnicity_proportions_national

age_comparison = pd.DataFrame({'Weighted': weighted_age_distribution, 'Ground Truth': age_proportions_national})
education_comparison = pd.DataFrame({'Weighted': weighted_education_distribution, 'Ground Truth': education_proportions_national})
ethnicity_comparison = pd.DataFrame({'Weighted': weighted_ethnicity_distribution, 'Ground Truth': ethnicity_proportions_national})

# Displaying the results
print("Age Comparison:\n", age_comparison)
print("\nEducation Comparison:\n", education_comparison)
print("\nEthnicity Comparison:\n", ethnicity_comparison)


Age Comparison:
         Weighted  Ground Truth
18_24   0.127771      0.035888
25_34   0.176989      0.049712
35_44   0.165507      0.046487
45_54   0.177325      0.049806
55_64   0.163489      0.045920
65_74   0.107819      0.030284
75_84   0.056351      0.015828
85_120  0.024749      0.006951

Education Comparison:
                    Weighted  Ground Truth
< Than HS Diploma  0.132464      0.037206
Bachelor Degree    0.175975      0.049427
Graduate Degree    0.100834      0.028322
HS Diploma         0.278429      0.078204
Some College       0.312297      0.087716

Ethnicity Comparison:
           Weighted  Ground Truth
asian     0.052633      0.018628
black     0.125276      0.044337
hispanic  0.188032      0.066547
islander  0.001631      0.000577
white     0.632427      0.223823
