# Privacy Policy Data Extraction Workflow

This notebook demonstrates the workflow for extracting and preparing data broker privacy policies for analysis. The core functionality has been moved to `data_utils.privacy_policy_extractor` for better code organization and reusability.

## Purpose
- Extract unique privacy policy URLs from data broker registry
- Prepare datasets for manual or automated privacy policy analysis
- Document the LLM-based analysis workflow used in the research

In [1]:
import sys
import os
from pathlib import Path

# Add project root to path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import utility functions
from data_utils.privacy_policy_extractor import (
    prepare_privacy_policy_dataset, 
    create_llm_analysis_prompt,
    validate_llm_output
)

print("Privacy Policy Extraction Utilities Loaded")
print("Core functions available from data_utils.privacy_policy_extractor")

Privacy Policy Extraction Utilities Loaded
Core functions available from data_utils.privacy_policy_extractor


In [2]:
# Configuration
DATA_PATH = '../data/cleaned_data/uq-data-brokers.csv'
OUTPUT_DIR = '../data/cleaned_data/privacy_policies'

# Prepare privacy policy datasets using utility function
print("Preparing Privacy Policy Datasets...")
clean_data, unique_policies = prepare_privacy_policy_dataset(DATA_PATH, OUTPUT_DIR)

print(f"\nProcessing Complete!")
print(f"Total policies processed: {len(clean_data)}")
print(f"Unique policies identified: {len(unique_policies)}")
print(f"Files saved to: {OUTPUT_DIR}")

Preparing Privacy Policy Datasets...
Datasets saved to ../data/cleaned_data/privacy_policies
   All policies: privacy_policies_cleaned.csv (775 rows)
   Unique policies: privacy_policies_unique_shuffled.csv (657 rows)

Processing Complete!
Total policies processed: 775
Unique policies identified: 657
Files saved to: ../data/cleaned_data/privacy_policies


In [3]:
# Display sample of the processed data
print("Sample of Unique Privacy Policies Dataset:")
display(unique_policies.head(10))

Sample of Unique Privacy Policies Dataset:


Unnamed: 0,Name,PrivacyPolicyURL,Name_Clean,PrivacyPolicyURL_Clean
0,VIDEOAMP,https://www.videoamp.com/privacy-policy/,videoamp,https://www.videoamp.com/privacy-policy/
1,HUNT CLUB,https://www.exploreatlas.io/privacy,huntclub,https://www.exploreatlas.io/privacy
2,INFORMA USA,https://privacy.informa.com/,informausa,https://privacy.informa.com/
3,AZERION US,https://www.azerion.com/azerion-global-corpora...,azerionus,https://www.azerion.com/azerion-global-corpora...
4,INTELIUS,https://www.intelius.com/privacy-policy/,intelius,https://www.intelius.com/privacy-policy/
5,BLACK PEARL GROUP LIMITED,https://www.blackpearl.com/privacy-policy,blackpearlgrouplimited,https://www.blackpearl.com/privacy-policy
6,WISDOM MEDIA GROUP,https://www.wisdommediagroupllc.com/privacy.php,wisdommediagroup,https://www.wisdommediagroupllc.com/privacy.php
7,TECHTARGET,https://www.techtarget.com/privacy-policy/,techtarget,https://www.techtarget.com/privacy-policy/
8,BEESWAX,https://www.beeswax.com/privacy/,beeswax,https://www.beeswax.com/privacy/
9,FD HOLDINGS,https://www.factualdata.com/privacy/,fdholdings,https://www.factualdata.com/privacy/


In [9]:
# Show data quality metrics
print("Data Quality Assessment:")
print(f"   Original data points: {len(clean_data)}")
print(f"   Unique privacy policy URLs: {len(unique_policies)}")
print(f"   Duplicate reduction: {len(clean_data) - len(unique_policies)} removed")
print(f"   Uniqueness ratio: {len(unique_policies)/len(clean_data):.1%}")

# Check for potential data quality issues
print(f"\nData Quality Checks:")
empty_urls = unique_policies['PrivacyPolicyURL'].str.strip().eq('').sum()
short_names = unique_policies['Name'].str.len().lt(3).sum()

if empty_urls > 0:
    print(f"   WARNING: {empty_urls} empty privacy policy URLs found")
if short_names > 0:
    print(f"   WARNING: {short_names} very short company names found")
if empty_urls == 0 and short_names == 0:
    print("   No major data quality issues detected")

Data Quality Assessment:
   Original data points: 775
   Unique privacy policy URLs: 657
   Duplicate reduction: 118 removed
   Uniqueness ratio: 84.8%

Data Quality Checks:
   No major data quality issues detected


In [10]:
# Generate the standardized LLM analysis prompt
llm_prompt = create_llm_analysis_prompt()

print("LLM Analysis Prompt Generated")
print("=" * 50)
print(llm_prompt)
print("=" * 50)
print(f"\nPrompt Statistics:")
print(f"   Length: {len(llm_prompt)} characters")
print(f"   Lines: {llm_prompt.count(chr(10))} lines")

LLM Analysis Prompt Generated
Please extract the following information from this privacy policy:

A. If applicable, list all the sources of information that this entity says it collects

B. If applicable, state how the entity uses and shares information

C. If applicable, state which entities the information might be shared to

D. If applicable, state the exact statutes that are mentioned in relation to privacy (e.g., California's Civil Code Section 1798.83)

E. If applicable, state the rights that users have to control and delete their data

F. If applicable, state the methods to contact the data broker

G. If applicable, state the sources where data was collected

While synthesizing the information, make little to no changes to the wording and semantic meaning of the text. Be concise. Remove every instance of '[cite_start]' from your previous response. Please structure your response in this format:

{
    "A": "If applicable, list all the sources of information that this entity says 

## Privacy Policy Extraction Process

**1) Manually download privacy policies in order based on the URLs ("Ctrl + P"), default settings, and save them in our folder as a pdf**

## LLM-Based Privacy Policy Analysis Workflow

### **Step 1: Policy Collection**
1. **Download PDFs**: Based on `uq-data-brokers.csv`, use the generated URLs to download privacy policies
   - Randomly sampled `uq-data-brokers.csv` dataframe to obtain a representative sample of the first 200 data brokers
   - Save as PDFs using browser "Print to PDF" functionality
   - Organize files by company name or URL hash in the original .csv file
   - The exact UChicago Box where these PDFs were placed is here: https://uchicago.box.com/s/m5pwd81qc202r4xp3snm58x6nhi200df. 

### **Step 2: LLM Analysis** 
2. **Upload to LLM**: Use Gemini 2.5 Pro (“Reasoning, math & code”)
   - Upload PDF files individually and prompt 3 times
   - Apply the standardized prompt (listed above)
   - Collect structured responses of LLM outputs
   - Record responses on shared spreadsheet, on row with associated data broker
      - Spreadsheet is here: https://docs.google.com/spreadsheets/d/1vHZNOK6ATJ-8pJgTpaPzP7RKGs3A9dMk6WbmVli-KBE/edit?usp=sharing. 
   - Download and store in `data/raw_data/privacy_policies/` directory as `privacy-policy-scraping-final.csv`

### **Step 3: Quality Control**
3. **Manual Verification**: Verified random sample of 20 responses
   - Check accuracy of information extraction (~97.86% accurate)
   - Verify citation integrity and completeness
   - Identify common LLM errors or biases
      - Responses to targeted advertising prohibition question were somewhat inconsistent
      - Some privacy policies had to be re-downloaded, e.g., privacy policies were split between two pages or embedded within larger pages

### **Step 4: Data Integration**
4. **Compile Results**:
   - Create consolidated dataset for further analysis
   - Generate summary statistics and insights

### **Example LLM Output Format**

The LLM should return a structured response in the form of a list of numbers, like this:
[1, 2, 2, 1, 1]. 

## Next Steps: Advanced Analysis

After completing the LLM-based privacy policy analysis, you can conduct various analyses:

### **Compliance Analysis**
- **Statute Coverage**: Which privacy laws are most commonly referenced?
- **Geographic Patterns**: Do brokers in different states follow different regulations?
- **Completeness Scoring**: Rate policies based on information comprehensiveness

### **Data Practice Patterns** 
- **Collection Sources**: What are the most common data sources?
- **Sharing Networks**: Which third-party categories receive the most data?
- **User Rights**: How consistently are user rights communicated?

### **Transparency Assessment**
- **Contact Accessibility**: How easy is it to reach data brokers?
- **Policy Clarity**: Text readability and complexity analysis
- **Disclosure Quality**: Completeness and specificity of disclosures

### **Integration with Registry Data**
- **Cross-Reference**: Link policy analysis with broker collection declarations
- **Discrepancy Detection**: Identify mismatches between stated and actual practices
- **Risk Profiling**: Combine multiple data sources for comprehensive broker assessment