# Survey Analysis

This notebook analyzes survey data to understand data broker practices across three key dimensions:

## Analysis Components

### 1. **Data Use Practices** (LLM Q1)
- Marketing purposes
- Personalized advertising 
- Employment decisions
- Consumer finance
- Law enforcement (without subpoena)

### 2. **Entity Sharing Practices** (LLM Q2)  
- Government entities
- Corporations
- Educational/Research institutions

### 3. **User Rights and Controls** (LLM Q3)
- Data access rights
- Correction capabilities
- Deletion rights  
- Non-discrimination protections
- Opt-out from targeted advertising
- Opt-out option for data collection or sharing

## Methodology

We administered a Google Forms survey to college students, framed around the three categories above, and obtained 62 responses. For many questions, students were asked to select checkboxes based on their preferences (e.g., "select all that apply"). The exact wording of survey questions can be found in "Survey-Questions.pdf" in ".../raw_data/survey/".

While university students may not represent a completely representative sample of average Americans, we believe that this initial dataset provides insight into consumer privacy policies. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt

In [2]:
# Configuration and Data Loading
DATA_PATH = '../data/raw_data/survey/survey_results.csv'

print("Loading survey data...")
survey = pd.read_csv(DATA_PATH)

if survey is not None:
    print(f"Data loaded successfully!")
    print(f"Columns available: {list(survey.columns)}")
    print(f"Shape: {survey.shape}")
else:
    print("Failed to load data. Please check the file path.")

Loading survey data...
Data loaded successfully!
Columns available: ['Timestamp', '1. How familiar are you with data brokers?', '2. What types of data do you think data brokers collect? (Select all that apply)', '3. Which types of your personal data are you comfortable being used for any purpose? (Select all that apply)', '4. What purposes are you comfortable with your personal data being used for? (Select all that apply)', '5. Which entities are you comfortable sharing your personal data with? (Select all that apply)', '6. Which of the following do you think should be included in data privacy laws and policies? (Select all that apply)', '7. Have you ever tried to request deletion of your data from a data broker?', '8. Are you a resident of California?', '9. Have you heard of the California DELETE Act? \nThe Act requires the California Privacy Protection Agency (CPPA) to create an accessible universal deletion mechanism by January 1, 2026. Consumers will then be able to access the plat

In [3]:
survey.columns = ["timestamp", "familiarity", "type_collected", "type_collected_comfort", "purposes_comfortable", "entities", "included_laws", "request_deletion", "ca_resident", "delete_act_awareness", "delete_act_use"]
survey.head()

Unnamed: 0,timestamp,familiarity,type_collected,type_collected_comfort,purposes_comfortable,entities,included_laws,request_deletion,ca_resident,delete_act_awareness,delete_act_use
0,11/2/2025 16:13:40,I have a good understanding of their role and ...,"Location data, Biometric data (e.g., fingerpri...",Employment-related data,"Consumer financer decisions (e.g., loans, cred...","Government agencies, Educational or research i...","Right to access personal data, Right to correc...",No,Yes,No,Maybe
1,11/2/2025 16:22:26,I have a basic understanding of their role and...,"Location data, Biometric data (e.g., fingerpri...","Commercial data (e.g., purchasing and transact...","Personalized advertising, Law enforcement acce...","Government agencies, Educational or research i...","Right to access personal data, Right to correc...",I didn't know that was possible,Yes,No,Maybe
2,11/2/2025 17:14:17,I have a basic understanding of their role and...,"Location data, Biometric data (e.g., fingerpri...","Commercial data (e.g., purchasing and transact...","Marketing, Personalized advertising",Educational or research institutions,"Right to access personal data, Right to correc...",Yes,Yes,Yes,Yes
3,11/2/2025 18:03:35,I’ve heard of them but don’t know what they do,"Location data, Biometric data (e.g., fingerpri...",Employment-related data,Personalized advertising,"Government agencies, Educational or research i...","Right to access personal data, Right to delete...",I didn't know that was possible,Yes,No,Maybe
4,11/2/2025 18:47:14,I’ve heard of them but don’t know what they do,"Location data, Biometric data (e.g., fingerpri...",Employment-related data,"Marketing, Personalized advertising, Employmen...",Educational or research institutions,"Protections against discrimination, Restrictio...",I didn't know that was possible,No,,


In [4]:
def process_familiarity(df):
    '''
    Process and analyze data correspond to the survey question "How familiar are you with data brokers?" 
    Returns a dataframe of ordered counts with corresponding percentages, by response category.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # List of possible responses, based on survey administration
    possible_responses = [
        "I’ve never heard of data brokers",
        "I’ve heard of them but don’t know what they do",
        "I have a basic understanding of their role and practices",
        "I have a good understanding of their role and practices",
        "I’m very familiar with how data brokers operate"
    ]
    # Force column into ordered categorical
    cat_type = pd.api.types.CategoricalDtype(categories=possible_responses, ordered=True)
    df["familiarity"] = df["familiarity"].astype(cat_type)
    ordered_counts = (
        df["familiarity"]
        .value_counts(sort=False)
        .reset_index()
    )
    ordered_counts.columns = ["response_category", "count"]

    # Graph responses with Altair
    bars = (
        alt.Chart(ordered_counts)
        .mark_bar()
        .encode(
            x=alt.X("count:Q", title="Number of Responses"),
            y=alt.Y(
                "response_category:N",
                sort=possible_responses,
                title="Response Category"
            ),
            tooltip=["response_category", "count"]
        )
        .properties(
            width=650,
            height=220,
            title="Familiarity with Data Brokers"
        )
    )

    # Find percentages from the raw count
    ordered_counts["percent"] = (
        ordered_counts["count"] / ordered_counts["count"].sum() * 100
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='percent:Q'
    )

    display(bars + text)

    return ordered_counts

In [6]:
def rename_categories(cell, rename_map):
    # Helper function for renaming categories
    for long_name, short_name in rename_map.items():
        cell = cell.replace(long_name, short_name)
    items = [item.strip() for item in cell.split(", ")]
    return ", ".join(items)

def analyze_type_collected(df):
    '''
    Process and analyze data correspond to the survey question "What types of data do you think data brokers collect? (Select all that apply)?" 
    Returns a dataframe with percentages, per response category, of types of data that survey respondents think data brokers collect.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # Rename category names so it is easier to split by commas
    rename_map = {
        "Location data": "Location",
        "Biometric data (e.g., fingerprint, voice, facial recognition)": "Biometric",
        "Reproductive health-related information": "Reproductive Health",
        "Commercial data (e.g., purchasing and transaction history)": "Commercial",
        "Employment-related data": "Employment",
        "Personal information of individuals under 18": "Minors",
        "Social Security Number and government ID information": "SSN and ID",
        "Network data (e.g., IP address, browsing history)": "Network",
        "Not sure": "Not Sure"
    }
    df["type_collected"] = df["type_collected"].apply(lambda x: rename_categories(x, rename_map))

    # Split by comma and explode into rows and strip whitespace
    split_df = df["type_collected"].str.split(",").explode()
    split_df = split_df.str.strip()

    # One-Hot Encode Columns
    one_hot = pd.get_dummies(split_df)
    one_hot = one_hot.groupby(split_df.index).max()
    one_hot = pd.DataFrame(one_hot)

    # Calculate Percentages and put in a dictionary
    percentages = dict(zip(one_hot.columns, np.round(one_hot.mean() * 100, 3)))
    percent_df = pd.DataFrame({
        'Category': list(percentages.keys()),
        'Percent': list(percentages.values())
    })

    # Graph
    bars = (
        alt.Chart(percent_df)
        .mark_bar()
        .encode(
            x=alt.X('Percent:Q', title='Percent of Respondents'),
            y=alt.Y('Category:N', title="Data Types"),
            tooltip=['Category', 'Percent']
        )
        .properties(height=300, width=500, title="Data Types that Respondents Believe Data Brokers Collect")
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='Percent:Q'
    )

    display(percent_df)
    chart = bars + text

    return chart

survey_type_collected = analyze_type_collected(survey)
display(survey_type_collected)
survey_type_collected.save("imgs/Data Types that Respondents Believe Data Brokers Collect.svg")


Unnamed: 0,Category,Percent
0,Biometric,62.903
1,Commercial,91.935
2,Employment,83.871
3,Location,88.71
4,Minors,37.097
5,Network,91.935
6,Not Sure,11.29
7,Reproductive Health,54.839
8,SSN and ID,40.323


In [7]:
def analyze_type_collected_comfort(df):
    '''
    Process and analyze data correspond to the survey question "Which types of your personal data are you comfortable being used for any purpose? (Select all that apply)?" 
    Returns a dataframe with percentages, per response category, of data types that survey respondents are comfortable with data brokers collecting.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # Rename category names so it is easier to split by commas
    rename_map = {
        "Location data": "Location",
        "Biometric data (e.g., fingerprint, voice, facial recognition)": "Biometric",
        "Reproductive health-related information": "Reproductive Health",
        "Commercial data (e.g., purchasing and transaction history)": "Commercial",
        "Employment-related data": "Employment",
        "Personal information of individuals under 18": "Minors",
        "Social Security Number and government ID information": "SSN and ID",
        "Network data (e.g., IP address, browsing history)": "Network",
        "Not sure": "Not Sure"
    }
    df["type_collected_comfort"] = df["type_collected_comfort"].apply(lambda x: rename_categories(x, rename_map))

    # Split by comma and explode into rows and strip whitespace
    split_df = df["type_collected_comfort"].str.split(",").explode()
    split_df = split_df.str.strip()

    # One-Hot Encode Columns
    one_hot = pd.get_dummies(split_df)
    one_hot = one_hot.groupby(split_df.index).max()
    one_hot = pd.DataFrame(one_hot)

    # Calculate Percentages and put in a dictionary
    percentages = dict(zip(one_hot.columns, np.round(one_hot.mean() * 100, 3)))
    percent_df = pd.DataFrame({
        'Category': list(percentages.keys()),
        'Percent': list(percentages.values())
    })

    # Graph
    bars = (
        alt.Chart(percent_df)
        .mark_bar()
        .encode(
            x=alt.X('Percent:Q', title='Percent of Respondents'),
            y=alt.Y('Category:N', title="Data Types"),
            tooltip=['Category', 'Percent']
        )
        .properties(height=300, width=500, title="Data Types That Respondents Are Comfortable with Data Brokers Collecting")
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='Percent:Q'
    )

    display(percent_df)
    chart = bars + text

    return chart

survey_type_comfort = analyze_type_collected_comfort(survey)
display(survey_type_comfort)
survey_type_comfort.save("imgs/Data Types That Respondents Are Comfortable with Data Brokers Collecting.svg")



Unnamed: 0,Category,Percent
0,Biometric,11.29
1,Commercial,53.226
2,Employment,35.484
3,Location,16.129
4,Minors,3.226
5,Network,24.194
6,Not Sure,27.419
7,Reproductive Health,12.903
8,SSN and ID,1.613


In [8]:
def analyze_purposes_comfortable(df):
    '''
    Process and analyze data correspond to the survey question "What purposes are you comfortable with your personal data being used for? (Select all that apply)?" 
    Returns a dataframe with percentages, per response category, of purposes that survey respondents are comfortable with data brokers using their data for.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # Rename category names so it is easier to split by commas
    rename_map = {
        "Marketing": "Marketing",
        "Personalized advertising": "Personalized Advertising",
        "Employment-related decisions": "Employment Decisions",
        "Consumer financer decisions (e.g., loans, credit scores)": "Consumer Finance",
        "Law enforcement access without a subpoena": "Law Enforcement Without Subpoena",
        "None of the above": "None"
    }
    df["purposes_comfortable"] = df["purposes_comfortable"].apply(lambda x: rename_categories(x, rename_map))

    # Split by comma and explode into rows and strip whitespace
    split_df = df["purposes_comfortable"].str.split(",").explode()
    split_df = split_df.str.strip()

    # One-Hot Encode Columns
    one_hot = pd.get_dummies(split_df)
    one_hot = one_hot.groupby(split_df.index).max()
    one_hot = pd.DataFrame(one_hot)

    # Calculate Percentages and put in a dictionary
    percentages = dict(zip(one_hot.columns, np.round(one_hot.mean() * 100, 3)))
    percent_df = pd.DataFrame({
        'Category': list(percentages.keys()),
        'Percent': list(percentages.values())
    })

    # Graph
    bars = (
        alt.Chart(percent_df)
        .mark_bar()
        .encode(
            x=alt.X('Percent:Q', title='Percent of Respondents'),
            y=alt.Y('Category:N', title="Purposes"),
            tooltip=['Category', 'Percent']
        )
        .properties(height=300, width=500, title="Use Cases That Respondents Are Comfortable With")
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='Percent:Q'
    )

    display(percent_df)
    chart = bars + text

    return chart

survey_purposes_comfort = analyze_purposes_comfortable(survey)
display(survey_purposes_comfort)
survey_purposes_comfort.save("imgs/Use Cases That Respondents Are Comfortable With.svg")

Unnamed: 0,Category,Percent
0,Consumer Finance,20.968
1,Employment Decisions,25.806
2,Law Enforcement Without Subpoena,6.452
3,Marketing,43.548
4,,24.194
5,Personalized Advertising,62.903


In [9]:
def analyze_sharing_entities(df):
    '''
    Process and analyze data correspond to the survey question "Which entities are you comfortable sharing your personal data with? (Select all that apply)?" 
    Returns a dataframe with percentages, per response category, of entities that survey respondents are comfortable with data brokers sharing their data to.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # Split by comma and explode into rows and strip whitespace
    split_df = df["entities"].str.split(",").explode()
    split_df = split_df.str.strip()

    # One-Hot Encode Columns
    one_hot = pd.get_dummies(split_df)
    one_hot = one_hot.groupby(split_df.index).max()
    one_hot = pd.DataFrame(one_hot)

    # Calculate Percentages and put in a dictionary
    percentages = dict(zip(one_hot.columns, np.round(one_hot.mean() * 100, 3)))
    percent_df = pd.DataFrame({
        'Category': list(percentages.keys()),
        'Percent': list(percentages.values())
    })

    # Graph
    bars = (
        alt.Chart(percent_df)
        .mark_bar()
        .encode(
            x=alt.X('Percent:Q', title='Percent of Respondents'),
            y=alt.Y('Category:N', title="Entities"),
            tooltip=['Category', 'Percent']
        )
        .properties(height=300, width=500, title="Entities That Respondents Are Comfortable With Their Data Shared To")
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='Percent:Q'
    )

    display(percent_df)
    chart = bars + text

    return chart

survey_entities = analyze_sharing_entities(survey)
display(survey_entities)
survey_entities.save("imgs/Entities That Respondents Are Comfortable With Their Data Shared To.svg")


Unnamed: 0,Category,Percent
0,Corporations,24.194
1,Educational or research institutions,85.484
2,Government agencies,35.484
3,None of the above,11.29


In [10]:
def analyze_policy_preferences(df):
    '''
    Process and analyze data correspond to the survey question "Which of the following do you think should be included in data privacy laws and policies? (Select all that apply)" 
    Returns a dataframe with percentages, per response category, of policy preferences that survey respondents have.
    Graphs a corresponding plot of frequencies by each response.
    '''
    # Split by comma and explode into rows and strip whitespace
    split_df = df["included_laws"].str.split(",").explode()
    split_df = split_df.str.strip()

    # One-Hot Encode Columns
    one_hot = pd.get_dummies(split_df)
    one_hot = one_hot.groupby(split_df.index).max()
    one_hot = pd.DataFrame(one_hot)

    # Calculate Percentages and put in a dictionary
    percentages = dict(zip(one_hot.columns, np.round(one_hot.mean() * 100, 3)))
    percent_df = pd.DataFrame({
        'Category': list(percentages.keys()),
        'Percent': list(percentages.values())
    })

    # Graph
    bars = (
        alt.Chart(percent_df)
        .mark_bar()
        .encode(
            x=alt.X('Percent:Q', title='Percent of Respondents'),
            y=alt.Y('Category:N', title="Policy Preferences", axis=alt.Axis(labelLimit=300)),
            tooltip=['Category', 'Percent']
        )
        .properties(height=300, width=500, title="Provisions In Privacy Policies and Laws Respondents Desire")
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text='Percent:Q'
    )

    display(percent_df)
    chart = bars + text

    return chart

survey_user_controls = analyze_policy_preferences(survey)
display(survey_user_controls)
survey_user_controls.save("imgs/Provisions In Privacy Policies and Laws Respondents Desire.svg")

Unnamed: 0,Category,Percent
0,Option to opt out of data collection or sharing,93.548
1,Protections against discrimination,83.871
2,Restrictions on targeted advertising,75.806
3,Right to access personal data,77.419
4,Right to correct personal data,69.355
5,Right to delete personal data,79.032


In [11]:
def process_request_deletion(df):
    '''
    Process and analyze data correspond to the survey question "Have you ever tried to request deletion of your data from a data broker?" 
    '''
    total_responses = len(df)
    pct_yes = len(df[df["request_deletion"] == "Yes"])/total_responses * 100
    pct_unaware = len(df[df["request_deletion"] == "I didn't know that was possible"])/total_responses * 100
    pct_no = len(df[df["request_deletion"] == "No"])/total_responses * 100
    combined_unaware_no = pct_unaware + pct_no

    print(f"% Requested Deletion Before: {pct_yes}")
    print(f"% Unaware about Deletion or Never Requested: {combined_unaware_no}")

process_request_deletion(survey)
    

% Requested Deletion Before: 4.838709677419355
% Unaware about Deletion or Never Requested: 95.16129032258064


In [12]:
# Californian analysis
def process_california(df):
    '''
    Find the frequency percentage for each response category for the questions that reference California:
        - DELETE Act Awareness
        - Intended use of DELETE Act 
    '''
    # Drop all column not corresponding to CA residents
    ca_df = df[df["ca_resident"] == "Yes"] 

    num_californians = len(ca_df)
    print(f"Number of Californian respondents, {num_californians}")

    delete_act_awareness_percentage = len(ca_df[ca_df["delete_act_awareness"]=="Yes"])/num_californians * 100
    print(f"Percentage of Californian respondents aware of the DELETE Act, {delete_act_awareness_percentage}")

    yes_use_pct = len(ca_df[ca_df["delete_act_use"] == "Yes"]) / num_californians * 100 
    maybe_use_pct = len(ca_df[ca_df["delete_act_use"] == "Maybe"]) / num_californians * 100
    print(f"Percentage of Californian respondents who responded that they WOULD use the DELETE Act's tools (Yes), {yes_use_pct}")
    print(f"Percentage of Californian respondents who responded that they MIGHT use the DELETE Act's tools (Maybe), {maybe_use_pct}")

process_california(survey)

Number of Californian respondents, 12
Percentage of Californian respondents aware of the DELETE Act, 25.0
Percentage of Californian respondents who responded that they WOULD use the DELETE Act's tools (Yes), 25.0
Percentage of Californian respondents who responded that they MIGHT use the DELETE Act's tools (Maybe), 66.66666666666666
