<a href="https://colab.research.google.com/github/pnewmatt/github.io/blob/master/What_gets_counted_counts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import pandas for data manipulation
import pandas as pd
import numpy as np # For some numerical operations if needed

# Illustrating Concepts from "Conceptualization of Categories" Lecture

This notebook provides simple Python and Pandas examples to illustrate key concepts
discussed this week on how categories are conceptualized, why they matter,
and the ethical implications of classification in data science.

In [None]:
# --- 2. Why Categories? The Need for Classification in Computing ---
print("--- 2. Why Categories? The Need for Classification in Computing ---")

# Boolean data
light_switch_on = True
is_user_logged_in = False
print(f"Light switch is on: {light_switch_on} (Type: {type(light_switch_on)})")

# Integer data
age = 35
items_in_cart = 3
print(f"Person's age: {age} (Type: {type(age)})")
print(f"Items in cart: {items_in_cart} (Type: {type(items_in_cart)})")

# String data
name = "Dr. Evelyn Hayes"
address = "456 Innovation Drive, Techville"
sentence = "Categories enable computers to organize data."
print(f"Person's name: {name} (Type: {type(name)})")
print(f"Sentence: {sentence} (Type: {type(sentence)})")

# Chaos without categories (e.g., grocery store)
unorganized_grocery_items = ["Milk", "Cereal", "Lightbulbs", "Apples", "Cleaning Spray", "Bananas", "Bread", "Shampoo"]
print(f"\nUnorganized items: {unorganized_grocery_items}")

# Organizing with categories
organized_grocery_store = {
    "Dairy": ["Milk"],
    "Pantry": ["Cereal", "Bread"],
    "Produce": ["Apples", "Bananas"],
    "Household": ["Lightbulbs", "Cleaning Spray", "Shampoo"]
}
print("\nOrganized Grocery Store:")
for category, items in organized_grocery_store.items():
    print(f"  {category}: {items}")

# Using Pandas for structured organization
grocery_df_data = [
    {"item": "Milk", "category": "Dairy", "price": 3.50},
    {"item": "Cereal", "category": "Pantry", "price": 4.00},
    {"item": "Lightbulbs", "category": "Household", "price": 8.00},
    {"item": "Apples", "category": "Produce", "price": 0.75},
    {"item": "Bananas", "category": "Produce", "price": 0.50}
]
grocery_df = pd.DataFrame(grocery_df_data)
print("\nGrocery Items in a Pandas DataFrame:")
print(grocery_df)
print(f"\nItems in 'Produce' category:\n{grocery_df[grocery_df['category'] == 'Produce']}")
print("\n--- End of Section 2 ---\n")

--- 2. Why Categories? The Need for Classification in Computing ---
Light switch is on: True (Type: <class 'bool'>)
Person's age: 35 (Type: <class 'int'>)
Items in cart: 3 (Type: <class 'int'>)
Person's name: Dr. Evelyn Hayes (Type: <class 'str'>)
Sentence: Categories enable computers to organize data. (Type: <class 'str'>)

Unorganized items: ['Milk', 'Cereal', 'Lightbulbs', 'Apples', 'Cleaning Spray', 'Bananas', 'Bread', 'Shampoo']

Organized Grocery Store:
  Dairy: ['Milk']
  Pantry: ['Cereal', 'Bread']
  Produce: ['Apples', 'Bananas']
  Household: ['Lightbulbs', 'Cleaning Spray', 'Shampoo']

Grocery Items in a Pandas DataFrame:
         item   category  price
0        Milk      Dairy   3.50
1      Cereal     Pantry   4.00
2  Lightbulbs  Household   8.00
3      Apples    Produce   0.75
4     Bananas    Produce   0.50

Items in 'Produce' category:
      item category  price
3   Apples  Produce   0.75
4  Bananas  Produce   0.50

--- End of Section 2 ---



In [None]:
# --- 3. Material Consequences of Classification: Census Data Example ---
print("--- 3. Material Consequences of Classification: Census Data Example ---")
census_data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8],
    'age': [34, 67, 22, 45, 58, 29, 72, 40],
    'district': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B'],
    'income_bracket': ['Medium', 'High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Medium'],
    'voted_last_election': [True, True, False, True, True, True, False, True]
}
census_df = pd.DataFrame(census_data)
print("Mock Census Data:")
print(census_df)

# Drawing voting districts (simplified: count per district)
print("\nPopulation count per district (could influence districting):")
print(census_df['district'].value_counts())

# Policy decisions (e.g., resource allocation by income)
print("\nAverage age by income bracket (could inform services for seniors/youth):")
print(census_df.groupby('income_bracket')['age'].mean())

# Budget allocation (e.g., voter outreach based on turnout)
print("\nVoter turnout by district (could inform budget for voter education):")
print(census_df.groupby('district')['voted_last_election'].sum() / census_df.groupby('district')['voted_last_election'].count())
print("\n--- End of Section 3 ---\n")

--- 3. Material Consequences of Classification: Census Data Example ---
Mock Census Data:
   id  age district income_bracket  voted_last_election
0   1   34        A         Medium                 True
1   2   67        B           High                 True
2   3   22        A            Low                False
3   4   45        C         Medium                 True
4   5   58        B           High                 True
5   6   29        C            Low                 True
6   7   72        A         Medium                False
7   8   40        B         Medium                 True

Population count per district (could influence districting):
district
A    3
B    3
C    2
Name: count, dtype: int64

Average age by income bracket (could inform services for seniors/youth):
income_bracket
High      62.50
Low       25.50
Medium    47.75
Name: age, dtype: float64

Voter turnout by district (could inform budget for voter education):
district
A    0.333333
B    1.000000
C    1.000000
Name

In [None]:
# --- 4. Intersectionality: Beyond Single Categories ---
print("--- 4. Intersectionality: Beyond Single Categories ---")
intersectional_data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'race': ['White', 'Black', 'Asian', 'Hispanic', 'Black', 'White', 'Hispanic', 'Asian', 'Black', 'White'],
    'gender': ['Woman', 'Woman', 'Man', 'Woman', 'Man', 'Man', 'Man', 'Woman', 'Woman', 'Man'],
    'employment_status': ['Employed', 'Employed', 'Unemployed', 'Employed', 'Employed', 'Student', 'Unemployed', 'Employed', 'Student', 'Employed'],
    'income': [60000, 65000, 20000, 55000, 70000, 15000, 22000, 75000, 12000, 80000]
}
intersectional_df = pd.DataFrame(intersectional_data)
print("Mock Intersectional Data:")
print(intersectional_df)

# Examining one category
print("\nAverage income by race:")
print(intersectional_df.groupby('race')['income'].mean())

print("\nAverage income by gender:")
print(intersectional_df.groupby('gender')['income'].mean())

# Examining an intersection: Black Women
black_women_df = intersectional_df[
    (intersectional_df['race'] == 'Black') &
    (intersectional_df['gender'] == 'Woman')
]
print("\nData for Black Women:")
print(black_women_df)
print(f"Average income for Black Women: ${black_women_df['income'].mean():.2f}")

# Examining another intersection: Unemployed Hispanic Men
unemployed_hispanic_men_df = intersectional_df[
    (intersectional_df['race'] == 'Hispanic') &
    (intersectional_df['gender'] == 'Man') &
    (intersectional_df['employment_status'] == 'Unemployed')
]
print("\nData for Unemployed Hispanic Men:")
print(unemployed_hispanic_men_df)
if not unemployed_hispanic_men_df.empty:
    print(f"Average income for Unemployed Hispanic Men: ${unemployed_hispanic_men_df['income'].mean():.2f}")
else:
    print("No data for Unemployed Hispanic Men in this sample.")
print("\n--- End of Section 4 ---\n")


--- 4. Intersectionality: Beyond Single Categories ---
Mock Intersectional Data:
   id      race gender employment_status  income
0   1     White  Woman          Employed   60000
1   2     Black  Woman          Employed   65000
2   3     Asian    Man        Unemployed   20000
3   4  Hispanic  Woman          Employed   55000
4   5     Black    Man          Employed   70000
5   6     White    Man           Student   15000
6   7  Hispanic    Man        Unemployed   22000
7   8     Asian  Woman          Employed   75000
8   9     Black  Woman           Student   12000
9  10     White    Man          Employed   80000

Average income by race:
race
Asian       47500.000000
Black       49000.000000
Hispanic    38500.000000
White       51666.666667
Name: income, dtype: float64

Average income by gender:
gender
Man      41400.0
Woman    53400.0
Name: income, dtype: float64

Data for Black Women:
   id   race gender employment_status  income
1   2  Black  Woman          Employed   65000
8   9  Bl

In [None]:
# --- 5. Illustrating Domains from the Matrix of Domination (Conceptual Examples) ---
print("--- 5. Illustrating Domains from the Matrix of Domination ---")

# Structural Domain: Absence of national paid parental leave
# (Markdown cell explanation in Colab)
"""
**Structural Domain Example:**
The lecture mentions the absence of a national law guaranteeing paid parental leave
in the US. If we had an employment dataset, the *lack* of a column like
'has_access_to_paid_parental_leave' or seeing this column be 'False' for most
US-based employees would reflect this structural issue.
This isn't directly coded as a 'law' in data, but its effects are seen in the data.
"""
employment_data_structural = {
    'employee_id': [101, 102, 103, 104],
    'country': ['US', 'Canada', 'US', 'Germany'],
    'paid_parental_leave_weeks': [0, 52, 0, 58] # Hypothetical based on national policies
}
employment_structural_df = pd.DataFrame(employment_data_structural)
print("Structural Domain (Paid Parental Leave Example):")
print(employment_structural_df)


# Disciplinary Domain: Racial covenants
property_applications_data = {
    'applicant_id': [1, 2, 3, 4],
    'applicant_race': ['White', 'Black', 'Asian', 'White'],
    'property_has_racial_covenant': [False, True, False, False], # True if deed had historical covenant
    'application_status': ['Pending'] * 4
}
property_df = pd.DataFrame(property_applications_data)
print("\nDisciplinary Domain (Racial Covenants Example - Initial):")
print(property_df)

# Simulating a neighborhood association's disciplinary action (not a law)
for index, row in property_df.iterrows():
    if row['property_has_racial_covenant'] and row['applicant_race'] == 'Black':
        property_df.loc[index, 'application_status'] = 'Rejected (Neighborhood Assoc. Rule)'
    else:
        property_df.loc[index, 'application_status'] = 'Approved / Further Review'
print("\nAfter Disciplinary 'Rule' Application:")
print(property_df)

# Hegemonic Domain: Media representation
# (Markdown cell explanation in Colab)
"""
**Hegemonic Domain Example:**
The lecture mentions media representations like 'Leave It to Beaver' shaping ideas
of a 'normal' family. This isn't directly coded but influences the data we *collect*
or how features are *interpreted*. For example, if datasets about family structures
historically over-represented a nuclear family, it reinforces this hegemonic ideal.
The image search bias discussed later is another strong example of this.
"""

# Interpersonal Domain: "Pink Tax"
product_data_interpersonal = {
    'product_id': ['A1', 'A2', 'B1', 'B2'],
    'product_type': ['Razor', 'Razor', 'Shampoo', 'Shampoo'],
    'target_consumer': ['Men', 'Women', 'Unisex', 'Women (Floral Scent)'],
    'base_cost': [2.0, 2.0, 3.0, 3.0],
    'price': [5.0, 6.5, 6.0, 7.0] # Women's versions priced higher
}
products_interpersonal_df = pd.DataFrame(product_data_interpersonal)
products_interpersonal_df['markup'] = (products_interpersonal_df['price'] - products_interpersonal_df['base_cost']) / products_interpersonal_df['base_cost']
print("\nInterpersonal Domain ('Pink Tax' Example):")
print(products_interpersonal_df)
print("\nNote the higher markup for products targeted at 'Women' for similar base items.")
print("\n--- End of Section 5 ---\n")

--- 5. Illustrating Domains from the Matrix of Domination ---
Structural Domain (Paid Parental Leave Example):
   employee_id  country  paid_parental_leave_weeks
0          101       US                          0
1          102   Canada                         52
2          103       US                          0
3          104  Germany                         58

Disciplinary Domain (Racial Covenants Example - Initial):
   applicant_id applicant_race  property_has_racial_covenant  \
0             1          White                         False   
1             2          Black                          True   
2             3          Asian                         False   
3             4          White                         False   

  application_status  
0            Pending  
1            Pending  
2            Pending  
3            Pending  

After Disciplinary 'Rule' Application:
   applicant_id applicant_race  property_has_racial_covenant  \
0             1          White     

In [None]:
# --- 6. Missing Data: The Library of Missing Datasets Concept ---
print("--- 6. Missing Data: The Library of Missing Datasets Concept ---")
# Example: University Faculty Satisfaction Survey Data
faculty_satisfaction_data = {
    'faculty_id': [201, 202, 203, 204, 205],
    'department': ['CS', 'History', 'CS', 'Biology', 'History'],
    'tenured': [True, True, False, True, False],
    'satisfaction_score (1-5)': [4, 5, 3, 4, 2],
    'years_at_university': [10, 15, 2, 8, 3]
}
faculty_df = pd.DataFrame(faculty_satisfaction_data)
print("Collected Data: Faculty Satisfaction Survey")
print(faculty_df)

--- 6. Missing Data: The Library of Missing Datasets Concept ---
Collected Data: Faculty Satisfaction Survey
   faculty_id department  tenured  satisfaction_score (1-5)  \
0         201         CS     True                         4   
1         202    History     True                         5   
2         203         CS    False                         3   
3         204    Biology     True                         4   
4         205    History    False                         2   

   years_at_university  
0                   10  
1                   15  
2                    2  
3                    8  
4                    3  


**Missing Datasets (Inspired by Mimi Onuoha):**
The collected data above gives some insights, but what's missing?
Following Mimi Onuoha's concept, we can imagine 'empty folders' for datasets like:

* 'Reasons_for_faculty_of_color_leaving_tenure_track_positions_globally'
* 'Incidents_of_microaggressions_reported_by_non_tenured_faculty_by_department'
* 'Salary_discrepancies_for_equivalent_roles_adjusted_for_years_experience_race_gender'
* As the lecture mentioned: 'Police_killings_North_Carolina' or 'Femicides_US_Mexico_border'

**Childbirth Mobility Example:**
Imagine older health datasets:

In [None]:
old_health_data = {
    'patient_id': [1,2,3],
    'birth_year_of_mother': [1950, 1951, 1950],
    'child_health_outcome': ['Good', 'Fair', 'Good'],
    # 'maternal_morbidity_during_childbirth': MISSING!
}
old_health_df = pd.DataFrame(old_health_data)
print("\nExample: Older Health Data (Missing Maternal Morbidity)")
print(old_health_df)
print("The absence of 'maternal_morbidity_during_childbirth' makes that form of suffering invisible in this dataset.")
print("\n--- End of Section 6 ---\n")


Example: Older Health Data (Missing Maternal Morbidity)
   patient_id  birth_year_of_mother child_health_outcome
0           1                  1950                 Good
1           2                  1951                 Fair
2           3                  1950                 Good
The absence of 'maternal_morbidity_during_childbirth' makes that form of suffering invisible in this dataset.

--- End of Section 6 ---



In [None]:
# --- 7. Risks of Data Collection: De-anonymization in Small Groups ---
print("--- 7. Risks of Data Collection: De-anonymization in Small Groups ---")
# "Anonymized" university faculty data (names removed)
anonymized_faculty_data = {
    # 'name': ['Removed'],
    'department': ['Physics', 'Physics', 'History', 'Art', 'Physics', 'Art', 'History'],
    'tenure_status': ['Tenured', 'Tenured', 'Tenured', 'Associate', 'Tenured', 'Associate', 'Tenured'],
    'research_area': ['Quantum Computing', 'Astrophysics', 'Medieval Europe', 'Sculpture', 'Particle Physics', 'Digital Art', 'Renaissance Italy'],
    'underrepresented_group_status': [False, False, False, True, False, True, False] # Simplified
}
anon_faculty_df = pd.DataFrame(anonymized_faculty_data)
print("Hypothetical 'Anonymized' Faculty Data:")
print(anon_faculty_df)

# Attempt to identify an individual from an underrepresented group
# Suppose we know there's only one tenured faculty in Art from an underrepresented group
potential_identification = anon_faculty_df[
    (anon_faculty_df['department'] == 'Art') &
    (anon_faculty_df['tenure_status'] == 'Associate') & # Corrected from lecture example for this dataset
    (anon_faculty_df['underrepresented_group_status'] == True)
]
print("\nSearching for: Art department, Associate Professor, Underrepresented Group")
print(potential_identification)
if len(potential_identification) == 1:
    print("Result: A single individual matches. Privacy is compromised despite 'anonymization'.")
elif len(potential_identification) > 1:
    print("Result: Multiple individuals match. Identification is harder but still a risk with more data.")
else:
    print("Result: No individuals match this specific query.")

# The lecture mentions tenured faculty, if the "Art" department only had one tenured faculty
# and that person was from an underrepresented group, they'd be identifiable.
# Let's adjust for a scenario where the risk is clearer:
# What if there's only ONE person in 'Physics' focusing on 'Quantum Computing' and they are 'Tenured'?
# If that person is also from an underrepresented group (even if that data wasn't explicitly linked initially),
# external knowledge could bridge the gap.

potential_identification_physics = anon_faculty_df[
    (anon_faculty_df['department'] == 'Physics') &
    (anon_faculty_df['research_area'] == 'Quantum Computing') &
    (anon_faculty_df['tenure_status'] == 'Tenured')
]
print("\nSearching for: Physics, Quantum Computing, Tenured")
print(potential_identification_physics)
if len(potential_identification_physics) == 1:
    print("Result: A single individual matches based on department, research area and tenure. If this person is known to be from an underrepresented group, their specific record here is identified.")

print("\n--- End of Section 7 ---\n")


--- 7. Risks of Data Collection: De-anonymization in Small Groups ---
Hypothetical 'Anonymized' Faculty Data:
  department tenure_status      research_area  underrepresented_group_status
0    Physics       Tenured  Quantum Computing                          False
1    Physics       Tenured       Astrophysics                          False
2    History       Tenured    Medieval Europe                          False
3        Art     Associate          Sculpture                           True
4    Physics       Tenured   Particle Physics                          False
5        Art     Associate        Digital Art                           True
6    History       Tenured  Renaissance Italy                          False

Searching for: Art department, Associate Professor, Underrepresented Group
  department tenure_status research_area  underrepresented_group_status
3        Art     Associate     Sculpture                           True
5        Art     Associate   Digital Art              

In [None]:
# --- 8. Implicit Bias in Language & Data: Algorithmic Bias Simulation ---
print("--- 8. Implicit Bias in Language & Data: Algorithmic Bias Simulation ---")
# (Markdown cell explanation in Colab for "Banana" example)
"""
**Implicit Bias in Language: The "Banana" Example**
As the lecture states:
> Consider that people say "green bananas," but people never say, "hand me a yellow banana."
This implies that "yellow" is the default, unstated assumption for a banana.
This matters in data labeling and model training. If all training images of bananas
are yellow and unlabeled for color, the model assumes 'banana = yellow banana'.
It might then misclassify or struggle with green or red bananas unless explicitly trained.
"""

# Simulating Algorithmic Bias (like biased image search results)
# We'll create "profiles" or "image descriptors" that a system might pull from.
# These are intentionally biased to mimic the lecture's image search examples.

# Computer Engineer
# Lecture: "overwhelmingly computers themselves or men"
computer_engineer_profiles = [
    {"primary_tag": "male", "secondary_tags": ["circuits", "code", "server"]},
    {"primary_tag": "technology", "secondary_tags": ["laptop", "network"]},
    {"primary_tag": "male", "secondary_tags": ["coding", "desk", "glasses"]},
    {"primary_tag": "male", "secondary_tags": ["software", "system"]},
    {"primary_tag": "female", "secondary_tags": ["collaboration", "whiteboard", "code"]} # Few female representations
]
ce_df = pd.DataFrame(computer_engineer_profiles)
print("\nSimulated 'Computer Engineer' Profiles/Image Descriptors:")
print(ce_df['primary_tag'].value_counts())

# House Cleaner
# Lecture: "all of the images procured for me are of women... race intersects with gender"
house_cleaner_profiles = [
    {"primary_tag": "female", "secondary_tags": ["cleaning supplies", "home", "apron"], "race_hint": "Hispanic/Latina"},
    {"primary_tag": "female", "secondary_tags": ["vacuum", "kitchen", "smiling"], "race_hint": "White"},
    {"primary_tag": "female", "secondary_tags": ["gloves", "spray bottle", "bathroom"], "race_hint": "Black"},
    {"primary_tag": "female", "secondary_tags": ["mop", "bucket", "window"], "race_hint": "Asian"},
    {"primary_tag": "female", "secondary_tags": ["dusting", "living_room"], "race_hint": "Hispanic/Latina"}
]
hc_df = pd.DataFrame(house_cleaner_profiles)
print("\nSimulated 'House Cleaner' Profiles/Image Descriptors:")
print(hc_df['primary_tag'].value_counts())
print(hc_df['race_hint'].value_counts())


# Lawyer
# Lecture: "not necessarily gendered or racialized, but we can see how these tools...reproduce these biases"
lawyer_profiles = [
    {"primary_tag": "male", "secondary_tags": ["suit", "courtroom", "briefcase"], "race_hint": "White"},
    {"primary_tag": "male", "secondary_tags": ["books", "desk", "confident"], "race_hint": "White"},
    {"primary_tag": "female", "secondary_tags": ["office", "client", "focused"], "race_hint": "White"}, # Fewer female
    {"primary_tag": "male", "secondary_tags": ["gavel", "law firm", "serious"], "race_hint": "White"},
    {"primary_tag": "male", "secondary_tags": ["negotiation", "documents"], "race_hint": "Black"} # Few non-white
]
lawyer_df = pd.DataFrame(lawyer_profiles)
print("\nSimulated 'Lawyer' Profiles/Image Descriptors:")
print(lawyer_df['primary_tag'].value_counts())
print(lawyer_df['race_hint'].value_counts())

# Teacher
# Lecture: "teacher is not gendered, but perhaps I would have had to put in 'male teacher'"
teacher_profiles = [
    {"primary_tag": "female", "secondary_tags": ["classroom", "children", "books"]},
    {"primary_tag": "female", "secondary_tags": ["whiteboard", "students", "apple"]},
    {"primary_tag": "female", "secondary_tags": ["elementary", "caring", "lesson_plan"]},
    {"primary_tag": "female", "secondary_tags": ["high_school", "grading", "patient"]},
    {"primary_tag": "male", "secondary_tags": ["university", "lecture", "focused"]} # Fewer male, often different context
]
teacher_df = pd.DataFrame(teacher_profiles)
print("\nSimulated 'Teacher' Profiles/Image Descriptors:")
print(teacher_df['primary_tag'].value_counts())

print("\nThese skewed 'profiles' would lead an algorithm to reproduce societal biases,")
print("similar to the image search results shown in the lecture.")
print("\n--- End of Section 8 ---\n")



--- 8. Implicit Bias in Language & Data: Algorithmic Bias Simulation ---

Simulated 'Computer Engineer' Profiles/Image Descriptors:
primary_tag
male          3
technology    1
female        1
Name: count, dtype: int64

Simulated 'House Cleaner' Profiles/Image Descriptors:
primary_tag
female    5
Name: count, dtype: int64
race_hint
Hispanic/Latina    2
White              1
Black              1
Asian              1
Name: count, dtype: int64

Simulated 'Lawyer' Profiles/Image Descriptors:
primary_tag
male      4
female    1
Name: count, dtype: int64
race_hint
White    4
Black    1
Name: count, dtype: int64

Simulated 'Teacher' Profiles/Image Descriptors:
primary_tag
female    4
male      1
Name: count, dtype: int64

These skewed 'profiles' would lead an algorithm to reproduce societal biases,
similar to the image search results shown in the lecture.

--- End of Section 8 ---



**The Double-Edged Sword (Mimi Onuoha):**

As the lecture highlights, quoting Mimi Onuoha:
> "Black and Brown Americans face a double-edged sword. Rarely is data collected
> that meaningfully impacts their lives... But you also see how the inclusion of
> persons might increase surveillance."

This section is more conceptual but vital for data scientists:

* **Lack of Beneficial Data:**
    * Example from lecture: Data on how criminal records exclude people from public housing.
    * The *absence* of such data makes it hard to advocate for change or quantify the problem.

* **Increased Surveillance/Risk with Inclusion:**
    * Example from lecture: Including undocumented immigrants on the census.
    * Example: Stingray phone trackers used by police departments.
    * The very act of collecting data, even with good intentions, can create
        vulnerabilities for the populations being studied if not handled with
        extreme care, consent, and an understanding of the terms of inclusion.

Data scientists must constantly weigh the potential benefits of data collection
against the potential harms, especially for minoritized and vulnerable groups.
Consent, transparency, and data minimization are key principles.