In [1]:
your_name = "Myles Sartor"
your_uid = "119017708"

# Assignment 04: Data Quality

This assignment explores how data quality issues can lead to misleading conclusions. You'll simulate datasets, reproduce biased results, and test whether AI can spot the issues.


In [2]:
# Setup
import pandas as pd
import numpy as np
np.random.seed(42)

## PART 1: Simpson's Paradox - Berkeley Admissions

In 1973, UC Berkeley was sued for gender discrimination. Aggregate data showed men had higher admission rates. But within each department, women had equal or higher rates. This is an example of **Simpson's Paradox**: Men applied to high-acceptance departments; women to low-acceptance ones.

Let's first create a synthetic dataset simulating the students' application and admission status.


In [4]:
# Simulation parameters
n_applicants = 10000
n_departments = 6
department_acceptance_rates = [0.64, 0.63, 0.35, 0.34, 0.24, 0.07]  # High to low acceptance
male_application_dist = [0.31, 0.21, 0.12, 0.16, 0.07, 0.13]  # Men favor high-acceptance depts
female_application_dist = [0.02, 0.01, 0.27, 0.17, 0.24, 0.29]  # Women favor low-acceptance depts

In [5]:
# Generate department choices for each gender
n_males = n_applicants // 2
n_females = n_applicants - n_males

# Men choose departments based on male_application_dist
male_depts = np.random.choice(n_departments, size=n_males, p=male_application_dist)

# Women choose departments based on female_application_dist
female_depts = np.random.choice(n_departments, size=n_females, p=female_application_dist)

print(f"Generated {n_males} male applicants and {n_females} female applicants")
print(f"\nMale department distribution:")
print(pd.Series(male_depts).value_counts().sort_index())
print(f"\nFemale department distribution:")
print(pd.Series(female_depts).value_counts().sort_index())


Generated 5000 male applicants and 5000 female applicants

Male department distribution:
0    1574
1    1035
2     599
3     793
4     359
5     640
Name: count, dtype: int64

Female department distribution:
0      98
1      42
2    1382
3     895
4    1214
5    1369
Name: count, dtype: int64


In [6]:
# Determine admission for each applicant
# Same acceptance rates for both genders (no discrimination at dept level)
male_admitted = (np.random.random(n_males) < np.array(department_acceptance_rates)[male_depts]).astype(int)
female_admitted = (np.random.random(n_females) < np.array(department_acceptance_rates)[female_depts]).astype(int)

print("Admission decisions made!")

Admission decisions made!


In [7]:
# Combine everything into a DataFrame
df_berkeley = pd.DataFrame({
    'gender': ['Male'] * n_males + ['Female'] * n_females,
    'department': np.concatenate([male_depts, female_depts]),
    'admitted': np.concatenate([male_admitted, female_admitted])
})

print("Dataset created!")
print(df_berkeley.head(10))

Dataset created!
  gender  department  admitted
0   Male           1         1
1   Male           5         0
2   Male           3         1
3   Male           2         0
4   Male           0         1
5   Male           0         0
6   Male           0         1
7   Male           4         0
8   Male           2         0
9   Male           3         0


Aggregate vs. Stratified Rates

Compute admission rates two ways:

1. **Aggregate:** Overall rate by gender 
2. **Stratified:** Rate by gender WITHIN each department 


In [9]:
# YOUR CODE HERE
# Aggregate rates
aggregate_rates = df_berkeley.groupby('gender')['admitted'].mean()
print("Aggregate Admission Rates (overall rate by gender):")
print(aggregate_rates)
print()

# Stratified rates
stratified_rates = df_berkeley.groupby(['department', 'gender'])['admitted'].mean().unstack()
print("Stratified Admission Rates (Rate by gender within each department):")
print(stratified_rates)
print()


Aggregate Admission Rates (overall rate by gender):
gender
Female    0.2428
Male      0.4564
Name: admitted, dtype: float64

Stratified Admission Rates (Rate by gender within each department):
gender        Female      Male
department                    
0           0.632653  0.632783
1           0.642857  0.650242
2           0.335745  0.355593
3           0.310615  0.351828
4           0.228995  0.231198
5           0.076698  0.059375



Did you reproduce the Simpson's Paradox? 

Yes, the Simpson's Paradox was reproduced and showcased after aggregation and stratification of the given dataset. Although men have a higher overall average admission rate at 39.26 % while women have one at 29.02 %, the individual departments revealed that women have similar or higher admission rates in most departments. The gender bias that one might assume to be there ends up disappearing when we control for departments. This is quite surprising, especially during this time period when the data was first acquired. 

## PART 2: Selection Bias - Wage Data

When wages are only observed for people who work, missingness is informative. People with higher potential wages are more likely to work (self selection bias).

Let's begin by creating a synthetic dataset.


In [None]:
# Simulation parameters
n_population = 5000
true_wage_mean = 50000
true_wage_std = 15000
reservation_wage_mean = 35000
reservation_wage_std = 5000
selection_correlation = 0.2  # How correlated are true_wage and reservation_wage

# Generate true wages (what each person would earn if they worked)
true_wages = np.maximum(np.random.normal(true_wage_mean, true_wage_std, n_population), 0)

print(f"Generated {n_population} true wages")
print(f"Mean: ${true_wages.mean():,.2f}")
print(f"Std: ${true_wages.std():,.2f}")

# Generate reservation wages (minimum wage person would accept to work)
# Has some correlation with true_wage, but also independent component
reservation_base = np.random.normal(reservation_wage_mean, reservation_wage_std, n_population)
reservation_correlated = reservation_wage_mean + selection_correlation * (true_wages - true_wage_mean) * (reservation_wage_std / true_wage_std)
reservation_wages = np.maximum(reservation_base * 0.7 + reservation_correlated * 0.3, 0)

# Determine Participation
# People participate (work) if their true_wage >= reservation_wage. 
# Higher earners are more likely to participate, but it's probabilistic (not guaranteed):

wage_gap = true_wages - reservation_wages
participation_prob = 1 / (1 + np.exp(-wage_gap / 5000))  # Sigmoid function
participation_prob = np.clip(participation_prob, 0.05, 0.95)  # Cap between 5% and 95%
participated = (np.random.random(n_population) < participation_prob).astype(int)

print(f"Participation rate: {participated.mean():.2%}")
print(f"Number of participants: {participated.sum()}")
print(f"Number of non-participants: {(participated == 0).sum()}")

# Create observed_wage
# We only observe wages for people who participated. 
# Non-participants have missing values (NaN):
observed_wages = np.where(participated == 1, true_wages, np.nan)

# Combine into DataFrame
df_wages = pd.DataFrame({
    'true_wage': true_wages,
    'reservation_wage': reservation_wages,
    'participated': participated,
    'observed_wage': observed_wages
})

print("Dataset created!")
print(df_wages.head(10))


Generated 5000 true wages
Mean: $50,226.81
Std: $15,047.34
Participation rate: 79.38%
Number of participants: 3969
Number of non-participants: 1031
Dataset created!
      true_wage  reservation_wage  participated  observed_wage
0  37942.952299      40331.551745             1   37942.952299
1  58789.876535      30544.396782             1   58789.876535
2  69594.207136      30422.382772             1   69594.207136
3  31596.264856      30464.442285             1   31596.264856
4  75684.615926      35984.048365             1   75684.615926
5  48071.734484      25130.898616             1   48071.734484
6  35374.679226      32646.688101             1   35374.679226
7  61366.321703      40152.173946             1   61366.321703
8  43303.182826      38304.599564             1   43303.182826
9  48773.949860      33193.647720             0            NaN


We have created a visualization to show the the true wage distribution, and the hypothetical distribution if you treat missing values with the naive fixes. You may clearly see that by removing missing value (middle panel), the observed wages is right skewed and the estimated mean is higher than the true mean. However, if we replace missing value with 0, there would be a artificial spike at 0 and the estimated mean is lower than the true mean.  

![Basic visualization](https://raw.githubusercontent.com/aiwei/course-umd/main/data/assignment04_part2_vis.png)

In the cells below, please calcuate the average wage using the two naive fixes.

In [11]:
# YOUR CODE HERE
# naive fix 1: dropna
dropna_mean = df_wages['observed_wage'].dropna().mean() # YOUR CODE: mean of observed_wage after dropna()

# naive fix 2: fillna(0)
fillna_mean = df_wages['observed_wage'].fillna(0).mean() # YOUR CODE: mean of observed_wage after fillna(0)

# Compare with true mean
true_mean = df_wages['true_wage'].mean()
print(f"True mean: ${true_mean:,.2f}")
print(f"dropna() mean: ${dropna_mean:,.2f} (bias: +${dropna_mean-true_mean:,.2f})")
print(f"fillna(0) mean: ${fillna_mean:,.2f} (bias: ${fillna_mean-true_mean:,.2f})")


True mean: $50,226.81
dropna() mean: $54,273.64 (bias: +$4,046.83)
fillna(0) mean: $43,082.42 (bias: $-7,144.39)


## PART 3: Testing Your AI Assistant

Now, let's see if AI Assistant can spotify these issues and help us correct them. You may try the following prompts with your favorite (or least favorite) AI Chatbot:




### 3A: Naive Questions (No Context)

Ask AI **WITHOUT** mentioning how data was generated:

**Berkeley Prompt (This prompt purposefully ignored the availability of department information.):**

```
I have admissions data, that look like the following and I am analyzing whether there are gender discrimination in the admission process. Help me create a data analysis plan.

gender admitted
0   Male 0
1   Male 0
2   Male 0
3   Male 0
4   Male 1
5   Female 0
6   Female 0
7   Female 1
8   Female 0
9   Female 1
```

**Question**

How did LLM respond? It is very likely that the latest models have been trained to master the knowledge of Simpson's Paradox, and may even ask "do you have other information like department, if so, you may encounter Simpson's Paradox". Yet you may still need to see the detailed language LLM used in bringing up Simpson's Paradox. -- Did LLM merely mention that "you will see different or reverse result, or did LLM try to reason what is the right conclusion? Did LLM discuss from the data generation perspective?

**Wages Prompt:**

```
I would like to study women's wage in the labor market, what is their mean wage and what is the impact factor of their wage. I have data that look like the following

 wage, education, chilren_in_family, age
0, 1000, "high-school", 0, 30
1, NA, "bachelor", 2, 35
2, 1000, "post-graduate", 1, 27

My first step is to calculate the mean wage of the population, how should I do that.

```

**Question**

How did LLM respond. You may want to pay attention to the following perspective:
1. Did LLM notice missing values in the dataset.
2. Did LLM offered solutions (dropna, fillna) to deal with missing values, one, or both, or more?
3. Did LLM discuss how they may impact/bias the estimated mean, in what direction?
4. Did LLM offer more comprehensive plan tackle the missing value issue? 

Please put your **answers** as well as your reflection in the next two cells. Please do **NOT** paste the AI's response. Intead, summarize what LLM says and misses in your own language. 

**please note that this summarization should be your own words, and we reserve the right to penalize answers that does not comply with this requirement**



**Your Response to the first case study (Berkeley):**

The LLM provided me with a plan that incorporated a standardized research question before beginning to help me inspect and clean the data for exploratory data analysis. With the question being focused around the differences that may exist in admission rates between male and female applicants, visualizations of bar charts grouped by gender for counts and rates were suggested as powerful measures of disparity. Once getting past plots and descriptive statistics, chi-squre statistical testing of independence was utilized to determine if gender is independent of admissions since these two variables were categorical in nature. Even with all this, an effect size odds ratio, and logistic regression, the LLM still urged me to check for Simpson's Paradox, as it deemed it to be the most important detail in the process. In citing real admission discrimination cases such as the famous one from UC Berkeley, it realized that aggregated data may show gender differences when each department has the pattern reversing. The summarized conclusions and interpretations could be associated with statistically significant differences in admission rates due to chance or confounding variables, but not malicious intent.

**Your Response to the second case study (Wage):**

1. The missing values were initially the first aspect about the given data that the LLM noticed, remarking that I needed to decide what I wanted to do with them. 

2. Upon noticing the mising data, the LLM gave me two options. The first being the choice of excluding missing wages altogether, stating that it would be a standard and reccomended approach in labor economics unless there's valid reasoning to do something else. The other option dealt with simply imputing the missing wages if justified, as the mean, median, and regression model could all aid in determining what exact values to originally input. 

3. The LLM mentioned the dangers of imputing data blindly or with no intentions of creating a reliable model in the future, since the regression results can later be biased as a result of this. In this case, the model might end up understating what wages truly deserve to be (negative direction) if the values replacing missing observations are too low (or if the presence of them is too high).   

4. The LLM offered a well-thought out plan that handled missing wage values in the pandas dataframe with code that skipped over them in average wage calculations using synthetic data as an example for the given situation. Exploring wage inequalities across groups and running regression analysis to find wage impact factors (education, age, children) appeared to be the next steps that it was going to take in building a complete data analysis plan for this pseudo-proejct. 


**Your Reflection -- What do you think AI do well and do poorly, how does that affect your own use of AI in data analysis?**

AI does really well at explaining things and recognizing patterns from data and scenarios that it has been trained on, especially as it concerns common data science problems that will undoubtedly show up and have shown up in the real world. As a result, it can easily offer solutions that are comprehensive enough to be valuable approaches that utilize concepts and statistical ideas clearly. However, as the problem gets more complicated and the level of information given to an LLM is scarce, its means of judgement and reasoning can suffer without proper guidance for the given context. With this lack of real-world understanding behind the numerical situations it can analyze, the data generation often leaves AI stuck at times. As I engage with data analysis concepts and typical exploration more and more, I've come to use AI as a tool that advances my learning in data domains that I'm already familiar with so that I can focus on what's important. By not utilizing it as a full on replacement, I can leverage what's given to me at opportune times, checking data and analysis for faults and ideas that I wouldn't have seen otherwise (Given that I frame my prompts and desires correctly). Human judgement will remain essential for understanding real-world navigations through expected results and data-driven decision-making. 

## Submission Instruction

- Jupyter Notebook: File → Download as → Notebook (.ipynb)
- Google Colab: File → Download → Download .ipynb
- **Submit the `.ipynb` file to ELMS**