# COGS 108 - Data Checkpoint

## Authors
  
- Evan Ngo: Writer, Background Research
- Leo Wong: Analysis, Software, Visualization  
- David Yu: Writer, Background research  
- Roel Torralba: Writer, Background research

## Research Question


- How do socioeconomic and demographic factors (income, race, gender, parental education, location) influence a student’s choice of college major?  

- Across the University of California system, are STEM majors disproportionately represented at campuses with wealthier or more educated student populations?

**Operational definitions:**  
- *Campuses that are “wealthier”*: Less Pell Grant distribution, less need-based aid.  
- *Campuses that are “more educated”*: Fewer first-gen students.

## Background and Prior Work

Since we are looking at socioeconomic factors and demographics—specifically in the University of California—and their influence on a student's choice of major, one directly relevant study for our question is **“Rich Grad, Poor Grad: Family Background and College Major Choice”** by Leighton and Speer. They used regression models with the dependent variable *chosen major’s earnings growth* and predictors including *parent education/income*. They show a strong correlation between parent education and the type of major a student chooses. Specifically, students with more educated parents are likely to pick “safe” majors with higher early expected pay, whereas students with higher educated parents tend to pick majors that are less safe with slower early earnings. Although there tends to be an association between higher paying jobs upon graduation of STEM-related majors, this paper does not directly address that correlation.

Another relevant paper is **“Too Poor to Science: How wealth determines who succeeds in STEM”** by Craig R. McClain. This work discusses financial barriers in STEM such as the costs of after-school tutoring, summer camps, laboratory equipment, and even the expectation to pursue unpaid internships. These economic barriers are a major factor limiting students from pursuing STEM degrees. We can use this article to explore whether economic factors not only limit a student's choice of major but also their potential future earnings within the University of California.

A similar line of research appears in **“The Changing Role of Family Income in College Selection and Beyond”**. This work narrows the broader question of family income in college decisions to variables of college choice, degree, and post-school earnings. It relates to our project because it examines income across multiple stages of college progression, while our question focuses on whether income affects the choice of major within the UC system. Their research relies on multiple datasets relating income to test scores, college entry, institutional quality, graduation, and regressions (probit & multinomial logit models).

**References:**  
- Leighton, M., & Speer, J. (2023). *Rich Grad, Poor Grad: Family Background and College Major Choice*. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4436699  
- McClain, C. R. (2025). *Too Poor to Science: How Wealth Determines Who Succeeds in STEM*. **PLOS Biology**, 23(6), e3003243. https://doi.org/10.1371/journal.pbio.3003243 (Accessed 2025-07-02)  
- Leukhina, O. (2024). *The Changing Role of Family Income in College Selection and Beyond*. Federal Reserve Bank of St. Louis Review. https://www.stlouisfed.org/publications/review/2023/05/15/the-changing-role-of-family-income-in-college-selection-and-beyond

## Hypothesis


**Draft hypothesis:**  
We predict a strong correlation between the proportion of STEM majors and the proportion of wealthier and more educated students per University of California campus. Students who come from wealthier and educated households typically have more resources and exposure toward STEM fields. This can appear in the form of extra-curricular opportunities, an out-of-pocket expense that would be less available to those who are need-based.

## Data

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
{ 
    'url': 'https://drive.google.com/uc?export=download&id=132-gnm1v6oJWUPDDBAfg7RQ_ysRj3LAg',
    'filename': 'UCAidAward.pdf',
}
]

get_data.get_raw(datafiles, destination_directory='data/00-raw/')


Overall Download Progress:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Overview


### Dataset 1 – UC Universitywide Undergraduate Grant Aid Detail (2024–25)

- **Dataset Name:** UC Universitywide Undergraduate Grant Aid Detail, by Award Type (2024–25)  
- **Link to the dataset (original source):** UC Office of the President, Aid Award Detail – Undergraduate Grants (exported as `Aid Award Detail.pdf`)  
- **Raw file in repo:** `data/00-raw/Aid Award Detail.pdf`  
- **Planned processed file:** `data/02-processed/uc_aid_award_detail_2024_25.csv`

- **Number of observations (rows):**  
  - Conceptually: one row per **grant award type** (e.g., Pell Grant, Cal Grant A – Entitlement, UC Need-based Grants, etc.) plus a total row.  
  - This is likely around **10–20 rows**, but the exact count will be confirmed after table extraction (`df.shape`).

- **Number of variables (columns):**  
  - For each award type, the columns include: `paid_dollars`, `recipients_headcount`, `recipients_fye`, `percent_with_aid`, `average_award`, and `per_capita_award`.  
  - We will confirm the exact column list after converting the PDF to CSV.

- **Variables most relevant to this project:**
  - For the **“Pell Grant”** award row:  
    - `pell_paid_dollars` – total Pell dollars awarded systemwide.  
    - `pell_recipients_headcount` – number of Pell Grant recipients.  
    - `pell_recipients_fye` – full-year equivalent Pell recipients.  
    - `pell_percent_with_aid` – percent of undergraduates receiving Pell.  
  - Summary row at the bottom:  
    - `total_headcount_2024_25` – headcount of all undergraduates in 2024–25.  
    - `total_fye_2024_25` – full-year equivalent enrollment of all undergraduates.  
  - **Derived variables:**  
    - `pell_share_headcount = pell_recipients_headcount / total_headcount_2024_25`  
    - `pell_share_fye = pell_recipients_fye / total_fye_2024_25`  

- **Shortcomings / limitations for this project:**
  - Data are **universitywide** for a **single year (2024–25)**, so they do not provide trends over time.  
  - There is **no breakdown by campus or by major/STEM**; this dataset alone cannot tell us how Pell students are distributed across majors or campuses.  
  - It covers **grants only**; loans and work-study are outside the scope of this table.

---

### Detailed description – UC Universitywide Undergraduate Grant Aid Detail (2024–25)

This dataset dives deeper into **how grant aid is distributed by award type** in a single year (2024–25). For each grant program (e.g., Pell Grant, Cal Grant A/B, UC Need-based Grants), it reports:

- **Paid dollars**: the total amount of money distributed through that program in 2024–25 (measured in US dollars).  
- **Recipients headcount**: number of individual students who received that award at any point in the year.  
- **Recipients FYE**: full-year equivalent count of recipients, which accounts for students receiving aid for less than a full academic year.  
- **Percent with aid**: the percentage of all UC undergraduates who received that specific award.  
- **Average award**: the average grant amount per recipient for that program.  
- **Per capita award**: the average amount of that grant per student when spread across the entire undergraduate population (including non-recipients).

For our project, the **Pell Grant** row is particularly important. From that row we can compute the **Pell proportion for 2024–25** (`pell_recipients_headcount / total_headcount_2024_25`), which tells us what fraction of UC undergraduates received Pell in the most recent year in our data. We can then compare this to the proportion of STEM majors in 2024–25 from a STEM-enrollment dataset.

**Major concerns / biases:**

- The dataset is both **universitywide** and **single-year**. It does not show how Pell proportions vary by campus or how they have changed over time. This limits us to using it as a detailed snapshot rather than a full trend analysis.  
- The table includes **grants only**; student loans and work-study are not part of this dataset. So while it gives a good view of gift aid, it does not capture total student debt or work burden.  
- Like the previous dataset, it does not contain any information on **majors or STEM status**; we must combine this dataset conceptually with separate STEM-enrollment data.

From an ethical perspective, we must be careful in how we interpret Pell-related metrics. Pell receipt indicates financial need, not academic ability. Any associations between Pell proportions and STEM participation should be framed in terms of **structural and economic factors** (access to preparation, advising, support, and affordability), not as evidence that Pell students are less capable.



## Ethics

**Instructions:** Keep the contents of this cell. For each item on the checklist, put an `X` if you've considered the item. **If the item is relevant**, add a short paragraph after the checklist item discussing the issue. Items here are to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Document these discussions and decisions. You don't have to solve these problems; just acknowledge potential harm, no matter how unlikely.  

A. **Data Collection**

- [ ] **A.1 Informed consent:** If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?  
  There are no direct human subject interactions for this data. The Common Data Set for each University of California campus is aggregated and anonymous, so informed consent is not required.

- [ ] **A.2 Collection bias:** Have we considered sources of bias introduced during data collection/survey design and taken steps to mitigate those?  
  We acknowledge that the CDS for each campus is self-reported and may contain inconsistencies across campuses. To mitigate this, we will use clearly defined and comparable data (e.g., percentages receiving need-based aid).

- [ ] **A.3 Limit PII exposure:** Have we considered ways to minimize exposure of personally identifiable information (PII)?  
  The CDS contains no PII.

- [ ] **A.4 Downstream bias mitigation:** Have we considered ways to test downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?  
  We will avoid causal framing and focus on representation patterns across UC campuses. Findings will be framed with respect to structural and systemic factors.

B. **Data Storage**

- [ ] **B.1 Data security:** Plan to protect and secure data?  
  We plan to store data locally and on a private GitHub repository. Since the data are public and contain no PII, heavy security measures are not necessary.

- [ ] **B.2 Right to be forgotten:** Mechanism for removal requests?  
  Not applicable; we are not collecting data directly.

- [ ] **B.3 Data retention plan:** Schedule or plan to delete data after it is no longer needed?  
  _[Team to define a retention policy for archival or deletion after course completion.]_

C. **Analysis**

- [ ] **C.1 Missing perspectives:** Have we sought to address blind spots by engaging stakeholders/experts?  
  Our analysis focuses on quantitative metrics and may capture only part of the story. We acknowledge broader social factors likely influence STEM major selection.

- [ ] **C.2 Dataset bias:** Have we examined the data for possible sources of bias and taken steps to mitigate?  
  Biases may arise from misreported numbers. We will sanity-check values and investigate outliers.

- [ ] **C.3 Honest representation:** Are visualizations and summaries designed to honestly represent the underlying data?  
  _[Commit to clear scales, labeled axes, and context for comparisons.]_

- [ ] **C.4 Privacy in analysis:** Ensure data with PII are not used or displayed unless necessary?  
  No PII involved.

- [ ] **C.5 Auditability:** Is the analysis process well documented and reproducible?  
  _[Maintain scripts/notebooks; track data sources and transforms.]_

D. **Modeling** _(if applicable)_

- [ ] **D.1 Proxy discrimination:** Avoid variables/proxies that are unfairly discriminatory.  
- [ ] **D.2 Fairness across groups:** Test results for disparate error rates.  
- [ ] **D.3 Metric selection:** Consider the effects of optimizing chosen metrics and evaluate alternatives.  
- [ ] **D.4 Explainability:** Ensure decisions are explainable in understandable terms.  
- [ ] **D.5 Communicate limitations:** Clearly communicate shortcomings, limitations, and biases.

E. **Deployment** _(if applicable)_

- [ ] **E.1 Monitoring and evaluation:** Plan to monitor model and impacts post-deployment.  
- [ ] **E.2 Redress:** Discuss a plan for response if users are harmed by results.  
- [ ] **E.3 Roll back:** Ability to turn off/roll back the model in production if necessary.  
- [ ] **E.4 Unintended use:** Identify and prevent unintended uses/abuses; plan to monitor.

## Team Expectations 

- **Communications / Meetings:**  
  Meetings will be held on Google Meet using the same posted link each time. Communication between members will happen in the group text chat, which is mainly used for announcements, concerns, and plans related to this project.

- **Group Norms:**  
  Tone should always be respectful. If a disagreement occurs, both sides will present their case, and the group will work toward a solution that addresses both perspectives through productive discussion.

- **Group Decisions:**  
  Decisions are made through a group vote from each member. If a disagreement occurs, refer back to the Group Norms. If time constraints prevent everyone from responding, the vote or discussion will proceed with the members who are currently present.

- **Contributions / Tasks:**  
  Before starting work (including before and during meetings), the group will outline the current work plan and discuss who will do what based on each member’s strengths and weaknesses. Work should be distributed as equally as possible, or make-up work will be assigned later in the project if needed.

- **Member Expectations:**  
  Members are expected to complete their assigned tasks (independently or with help from another member), and should communicate if they are struggling or have outside conflicts that affect their work. Refer to the Contributions / Tasks section for additional details.


## Project Timeline Proposal

| Meeting Date | Meeting Time    | Completed Before Meeting                                              | Discuss at Meeting                                                                                                             |
| ------------ | --------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| 1/20         | 1 PM            | Read & Think about COGS 108 expectations; brainstorm topics/questions | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 1/26         | 10 AM           | Do background research on topic                                       | Discuss ideal dataset(s) and ethics; draft project proposal                                                                    |
| 2/1          | 10 AM           | Edit, finalize, and submit proposal; Search for datasets              | Discuss wrangling and possible analytical approaches; Assign group members to lead each specific part                          |
| 2/14         | 6 PM            | Import & Wrangle Data (Ant Man); EDA (Hulk)                           | Review/Edit wrangling/EDA; Discuss Analysis Plan                                                                               |
| 2/23         | 12 PM           | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor)               | Discuss/edit Analysis; Complete project check-in                                                                               |
| 3/13         | 12 PM           | Complete analysis; Draft results/conclusion/discussion (Wasp)         | Discuss/edit full project                                                                                                      |
| 3/20         | Before 11:59 PM | NA                                                                    | Turn in Final Project & Group Project Surveys                                                                                  |
