# COGS108 Project Proposal



## Authors
  
- Evan Ngo: Writer, Background Research
- Leo Wong: Analysis, Software, Visualization  
- David Yu: Writer, Background research  
- Roel Torralba: Writer, Background research

## Research Question


- How do socioeconomic and demographic factors (income, race, gender, parental education, location) influence a student’s choice of college major?  

- Across the University of California system, are STEM majors disproportionately represented at campuses with wealthier or more educated student populations?

**Operational definitions:**  
- *Campuses that are “wealthier”*: Less Pell Grant distribution, less need-based aid.  
- *Campuses that are “more educated”*: Fewer first-gen students.

### Preliminary UC Campus Snapshot

| Campus           | % STEM Majors | % First-Gen | % Pell | % Need-Based Aid |
|------------------|---------------|-------------|--------|------------------|
| UC Berkeley      | 52            | 22          | 26     | 49               |
| UC Davis         | 47            | 30          | 33     | 55               |
| UC Irvine        | 46            | 37          | 40     | 58               |
| UCLA             | 50            | 28          | 25     | 52               |
| UC Merced        | 35            | 61          | 66     | 81               |
| UC Riverside     | 39            | 57          | 59     | 76               |
| UC San Diego     | 51            | 30          | 27     | 54               |
| UC Santa Barbara | 44            | 25          | 23     | 50               |
| UC Santa Cruz    | 41            | 35          | 40     | 61               |

## Background and Prior Work

Since we are looking at socioeconomic factors and demographics—specifically in the University of California—and their influence on a student's choice of major, one directly relevant study for our question is **“Rich Grad, Poor Grad: Family Background and College Major Choice”** by Leighton and Speer. They used regression models with the dependent variable *chosen major’s earnings growth* and predictors including *parent education/income*. They show a strong correlation between parent education and the type of major a student chooses. Specifically, students with more educated parents are likely to pick “safe” majors with higher early expected pay, whereas students with higher educated parents tend to pick majors that are less safe with slower early earnings. Although there tends to be an association between higher paying jobs upon graduation of STEM-related majors, this paper does not directly address that correlation.

Another relevant paper is **“Too Poor to Science: How wealth determines who succeeds in STEM”** by Craig R. McClain. This work discusses financial barriers in STEM such as the costs of after-school tutoring, summer camps, laboratory equipment, and even the expectation to pursue unpaid internships. These economic barriers are a major factor limiting students from pursuing STEM degrees. We can use this article to explore whether economic factors not only limit a student's choice of major but also their potential future earnings within the University of California.

A similar line of research appears in **“The Changing Role of Family Income in College Selection and Beyond”**. This work narrows the broader question of family income in college decisions to variables of college choice, degree, and post-school earnings. It relates to our project because it examines income across multiple stages of college progression, while our question focuses on whether income affects the choice of major within the UC system. Their research relies on multiple datasets relating income to test scores, college entry, institutional quality, graduation, and regressions (probit & multinomial logit models).

**References:**  
- Leighton, M., & Speer, J. (2023). *Rich Grad, Poor Grad: Family Background and College Major Choice*. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4436699  
- McClain, C. R. (2025). *Too Poor to Science: How Wealth Determines Who Succeeds in STEM*. **PLOS Biology**, 23(6), e3003243. https://doi.org/10.1371/journal.pbio.3003243 (Accessed 2025-07-02)  
- Leukhina, O. (2024). *The Changing Role of Family Income in College Selection and Beyond*. Federal Reserve Bank of St. Louis Review. https://www.stlouisfed.org/publications/review/2023/05/15/the-changing-role-of-family-income-in-college-selection-and-beyond

## Hypothesis

**Draft hypothesis:**  
We predict a strong correlation between the proportion of STEM majors and the proportion of wealthier and more educated students per University of California campus. Students who come from wealthier and educated households typically have more resources and exposure toward STEM fields. This can appear in the form of extra-curricular opportunities, an out-of-pocket expense that would be less available to those who are need-based.

## Data 


**Draft approach:**  
The ideal data set for this project would be to look at common data sets(CDS) from campuses across the UC system. Specifically we would be looking at the proportion of students who received the Pell Grant and need-based aid as well as the proportion of first generation students in comparison to how this may affect a student’s choice of college major. We’d be able to collect a large amount of data and observations through the cohorts across all the UC campuses from the multiple academic years that are released publicly. We plan to organize this data and keep it tidy by creating tables keyed by UC campus, year, major, demographic and socioeconomic categories. Furthermore, separate tables can be made for enrollment and financial aid status. 

https://opa.berkeley.edu/campus-data/common-data-set 2024-2025-CDS.xlsx
The two attached links are an example of one data set that we will be utilizing. UC Berkeley, along with the other UC campuses, annually publishes their Common Data Set as an excel spreadsheet with different tabs for general campus, type of enrollment, admissions, student life, financial aid, degrees, and other information like this, which can all be publicly viewed and downloaded without any authorization. So in this, the important variables we want to use are degrees conferred by field, which serves as a temporary proxy for major choices. This allows us to look at closely related measures that correlate strongly with their choice of major, and then cross reference it with another data set of actual admitted majors by demographics, providing the conclusive view for us to analyze the strength correlation between the different data types and our research question. 


## Ethics

**Instructions:** Keep the contents of this cell. For each item on the checklist, put an `X` if you've considered the item. **If the item is relevant**, add a short paragraph after the checklist item discussing the issue. Items here are to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Document these discussions and decisions. You don't have to solve these problems; just acknowledge potential harm, no matter how unlikely.  

A. **Data Collection**

- [ ] **A.1 Informed consent:** If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?  
  There are no direct human subject interactions for this data. The Common Data Set for each University of California campus is aggregated and anonymous, so informed consent is not required.

- [ ] **A.2 Collection bias:** Have we considered sources of bias introduced during data collection/survey design and taken steps to mitigate those?  
  We acknowledge that the CDS for each campus is self-reported and may contain inconsistencies across campuses. To mitigate this, we will use clearly defined and comparable data (e.g., percentages receiving need-based aid).

- [ ] **A.3 Limit PII exposure:** Have we considered ways to minimize exposure of personally identifiable information (PII)?  
  The CDS contains no PII.

- [ ] **A.4 Downstream bias mitigation:** Have we considered ways to test downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?  
  We will avoid causal framing and focus on representation patterns across UC campuses. Findings will be framed with respect to structural and systemic factors.

B. **Data Storage**

- [ ] **B.1 Data security:** Plan to protect and secure data?  
  We plan to store data locally and on a private GitHub repository. Since the data are public and contain no PII, heavy security measures are not necessary.

- [ ] **B.2 Right to be forgotten:** Mechanism for removal requests?  
  Not applicable; we are not collecting data directly.

- [ ] **B.3 Data retention plan:** Schedule or plan to delete data after it is no longer needed?  
  _[Team to define a retention policy for archival or deletion after course completion.]_

C. **Analysis**

- [ ] **C.1 Missing perspectives:** Have we sought to address blind spots by engaging stakeholders/experts?  
  Our analysis focuses on quantitative metrics and may capture only part of the story. We acknowledge broader social factors likely influence STEM major selection.

- [ ] **C.2 Dataset bias:** Have we examined the data for possible sources of bias and taken steps to mitigate?  
  Biases may arise from misreported numbers. We will sanity-check values and investigate outliers.

- [ ] **C.3 Honest representation:** Are visualizations and summaries designed to honestly represent the underlying data?  
  _[Commit to clear scales, labeled axes, and context for comparisons.]_

- [ ] **C.4 Privacy in analysis:** Ensure data with PII are not used or displayed unless necessary?  
  No PII involved.

- [ ] **C.5 Auditability:** Is the analysis process well documented and reproducible?  
  _[Maintain scripts/notebooks; track data sources and transforms.]_

D. **Modeling** _(if applicable)_

- [ ] **D.1 Proxy discrimination:** Avoid variables/proxies that are unfairly discriminatory.  
- [ ] **D.2 Fairness across groups:** Test results for disparate error rates.  
- [ ] **D.3 Metric selection:** Consider the effects of optimizing chosen metrics and evaluate alternatives.  
- [ ] **D.4 Explainability:** Ensure decisions are explainable in understandable terms.  
- [ ] **D.5 Communicate limitations:** Clearly communicate shortcomings, limitations, and biases.

E. **Deployment** _(if applicable)_

- [ ] **E.1 Monitoring and evaluation:** Plan to monitor model and impacts post-deployment.  
- [ ] **E.2 Redress:** Discuss a plan for response if users are harmed by results.  
- [ ] **E.3 Roll back:** Ability to turn off/roll back the model in production if necessary.  
- [ ] **E.4 Unintended use:** Identify and prevent unintended uses/abuses; plan to monitor.

## Team Expectations

- **Team Expectation 1:** A clear and understood timeline of the project from all members.  
- **Team Expectation 2:** Responsive and communicative within the group chat, even when not physically present.  
- **Team Expectation 3:** Tasks within assignments split evenly across members as much as possible.

## Project Timeline Proposal


**Our group’s timeline:**

| Meeting Date | Meeting Time | Completed Before Meeting                                   | Discuss at Meeting                                                          |
|--------------|--------------|-------------------------------------------------------------|------------------------------------------------------------------------------|
| 10/29        | 3 PM         | Read Project Proposal & Brainstorm Ideas                    | Brainstorm ideas together; separate tasks for proposal; submit by midnight  |
| 11/4         | 3 PM         | Research potential data sources / survey methods            | Brainstorm Data Checkpoint 1; discuss sources together                      |
| 11/12        | 3 PM         | Edit, finalize, and submit proposal; find datasets; check-ins | Finalize and submit Data Checkpoint 1; preview Checkpoint 2: EDA             |
| 11/19        | 3 PM         | Brainstorm wrangling/EDA; weekly check-in surveys           | Begin analysis                                                               |
| 11/25        | 3 PM         | Submit Data Checkpoint 2                                    | Discuss/edit analysis; complete project check-in                            |
| 12/2         | 3 PM         | Complete analysis draft; weekly check-in surveys            | Discuss/edit full project; plan final project video & README                |
| 12/15        | 3 PM         | Finalize project; work on video & documentation             | Turn in Final Project & Group Project Surveys                               |