# COGS 108 - Data Checkpoint

## Authors
  
- Evan Ngo: Writer, Background Research
- Leo Wong: Analysis, Software, Visualization  
- David Yu: Writer, Background research  
- Roel Torralba: Writer, Background research

## Research Question


- How do socioeconomic and demographic factors (income, race, gender, parental education, location) influence a student’s choice of college major?  

- Across the University of California system, are STEM majors disproportionately represented at campuses with wealthier or more educated student populations?

**Operational definitions:**  
- *Campuses that are “wealthier”*: Less Pell Grant distribution, less need-based aid.  
- *Campuses that are “more educated”*: Fewer first-gen students.

## Background and Prior Work

Since we are looking at socioeconomic factors and demographics—specifically in the University of California—and their influence on a student's choice of major, one directly relevant study for our question is **“Rich Grad, Poor Grad: Family Background and College Major Choice”** by Leighton and Speer. They used regression models with the dependent variable *chosen major’s earnings growth* and predictors including *parent education/income*. They show a strong correlation between parent education and the type of major a student chooses. Specifically, students with more educated parents are likely to pick “safe” majors with higher early expected pay, whereas students with higher educated parents tend to pick majors that are less safe with slower early earnings. Although there tends to be an association between higher paying jobs upon graduation of STEM-related majors, this paper does not directly address that correlation.

Another relevant paper is **“Too Poor to Science: How wealth determines who succeeds in STEM”** by Craig R. McClain. This work discusses financial barriers in STEM such as the costs of after-school tutoring, summer camps, laboratory equipment, and even the expectation to pursue unpaid internships. These economic barriers are a major factor limiting students from pursuing STEM degrees. We can use this article to explore whether economic factors not only limit a student's choice of major but also their potential future earnings within the University of California.

A similar line of research appears in **“The Changing Role of Family Income in College Selection and Beyond”**. This work narrows the broader question of family income in college decisions to variables of college choice, degree, and post-school earnings. It relates to our project because it examines income across multiple stages of college progression, while our question focuses on whether income affects the choice of major within the UC system. Their research relies on multiple datasets relating income to test scores, college entry, institutional quality, graduation, and regressions (probit & multinomial logit models).

**References:**  
- Leighton, M., & Speer, J. (2023). *Rich Grad, Poor Grad: Family Background and College Major Choice*. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4436699  
- McClain, C. R. (2025). *Too Poor to Science: How Wealth Determines Who Succeeds in STEM*. **PLOS Biology**, 23(6), e3003243. https://doi.org/10.1371/journal.pbio.3003243 (Accessed 2025-07-02)  
- Leukhina, O. (2024). *The Changing Role of Family Income in College Selection and Beyond*. Federal Reserve Bank of St. Louis Review. https://www.stlouisfed.org/publications/review/2023/05/15/the-changing-role-of-family-income-in-college-selection-and-beyond

## Hypothesis


**Draft hypothesis:**  
We predict a strong positive correlation between the proportion of STEM majors and the proportion of pell grant recipients per University of California campus. We will quantify these proportions using a Pearson correlation coefficient and an OLS linear regression (STEM % - PELL %). We will compare the coefficient and regression results from UC-wide campus data with individual UC campuses, performing a permutation test with a p-value of 0.01 to determine significance. 


## Data

In [5]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [1]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
{ 
    'url': 'https://drive.google.com/uc?export=download&id=18oOygdaKaoIpPQpe67W4FzL67mDeOQBm',
    'filename': 'CollegeScorecardDataset.csv',
}
]

get_data.get_raw(datafiles, destination_directory='data/00-raw/')


Overall Download Progress: 100%|██████████| 1/1 [00:00<00:00,  2.18it/s]

Successfully downloaded: CollegeScorecardDataset.csv





## Dataset: U.S. Department of Education – College Scorecard (Most Recent Cohort)  
Link to data: https://collegescorecard.ed.gov/data/  

### Dataset Description

We use the U.S. Department of Education College Scorecard – Most Recent Cohorts (Institution-Level) dataset, which contains standardized information on over 6,000 colleges and universities nationwide.    
For our analysis, we extract only the variables directly relevant to our research question: the proportion of students receiving Pell Grants (PCTPELL) and the institutional distribution of degree awards across STEM fields. Pell Grant proportion serves as a widely accepted socioeconomic indicator, as Pell eligibility is strongly tied to low-income status.

To measure how STEM-heavy each institution is, we use College Scorecard’s CIP-based program fields representing the percentage of degrees awarded in specific STEM disciplines: PCIP11 (Computer Science), PCIP14 (Engineering), PCIP15 (Engineering Technologies), PCIP26 (Biological Sciences), PCIP27 (Mathematics/Statistics), PCIP40 (Physical Sciences), and PCIP41 (Science Technologies). These are summed to create a single composite measure (STEM_PCT) indicating the share of total degrees granted in STEM. We restrict the dataset to Bachelor’s degree–granting public institutions using PREDDEG = 3 and CONTROL = 1 to ensure comparability. Additional columns such as INSTNM, STABBR, CITY, UGDS, and MD_EARN_WNE_P10 are retained for context but are not central to the analysis. Together, these processed variables allow us to investigate whether campuses with higher proportions of lower-income students tend to grant a higher or lower percentage of STEM degrees.

## Ethics

**Instructions:** Keep the contents of this cell. For each item on the checklist, put an `X` if you've considered the item. **If the item is relevant**, add a short paragraph after the checklist item discussing the issue. Items here are to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Document these discussions and decisions. You don't have to solve these problems; just acknowledge potential harm, no matter how unlikely.  

A. **Data Collection**

- [ ] **A.1 Informed consent:** If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?  
  There are no direct human subject interactions for this data. The Common Data Set for each University of California campus is aggregated and anonymous, so informed consent is not required.

- [ ] **A.2 Collection bias:** Have we considered sources of bias introduced during data collection/survey design and taken steps to mitigate those?  
  We acknowledge that the CDS for each campus is self-reported and may contain inconsistencies across campuses. To mitigate this, we will use clearly defined and comparable data (e.g., percentages receiving need-based aid).

- [ ] **A.3 Limit PII exposure:** Have we considered ways to minimize exposure of personally identifiable information (PII)?  
  The CDS contains no PII.

- [ ] **A.4 Downstream bias mitigation:** Have we considered ways to test downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?  
  We will avoid causal framing and focus on representation patterns across UC campuses. Findings will be framed with respect to structural and systemic factors.

B. **Data Storage**

- [ ] **B.1 Data security:** Plan to protect and secure data?  
  We plan to store data locally and on a private GitHub repository. Since the data are public and contain no PII, heavy security measures are not necessary.

- [ ] **B.2 Right to be forgotten:** Mechanism for removal requests?  
  Not applicable; we are not collecting data directly.

- [ ] **B.3 Data retention plan:** Schedule or plan to delete data after it is no longer needed?  
  _[Team to define a retention policy for archival or deletion after course completion.]_

C. **Analysis**

- [ ] **C.1 Missing perspectives:** Have we sought to address blind spots by engaging stakeholders/experts?  
  Our analysis focuses on quantitative metrics and may capture only part of the story. We acknowledge broader social factors likely influence STEM major selection.

- [ ] **C.2 Dataset bias:** Have we examined the data for possible sources of bias and taken steps to mitigate?  
  Biases may arise from misreported numbers. We will sanity-check values and investigate outliers.

- [ ] **C.3 Honest representation:** Are visualizations and summaries designed to honestly represent the underlying data?  
  _[Commit to clear scales, labeled axes, and context for comparisons.]_

- [ ] **C.4 Privacy in analysis:** Ensure data with PII are not used or displayed unless necessary?  
  No PII involved.

- [ ] **C.5 Auditability:** Is the analysis process well documented and reproducible?  
  _[Maintain scripts/notebooks; track data sources and transforms.]_

D. **Modeling** _(if applicable)_

- [ ] **D.1 Proxy discrimination:** Avoid variables/proxies that are unfairly discriminatory.  
- [ ] **D.2 Fairness across groups:** Test results for disparate error rates.  
- [ ] **D.3 Metric selection:** Consider the effects of optimizing chosen metrics and evaluate alternatives.  
- [ ] **D.4 Explainability:** Ensure decisions are explainable in understandable terms.  
- [ ] **D.5 Communicate limitations:** Clearly communicate shortcomings, limitations, and biases.

E. **Deployment** _(if applicable)_

- [ ] **E.1 Monitoring and evaluation:** Plan to monitor model and impacts post-deployment.  
- [ ] **E.2 Redress:** Discuss a plan for response if users are harmed by results.  
- [ ] **E.3 Roll back:** Ability to turn off/roll back the model in production if necessary.  
- [ ] **E.4 Unintended use:** Identify and prevent unintended uses/abuses; plan to monitor.

## Team Expectations 

- **Communications / Meetings:**  
  Meetings will be held on Google Meet using the same posted link each time. Communication between members will happen in the group text chat, which is mainly used for announcements, concerns, and plans related to this project.

- **Group Norms:**  
  Tone should always be respectful. If a disagreement occurs, both sides will present their case, and the group will work toward a solution that addresses both perspectives through productive discussion.

- **Group Decisions:**  
  Decisions are made through a group vote from each member. If a disagreement occurs, refer back to the Group Norms. If time constraints prevent everyone from responding, the vote or discussion will proceed with the members who are currently present.

- **Contributions / Tasks:**  
  Before starting work (including before and during meetings), the group will outline the current work plan and discuss who will do what based on each member’s strengths and weaknesses. Work should be distributed as equally as possible, or make-up work will be assigned later in the project if needed.

- **Member Expectations:**  
  Members are expected to complete their assigned tasks (independently or with help from another member), and should communicate if they are struggling or have outside conflicts that affect their work. Refer to the Contributions / Tasks section for additional details.


## Project Timeline Proposal

| Meeting Date | Meeting Time    | Completed Before Meeting                                              | Discuss at Meeting                                                                                                             |
| ------------ | --------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| 1/20         | 1 PM            | Read & Think about COGS 108 expectations; brainstorm topics/questions | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 1/26         | 10 AM           | Do background research on topic                                       | Discuss ideal dataset(s) and ethics; draft project proposal                                                                    |
| 2/1          | 10 AM           | Edit, finalize, and submit proposal; Search for datasets              | Discuss wrangling and possible analytical approaches; Assign group members to lead each specific part                          |
| 2/14         | 6 PM            | Import & Wrangle Data (Ant Man); EDA (Hulk)                           | Review/Edit wrangling/EDA; Discuss Analysis Plan                                                                               |
| 2/23         | 12 PM           | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor)               | Discuss/edit Analysis; Complete project check-in                                                                               |
| 3/13         | 12 PM           | Complete analysis; Draft results/conclusion/discussion (Wasp)         | Discuss/edit full project                                                                                                      |
| 3/20         | Before 11:59 PM | NA                                                                    | Turn in Final Project & Group Project Surveys                                                                                  |
