## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

 Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 0

If students address the detailed feedback in a future checkpoint they will earn these points back






|                                  | **Unsatisfactory**                                                                                                                                                                                                                                                                                                                        | **Developing**                                                                                                                                                                                                       | **Proficient**                                                                                                                                                                                            | **Excellent**                                                                                                                                                                            |
|----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **EDA relevance**                | EDA is mostly neither relevant to the question nor helpful in figuring out how to address the question. Or the EDA does address the question, but many obviously relevant variables / analyses / figures were not included. | EDA is partly irrelevant/unhelpful. EDA missed one or two obvioulsy relevant analysis (distributions of single variables or relationships between variables) | EDA includes the obviously relevant / helpful variables in addressing the question.                                                              | Thorough EDA fully explored the dataset                                                                                                                 |
| **EDA analysis and description** | Many of the analyses are poor choices (e.g., using means instead of medians for obviously skewed data), or are poorly described in the text, or do not aid understanding the data                                                                                                                                                     | Some of the analyses are poor choices, or are poorly described in the text, or do not aid understanding the data                                                                                                 | All analyses are correct choices. Only one or two have minor issues in the text descriptions supporting them. Mostly they fit well with other elements of the EDA and support understanding the data  | All analyses are correct choices with clear text descriptions supporting them. The figures fit well with the other elements of the EDA, producing a clear understanding of the data. |
| **EDA figures**                  | Many of the figures are poor plot choices (e.g., using a bar plot to represent a time series where it would be better to use a line plot) or have poor aesthetics (including colormap, data point shape/color, axis labels, titles, annotations, text legibility) or do not aid understanding the data                                | Some of the figures are poor plot choices or have poor aesthetics. Some figures do not aid understanding the data                                                                                                | All figures are correct plot choices. Only one or two have minor questionable aesthetic choices. The figures mostly fit well with the other elements of the EDA and support understanding the data    | All figures are correct plot choices with beautiful aesthetics. The figures fit well with the other elements of the EDA, producing a clear understanding of the data.                |





# COGS 108 - EDA Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

# Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback



## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

# Hypothesis


Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Data

### Data overview

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your data checkpoint feedback


In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [3]:
# Setup code -- Run only once after cloning!!! 
#
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [

{ 
    'url': 'https://drive.google.com/uc?export=download&id=1n4Lw7TohcZKeXrXn_CgRzZW4-3l2WCkc',
    'filename': 'CollegeScorecardDataset.csv',
}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress: 100%|██████████| 1/1 [00:12<00:00, 12.97s/it]

Successfully downloaded: CollegeScorecardDataset.csv





## Dataset: U.S. Department of Education – College Scorecard (Most Recent Cohort)  
Link to data: https://collegescorecard.ed.gov/data/  

### Dataset Description

We use the U.S. Department of Education College Scorecard – Most Recent Cohorts (Institution-Level) dataset, which contains standardized information on over 6,000 colleges and universities nationwide.    
For our analysis, we extract only the variables directly relevant to our research question: the proportion of students receiving Pell Grants (PCTPELL) and the institutional distribution of degree awards across STEM fields. Pell Grant proportion serves as a widely accepted socioeconomic indicator, as Pell eligibility is strongly tied to low-income status.

To measure how STEM-heavy each institution is, we use College Scorecard’s CIP-based program fields representing the percentage of degrees awarded in specific STEM disciplines: PCIP11 (Computer Science), PCIP14 (Engineering), PCIP15 (Engineering Technologies), PCIP26 (Biological Sciences), PCIP27 (Mathematics/Statistics), PCIP40 (Physical Sciences), and PCIP41 (Science Technologies). These are summed to create a single composite measure (STEM_PCT) indicating the share of total degrees granted in STEM. We restrict the dataset to Bachelor’s degree–granting public institutions using PREDDEG = 3 and CONTROL = 1 to ensure comparability. Additional columns such as INSTNM, STABBR, CITY, UGDS, and MD_EARN_WNE_P10 are retained for context but are not central to the analysis. Together, these processed variables allow us to investigate whether campuses with higher proportions of lower-income students tend to grant a higher or lower percentage of STEM degrees.

## Import + Loading

In [4]:
import pandas as pd
import numpy as np

raw_path = "data/00-raw/CollegeScorecardDataset.csv"
df_raw = pd.read_csv(raw_path)

df_raw.head()

  df_raw = pd.read_csv(raw_path)


Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,COUNT_WNE_MALE1_P11,GT_THRESHOLD_P11,MD_EARN_WNE_INC1_P11,MD_EARN_WNE_INC2_P11,MD_EARN_WNE_INC3_P11,MD_EARN_WNE_INDEP0_P11,MD_EARN_WNE_INDEP1_P11,MD_EARN_WNE_MALE0_P11,MD_EARN_WNE_MALE1_P11,SCORECARD_SECTOR
0,100654,100200.0,1002.0,Alabama A & M University,Normal,AL,35762,Southern Association of Colleges and Schools C...,www.aamu.edu/,www.aamu.edu/admissions-aid/tuition-fees/net-p...,...,777.0,0.625,36650.0,41070.0,47016.0,38892.0,41738.0,38167.0,40250.0,4
1,100663,105200.0,1052.0,University of Alabama at Birmingham,Birmingham,AL,35294-0110,Southern Association of Colleges and Schools C...,https://www.uab.edu/,https://tcc.ruffalonl.com/University of Alabam...,...,1157.0,0.7588,47182.0,51896.0,54368.0,50488.0,51505.0,46559.0,59181.0,4
2,100690,2503400.0,25034.0,Amridge University,Montgomery,AL,36117-3553,Southern Association of Colleges and Schools C...,https://www.amridgeuniversity.edu/,https://www2.amridgeuniversity.edu:9091/,...,67.0,0.5986,35752.0,41007.0,,,38467.0,32654.0,49435.0,5
3,100706,105500.0,1055.0,University of Alabama in Huntsville,Huntsville,AL,35899,Southern Association of Colleges and Schools C...,www.uah.edu/,uah.clearcostcalculator.com/student/default/ne...,...,802.0,0.781,51208.0,62219.0,62577.0,55920.0,60221.0,47787.0,67454.0,4
4,100724,100500.0,1005.0,Alabama State University,Montgomery,AL,36104-0271,Southern Association of Colleges and Schools C...,www.alasu.edu/,tcc.ruffalonl.com/Alabama State University/Fre...,...,1049.0,0.5378,32844.0,36932.0,37966.0,34294.0,31797.0,32303.0,36964.0,4


## Select only Relevant Columns

In [5]:
essential_cols = [
    'UNITID', 'INSTNM', 'STABBR', 'CITY',
    
    # Institution filters
    'CONTROL', 'PREDDEG',
    
    # Socioeconomic variable
    'PCTPELL',
    
    # STEM major fields
    'PCIP11', 'PCIP14', 'PCIP15', 'PCIP26',
    'PCIP27', 'PCIP40', 'PCIP41',
    
    # Context
    'UGDS', 'MD_EARN_WNE_P10',
]

df = df_raw[essential_cols].copy()
df.head()


Unnamed: 0,UNITID,INSTNM,STABBR,CITY,CONTROL,PREDDEG,PCTPELL,PCIP11,PCIP14,PCIP15,PCIP26,PCIP27,PCIP40,PCIP41,UGDS,MD_EARN_WNE_P10
0,100654,Alabama A & M University,AL,Normal,1,3,0.6441,0.0509,0.1115,0.0372,0.1487,0.0098,0.0137,0.0,5726.0,40628.0
1,100663,University of Alabama at Birmingham,AL,Birmingham,1,3,0.3318,0.0284,0.0581,0.0,0.1334,0.0046,0.0231,0.0,12118.0,54501.0
2,100690,Amridge University,AL,Montgomery,2,3,0.6842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,226.0,37621.0
3,100706,University of Alabama in Huntsville,AL,Huntsville,1,3,0.225,0.101,0.3086,0.0037,0.0484,0.0147,0.0514,0.0,6650.0,61767.0
4,100724,Alabama State University,AL,Montgomery,1,3,0.7203,0.0568,0.0168,0.0,0.1221,0.0084,0.0126,0.0,3322.0,34502.0


## Filter Institutions
We keep only:

* Bachelor’s degree–granting institutions (PREDDEG = 3)

* Public colleges/universities (CONTROL = 1)

In [6]:
df = df[
    (df['PREDDEG'] == 3) &
    (df['CONTROL'] == 1)
].copy()

len(df)


600

### Convert STEM Columns to Numeric | Setup STEM Percentage Variable | Convert Pell Grant % to Numeric


In [8]:
stem_cols = ['PCIP11','PCIP14','PCIP15','PCIP26','PCIP27','PCIP40','PCIP41']

for col in stem_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df['STEM_PCT'] = df[stem_cols].sum(axis=1, min_count=1)
df['PCTPELL'] = pd.to_numeric(df['PCTPELL'], errors='coerce')


### Data Cleaning

In [10]:
df_analysis = df.dropna(subset=['PCTPELL', 'STEM_PCT'])
len(df_analysis)
df_analysis[['PCTPELL', 'STEM_PCT']].describe()


Unnamed: 0,PCTPELL,STEM_PCT
count,595.0,595.0
mean,0.347797,0.196676
std,0.143365,0.142166
min,0.0,0.0
25%,0.24505,0.105
50%,0.331,0.1692
75%,0.42275,0.2529
max,0.8571,0.9958


## Results

### Exploratory Data Analysis

Instructions: replace the words in this subsection with whatever words you need to setup and preview the EDA you're going to do.   

Please explicitly load the fully wrangled data you will use from `data/02-processed`.  This is a good idea rather than forcing people to re-run the data getting / wrangling cells above.  Sometimes it takes a long time to get / wrangle data compared to reloading the fixed up dataset.

Carry out whatever EDA you need to for your project in the code cells below.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

Please note that you should consider the use of python modules in your work.  Any code which gets called repeatedly should be modularized. So if you run the same pre-processing, analysis or visualiazation on different subsets of the data, then you should turn that into a function or class.  Put that function or class in a .py file that lives in `modules/`.  Import the module you made and use it to get your work done.  For reference see `get_raw()` which is inside `modules/get_data.py`. 



#### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

#### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them