# From Ninth Grade to STEM Pathways: Socioeconomic Disparities and Psychological Factors in the High School Longitudinal Study of 2009

# Introduction 

This comprehensive analysis utilizes the High School Longitudinal Study of 2009 to examine critical questions about educational equity and STEM participation in U.S. secondary and postsecondary education. Through rigorous data analysis over 23,000 students across multiple waves (2009, 2012-2014, 2016-2018), we aim to explore how socioeconomic background influences students’ abilities to pursue in a STEM pathway and examine how psychological factors —specifically mathematical identity and self-efficacy—vary across demographic groups. By highlighting how structured inequalities shape early educational trajectories and evolve over time, this analysis offers actionable insights for educational and social aimed at better understanding and supporting young adults interested in STEM fields.

# Group Members

Here are the names of the group members for this project

- Qiran Hu 
- Hahyeong Kim
- Weiting Yang
- Kevin Xia

# Dataset Overview

## Official Dataset Name

The name of the dataset that is used in this analysis is the **"High School Longitudinal Study of 2009"**. The data is stored as `hsls_17_student_pets_sr_v1_0.csv`, which is the public version of the Student Parent Environment Teacher School dataset series. In research literatures and documentations, this dataset is referenced as **HSLS:09**.

HSLS:09 represents the one of the major longitudinal investigations undertaken by the National Center for Education Statistics as part of its Secondary Longitudinal Studies Program. This research continues the tradition of federal educational data collection done since the year of 1972. Building upon foundations established by the National Longitudinal Study of the High School Class of 1972, the High School and Beyond Longitudinal Study of 1980, the National Education Longitudinal Study of 1988, and the Education Longitudinal Study of 2002, HSLS:09 introduces several novel methodologies compared to the previous studies.

The study began in the 2009 academic year with a nationally representative sample of 23,503 ninth grade students drawn from 944 carefully selected public and private schools across the United States and the District of Columbia. The study begins in ninth grade because it is widely recognized as a pivotal transition year. It has been indicated that ninth grade is a crucial moment in students’ academic trajectories, as they make decisions about mathematics and science course sequences that ultimately limit or expand their future STEM opportunities.

# The Origin of the Dataset

## List of Organizations Contributed to the Dataset

The National Center for Education Statistics and the Institute of Education Sciences in the U.S. Department of Education are responsible for collecting and analyzing educational data to enhance policy in the field of education. Established in 1867, NCES is known as the nation's oldest federal statistical agency. The center's mission include providing fast and accurate information about the condition and progress of American education from early childhood through adult education so that federal, state, and local levels policy makers can make correct decisions and better allocate resources.

NCES operates under strict protocols to protect respondent confidentiality and maintain the accuracies of its findings. With robust peer review processes, all the data collected by NCES are carefully reviewed to safeguard the personal information of participants and enhance compliance with federal statistical standards.

Researchers can access organizational resources through several official channels. The NCES main website is located at https://nces.ed.gov/, which provides comprehensive information about all NCES programs and initiatives. The `HSLS:09` project homepage is located at at https://nces.ed.gov/surveys/hsls09/ where the researchers can find specific documentation about each study, publications, and updates. Direct access to data products is located at https://nces.ed.gov/datalab/onlinecodebook/session/codebook/56bd6f3d-80d2-45ef-8b34-c4109651c076 so that researchers can download data files, codebooks, and documentation for a specific documentation.

## The Background Information about the Organization

RTI International is a nonprofit research institute in North Carolina where the HSLS:09 study starts. RTI organizes extensive professional experts to conduct large scale national surveys which includes previous NCES longitudinal studies. The organization successfully navigates the complex challenges inherent in tracking thousands of geographically dispersed students over multiple years. Their capabilities in sample design, questionnaire selection, testing, assessment construction, and validation ensure high quality data.

## Academic Literature, Government Collection, Organizations, and Surveys

Several major academic and government publications demonstrate the methodologies in  HSLS:09. The core technical documentation consists of three primary reports that describe the study’s design and implementation across multiple data collection waves. The base year report outlines the initial data collection procedures, sampling strategy, and construction of baseline variables. This document is available at https://nces.ed.gov/pubs2012/2012001.pdf, which offers researchers more context for understanding how the study was planned and executed.

The first follow up report extends the study from the spring 2012 data collection when most participants were in their junior year of high school. This report is available at https://nces.ed.gov/pubs2014/2014361.pdf, which illustrates revisions of the survey instruments, current tracking procedures, and the creation of longitudinal variables that span the first three years of the study.

The most comprehensive documentation covers the period from the base year to the second follow up year. This documentation is available at https://nces.ed.gov/pubs2018/2018140.pdf, which delineates the study design in detail. It is the primary reference for researchers working with data with the second follow up wave. The documentation describes sampling procedures and the construction of weights by introducing the full survey instruments, annotating variable coding schemes and missing data, and discussing nonresponse analyses with weighting adjustments. Hence, this documentation provides variance estimation guidance for complex survey analysis and indicates the public use version and the restricted version.



 - **I will add the citations for the literatures**
  - *Since the initial release of `HSLS:09`, the scholarly literatures that utilize these data has grown substantially in the field of education as well as the adjacent fields such as sociology, economics, psychology, and public policy. Researchers have explored how gender gaps in mathematics self efficacy persist even when boys and girls perform similarly on standardized measures. Other researchers have analyzed how students decide college enrollment decisions, how high school courses influence the likelihood of choosing a STEM major in college, and how academic undermatch occurs when excellent students from disadvantaged backgrounds attend less selective institutions. Although college enrollment has increased across all socioeconomic groups, researches on socioeconomic inequality has shown that the gaps in college completion still remain and even grew larger in certain regions. Based on their previous performances, students from lower socioeconomic backgrounds perform better in mathematics and science in high school compared to their peers. However, they remain considerably less likely to enter in STEM pathways after their postsecondary education. Thus, these patterns demonstrate that the hidden challenges among these students such as unequal access to information about educational options, different levels of encouragement and support from counselors and teachers, and financial barriers that constrain college choices.*

# Data Acquisition

## Official URLs to the Dataset 

The public version of the `HSLS:09` dataset can be obtained through multiple official websites. The NCES DataLab Online Codebook is a web application that contains the zip file of the `HSLS:09` dataset, which is located at https://nces.ed.gov/datalab/onlinecodebook. This platform allows researchers to browse variables, create custom visualizations, and download data in multiple formats including SAS, SPSS, Stata, R, and CSV files that are suitable for Python and other data analysis tools.

For researchers who prefer working with complete datasets, the NCES Data Products page at https://nces.ed.gov/surveys/hsls09/hsls09_data.asp provides direct download access to full data files with detailed documentation. The files are compressed archives that include both the data and syntax files for major statistical software packages, which is useful for researchers who want offline access to the full documentation or plan to work with the complete set of variables.

The HSLS:09 project homepage at https://nces.ed.gov/surveys/hsls09/ also serves as the primary gateway to all the related resources. It provides links to data products, documentation, publications, and updates about future releases. Hence, this main page offers most up to date information about the progress of the study which newest information and related research findings.

# Data Loading

## Access Through Relative Path

For this project, we employs relative path to load `hsls_17_student_pets_sr_v1_0.csv` to our codebase on account of the fact that it does not require any additional modifications to accommodate different directory structures.

The dataset is read directly from the current directory by using the `read_csv()` function with the relative path `./hsls_17_student_pets_sr_v1_0.csv`. This simple path specification assumes the data file resides in the same directory as the analysis scripts, a structure that promotes reproducibility and simplifies project organization.

After loading the raw data, we immediately address NCES's structured missing value coding system. Unlike many datasets that represent missing values with empty cells, NCES employs specific negative integer codes to distinguish different types of missingness. The codes negative four, negative five, negative eight, and negative nine each convey distinct information about why particular values are absent. We convert these NCES-specific codes to numpy's standard NaN representation using the replace method with a dictionary mapping all missing codes to np.nan. This conversion enables use of pandas' native missing data handling capabilities while preserving the ability to investigate missingness patterns when relevant to research questions.

The data file follows a simple organizational structure designed for clarity and ease of access. The project root directory contains the main data file `hsls_17_student_pets_sr_v1_0.csv` alongside the analysis scripts `stem_pipeline_analysis.py` and `math_identity_analysis.py`, as well as supporting documentation in README.md files. This flat structure avoids nested directories that could complicate path specifications while keeping all essential project components readily accessible.

This straightforward relative path approach ensures the code runs consistently across different computing environments without requiring path modifications. Researchers replicating the analysis need only download the data file from NCES and place it in the project root directory alongside the analysis scripts. The scripts will then automatically locate and load the data using the relative path specification.

## Data Download Instructions for Replication

Researchers seeking to replicate this analysis should follow a systematic procedure to obtain the HSLS:09 dataset. The process begins by navigating to the NCES DataLab Online Codebook at https://nces.ed.gov/datalab/onlinecodebook, where the intuitive interface guides users through data selection and download. After reaching the platform, researchers should select "HSLS:09" from the menu of available studies, which displays all data releases for the High School Longitudinal Study.

Within the HSLS:09 interface, researchers can choose to download the complete data file rather than creating a custom extract. This option provides access to the full public-use dataset including all 9,614 variables across all available waves. The platform offers downloads in multiple formats, with the CSV option being most appropriate for use with Python pandas. After selecting preferred format and accepting the NCES data usage agreement, the download process begins, transferring the approximately 400-450 MB compressed file to the researcher's computer.

Once downloaded, the compressed archive should be extracted to reveal the data file `hsls_17_student_pets_sr_v1_0.csv`. This file should then be placed in the same directory as the analysis scripts, matching the relative path structure described earlier. With the data properly positioned, the Python scripts will automatically load it without requiring any code modifications.

While automated download within Python scripts is technically possible, manual download is recommended for several reasons. Manual acquisition provides transparency about data provenance, ensures researchers review and accept the data usage agreement, facilitates proper version control through Git LFS, and gives researchers the opportunity to examine accompanying documentation before beginning analysis.

# Comprehensive Data Analysis



In [1]:
import pandas as pd
import numpy as np
import altair as alt

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
df_raw = pd.read_csv("hsls_subset.csv")

In [3]:
stem_vars = ["STU_ID", "X4INCOMECAT", "X4RFDGMJSTEM", "X4OCCFBSTEM1", "S4JOBSAT2",
             "X5STEM3ERN", "X1TXMQUINT", "X1MTHINT", "X1SCIINT", "X1MTHUTI",
             "X1SCIUTI", "X1MTHEFF", "X1SCIEFF", "X1MTHID", "X1SCIID",
             "S1FAVSUBJ", "X2TXMQUINT", "S2MUSELIFE", "S2MUSECLG", "X1DADEDU",
             "X1MOMEDU", "S2EDUASP", "X3TAGPA12", "S4BIRTHSEX", "X1RACE", "X1SES_U"]

In [4]:
df = df_raw[stem_vars].copy() # Create a copy to avoid SettingWithCopyWarning
df

Unnamed: 0,STU_ID,X4INCOMECAT,X4RFDGMJSTEM,X4OCCFBSTEM1,S4JOBSAT2,X5STEM3ERN,X1TXMQUINT,X1MTHINT,X1SCIINT,X1MTHUTI,...,X2TXMQUINT,S2MUSELIFE,S2MUSECLG,X1DADEDU,X1MOMEDU,S2EDUASP,X3TAGPA12,S4BIRTHSEX,X1RACE,X1SES_U
0,10001,4,1,-7,-7,43,5,0.12,-0.23,1.31,...,5,1,1,5,5,6,3.5,-5,8,1.6907
1,10002,12,0,0,-9,0,2,-9.00,-9.00,1.31,...,4,2,2,2,3,5,4.0,-5,8,-0.3923
2,10003,7,1,0,-7,13,5,0.86,0.93,0.27,...,4,3,3,0,7,7,2.5,-5,3,1.1271
3,10004,8,-7,0,-4,-6,3,0.19,-0.67,-0.70,...,-8,-8,-8,0,0,-8,4.0,-5,8,0.4283
4,10005,1,-7,-7,-7,-6,5,-0.96,-1.50,-1.10,...,3,-9,1,0,4,5,3.0,-5,8,0.2147
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23498,35202,-8,-8,-8,-8,-8,4,-0.36,0.93,-0.13,...,4,2,1,-8,-8,6,-1.0,-5,5,0.0356
23499,35203,6,-7,0,1,-6,2,1.41,0.22,1.31,...,3,1,1,0,2,6,1.0,-5,5,-1.3350
23500,35204,2,0,-7,-3,0,3,-0.96,-1.38,-0.30,...,3,1,1,2,2,7,2.0,-5,8,-0.0031
23501,35205,13,0,0,-7,0,5,-0.96,-0.23,-0.93,...,3,4,4,0,5,7,2.5,-5,8,0.7236


In [5]:
missing_codes = [-8, -9, -4, -7, -3]

for col in df.columns:
    df[col] = df[col].replace(missing_codes, np.nan)

df

Unnamed: 0,STU_ID,X4INCOMECAT,X4RFDGMJSTEM,X4OCCFBSTEM1,S4JOBSAT2,X5STEM3ERN,X1TXMQUINT,X1MTHINT,X1SCIINT,X1MTHUTI,...,X2TXMQUINT,S2MUSELIFE,S2MUSECLG,X1DADEDU,X1MOMEDU,S2EDUASP,X3TAGPA12,S4BIRTHSEX,X1RACE,X1SES_U
0,10001,4.0,1.0,,,43.0,5.0,0.12,-0.23,1.31,...,5.0,1.0,1.0,5.0,5.0,6.0,3.5,-5,8.0,1.6907
1,10002,12.0,0.0,0.0,,0.0,2.0,,,1.31,...,4.0,2.0,2.0,2.0,3.0,5.0,4.0,-5,8.0,-0.3923
2,10003,7.0,1.0,0.0,,13.0,5.0,0.86,0.93,0.27,...,4.0,3.0,3.0,0.0,7.0,7.0,2.5,-5,3.0,1.1271
3,10004,8.0,,0.0,,-6.0,3.0,0.19,-0.67,-0.70,...,,,,0.0,0.0,,4.0,-5,8.0,0.4283
4,10005,1.0,,,,-6.0,5.0,-0.96,-1.50,-1.10,...,3.0,,1.0,0.0,4.0,5.0,3.0,-5,8.0,0.2147
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23498,35202,,,,,,4.0,-0.36,0.93,-0.13,...,4.0,2.0,1.0,,,6.0,-1.0,-5,5.0,0.0356
23499,35203,6.0,,0.0,1.0,-6.0,2.0,1.41,0.22,1.31,...,3.0,1.0,1.0,0.0,2.0,6.0,1.0,-5,5.0,-1.3350
23500,35204,2.0,0.0,,,0.0,3.0,-0.96,-1.38,-0.30,...,3.0,1.0,1.0,2.0,2.0,7.0,2.0,-5,8.0,-0.0031
23501,35205,13.0,0.0,0.0,,0.0,5.0,-0.96,-0.23,-0.93,...,3.0,4.0,4.0,0.0,5.0,7.0,2.5,-5,8.0,0.7236


In [6]:
INCOME_MAP = {1: "No income", 2: "$1-5k", 3: "$5-10k", 4: "$10-15k", 5: "$15-20k", 6: "$20-25k", 
              7: "$25-30k", 8: "$30-35k",9: "$35-40k", 10: "$40-50k", 11: "$50-60k", 12: "$60-75k", 
              13: "$75-100k", 14: "$100k+"}

EDUCATION_MAP = {1: "Less than HS", 2: "HS Diploma/GED", 3: "Some college", 4: "Associate's", 
                 5: "Bachelor's", 6: "Master's", 7: "Ph.D./M.D./Law"}

JOB_SAT_MAP = {1: "Very satisfied", 2: "Somewhat satisfied", 3: "Somewhat dissatisfied", 4: "Very dissatisfied"}

In [7]:
df["income_label"] = df["X4INCOMECAT"].map(INCOME_MAP)
df["income_numeric"] = df["X4INCOMECAT"]

df["stem_credits"] = df["X5STEM3ERN"].map({0: "No STEM Credits", 1: "STEM Credits Earned"})
df["stem_degree"] = df["X4RFDGMJSTEM"].map({0: "Non-STEM", 1: "STEM"})

df["stem_job"] = df["X4OCCFBSTEM1"].apply(lambda x: "STEM" if x > 0 else ("Non-STEM" if x == 0 else np.nan))
df["job_satisfaction"] = df["S4JOBSAT2"].map(JOB_SAT_MAP)

df["parent_max_edu"] = df[["X1DADEDU", "X1MOMEDU"]].max(axis = 1)
df["parent_edu_label"] = df["parent_max_edu"].map(EDUCATION_MAP)
df["parent_edu_group"] = df["parent_max_edu"].apply(
    lambda x: "HS or Less" if x <= 2 else ("Some College/Associate" if x <= 4 else "Bachelor+"))

df["math_2009_group"] = df["X1TXMQUINT"].apply(
    lambda x: "Low (Q1-Q2)" if x <= 2 else ("Medium (Q3)" if x == 3 else "High (Q4-Q5)"))

df

Unnamed: 0,STU_ID,X4INCOMECAT,X4RFDGMJSTEM,X4OCCFBSTEM1,S4JOBSAT2,X5STEM3ERN,X1TXMQUINT,X1MTHINT,X1SCIINT,X1MTHUTI,...,income_label,income_numeric,stem_credits,stem_degree,stem_job,job_satisfaction,parent_max_edu,parent_edu_label,parent_edu_group,math_2009_group
0,10001,4.0,1.0,,,43.0,5.0,0.12,-0.23,1.31,...,$10-15k,4.0,,STEM,,,5.0,Bachelor's,Bachelor+,High (Q4-Q5)
1,10002,12.0,0.0,0.0,,0.0,2.0,,,1.31,...,$60-75k,12.0,No STEM Credits,Non-STEM,Non-STEM,,3.0,Some college,Some College/Associate,Low (Q1-Q2)
2,10003,7.0,1.0,0.0,,13.0,5.0,0.86,0.93,0.27,...,$25-30k,7.0,,STEM,Non-STEM,,7.0,Ph.D./M.D./Law,Bachelor+,High (Q4-Q5)
3,10004,8.0,,0.0,,-6.0,3.0,0.19,-0.67,-0.70,...,$30-35k,8.0,,,Non-STEM,,0.0,,HS or Less,Medium (Q3)
4,10005,1.0,,,,-6.0,5.0,-0.96,-1.50,-1.10,...,No income,1.0,,,,,4.0,Associate's,Some College/Associate,High (Q4-Q5)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23498,35202,,,,,,4.0,-0.36,0.93,-0.13,...,,,,,,,,,Bachelor+,High (Q4-Q5)
23499,35203,6.0,,0.0,1.0,-6.0,2.0,1.41,0.22,1.31,...,$20-25k,6.0,,,Non-STEM,Very satisfied,2.0,HS Diploma/GED,HS or Less,Low (Q1-Q2)
23500,35204,2.0,0.0,,,0.0,3.0,-0.96,-1.38,-0.30,...,$1-5k,2.0,No STEM Credits,Non-STEM,,,2.0,HS Diploma/GED,HS or Less,Medium (Q3)
23501,35205,13.0,0.0,0.0,,0.0,5.0,-0.96,-0.23,-0.93,...,$75-100k,13.0,No STEM Credits,Non-STEM,Non-STEM,,5.0,Bachelor's,Bachelor+,High (Q4-Q5)


In [8]:
stem_interest_cols = ["X1MTHINT", "X1SCIINT", "X1MTHUTI", "X1SCIUTI","X1MTHEFF", "X1SCIEFF", "X1MTHID", "X1SCIID"]

num_cols = len(stem_interest_cols)

df["stem_interest_2009"] = df[stem_interest_cols].sum(axis = 1) / num_cols
valid_scores = df["stem_interest_2009"].dropna()

low_cut = valid_scores.quantile(1/3)
high_cut = valid_scores.quantile(2/3)

df["stem_interest_group"] = df["stem_interest_2009"].apply(
    lambda x: "Low Interest" if x < low_cut else ("Medium Interest" if x < high_cut else "High Interest"))

df["edu_aspiration_group"] = df["S2EDUASP"].apply(
    lambda x: "HS or Less" if x <= 2 else ("Some College" if x <= 4 
                                           else ("Bachelor's" if x == 6 else "Advanced Degree")))

df["gender"] = df["S4BIRTHSEX"].map({1: "Male", 2: "Female"})

df

Unnamed: 0,STU_ID,X4INCOMECAT,X4RFDGMJSTEM,X4OCCFBSTEM1,S4JOBSAT2,X5STEM3ERN,X1TXMQUINT,X1MTHINT,X1SCIINT,X1MTHUTI,...,stem_job,job_satisfaction,parent_max_edu,parent_edu_label,parent_edu_group,math_2009_group,stem_interest_2009,stem_interest_group,edu_aspiration_group,gender
0,10001,4.0,1.0,,,43.0,5.0,0.12,-0.23,1.31,...,,,5.0,Bachelor's,Bachelor+,High (Q4-Q5),0.69875,High Interest,Bachelor's,
1,10002,12.0,0.0,0.0,,0.0,2.0,,,1.31,...,Non-STEM,,3.0,Some college,Some College/Associate,Low (Q1-Q2),0.57250,High Interest,Advanced Degree,
2,10003,7.0,1.0,0.0,,13.0,5.0,0.86,0.93,0.27,...,Non-STEM,,7.0,Ph.D./M.D./Law,Bachelor+,High (Q4-Q5),0.23625,High Interest,Advanced Degree,
3,10004,8.0,,0.0,,-6.0,3.0,0.19,-0.67,-0.70,...,Non-STEM,,0.0,,HS or Less,Medium (Q3),-0.05750,Medium Interest,Advanced Degree,
4,10005,1.0,,,,-6.0,5.0,-0.96,-1.50,-1.10,...,,,4.0,Associate's,Some College/Associate,High (Q4-Q5),-0.60750,Low Interest,Advanced Degree,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23498,35202,,,,,,4.0,-0.36,0.93,-0.13,...,,,,,Bachelor+,High (Q4-Q5),0.62000,High Interest,Bachelor's,
23499,35203,6.0,,0.0,1.0,-6.0,2.0,1.41,0.22,1.31,...,Non-STEM,Very satisfied,2.0,HS Diploma/GED,HS or Less,Low (Q1-Q2),0.33250,High Interest,Bachelor's,
23500,35204,2.0,0.0,,,0.0,3.0,-0.96,-1.38,-0.30,...,,,2.0,HS Diploma/GED,HS or Less,Medium (Q3),-0.48625,Low Interest,Advanced Degree,
23501,35205,13.0,0.0,0.0,,0.0,5.0,-0.96,-0.23,-0.93,...,Non-STEM,,5.0,Bachelor's,Bachelor+,High (Q4-Q5),-0.57500,Low Interest,Advanced Degree,


In [9]:
# Altair selections don't persist across cell executions in Jupyter notebooks
# So I combine the graphs in one cell

In [10]:
stage1 = df[df["stem_interest_group"].notna()].copy()
stage1["stage"] = "1. STEM Interest (2009)"
stage1["stem_track"] = stage1["stem_interest_group"].apply(
    lambda x: "High" if x == "High Interest" else "Low/Med"
)

stage2 = df[df["X2TXMQUINT"].notna()].copy()
stage2["stage"] = "2. Math Score (2012)"
stage2["stem_track"] = stage2["X2TXMQUINT"].apply(lambda x: "High" if x >= 4 else "Low/Med")

stage3 = df[df["stem_degree"].notna()].copy()
stage3["stage"] = "3. STEM Degree"
stage3["stem_track"] = stage3["stem_degree"].replace({"STEM": "High", "Non-STEM": "Low/Med"})

stage4 = df[df["stem_job"].notna()].copy()
stage4["stage"] = "4. STEM Career (2016)"
stage4["stem_track"] = stage4["stem_job"].replace({"STEM": "High", "Non-STEM": "Low/Med"})

pathway = pd.concat([stage1, stage2, stage3, stage4], ignore_index = True)
pathway_summary = pathway.groupby(["stage", "stem_track"]).size().reset_index(name = "count")
stage_totals = pathway.groupby("stage").size().reset_index(name = "total")
pathway_summary = pathway_summary.merge(stage_totals, on = "stage")
pathway_summary["proportion"] = pathway_summary["count"] / pathway_summary["total"]

click_track = alt.selection_point(fields = ["stem_track"], empty = True)

bar_chart = alt.Chart(pathway_summary).mark_bar().encode(
    x = alt.X("stage:N",
           title = "Progression Stage",
           axis = alt.Axis(labelAngle = -25, labelFontSize = 12),
           sort = ["1. STEM Interest (2009)", "2. Math Score (2012)",
                 "3. STEM Degree", "4. STEM Career (2016)"]),
    y = alt.Y("count:Q",
           title = "Number of Students",
           axis = alt.Axis(labelFontSize = 13)),
    color = alt.Color("stem_track:N",
                   title = "STEM Track",
                   scale = alt.Scale(domain = ["High", "Low/Med"],
                                 range = ["#e74c3c", "#95a5a6"]),
                   legend = alt.Legend(titleFontSize = 13, labelFontSize = 12)),
    opacity = alt.condition(click_track, alt.value(1.0), alt.value(0.3)),
    tooltip = [
        alt.Tooltip("stage:N", title = "Stage"),
        alt.Tooltip("stem_track:N", title = "Track"),
        alt.Tooltip("count:Q", title = "Students"),
        alt.Tooltip("proportion:Q", title = "Proportion", format = ".1%")
    ]
).add_params(click_track).properties(
    width = 550,
    height = 450,
    title = {
        "text": ["STEM Pathway Progression: Interest to Career"],
        "fontSize": 16,
        "fontWeight": "bold"
    }
)

line_plot = alt.Chart(pathway_summary).mark_line(
    point = alt.OverlayMarkDef(filled = True, size = 100),
    strokeWidth = 4
).encode(
    x = alt.X("stage:N",
           title = "Progression Stage",
           axis = alt.Axis(labelAngle = -25, labelFontSize = 12),
           sort = ["1. STEM Interest (2009)", "2. Math Score (2012)",
                 "3. STEM Degree", "4. STEM Career (2016)"]),
    y = alt.Y("proportion:Q",
           title = "Proportion of Students",
           axis = alt.Axis(format = ".0%", labelFontSize = 13),
           scale = alt.Scale(domain = [0, 1])),
    color = alt.Color("stem_track:N",
                   scale = alt.Scale(domain = ["High", "Low/Med"],
                                 range = ["#e74c3c", "#95a5a6"]),
                   legend = None),
    opacity = alt.condition(click_track, alt.value(1.0), alt.value(0.1)),
    tooltip = [
        alt.Tooltip("stage:N", title = "Stage"),
        alt.Tooltip("stem_track:N", title = "Track"),
        alt.Tooltip("proportion:Q", title = "Proportion", format = ".1%"),
        alt.Tooltip("count:Q", title = "Students")
    ]
).properties(
    width = 550,
    height = 450,
    title = {
        "text": "STEM Track Proportions Over Time",
        "fontSize": 16,
        "fontWeight": "bold"
    }
)

viz1 = (bar_chart.properties(width = 250) | line_plot.properties(width = 250))
viz1

In [11]:
click_parent_edu = alt.selection_point(fields = ["parent_edu_group"], empty = True)

asp_data = df[df["parent_edu_group"].notna() & df["edu_aspiration_group"].notna()]
asp_summary = asp_data.groupby(["parent_edu_group", "edu_aspiration_group"]).size().reset_index(name = "count")
totals = asp_data.groupby("parent_edu_group").size().reset_index(name = "total")
asp_summary = asp_summary.merge(totals, on = "parent_edu_group")
asp_summary["percentage"] = (asp_summary["count"] / asp_summary["total"]) * 100

stacked_bar = alt.Chart(asp_summary).mark_bar().encode(
    x = alt.X("parent_edu_group:N",
           title = "Parent Education Level",
           axis = alt.Axis(labelAngle = -30, labelFontSize = 12),
           sort = ["HS or Less", "Some College/Associate", "Bachelor+"]),
    y = alt.Y("percentage:Q",
           title = "Percentage of Students (%)",
           axis = alt.Axis(labelFontSize = 12)),
    color = alt.Color("edu_aspiration_group:N",
                   title = "Student Educational Aspiration",
                   scale = alt.Scale(
                       domain = ["HS or Less", "Some College", "Bachelor's", "Advanced Degree"],
                       range = ["#e74c3c", "#f39c12", "#3498db", "#2ecc71"]
                   ),
                   legend = alt.Legend(titleFontSize = 13, labelFontSize = 11, orient = "top")),
    opacity = alt.condition(click_parent_edu, alt.value(1.0), alt.value(0.3)),
    tooltip = [
        alt.Tooltip("parent_edu_group:N", title = "Parent Education"),
        alt.Tooltip("edu_aspiration_group:N", title = "Student Aspiration"),
        alt.Tooltip("percentage:Q", title = "Percentage", format = ".1f")
    ]
).add_params(click_parent_edu).properties(
    width = 550,
    height = 450,
    title = {
        "text": ["Student Educational Aspiration by Parent Education"],
        "fontSize": 16,
        "fontWeight": "bold"
    }
)

stem_outcome_data = df[df["parent_edu_group"].notna() & df["stem_degree"].notna()]
heatmap_summary = stem_outcome_data.groupby(["parent_edu_group", "stem_degree"]).size().reset_index(name = "count")
heat_totals = stem_outcome_data.groupby("parent_edu_group").size().reset_index(name = "total")
heatmap_summary = heatmap_summary.merge(heat_totals, on = "parent_edu_group")
heatmap_summary["percentage"] = (heatmap_summary["count"] / heatmap_summary["total"]) * 100

heatmap = alt.Chart(heatmap_summary).mark_rect(stroke = "white", strokeWidth = 2).encode(
    x = alt.X("stem_degree:N", title = "College Major", axis = alt.Axis(labelFontSize = 12)),
    y = alt.Y("parent_edu_group:N",
           title = "Parent Education Level",
           axis = alt.Axis(labelFontSize = 12),
           sort = ["HS or Less", "Some College/Associate", "Bachelor+"]),
    color = alt.Color("percentage:Q",
                   title = "Percentage (%)",
                   scale = alt.Scale(scheme = "blues", domain = [0, 100]),
                   legend = alt.Legend(titleFontSize = 13)),
    tooltip = [
        alt.Tooltip("parent_edu_group:N", title = "Parent Education"),
        alt.Tooltip("stem_degree:N", title = "Student Major"),
        alt.Tooltip("percentage:Q", title = "Percentage", format = ".1f"),
        alt.Tooltip("count:Q", title = "Count")
    ]
).transform_filter(click_parent_edu).properties(
    width = 550,
    height = 450,
    title = {"text": "STEM Degree Rate by Parent Education", "fontSize": 16, "fontWeight": "bold"}
)

viz2 = (stacked_bar.properties(width = 250) | heatmap.properties(width = 250))
viz2

In [12]:
selection = alt.selection_point(fields = ["stem_job"], empty = True)

income_data = df[df["income_numeric"].notna() & df["stem_job"].notna() & (df["income_numeric"] > 0)]
income_summary = income_data.groupby(["stem_job", "income_label", "income_numeric"]).size().reset_index(name = "count")
income_totals = income_data.groupby("stem_job").size().reset_index(name = "total")
income_summary = income_summary.merge(income_totals, on = "stem_job")
income_summary["percentage"] = (income_summary["count"] / income_summary["total"]) * 100

income_order = ["No income", "$1-5k", "$5-10k", "$10-15k", "$15-20k", "$20-25k", 
                "$25-30k", "$30-35k", "$35-40k", "$40-50k", "$50-60k", 
                "$60-75k", "$75-100k", "$100k+"]

income_chart = alt.Chart(income_summary).mark_bar().encode(
    x = alt.X("income_label:N",
           title = "Income Range",
           axis = alt.Axis(labelAngle = -45, labelFontSize = 10),
           sort = income_order),
    y = alt.Y("percentage:Q",
           title = "Percentage of Workers (%)",
           axis = alt.Axis(labelFontSize = 12)),
    color = alt.Color("stem_job:N",
                   title = "Career Type",
                   scale = alt.Scale(domain = ["STEM", "Non-STEM"], 
                                 range = ["#3498db", "#95a5a6"]),
                   legend = alt.Legend(titleFontSize = 13, labelFontSize = 12, orient = "top")),
    opacity = alt.condition(selection, alt.value(1.0), alt.value(0.3)),
    tooltip = [
        alt.Tooltip("stem_job:N", title = "Career Type"),
        alt.Tooltip("income_label:N", title = "Income Range"),
        alt.Tooltip("percentage:Q", title = "Percentage", format = ".1f"),
        alt.Tooltip("count:Q", title = "Count")
    ]
).add_params(selection).properties(
    width = 700,
    height = 400,
    title = {
        "text": ["Income Distribution by Career Type (2017)"],
        "fontSize": 16,
        "fontWeight": "bold"
    }
)

job_sat_data = df[df["job_satisfaction"].notna() & df["stem_job"].notna()]
sat_summary = job_sat_data.groupby(["stem_job", "job_satisfaction"]).size().reset_index(name = "count")
sat_totals = job_sat_data.groupby("stem_job").size().reset_index(name = "total")
sat_summary = sat_summary.merge(sat_totals, on = "stem_job")
sat_summary["percentage"] = (sat_summary["count"] / sat_summary["total"]) * 100

job_sat_chart = alt.Chart(sat_summary).mark_bar().encode(
    y = alt.Y("stem_job:N", 
           title = "Career Type",
           axis = alt.Axis(labelFontSize = 13)),
    x = alt.X("percentage:Q", 
           title = "Percentage (%)",
           stack = "normalize",
           axis = alt.Axis(format = ".0%", labelFontSize = 12)),
    color = alt.Color("job_satisfaction:N",
                   title = "Job Satisfaction Level",
                   scale = alt.Scale(
                       domain = ["Very satisfied", "Somewhat satisfied",
                             "Somewhat dissatisfied", "Very dissatisfied"],
                       range = ["#2ecc71", "#3498db", "#f39c12", "#e74c3c"]
                   ),
                   legend = alt.Legend(titleFontSize = 13, labelFontSize = 11, orient = "top")),
    opacity = alt.condition(selection, alt.value(1.0), alt.value(0.3)),
    tooltip = [
        alt.Tooltip("stem_job:N", title = "Career Type"),
        alt.Tooltip("job_satisfaction:N", title = "Satisfaction Level"),
        alt.Tooltip("percentage:Q", title = "Percentage", format = ".1f"),
        alt.Tooltip("count:Q", title = "Count")
    ]
).properties(
    width = 700,
    height = 400,
    title = {"text": "Job Satisfaction Distribution by Career Type (2016)", "fontSize": 16, "fontWeight": "bold"}
)

viz3 = (income_chart.properties(width = 250) | job_sat_chart.properties(width = 250))
viz3