# Main Notebook - California EWS Capstone

_Authors: Amaryani Balbuena, Jun Clemente, Tanya Ortega_

_Applied Data Science, ADS-599 - University of San Diego_


# 1. Project Overview

This notebook provides a high-level summary of the California Early Warning System (EWS) developed for the ADS-599 Capstone. The goal of the project is to identify public high schools that are at risk of low graduation outcomes, using only public, non-PII statewide datasets.

The EWS model uses indicators aligned with the ABC framework - Attendance (A), Behavior (B), and Course performance (C) - to generate transparent and reproducible risk predictions. This notebook loads the final modeling dataset, the selected top-15 features, and the trained Random Forest classifier, and demonstrates a sample prediction. Links to the full workflow notebooks and the data dictionaries are included below.


# 2. Load Final Modeling Dataset


In [3]:
import pandas as pd 

final_dataset = pd.read_pickle("./data/modeling_dataset_final.pkl")

# 3. Preview Columns + Shape


In [15]:
# show the shape of the dataset
display(f"The shape of the final dataset is: {final_dataset.shape}")

# show first five rows of final dataset
display(final_dataset.head())


'The shape of the final dataset is: (958, 17)'

Unnamed: 0,still_enrolled_rate,chronicabsenteeismrate,unexcused_absences_percent,met_uccsu_grad_reqs_rate,percent__eligible_free_k12,frpm_count_k12,stu_tch_ratio,pct_experienced,cohortstudents,pct_senior_cohort,stu_adm_ratio,grade_retention_ratio,pct_bachelors_plus,stu_psv_ratio,pct_bachelors,graduation_rate,high_grad_rate
0,1.0,12.7,23.5,73.9,0.172013,327.0,23.3,0.863158,394.0,0.498894,452.0,1.058568,0.315789,361.6,0.126316,92.4,Graduated / On Track
1,0.0,70.3,46.2,67.8,0.174389,307.0,22.0,0.894523,284.0,0.501264,414.0,0.996721,0.264133,274.7,0.166667,95.1,Graduated / On Track
2,0.8,5.2,24.1,62.3,0.262259,935.0,19.2,0.855932,861.0,0.496276,374.7,1.014907,0.148305,228.7,0.427966,90.5,Graduated / On Track
3,0.0,3.5,28.0,72.8,0.166358,491.0,22.6,0.912752,672.0,0.470174,396.9,0.829876,0.315436,178.2,0.187919,96.4,Graduated / On Track
5,0.0,29.6,67.9,76.3,0.551136,102.0,11.6,1.0,40.0,0.545455,176.0,1.170732,0.166667,274.7,0.333333,95.0,Graduated / On Track


# 4. Load Top 15 Features


In [24]:
import joblib 
from pathlib import Path 

# read top 15 features
top_features_path = Path("./models/top_features.pkl")
top_features = joblib.load(top_features_path)
top_features

['still_enrolled_rate',
 'chronicabsenteeismrate',
 'unexcused_absences_percent',
 'met_uccsu_grad_reqs_rate',
 'percent__eligible_free_k12',
 'frpm_count_k12',
 'stu_tch_ratio',
 'pct_experienced',
 'cohortstudents',
 'pct_senior_cohort',
 'stu_adm_ratio',
 'grade_retention_ratio',
 'pct_bachelors_plus',
 'stu_psv_ratio',
 'pct_bachelors']

# 5. Load Final Random Forest Model


In [25]:
# read in predictive model 
model_path = Path("./models/random_forest_ews.pkl")
final_model = joblib.load(model_path)

# display final model 
print(type(final_model))
final_model

<class 'sklearn.ensemble._forest.RandomForestClassifier'>


0,1,2
,n_estimators,400
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,3
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


# 6. Workflow Diagram (Markdown only)

# 7. Links to Detailed Notebooks

# 8. Links to Data Dictionaries

# 9. Link to Streamlit App


## 10. References

Austin, G., Hanson, T., Bala, N., & Zheng, C. (2023). Student engagement and well-being in California, 2019-21: Results of the Eighteenth Biennial State California Healthy Kids Survey, Grades 7, 9, and 11. WestEd. https://data.calschls.org/resources/18th_Biennial_State_1921.pdf

California Department of Education. (n.d.). Retrieved October 26, 2025, from https://www.cde.ca.gov/

Chen, T., Wanberg, R. C., Gouioa, E. T., Brown, M. J. S., Chen, J. C.-Y., & Kurt Kraiger, J. J. (2019). Engaging parents Involvement in K – 12 Online Learning Settings: Are We Meeting the Needs of Underserved Students? Journal of E-Learning and Knowledge Society, Vol 15 No 2 (2019): Journal of eLearning and Knowledge Society. https://doi.org/10.20368/1971-8829/1563

Cobb, C. D. (2020). Geospatial Analysis: A New Window Into Educational Equity, Access, and Opportunity. Review of Research in Education, 44(1), 97–129. https://doi.org/10.3102/0091732X20907362

Rumberger, R., Addis, H., Allensworth, E., Balfanz, R., Bruch, J., Dillon, E., Duardo, D., Dynarski, M., Furgeson, J., Jayanthi, M., Newman-Gonchar, R., Place, K., & Tuttle, C. (2017). Preventing Dropout in Secondary Schools (No. NCEE 2017-4028). National Center for Education Evaluation and Regional Assistance (NCEE), Institute of Education Sciences, U.S. Department of Education. https://whatworks.ed.gov

Sava, S., Bunoiu, M., & Malita, L. (2017). Ways to Improve Students’ Decision for Academic Studies. Acta Didactica Napocensia, 10(4), 109–120. https://doi.org/10.24193/adn.10.4.11

Siegle, D., Gubbins, E. J., O’Rourke, P., Langley, S. D., Mun, R. U., Luria, S. R., Little, C. A., McCoach, D. B., Knupp, T., Callahan, C. M., & Plucker, J. A. (2016). Barriers to Underserved Students’ Participation in Gifted Programs and Possible Solutions. Journal for the Education of the Gifted, 39(2), 103–131. https://doi.org/10.1177/0162353216640930

The California School Climate, Health, and Learning Survey (CalSCHLS) System—Home. (n.d.). Retrieved October 26, 2025, from https://calschls.org/


## 11. Appendix

**Data Dictionaries**

1. Combined EDA Dataset - [01_data_dictionary_combined.md](./docs/01_data_dictionary_combined.md)
2. Modeling Dataset - [02_data_dictionary_modeling.md](./docs/02_data_dictionary_modeling.md)
3. Final Top 15 Features [03_data_dictionary_top15.md](./docs/03_data_dictionary_top15.md)
