level-data-1a

[process_data.ipynb](./process_data.ipynb) - converts provided data from .json to .csv and renames files for consistency
[make_dfs.ipynb](./make_dfs.ipynb) - merges provided data into dataframes which we used to train models (includes creating encoded versions of dataframes)
Model spreadsheet - contains metadata, metrics for each model
- Within this repo, individual models can be found at architecture/dataframe.ipynb. All models are trained on dataframes within dataframes/.

Predicting Proficiency

Project Description

Overview, Objectives, and Goals

For a given district: Identify which factors most strongly contribute to student proficiency (subject-based) Utilize factors to build predictive models that flag students at risk of falling below proficiency level

For example: Our ML model could help us flag a current 11th grader as at risk of falling below proficiency in math based on 10th grade features
Educators can then provide extra help / attention / services

Scope: Predict subject-specific proficiency given past year’s data and grade level (potentially generalize across grade levels)

Methodology

Raw Dataset Overview

Data Tables Provided	Data Attributes
- scores - benchmarks - courseSections - courseSectionRosters - schools - vendorUsage	- District 18, 45 (where 18 is a superset of 45) - student ID (all data tables) - year (all data tables) - grade level (all data tables) - demographics (ethnicity, ELL, etc.) - scores - course enrollment - courseRosterSections, courseRosters - school - schools (anonymized) - vendor usage - vendorUsage Note: Data is anonymized

Data Usage

Using benchmark threshold and student score, we created two data types:

Boolean — is_proficient
Continuous — proficient_score (score/threshold)
- < 1 is not proficient, >= 1 is proficient
- Captures “how” proficient a student is

For example, if a student's score is 21 and the proficiency threshold is 18, then is_proficient will be True and proficient_score is 1.1666.

Then, for a given student, grade level, and subject, we merged dataframes in the following manner:

studentId	subgroup_ethnicity	...	course_Algebra II	...	school_A	...	iready_math	...	proficient_score
45440	1		1		0		1		0.941176
45054	0		0		1		0		0.529412

1 = a student is part of a subgroup, course, school, etc.
0 = a student is not part of a subgroup, course, school, etc.
Label: proficient_score

Final Datasets

Math and Reading Proficiency for Grades 3–8	Subject Proficiency for Grade 11
- scantronMath, scantronReading - 2017 features, 2018 labels - Features: courses, schools, vendors, past_proficiency - ~20,000 students in each DataFrame	- ACT (Reading, English, Math, Science) - 2017 features, 2018 labels - Features: courses, schools, vendors - ~2,500 students in each DataFrame

Dimension Reduction

We used two different methods to reduce the number of columns in our final dataframes: Encoding and Principal Component Analysis (PCA)

Encoding

In the raw data we were given, original course names had formats such as English 5, LifeSci Gr7 etc. To encode this information in our dataframes, we processed it in the following way:

Extract grade level
Create subject areas using keywords
Create feature for course 0 = Not enrolled
1 = Below grade level
2 = At grade level
3 = Above grade level
Create binary feature for electives

PCA

What is PCA?
- PCA reduces the number of columns (features) in a dataset.
How does it work?
- Combines original columns into fewer “principal components.”
- Captures the most important differences (variation) in the data.
- We kept 80% of the variation and removed excess columns.
Why do we use it?
- Hundreds of columns make data hard to analyze.
- Removing unnecessary components didn’t hurt performance.
- Eliminates redundancy and simplifies the data.

Outcomes of PCA

Metric	ACT_math	ACT_reading	Scantron_Math	Scantron_reading
Columns in original dataframe	240	240	139	139
Columns in dataframe after PCA	109	109	46	46

Before: 142 Features

studentId	course_English 5	...	course_LifeSci Gr7
43588	1		0
30983	0		1

After: 26 Features

studentId	subject_english	...	subject_science
43588	2		0
30983	0		3

Exploratory Data Analysis

Correlations

We used correlation matrices to find how related different columns of our data frames are. Here are some of our key findings from this process:

Strong connections between lunch status, gender, and ethnicity.
ACT sections (e.g., math & reading) are closely related.
Doing well in one ACT section often means doing well in another.
Similar pattern for Scantron Math and Reading exams (not shown).

Class Imbalance

We found that only about 15-20% of students represented in the dataset are proficient (i.e., score at least the benchmark). This impacted our models’ ability to predict proficient students.

Modeling

Model Selection

We had three main requirements when choosing which machine learning models we wanted to use to predict proficiency

First, given that our intended audience is school administrators and educators, who likely do not have a data science background, we wanted our models to be interpretable.
Second, we also decided to look at proficiency as a continuous label instead of a simple binary, yes or no label. After discussing with our Challenge Advisors and TA, we thought that using a continuous label would be more beneficial because it accounts for different levels of proficiency.
The third criteria is that we wanted our models to fit well to the data we were given, meaning they do not underfit (do not learn enough complexities) or overfit (learn too much of the training data’s complexities). It is important to note that model training and tuning also has an impact on how the model performs in the end.
Ultimately, we decided to train linear regression, decision tree, random forest, and gradient boosted decision tree models.

Model Training

Our approach: Trying many different models and seeing which ones performed best.
Here is our Model spreadsheet - contains metadata, metrics for each model

Here are our highest performing models:
Note that accuracy and macro F1 scores are computed after converting the continuous result into a binary value (e.g. 1.6 → true)

Math, reading for grades 3-8 (Using Scantron Math/Reading data)

Model Name	Features	Evaluation Metrics	Insights
Simple Linear Regression	Past proficiency	Math: RMSE: 0.06 R^2: 0.44 Accuracy: 0.87 Macro F1: 0.83 Reading: RMSE: 0.07 R^2: 0.59 Accuracy: 0.87 Macro F1: 0.86	- Only 1 feature for this model because adding features either didn’t change or worsened RMSE and R^2 values. - Predicts poorly for students with better or worse future scores. - Treated as a baseline model for scantron math/reading.
XG Boost	Schools, courses, vendor usage, past proficiency	RMSE: 0.06 R^2: 0.59 Accuracy: 0.88 Macro F1: 0.87	- Past_proficient_score is positively correlated with the label. - Average performance varies by grade level. - Co-enrollment in advanced classes. - Scantron Math had no advanced courses in key features.
Decision Tree	Schools, courses, vendor usage	RMSE: 0.07 R^2: 0.5 Accuracy: 0.87 Macro F1: 0.87	- Encoded versions performed slightly better than PCA. - School comes up as an important feature.
Random Forest	Schools, courses, vendor usage	RMSE: 0.05 R^2: 0.5 Accuracy: 0.86 Macro F1: 0.86	- Decision Tree performed slightly better. - PCA performed better than encoded. - Schools play an important factor.

Math, reading for grade 11 (Using ACT Math/Reading data)

Model Name	Features	Evaluation Metrics	Insights
XG Boost	Schools, courses, vendor usage	RMSE: 0.08 R^2: 0.6 Accuracy: 0.79 Macro F1: 0.82	- Positive Correlations with Proficiency: - iReady Math (value of 1) positively correlates with ACT math scores. - Advanced courses with positive correlations: course_Alg II/Trig, course_Eng Gr10 Adv, course_USHis I Adv, course_Geometry Adv, course_ChemistryAdv. - Negative Correlations with Proficiency: - Courses with negative correlations: Algebra I B, Physical Science, English Grade 10.
Decision Tree	Schools, courses, vendor usage	RMSE: 0.18 R^2: 0.38 Accuracy: 0.79 Macro F1: 0.87	- In the ACT Math model, science courses were important. - In the ACT Reading model, STEM courses were more important than English 10 enrollment.
Random Forest	Schools, courses, vendor usage	RMSE: 0.16 R^2: 0.5 Accuracy: 0.80 Macro F1: 0.79	- Random Forest performs better than Decision Trees.

Results and Key Findings

About the data:
- A majority of students are not proficient (according to the benchmarks)
- We have enough data to train predictive models for students in grades 3–8 and 11
Past proficiency is a relatively good predictor of proficiency
Proficiency in one subject (e.g. reading) correlates with proficiency in another (e.g. math)
We can predict proficiency for the stated demographics with over 70% accuracy in most cases
Our models have a harder time predicting proficient students

Potential Next Steps

Add additional data to improve accuracy and representation
- Performance on courses
- Complete demographic data
- More benchmark exams (e.g. Aspire)
Try to get the model to generalize better
- We can generalize across grade levels given enough data
- Can we generalize across subjects?
Represent proficiency more holistically
- Different tiers of proficiency
- Different data sources

Usage

The dataframes folder contains all of the csv files that we used to train models on
Individual models are in their respective folder
- E.g. You can find all decision tree models in the decision tree folder
Inside of each folder, you will find Jupyter Notebooks for each dataframe that the model was trained on
You can download and run the Jupyter Notebooks on an IDE of your choosing (we used Visual Studio Code)

Credits and Acknowledgements

Student Team: Allison Huang, Manjari Muruganandam, Louise Marie Maganto, Maya Patel Thank you to our TA, Blessing Nwogu, and Challenge Advisors from Level Data, Pradnya Bhawalkar and Eddie Shek, for supporting us through this project!

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.idea		.idea
DecisionTree		DecisionTree
Images		Images
LinearRegression		LinearRegression
LogisticRegression		LogisticRegression
RandomForest		RandomForest
XGBoost		XGBoost
_individual		_individual
dataframes		dataframes
.gitignore		.gitignore
README.md		README.md
make_dfs.ipynb		make_dfs.ipynb
process_data.ipynb		process_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

level-data-1a

Predicting Proficiency

Table of Contents

Project Description

Overview, Objectives, and Goals

Methodology

Raw Dataset Overview

Data Usage

Final Datasets

Dimension Reduction

Encoding

PCA

Exploratory Data Analysis

Correlations

Class Imbalance

Modeling

Model Selection

Model Training

Math, reading for grades 3-8 (Using Scantron Math/Reading data)

Math, reading for grade 11 (Using ACT Math/Reading data)

Results and Key Findings

Potential Next Steps

Usage

Credits and Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

level-data-1a

Predicting Proficiency

Table of Contents

Project Description

Overview, Objectives, and Goals

Methodology

Raw Dataset Overview

Data Usage

Final Datasets

Dimension Reduction

Encoding

PCA

Exploratory Data Analysis

Correlations

Class Imbalance

Modeling

Model Selection

Model Training

Math, reading for grades 3-8 (Using Scantron Math/Reading data)

Math, reading for grade 11 (Using ACT Math/Reading data)

Results and Key Findings

Potential Next Steps

Usage

Credits and Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages