Skip to content

gregfaletto/fragilefamilies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fragilefamilies

Summary

The Fragile Families Challenge was a contest to create models predicting six outcome variables for disadvantaged children at age 15. Panel data derived from interviews with the children and their caretakers were available to train models. In total, there were nearly 13,000 predictor variables, with a training set of 2,121 children. A test set was held out by the organizers of the contest to evaluate the models at the end.

I participated in the contest, as one of my first applications of training machine learning models on a real-life data set (with a great deal of missing data).

I wrote my code in R. After pre-processing the data and dealing with missing data (imputation, feature removal if necessary, etc.), my code trained lasso-penalized linear regression models and well as principal components linear regression (PCR) models on each of the three response variables, as measured at age 15:

I used cross-validation to find the best lasso-penalized model as well as the best PCR model. Final, I compared the root-mean-squared error (RMSE) on the cross-validation sets for the best lasso-penalized model to the cross-validated RMSE on the best PCR model to choose the final model for each variable. The final selections were as follows:

  • GPA model: Lasso-penalized linear regression with 57 variables.

  • Grit model: Principal components linear regression with 1 (!) component.

  • Material hardship model: Lasso-penalized linear regression with 57 variables.

All three of my models finished in the top 40% of entries and outperformed a baseline null model on the test set.

Thank you to Stephen McKay AKA the_Brit, whose code 'FFC-simple-R-code.R' I used to get started, Viola Mocz (vmocz) & Sonia Hashim (shashim), whose code FeatEngineering.R I used, and hty, whose code 'COS424_HW2_imputation_Rcode.R' I used. Those files were taken from the Fragile Families github, at https://github.com/fragilefamilieschallenge.

More information about the Fragile Families Challenge is available at http://www.fragilefamilieschallenge.org.

Files

Here are descriptions of each of the files in this repository:

  • ffc2.R: My code for my final entry which loads the raw data, pre-processes it, and trains models, as described above.

  • narrative.txt: A brief description of the models trained by my code for my final submission, including model type and number of variables included. This file was generated by my R script in order to preserve this information.

  • prediction.csv: The predictions generated by my models for both training and test set data.

About

Code from my entry to Princeton University's Fragile Families Challenge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages