R programs for analyzing 8 years of records of foster care placement: preparing for machine learning, extrapolating values, deeper analysis (feature engineering: ranking & weighing different case types, etc), and combining data into single dataframes (ex. "all_numbers") in order to best suit analysis procedures. Some adapted codes for models also in this repository - remember to set proper columns / characteristics.
The purpose of this project is to use statistical analysis, machine learning, and data science tools and procedures to find insights into how child welfare servies can be improved in Northern Florida. With the help of a local company who shared their data, I was able to build a number of different programs / data analysis schemes that aided in the understanding of the associated foster care records as a whole. These records spanned from 2010 to 2017, with approximately 40,000 total cases and 170,000 participants (including children, parents, caregivers, etc).
Children's cases are often imperfect and they can "bounce" around the system in different placement settings, then leave, and come back for further cases. One of the main goals of this project was to find any characteristics or identifiable factors that would lead to a child being removed from his or her family multiple times, and/or be re-entered into the system after being placed out of it. As discovered, the majority of the children in the given data have multiple records, different placement types, and multiple caregivers, making this initiative a multi-faceted and complex one.
Data Manipulation & Feature Engineering Programs
Reproducible design; most programs build a dataframe that is used by other programs for deeper insights. Feature engineering became a pivotal piece of the project before machine learning could be done, so these programs then lead into the construction of a dataframe with numerical characteristics. This final dataframe, "all_numbers", was used for machine learning and statistical analysis.
--> outputs each number case as dataframe (ex. "second_removals")
--> creates percentages for movements within removal dataframes ^^
--> these percentages do not consider the case BEFORE the current one
--> outputs wide file "paths" with 1-5 placements
--> uses "paths" to track specific placement type through case history
--> outputs "child_flow_custom"
--> 5-7 use child_flow_custom to calculate % of A to B movements (user chooses A & B)
--> these percentages WILL consider the case BEFORE the current one (as specified by user)
--> isolates last case of each child and pulls information
--> feature engineering, builds ranking system using last placement in case
--> feature engineering, extrapolates & calculates columns of features for machine learning
--> examples: case duration (in days), number of participants, average caregiver age, pop density
--> final feature engineering scheme, "weights" entirety of case history, not just last place
--> "all_numbers" dataframe complete and ready for machine learning
Machine Learning Programs
--> 11-14 are models
--> adapted from public code, use "all_numbers" or "ML_removals" to build model
--> RF also uses ggplot2 to visualize case success influences
Data Manipulation: Removals Info Programs
--> creates data frame with details about why child was removed and then re-entered
--> has place, service, end reasons for removal AND RE-ENTRY TO SYSTEM information
--> builds percentages table of a child's END REASON on RE-ENTRY CASE
--> user can select what end reason of FIRST REMOVAL was (say, reunification)
--> similar to merge_FIRST_THROUGH_FIFTH.R
--> tracks REMOVALS, not placements, with placement / service / end reason for each
--> also calculates length of time between removals (in days)
--> only does up to 4 removals (highest # of out-of-home episode types is 4)
--> outputes WIDE file "ALL_REMS_WIDE.csv"
--> builds full table of % of each Placement Setting or Service per a certain End Reason
--> these Place/Services are in the 2nd and 3rd removals
--> user can select End Reason of 1st and 2nd removals
--> takes removal data conglomerated in past few programs
--> merges into machine learning df with "all_numbers" details
--> creates "ML_removals" df in R Studio
The goal currently with these new programs is to visualize what is happening when children are reunified with their parents, then pulled out from that home for the second time (their second removal). If they also have a 3rd removal, see what sorts of services / end reasons those cases have too.
Analysis in Progress
--> custom algorithm to find the most influential characteristics
--> concept is fairly simple, see code for more details
--> uses ML_removals, and outputs "STAT_ALG" and "STAT_ALG_RESULTS.csv"
How children's placements over time as they move through the foster care system effect if their case ends after a specific placement or if they continue with further placements in the system / moving to better (or worse) placements.
I built this multi-layered pie chart that includes the placement settings of the child after the three top first settings
(from pie chart): Foster Home - Relative, Foster Home - NonRelative, and Institution.
The outer rings of the pie show the percentage of children that went to the indicated placement setting, after this first setting. To be clear, this only visualizes the first and second placements. The R programs that we built can track any number of cases, but after three or four the amount of data starts to dwindle significantly, so tracking beyond that point has so far been fairly unfruitful.
Having “No Case” is a significant insight because it means that when the program was run, there was an “NA” in the data for the next case. This means that there is no additional record for this child after moving out of a placement setting. This is good because I can assume that the child does not have a need to be re-entered into the system and was provided for properly by the system. One of the original goals of this research was to find factors that influenced this sort of behavior, so, from this visualization, I can conclude the following:
● After Foster Relative - highest chance (55%) of leaving system
● After Foster-Not Relative - high chance (75%) of staying in system
● After Institution - lowest chance (9%) of leaving system
● Highest chance of going to Pre-Adoptive Home after Foster-Relative (12%)
Foster care with a relative is a clear influence on a child's retention decreasing. In addition to the insights that can be extrapolated from this visualization, I also discovered that 98% of the children who move into the Pre-Adoption placement setting (which is ONLY a significant movement after foster with a relative), do not return for any further cases. This is more proof that foster care with a relative is superior than with a non-relative.
Characteristic Analysis & Random Forest
After building the feature engineering R program to calculate the "weight" of each case, the next step was to determine the most influential factors / characteristics of the success of child's cases in foster care system (higher success = higher likelyhood of being provided for properly and exiting system). The success factor was used as the characteristic to be predicted, based off of various other details about the child's case, their caregivers, location, etc. The result was discovered via a random forest model in R, and plotted in ggplot2.
- Case Duration: actual duration of case, in days, from case begin date to case end date
- Age: age of child during associated case (estimated with MM/YYYY)
- CareAge: age of caregiver (estimated with MM/YYYY)
- NumParticipants: number of people involved in case (parents, caregivers, other children, relatives, etc)
- PerCapIncome: income per capita of associated case zip code
- MedHousIncome: median household income
- NumCaregivers: number of caregivers in case
- VC: violent crime rate
- ZipDens: density of zip code per square mile
- ZipCount: number of other cases in same zip code as case
- AvgHome: average cost of a house in area
- PC: property crime rate
- Zip: zip code where case took place
- PercUnder18: percent of population in zip under 18
- LowBirthWeight: rate of low birth weights in zip
- InfantMortRate: rate of infant mortality in zip
- JuvDelinquency: rate of junvinile delinquency in zip
- ProbZip: a factor created to track the most densely packed zip codes in terms of cases
This plot here also incorporates the use of the crime/financial data, which was created by another team member. It included financial and crime statistics for about 60 of the most populated zip codes in the data set. Turns out that this was fairly important in the first 5 influences.
Statistical Analysis using Top Influences
This is not all-inclusive yet of removals analysis, but gives good insight into what is a positive and/or negative influence on a child's case in the foster care system. These results were found based on statistical analysis aided using the Random Forest model's results. I took the average value and compared who was above/below that value.
Removal to Re-Entry Case Tracking
Children are removed from their homes to start their path in the foster care system, but then they many times are also removed from the placements that they are put into. It may take a number of days, weeks, months, or years for them to return. After speaking with the company, they recommended that I track how children are reunified with their parents as an End Reason, and why they are put back into the system after this. This would imply that they are then removed AGAIN out of the same home that they were originally taken out of. I built more programs (#'s 15, 16, 17, 18, 19) to track these "re-entry" case details as children are brought back out of the same home that they first were taken out of, for a second time.
Reunification Cases; Removal Tracking
Children who's first removal is marked with an end reason of "Reunification with Parents". Which placements and services do these children move into after their 1st removal ends in reunification?
The neural network was used to assess the capability of machine learning models to accurately predict the "success" of a case (the "Weight" feature that we engineered earlier). This model below also includes the crime data; it was actually created solely to test if there was enough data in our statistical analysis / machine learning dataframe "all_numbers", to determine if more data was needed to make a more accurate prediction. To our surprise, this was extremely accurate, so I did not need to source out any additional details beyond what I had.
Once this neural network was done (error was only 4.5%), I knew that I had enough data to make accurate predictions with for the "weight" (the success factor).
Results: Random Forest on Removal Data
Once again, incredible results with the RF model using R. This time, different characteristics show up that were in the ML_Removals df created by the removals programs. Lots of good insights here.
If I just use the newer removal data in the ML frame to make the prediction for case success to see how it ranks:
I used a number of different decision trees in an attempt to gain more insight out of the data. However, most of the results were self-explanatory. For example, below, we can see that a lower Rank usually results in a higher Weight. This makes sense because Rank was based off of the success of a child's final case. A lower Rank means a less successful case result, so the child would most likely have a higher Weight for their entire case as well.
Geographic Heat Mapping
The goal of heat mapping the data (there were ~900 unique zip codes present) was to see if any areas particularly stuck out as having a high number of cases per capita.