/
Predicting_Retention.Rmd
191 lines (147 loc) · 6.85 KB
/
Predicting_Retention.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: "Predicting Retention"
author: "Jason Bryer, Ph.D."
date: "2019-11-14"
output:
ioslides_presentation:
smaller: yes
widescreen: yes
editor_options:
chunk_output_type: console
---
<style>
blockquote {
background: #f9f9f9;
border-left: 5px solid #ccc;
margin: 1.5em 10px;
padding: 0.5em 1.5em;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE)
library(DT)
library(ggplot2)
library(psych)
```
## Overview
In higher education, being able to predict retention is important for multiple reasons including:
* Provide support services for students "at-risk"
* Used by some college rankings (e.g. [US News](https://www.usnews.com/best-colleges))
Many colleges use admissions tests (e.g. ACT, SAT) to determine admission to the college. Our guiding research question is:
**Does addmission tests predict college retention?**
Retention is defined as:
> A measure of the rate at which students persist in their educational program at an institution, expressed as a percentage. For four-year institutions, this is the percentage of first-time bachelors (or equivalent) degree-seeking undergraduates from the previous fall who are again enrolled in the current fall. For all other institutions this is the percentage of first-time degree/certificate-seeking students from the previous fall who either re-enrolled or successfully completed their program by the current fall. ([IPEDS, 2019](https://surveys.nces.ed.gov/ipeds/VisInstructions.aspx?survey=6&id=30074&show=all))
## Data Source
The [Integrated Postsecondary Education Data System (IPEDS)](https://nces.ed.gov/ipeds/) provides information about all higher education institutions that provide Federal Finacial Aid to students.
```{r, message=FALSE, echo=FALSE}
library(ipeds)
data(surveys)
DT::datatable(surveys[,1:3], rownames = FALSE)
```
## Data Preparation
The [`ipeds`](https://github.com/jbryer/ipeds) R package provides an interface to download IPEDS data directly into R.
```{r message=FALSE, warning=FALSE, cache=TRUE}
directory <- getIPEDSSurvey('HD', 2011)
admissions <- getIPEDSSurvey("IC", 2011)
retention <- getIPEDSSurvey("EFD", 2011)
```
We will subset the columns we are interested in and rename them.
```{r}
directory <- directory[,c('unitid', 'instnm', 'sector', 'control')]
admissions <- admissions[,c('unitid', 'admcon1', 'admcon2', 'admcon7', 'applcnm',
'applcnw', 'applcn', 'admssnm', 'admssnw', 'admssn',
'enrlftm', 'enrlftw', 'enrlptm', 'enrlptw', 'enrlt',
'satnum', 'satpct', 'actnum', 'actpct', 'satvr25',
'satvr75', 'satmt25', 'satmt75', 'satwr25', 'satwr75',
'actcm25', 'actcm75', 'acten25', 'acten75', 'actmt25',
'actmt75', 'actwr25', 'actwr75')]
retention <- retention[,c('unitid', 'ret_pcf', 'ret_pcp')]
```
## Data Preparation: Rename Columns
```{r}
names(admissions) <- c("unitid", "UseHSGPA", "UseHSRank", "UseAdmissionTestScores",
"ApplicantsMen", "ApplicantsWomen", "ApplicantsTotal",
"AdmissionsMen", "AdmissionsWomen", "AdmissionsTotal",
"EnrolledFullTimeMen", "EnrolledFullTimeWomen",
"EnrolledPartTimeMen", "EnrolledPartTimeWomen",
"EnrolledTotal", "NumSATScores", "PercentSATScores",
"NumACTScores", "PercentACTScores", "SATReading25",
"SATReading75", "SATMath25", "SATMath75", "SATWriting25",
"SATWriting75", "ACTComposite25", "ACTComposite75",
"ACTEnglish25", "ACTEnglish75", "ACTMath25", "ACTMath75",
"ACTWriting25", "ACTWriting75")
names(retention) = c("unitid", "FullTimeRetentionRate", "PartTimeRetentionRate")
```
## Data Preparation: Recoding
Recode the `openadmission` and `distanceEd` variables to factors and `enrollment` to an integer.
```{r message=FALSE, warning=FALSE}
admissionsLabels = c("Required", "Recommended", "Neither requiered nor recommended",
"Do not know", "Not reported", "Not applicable")
admissions$UseHSGPA = factor(admissions$UseHSGPA, levels=c(1,2,3,4,-1,-2),
labels=admissionsLabels)
admissions$UseHSRank = factor(admissions$UseHSRank, levels=c(1,2,3,4,-1,-2),
labels=admissionsLabels)
admissions$UseAdmissionTestScores = factor(admissions$UseAdmissionTestScores, levels=c(1,2,3,4,-1,-2),
labels=admissionsLabels)
```
## Data Preparation: Merging
```{r}
ret <- merge(directory, admissions, by="unitid")
ret <- merge(ret, retention, by="unitid")
#Use schools that require or recommend admission tests
ret2 <- ret[ret$UseAdmissionTestScores %in%
c('Required', 'Recommended', 'Neither requiered nor recommended'),]
#Remove schools with low retention rates. Are these errors in the data?
ret2 <- ret2[-which(ret2$FullTimeRetentionRate < 20),]
```
```{r}
head(ret2, n = 3)
```
```{r, echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
DT::datatable(ret2)
```
## Data Preparation: SAT and ACT Scores
IPEDS only provides the 25th and 75th percentile in SAT and ACT scores. We will use the mean of these two values as a proxy for the mean.
```{r, warning=FALSE, message=FALSE}
ret2$SATMath75 <- as.numeric(ret2$SATMath75)
ret2$SATMath25 <- as.numeric(ret2$SATMath25)
ret2$SATMath <- (ret2$SATMath75 + ret2$SATMath25) / 2
ret2$SATWriting75 <- as.numeric(ret2$SATWriting75)
ret2$SATWriting25 <- as.numeric(ret2$SATWriting25)
ret2$SATWriting <- (ret2$SATWriting75 + ret2$SATWriting25) / 2
ret2$SATTotal <- ret2$SATMath + ret2$SATWriting
ret2$NumSATScores <- as.integer(ret2$NumSATScores)
```
## Data Preparation: Selectivity
Calculate the the selectivity of the institution by calculating the acceptance rate (i.e. # admissions / # applicants).
```{r, warning=FALSE, message=FALSE}
ret2$Selectivity <- as.numeric(ret2$AdmissionsTotal) /
as.numeric(ret2$ApplicantsTotal)
ret2$UseAdmissionTestScores <- as.factor(as.character(ret2$UseAdmissionTestScores))
```
## SAT Scores vs. Full-Time Retention
```{r echo=FALSE, fig.height=5, fig.width=10, message=FALSE, warning=FALSE}
ggplot(ret2, aes(x=SATTotal, y=FullTimeRetentionRate, size=NumSATScores,
color=UseAdmissionTestScores)) +
geom_point(alpha = 0.3) +
geom_smooth(method='loess', alpha=0.1) +
xlab('Median SAT Score') + ylab('Full-Time Retention Rate') +
scale_size('Number of Scores Reported') +
scale_color_brewer('Use Admission Test Scores', palette = 2, type = 'qual')
```
## Regression Results
```{r}
lm.out <- lm(FullTimeRetentionRate ~ SATWriting + SATMath +
Selectivity + UseAdmissionTestScores,
data=ret2,
weights=ret2$NumSATScores)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
options('width' = 120)
lm.out.sum <- summary(lm.out)
printCoefmat(lm.out.sum$coefficients)
```
Adjusted $R^2$ = `r round(lm.out.sum$adj.r.squared, digits = 2)`
## Conclusion
SAT scores are a significant predictor of full-time retention rates for higher education institutions in the United States.
SAT scores and selectivity account for approximately `r round(round(lm.out.sum$adj.r.squared * 100))`% of the variance of full-time retention.