# PH 245 Final Project - Flu Absenteeism 

In [None]:
library(data.table)
library(boot)

prefix = "../absentee/Combined-data/"
filenames = c("absentee_all.csv","absentee-flu.csv", "absentee-nonflu.csv", "ILIData_CA_201101_201739.csv",
              "absentee.RData"
             )

In [None]:
# Loading Data (using high-speed data.tables)
absenteeData = fread( file=paste(prefix, filenames[1], sep=""), stringsAsFactors=TRUE )

In [None]:
head(absenteeData)
colnames(absenteeData)

# Creating a smaller sample for use until final analysis
#absenteeData = absenteeData[sample(.N, 1000000)]
nrow(absenteeData)

In [None]:
# Cleaning data and adding more useful variables

absenteeData = absenteeData[,date:=as.Date(absenteeData$date, "%d%b%Y")]
absenteeData=absenteeData[,month:=as.numeric(format(absenteeData$date, "%m"))]
absenteeData=absenteeData[,week:=week(date)]
absenteeData=absenteeData[,yr:=year(date)]

absenteeData$fluseasCDC = ifelse(absenteeData$month <= 4 | absenteeData$month >= 10, 1, 0)

absenteeData$dist.n = ifelse(absenteeData$dist == "OUSD", 1, 0)

absenteeData$grade = as.factor(absenteeData$grade)

absenteeData$race <- factor(absenteeData$race, levels = c("White","African American",
      "Asian","Latino","Multiple Ethnicity","Native American","Not Reported",
      "Pacific Islander"))

# Since WCCUSD has different labeling and fewer races reported that OUSD, 
# reduce all races to subset for uniformity
absenteeData = absenteeData[race %in% c("Native American", "Multiple Ethnicity", "Not Reported"), 
                            race := "Don't know Other"]

# The sum of any row will be 0 if there was no absence 
# or 1 if there was an absence for any reason
absenteeData$absence = absenteeData$absent_nonill + absenteeData$absent_ill

# End result
head(absenteeData)

#### Exploratory Data Analysis (EDA)
The first, most important thing to do is examine how many absences ocurred in total. Then, we'll break it down year by year and examine absences.

Absences are defined within the absent_nonill and absent_ill columns. Both columns having a 0 means the student was present. A 1 appears in one of the columns if there was an absence.

In examining our dataset, some other good things to understand include racial breakdown and grade distribution.

In [None]:
# Beginning Exploratory Data Analysis
summary(absenteeData)

In [None]:
pieAbsenceBreakdown = function(data, pieTitle) {
    "Creates a pie chart of the absences and presences in dataset"
    numAbsences = sum(data$absence)
    numPresences = length(data$absence) - numAbsences
    rawBreakdown = c(numAbsences, numPresences)
    
    piePercent = paste(round(100*rawBreakdown/sum(rawBreakdown), 2), "%", sep="")
    
    pie(rawBreakdown, 
        labels=piePercent, 
        col=rainbow(length(rawBreakdown)),
        main=pieTitle
       )
    
    legend("topright", 
           c("Absences","Presences"), 
           fill=rainbow(length(rawBreakdown))
          )
}

# Examining total absence/presence breakdown
pieAbsenceBreakdown(data=absenteeData, pieTitle="All Year Absence/Presence breakdown")


# Examining flu-specific absence/presence breakdown
fluData = absenteeData[fluseasCDC==1]
nonFluData = absenteeData[fluseasCDC==0]

pieAbsenceBreakdown(data=fluData, pieTitle="Flu Season Absence/Presence breakdown")
pieAbsenceBreakdown(data=nonFluData, pieTitle="NonFlu Season Absence/Presence breakdown")

In [None]:
# Creating a pie chart of ethnicities

races = absenteeData[,.N,by="race"]
piePercent2 = paste(round(100*races$N/sum(races$N), 2), "%", sep="")

pie(x=races$N, labels=piePercent2, col=rainbow(length(races$race)), cex = 0.4)
legend("topright", legend=races$race, fill=rainbow(length(races$race)), cex = 0.6, title="Ethnic breakdown")
races

In [None]:
# Examining overall grade distribution
grades = absenteeData[,.N,by="grade"][order(grade)]

barplot(grades$N, names.arg=grades$grade)

In [None]:
# Sixth graders are all from one district - drop all sixth graders
sixthGraders = absenteeData[grade==6]
unique(sixthGraders$dist)

head(sixthGraders)

fullNumRows = nrow(absenteeData)
absenteeData = absenteeData[grade != 6]
print(paste("Lost", (fullNumRows-nrow(absenteeData)), "rows in eliminating sixth graders.", 
            nrow(absenteeData), "rows remain")
     )

#### Interpreting Our EDA Results

So, we see that we have a relatively small number of absences in our overall dataset (this is good!). Since we have a huge sample size, we'll have plenty of absences to examine.

The first thing we did is examine overall number of absences during flu season versus during the nonflu season. As one would expect, flu season had slightly a slightly greater percentage of students absent.

In the rest of our EDA, we explored the ethnic breakdown and grade distributions of our dataset. One thing to note is that our subject population is quite different in terms of ethnic breakdown from the entire United States, so our projects extensibility to other populations with different breakdowns is a bit less certain.

One thing to note is that our 6th grade population is so small because only one of the two school districts contributed data to that bin, so for this analysis, we'll proceed analyzing only grades K-5. 


#### Analyzing Absenteeism Variation among Matched Schools
To continue, let's try to understand how much variation in absenteeism there was between matched schools during the nonflu season. This will be important as a baseline for analyzing the variance between the same matched schools during flu season when the intervention took place. Schools that were matched have matchid's that are *not* 0. 

In [None]:
# Calculating the average percentage of absences per school
# For now, we'll only include the intervention time period
nonFluDataInterventionTime = nonFluData[nonFluData$yr > 2014 | nonFluData$schoolyr == "2014-15"]

nonFluAbsenceAverages = nonFluDataInterventionTime[,.(absenceAverage=mean(absence)),by=c("matchid", "dist", "school")][order(matchid, dist)]
head(nonFluAbsenceAverages)
tail(nonFluAbsenceAverages)

In [None]:
# Drop schools that were not matched by the matching algorithm and group by matchid
nonFluMatchedAbsenceAverages = nonFluAbsenceAverages[matchid != 0][order(matchid, dist)]
head(nonFluMatchedAbsenceAverages)

In [None]:
# Let's find the baseline difference between the two groups for each matched school

OUSDNonFlu = nonFluMatchedAbsenceAverages[dist=="OUSD"][order(matchid)]
WCCUSDNonFlu = nonFluMatchedAbsenceAverages[dist=="WCCUSD"][order(matchid)]

differenceNonFlu = OUSDNonFlu[,difference:=(OUSDNonFlu$absenceAverage - WCCUSDNonFlu$absenceAverage)][,c("matchid", "difference")]
head(differenceNonFlu)
barplot(differenceNonFlu$difference)

print("Mean difference in percentage of absences between matched pairs of schools during nonflu season")
mean(differenceNonFlu$difference)

In [None]:
# Now, let's repeat the same set of steps to analyze whether the intervention seemed to have any effect.
# We would expect OUSD, which had the intervention, to have absenteeism less impacted by illness. 
# On the other hand WCCUSD, which did not have any intervention
# would have greater absenteeism as flu became more prevalent during flu season. 
# Thus, we would expect a downward shift in the barplot
fluDataInterventionTime = fluData[fluData$yr > 2014 | fluData$schoolyr == "2014-15"]


fluAbsenceAverages = fluDataInterventionTime[,.(absenceAverage=mean(absence)),by=c("matchid", "dist", "school")][order(matchid, dist)]
fluMatchedAbsenceAverages = fluAbsenceAverages[matchid != 0][order(matchid, dist)]
OUSDFlu = fluMatchedAbsenceAverages[dist=="OUSD"][order(matchid)]
WCCUSDFlu = fluMatchedAbsenceAverages[dist=="WCCUSD"][order(matchid)]

differenceFlu = OUSDFlu[,difference:=(OUSDFlu$absenceAverage - WCCUSDFlu$absenceAverage)][,c("matchid", "difference")]
head(differenceFlu)
barplot(differenceFlu$difference, col="black")

print("Mean difference in percentage of absences between matched pairs of schools during flu season")
mean(differenceFlu$difference)

# Calculate the percentage of schools where expected "downward shift" during flu season occurred
print("Percentage of matched pairs with expected downward shift:")
sum(differenceFlu$difference < differenceNonFlu$difference)/length(differenceFlu$difference)

#### Interpreting the result

This is... mildly worrying, if I'm interpreting the data correctly, though the test we ran was rather informal and intended to understand whether the data would fit to our intuitions. However, it seems as if schools receiving the intervention actually had a larger increase in absenteeism during the flu season vs rest of the year compared to the matched control group which did not receive the intervention. While our analysis did not look at illness specific data (which is pretty important to making an actual conclusion), the trends in the data are very counterintuitive. 


#### Moving Forward
Nevertheless, we'll move on to fitting statistical models for linear and logistic regression in an attempt to be able to predict how certain factors affect all-cause and illness specific absenteeism. 

In [None]:
# Since we're generating predictions with regression, need to bring in other school-specific variables to fit on

getSchoolData = function(aggregationData, dropColumns, aggregationColumns) {
    oldw <- getOption("warn")
    options(warn = -1)
    
    cleanAggregationData = aggregationData[,(dropColumns):=NULL]
    groupedSchoolData = cleanAggregationData[,head(.SD, 1),by=aggregationColumns]
    
    options(warn = oldw)
    
    print(paste("Data collected for", nrow(groupedSchoolData), "schools"))

    return(groupedSchoolData)
}

# Dropping irrelevant columns (for specific schools) from aggregation data
dropColumns = c("V1", "schoolyr", "date", "grade", "race", "absent_nonill", "absent_ill",
                "matchid", "month", "flusesn", "absent_all", "weekending", "peakwk", "week", "yr",
                "fluseasCDPH", "fluseasCDC"
               )

aggregationColumns = c("dist", "school", "enrolled") # Unique identifying key for a school

#load(file = paste(prefix, filenames[5], sep=""))
attach(paste(prefix, filenames[5], sep="")); 
flu = flu; 
detach()

schoolData = getSchoolData(aggregationData=flu, dropColumns=dropColumns, aggregationColumns=aggregationColumns)
head(schoolData)
colnames(schoolData)

In [None]:
# Merging school level data into our set of patients  
combinedFluDataInterventionTime = merge(x=fluDataInterventionTime[matchid!=0,!c("schoolyr", "date", "absence")],
                                        y=schoolData, 
                                        by=c("dist", "school", "dist.n")
                                       )
head(combinedFluDataInterventionTime)
colnames(combinedFluDataInterventionTime)

In [None]:
# Fitting logistic regression for illness-specific absenteeism and nonspecific absenteeism

glm.log.ill = glm(absent_ill~., data=combinedFluDataInterventionTime[,!c("dist", "school", "absent_nonill", "matchid")])
glm.log.nonill = glm(absent_nonill~., data=combinedFluDataInterventionTime[,!c("dist", "school", "absent_ill", "matchid")])

summary(glm.log.ill)
summary(glm.log.nonill)

In [None]:
# Using Cross-Validation to estimate prediction error of our two models

oldw <- getOption("warn")
options(warn = -1)

cv.log.ill.predError = cv.glm(data=combinedFluDataInterventionTime[,!c("dist", "school", "absent_nonill", "matchid")],
                              glmfit = glm.log.ill,
                              K=2
                             )$delta

cv.log.nonill.predError = cv.glm(data=combinedFluDataInterventionTime[,!c("dist", "school", "absent_ill", "matchid")],
                              glmfit = glm.log.nonill, 
                              K=2
                             )$delta

options(warn = oldw)

cv.log.ill.predError
cv.log.nonill.predError

#### Logistic Regression Interpretation

Though our prediction accuracies are actually very good, its important to recognize how biased our data was to begin with. We started with a dataset composed of < 5% absences, so simply guessing "present" every time, a naive model could still get a 95%+ accuracy. This model, thus, is able to pick up on some of the variables which are important to the classification but it has a biased view of which variables are extremely important because of how skewed the data is to one class. That said, dist.n *is* thankfully one of the significant predictors, though that should be taken with a grain of salt due to the above. 

To further explore whether Shoo-the-flu had an impact:

#### Multiple Linear Regression on All-Cause and Illness-Specific School-level Absenteeism

In [None]:
# Having fit a logistic regression model, a regularized multiple linear regression model may now help us discern
# effects of many of these variables on absenteeism percentage by school

granularSchoolAbsenceAverages = absenteeData[,.(absenceAverage=mean(absence)*100, yr=yr, 
                                                illnessAbsenceAverage=mean(absent_ill)*100),
                                             by=c("matchid", "dist", "school", "schoolyr", "fluseasCDC")][order(matchid, dist)]
head(granularSchoolAbsenceAverages)
tail(granularSchoolAbsenceAverages)

In [None]:
# Merging school level data into our set of all-cause absenteeism  
combinedGranularSchoolAbsenceAverages = merge(x=granularSchoolAbsenceAverages,
                                              y=schoolData,
                                              by=c("dist", "school")
                                             )

head(combinedGranularSchoolAbsenceAverages)
tail(combinedGranularSchoolAbsenceAverages)
colnames(combinedGranularSchoolAbsenceAverages)

In [None]:
# Marking rows that schools were under intervention - the hope is of course that intervention contributes significantly to each type of absenteeism predictions

combinedGranularSchoolAbsenceAverages = combinedGranularSchoolAbsenceAverages[
    ,"intervention":= ifelse( (yr>2014|schoolyr=="2014-2015"), dist.n, 0)]

print("Percentage of all rows under intervention: ")
mean(combinedGranularSchoolAbsenceAverages$intervention)

In [None]:
glm.linReg.absenceAverage = glm(absenceAverage~., data=combinedGranularSchoolAbsenceAverages[,!c("dist", "school", "matchid", "illnessAbsenceAverage")])

glm.linReg.illnessAbsenceAverage = glm(illnessAbsenceAverage~., data=combinedGranularSchoolAbsenceAverages[,!c("dist", "school", "matchid", "absenceAverage")])

summary(glm.linReg.absenceAverage)
summary(glm.linReg.illnessAbsenceAverage)

In [None]:
print("Cross Validation Linear Regression Prediction Error for all cause absenteeism:")
cv.linReg.absenceAverage.predError = cv.glm(data=combinedGranularSchoolAbsenceAverages[,!c("dist", "school", "matchid", "illnessAbsenceAverage")],
                              glmfit = glm.linReg.absenceAverage,
                              K=2
                             )$delta

cv.linReg.absenceAverage.predError[1]

print("Compare to the mean proportionof all-cause absenteeism across schools:")
mean(combinedGranularSchoolAbsenceAverages$absenceAverage)

In [None]:
print("Cross Validation Linear Regression Prediction Error for illness-specific absenteeism:")
cv.linReg.absenceAverage.predError = cv.glm(data=combinedGranularSchoolAbsenceAverages[,!c("dist", "school", "matchid", "absenceAverage")],
                              glmfit = glm.linReg.illnessAbsenceAverage,
                              K=2
                             )$delta

cv.linReg.absenceAverage.predError[1]

print("Compare to the mean proportion of illness-specific absenteeism across schools:")
mean(combinedGranularSchoolAbsenceAverages$illnessAbsenceAverage)

#### Interpreting our linear regression

So, in this case, based on our cross validation predictions, our linear regression model isn't awful, but it isn't great either at using these school level variables to detect either type of absenteeism, with significant residuals. Unfortunately, we are no closer to discovering how important our intervention variable really is, and can only note that it also was a significant contributor to the regression combination, but since every other variable was as well... that doesn't say much. Our regression does, however, allow us to predict (albeit with a very large margin of error) average absenteeism over any given time period at the school level. This, of course, has the potential to highlight schools in areas that require 