# **Absenteeism in Toms River High School East**
![banner.jpg](https://www.trschools.com/hseast/imgs/banner.jpg)

***

## Introduction
Toms River High School East (TRSHE), a comprehensive four-year high school located in New Jersey, is affected by large absent rates. In fact, the chronic absenteeism rates in TRSHE has been above the NJ state average rates from grades 9th to 12th. This kernel's objective is to better understand the underlying patterns of TRSHE absenteeism, how certain factors contribute to these patterns, and the type of methods that are best for prediction.

T-Test to check for significance.

Outline:
1. [Data Overview](#section-one)

In [18]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

student_data = pd.read_csv('../input/dataproj/High School East Student Data - Sheet1.csv')

<a id="section-one"></a>
# 1. Data Overview

The data was collected through the help of Mrs. Anders and Dr. Kretz, both faculty of TRSHE. Certain variables were chosen because chronic absenteeism is susceptible to variables like **Limited English Proficiency, 504/IDEA Disability, Race/Ethnicity**, and likewise. A comprehensive and large dataset of students in the United States conducted by the National Center for Education Statistics exists, but many important variables were surpressed for public-use. Accessing that data is a **rigorous** and **security-tight** process involving various academic officers that no regular person could pass. So that wasn't happening. So I resorted to my school data.

Thus, the data collected for this project was **manually** recorded. 
***

**A6-A12**: represents absences from 6th grade to 12th grade

**T6-T12**: represents tardies from 6th grade to 12th grade

**IEP/Specialized**: represents whether a student is in special education

In [2]:
student_data.head()

Unnamed: 0,Student,English Langauge Learner,Has a Disability?,Student on Free or Reduced Lunch,Race/Ethnic,A6,A7,A8,A9,A10,...,A12,T6,T7,T8,T9,T10,T11,T12,Gender,IEP/Specialized
0,CA,Yes/No?,No,No,Asian,1,0,0,0,1,...,0.0,9,4,5,1,2,2,1,M,No
1,CI,No,No,,White,5,9,5,6,9,...,3.0,2,2,3,4,2,4,1,F,No
2,CIS,,No,,White/Hispanic,0,0,0,0,0,...,2.0,0,0,0,0,0,5,2,F,No
3,DIP,,No,,White,2,7,8,7,10,...,13.0,1,0,3,6,7,9,3,F,No
4,EA,,No,,White,7,10,7,4,1,...,1.0,1,0,0,0,5,0,0,F,No


In [15]:
print(student_data.iloc[0].values[5:12]) # absence columns
print(student_data.iloc[0].values[12:19]) #tardy columns

['1' '0' '0' '0' '1' 2.0 0.0]
['9' '4' '5' '1' 2 2 1]


Because some absent and tardy count columns contains a string ("TRANSFER"), some of the variables in the **A6** to **A12** and **T6** to **T12** columns are represented as strings instead of integers. So I made some adjustments and converted the necessary values to int values. 

https://stackoverflow.com/questions/59084770/one-hot-encoder-what-is-the-industry-norm-to-encode-before-train-split-or-after

## 1.1 Preparing Data for Graph

I make a separate instance of student_data because I will be preproccessing data before the train test split in order for searborn and matplotlib to work. Here, I convert number values that are strings into an int type even if the string indicates a float type (some students have absences like 2.5 for a given year). Then, I convert "TRANSFER" to equal 0.

In [34]:
dataForGraph = pd.read_csv('../input/dataproj/High School East Student Data - Sheet1.csv')

#easy way of accessing A_6, A_7, ... A_N columns
def column_list(letter, start, end):
    return ["%s%d" % (letter, i) for i in range(start, end)]

#convert strings to int type even if it's a float
def convertStat(x):
    
    if(isinstance(x, int) == False):
        
        #we don't know if the string is float or int
        #converting it to int if it's not an int will cause an error
        try:
            return 0 if x == "TRANSFER" else int(x)
        except:
            pass
        
        #if it can't pass as an int, then it must be a float that will be converted to an int
        #the float is rounded
        return 0 if x == "TRANSFER" else int(float(x))
    else:
        return x
    
for i in ["A", "T"]:
    for j in column_list(i, 6, 13):
        dataForGraph[j] = dataForGraph[j].apply(convertStat)

print(list(dataForGraph.iloc[0].values[5:12])) # absence columns of first student
print(list(dataForGraph.iloc[0].values[12:19])) #tardy columns


[1, 0, 0, 0, 1, 2, 0]
[9, 4, 5, 1, 2, 2, 1]


In [None]:

#sums all absences and tardies from all grades
student_data['AbsentSum'] = student_data[column_list('A', 6, 13)].sum(axis=1)
student_data['TardySum'] = student_data[column_list('T', 6, 13)].sum(axis=1)

#sum absences in middle and high school
student_data['AbsencesSum_MS'] = student_data[column_list('A', 6, 9)].sum(axis=1)
student_data['AbsencesSum_HS'] = student_data[column_list('A', 9, 13)].sum(axis=1)


## 1.1 Train/Test Split and Preproccessing

Converting needed values into usable int types for learning models I also sum the absences and tardies into a new column. All of this happens after the test split.


In [4]:
random_seed = 1
test_size = 0.2

features = ["A6", "A7", "A8", "A9", "T6", "T7", "T8", "Gender", "IEP/Specialized"]
y_objective = student_data["AbsencesSum_HS"]

#split and preproccess
train_X, val_X, train_y, val_y = train_test_split(student_data[features], y_objective, random_state=random_seed, test_size=test_size)

train_X = pd.get_dummies(train_X)
val_X = pd.get_dummies(val_X)

# get_dummies creates different columns,but each set needs to have equal # of features
missing_cols = set( train_X.columns ) - set( val_X.columns )
for c in missing_cols:
    val_X[c] = 0

val_X = val_X[train_X.columns]

KeyError: 'AbsencesSum_HS'

### To address the issue of "TRANSFER" values converting tardy and absent columns into strings:

In [None]:


###### CHANGE. PREPROCCESS THE DATA AFTER THE TRAIN TEST SPLIT. BUT FOR NOW, KEEP IT THE WAY SO YOU CAN PLOT

#convert strings to int type even if it's a float
def convertStat(x):
    
    if(isinstance(x, int) == False):
        
        #we don't know if the string is float or int
        #converting it to int if it's not an int will cause an error
        try:
            return 0 if x == "TRANSFER" else int(x)
        except:
            pass
        
        #if it can't pass as an int, then it must be a float that will be converted to an int
        return 0 if x == "TRANSFER" else int(float(x))
    else:
        return x
    
for convertColumnData(df):
    for i in range(6,13):
        absences_column = "A%d" % i
        tardies_column = "T%d" % i

        df[absences_column] = df[absences_column].apply(convertStat)
        df[tardies_column] = df[tardies_column].apply(convertStat)
        
covertColumnData(train_X)
covertColumnData(val_X)
    
    
#convert absenses and tardies to integers for all students


print(list(student_data.iloc[0].values[5:12])) # absence columns of first student
print(list(student_data.iloc[0].values[12:19])) #tardy columns

# 2. Evaluating Data and Preproccessing

Plotting to see if there are any glaryingly obvious patterns or relationships between variables.

In [None]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,7))
#plt.figure(figsize=(2,10))

# 6:13 represents the absent columns
# 13:20 represents the tardy columns
x_values = range(6,13)
for i in range(len(student_data)):
    absences_y = student_data.iloc[i].values[5:12]
    tardies_y = student_data.iloc[i].values[12:19]

    ax1.plot(x_values, np.array(absences_y), alpha=0.7)    
    ax2.plot(x_values, np.array(tardies_y), alpha=0.7)

ax1.set_title("Absences from 6th to 12th grade")
ax2.set_title("Tardies from 6th to 12th grade")

ax1.set_ylabel("Absences")
ax2.set_ylabel("Tardies")

ax2.set_xlabel("Grade")
ax1.set_xlabel("Grade")

plt.subplots_adjust(wspace=0.2)

As expected, there isn't a one-fit-all absent and tardy pattern across all the grade levels. There isn't much of a clear pattern and this is due to the multitude of factors that contribute to absences throughout the years. But there seems to be similarity in the peaks, so it would be useful to see the correlation between absences and tardies. [Based on TRSHE data,](https://rc.doe.state.nj.us/report.aspx?type=school&lang=english&county=29&district=5190&school=030&schoolyear=2018-2019#P99cba7ec593f446e8cbf8d62c3db0208_2_oHit0) the chronic absent rates grow throughout the grade levels. At grade 12, **21**% of students, 2% higher than the NJ state average, have been chronically absent. In grade 11, **20**% of TRSHE students were chronically absent, which is 6% higher than the NJ state average.

For our dataset, **what is the distrubtion of tardies/absences over the grades, and what are some variable relationships?**

In [None]:
plt.clf()

#easy way of accessing A_6, A_7, ... A_N columns
def column_list(letter, start, end):
    return ["%s%d" % (letter, i) for i in range(start, end)]
        
#sums all absences and tardies from all grades
student_data['AbsentSum'] = student_data[column_list('A', 6, 13)].sum(axis=1)
student_data['TardySum'] = student_data[column_list('T', 6, 13)].sum(axis=1)

#sum absences in middle and high school
student_data['AbsencesSum_MS'] = student_data[column_list('A', 6, 9)].sum(axis=1)
student_data['AbsencesSum_HS'] = student_data[column_list('A', 9, 13)].sum(axis=1)


# ============================= ABSENCES AND TARDIES PLOTTING =====================================

# middle school tardies vs middle school absences
fg, (ax1, ax2) = plt.subplots(1,2, figsize=(12,6))

#First plot
ax1.title.set_position([.5, 1.05])
#sns.regplot(x="TardySum", y="AbsentSum", data=student_data, ax=ax1)
sns.scatterplot(x="TardySum", y="AbsentSum", hue="Student", data=student_data, legend=False, ax=ax1)
ax1.set(xlabel="Sum of Tardies", ylabel="Sum of Absences", title="Relationship between Tardies and Sums")

#Second plot
ax2.title.set_position([.5, 1.05])
#sns.regplot(x="AbsencesSum_MS", y="AbsencesSum_HS", data=student_data, ax=ax2)
sns.scatterplot(x="AbsencesSum_MS", y="AbsencesSum_HS", hue="Student", data=student_data, legend=False, ax=ax2)
ax2.set(xlabel="High School Absences", ylabel="Middle School Absences", title="Relationship between High School and Middle School Absences")


plt.subplots_adjust(wspace=0.5)


As expected, there is a linear relathionship between absences and tardies. On average, the number of tardies is less than the number of absences.

In [None]:
plt.clf()

#sns.distplot(student_data["AbsencesSum_HS"], bins=5, kde=False)
Histbins = range(0,100,10)

#plt.hist(student_data["AbsencesSum_HS"], bins=Histbins, edgecolor="black")

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,6))

sns.distplot(student_data["AbsencesSum_HS"], bins=Histbins, kde=False, rug=True, ax=ax1)
sns.boxplot(x="AbsencesSum_HS", data=student_data, ax=ax2)


#plt.xticks(np.arange(0, 95, 10))

# 3. Learning Models

## 3.1 Decision Trees/Random Forests

Objective is to find the optimal number of leaves for each decision tree. The next step is to compare each model to each other for the most optimal model to use.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split


#==================== Validate the Decision Tree Model ===========================

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    
    return mae

leaf_node_range = range(2,30)
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in leaf_node_range}
low = min(scores, key=scores.get)
student_tree_model = DecisionTreeRegressor(random_state=1, max_leaf_nodes=min(scores, key=scores.get))
student_tree_model.fit(train_X, train_y)

absences_predictions = student_tree_model.predict(val_X)
mae = mean_absolute_error(absences_predictions, val_y)

print("Validation MAE: {:,.0f}".format(mae))

https://towardsdatascience.com/why-random-forests-outperform-decision-trees-1b0f175a0b5
https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

random_forest = RandomForestRegressor(random_state=random_seed,
                                    max_features="log2",
                                    n_estimators=400,
                                    max_leaf_nodes=low,
                                    min_samples_split=12,
                                    min_samples_leaf=40)

mae = []
random_forest.fit(train_X, train_y)
absences_predictions = random_forest.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))

scores = -1 * cross_val_score(student_tree_model, train_X, train_y,
                              cv=5,
                            scoring='neg_mean_absolute_error')

print(scores.mean())
print(list(scores))



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor

models = []
models.append(('LogisticRegression', LogisticRegression()))

def runModels(models, trainX, trainY, valY, cv):
    cross_valid = []
    for i in models:
        model = i[1]
        scores = -1 * cross_val_score(model, 
            trainX, 
            trainY,
            cv=cv,
            scoring="mean_absolute_error")

        cross_valid.append((i[0], scores.mean(), scores))

    return cross_valid

logReg = LogisticRegression(random_state=1, max_iter=10000)
logReg.fit(train_X, train_y)
absences_predictions = logReg.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))


In [None]:
from sklearn.svm import SVC

svcmodel = SVC(random_state=random_seed)
svcmodel.fit(train_X, train_y)
absences_predictions = svcmodel.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))

scores = -1 * cross_val_score(svcmodel, train_X, train_y,
                              cv=3,
                            scoring='neg_mean_absolute_error')

print(scores.mean())
print(list(scores))

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor(random_state=random_seed, n_estimators=2000, learning_rate=0.0006)
xgb.fit(train_X, train_y)
absences_predictions = xgb.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))


scores = -1 * cross_val_score(xgb, train_X, train_y,
                              cv=4,
                            scoring='neg_mean_absolute_error')

print(scores.mean())
print(list(scores))


In [None]:
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(train_X, train_y)
absences_predictions = linear.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))

scores = -1 * cross_val_score(linear, train_X, train_y,
                              cv=4,
                            scoring='neg_mean_absolute_error')

print(scores.mean())

print(list(scores))


In [None]:
from sklearn.svm import SVR

In [None]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor()
neigh.fit(train_X, train_y)
absences_predictions = neigh.predict(val_X)
print(mean_absolute_error(absences_predictions, val_y))
scores = -1 * cross_val_score(neigh, train_X, train_y,
                              cv=5,
                            scoring='neg_mean_absolute_error')

print(scores.mean())
print(list(scores))
