# **Analysis Challenge Assignment 2** 
by **Kan Yamane, Wendy Weng and Pooja Addla.**

In the following analysis, our team built a classifier to predict on or off-task behavior of K-4 students from 22 classrooms from 5 local charter schools based on the dataset `aca2_dataset_training.csv`

The dataset consisted of:

**1) General Variables:**
*  UNIQUEID: The unique id for each observation
*  SCHOOL: School name. Five schools in total.
*  Class: Classroom name
*  GRADE: Grade level, 0 = Kintergarden; 1 = First Grade;...
*  STUDENTID: 1226 unique student in total.
*  Gender: 0 = Female, 1 = Male

**2) Observation Variables:**
*  CODER: The coder who coded on/off task behavior.
*  OBSNUM: The observation made on one student. 1 = The first observation on the student;... 32 = The 32nd observation on the student.
*  Activity: Six different format of activities: (1) individual work, (2)small-group or partner work, (3) whole-group instruction at desks, (4) whole-group instruction while sitting on the carpet, (5) dancing, and (6) testing
*  ONTASK: N = On task; Y = Off-task
*  Total Time: Total time in seconds of how long each activity is. 0 means the instruction was given but the activity did not actually happen.

**3) Class Session Variables**
*  totalobs-forsession: total observations made per session
*  NumACTIVITY: How many activities one session has taken?
*  TRANSITIONS: How many times the activities have changed in one session, Transitions were noted every time the teacher paused instruction to change from one activity to another (e.g., transitioning from working on a math problem to listening to a short story).
*  NumFORMATS: How many format of activity one session has taken?
*  FORMATchanges: How many times the format of instruction have changed in one session
*  Obsv/act: The average duration of an instructional activity (sec). The total duration of an observation session divided by the number of activities.
*  Transition/Durations: Average times of transition per session. The total number of activity divided by the duration of an observation session (sec).




In [None]:
# Upload and name the validation dataset 'original_va', and the training dataset 'original_tr'
import pandas as pd
original_va = pd.read_csv("aca2_dataset_validation.csv")
original_tr = pd.read_csv("aca2_dataset_training.csv")

print(original_va.info())
print(original_tr.info())
# We find that the data is relatively clean with no missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5547 entries, 0 to 5546
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   UNIQUEID               5547 non-null   int64  
 1   SCHOOL                 5547 non-null   object 
 2   Class                  5547 non-null   object 
 3   GRADE                  5547 non-null   int64  
 4   CODER                  5547 non-null   object 
 5   STUDENTID              5547 non-null   int64  
 6   Gender                 5547 non-null   int64  
 7   OBSNUM                 5547 non-null   int64  
 8   totalobs-forsession    5547 non-null   int64  
 9   Activity               5547 non-null   object 
 10  ONTASK                 5547 non-null   object 
 11  TRANSITIONS            5547 non-null   int64  
 12  NumACTIVITIES          5547 non-null   int64  
 13  FORMATchanges          5547 non-null   int64  
 14  NumFORMATS             5547 non-null   int64  
 15  Obsv

Since there were no mising values in the original datasets, we further clean the data by turning the categorical variables into dummy variables (with values 1 and 0) and remove redundant columns. 

In [None]:
#Turn the categorical variables ('SCHOOL', 'GRADE', 'Activity', 'ONTASK') into dummy variables across multiple columns
df_va = pd.get_dummies(original_va, columns = ['SCHOOL', 'GRADE', 'Activity', 'ONTASK'])
df_tr = pd.get_dummies(original_tr, columns = ['SCHOOL', 'GRADE', 'Activity', 'ONTASK'])

#Make the 'ONTASK' columns a single column contating the dependent boolean variable
df_va = df_va.rename(columns = {'ONTASK_Y': 'ONTASK'})
df_tr = df_tr.rename(columns = {'ONTASK_Y': 'ONTASK'})

#Turn 'NumACTIVITIES', 'NumFORMAT', and 'Obsv/act' to dummy variables for analysis
#All data points that is over the average of the entire variable will be called 1; 0 otherwise
df_va['NUMACTIVITIES_HIGH'] = (df_va['NumACTIVITIES'] > df_va['NumACTIVITIES'].mean()).astype(int)
df_tr['NUMACTIVITIES_HIGH'] = (df_tr['NumACTIVITIES'] > df_tr['NumACTIVITIES'].mean()).astype(int)
df_va['NUMFORMATS_HIGH'] = (df_va['NumFORMATS'] > df_va['NumFORMATS'].mean()).astype(int)
df_tr['NUMFORMATS_HIGH'] = (df_tr['NumFORMATS'] > df_tr['NumFORMATS'].mean()).astype(int)
df_va['OBSV/ACT_HIGH'] = (df_va['Obsv/act'] > df_va['Obsv/act'].mean()).astype(int)
df_tr['OBSV/ACT_HIGH'] = (df_tr['Obsv/act'] > df_tr['Obsv/act'].mean()).astype(int)

#Drop columns unnecessary for classification analysis
#Turn all column names to lowercase
df_va = df_va.drop(['UNIQUEID',
                    'Class',
                    'STUDENTID',
                    'CODER',
                    'OBSNUM',
                    'ONTASK_N',
                    'totalobs-forsession',
                    'NumACTIVITIES',
                    'TRANSITIONS',
                    'NumFORMATS',
                    'FORMATchanges',
                    'Obsv/act',
                    'Transitions/Durations',
                    'Total Time'],
                   axis = 1)
df_tr = df_tr.drop(['UNIQUEID',
                    'Class',
                    'STUDENTID',
                    'CODER',
                    'OBSNUM',
                    'ONTASK_N',
                    'totalobs-forsession',
                    'NumACTIVITIES',
                    'TRANSITIONS',
                    'NumFORMATS',
                    'FORMATchanges',
                    'Obsv/act',
                    'Transitions/Durations',
                    'Total Time'],
                   axis = 1)
df_va.columns = df_va.columns.str.lower()
df_tr.columns = df_tr.columns.str.lower()

We now begin training and testing various logistic regression models with the cleaned data.

In [None]:
#Create functions to train and test logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#Function to create a logistic regression model given a list of the independent variable column names
y_tr = df_tr['ontask'].to_numpy()
def new_log_reg (x_cols):
    Xs_tr = df_tr[x_cols].to_numpy()
    log_reg_model = LogisticRegression()
    log_reg_model.fit(Xs_tr, y_tr)
    return log_reg_model

#Function to test a logistic regression model
y_va = df_va['ontask'].to_numpy()
def test_log_reg (x_cols, model):
    Xs_va = df_va[x_cols].to_numpy()
    pred = model.predict(Xs_va)
    print(confusion_matrix(y_va, pred))
    print(f"Accuracy Score: {accuracy_score(y_va, pred)}")
    
#Put function together
def log_reg_tester (x_cols):
    new_model = new_log_reg(x_cols)
    test_log_reg(x_cols, new_model)

In [None]:
#Create functions to train and test decision tree model
from sklearn.tree import DecisionTreeClassifier

#Function to create a decision tree model given a list of the independent variable column names
y_tr = df_tr['ontask']
def new_dec_tree (x_cols):
    Xs_tr = df_tr[x_cols]
    dec_tree_model = DecisionTreeClassifier()
    dec_tree_model.fit(Xs_tr, y_tr)
    return dec_tree_model

#Function to test a logistic regression model
y_va = df_va['ontask']
def test_dec_tree(x_cols, model):
    Xs_va = df_va[x_cols]
    pred = model.predict(Xs_va)
    print(confusion_matrix(y_va, pred))
    print(f"Accuracy Score: {accuracy_score(y_va, pred)}")
    
#Put function together
def dec_tree_tester (x_cols):
    new_model = new_dec_tree(x_cols)
    test_dec_tree(x_cols, new_model)

In [None]:
#Create functions to train and naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Function to create a naive Bayes model given a list of the independent variable column names
def new_naive_bayes (x_cols):
    Xs_tr = df_tr[x_cols]
    naive_bayes_model = GaussianNB()
    naive_bayes_model.fit(Xs_tr, y_tr)
    return naive_bayes_model

#Function to test a naive Bayes model
def test_naive_bayes(x_cols, model):
    Xs_va = df_va[x_cols]
    pred = model.predict(Xs_va)
    print(confusion_matrix(y_va, pred))
    print(f"Accuracy Score: {accuracy_score(y_va, pred)}")
    
#Put function together
def naive_bayes_tester (x_cols):
    new_model = new_naive_bayes(x_cols)
    test_naive_bayes(x_cols, new_model)

After conducting significance tests on the chosen variables (independently using SPSS) in the various models above, we removed variables with p-value > 0.05. The following is the model that yielded the highest accuracy score.

In [None]:

#Model with highest accuraccy score upon testing
x_cols = [
'school_a',
'school_b',
'school_c',
'school_d',
'school_e',
'grade_0',
'grade_1',
'grade_2',
'grade_3',
'grade_4',
'activity_dancing',
'activity_smallgroup',
'activity_testing',
'activity_wholedesks',
'numactivities_high',
'obsv/act_high'
]
dec_tree_tester(x_cols)

[[ 174 1675]
 [ 132 3566]]
Accuracy Score: 0.6742383270236164


**Observation/Insights**  

The accuracy score above indicates that our classifer predicts on or off-task behavior accurately 67.42% of the time. This is slightly better than a 'blind guess'. In a blind guess model, the optimal strategy is to always guess that the students will pay attention, as the student pay attention on the majority of the observations. This naive strategy will result in a 66.67% accuracy rate (the ratio of observations with low attention in the validation data). Our model is slightly better than the naive strategy.  
There are several insights we can gather from the final variables we used in the final model. To begin with, the accuracy rates were higher when the “school” variables were included. This suggests that attention rates may differ between schools; simply put, students in a certain school may be likely to pay attention compared to a student in another. This may be due to differences in the student pool or instructional quality. Secondly, the “gender” variable ended up being omitted. This suggests that the student’s gender is not a significant signal of whether the student is likely to be paying attention in class. Finally, and perhaps most interestingly, certain activities were left in the final model while others (“individual” and “wholecarpet”) were omitted. This came as somewhat of a surprise, as we initially conducted our analysis treating all activities equal. However, this result suggests that a specific subset of the activities (which remained in the model) have a relatively higher effect on whether or not the students pay attention.