## ICE 7

**Author**: Nicolas Dussaillant

Use one of the methods that I used for ACA 2 (Logistic regression with feature selection) and evaluate different metrics:

In [1]:
import pandas as pd

cols = [
        'SCHOOL',
        'GRADE',
        'CODER',
        'Gender',
        'OBSNUM',
        'totalobs-forsession',
        'Activity',
        'ONTASK',
        'TRANSITIONS',
        'FORMATchanges',
        'Obsv/act',
        'Transitions/Durations',
        'Total Time'
        ]
df = pd.read_csv('aca2_dataset/aca2_dataset_training.csv', usecols = cols)
dv = pd.read_csv('aca2_dataset/aca2_dataset_validation.csv', usecols = cols)


In [2]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

#Define a function to prepare data for this case and that will be useful for the validation data and other methods
def prepare_data(df):
    dfx = df.loc[:, df.columns != 'ONTASK']
    # Encode dummy variables (Gender is not included because it is already encoded)
    Xs = pd.get_dummies(dfx, columns = ['SCHOOL', 'GRADE', 'CODER', 'Activity'])

    # Format ONTASK as binary
    Y = df['ONTASK'].replace(to_replace = ['Y', 'N'], value = [1, 0])
    return Xs, Y

In [3]:
Xs, Y = prepare_data(df)
logit2 = LogisticRegression(max_iter = 500) # Increment number of iterations to let it achieve the optimal fitting

min_features_to_select = 1  # Minimum number of features to consider
rfecv = RFECV(estimator=logit2, step=1, cv=None,
              scoring='accuracy',
              min_features_to_select=min_features_to_select)
rfecv.fit(Xs, Y)

Xtest, Ytest = prepare_data(dv)



### Metrics

In [4]:
import sklearn.metrics as mt
y_train_pred = rfecv.predict(Xs)
y_test_pred = rfecv.predict(Xtest)

Precision:

In [6]:
print("Training data precision:", mt.precision_score(Y, y_train_pred))
print("Testing data precision:", mt.precision_score(Ytest, y_test_pred))

Training data precision: 0.6766954938552572
Testing data precision: 0.6690948825350573


Recall:

In [7]:
print("Training data recall:", mt.recall_score(Y, y_train_pred))
print("Testing data recall:", mt.recall_score(Ytest, y_test_pred))

Training data recall: 0.9952470210202169
Testing data recall: 0.9935100054083288


Accuracy:

In [8]:
print("Training data accuracy:", mt.accuracy_score(Y, y_train_pred))
print("Testing data accuracy:", mt.accuracy_score(Ytest, y_test_pred))

Training data accuracy: 0.676613775694194
Testing data accuracy: 0.668108887687038


F1:

In [9]:
print("Training data F1:", mt.f1_score(Y, y_train_pred))
print("Testing data F1:", mt.f1_score(Ytest, y_test_pred))

Training data F1: 0.8056247967920235
Testing data F1: 0.7996517575361846


K:

In [10]:
print("Training data Cohen's Kappa:", mt.cohen_kappa_score(Y, y_train_pred))
print("Testing data Cohen's Kappa:", mt.cohen_kappa_score(Ytest, y_test_pred))

Training data Cohen's Kappa: 0.01997234434010986
Testing data Cohen's Kappa: 0.014278065322148814


We can see that comparing different metrics, the disparities in the data are represented in differences such as precision and recall. Cohen's Kappa tells us that there is none-to-slight agreement, which shows that even the accuracy is not very bad, it is mainly due the disparities of the samples.