# Applying Trained Predictors to Other Target Variabes

Given the predictors trained on 2022-01-17, AP2 posed the question whether we could
apply them onto other target variables and check what the results would be.

As a recap,
the results where the following for different target variables (listing balanced accuracy)
* Master admission: ~ 0.66
* Fourth term AP: ~ 0.72
* Fourth term CP: ~ 0.75
* Dropout ~ 0.75
* RSZ: ~ 0.6

The target variables hereby are

| Variable | Description |
| --- | --- |
| Master admission | whether the average grade was 2.3 or better |
| Fourth term AP | whether 10 exams where passed after 4th term |
| Fourth term CP | whether 100 ECTS where acquired after 4th term |
| Dropout | whether a student will drop out of the study course or not |
| RSZ | whether the degree is achieved within 6 terms of regular time of study |

## Load datasets

In [1]:
import pandas as pd
import sqlite3

In [2]:
db = sqlite3.connect("data/rapp.db")
admission_sql = "sql/cs/cs-first_term_modules-master_admission.sql"
dropout_sql = "sql/cs/cs-first_term_modules-3_dropout.sql"
four_term_ap_sql = "sql/cs/cs-first_term_modules-4term_ap.sql"
four_term_cp_sql = "sql/cs/cs-first_term_modules-4term_ectp.sql"
rsz_sql = "sql/cs/cs-first_term_modules-rsz.sql"

def load_sql(file, con):
    with open(file) as f:
        sql_query = f.readlines()
        sql_query = ''.join(sql_query)
    return pd.read_sql_query(sql_query, con)


In [3]:
admission_df = load_sql(admission_sql, db)
dropout_df = load_sql(dropout_sql, db)
four_term_ap_df = load_sql(four_term_ap_sql, db)
four_term_cp_df = load_sql(four_term_cp_sql, db)
rsz_df = load_sql(rsz_sql, db)

### Prepare train/test split

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
def split_df(df, label):
    seed = 42  # Currently used in the RAPP pipeline [2022-01-20]
    train_size = 0.8  # Current setting in RAPP pipeline [2022-01-20]

    X = df.drop(label, axis=1, inplace=False)
    y = df[label]

    # Prepare categorical data
    columns = ["Geschlecht"]
    categorical = pd.get_dummies(data=X[columns], columns=columns)
    X = pd.concat([X, categorical], axis=1)
    X = X.drop(["Geschlecht"], axis=1)

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
    return X_train, X_test, y_train, y_test

In [6]:
sets = {}
dfs = {
    "dropout": {"df": dropout_df, "label": "Dropout"},
    "admission": {"df": admission_df, "label": "MasterZulassung"},
    "4term_ap": {"df": four_term_ap_df, "label": "FourthTermAP"},
    "4term_cp": {"df": four_term_cp_df, "label": "FourthTermCP"},
    "rsz": {"df": rsz_df, "label": "RSZ"},
}

for target in dfs.keys():
    sets[target] = {}
    X_train, X_test, y_train, y_test = split_df(**dfs[target])
    sets[target]["X_train"] = X_train
    sets[target]["X_test"] = X_test
    sets[target]["y_train"] = y_train
    sets[target]["y_test"] = y_test

## Load the models

In [7]:
import joblib

In [8]:
# Deeper Models
# admission_clf = joblib.load("reports/2022-01-17/admission/DecisionTreeClassifier/additional_models/76.joblib")
# dropout_clf = joblib.load("reports/2022-01-17/dropout/DecisionTreeClassifier/additional_models/83.joblib")
# four_term_ap_clf = joblib.load("reports/2022-01-17/4term_ap/DecisionTreeClassifier/additional_models/141.joblib")
# four_term_cp_clf = joblib.load("reports/2022-01-17/4term_cp/DecisionTreeClassifier/additional_models/148.joblib")
# rsz_clf = joblib.load("reports/2022-01-17/rsz/DecisionTreeClassifier/additional_models/63.joblib")

# More shallow models
admission_clf = joblib.load("reports/2022-01-17/admission/DecisionTreeClassifier/additional_models/79.joblib")
dropout_clf = joblib.load("reports/2022-01-17/dropout/DecisionTreeClassifier/additional_models/96.joblib")
four_term_ap_clf = joblib.load("reports/2022-01-17/4term_ap/DecisionTreeClassifier/additional_models/141.joblib")
four_term_cp_clf = joblib.load("reports/2022-01-17/4term_cp/DecisionTreeClassifier/additional_models/153.joblib")
rsz_clf = joblib.load("reports/2022-01-17/rsz/DecisionTreeClassifier/additional_models/70.joblib")

In [9]:
models = {
    "admission": admission_clf,
    "dropout": dropout_clf,
    "4term_ap": four_term_ap_clf,
    "4term_cp": four_term_cp_clf,
    "rsz": rsz_clf
}

## Make Cross-Predictions

In [10]:
from sklearn.metrics import balanced_accuracy_score

In [11]:
targets = ["admission", "dropout", "4term_ap", "4term_cp", "rsz"]

scores = {}

for t1 in targets:
    clf = models[t1]
    model_key = t1 + "_clf"
    scores[model_key] = {}
    for t2 in targets:
        X_test, y_test = sets[t2]["X_test"], sets[t2]["y_test"]
        y_pred = clf.predict(X_test)
        bacc = balanced_accuracy_score(y_test, y_pred)
        scores[model_key][t2] = bacc

In [12]:
results = pd.DataFrame(scores)

In [13]:
print(results)

           admission_clf  dropout_clf  4term_ap_clf  4term_cp_clf   rsz_clf
admission       0.638889     0.500000      0.522222      0.527778  0.444444
dropout         0.419643     0.742857      0.212500      0.280357  0.300000
4term_ap        0.636276     0.256157      0.766049      0.749135  0.699121
4term_cp        0.643359     0.238010      0.761006      0.783533  0.733749
rsz             0.612500     0.468750      0.550000      0.556250  0.568750
