# Introduction
This implements the iterative classification algorithm as described in slide 37 of http://web.stanford.edu/class/cs224w/slides/05-message.pdf  
It classifies a node based on its features as well as labels of neighbours

## Definitions
$v$: Node  
$Y_v$: Labels of node $v$  
$f_v$: feature vector of node $v$  
$z_v$: summary of labels of $v$'s neighbours (a vector)  
$\phi_1(f_v)$: predict node label based on node feature vector $f_v$  
$\phi_2(f_v, z_v)$: predict label based on node feature vector $f_v$ of labels of $v$'s neighbours

## Phase 1: Train a Classifier based on node attributes only
The classifier can be linear classifier, neural network classifier etc. This is trained on the training set to predict the labels for each node.

$\phi_1(f_v)$ : to predict $Y_v$ based on $f_v$  
$\phi_2(f_v, z_v)$ to predict $Y_v$ based on $f_v$ and summary $z_v$ of labels of $v$'s neighbours  
For vector $z_v$ of neighbourhood labels, let

- $I$ = incoming neighbour label info vector  
  $I_0$ = 1 if at least one of the incoming node is labelled 0.  
  $I_1$ = 1 if at least one of the incoming node is labelled 1.
- $O$ = outgoing neighbour label info vector  
  $O_0$ = 1 if at least one of the outgoing node is labelled 1.  
  $O_1$ = 1 if at least one of the outgoing node is labelled 1.

## Phase 2: Iterate till Convergence

On the test set, set the labels based on the classifier in Phase 1,

## Step 1: Train Classifier

On a different training set, train two classifiers:

- node attribute vector only: $\phi_1$
- node attribute and link vectors: $\phi_2$

## Step 2: Apply Classifier to test set

On test set, use trained node feature vector classifier $\phi_1$ to set $Y_v$

## Step 3.1: Update relational vectors z

Update $z_v$ for all nodes on test set

## 3.2: Update Label

Reclassify all nodes with $\phi_2$

## Iterate

Continue until convergence

- update $z_v$
- update $Y_v = \phi_2(f_v, z_v)$

In [1]:
import pandas as pd
import networkx as nx
from collective.constants import get_summary_zv
from collective.Iterative import IterativeClassification

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix

import parse
import sys
sys.path.insert(1, '../src')
import preprocess

In [5]:
df_node = pd.read_csv('../data/unified_node_data.csv', keep_default_na=False)
df_edge = pd.read_csv('../data/max_edge_weights.csv')
df_formatted = preprocess.nodes1(df_node)
df_clean = preprocess.nodes_filter(df_formatted, df_edge)
df_impute = preprocess.impute(df_clean)
X_train, X_val, X_test, y_train, y_val, y_test = preprocess.stratified_train_val_test_split(df_impute)
X_test = pd.concat([X_val, X_test])
y_test = pd.concat([y_val, y_test])

y_train = y_train.apply(lambda x: 1 if x > 0 else 0).rename("label")
y_test = y_test.apply(lambda x: 1 if x > 0 else 0).rename("label")

X_train = pd.concat([X_train, y_train], axis = 1)
X_train = X_train.drop(['confessed_assignments'], axis = 1)
X_test = pd.concat([X_test, y_test], axis = 1)
X_test = X_test.drop(['confessed_assignments'], axis = 1)

In [11]:
print("Training model2")
train_x_model2 = X_train.drop(['label', 'name'], axis=1)
train_y_model2 = X_train['label']
test_x_model2 = X_test.drop(['label', 'name'], axis=1)
test_y_model2 = X_test['label']
model2 = LogisticRegression(max_iter = 1000)
model2.fit(train_x_model2, train_y_model2)
y_pred2 = model2.predict(test_x_model2)

print(f1_score(test_y_model2.to_numpy(), y_pred2))
confusion_matrix(test_y_model2.to_numpy(), y_pred2)

Training model2
0.0


array([[264,   0],
       [ 44,   0]], dtype=int64)

## Further cleaning
Note that need to drop confessed_assignments and num_confessed_assignments as both indicates whether the student cheated or not

In [None]:
network_graph_train = parse.create_nx_graph_nodes(X_train)
network_graph_train = parse.add_nx_graph_edges(network_graph_train, df_edge)

network_graph_test = parse.create_nx_graph_nodes(X_test)
network_graph_test = parse.add_nx_graph_edges(network_graph_test, df_edge)

# Gets L1_max, L0_max, L1_mean, L0_mean

In [None]:
network_graph_train = get_summary_zv(network_graph_train)
network_graph_test = get_summary_zv(network_graph_test)

In [None]:
df_train = pd.DataFrame()
for node in network_graph_train.nodes:
    network_graph_train.nodes[node]['index'] = node
    temp = pd.DataFrame([network_graph_train.nodes[node]]).set_index('index')
    df_train = pd.concat([df_train, temp])

df_test = pd.DataFrame()
for node in network_graph_test.nodes:
    network_graph_test.nodes[node]['index'] = node
    temp = pd.DataFrame([network_graph_test.nodes[node]]).set_index('index')
    df_test = pd.concat([df_test, temp])

# Model 1: Logistic Regression without L1_max, L0_max, L1_mean, L0_mean

In [None]:
# model1
print("Training model1")
train_x_model1 = df_train.drop(['L1_max', 'L0_max', 'L1_mean', 'L0_mean', 'label'], axis=1)
train_y_model1 = df_train['label']
test_x_model1 = df_test.drop(['L1_max', 'L0_max', 'L1_mean', 'L0_mean', 'label'], axis=1)
test_y_model1 = df_test['label']
model1 = LogisticRegression(max_iter = 1000)
model1.fit(train_x_model1, train_y_model1)
y_pred1 = model1.predict(test_x_model1)

print(f1_score(test_y_model1.to_numpy(), y_pred1))
confusion_matrix(test_y_model1.to_numpy(), y_pred1)

# Model 2: Logistic Regression with L1_max, L0_max, L1_mean, L0_mean

In [None]:
print("Training model2")
train_x_model2 = df_train.drop(['label'], axis=1)
train_y_model2 = df_train['label']
test_x_model2 = df_test.drop(['label'], axis=1)
test_y_model2 = df_test['label']
model2 = LogisticRegression(max_iter = 1000)
model2.fit(train_x_model2, train_y_model2)
y_pred2 = model2.predict(test_x_model2)

print(f1_score(test_y_model2.to_numpy(), y_pred2))
confusion_matrix(test_y_model2.to_numpy(), y_pred2)

# Iterative Classification

In [None]:
print("Iterative classification")
ic = IterativeClassification(max_iterations=5)
new_gnx = ic.predict(network_graph_test, model1, model2)

In [None]:
new_gnx_pred = pd.DataFrame([])
for node in df_test['label'].index:
    temp = pd.DataFrame([[new_gnx.nodes[node]['label'][0], node]], columns=[
                        'label', 'index']).set_index('index')
    new_gnx_pred = pd.concat([new_gnx_pred, temp])
print(f1_score(test_y_model2, new_gnx_pred.to_numpy()))
confusion_matrix(test_y_model2, new_gnx_pred.to_numpy())