# Email Classification Example
---

This graph dataset contains the emails exchange inside an enterprise. Each node *u* represents an employee email, that is labeled by its department; each edge *(u,v)* says that *u* sent at least one email to *v*.

Our objective is to predict the department in which the employee works.

## Definition of some useful functions
---

In [None]:
#basic imports
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx
from networkx import DiGraph
from node2vec import Node2Vec

#Default name for embeddign file
EMBEDDING_FILE = "embeddings.emb"

## Computing the Embedding
---

First, we define the path to the dataset and to its edges file

In [None]:
dataset = "Email_dataset/" #path to dataset
graph_file = dataset+"edges.ssv" #path to edges file

Then, we load the graph

In [None]:
graph = nx.read_edgelist(graph_file, delimiter=" ", create_using=DiGraph()) #the graph is directed, so we use DiGraph()

After, we define the Node2Vec object with the transaction probabilities that will be used at the random walking

In [None]:
n2v = Node2Vec(graph, dimensions=64, walk_length=120, num_walks=200, workers=4, p=2, q=.25)

Now, we process the embedding using Node2Vec

In [None]:
n2v_model = n2v.fit(window=10, min_count=1, batch_words=128)

n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)

Now we are going to use the embedding generated to predict the employees departments

## Data Processing
---

In [None]:
import numpy as np
from numpy import array

Since we saved the embedding, we can load it as a Numpy matrix

In [None]:
vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)

We can also define a function to get the node embedding representation

In [None]:
def to_embedded(n):
    return vectors[n,:]

Now, we load the data from the *labels.ssv* file and generate the dataset like with the embeddings and the department

In [None]:
data = [] #data matrix initialized empty

with open("Email_dataset/labels.ssv") as f:
    for line in f:
        node,department = line.split() #get the node id and its department (class)
        node_embedded = to_embedded(int(node)) #get the embedded representation of the node
        data.append(np.append(node_embedded,array([department]))) #insert the embedding and the class inside the data matrix

data = array(data,dtype=float) #transform the data matrix in a Numpy array
data

We shuffle the data and split into train and test subsets, using the *train_percentage* factor

In [None]:
np.random.shuffle(data)
train_percentage = 0.7
train_size = int(len(data)*train_percentage)

train_data = array(data[0:train_size])
test_data = array(data[train_size:])

## Prediction
---
We are going to use the KNN model to predict the classes

In [None]:
from sklearn.neighbors import KNeighborsClassifier

We also define a function to train the model and give the score

In [None]:
def train_and_eval(model):
    model.fit(train_data[:,0:-1], train_data[:,-1])
    return model.score(test_data[:,:-1], test_data[:,-1])

At least, we train and compute the scores for each model

In [None]:
score = train_and_eval(KNeighborsClassifier())
print("KNeighborsClassifier score: {}%".format(score*100))

Our accuracy is very low. To understand this, let's analyse the departments distribution over the data

In [None]:
import pandas as pd
df = pd.read_csv("Email_dataset/labels.ssv", delimiter=" ", names=["Node","Dep"])
_ = df["Dep"].value_counts().plot(kind="bar", figsize=(8,8))

Analysing the graph we can see that the dataset is unbalanced, so that can explain the bad accuracies that we got.
So we are going take a subgraph that have only certain departments.

## Retrieving the Subgraph
---
First, we compute a dictionary with the nodes departments

In [None]:
labels_dict = {}
with open("Email_dataset/labels.ssv") as f:
    for line in f:
        node,department = line.split()
        labels_dict[node] = department

After, we generate a subset of edges that contains only the nodes that are from the departments defined in *labels_filter*

In [None]:
labels_filter = ["4","14"]
with open("Email_dataset/edges.ssv",'r') as f_in:
    with open("Email_dataset/edges_filtered.ssv",'w') as f_out:
        for line in f_in:
            src,trg = line.split()
            if labels_dict[src] in labels_filter and labels_dict[trg] in labels_filter:
                f_out.write(line)

Now, we reload the graph an compute its embedding

In [None]:
graph = nx.read_edgelist("Email_dataset/edges_filtered.ssv", delimiter=" ", create_using=DiGraph())

n2v = Node2Vec(graph, dimensions=128, walk_length=50, num_walks=30, workers=4, p=1, q=1)

n2v_model = n2v.fit(window=50, min_count=1, batch_words=64)

n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)

And now we process the data again, but we take only the employees that are at the graph

In [None]:
vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)

def to_embedded(n):
    return vectors[n,:]

data = []
nodes = list(graph.nodes())
with open("Email_dataset/labels.ssv") as f:
    for line in f:
#         print(line)
        node,department = line.split()
        if department in labels_filter:
            try:
                node_embedded = to_embedded(nodes.index(node))
                data.append(np.append(node_embedded,array([department])))
            except Exception as e:
#                 print(e)
                pass

data = array(data,dtype=float)

np.random.shuffle(data)
train_percentage = 0.7
train_size = int(len(data)*train_percentage)

train_data = array(data[0:train_size])
test_data = array(data[train_size:])

At least, we run again the KNN

In [None]:
score = train_and_eval(KNeighborsClassifier())
print("KNeighborsClassifier score: {}%".format(score*100))

We had an improvement at the accuracy. Let's check the accuracy with other models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron

from sklearn.svm import SVC

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB

In [None]:
score = train_and_eval(LogisticRegression())
print("LogisticRegression score: {}%".format(score*100))
score = train_and_eval(SGDClassifier(max_iter=100, tol=0.001))
print("SGDClassifier score: {}%".format(score*100))
score = train_and_eval(Perceptron(max_iter=100, tol=0.001))
print("Perceptron score: {}%".format(score*100))

score = train_and_eval(SVC())
print("SVC score: {}%".format(score*100))

score = train_and_eval(MLPClassifier())
print("MLPClassifier score: {}%".format(score*100))

score = train_and_eval(GaussianProcessClassifier())
print("GaussianProcessClassifier score: {}%".format(score*100))

score = train_and_eval(DecisionTreeClassifier())
print("DecisionTreeClassifier score: {}%".format(score*100))

score = train_and_eval(BernoulliNB())
print("BernoulliNB score: {}%".format(score*100))
score = train_and_eval(GaussianNB())
print("GaussianNB score: {}%".format(score*100))

score = train_and_eval(GradientBoostingClassifier())
print("GradientBoostingClassifier score: {}%".format(score*100))
score = train_and_eval(RandomForestClassifier())
print("RandomForestClassifier score: {}%".format(score*100))
score = train_and_eval(ExtraTreesClassifier())
print("ExtraTreesClassifier score: {}%".format(score*100))
score = train_and_eval(AdaBoostClassifier())
print("AdaBoostClassifier score: {}%".format(score*100))

To further tests we can change the parameters to generate the transaction probabilities, resulting in different embeddings. We let this as an excercise to the reader ;)