# Task for ML Interview

The is to develop a machine learning approach to predict the subjects of scientific papers.

## Dataset
The Cora dataset consists of 2708 scientific publications classified into one of seven classes (`Case_Based`, `Genetic_Algorithms`, `Neural_Networks`, `Probabilistic_Methods`, `Reinforcement_Learning`, `Rule_Learning`, `Theory`). The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details.

Download Link: https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Install dependencies

For this task, we decide to use the StellarGraph library.
To install it, we need to install some dependencies like tensorflow

In [None]:
!pip install tensorflow==2.11.0



Now install StellarGraph

In [None]:
!pip install git+https://github.com/VenkateshwaranB/stellargraph.git

Collecting git+https://github.com/VenkateshwaranB/stellargraph.git
  Cloning https://github.com/VenkateshwaranB/stellargraph.git to /tmp/pip-req-build-jyqrvasb
  Running command git clone --filter=blob:none --quiet https://github.com/VenkateshwaranB/stellargraph.git /tmp/pip-req-build-jyqrvasb
  Resolved https://github.com/VenkateshwaranB/stellargraph.git to commit efa1f847109a4ba490e7a5105646a20ee09a3243
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: stellargraph
  Building wheel for stellargraph (setup.py) ... [?25l[?25hdone
  Created wheel for stellargraph: filename=stellargraph-1.3.0b0-py3-none-any.whl size=431845 sha256=9c46bdf5bf88fd3fe5d4536b62f8be2d557540c57bfc6eef95ca72212826c1cf
  Stored in directory: /tmp/pip-ephem-wheel-cache-moyk5ytb/wheels/f3/06/0f/089f69af27d308a1830638f855b6c5755311d8ffc451de9980
Successfully built stellargraph
Installing collected packages: stellargraph
Successfully installed stellargraph-1.3.0b0


# Import libraries

In [None]:
import sys,os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import stellargraph as sg
from sklearn.model_selection import KFold ,train_test_split #for cross validation
from sklearn.preprocessing import LabelBinarizer #for preprocessing

###for modeling ########
import tensorflow as tf
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN

from tensorflow.keras import layers, optimizers, losses, metrics, Model

# Loading Data

In [None]:
data_dir = os.path.expanduser("/content/drive/MyDrive/ML_interview_XAI/data")

In [21]:
data_dir = os.path.expanduser(data_dir)
feature_names = ["F_{}".format(ii) for ii in range(1433)]
column_names = feature_names + ["subject"]
data = pd.read_csv(os.path.join(data_dir, "cora.content"), sep='\t', header=None, names=column_names)
nodes = len(data)
labels = data['subject'].values

In [22]:
data.head()

Unnamed: 0,F_0,F_1,F_2,F_3,F_4,F_5,F_6,F_7,F_8,F_9,...,F_1424,F_1425,F_1426,F_1427,F_1428,F_1429,F_1430,F_1431,F_1432,subject
31336,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Neural_Networks
1061127,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Rule_Learning
1106406,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
13195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
37879,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Probabilistic_Methods


**Loading the data using StellarGraph library**

We can use StellarGraph to load the Cora dataset also. And this is the way that we will use in order to harness our model in a effecient way.

In [31]:
#now we define a function that performs both the cross validation and the modeling

def Simple_GCN_Model(path, K,lr,epochs):
  """
  The function takes as input:

  K: The number of folds
  lr: The learning rate
  epoch: The number of epochs

   We use a Graph Convolution model (https://arxiv.org/abs/1609.02907)
  """

  path = os.path.expanduser(path)
  column_names =  ["subject"]
  data= pd.read_csv(os.path.join(path, "cora.content"), sep='\t', header=None, names=column_names)
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(np.array(data['subject']),index=data['level_0'],columns=['subject'])



#  Load Cora dataset using StellarGraph
  loader=sg.datasets.Cora()
  G,nodes=loader.load()

  #creating a generator
  generator = FullBatchNodeGenerator(G, method="gcn")

  ###### k-fold performing K-fold cross validation ###########

  fold=KFold(n_splits=K, shuffle = True) #no shuffling inorder not to shuffle the graph

  ### initializing an encoder
  encode=LabelBinarizer()

  ###creating a numpy array for predictions
  predictions =np.zeros((1,2708,7))



  # Performing 10-fold cross validation during training of the model
  for train_index , test_index in fold.split(data):

    train_data , test_data = data.iloc[train_index] , data.iloc[test_index] # we split the data into K folds (based on the for loop)

    train_target_data = encode.fit_transform(train_data)        # then we generate the corresponding target for the train and validation set

    test_target_data = encode.fit_transform(test_data)

    ####creating a train generator #############

    train_gen = generator.flow(train_data.index, train_target_data )

    graph_conv = GCN(layer_sizes=[16, 16], activations=["relu", "relu"], generator=generator, dropout=0.5)

    x_input, x_output = graph_conv.in_out_tensors()

    prediction = layers.Dense(units=train_target_data . shape[1], activation="softmax")(x_output)


    # Model definition
    model = Model(inputs=x_input, outputs=prediction)
    model.compile(
      optimizer=tf.optimizers.Adam(lr),
      loss=losses.categorical_crossentropy,
      metrics=["acc"],)


    # Test generation
    test_gen = generator.flow(test_data.index, test_target_data )


    history = model.fit(
        train_gen,
        epochs=epochs,
        validation_data=test_gen,
        verbose=True,
        shuffle=False)



    ########## now we perform predictions on the entire network ########

    Graph_level = data.index
    Generor_level = generator.flow(Graph_level)
    preds = model.predict(Generor_level)

    # Store predictions in a list
    predictions += preds

    #######computing the mean of predictions ####################

  pred_folds = predictions / K       #remember K is the number of folds

    ########now we perform an inverse transformation to obtain the predicted output

  outputs = encode.inverse_transform(pred_folds.squeeze())
  print(outputs)
  print(type(outputs))

  return pd.DataFrame({"paper_id": nodes.index, "class_label": outputs})

In [33]:
results = Simple_GCN_Model(data_dir, K=10,lr=0.01,epochs=30)

  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()
  data=pd.DataFrame(data).reset_index()


Using GCN (local pooling) filters...
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 

In [12]:
results.to_csv('predictions.tsv', sep='\t', index=False)

In [37]:
from sklearn.metrics import accuracy_score
print(f'\nSimple_GCN_Model test accuracy:{accuracy_score(data.subject.values,results.class_label.values)*100:.4}')


Simple_GCN_Model test accuracy:92.21
