# Neural graph learning on terrorist attacks

## Intorduction to dataset

The [Terrorist attack dataset]() consists of 1293 terrorist attacks. Each terrorist attach as xx features, represented as a vector 0/1 describing the absence/presence of a feature. Each terrorist attack is assigned a label describing the type of the terrorist attack. There is a total of 6 labels.
The dataset also cosists of two possible graph structures - one where the nodes are attacks and linked if they share location, and another where links are formed by shared locations and shared terrorist oraganisation. ?? Skriv om..

one based on co-located attacks and another, based on co-located attacks organized by the same terrorist organization

The dataset is a subset of the original dataset. For related papers on the same dataset, see [Bin Zhao, et. al. "Entity and Relationship Labeling in Affiliation Networks." ICML. 2006.](https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/zhao:sna06/zhaosna06.pdf)

For the preprocessing the id's were shortened and seperated exports as tsv files. The notebook for preprocessing can be found [here](preprocessing_terrorist_attack_dataset.ipynb). 

Dataset summary:
* 1293 terrorist attacks.
* 106 distinct features.
* Labels for classifying the terrorist attacks:
    * Arson
    * Bombing
    * Kidnapping
    * NBCR_Attack
    * Weapon_Attack
    * other_attack
* Each terrorist attack has one label.

## Experiment

The goal is to correctly classify the terrorist attakcks and examine the performance difference between the Neural Graph Learning model and a base model.

To proberly examine the performance difference between the models, each model is trainined with traning sizes from 0.1 to 0.85 with 0.5 incrementes. The models is run 5 times at each traning sizes whereafter the average results are presented in a graph.

## References

Large parts of the code for preprocessing, loading train, test and validation data, evaluation, generation of Keras functional models is modefied from Tensorflows tutorials and resources introducing neural structed learning. The original code can be found here: [Tensorflows github](https://github.com/tensorflow/neural-structured-learning) and [Guide and Tutorials](https://www.tensorflow.org/neural_structured_learning/framework).
Furthermore, additional information about the API can be found [here](https://www.tensorflow.org/neural_structured_learning/api_docs/python/nsl).

# Experiment

### Importing needed libraries

In [13]:
# !pip install --quiet neural-structured-learning
from __future__ import absolute_import, division, print_function, unicode_literals
import neural_structured_learning as nsl
import tensorflow as tf
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.expand_frame_repr', False)
import warnings
warnings.filterwarnings('ignore')

### Getting data

In [14]:
!tar -C /tmp -xvzf /Users/johanweisshansen/Documents/DTU/3.semester/advanced_project/oticon_project/terror_data/nsl/dataset_test_4/TerrorAttackNew/Archive2.tgz

x terrorist_attack_loc.edges
x terrorist_attack.nodes
x terrorist_attack_loc_org.edges


### Defining hyperparameters

In [None]:
class HParams(object):
  """Hyperparameters used for training."""
  def __init__(self):
    ### dataset parameters
    self.num_classes = 6
    self.max_seq_length = 106
    ### neural graph learning parameters
    self.distance_type = nsl.configs.DistanceType.L2
    self.graph_regularization_multiplier = 0.3
    self.num_neighbors = 1
    ### model architecture
    self.num_fc_units = [50,50]
    ### training parameters
    self.train_epochs = 50
    self.batch_size = 50
    self.dropout_rate = 0.5
    ### eval parameters
    self.eval_steps = None  # All instances in the test set are evaluated.

HPARAMS = HParams()

### Load train, valid and test data

In [4]:
def parse_example(example_proto):

    feature_spec = {
        'attributes':
            tf.io.FixedLenFeature([HPARAMS.max_seq_length],
                                tf.int64,
                                default_value=tf.constant(
                                    0,
                                    dtype=tf.int64,
                                    shape=[HPARAMS.max_seq_length])),
        'label':
          tf.io.FixedLenFeature((), tf.int64, default_value=-1),
    }
    # We also extract corresponding neighbor features in a similar manner to
  # the features above.
    for i in range(HPARAMS.num_neighbors):
        nbr_feature_key = '{}{}_{}'.format('NL_nbr_', i, 'attributes')
        nbr_weight_key = '{}{}{}'.format('NL_nbr_', i, '_weight')
        feature_spec[nbr_feature_key] = tf.io.FixedLenFeature(
            [HPARAMS.max_seq_length],
            tf.int64,
            default_value=tf.constant(
            0, dtype=tf.int64, shape=[HPARAMS.max_seq_length]))

    # We assign a default value of 0.0 for the neighbor weight so that
    # graph regularization is done on samples based on their exact number
    # of neighbors. In other words, non-existent neighbors are discounted.
    feature_spec[nbr_weight_key] = tf.io.FixedLenFeature(
        [1], tf.float32, default_value=tf.constant([0.0]))

    features = tf.io.parse_single_example(example_proto, feature_spec)

    labels = features.pop('label')
    return features, labels


def make_dataset(file_path, training=False):
  #Creates a `tf.data.TFRecordDataset`.

    dataset = tf.data.TFRecordDataset([file_path])
    if training:
        dataset = dataset.shuffle(10000)
    dataset = dataset.map(parse_example)
    dataset = dataset.batch(HPARAMS.batch_size)
    return dataset

## Functional base model

In [5]:
def functional_model(hparams):
    """Creates a functional API-based multi-layer perceptron model."""
    inputs = tf.keras.Input(shape=(hparams.max_seq_length,), dtype='int64', name='attributes')

  # casting one hot to floating point format.
    cur_layer = tf.keras.layers.Lambda(
      lambda x: tf.keras.backend.cast(x, tf.float32))(
          inputs)

    for num_units in hparams.num_fc_units:
        cur_layer = tf.keras.layers.Dense(num_units, activation='relu')(cur_layer)
        cur_layer = tf.keras.layers.Dropout(hparams.dropout_rate)(cur_layer)
#         cur_layer = tf.keras.layers.BatchNormalization()(cur_layer)

    outputs = tf.keras.layers.Dense(
      hparams.num_classes, activation='softmax')(
          cur_layer)

    model = tf.keras.Model(inputs, outputs=outputs)
    return model

### Function to evaluate the models

In [6]:
# Helper function to print evaluation metrics.
def print_metrics(model_desc, eval_metrics):
    print('\n')
    print('Eval accuracy for ', model_desc, ': ', eval_metrics['accuracy'])
#     print('Eval loss for ', model_desc, ': ', eval_metrics['loss'])


### Function for training base model

In [8]:
def traning_base_model(train_dataset, valid_dataset):
    base_model = functional_model(HPARAMS)
    # Compile and train the base MLP model
    base_model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    base_model_history = base_model.fit(train_dataset, epochs=HPARAMS.train_epochs, verbose=0, validation_data=valid_dataset)
    
    return base_model

### Function for training graph model

In [9]:
def training_graph_model(train_dataset, valid_dataset):
    # Build a new base MLP model.
    base_reg_model = functional_model(HPARAMS)
    
    # Wrap the base MLP model with graph regularization.
    graph_reg_config = nsl.configs.make_graph_reg_config(
        max_neighbors=HPARAMS.num_neighbors,
        multiplier=HPARAMS.graph_regularization_multiplier,
        distance_type=HPARAMS.distance_type,
        sum_over_axis=-1)
    graph_reg_model = nsl.keras.GraphRegularization(base_reg_model, graph_reg_config)
    
    graph_reg_model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    graph_reg_history = graph_reg_model.fit(train_dataset, epochs=HPARAMS.train_epochs, verbose=0, validation_data=valid_dataset)
    
    return graph_reg_model

### Function for generating train, test and validataion data

In [10]:
def generateTrainingData(train_percent):
    # getting the training percentages
    !python preprocess_terror_attack_dataset.py \
    --input_content=/tmp/terrorist_attack.nodes \
    --input_graph=/tmp/terrorist_attack_loc_org.edges \
    --max_nbrs=15 \
    --train_percentage=$train_percent\
    --output_train_data=/tmp/train_merged_examples.tfr \
    --output_test_data=/tmp/test_examples.tfr \
    --output_valid_data=/tmp/valid_examples.tfr \
    
        # generating train and test data
    train_dataset = make_dataset('/tmp/train_merged_examples.tfr', training=True)
    test_dataset = make_dataset('/tmp/test_examples.tfr')
    valid_dataset = make_dataset('/tmp/valid_examples.tfr')
    
    return train_dataset, test_dataset, valid_dataset

### Iterating over different training sizes

In [11]:
import numpy as np
# defining the training size we need to iterate over
train_percentage = []
train_percentage.append(0.01) # starting the list at 1% of training data

for i in np.arange(0.05, 0.9, 0.05):
    train_percentage.append(round(i,2))

In [12]:
# lists for holding results
graph_accuracy_by_training_size_avg = []
base_accuracy_by_training_size_avg = []

for j in range(5):
    
    print("----------------------- iteration: ", j+1, "------------------------" )
    base_model_results_list = []
    graph_model_results_list = []
    for i in range(len(train_percentage)):

        print("---------------------training at percentage ", train_percentage[i], "--------------------------------")

        # creating test and training data
        train_dataset, test_dataset, valid_dataset = generateTrainingData(train_percentage[i])

        # creating and training the base model
        base_model = traning_base_model(train_dataset, valid_dataset)

        eval_results_base_model = dict(
        zip(base_model.metrics_names,
            base_model.evaluate(test_dataset, steps=HPARAMS.eval_steps)))
        print_metrics('Base MLP model', eval_results_base_model)

        base_model_results_list.append(eval_results_base_model)

        # creating and training the graph model
        graph_model = training_graph_model(train_dataset, valid_dataset)

        eval_results_graph_regulated_model = dict(
        zip(graph_model.metrics_names,
            graph_model.evaluate(test_dataset, steps=HPARAMS.eval_steps)))
        print_metrics('MLP + graph regularization', eval_results_graph_regulated_model)

        graph_model_results_list.append(eval_results_graph_regulated_model)


    graph_accuracy_by_training_size_avg.append(graph_model_results_list)
    base_accuracy_by_training_size_avg.append(base_model_results_list)
    

----------------------- iteration:  1 ------------------------
---------------------training at percentage  0.01 --------------------------------
  with open(in_file, 'rU') as cora_content:
Reading graph file: /tmp/terrorist_attack_loc_org.edges...
Done reading 571 edges from: /tmp/terrorist_attack_loc_org.edges (0.00 seconds).
Making all edges bi-directional...
Done (0.00 seconds). Total graph nodes: 257
Joining seed and neighbor tf.train.Examples with graph edges...
Done creating and writing 15 merged tf.train.Examples (0.00 seconds).
Out-degree histogram: [(0, 13), (2, 2)]
Output training data written to TFRecord file: /tmp/train_merged_examples.tfr.
Output test data written to TFRecord file: /tmp/test_examples.tfr.
Output valid data written to TFRecord file: /tmp/valid_examples.tfr.
Total running time: 0.00 minutes.
Eval accuracy for  Base MLP model :  0.5803723
Eval accuracy for  MLP + graph regularization :  0.42808798
---------------------training at percentage  0.05 -----------

KeyboardInterrupt: 

# Results

Since we ran 5 iterations we first have to average the numner training accuracy at the different training sizes.

In [15]:
graph_avg_list = []
base_avg_list = []

for i in range(0,len(graph_accuracy_by_training_size_avg[0])):
    tmp_avg_value = 0
    for j in range(0,len(graph_accuracy_by_training_size_avg)):
        tmp_avg_value += graph_accuracy_by_training_size_avg[j][i]['accuracy']
    graph_avg_list.append(tmp_avg_value/5)
    
for i in range(0,len(base_accuracy_by_training_size_avg[0])):
    tmp_avg_value = 0
    for j in range(0,len(base_accuracy_by_training_size_avg)):
        tmp_avg_value += base_accuracy_by_training_size_avg[j][i]['accuracy']
    base_avg_list.append(tmp_avg_value/5)

IndexError: list index out of range

To get at better idea of the difference in learning at different training sizes we substract the two model preformances from each other.

A postive value indicates a gain for the graph based model over the base model.

In [None]:
diff_graph_and_basemodel = []

for i in range(len(base_avg_list)):
    diff_graph_and_basemodel.append(graph_avg_list[i]- base_avg_list[i])

collected_list = []
collected_list.append(base_avg_list)
collected_list.append(graph_avg_list)
collected_list.append(diff_graph_and_basemodel)

In [None]:
import numpy as np
import matplotlib.pyplot as plt


plt.figure(figsize=(12, 6))

columns = train_percentage
rows = ['base model', 'graph model', 'pref. diff.']

# Get some pastel shades for the colors
n_rows = len(collected_list)

# Initialize the vertical-offset for the stacked bar chart.
y_offset = np.zeros(len(train_percentage))

# Plot bars and create text labels for the table
cell_text = []
for row in range(n_rows):
    y_offset = collected_list[row]
    cell_text.append(['%1.3f' % x for x in y_offset])

# Add a table at the bottom of the axes
table = plt.table(cellText=cell_text,
                      rowLabels=rows,
                      colLabels=columns,
                      rowColours=['#5b9ac4', '#fd8e39', '#ffffff'],
                      loc='bottom')

table.set_fontsize(14)

# table.auto_set_font_size(False)
table.set_fontsize(14)
table.scale(1, 1.7)
# Adjust layout to make room for the table:
plt.subplots_adjust(left=0.2, bottom=0.2)


# plt.plot(graph_collected)
plt.plot(base_avg_list)
plt.plot(graph_avg_list)

# plt.ylabel("Loss in ${0}'s".format(value_increment))
# plt.yticks(values * value_increment, ['%d' % val for val in values])
plt.xticks([])
plt.title('Comparison between base and graph model accuracy')
plt.ylabel('accuracy')
# plt.xlabel('training size')
plt.annotate('training size', xy=(1,0), xytext=(-669, -3), ha='left', va='top',
            xycoords='axes fraction', textcoords='offset points')
plt.legend(['base model', 'graph reg model'], loc='upper left')
plt.savefig('plots/terroist_attack_comparison.png', bbox_inches='tight', pad_inches=0.1)

plt.show()