<a href="https://colab.research.google.com/github/mbmackenzie/peace-speech-project/blob/master/Encoder_Generator_with_rationale.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About Notebook

This notebook builds a model mentioned from the [paper](https://arxiv.org/abs/1606.04155). One can directly infer to the github repo mentioned in the paper for the code. However, I have used the following [repo](https://github.com/yala/text_nn) for the model. Most of the codes are from the repo, and I have transformed the functions in order to fit our data. Instead of putting the raw textfile for the input, I have changed it to accept the *pd.DataFrame* Format. One can manipulate the AbstractDataset class to change it according to one's data format.



In [None]:
from google.colab import drive
import sys
drive.mount('/content/drive')

#Reference
!git clone https://github.com/yala/text_nn.git

sys.path.append('/content/text_nn')
sys.path.append('/content/text_nn/scripts')

In [2]:
#Libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from importlib import reload
import sys
from imp import reload
import warnings


warnings.filterwarnings('ignore')
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8")

In [None]:
import tensorflow as tf
import torch

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

#To confirm that we are using GPU for the training later

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-16GB


## 1. Import Data

The dataset is n-gram processed articles with lemmatization and stop word removal. One can refer to each preprocess on the repo.

In [None]:
peaceful_countries = ['GB','AU','CA','SG','NZ','IE']
non_peaceful_countries = ['PK','BD','NG','KE','ZA','TZ']
directory = '/content/drive/My Drive/capstone_data/data/5_domestic_filter_Ngram_stopwords_lemmatize'

#data = pd.DataFrame([])

data = []

for entry in os.scandir(directory):
  if ("domestic_Ngram_stopword_lematize.csv" in entry.path):
    country = entry.name.split("_domestic_Ngram_stopword_lematize.csv")[0]
    # print(country)
    if (country in peaceful_countries):
      country_csv_path = entry.path
      df = pd.read_csv(country_csv_path,index_col=[0])
      df.rename(columns={'article_text_Ngram_stopword_lemmatize':'Processed_Reviews'}, inplace=True)
      df['peaceful'] = 1
      df = df[['Processed_Reviews','peaceful']]
      data.append(df)
      # print(data)
      # print(df)
    elif (country in non_peaceful_countries):
      country_csv_path = entry.path
      df = pd.read_csv(country_csv_path,index_col=[0])
      df.rename(columns={'article_text_Ngram_stopword_lemmatize':'Processed_Reviews'}, inplace=True)
      df['peaceful'] = 0
      df = df[['Processed_Reviews','peaceful']]
      data.append(df)
      # print(df)
      # print(data)
    else:
      continue

In [None]:
df_full = pd.concat(data, axis=0, ignore_index=True)
df_full.dropna(inplace=True)
df_full.reset_index(drop=True, inplace=True)
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552329 entries, 0 to 552328
Data columns (total 2 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Processed_Reviews  552329 non-null  object
 1   peaceful           552329 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 8.4+ MB


In [None]:
#Sampling with seed

import random

random.seed(42)

peace_index = random.sample(list(df_full[df_full['peaceful'] == 0].index), 100000)
nonpeaceful_index = random.sample(list(df_full[df_full['peaceful'] == 1].index), 100000)
index = peace_index + nonpeaceful_index

sample_df = df_full.iloc[index, :]
sample_df

Unnamed: 0,Processed_Reviews,peaceful
480560,Maverick Citizen Reflection Langa heal one sin...,0
49380,Well country 's young TV station officially la...,0
26752,President Uhuru Kenyatta outline Kenya 's clim...,0
507332,Road traffic collective consciousness In book ...,0
92293,Alter Ego brings luxury lifestyle Nigeria Priv...,0
...,...,...
276321,Surfers pounce Prowlers roar back life Six luc...,1
163731,Christchurch base Stoney Range Wines joint ven...,1
288728,In song Red Rose Caf Fureys sing place poet sa...,1
206317,Dr. Aw Knowing give terminal illness fight may...,1


In [None]:
sample_df.index = np.arange(0, len(sample_df))

def remove_digit(s):
  return ''.join([i for i in s if not i.isdigit()])

for i in range(len(sample_df)):
  sample_df['Processed_Reviews'][i] = remove_digit(sample_df['Processed_Reviews'][i])

## 2. Further Prerpocess

Since the result from the above dataset was not satisfactory enough, we have removed the NORP, PERSON, GPE annotated words from the articles, and digits also.

In [None]:
!python -m spacy download en_core_web_md

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_md
nlp = en_core_web_md.load()

In [None]:
def apo_s_remove(s):
  s = re.sub(r'\'s', '', s)
  return s

for i in range(len(sample_df)):
  sample_df['Processed_Reviews'][i] = apo_s_remove(sample_df['Processed_Reviews'][i])

In [None]:
def exclude_entity(s):
  doc = nlp(s)
  entity_list = [(X.text, X.label_) for X in doc.ents if (X.label_ == 'GPE') or X.label_ == 'PERSON' or (X.label_ == 'NORP')]
  entity_list = set(entity_list)
  for entity in entity_list:
    s = re.sub(entity[0], '', s)
  return s

In [None]:
%time
import re
for i in range(len(sample_df)):
  s = sample_df['Processed_Reviews'][i]
  if len(s) > 1000000:
    s = s[:1000000-1]
  doc = nlp(s)
  entity_list = [(X.text, X.label_) for X in doc.ents if (X.label_ == 'GPE') or X.label_ == 'PERSON' or (X.label_ == 'NORP')]
  entity_list = set(entity_list)
  for entity in entity_list:
    s = s.replace(entity[0], '')
  sample_df['Processed_Reviews'][i] = s

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


In [None]:
sample_df.to_csv('/content/drive/MyDrive/2020 Capstone/processed_sample_df.csv')

In [None]:
train_sample, test_dev_sample = train_test_split(sample_df.index, test_size = 0.3, random_state = 42)
dev_sample, test_sample = train_test_split(test_dev_sample, test_size = 0.5, random_state = 42)

In [None]:
sample_df['set'] = ''
sample_df.loc[train_sample, 'set'] = 'train'
sample_df.loc[dev_sample, 'set'] ='dev'
sample_df.loc[test_sample,'set'] = 'test'

## 3. Embedding

In later section, we have to use word embedding for the representation of the word. We have created word index linked to the 300 dimensional word vectors from the glove.6B.300d. One needs to download the file from the following [link](https://nlp.stanford.edu/projects/glove/).

In [None]:

# Load the Embeddings
import numpy as np

# Set the path where you have downloaded the embeddings
emb_path = "/content/drive/MyDrive/glove_word_vec/glove.6B.300d.txt"

# Set the embedding size
emb_dims = 300


def load_embeddings(emb_path, emb_dims):
    '''
    Load the embeddings from a text file
    
        :param emb_path: Path of the text file
        :param emb_dims: Embedding dimensions
        
        :return emb_tensor: tensor containing all word embeedings
        :return word_to_indx: dictionary with word:index
    '''

    # Load the file
    lines = open(emb_path).readlines()
    
    # Creating the list and adding the PADDING embedding
    emb_tensor = [np.zeros(emb_dims)]
    word_to_indx = {'PADDING_WORD':0}
    
    # For each line, save the embedding and the word:index
    for indx, l in enumerate(lines):
        word, emb = l.split()[0], l.split()[1:]
        
        if not len(emb) == emb_dims:
            continue
        
        # Update the embedding list and the word:index dictionary
        emb_tensor.append([float(x) for x in emb])
        word_to_indx[word] = indx+1
    
    # Turning the list into a numpy object
    emb_tensor = np.array(emb_tensor, dtype=np.float32)
    return emb_tensor, word_to_indx

In [None]:
# Calling load_embeddings and printing the size of the returned objects
emb_tensor, word_to_indx = load_embeddings(emb_path, emb_dims)

print('Words: {}\nVectors (+ zero-padding): {}'.format(len(word_to_indx.keys()), emb_tensor.shape))

Words: 400001
Vectors (+ zero-padding): (400001, 300)


## 4. Convert dataset

We need to convert our dataset that is suitable for the models later. One can change the AbstractDataset Class for one's own dataset. 

In [None]:
from abc import ABCMeta, abstractmethod, abstractproperty
import torch.utils.data as data
import torch

import re
import random
import tqdm


# Classes in the dataset
classes = {1:'peaceful',
           0:'non-peaceful'}

In [None]:
def preprocess(data):
        '''
        Return a list of (text, label and label_name)

            :param data: 20 newsgroup dataset as imported by SK-Learn
            
            :return processed_data: list of text, label and label_name
        '''
        processed_data = []
        for indx, sample in enumerate(data['data']):
            text, label = sample, data['target'][indx]
            label_name = data['target_names'][label]
            text = re.sub('\W+', ' ', text).lower().strip()
            processed_data.append((text, label, label_name))
        return processed_data

In [None]:
# Load the Dataset

from sklearn.datasets import fetch_20newsgroups
from abc import ABCMeta, abstractmethod, abstractproperty
import torch.utils.data as data
import torch

import re
import random
import tqdm


# Classes in the dataset
classes = {0:'peaceful',
           1:'non-peaceful'}

class AbstractDataset(data.Dataset):
    '''
    Abstract class that adds general method to the Newsgroup dataset
    '''
    
    __metaclass__ = ABCMeta

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        sample = self.dataset[index]
        return sample


class Article(AbstractDataset):
    '''
    Newsgroup dataset loader
    '''
    
    def __init__(self,dataframe, set_type, classes, word_to_indx, max_length=80):
        '''
        Load the dataset from SK-Learn

            :param set_type: string containing either 'train', 'dev' or 'test'
            :param classes: list of strings containing the classes
            :param word_to_indx: dictionary of word:index
            :param max_length: integer with max word to consider
            :return: nothing
        '''

        # Deterministic randomization
        random.seed(0)
        
        n_classes = len(classes)
        class_balance = {}
        self.dataset = []

        # If train or dev...
        if set_type == 'train':
          df = dataframe[dataframe['set'] == 'train']
        elif set_type == 'dev':
          df = dataframe[dataframe['set'] == 'dev']
        else:
          df = dataframe[dataframe['set'] == 'test']

        df.index = np.arange(0,len(df))
        data = [(df['Processed_Reviews'][i], df['peaceful'][i], classes[df['peaceful'][i]]) for i in range(len(df))]
       
        # For every unprocessed_sample in the created set, process it
        for indx, unprocessed_sample in tqdm.tqdm(enumerate(data)):
            sample = self.process_line(unprocessed_sample, word_to_indx, max_length)
            
            # If the sample is not empty, save it and add its y to the class_balance dictionary
            if sample['text'] != '':
                if not sample['y'] in class_balance:
                    class_balance[sample['y']] = 0
                class_balance[sample['y']] += 1
                self.dataset.append(sample)

            
   

    
    def get_indices_tensor(self, text_arr, word_to_indx, max_length):
        '''
        Return a tensor of max_length with the word indices
        
            :param text_arr: text array
            :param word_to_indx: dictionary word:index
            :param max_length: maximum length of returned tensors
            
            :return x: tensor containing the indices
        '''
        
        pad_indx = 0
        text_indx = [word_to_indx[x] if x in word_to_indx else pad_indx for x in text_arr][:max_length]
        
        # Padding
        if len(text_indx) < max_length:
            text_indx.extend([pad_indx for _ in range(max_length - len(text_indx))])

        x =  torch.LongTensor([text_indx])

        return x


    def process_line(self, row, word_to_indx, max_length, case_insensitive=True):
        '''
        Return every line as a dictionary with text, x, y, y_name

            :param row: document (or comment)
            :param word_to_indx: dictionary of word:index
            :param max_length: integer with max word to consider
            
            :return sample: dictionary of text, x, y, y_name
        '''
        
        text, label, label_name = row
        
        if case_insensitive:
            text = " ".join(text.split()[:max_length]).lower()
        else:
            text = " ".join(text.split()[:max_length])
            
        x =  self.get_indices_tensor(text.split(), word_to_indx, max_length)
        
        sample = {'text':text,'x':x, 'y':label, 'y_name': label_name}
        return sample

In [None]:
# Loading the dataset
train_data = Article(sample_df,'train', classes, word_to_indx,  max_length=512)
dev_data = Article(sample_df,'dev', classes, word_to_indx, max_length=512)
test_data = Article(sample_df,'test', classes, word_to_indx,  max_length=512)

# Printing 3 datapoints
for datapoint in train_data[:1]:
    print(datapoint)

140000it [00:19, 7134.65it/s]
30000it [00:04, 6904.69it/s]
30000it [00:04, 7275.42it/s]


{'text': 'well country young tv station officially launch yesterday glorify reception excitement yet wear the move rms game changer especially come time nothing fresh go tv front standard group hr news channel ktn news while viewer rms doubt celebrate party go quite butt hurt either ignore move utilize opportunity victim change indirect competitor so five main loser game nation media standard group mediamax basically however smart unruffled big medium house may try appear know almost everyone know lose big time with digital migration guy view content big negotiation ability advertiser company broadcasting they already trail citizen tv term viewership get another news channel struggle make unique qtv quite ratchet tv station compete like gbs rather local content offer citizen so long rely butter form citizen tv rms equity bank best target million low market segment with inooro tv vernacular station come rms quite stranglehold viewership country the exist vernacular station with trumpet 

## 5. Model Parameters

The following model parameters is for the models (encoder, decoder, tagger). One can change the following into argparser for the python file.

In [None]:
args = {'train':True, 'test':True, 'cuda':True, 'class_balance':False, 'num_gpus':1, 'debug_mode':False,
        'model_form':'cnn', 'dropout': 0.1,
        'init_lr':0.001, 'epochs':2, 'batch_size':512, 'patience':10,
        'save_dir':'snapshot', 'model_path':'/content/model.pt', 'results_path':' /content', 'model':'TextCNN',
        'hidden_dims':100, 'num_layers':1, 'dropout':0.1, 'weight_decay':1e-3,
        'filter_num':100, 'filters':[3, 4, 5], 'num_class':2, 'emb_dims':300,
        'tuning_metric':'loss', 'num_workers':4, 'objective':'cross_entropy',
        'use_as_tagger':False, 'get_rationales': True,
        'selection_lambda': 0.01,'continuity_lambda':0.01, 'snapshot':None,
        'gumbel_temprature':1, 'gumbel_decay': 1e-5
        }

## 6. Encoder

The following part is to implement the Encoder mentioned from the paper. In this notebook, we have used CNN from Yoon Kim, but one can implement different methods. The purpose of encoder is to classify the article with a set of Bernoulli variable masks.

In [None]:
import pdb
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.nn.functional as F


# Encoder
class Encoder(nn.Module):
    '''
    Load the embeddings and encode them
    '''

    def __init__(self, embeddings, args):
        '''
        Load embeddings and call the TextCNN model
        
            :param embeddings: tensor with word embeddings
            :param model: default is 'TextCNN'
            
            :return: nothing
        '''
        super(Encoder, self).__init__()
        
        # Saving the parameters
        self.model = args['model']
        self.num_class = args['num_class']
        self.hidden_dims = args['hidden_dims']
        self.num_layers = args['num_layers']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.cuda = args['cuda']
        self.dropout = args['dropout']
        
        # Loading the word embeddings in the Neural Network
        vocab_size, hidden_dim = embeddings.shape
        self.emb_dims = hidden_dim
        self.emb_layer = nn.Embedding(vocab_size, hidden_dim)
        self.emb_layer.weight.data = torch.from_numpy(embeddings)
        self.emb_layer.weight.requires_grad = True
        self.emb_fc = nn.Linear(hidden_dim, hidden_dim)
        self.emb_bn = nn.BatchNorm1d(hidden_dim)
        
        # Calling the model, followed by a fully connected hidden layer
        if self.model == 'TextCNN':
            self.cnn = TextCNN(args, max_pool_over_time=True)
            # The hidden fully connected layer size is given by the number of filters
            # times the filter size, by the number of hidden dimensions
            self.fc = nn.Linear(len(self.filters) * self.filter_num, hidden_dim)
        else:
            raise NotImplementedError("Model {} not yet supported for encoder!".format(model))

        # Dropout and final layer
        self.dropout = nn.Dropout(self.dropout)
        self.hidden = nn.Linear(hidden_dim, self.num_class)
        
        
    def forward(self, x_indx, mask = None):
        '''
        Forward step
        
            :param x_indx: batch of word indices
            
            :return logit: predictions
            :return: hidden layer
        '''
        
        x = self.emb_layer(x_indx.squeeze(1))
        if self.cuda:
            x = x.to(device)
        if not mask is None:
            x = x * mask.unsqueeze(-1)

        # Non linear projection with dropout
        x = F.relu(self.emb_fc(x))
        x = self.dropout(x)
        # TextNN, fully connected and non linearity
        if self.model == 'TextCNN':
            x = torch.transpose(x, 1, 2) # Transpose x dimensions into (Batch, Emb, Length)
            hidden = self.cnn(x)
            hidden = F.relu(self.fc(hidden))
        else:
            raise Exception("Model {} not yet supported for encoder!".format(self.model))

        # Dropout and final layer
        hidden = self.dropout(hidden)
        logit = self.hidden(hidden)
        return logit, hidden


# Model
class TextCNN(nn.Module):
    '''
    CNN for Text Classification
    '''

    def __init__(self, args, max_pool_over_time=False):
        '''
        Convolutional Neural Network
        
            :param num_layers: number of layers
            :param filters: filters shape
            :param filter_num: number of filters
            :param emb_dims: embedding dimensions
            :param max_pool_over_time: boolean
            
            :return: nothing
        '''
        super(TextCNN, self).__init__()

        # Saving the parameters
        self.num_layers = args['num_layers']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.emb_dims = args['emb_dims']
        self.cuda = args['cuda']
        self.max_pool = max_pool_over_time
        
        self.layers = []
        
        # For every layer...
        for l in range(self.num_layers):
            convs = []
            
            # For every filter...
            for f in self.filters:
                # Defining the sizes
                in_channels =  self.emb_dims if l == 0 else self.filter_num * len(self.filters)
                kernel_size = f
                
                # Adding the convolutions in the list
                conv = nn.Conv1d(in_channels=in_channels, out_channels=self.filter_num, kernel_size=kernel_size)
                self.add_module('layer_' + str(l) + '_conv_' + str(f), conv)
                convs.append(conv)
                
            self.layers.append(convs)


    def _conv(self, x):
        '''
        Left padding and returning the activation
        
            :param x: input tensor (batch, emb, length)
            :return layer_activ: activation
        '''
        
        layer_activ = x
        
        for layer in self.layers:
            next_activ = []
            
            for conv in layer:
                # Setting the padding dimensions: it is like adding
                # kernel_size - 1 empty embeddings
                left_pad = conv.kernel_size[0] - 1
                pad_tensor_size = [d for d in layer_activ.size()]
                pad_tensor_size[2] = left_pad
                left_pad_tensor = autograd.Variable(torch.zeros(pad_tensor_size))
                
                if self.cuda:
                    left_pad_tensor = left_pad_tensor.to(device)
                    
                # Concatenating the padding to the tensor
                padded_activ = torch.cat((left_pad_tensor, layer_activ), dim=2)
                
                # onvolution activation
                next_activ.append(conv(padded_activ))

            # Concatenating accross channels
            layer_activ = F.relu(torch.cat(next_activ, 1))
            #pdb.set_trace()
        return layer_activ


    def _pool(self, relu):
        '''
        Max Pool Over Time
        '''
        
        pool = F.max_pool1d(relu, relu.size(2)).squeeze(-1)
        return pool


    def forward(self, x):
        '''
        Forward steps over the x
        
            :param x: input (batch, emb, length)

            :return activ: activation
        '''
        
        activ = self._conv(x)
        
        # Pooling over time?
        if self.max_pool:
            activ = self._pool(activ)
            
        return activ

## 7. Generator

This part is to generate the rationale mentioned from the paper. 

In [None]:
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.nn.functional as F
import rationale_net.models.cnn as cnn
import rationale_net.utils.learn as learn
import pdb

'''
    The generator selects a rationale z from a document x that should be sufficient
    for the encoder to make it's prediction.
    Several froms of Generator are supported. Namely CNN with arbitary number of layers, and @taolei's FastKNN
'''



class Generator(nn.Module):

    def __init__(self, embeddings, args):
        super(Generator, self).__init__()
        vocab_size, hidden_dim = embeddings.shape
        self.embedding_layer = nn.Embedding( vocab_size, hidden_dim)
        self.embedding_layer.weight.data = torch.from_numpy( embeddings )
        self.embedding_layer.weight.requires_grad = False

        self.model = args['model']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.dropout = args['dropout']
        self.cuda = args['cuda']
        self.gumbel_temprature = args['gumbel_temprature']
        self.gumbel_decay = args['gumbel_decay']

        if self.model == 'TextCNN':
            self.cnn = TextCNN(args, max_pool_over_time=False)
        else:
            raise NotImplementedError("Model {} not yet supported for encoder!".format(model))

        self.z_dim = 2

        self.hidden = nn.Linear((len(self.filters)* self.filter_num), self.z_dim)
        self.dropout = nn.Dropout(self.dropout)



    def  __z_forward(self, activ):
        '''
            Returns prob of each token being selected
        '''
        activ = activ.transpose(1,2)
        logits = self.hidden(activ)
        probs = gumbel_softmax(logits, self.gumbel_temprature, self.cuda)
        z = probs[:,:,1]
        return z


    def forward(self, x_indx):
        '''
            Given input x_indx of dim (batch, length), return z (batch, length) such that z
            can act as element-wise mask on x
        '''
        if self.model== 'TextCNN':
            x = self.embedding_layer(x_indx.squeeze(1))
            if self.cuda:
                x = x.to(device)
            
            x = torch.transpose(x, 1, 2) # Switch X to (Batch, Embed, Length)
            activ = self.cnn(x)
            if self.cuda:
              activ = activ.to(device)
           
        else:
            raise NotImplementedError("Model form {} not yet supported for generator!".format(self.model))

        z = self.__z_forward(F.relu(activ))
        if self.cuda:
          z= z.to(device)
        mask = self.sample(z)
        if self.cuda:
          mask = mask.to(device)
        return mask, z


    def sample(self, z):
        '''
            Get mask from probablites at each token. Use gumbel
            softmax at train time, hard mask at test time
        '''
        mask = z
        if self.training:
            mask = z
        else:
            ## pointwise set <.5 to 0 >=.5 to 1
            mask = get_hard_mask(z)
        return mask


    def loss(self, mask, x_indx):
        '''
            Compute the generator specific costs, i.e selection cost, continuity cost, and global vocab cost
        '''
        selection_cost = torch.mean( torch.sum(mask, dim=1) )
        l_padded_mask =  torch.cat( [mask[:,0].unsqueeze(1), mask] , dim=1)
        r_padded_mask =  torch.cat( [mask, mask[:,-1].unsqueeze(1)] , dim=1)
        continuity_cost = torch.mean( torch.sum( torch.abs( l_padded_mask - r_padded_mask ) , dim=1) )
        return selection_cost, continuity_cost

## 8. Empty module

In [None]:
import torch
import torch.nn as nn
import pdb

class Empty(torch.nn.Module):
    def __init__(self):
        super(Empty, self).__init__()

    def forward(self, x):
        return x

## 9. Tagger


In [None]:
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.nn.functional as F
import rationale_net.models.cnn as cnn
import pdb

'''
    Implements a CNN with arbitary number of layers for tagging (predicts 0/1 for each token in text if token matches label), no max pool over time.
'''



class Tagger(nn.Module):

    def __init__(self, embeddings, args):
        super(Tagger, self).__init__()
        vocab_size, hidden_dim = embeddings.shape
        self.embedding_layer = nn.Embedding(vocab_size, hidden_dim)
        self.embedding_layer.weight.data = torch.from_numpy(embeddings)
        self.embedding_layer.weight.requires_grad = False
        
        self.model = args['model']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.num_tags = args['num_class']
        self.dropout = args['dropout']
        self.cuda = args['cuda']

        if args.model == 'TextCNN':
            self.cnn = TextCNN(args, max_pool_over_time=False)

        self.hidden = nn.Linear((len(self.filters)*self.filter_num), self.num_tags)
        self.dropout = nn.Dropout(self.dropout)

    
    def forward(self, x_indx, mask):
        '''Given input x_indx of dim (batch_size, 1, max_length), return z (batch, length) such that z
        can act as element-wise mask on x'''
        if self.model == 'TextCNN':
            ## embedding layer takes in dim (batch_size, max_length), outputs x of dim (batch_size, max_length, hidden_dim)
            x = self.embedding_layer(x_indx.squeeze(1))
        
            if self.cuda:
                x = x.to(device)
            ## switch x to dim (batch_size, hidden_dim, max_length)
            x = torch.transpose(x, 1, 2)
            ## activ of dim (batch_size, len(filters)*filter_num, max_length)
            activ = self.cnn(x)
        else:
            raise NotImplementedError("Model form {} not yet supported for generator!".format(model))

        ## hidden layer takes activ transposed to dim (batch_size, max_length, len(filters)*filter_num) and outputs logit of dim (batch_size, max_length, num_tags)
        logit = self.hidden(torch.transpose(activ, 1, 2))
        return logit, self.hidden

## 10. Get Model

get encoder, generator or tagger and empty generator

In [None]:
def get_model(args, embeddings, train_data):

    if args['snapshot'] is None:
        if args['use_as_tagger'] == True:
            gen = empty.Empty()
            model = Tagger(embeddings, args)
        else:
            gen   = Generator(embeddings, args)
            model = Encoder(embeddings, args)
    else :
        print('\nLoading model from [%s]...' % args['snapshot'])
        try:
            gen_path = learn.get_gen_path(args['snapshot'])
            if os.path.exists(gen_path):
                gen   = torch.load(gen_path)
            model = torch.load(args['snapshot'])
        except :
            print("Sorry, This snapshot doesn't exist."); exit()

    if args['num_gpus'] > 1:
        model = nn.DataParallel(model,
                                    device_ids=range(args['num_gpus']))

        if not gen is None:
            gen = nn.DataParallel(gen,
                                    device_ids=range(args['num_gpus']))
    return gen, model

## 11. Utils.learn

utility functions for the models and training. The following functions are changed into our dataformat.

In [None]:
# Train the model
import sklearn.metrics
import sys, os

def get_hard_mask(z, return_ind=False):
    '''
        -z: torch Tensor where each element probablity of element
        being selected
        -args: experiment level config
        returns: A torch variable that is binary mask of z >= .5
    '''
    max_z, ind = torch.max(z, dim=-1)
    if return_ind:
        del z
        return ind
    masked = torch.ge(z, max_z.unsqueeze(-1)).float()
    del z
    return masked

def get_gen_path(model_path):
    '''
        -model_path: path of encoder model
        returns: path of generator
    '''
    return '{}.gen'.format(model_path)

def one_hot(label, num_class):
    vec = torch.zeros( (1, num_class) )
    vec[0][label] = 1
    return vec

    
def gumbel_softmax(input, temperature, cuda):
    noise = torch.rand(input.size())
    noise.add_(1e-9).log_().neg_()
    noise.add_(1e-9).log_().neg_()
    noise = autograd.Variable(noise)
    if cuda:
        noise = noise.to(device)
    x = (input + noise) / temperature
    x = F.softmax(x.view(-1,  x.size()[-1]), dim=-1)
    return x.view_as(input)


def get_optimizer(models, args):
    '''
    Save the parameters of every model in models and pass them to
    Adam optimizer.
    
        :param models: list of models (such as TextCNN, etc.)
        :param args: arguments
        
        :return: torch optimizer over models
    '''
    params = []
    for model in models:
        params.extend([param for param in model.parameters() if param.requires_grad])
    return torch.optim.Adam(params, lr=args['lr'],  weight_decay=args['weight_decay'])


def init_metrics_dictionary(modes):
    '''
    Create dictionary with empty array for each metric in each mode
    
        :param modes: list with either train, dev or test
        
        :return epoch_stats: statistics for a given epoch
    '''
    epoch_stats = {}
    metrics = ['loss', 'obj_loss', 'k_selection_loss', 'k_continuity_loss',
               'accuracy', 'precision', 'recall', 'f1', 'confusion_matrix', 'mse']
    for metric in metrics:
        for mode in modes:
            key = "{}_{}".format(mode, metric)
            epoch_stats[key] = []
    return epoch_stats


def get_train_loader(train_data, args):
    '''
    Iterative train loader with sampler and replacer if class_balance
    is true, normal otherwise.
    
        :param train_data: training data
        :param args: arguments
        
        :return train_loader: iterable training set
    '''
    
    if args['class_balance']:
        # If the class_balance is true: sample and replace
        sampler = data.sampler.WeightedRandomSampler(
                weights=train_data.weights,
                num_samples=len(train_data),
                replacement=True)
        train_loader = data.DataLoader(
                train_data,
                num_workers=args['num_workers'],
                sampler=sampler,
                batch_size=args['batch_size'])
    else:
        # If the class_balance is false, do not sample
        train_loader = data.DataLoader(
            train_data,
            batch_size=args['batch_size'],
            shuffle=True,
            num_workers=args['num_workers'],
            drop_last=False)
    return train_loader


def get_dev_loader(dev_data, args):
    '''
    Iterative dev loader
    
        :param dev_data: dev set
        :param args: arguments
        
        :return dev_loader: iterative dev set
    '''
    
    dev_loader = data.DataLoader(
        dev_data,
        batch_size=args['batch_size'],
        shuffle=False,
        num_workers=args['num_workers'],
        drop_last=False)
    return dev_loader


def get_x_indx(batch, eval_model):
    '''
    Given a batch, return all the x
    
        :param batch: batch of dictionaries
        :param eval_model: true or false, for volatile
        
        :return x_indx: tensor of batch*x
    '''
    
    x_indx = autograd.Variable(batch['x'], volatile=eval_model)
    return x_indx


def get_loss(logit, y, args):
    '''
    Return the cross entropy or mse loss
    
        :param logit: predictions
        :param y: gold standard
        :param args: arguments
        
        :return loss: loss
    '''
    
    if args['objective'] == 'cross_entropy':
        loss_fn = nn.CrossEntropyLoss()
        loss = F.cross_entropy(logit, y)
    elif args['objective'] == 'mse':
        loss = F.mse_loss(logit, y.float())
    else:
        raise Exception("Objective {} not supported!".format(args['objective']))
    return loss


def tensor_to_numpy(tensor):
    '''
    Return a numpy matrix from a tensor

        :param tensor: tensor
        
        :return numpy_matrix: numpy matrix
    '''
    return tensor.data[0]


def get_metrics(preds, golds, args):
    '''
    Return the metrics given predictions and golds
    
        :param preds: list of predictions
        :param golds: list of golds
        :param args: arguments
        
        :return metrics: metrics dictionary
    '''
    metrics = {}

    if args['objective']  in ['cross_entropy', 'margin']:
        metrics['accuracy'] = sklearn.metrics.accuracy_score(y_true=golds, y_pred=preds)
        metrics['confusion_matrix'] = sklearn.metrics.confusion_matrix(y_true=golds,y_pred=preds)
        metrics['precision'] = sklearn.metrics.precision_score(y_true=golds, y_pred=preds, average="weighted")
        metrics['recall'] = sklearn.metrics.recall_score(y_true=golds,y_pred=preds, average="weighted")
        metrics['f1'] = sklearn.metrics.f1_score(y_true=golds,y_pred=preds, average="weighted")
        metrics['mse'] = "NA"
    elif args['objective'] == 'mse':
        metrics['mse'] = sklearn.metrics.mean_squared_error(y_true=golds, y_pred=preds)
        metrics['confusion_matrix'] = "NA"
        metrics['accuracy'] = "NA"
        metrics['precision'] = "NA"
        metrics['recall'] = "NA"
        metrics['f1'] = 'NA'
    return metrics


def collate_epoch_stat(stat_dict, epoch_details, mode, args):
    '''
    Update stat_dict with details from epoch_details and create
    log statement

        :param stat_dict: a dictionary of statistics lists to update
        :param epoch_details: list of statistics for a given epoch
        :param mode: train, dev or test
        :param args: model run configuration

        :return stat_dict: updated stat_dict with epoch details
        :return log_statement: log statement sumarizing new epoch

    '''
    log_statement_details = ''
    for metric in epoch_details:
        loss = epoch_details[metric]
        stat_dict['{}_{}'.format(mode, metric)].append(loss)

        log_statement_details += ' -{}: {}'.format(metric, loss)

    log_statement = '\n {} - {}\n--'.format(args['objective'], log_statement_details )

    return stat_dict, log_statement

## 12. Train model

Function to train model and extracts rationales, statistics

In [None]:
import os
import sys
import torch
import torch.autograd as autograd
import torch.nn.functional as F
import rationale_net.utils.generic as generic
import rationale_net.utils.metrics as metrics
import tqdm
import numpy as np
import pdb
import sklearn.metrics
import rationale_net.utils.learn as learn



def train_model(train_data, dev_data, model, gen, args):
    '''
    Train model and tune on dev set. If model doesn't improve dev performance within args.patience
    epochs, then halve the learning rate, restore the model to best and continue training.
    At the end of training, the function will restore the model to best dev version.
    returns epoch_stats: a dictionary of epoch level metrics for train and test
    returns model : best model from this call to train
    '''

    snapshot = '{}'.format(os.path.join(args['save_dir'], args['model_path']))

    if args['cuda']:
        model = model.to(device)
        gen = gen.to(device)

    args['lr'] = args['init_lr']
    optimizer = get_optimizer([model, gen], args)

    num_epoch_sans_improvement = 0
    epoch_stats = init_metrics_dictionary(modes=['train', 'dev'])
    step = 0
    tuning_key = "dev_{}".format(args['tuning_metric'])
    best_epoch_func = min if tuning_key == 'loss' else max

    train_loader = get_train_loader(train_data, args)
    dev_loader = get_dev_loader(dev_data, args)

    rationale_list = []
    gold_list = []
    y_list = []
    text_list = []

    for epoch in range(1, args['epochs'] + 1):

        print("-------------\nEpoch {}:\n".format(epoch))
        for mode, dataset, loader in [('Train', train_data, train_loader), 
                                      ('Dev', dev_data, dev_loader)]:

            train_model = mode == 'Train'
            print('{}'.format(mode))
            key_prefix = mode.lower()
            epoch_details, step, losses, preds, golds, rationales, text, y = run_epoch(
                data_loader=loader,
                train_model=train_model,
                model=model,
                gen=gen,
                optimizer=optimizer,
                step=step,
                args=args)


            rationale_list.append(rationales)
            gold_list.append(golds)
            y_list.append(y)
            text_list.append(text)
            epoch_stats, log_statement = collate_epoch_stat(epoch_stats, epoch_details, key_prefix, args)

            # Log  performance
            print(log_statement)


        # Save model if beats best dev
        best_func = min if args['tuning_metric'] == 'loss' else max
        if best_func(epoch_stats[tuning_key]) == epoch_stats[tuning_key][-1]:
            num_epoch_sans_improvement = 0
            if not os.path.isdir(args['save_dir']):
                os.makedirs(args['save_dir'])
            # Subtract one because epoch is 1-indexed and arr is 0-indexed
            epoch_stats['best_epoch'] = epoch - 1
            torch.save(model, snapshot)
            torch.save(gen, get_gen_path(args['model_path']))
        else:
            num_epoch_sans_improvement += 1

        if not train_model:
            print('---- Best Dev {} is {:.4f} at epoch {}'.format(
                args['tuning_metric'],
                epoch_stats[tuning_key][epoch_stats['best_epoch']],
                epoch_stats['best_epoch'] + 1))

        if num_epoch_sans_improvement >= args['patience']:
            print("Reducing learning rate")
            num_epoch_sans_improvement = 0
            model.cpu()
            gen.cpu()
            model = torch.load(snapshot)
            gen = torch.loadget_gen_path(snapshot)

            if args['cuda']:
                model = model.to(device)
                gen   = gen.to(device)
            args['lr'] *= .5
            optimizer = get_optimizer([model, gen], args)

    # Restore model to best dev performance
    if os.path.exists(args['model_path']):
        model.cpu()
        model = torch.load(args['model_path'])
        gen.cpu()
        gen = torch.load(get_gen_path(args['model_path']))

    return epoch_stats, model, gen, rationale_list, gold_list, y_list, text_list

## 13. Run Epoch

Function used in train model

In [None]:
def run_epoch(data_loader, train_model, model, gen, optimizer, step, args):
    '''
    Train model for one pass of train data, and return loss, acccuracy
    
        :param data_loader: iterable dataset
        :param train_model: true if training, false otherwise
        :param model: text classifier, such as TextCNN
        :param optimizer: Adam
        :param args: arguments
        
        :return epoch_stat:
        :return step: number of steps
        :return losses: list of losses
        :return preds: list of predictions
        :return golds: list of gold standards
    '''
    
    eval_model = not train_model
    data_iter = data_loader.__iter__()

    losses = []
    obj_losses = []
    k_selection_losses = []
    k_continuity_losses = []

    preds = []
    golds = []
    losses = []
    texts = []
    rationales = []
    y_list = []
    text_list = []

    if train_model:
        model.train()
        gen.train()
    else:
        model.eval()
        gen.eval()

    num_batches_per_epoch = len(data_iter)
    if train_model:
        num_batches_per_epoch = min(len(data_iter), 10000)

    for _ in tqdm.tqdm(range(num_batches_per_epoch)):
        # Get the batch
        batch = data_iter.next()
        
        if train_model:
            step += 1
            #if step % 100 == 0:
            #    args['gumbel_temprature'] = max(np.exp((step+1) * -1 * args['gumbel_decay']), .05)

        # Load X and Y
        x_indx = get_x_indx(batch, eval_model)
        text = batch['text']
        y = autograd.Variable(batch['y'], volatile=eval_model)

        text_list.append(text)
        y_list.append(y)

        if args['cuda']:
            x_indx, y = x_indx.to(device), y.to(device)

        if train_model:
            optimizer.zero_grad()
        
        if args['get_rationales']:
            mask, z = gen(x_indx)
        else:
          mask = None

        logit, _ = model(x_indx, mask = mask)
        
        # Calculate the loss
        loss = get_loss(logit, y, args)
        obj_loss = loss
        
        if args['get_rationales']:
            selection_cost, continuity_cost = gen.loss(mask, x_indx)

            loss += args['selection_lambda'] * selection_cost
            loss += args['continuity_lambda'] * continuity_cost


        # Backward step
        if train_model:
            loss.backward()
            optimizer.step()
        
        if args['get_rationales']:
            k_selection_losses.append( selection_cost.item())
            k_continuity_losses.append(continuity_cost.item())

        # Saving loss
        obj_losses.append(obj_loss.item()) #obj_losses.append(obj_loss)
        losses.append(loss.item())         #losses.append(loss.item) 
        
        # Softmax, preds, text and gold
        batch_softmax = F.softmax(logit, dim=-1).cpu()
        preds.extend(torch.max(batch_softmax, 1)[1].view(y.size()).data.numpy())
        
        texts.extend(text)
        rationales.extend(learn.get_rationales(mask, text))


        if args['use_as_tagger']:
            golds.extend(batch['y'].view(-1).numpy())
        else:
            golds.extend(batch['y'].numpy())

        

    # Get metrics
    epoch_metrics = get_metrics(preds, golds, args)
    epoch_stat = {'loss' : np.mean(losses), 'obj_loss': np.mean(obj_losses)}

    for metric_k in epoch_metrics.keys():
        epoch_stat[metric_k] = epoch_metrics[metric_k]
    
    if args['get_rationales']:
        epoch_stat['k_selection_loss'] = np.mean(k_selection_losses)
        epoch_stat['k_continuity_loss'] = np.mean(k_continuity_losses)

    return epoch_stat, step, losses, preds, golds, rationales, text_list, y_list

## 14. Training the model

In [None]:
# Creating the encoder and TextCNN, and printing an output from a random input

encoder = Encoder(emb_tensor, args)                    
#encoder.to(device)
print("Output logits for the first (randomly sorted) element of the dataset:\n\n")

print(encoder(train_data[0]['x']))

Output logits for the first (randomly sorted) element of the dataset:


(tensor([[-0.0080, -0.0564]], grad_fn=<AddmmBackward>), tensor([[0.0000, 0.0000, 0.1002, 0.0000, 0.0000, 0.0000, 0.1807, 0.0000, 0.0000,
         0.0000, 0.0436, 0.0164, 0.0000, 0.0220, 0.0000, 0.0000, 0.0000, 0.0646,
         0.0000, 0.0000, 0.1417, 0.0866, 0.0929, 0.0000, 0.0368, 0.1307, 0.0000,
         0.0000, 0.0561, 0.0620, 0.1641, 0.0000, 0.0000, 0.2966, 0.0000, 0.0000,
         0.0000, 0.0000, 0.2636, 0.0000, 0.0000, 0.0585, 0.0149, 0.0000, 0.0000,
         0.0326, 0.0895, 0.0000, 0.0000, 0.2108, 0.0000, 0.0487, 0.0000, 0.0841,
         0.0000, 0.3120, 0.0000, 0.1544, 0.1043, 0.0000, 0.0952, 0.0000, 0.0000,
         0.0366, 0.0000, 0.0053, 0.0000, 0.1681, 0.0964, 0.0000, 0.0000, 0.0403,
         0.0000, 0.0175, 0.0000, 0.0000, 0.0000, 0.2582, 0.0000, 0.0209, 0.0000,
         0.0000, 0.1520, 0.0769, 0.0000, 0.0000, 0.0000, 0.0887, 0.0000, 0.0000,
         0.0000, 0.0000, 0.2660, 0.0000, 0.0000, 0.0000, 0.000

In [None]:
gen, enc = get_model(args, emb_tensor, train_data)

In [None]:
epoch_stats, model, gen, rationale_list, gold_list, y_list, text_list = train_model(train_data, dev_data, enc, gen, args)

-------------
Epoch 1:

Train


100%|██████████| 274/274 [28:33<00:00,  6.25s/it]



 cross_entropy -  -loss: 1.0681992891061045 -obj_loss: 1.0681992891061045 -accuracy: 0.7982214285714285 -confusion_matrix: [[49831 20318]
 [ 7931 61920]] -precision: 0.8079327126025113 -recall: 0.7982214285714285 -f1: 0.7966682340159457 -mse: NA -k_selection_loss: 34.716494389694105 -k_continuity_loss: 31.95054263268074
--
Dev


100%|██████████| 59/59 [05:52<00:00,  5.98s/it]



 cross_entropy -  -loss: 0.4079273651211949 -obj_loss: 0.4079273651211949 -accuracy: 0.8262 -confusion_matrix: [[10803  4097]
 [ 1117 13983]] -precision: 0.8394009955752212 -recall: 0.8262 -f1: 0.8243492063492064 -mse: NA -k_selection_loss: 1.0 -k_continuity_loss: 1.9846276347920047
--
---- Best Dev loss is 0.4079 at epoch 1
-------------
Epoch 2:

Train


100%|██████████| 274/274 [29:18<00:00,  6.42s/it]



 cross_entropy -  -loss: 0.33957835020375077 -obj_loss: 0.33957835020375077 -accuracy: 0.8700642857142857 -confusion_matrix: [[56062 14087]
 [ 4104 65747]] -precision: 0.8777827390839731 -recall: 0.8700642857142857 -f1: 0.8694202402229159 -mse: NA -k_selection_loss: 1.7312692229765174 -k_continuity_loss: 3.264578269345917
--
Dev


100%|██████████| 59/59 [06:10<00:00,  6.29s/it]



 cross_entropy -  -loss: 0.3985718892792524 -obj_loss: 0.3985718892792524 -accuracy: 0.8438666666666667 -confusion_matrix: [[11358  3542]
 [ 1142 13958]] -precision: 0.8527498666666666 -recall: 0.8438666666666667 -f1: 0.8427759497260907 -mse: NA -k_selection_loss: 1.0 -k_continuity_loss: 1.9820768994800115
--
---- Best Dev loss is 0.3986 at epoch 2


## 15. Retrieve the lexicons from the model

In [None]:
#Using the rationale set from last epochs' training and validation set.
final_epoch_train_rationale = rationale_list[-2]
final_epoch_dev_rationale = rationale_list[-1]

final_epoch_train_y = []
final_epoch_dev_y = []

for i in range(len(y_list[-2])):
  final_epoch_train_y += list(y_list[-2][i].cpu().numpy())

for i in range(len(y_list[-1])):
  final_epoch_dev_y += list(y_list[-1][i].cpu().numpy())

#Check whether they match or not
print(len(final_epoch_train_rationale), len(final_epoch_train_y))
print(len(final_epoch_dev_rationale), len(final_epoch_dev_y))

final_epoch_train_text = []
final_epoch_dev_text = []

for i in range(len(text_list[-2])):
  final_epoch_train_text += text_list[-2][i]

for i in range(len(text_list[-1])):
  final_epoch_dev_text += text_list[-1][i]

140000 140000
30000 30000


In [None]:
#Extract the vocabulary from _ _ _ _ word _ _ _ form

import re
def extract_vocab(rationale):
  ret_set = []
  for article in rationale:
    vocab_list = re.findall(r'[a-zA-Z0-9]+', article)
    ret_set.append(vocab_list)
  return ret_set


val_vocab = extract_vocab(final_epoch_dev_rationale)
train_vocab = extract_vocab(final_epoch_train_rationale)

In [None]:
final_epoch_train =pd.DataFrame(final_epoch_train_rationale)
final_epoch_train['label'] = final_epoch_train_y
final_epoch_train['text'] = final_epoch_train_text
final_epoch_train['vocab'] = train_vocab

final_epoch_dev =pd.DataFrame(final_epoch_dev_rationale)
final_epoch_dev['label'] = final_epoch_dev_y
final_epoch_dev['text']  = final_epoch_dev_text
final_epoch_dev['vocab'] = val_vocab
final_epoch_dev

Unnamed: 0,0,label,text,vocab
0,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,0,brings luxury lifestyle private atelier privat...,[abuja]
1,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,0,at last a frustrate officer revealed magical p...,[]
2,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,0,minister foreign affairs tuesday say important...,[kashmir]
3,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,0,again flood sacks hundreds calabar a heavy dow...,[calabar]
4,jonty _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,0,jonty work great accepted work specialist cons...,[jonty]
...,...,...,...,...
29995,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,1,representatives woman soccer team lock intense...,[footballers]
29996,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ jalan _ ...,1,singapore the football association fas conduct...,[jalan]
29997,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,1,a man accuse repeatedly rap girl take explicit...,[]
29998,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,1,but decision make government would proceed sec...,[stirling]


In [25]:
train_peace = final_epoch_train[final_epoch_train['label'] == 1]
train_nonpeace = final_epoch_train[final_epoch_train['label'] == 0]
val_peace = final_epoch_dev[final_epoch_dev['label'] == 1]
val_nonpeace = final_epoch_dev[final_epoch_dev['label'] == 0]

In [42]:
peace_vocab_list = []
nonpeace_vocab_list = []
for vocabs in train_peace['vocab'].values:
  peace_vocab_list += vocabs

for vocabs in val_peace['vocab'].values:
  peace_vocab_list += vocabs

for vocabs in train_nonpeace['vocab'].values:
  nonpeace_vocab_list+= vocabs

for vocabs in val_nonpeace['vocab'].values:
  nonpeace_vocab_list += vocabs


In [27]:
print( len(set(peace_vocab_list)), len(set(nonpeace_vocab_list)))

9653 8694


In [74]:
peace_vocab_set = set(peace_vocab_list) - set(nonpeace_vocab_list)
nonpeace_vocab_set = set(nonpeace_vocab_list) -set(peace_vocab_list)
print(len(set(peace_vocab_set)), len(set(nonpeace_vocab_set)))

7080 6121


In [None]:
df_exp_peace_vocab = pd.DataFrame(set(peace_vocab_set))
df_exp_nonpeace_vocab = pd.DataFrame(set(nonpeace_vocab_set))

In [83]:
peace_vocab = list(peace_vocab_set)
nonpeace_vocab = list(nonpeace_vocab_set)
print(len(peace_vocab), len(nonpeace_vocab))

7080 6121


In [84]:
print(len(set(peace_vocab) - set(nonpeace_vocab)),len(set(nonpeace_vocab) - set(nonpeace_vocab)))

7080 6121


In [93]:
#Count the frequency of words for the ranking

peace_dict = {peace_vocab[i]: 0 for i in range(len(peace_vocab))}
nonpeace_dict = {nonpeace_vocab[i]: 0 for i in range(len(nonpeace_vocab))}

for i in range(len(final_epoch_dev)):
  vocab_list = final_epoch_dev['vocab'][i]
  if vocab_list == []:
    continue
  if final_epoch_train['label'][i] == 0:
    for vocab_ in vocab_list:
      try:
        peace_dict[vocab_] += 1
      except:
        pass
  else:
    for vocab_ in vocab_list:
      try:
        nonpeace_dict[vocab_] += 1
      except:
        pass

for i in range(len(final_epoch_train)):
  vocab_list = final_epoch_train['vocab'][i]
  if vocab_list == []:
    continue
  if final_epoch_train['label'][i] == 0:
    for vocab_ in vocab_list:
      try:
        peace_dict[vocab_] += 1
      except:
        pass
  else:
    for vocab_ in vocab_list:
      try:
        nonpeace_dict[vocab_] += 1
      except:
        pass


In [95]:
dict(sorted(peace_dict.items(), key=lambda item: item[1], reverse = True))

{'rogers': 24,
 'rcmp': 22,
 'rte': 21,
 'meath': 15,
 'news': 15,
 'limerick': 14,
 'gardai': 14,
 'taranaki': 12,
 'postmedia': 12,
 'iwi': 11,
 'hutt': 10,
 'southland': 9,
 'tallaght': 9,
 'defenceman': 9,
 'wicklow': 9,
 'rotorua': 9,
 'croke': 8,
 'afl': 7,
 'invercargill': 7,
 'cashel': 7,
 'westmeath': 7,
 'tipperary': 7,
 'carlow': 7,
 'clair': 7,
 'autocar': 7,
 'rnz': 6,
 'ocbc': 6,
 'bathurst': 6,
 'asb': 6,
 'palmerston': 6,
 'optus': 6,
 'oireachtas': 6,
 'nzx': 6,
 'qantas': 6,
 'aotearoa': 5,
 'quebec': 5,
 'gillard': 5,
 'wairarapa': 5,
 'ont': 5,
 'timaru': 5,
 'whanganui': 5,
 'ballina': 5,
 'hdb': 5,
 'truro': 5,
 'mediacorp': 5,
 'sbs': 5,
 'newsasia': 5,
 'raptors': 5,
 'leitrim': 5,
 'northland': 5,
 'scottish': 5,
 'navan': 5,
 'sligo': 5,
 'ardern': 4,
 'taoiseach': 4,
 'wanganui': 4,
 'brunton': 4,
 'newstalk': 4,
 'unsw': 4,
 'crowe': 4,
 'niall': 4,
 'stoppers': 4,
 'wollongong': 4,
 'taupo': 4,
 'guelph': 4,
 'ute': 4,
 'dail': 4,
 'nenagh': 4,
 'mick': 4,


In [96]:
dict(sorted(nonpeace_dict.items(), key=lambda item: item[1], reverse = True))

{'absa': 39,
 'enugu': 36,
 'naira': 35,
 'mombasa': 25,
 'tehreek': 18,
 'awami': 18,
 'sbp': 17,
 'odm': 15,
 'vodacom': 14,
 'nakuru': 14,
 'inec': 14,
 'muhammadu': 14,
 'mtn': 13,
 'multan': 12,
 'moi': 12,
 'anambra': 12,
 'upazila': 12,
 'shilling': 11,
 'thisday': 11,
 'stellenbosch': 11,
 'nse': 11,
 'kenyatta': 11,
 'limpopo': 11,
 'nigerian': 10,
 'iqbal': 10,
 'incite': 10,
 'iebc': 9,
 'ispr': 9,
 'bangla': 9,
 'idps': 9,
 'wits': 9,
 'retd': 9,
 'kaduna': 9,
 'authenticates': 8,
 'ekiti': 8,
 'abia': 8,
 'kse': 8,
 'faisalabad': 8,
 'kisumu': 8,
 'jinnah': 8,
 'sialkot': 8,
 'machakos': 8,
 'sukkur': 8,
 'balochistan': 8,
 'bangabandhu': 8,
 'benazir': 8,
 'awolowo': 8,
 'kiambu': 8,
 'tehsil': 7,
 'kra': 7,
 'igp': 7,
 'sargodha': 7,
 'sh': 7,
 'eldoret': 7,
 'saddar': 7,
 'katsina': 7,
 'haq': 6,
 'raila': 6,
 'kzn': 6,
 'kcb': 6,
 'mirpur': 6,
 'wapda': 6,
 'syed': 6,
 'obasanjo': 6,
 'bhutto': 6,
 'quetta': 6,
 'hossain': 6,
 'saps': 6,
 'ondo': 6,
 'pokot': 5,
 'mahl