# Training the Anger Model

**SPECIAL NOTE:**
This file corrresponds to the training of the anger model referenced in the *The Border Effect: Analyzing the Geographic Relationship of Angry Immigration Tweets Classified by a Gated Recurrent Unit* article.

This file will take you through the steps of training a tweet anger classification model using [SemEval 2018 Task 1 data](https://competitions.codalab.org/competitions/17751) using [pretrained word vectors](https://nlp.stanford.edu/projects/glove/).

## Training Data
path to training data set: `data/semeval/rawdata/train/training_anger_tri_class.csv`

Attributes of Interest:

Column name     | Description
-----------     | -----------
`SentimentText` | Tweet text
`Sentiment`     | Class (0: not angry, 1: somewhat angry, 2: angry)

  

## Preprocessing Pretrained Word Vectors

Download the `glove.twitter.27B.zip` from [here](https://nlp.stanford.edu/projects/glove/) and unzip the file. The run the following code to preprocess the pretrained word vectors.

In [None]:
# MODIFY THIS CODE TO MATCH THE PATH TO THE `glove.twitter.27B.100d.txt` file
txt_file = 'path/to/glove.twitter.27B.100d.txt' 

# MODIEFY THIS CODE. THIS IS WHERE YOU WILL SAVE THE PREPROCESSED `np.array`s VECTORS
preprocessed_save_path = 'path/to/preprocess_pretrained_vectors.npz'

In [None]:
import numpy as np

with open(txt_file, "r") as f:
    file = f.read()
    li = file.split('\n')
    
    # empty placeholder lists
    words = []
    vectors = []
    counter = 0

    # split string list and create word vector lists
    for i in li[0: len(li) -1]:
        try: 
            words.append(i.split(' ')[0])
            vectors.append([float(v) for v in i.split(' ')[1: 101]])
        except:
            print(counter)
        counter += 1
    
    # add blank to the list of words
    words = [''] + words
    
    # create 0 vector for blank reference
    vectors = [[0.0 for i in range(100)]] + vectors
    
    # create arrays
    vocab = np.array(words)
    embeddings = np.array(vectors)
    
    # saved array objects 
    np.savez(preprocessed_save_path, vocab=vocab, embeddings=embeddings)

## Training the Anger Model

Now that the word vectors are preprocessed, training the model is simple. From inside the `lovelace/` directory run the following. Be sure to modify the `--embedding_file` to where you saved the preprocessed word vectors.

```bash
python train_scully.py --data_file data/semeval/rawdata/train/training_anger_tri_class.csv \
 --embedding_file path/to/preprocess_pretrained_vectors.npz  \
 --test_size 0.15 --batch_size 128 --hidden_dim 120 --num_classes 3 --num_epochs 5000 \
 --learning_rate 0.00001 --embed_dim 100 --cell_type "gru"
```

A model will be saved and the file name and path will be printed to the terminal. At any moment you can stop the training of the model by pressing `cmd/ctrl + c`.

## Classifying Novel Data

To classify novel data, make sure you keep reference of the trained model path (it should look something like this `logs/rnn/22_Feb_2018-20_22_02`) and know the column name of the text attribute in the data file you are wanting to classify. The output file is where you would like to save the output to.

Run the following from the terminal

```bash
python predict.py --logdir path/to/model/dir --data_file path/to/novel/data.csv \ 
 --output_file path/to/save/file.csv --text_name 'text_attribute_name' --delimiter ',' --qchar '"'
```