# Simple Classification

This section is inspired by [Ben Trevett notebook](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)

Basically we'll be building a machine learning model to detect IT job (i.e. detect if a job posting is an IT job or not) using PyTorch and TorchText. This will be done on [Armenian Job posting dataset](https://www.kaggle.com/madhab/jobposts/home) published in Kaggle.

In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge, to actually get good results.

We'll be using a recurrent neural network (RNN) which reads a sequence of words, and for each word (sometimes called a step) will output a hidden state. We then use the hidden state for subsequent word in the sentence, until the final word has been fed into the RNN. This final hidden state will then be used to predict the sentiment of the sentence.

![RNN Image](https://camo.githubusercontent.com/45a3950547d071988cea037b78c8183cdbed0b17/68747470733a2f2f692e696d6775722e636f6d2f566564593969472e706e67 "RNN")

## Preparing Data

One of the main concepts of TorchText is the Field. These define how your data should be processed. In our IT job classification task we have to sources of data, the raw string of the jobpost description and IT label, either "True" or "False".

We use the TEXT field to handle the job description and the LABEL field to handle the IT label.

The parameters of a Field specify how the data should be processed.

Our TEXT field has `tokenize='spacy'`, which defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the spaCy tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically for handling labels. We will explain the tensor_type argument later.

For more on Fields, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility.

In [1]:
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/data job posts.csv')

df.head(5)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\nJOB TITL...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\nc...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\nI...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\nJOB...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\nn...,,2004,1,False
3,Manoff Group\nJOB TITLE: BCC Specialist\nPOSI...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\nPe...,,23 January 2004\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\nJOB TITLE: Software D...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\n- CV; \n-...,,"20 January 2004, 18:00",,,,2004,1,True


In [3]:
df[['jobpost', 'IT']].to_csv('../data/data_jobpost_it.csv', index=False)

In [4]:
pos = data.TabularDataset(
    path='../data/data_jobpost_it.csv', format='csv', 
    fields=[
        ('text', TEXT), 
        ('label', LABEL)
    ]
)

In [5]:
train_data, test_data = pos.split()

We can see how many examples are in each split by checking their length.

In [6]:
print('len(train_data):', len(train_data))
print('len(test_data):', len(test_data))

len(train_data): 13301
len(test_data): 5701


We can check the fields of the data, hoping that it they match the Fields given earlier.

In [7]:
print('train_data.fields:', train_data.fields)

train_data.fields: {'text': <torchtext.data.field.Field object at 0x7f6908977940>, 'label': <torchtext.data.field.LabelField object at 0x7f6908977c18>}


We can also check an example.

In [8]:
print('vars(train_data[0]):', vars(train_data[0]))

vars(train_data[0]): {'text': ['Aspid', 'Technologies', 'Co.', 'Ltd', '\n', 'TITLE', ':', ' ', 'Helpdesk/', 'Administrative', 'Assistant', '\n', 'TERM', ':', ' ', 'Full', 'time', 'or', 'Part', 'time', '\n', 'START', 'DATE/', 'TIME', ':', ' ', 'Immediately', '\n', 'DURATION', ':', ' ', 'Long', 'term', 'with', '2', 'months', 'probation', 'period', '\n', 'LOCATION', ':', ' ', 'Yerevan', ',', 'Armenia', '\n', 'JOB', 'DESCRIPTION', ':', ' ', 'N', '/', 'A', '\n', 'JOB', 'RESPONSIBILITIES', ':', '\n', '-', 'Provide', 'technical', 'assistance', 'to', 'the', 'company', "'s", 'global', 'customer', 'base', ';', '\n', '-', 'Answer', ',', 'transfer', 'and', 'record', 'phone', 'calls', 'and', 'emails', ';', 'send', 'and', 'receive', '\n', 'documents', 'via', 'fax', ',', 'post', 'offices', ';', '\n', '-', 'Receive', 'and', 'control', 'visitors', ',', 'external', 'and', 'internal', 'people', ';', '\n', '-', 'Check', 'internal', 'and', 'external', 'emails', ';', 'record', 'incoming', 'and', 'outgoing',

By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.

In [9]:
import random

train_data, valid_data = train_data.split(random_state=random.seed(SEED))

Again, we'll view how many examples are in each split.

In [10]:
print('len(train_data):', len(train_data))
print('len(valid_data):', len(valid_data))
print('len(test_data):', len(test_data))

len(train_data): 9311
len(valid_data): 3990
len(test_data): 5701


Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your dictionary (every word that occurs in all of your examples) has a corresponding index (an integer).

![image](https://camo.githubusercontent.com/6bc54d31095cbf20e35cfb8c1d9ea5e63ece0886/68747470733a2f2f692e696d6775722e636f6d2f306f35476461722e706e67)

We do this as our machine learning model cannot operate on strings, only numbers. Each index is used to construct a one-hot vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary.

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will be 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $n$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or unk token. For example, if the sentence was "This job requires working knowledge in Microsoft Office" but the word "Microsoft" was not in the vocabulary, it would become "This job requires working knowledge in unk Office".

In [11]:
TEXT.build_vocab(train_data, max_size=50000)
LABEL.build_vocab(train_data)

Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible.

In [12]:
print('len(TEXT.vocab):', len(TEXT.vocab))
print('len(LABEL.vocab):', len(LABEL.vocab))

len(TEXT.vocab): 36206
len(LABEL.vocab): 3


When we feed sentences into our model, we feed a batch of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the largest within the batch are padded.

We can also view the most common words in the vocabulary.

In [13]:
print(TEXT.vocab.freqs.most_common(20))

[('\n', 470238), ('-', 193482), ('and', 164214), (',', 144175), ('the', 128180), (';', 124992), (':', 124759), ('of', 115034), ('.', 99890), (' ', 88374), ('in', 84923), ('to', 73901), ('for', 41879), ('a', 40413), ('with', 29454), ('"', 28565), ('your', 25730), ('or', 23616), (')', 23221), ('is', 21895)]


We can also see the vocabulary directly using either the stoi (string to int) or itos (int to string) method.

In [14]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', '\n', '-', 'and', ',', 'the', ';', ':', 'of']


We can also check the labels, ensuring 0 is for False and 1 is for True.

In [15]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f68f06e0b70>, {'False': 0, 'True': 1, 'IT': 2})


The final step of preparing the data is creating the iterators.

`BucketIterator` first sorts of the examples using the `sort_key`, here we use the length of the sentences, and then partitions them into buckets. When the iterator is called it returns a batch of examples from the same bucket. This will return a batch of examples where each example is a similar length, minimizing the amount of padding.

In [16]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Build the Model

The next stage is building the model that we'll eventually train and evaluate.

There is a small amount of boilerplate code when creating models in PyTorch, note how our RNN class is a sub-class of `nn.Module` and the use of `super`.

Within the `__init__` we define the layers of the module. Our three layers are an *embedding* layer, our RNN, and a *linear* layer.

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller). This embedding layer is simply a single fully connected layer. The theory is that words that have similar impact on the sentiment are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, transforming it to the correct output dimension.

![Image](https://camo.githubusercontent.com/32eb89dd587c125f8379b22e58a0954d42a2fcf8/68747470733a2f2f692e696d6775722e636f6d2f47496f76337a462e706e67)

The forward method is called when we feed examples into our model.

Each batch, x, is a tensor of size **_[sentence length, batch size]_**. That is a batch of sentences, each having each word converted into a one-hot vector.

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value.

The input batch is then passed through the embedding layer to get embedded, where now each one-hot vector is converted to a dense vector. embedded is a tensor of size **_[sentence length, batch size]_**.

embedded is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, output of size **_[sentence length, batch size]_** and hidden of size **_[1, batch size, embedding dim]_**. output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state. We verify this using the assert statement. Note the squeeze method, which is used to remove a dimension of size 1.

Finally, we feed the last hidden state, hidden, through the linear layer to produce a prediction.

In [17]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):

        #x = [sent len, batch size]
        
        embedded = self.embedding(x)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

We now create an instance of our RNN class.

The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size.

The embedding dimension is the size of the dense word vectors, this is usually around the square root of the vocab size.

The hidden dimension is the size of the hidden states, this is usually around 100-500 dimensions, but depends on the vocab size, embedding dimension and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar.

In [18]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

## Train the Model

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use stochastic gradient descent (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do an update.

In [19]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Next, we'll define our loss function. In PyTorch this is commonly called a criterion.

The loss function here is binary cross entropy with logits.

The prediction for each sentence is an unbound real number, as our labels are either 0 or 1, we want to restrict the number between 0 and 1, we do this using the sigmoid function, see [here](https://en.wikipedia.org/wiki/Sigmoid_function).

We then calculate this bound scalar using binary cross entropy, see [here](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/).

In [20]:
criterion = nn.BCEWithLogitsLoss()

PyTorch has excellent support for NVIDIA GPUs via CUDA. torch.cuda.is_available() returns True if PyTorch detects a GPU.

Using .to, we can place the model and the criterion on the GPU.

In [21]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

Our criterion function calculates the loss, however we have to write our function to calculate the accuracy.

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [52]:
import torch.nn.functional as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum()/len(correct)
    return acc

The train function iterates over all examples, a batch at a time.

`model.train()` is used to put the model in "training mode", which turns on dropout and batch normalization. Although we aren't using them in this model, it's good practice to include it.

For each batch, we first zero the gradients. Each parameter in a model has a grad attribute which stores the gradient calculated by the criterion. PyTorch does not automatically remove (or zero) the gradients calculated from the last gradient calculation so they must be manually cleared.

We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The squeeze is needed as the predictions are initially size **_[batch size, 1]_**, and we need to remove the dimension of size 1.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.

Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator.

You may recall when initializing the `LABEL` field, we set `tensor_type=torch.FloatTensor`. This is because TorchText sets tensors to be `LongTensors` by default, however our criterion expects both inputs to be `FloatTensors`. As we have manually set the tensor_type to be `FloatTensors`, this conversion is done for us.

Another method would be to do the conversion inside the train function by passing `batch.label.float()` instad of `batch.label` to the criterion.

In [49]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

evaluate is similar to train, with a few modifications as you don't want to update the parameters when evaluating.

`model.eval()` puts the model in "evaluation mode", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include it.

Inside the `no_grad()`, no gradients are calculated which speeds up computation.

The rest of the function is the same as train, with the removal of `optimizer.zero_grad()`, `loss.backward()`, `optimizer.step()`.

In [50]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the split.

In [25]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.549, Train Acc: 77.15%, Val. Loss: 0.510, Val. Acc: 80.13%
Epoch: 02, Train Loss: 0.502, Train Acc: 80.21%, Val. Loss: 0.505, Val. Acc: 80.13%
Epoch: 03, Train Loss: 0.502, Train Acc: 80.11%, Val. Loss: 0.504, Val. Acc: 80.13%
Epoch: 04, Train Loss: 0.501, Train Acc: 80.18%, Val. Loss: 0.503, Val. Acc: 80.13%
Epoch: 05, Train Loss: 0.500, Train Acc: 80.23%, Val. Loss: 0.502, Val. Acc: 80.13%


You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.

## Model Evaluation

Finally, the metric you actually care about, the test loss and accuracy.

In [53]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.502, Test Acc: 80.19%


In [32]:
def to_binary(preds):
    """
    Convert predicted torch array to either 0 or 1
    """

    #round predictions to the closest integer
    return torch.round(F.sigmoid(preds))

def predict(model, iterator):
    
    model.eval()
    
    all_predictions = [] 
    with torch.no_grad():
    
        for batch in iterator:

            predictions = to_binary(model(batch.text).squeeze(1)).cpu().numpy()
            all_predictions += predictions.tolist()
            
    return all_predictions

This is the confusion matrix based on predicted and actual test label.

In [34]:
test_predicted_labels = predict(model, test_iterator)

  return Variable(arr, volatile=not train)


<torchtext.data.dataset.Dataset at 0x7f688be4aba8>

In [40]:
from sklearn.metrics import confusion_matrix

test_actual_labels = []
with torch.no_grad():
    for batch in test_iterator:
        test_actual_labels += batch.label.cpu().numpy().tolist()
        
cf = confusion_matrix(test_actual_labels, test_predicted_labels)
cf

  return Variable(arr, volatile=not train)


array([[4572,    8],
       [1121,    0]])

In [42]:
test_actual_labels_hist = np.histogram(test_actual_labels, bins=[0, 1, 2])
test_actual_labels_hist

(array([4580, 1121]), array([0, 1, 2]))

In [43]:
test_predicted_labels_hist = np.histogram(test_predicted_labels, bins=[0, 1, 2])
test_predicted_labels_hist

(array([5693,    8]), array([0, 1, 2]))

We can finally save our simple RNN model for IT job classification task.

In [29]:
torch.save(model, './models/01_simple_rnn.pth')

  "type " + obj.__name__ + ". It won't be checked "


## Next Steps

In the next notebook, the improvements we will make are:

* different optimizer
* use pre-trained word embeddings
* different RNN architecture
* bidirectional RNN
* multi-layer RNN
* regularization

This will allow us to achieve ~85% accuracy.