In [1]:
!pip3 install torch torchvision



In [2]:
!pip install torchtext



In [3]:
!pip install gensim
!pip install -U -q PyDrive



In [4]:
!python -m spacy download en


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from oauth2client.client import GoogleCredentials
from google.colab import auth

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# download google drive file
def download_drive_file(drive_directory, filename):
  list_file_query = "title='{}' and trashed=false".format(drive_directory)
  file_list = drive.ListFile({'q': list_file_query}).GetList()

  if len(file_list) > 0:
    directory_id = file_list[0]['id']

    list_file_query = "'{}' in parents".format(directory_id)

    file_list = drive.ListFile({'q': list_file_query}).GetList()
    
    file_id = None
    for file1 in file_list:
      if file1['title'] == filename:
        print("downloading file {}".format(file1['title']))
        file1.GetContentFile(file1['title'])

In [7]:
download_drive_file("Datasets", "data_jobposts_it.csv")

downloading file data_jobposts_it.csv


# 3 - Faster Classification using FastText

In the previous notebook, we managed to achieve a decent test accuracy of ~85% using all of the common techniques used for sentiment analysis. In this notebook, we'll implement a model that achieves comparable results a lot faster. More specifically, we'll be implementing the "FastText" model from the paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).

This will allow us to achieve the same ~85% test accuracy as the last model, but much faster.

## Preparing Data

One of the key concepts in the FastText paper is that they calculate the n-grams of an input sentence and append them to the end of a sentence. Here, we'll use bi-grams. Briefly, a bi-gram is a pair of words/tokens that appear consecutively within a sentence.

For example, in the sentence "how are you ?", the bi-grams are: "how are", "are you" and "you ?".

The generate_bigrams function takes a sentence that has already been tokenized, calculates the bi-grams and appends them to the end of the tokenized list.

In [0]:
def generate_bigrams(x):
    n_grams = set(zip(*[x[i:] for i in range(2)]))
    for n_gram in n_grams:
        x.append(' '.join(n_gram))
    return x

As an example:

In [9]:
generate_bigrams(['This', 'job', 'requires', 'you', 'to', 'be', 'fluent', 'in', 'Mandarin'])

['This',
 'job',
 'requires',
 'you',
 'to',
 'be',
 'fluent',
 'in',
 'Mandarin',
 'This job',
 'to be',
 'job requires',
 'fluent in',
 'requires you',
 'in Mandarin',
 'be fluent',
 'you to']

`TorchText` Fields have a preprocessing argument. A function passed here will be applied to a sentence after it has been tokenized (transformed from a string into a list of tokens), but before it has been indexed (transformed from a token to an integer). Here, we pass our `generate_bigrams` function.

In [0]:
import torch
from torchtext import data
import random

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy', preprocessing=generate_bigrams)
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

pos = data.TabularDataset(
    path='data_jobposts_it.csv', format='csv', 
    fields=[
        ('text', TEXT), 
        ('label', LABEL)
    ]
)

train_data, test_data = pos.split()

train_data, valid_data = train_data.split(random_state=random.seed(SEED))

Build the vocab and load the pre-trained word embeddings.

In [0]:
TEXT.build_vocab(train_data, max_size=50000, vectors="glove.6B.100d")
LABEL.build_vocab(train_data)

And create the iterators.

In [0]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Build the Model

This model has far fewer parameters than the previous model as it only has 2 layers that have any parameters, the embedding layer and the linear layer. There is no RNN component in sight!

Instead, it first calculates the word embedding for each word using the Embedding layer, then calculates the average of all of the word embeddings and feeds this through the Linear layer, and that's it!

![](https://camo.githubusercontent.com/faa26b515c1fe979636b2e8ca433bacb71815d0d/68747470733a2f2f692e696d6775722e636f6d2f653073575a6f5a2e706e67)

We implement the averaging with the `avg_pool2d` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the ones are along one axis and the dimensions of the word embeddings are along another. In the image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis.

![](https://camo.githubusercontent.com/e5d4a50200df0c001003675cf42280a21df1de6d/68747470733a2f2f692e696d6775722e636f6d2f53534832354e542e706e67)

The `avg_pool2d` passes a filter of size `embedded.shape[1]` (i.e. the length of the sentence) by 1. This is shown in pink in the image below.

![](https://camo.githubusercontent.com/97ed7e6f89deaac7884c664382a3f37b46dcbf3d/68747470733a2f2f692e696d6775722e636f6d2f553765526e49652e706e67)

The average value of all of the dimensions is calculated and concatenated into a 5-dimensional (in our pictoral examples, 100-dimensional in the code) tensor for each sentence. This tensor is then passed through the linear layer to produce our prediction.

In [0]:
import torch.nn as nn
import torch.nn.functional as F


class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, x):
        
        #x = [sent len, batch size]
        
        embedded = self.embedding(x)
                
        #embedded = [sent len, batch size, emb dim]
        
        embedded = embedded.permute(1, 0, 2)
        
        #embedded = [batch size, sent len, emb dim]
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        #pooled = [batch size, embedding_dim]
                
        return self.fc(pooled)

As previously, we'll create an instance of our FastText class.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

And copy the pre-trained vectors to our embedding layer.

In [34]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

## Train the Model

Training the model is the exact same as last time.

We initialize our optimizer...

In [0]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

The rest of the steps for training the model are unchanged.

We define the criterion and place the model and criterion on the GPU (if available)...

In [0]:
criterion = nn.BCEWithLogitsLoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [0]:
import torch

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

We define a function for training our model...

**Note:** we are no longer using dropout so we do not need to use `model.train()`, but as mentioned in the 1st notebook, it is good practice to use it.

In [0]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function for testing our model...

**Note**: again, we leave `model.eval()` even though we do not use dropout.

In [0]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Finally, we train our model...

In [40]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.541, Train Acc: 80.02%, Val. Loss: 0.463, Val. Acc: 80.39%
Epoch: 02, Train Loss: 0.468, Train Acc: 80.00%, Val. Loss: 0.414, Val. Acc: 80.39%
Epoch: 03, Train Loss: 0.429, Train Acc: 80.07%, Val. Loss: 0.335, Val. Acc: 83.11%
Epoch: 04, Train Loss: 0.373, Train Acc: 81.54%, Val. Loss: 0.257, Val. Acc: 89.27%
Epoch: 05, Train Loss: 0.316, Train Acc: 85.83%, Val. Loss: 0.251, Val. Acc: 90.16%


...and get the test accuracy!

The results are not as good to the results in the last notebook, but training takes considerably less time.

## Model Evaluation

Finally, the metric you actually care about, the test loss and accuracy.

In [41]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.224, Test Acc: 90.95%


In [0]:
def to_binary(preds):
    """
    Convert predicted torch array to either 0 or 1
    """

    #round predictions to the closest integer
    return torch.round(F.sigmoid(preds))

def predict(model, iterator):
    
    model.eval()
    
    all_predictions = [] 
    with torch.no_grad():
    
        for batch in iterator:

            predictions = to_binary(model(batch.text).squeeze(1)).cpu().numpy()
            all_predictions += predictions.tolist()
            
    return all_predictions

This is the confusion matrix based on predicted and actual test label.

In [43]:
test_predicted_labels = predict(model, test_iterator)

  return Variable(arr, volatile=not train)


In [44]:
from sklearn.metrics import confusion_matrix

test_actual_labels = []
with torch.no_grad():
    for batch in test_iterator:
        test_actual_labels += batch.label.cpu().numpy().tolist()
        
cf = confusion_matrix(test_actual_labels, test_predicted_labels)
cf

  return Variable(arr, volatile=not train)


array([[4336,  251],
       [ 270,  844]])

In [45]:
import numpy as np


test_actual_labels_hist = np.histogram(test_actual_labels, bins=[0, 1, 2])
test_actual_labels_hist

(array([4587, 1114]), array([0, 1, 2]))

In [46]:
test_predicted_labels_hist = np.histogram(test_predicted_labels, bins=[0, 1, 2])
test_predicted_labels_hist

(array([4606, 1095]), array([0, 1, 2]))

We can finally save our upgraded RNN model for IT job classification task.

In [47]:
torch.save(model, '03_fasttext.pth')

  "type " + obj.__name__ + ". It won't be checked "


In [48]:
!ls -lAh

total 90M
-rw-r--r-- 1 root root  23M Oct  7 03:19 02_lstm.pth
-rw-r--r-- 1 root root  20M Oct  7 04:15 03_fasttext.pth
-rw-r--r-- 1 root root 2.5K Oct  7 02:26 adc.json
drwxr-xr-x 1 root root 4.0K Oct  7 02:26 .config
-rw-r--r-- 1 root root  49M Oct  7 03:43 data_jobposts_it.csv
drwxr-xr-x 2 root root 4.0K Sep 28 23:32 sample_data
drwxr-xr-x 2 root root 4.0K Oct  7 02:41 .vector_cache


In [0]:
from google.colab import files

files.download('03_fasttext.pth') 

### Next Steps

In the final notebook we'll use convolutional neural networks (CNNs) to perform sentiment analysis, and get our best accuracy yet!