# **Lab: Unstructured Data**

## Exercise 3: Text Classification

In this exercise, we will train a Pytorch model with embeddings for classifying texts into topics. We will be working on the DBpedia dataset:
https://wiki.dbpedia.org/about

The steps are:
1.   Setup Repository
2.   Load Dataset
3.   Prepare Data
4.   Define Architecture
5.   Train Model
6.   Push Changes

**Note**: this lab initially used dataset DBpedia. But that and some others would not work. <br>AGNews worked fine - it doesn't have any access/download issues (yet).<br> 
https://pytorch.org/text/0.8.1/_modules/torchtext/datasets/text_classification.html

### 1. Setup Repository

**[1.1]** Go inside the created folder `adv_dsi_lab_6`

**[1.2]** Create a new git branch called pytorch_dbpedia

In [None]:
# Go inside the created folder adv_dsi_lab_6
cd adv_dsi_lab_6

# Create a new git branch called pytorch_dbpedia
git checkout -b pytorch_dbpedia

**[1.3]** Run the built image

**[1.4]** Display last 50 lines of logs
- Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

In [None]:
# Run the built image
docker run  -dit --rm --name adv_dsi_lab_6 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_6:/home/jovyan/work -v ~/Projects/adv_dsi/src:/home/jovyan/work/src -v ~/Projects/adv_dsi/data:/home/jovyan/work/data pytorch-notebook:latest 
                    
# Display last 50 lines of logs
docker logs --tail 50 adv_dsi_lab_6             

**[1.4]** Launch the magic commands for auto-relaoding external modules

In [10]:
# Launch the magic commands for auto-relaoding external modules
%load_ext autoreload
%autoreload 2

### 2.   Load Dataset

**[2.0]** Run !pip install once only

In [1]:
# Run !pip install once only
!pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchtext==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
    
# Attempted newer versions but they have not fixed the bug
#!pip install torch==1.8.1+cpu torchvision==0.9.1+cpu torchtext==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


**[2.1]** Import torch, torchtext and text_classification from torchtext.datasets

**[2.2]** Create 2 variables called `NGRAMS` and `BATCH_SIZE` that will contain respectively the values 2 and 16

**[2.3]** Dowload DBpedia dataset into `data/raw/` folder with 2 ngrams and no vocabulary


In [1]:
import os
if os.path.isdir('./data/raw'):
    # os.mkdir('./.data')
    print('os.path.isdir("./.data")')

# playing with versions
import pkg_resources
# pkg_resources.require("torchtext==0.8.0")
# import torchtext    

os.path.isdir("./.data")


In [7]:
!pip show torchtext

Name: torchtext
Version: 0.9.0
Summary: Text utilities and datasets for PyTorch
Home-page: https://github.com/pytorch/text
Author: PyTorch core devs and James Bradbury
Author-email: jekbradbury@gmail.com
License: BSD
Location: /opt/conda/lib/python3.7/site-packages
Requires: numpy, torch, requests, tqdm
Required-by: 


### torchtext.datasets import text_classification
- version 0.8.1 https://pytorch.org/text/0.8.1/_modules/torchtext/datasets/text_classification.html
- Think there is an issue with the datasets - only AG_NEWS works

In [21]:
# Import torch and torchtext
import torch
import torchtext
from torchtext.datasets import text_classification

# Create 2 variables called NGRAMS and BATCH_SIZE that will contain respectively the values 2 and 16
NGRAMS = 2
BATCH_SIZE = 16
UKNOWN_TOK=True

# Dowload DBpedia dataset into data/raw/ folder with 2 ngrams and no vocabulary
# train_dataset, test_dataset = text_classification.DATASETS['DBpedia'](root='../data/raw', ngrams=NGRAMS, vocab=None)
# train_dataset, test_dataset = text_classification.DATASETS['DBpedia'](root='./data/raw', ngrams=NGRAMS, vocab=None, include_unk=UKNOWN_TOK)

# Think access issue with DBpedia, use AG_NEWS instead - works fine
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](root='./data', ngrams=NGRAMS, vocab=None), include_unk=UKNOWN_TOK)

120000lines [00:10, 11317.07lines/s]
120000lines [00:21, 5545.51lines/s]
7600lines [00:01, 5144.64lines/s]


**[2.4]** Print the length of the training and testing sets

In [4]:
# Print the length of the training and testing sets
print(len(train_dataset))
print(len(test_dataset))

120000
7600


### 3. Prepare Data

**[3.1]** Extract the size of the vocabulary from the training set and save it into a variable called `VOCAB_SIZE`

**[3.2]** Extract the number of classes from the training set and save it into a variable called `NUM_CLASS`

In [5]:
# Extract the size of the vocabulary from the training set and save it into a variable called VOCAB_SIZE
VOCAB_SIZE = len(train_dataset.get_vocab())

# Extract the number of classes from the training set and save it into a variable called NUM_CLASS
NUM_CLASS = len(train_dataset.get_labels())

**[3.3]** Import random_split from torch.utils.data.dataset

**[3.4]** Create 2 variables called `train_len` and `valid_len` that will  contain values that represent respectively 95% and 5% of the training data

In [6]:
# Import random_split from torch.utils.data.dataset
from torch.utils.data.dataset import random_split

# Create 2 variables called train_len and valid_len that will contain values that represent respectively 95% and 5% of the training data
train_len = int(len(train_dataset) * 0.95)
valid_len = len(train_dataset) - train_len

**[3.5]** Split the training data into training and validation sets

**[3.6]** Create a generator on `train_data` and extract the first observation

In [7]:
# Split the training data into training and validation sets
train_data, valid_data = random_split(train_dataset, [train_len, valid_len])

# Create a generator on train_data and extract the first observation
examples = enumerate(train_data)
batch_idx, (example_data, example_targets) = next(examples)

example_targets

tensor([    101,    7852,    9472,     559,       7,    3175,      12,       3,
            528,   12337,     139,       7,       3,     516,      99,     320,
            511,   24192,     158,    6900,       4,    1360,    3175,      39,
           1002,      47,      22,    1649,     709,       6,     586,    2419,
              2,   59907, 1079516,  663215,    3559,  158314,  103366,      60,
          48076,  975461,   66409,    1705,      29,    1958, 1250075,   54120,
         281037, 1261818,  109992,   21110,   40499,  299012,    8923,  163675,
           7341,    9746,   13580,   48861,  830861,    5872,    3432,  926778,
          52114])

**[3.7]** Define a function that will extract label and target from a batch of observation, calculate the length of each text and store them as offset (highlight start of new text)

In [8]:
# Print the dimensions of the first image
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    
    offsets = [0] + [len(entry) for entry in text]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

### 4. Define Architecture

**[4.1]** Create a class called `TextTopic` that inherits from `nn.Module` with:
- inputs:
    - vocabulary size
    - embedding dimension
    - number of classes
- attributes:
    - `embedding`: bag of embedding of shape: vocabulary size * embedding dimension 
    - `fc`: fully-connected layer with the number of neurons equals to the number of classes
    - `softmax`: Softmax activation function
- methods:
    - `forward()` with `text` and `offsets` as input parameters and will sequentially add the embedding layer with dropout followed by the fully connected layer with dropout and softmax

**[4.2]** Import torch.nn as nn and torch.nn.functional as F

**[4.3]** Create a variable called `EMBED_DIM` that will contain the value 32

**[4.4]** Instantiate a `TextTopic()` and save it into a variable called `model` 

**[4.5]** Import the `get_device` function from src.models.pytorch 

In [12]:
# Import torch.nn as nn and torch.nn.functional as F
import torch.nn as nn
import torch.nn.functional as F

# Create a variable called EMBED_DIM that will contain the value 32
EMBED_DIM = 32

# Import from `pytorch.py` & instantiate a TextTopic() and save it into a variable called model
from src.models.pytorch import TextTopic
model = TextTopic(VOCAB_SIZE, EMBED_DIM, NUM_CLASS)

# Import the get_device function from src.models.pytorch
#from src.models.pytorch import get_device

def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

**[4.6]** Get the device available and set to the model to use it

**[4.7]** Print the architecture of this model

In [13]:
# Get the device available and set to the model to use it
device = get_device()
model.to(device)

# Print the architecture of this model
# model

TextTopic(
  (embedding): EmbeddingBag(1308844, 32, mode=mean)
  (fc): Linear(in_features=32, out_features=4, bias=True)
  (softmax): Softmax(dim=1)
)

### 5. Train the model

**[5.1]** Import DataLoader from torch.utils.data

**[5.2]** Create a variable called `N_EPOCHS` that will take the value 5

**[5.3]** Instantiate a `nn.CrossEntropyLoss()` and save it into a variable called `criterion`

**[5.6]** Instantiate a `torch.optim.SGD()` optimizer with the model's parameters and 4 as learning rate and save it into a variable called `optimizer`

**[5.7]** Instantiate a `torch.optim.lr_scheduler.StepLR()` scheduler that will decrease the learning rate by a coefficient of 0.9 for each epoch


In [26]:
# Import train_binary and test_binary from src.models.pytorch
from torch.utils.data import DataLoader

# Create a variable called N_EPOCHS that will take the value 5
N_EPOCHS = 5
BATCH_SIZE = 32

# Import train_classification and test_classification from src.models.pytorch
from src.models.pytorch import train_classification_unstruct, test_classification_unstruct

# Instantiate a nn.CrossEntropyLoss() and save it into a variable called criterion
criterion = torch.nn.CrossEntropyLoss()

# Instantiate a torch.optim.SGD() optimizer with the model's parameters and 4 as learning rate 
# and save it into a variable called optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)

# Instantiate a torch.optim.lr_scheduler.StepLR() scheduler that will 
# decrease the learning rate by a coefficient of 0.9 for each epoch
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

**[5.8]** Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [27]:
# Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and 
# assess the performance on the validation set and print their scores
for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_classification_unstruct(train_data, model, criterion, optimizer, batch_size=BATCH_SIZE, device=device, scheduler=scheduler, generate_batch=generate_batch)
    valid_loss, valid_acc = test_classification_unstruct(valid_data, model, criterion, batch_size=BATCH_SIZE, device=device, generate_batch=generate_batch)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\t|\tLoss: {train_loss:.4f}\t|\tAcc: {train_acc * 100:.1f}%')
    print(f'\t(valid)\t|\tLoss: {valid_loss:.4f}\t|\tAcc: {valid_acc * 100:.1f}%')

Epoch: 0
	(train)	|	Loss: 0.0382	|	Acc: 49.7%
	(valid)	|	Loss: 0.0349	|	Acc: 62.0%
Epoch: 1
	(train)	|	Loss: 0.0335	|	Acc: 65.8%
	(valid)	|	Loss: 0.0329	|	Acc: 67.6%
Epoch: 2
	(train)	|	Loss: 0.0321	|	Acc: 70.0%
	(valid)	|	Loss: 0.0321	|	Acc: 70.0%
Epoch: 3
	(train)	|	Loss: 0.0315	|	Acc: 72.2%
	(valid)	|	Loss: 0.0316	|	Acc: 71.8%
Epoch: 4
	(train)	|	Loss: 0.0310	|	Acc: 73.6%
	(valid)	|	Loss: 0.0313	|	Acc: 72.8%


**[5.6]** Save the model into the `models` folder

In [28]:
# Save the model into the models folder
torch.save(model, "../models/pytorch_agnews.pt")

### 6.   Push changes

**[6.1]** Add you changes to git staging area

**[6.2]** Create the snapshot of your repository and add a description

**[6.3]** Push your snapshot to Github

**[6.4]** Check out to the master branch

**[6.5]** Pull the latest updates

**[6.6]** Check out to the `pytorch_mnist` branch

**[6.7]** Merge the `master` branch and push your changes

**[6.8]** Go to Github and merge the branch after reviewing the code and fixing any conflict

In [None]:
"""
# Add you changes to git staging area
git add .

# Create the snapshot of your repository and add a description
git commit -m "pytorch dbpedia"

# Push your snapshot to Github
git push https://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@github.com/CazMayhem/adv_dsi_lab_6.git 

# Check out to the master branch
git checkout master

# Pull the latest updates
git pull https://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@github.com/CazMayhem/adv_dsi_lab_6.git 

# Check out to the pytorch_mnist branch
git checkout pytorch_dbpedia

# erge the master branch and push your changes
git merge master
git push https://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@github.com/CazMayhem/adv_dsi_lab_6.git 

# Now go to Github and merge the branch after reviewing the code and fixing any conflict
"""

**[6.9]** Stop the Docker container

In [None]:
# Stop the Docker container
docker stop adv_dsi_lab_6