# **Lab: Unstructured Data**



## Exercise 3: Text Classification

In this exercise, we will train a Pytorch model with embeddings for classifying texts into topics. We will be working on the DBpedia dataset:
https://wiki.dbpedia.org/about

The steps are:
1.   Setup Repository
2.   Load Dataset
3.   Prepare Data
4.   Define Architecture
5.   Train Model
6.   Push Changes


### 1. Setup Repository

**[1.1]** Go inside the created folder `adv_dsi_lab_6`

In [None]:
# Placeholder for student's code (1 command line)
# Task: Go inside the created folder adv_dsi_lab_6

In [None]:
# Solution
cd adv_dsi_lab_6

**[1.2]** Create a new git branch called pytorch_dbpedia

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create a new git branch called pytorch_dbpedia

In [None]:
# Solution
git checkout -b pytorch_dbpedia

**[1.3]** Run the built image

In [None]:
docker run  -dit --rm --name adv_dsi_lab_6 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_6:/home/jovyan/work -v ~/Projects/adv_dsi/src:/home/jovyan/work/src pytorch-notebook:latest 

**[1.4]** Display last 50 lines of logs

In [None]:
docker logs --tail 50 adv_dsi_lab_6

Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.6]** Navigate the folder `notebooks` and create a new jupyter notebook called `3_pytorch_dbpedia.ipynb`

### 2.   Load Dataset

In [None]:
#!pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchtext==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torch-1.7.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (159.4MB)
[K     |████████████████████████████████| 159.4MB 76kB/s 
[?25hCollecting torchvision==0.8.2+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.8.2%2Bcpu-cp37-cp37m-linux_x86_64.whl (11.8MB)
[K     |████████████████████████████████| 11.8MB 33.6MB/s 
[?25hCollecting torchtext==0.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/13/80/046f0691b296e755ae884df3ca98033cb9afcaf287603b2b7999e94640b8/torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (7.0MB)
[K     |████████████████████████████████| 7.0MB 4.4MB/s 
Installing collected packages: torch, torchvision, torchtext
  Found existing installation: torch 1.7.1+cu101
    Uninstalling torch-1.7.1+cu101:
      Successfully uninstalled torch-1.7.1+cu101
  Found existing installation: torchvision

**[2.1]** Import torch, torchtext and text_classification from torchtext.datasets


In [None]:
# Placeholder for student's code (3 lines of Python code)
# Task: Import torch and torchtext

In [None]:
#Solution
import torch
import torchtext
from torchtext.datasets import text_classification

**[2.2]** Create 2 variables called `NGRAMS` and `BATCH_SIZE` that will contain respectively the values 2 and 16


In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Create 2 variables called NGRAMS and BATCH_SIZE that will contain respectively the values 2 and 16

In [None]:
#Solution
NGRAMS = 2
BATCH_SIZE = 16

**[2.3]** Dowload DBpedia dataset into `data/raw/` folder with 2 ngrams and no vocabulary

In [None]:
#train_dataset, test_dataset = text_classification.DATASETS['DBpedia'](root='../data/raw', ngrams=NGRAMS, vocab=None)
train_dataset, test_dataset = text_classification.DATASETS['DBpedia'](root='./data/raw', ngrams=NGRAMS, vocab=None)

dbpedia_csv.tar.gz: 68.3MB [00:00, 110MB/s] 
560000lines [01:03, 8782.51lines/s]
560000lines [01:51, 5013.94lines/s]
70000lines [00:14, 4714.07lines/s]


**[2.4]** Print the length of the training and testing sets

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Print the length of the training and testing sets

In [None]:
# Solution
print(len(train_dataset))
print(len(test_dataset))

560000
70000


### 3. Prepare Data



**[3.1]** Extract the size of the vocabulary from the training set and save it into a variable called `VOCAB_SIZE`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the size of the vocabulary from the training set and save it into a variable called VOCAB_SIZE

In [None]:
# Solution
VOCAB_SIZE = len(train_dataset.get_vocab())

**[3.2]** Extract the number of classes from the training set and save it into a variable called `NUM_CLASS`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the number of classes from the training set and save it into a variable called NUM_CLASS

In [None]:
# Solution
NUM_CLASS = len(train_dataset.get_labels())

**[3.3]** Import random_split from torch.utils.data.dataset

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import random_split from torch.utils.data.dataset

In [None]:
# Solution:
from torch.utils.data.dataset import random_split

**[3.4]** Create 2 variables called `train_len` and `valid_len` that will  contain values that represent respectively 95% and 5% of the training data

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Create 2 variables called train_len and valid_len that will contain values that represent respectively 95% and 5% of the training data

In [None]:
# Solution
train_len = int(len(train_dataset) * 0.95)
valid_len = len(train_dataset) - train_len

**[3.5]** Split the training data into training and validation sets

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Split the training data into training and validation sets

In [None]:
# Solution
train_data, valid_data = random_split(train_dataset, [train_len, valid_len])

**[3.6]** Create a generator on `train_data` and extract the first observation

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Create a generator on train_data and extract the first observation

In [None]:
# Solution
examples = enumerate(train_data)
batch_idx, (example_data, example_targets) = next(examples)

In [None]:
example_targets

tensor([  34887,  330820, 3255093,     238,   16441,       2,  176676,  170439,
          20850,  330820,     238,      70,     311,    1154,    2684,   16441,
            238,       7,       6,     490,    3578,   28181,     902,   11034,
        4244776,     169,   10188,    1734,     441,       4,     411,       2,
            644,       5,    1154,       8,     311,    1154,       3,    2536,
         404551,     398,    1690,    2047,      20,  119286,     169,   52648,
              3, 3242498, 2339510, 3255094,  568496,  302357,  239112,  176677,
         178204, 6217895, 2339509,   77267,   87365,    5743,  315091, 5208092,
        2348866,   11687,      11,   10422,    8447,   47942,   36810, 3489086,
        3583243, 4244777, 5452755,   20246,   52947,    3465,    3281,    3261,
           2915,    1266,    7727,   12026,    4009,    5743,    3303, 1210135,
        5450200, 5990363,  582316,  357302,  438135,  677244,  219862, 5452431,
         411325])

**[3.7]** Define a function that will extract label and target from a batch of observation, calculate the length of each text and store them as offset (highlight start of new text)

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Print the dimensions of the first image

In [None]:
# Solution
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    
    offsets = [0] + [len(entry) for entry in text]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

### 4. Define Architecture

**[4.1]** Import torch.nn as nn and torch.nn.functional as F

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Import torch.nn as nn and torch.nn.functional as F

In [None]:
# Solution
import torch.nn as nn
import torch.nn.functional as F

**[4.2]** Create a class called `TextTopic` that inherits from `nn.Module` with:
- inputs:
    - vocabulary size
    - embedding dimension
    - number of classes
- attributes:
    - `embedding`: bag of embedding of shape: vocabulary size * embedding dimension 
    - `fc`: fully-connected layer with the number of neurons equals to the number of classes
    - `softmax`: Softmax activation function
- methods:
    - `forward()` with `text` and `offsets` as input parameters and will sequentially add the embedding layer with dropout followed by the fully connected layer with dropout and softmax

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a class called TextTopic that inherits from nn.Module

In [None]:
# Solution:
class TextTopic(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, text, offsets):
        x = F.dropout(self.embedding(text, offsets), 0.3)
        x = F.dropout(self.fc(x), 0.3)
        return self.softmax(x)

**[4.3]** Create a variable called `EMBED_DIM` that will contain the value 32

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a variable called EMBED_DIM that will contain the value 32

In [None]:
# Solution:
EMBED_DIM = 32

**[4.4]** Instantiate a `TextTopic()` and save it into a variable called `model` 

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a TextTopic() and save it into a variable called model

In [None]:
# Solution:
model = TextTopic(VOCAB_SIZE, EMBED_DIM, NUM_CLASS)

**[4.5]** Import the `get_device` function from src.models.pytorch 

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import the get_device function from src.models.pytorch

In [None]:
# Solution:
#from src.models.pytorch import get_device

def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

**[4.6]** Get the device available and set to the model to use it

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Get the device available and set to the model to use it

In [None]:
# Solution:
device = get_device()
model.to(device)

TextTopic(
  (embedding): EmbeddingBag(6375026, 32, mode=mean)
  (fc): Linear(in_features=32, out_features=14, bias=True)
  (softmax): Softmax(dim=1)
)

**[4.7]** Print the architecture of this model

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Print the architecture of this model

In [None]:
# Solution:
model

TextTopic(
  (embedding): EmbeddingBag(6375026, 32, mode=mean)
  (fc): Linear(in_features=32, out_features=14, bias=True)
  (softmax): Softmax(dim=1)
)

### 5. Train the model

**[5.1]** Import DataLoader from torch.utils.data

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Import train_binary and test_binary from src.models.pytorch

In [None]:
# Solution:
from torch.utils.data import DataLoader

**[5.2]** Create a function called `train_classification()` that will perform forward and back propagation and calculate loss and accuracy scores

In [None]:
def train_classification(train_data, model, criterion, optimizer, batch_size, device, scheduler=None, generate_batch=None):
    """Train a Pytorch multi-class classification model

    Parameters
    ----------
    train_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    optimizer: torch.optim
        Optimizer
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    scheduler : torch.optim.lr_scheduler
        Pytorch Scheduler used for updating learning rate
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """
    
    # Set model to training mode
    model.train()
    train_loss = 0
    train_acc = 0
    
    # Create data loader
    data = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, offsets, target_class in data:

        # Reset gradients
        optimizer.zero_grad()
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Make predictions
        output = model(feature, offsets)
        
        # Calculate loss for given batch
        loss = criterion(output, target_class.long())

        # Calculate global loss
        train_loss += loss.item()
        
        # Calculate gradients
        loss.backward()

        # Update Weights
        optimizer.step()
        
        # Calculate global accuracy
        train_acc += (output.argmax(1) == target_class).sum().item()

    # Adjust the learning rate
    if scheduler:
        scheduler.step()

    return train_loss / len(train_data), train_acc / len(train_data)

**[5.3]** Create a function called `test_classification()` that will perform forward and calculate loss and accuracy scores

In [None]:
# Placeholder for student's code (multiples lines of Python code)
# Task: Create a function called test_classification() that will perform forward and calculate loss and accuracy scores

In [None]:
# Solution:
def test_classification(test_data, model, criterion, batch_size, device, generate_batch=None):
    """Calculate performance of a Pytorch multi-class classification model

    Parameters
    ----------
    test_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """    
    
    # Set model to evaluation mode
    model.eval()
    test_loss = 0
    test_acc = 0
    
    # Create data loader
    data = DataLoader(test_data, batch_size=batch_size, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, offsets, target_class in data:
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Set no update to gradients
        with torch.no_grad():
            
            # Make predictions
            output = model(feature, offsets)
            
            # Calculate loss for given batch
            loss = criterion(output, target_class.long())

            # Calculate global loss
            test_loss += loss.item()
            
            # Calculate global accuracy
            test_acc += (output.argmax(1) == target_class).sum().item()

    return test_loss / len(test_data), test_acc / len(test_data)
    

**[5.4]** Create a variable called `N_EPOCHS` that will take the value 5

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Create a variable called N_EPOCHS that will take the value 5

In [None]:
# Solution:
N_EPOCHS = 5

**[5.5]** Instantiate a `nn.CrossEntropyLoss()` and save it into a variable called `criterion`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a nn.CrossEntropyLoss() and save it into a variable called criterion

In [None]:
# Solution:
criterion = torch.nn.CrossEntropyLoss()

**[5.6]** Instantiate a `torch.optim.SGD()` optimizer with the model's parameters and 4 as learning rate and save it into a variable called `optimizer`

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Instantiate a torch.optim.SGD() optimizer with the model's parameters and 4 as learning rate and save it into a variable called optimizer

In [None]:
# Solution:
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)

**[5.7]** Instantiate a `torch.optim.lr_scheduler.StepLR()` scheduler that will decrease the learning rate by a coefficient of 0.9 for each epoch

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Instantiate a torch.optim.lr_scheduler.StepLR() scheduler that will decrease the learning rate by a coefficient of 0.9 for each epoch

In [None]:
# Solution:
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

**[5.8]** Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [None]:
# Placeholder for student's code (multiples lines of Python code)
# Task: Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [None]:
# Solution:
for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_classification(train_data, model, criterion, optimizer, batch_size=BATCH_SIZE, device=device, scheduler=scheduler, generate_batch=generate_batch)
    valid_loss, valid_acc = test_classification(valid_data, model, criterion, batch_size=BATCH_SIZE, device=device, generate_batch=generate_batch)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\t|\tLoss: {train_loss:.4f}\t|\tAcc: {train_acc * 100:.1f}%')
    print(f'\t(valid)\t|\tLoss: {valid_loss:.4f}\t|\tAcc: {valid_acc * 100:.1f}%')

Epoch: 0
	(train)	|	Loss: 0.1343	|	Acc: 60.5%
	(valid)	|	Loss: 0.1289	|	Acc: 68.3%
Epoch: 1
	(train)	|	Loss: 0.1282	|	Acc: 69.2%
	(valid)	|	Loss: 0.1275	|	Acc: 70.1%
Epoch: 2
	(train)	|	Loss: 0.1271	|	Acc: 70.9%
	(valid)	|	Loss: 0.1271	|	Acc: 70.9%
Epoch: 3
	(train)	|	Loss: 0.1267	|	Acc: 71.6%
	(valid)	|	Loss: 0.1264	|	Acc: 72.2%
Epoch: 4
	(train)	|	Loss: 0.1264	|	Acc: 72.0%
	(valid)	|	Loss: 0.1263	|	Acc: 72.0%


**[5.6]** Save the model into the `models` folder

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the model into the models folder

In [None]:
# Solution:
torch.save(model, "../models/pytorch_dbpedia.pt")

### 6.   Push changes

**[6.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (1 command line)
# Task: Add you changes to git staging area

In [None]:
# Solution:
git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create the snapshot of your repository and add a description

In [None]:
# Solution:
git commit -m "pytorch dbpedia"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your snapshot to Github

In [None]:
# Solution:
git push

**[6.4]** Check out to the master branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Check out to the master branch

In [None]:
# Solution:
git checkout master

**[6.5]** Pull the latest updates

In [None]:
# Placeholder for student's code (1 command line)
# Task: Pull the latest updates

In [None]:
# Solution:
git pull

**[6.6]** Check out to the `pytorch_dbpedia` branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Merge the branch pytorch_dbpedia

In [None]:
# Solution:
git checkout pytorch_bee_ant

**[6.7]** Merge the `master` branch and push your changes

In [None]:
# Placeholder for student's code (2 command lines)
# Task: Merge the master branch and push your changes

In [None]:
# Solution:
git merge master
git push

**[6.8]** Go to Github and merge the branch after reviewing the code and fixing any conflict

**[6.9]** Stop the Docker container

In [None]:
# Placeholder for student's code (1 command line)
# Task: Stop the Docker container

In [None]:
# Solution:
docker stop adv_dsi_lab_6