# Practical Machine Learning and Deep Learning
# Lesson 3: Deep Learning in Natural Language Processing


## About the Data
The data is taken from Amazon Product Review
#### Files
train.csv - training 40k Amazon product reviews

test.csv - 10k reviews for test
#### Categories
There are 6 target categories: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.

#### Columns
1. Title
Format is string
2. Helpfulness - Assessment of review from other users
Format is int/int. 4/5 means that from 5 assessments 4 are helpful and 1 is not helpful
3. Score - Score assigned by the user
Format is float. 1.0 is minimum, 5.0 is maximum.
4. Text - Text of the review
Format is string
5. Category - one of the 6 target categories
Format is string


# Reading the data

We have training data and testing data available seperately. So instead of reading complete data and then splitting into train and test, we will read them separately. However, later in this lab we will split the test data further.

In [2]:
import pandas as pd

train_dataframe = pd.read_csv('train.csv')
test_dataframe = pd.read_csv('test.csv')


## Data Preprocessing

Text preprocessing in Natural Language Processing (NLP) involves a series of steps to clean, normalize, and prepare raw text data for further analysis or modeling. It is a crucial step because it transforms messy, unstructured text data into a structured form that machine learning models can understand and work with more effectively.

The various text preprocessing steps are:

* Tokenization
* Lower casing
* Stop words removal
* Stemming
* Lemmatization

These various text preprocessing steps are widely used for dimensionality reduction.

But before that, let's look at the data that we have:


In [None]:
train_dataframe.head()

In the training data we have `4` features (`Title`, `Helpfulness`, `Score` and `Text`) with target category (`Category`). For the test features are the same, except for target column.


## Normalize the values

First, let's write functions for preprocessing helpfulness and score feature in case we needed them.

In [4]:
def preprocess_score_inplace(df):
    """
    Normalizes score to make it from 0 to 1.

    For now it is from 1.0 to 5.0, so natural choice
    is to normalize by (f - 1.0)/4.0
    """
    # Subtract 1.0 from each score, changing the range from [1.0, 5.0] to [0.0, 4.0].
    # Divide by 4.0, normalizing the range from [0.0, 4.0] to [0.0, 1.0].
    df['Score'] = (df['Score'] - 1.0) / 4.0
    return df

def preprocess_helpfulness_inplace(df):
    """
    Splits feature by '/' and normalize helpfulness to make it from 0 to 1

    The total number of assessments can be 0, so let's substitute it
    with 1. The resulting helpfulness still will be zero but we
    remove the possibility of division by zero exception.
    """
    # Split the 'Helpfulness' column by the '/' delimiter, creating two new columns: _helpful and _total.
    _splitted = df['Helpfulness'].str.split('/', expand=True)
    _helpful, _total = _splitted[0], _splitted[1]
    # Replace all instances of "0" in _total with "1" to prevent division by zero.
    _total.replace("0", "1", inplace=True)
    # Convert _helpful and _total to integers and computes the ratio, normalizing the 'Helpfulness' values to a range from 0 to 1.
    df['Helpfulness'] = _helpful.astype(int) / _total.astype(int)
    return df

## Concat the textual data

The two other features are both text. For simplicity, let's remove concatenate them so that we will have one full text feature. The resulting code is also a function.

In [5]:
def concat_title_text_inplace(df):
    """
    Concatenates Title and Text columns together
    """
    df['Text'] = df['Title'] + " " + df['Text']
    df.drop('Title', axis=1, inplace=True)
    return df

## Encode the Target Values

Also, encode the target categories, so that the output is become an index

In [6]:
# define categories indices
cat2idx = {
    'toys games': 0,
    'health personal care': 1,
    'beauty': 2,
    'baby products': 3,
    'pet supplies': 4,
    'grocery gourmet food': 5,
}
# define reverse mapping
idx2cat = {
    v:k for k,v in cat2idx.items()
}

In [8]:
def encode_categories(df):
    df['Category'] = df['Category'].apply(lambda x: cat2idx[x])
    return df

Let's visualize our first stage of preprocessing.

In [None]:
train_copy = train_dataframe.head().copy()

encode_categories(preprocess_score_inplace(preprocess_helpfulness_inplace(concat_title_text_inplace(train_copy))))

### Text cleaning

For text cleaning, you can use lower casting, punctuation removal, numbers removal, tokenization, stop words removal, stemming. This will get a perfectly cleaned text without any garbage information.


---

## EXERCISE 1:


  1. Lower text - In python you can use .lower() or .upper() functions to change the case of string

  For example:

  `"PmLDl".upper()`
  would result in
  `PMLDL`


  2. Remove numbers - Use [Regex](https://docs.python.org/3/library/re.html) to manipulate string to remove numbers efficiently and replace with a space

  For example:

  "123abc456def789" ==> " abc def "


  3. Remove Punctuation - Remove the Punction from string like the step above and and replace with a space

  For example:

  "hello, what's up??" ==> "hello  what s up "


  4. Remove Multiple Spaces - You can use any approach for this and replace with a space

  For example:

  "hi  how are you?" ==> "hi how are you?"


---




In [10]:
import re

def lower_text(text: str):
    ...

def remove_numbers(text: str):
    """
    Substitute all punctuations with space in case of
    "there is5dogs".

    If subs with '' -> "there is5dogs"
    With ' ' -> there is dogs
    """
    ...

def remove_punctuation(text: str):
    """
    Substitute all punctiations with space in case of
    "hello!nice to meet you"

    If subs with '' -> "hellonice to meet you"
    With ' ' -> "hello nice to meet you"
    """
    ...

def remove_multiple_spaces(text: str):
    ...

This will give us clean text.

In [None]:
assert lower_text("MiXeD CaSe") == "mixed case"
assert remove_numbers("123abc456def789") == " abc def "
assert remove_punctuation("hello, what's up??") == "hello  what s up "
assert remove_multiple_spaces("hi  how are you?") == "hi how are you?"

print("Passed all test cases")

In [None]:
sample_text = train_copy['Text'][4]

_lowered = lower_text(sample_text)
_without_numbers = remove_numbers(_lowered)
_without_punct = remove_punctuation(_without_numbers)
_single_spaced = remove_multiple_spaces(_without_punct)

print(sample_text)
print('-'*10)
print(_lowered)
print('-'*10)
print(_without_numbers)
print('-'*10)
print(_without_punct)
print('-'*10)
print(_single_spaced)

Now, harder preprocessing: tokenization, stop words removal and stemming.

### Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. Tokens can be words, subwords, or characters.
Tokenization helps in converting raw text into a structured format that can be easily analyzed. It is the first step in many NLP tasks.

#### Example:

Input: "The quick brown fox jumps over the lazy dog."

Output: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

### Stop Words Removal
Stop words are common words that usually do not carry significant meaning and are often removed from text data. Examples include "the", "is", "in", "and", etc.
Removing stop words helps in reducing the size of the dataset and focusing on the more meaningful words, which can improve the performance of NLP models.

#### Example:

Input: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Output: ["quick", "brown", "fox", "jumps", "lazy", "dog"]

### Stemming
Stemming is the process of reducing words to their root or base form. This is done by removing suffixes and prefixes. The resulting root form may not be a valid word.
Stemming helps in normalizing words to their base form, which reduces the number of unique words in the text and helps in identifying related words.

#### Example:

Input: ["running", "jumps", "easily", "faster"]

Output: ["run", "jump", "easili", "fast"]


---


## EXERCISE 2:

  Implement the functions to work like shown in examples above

  ### Tools

  For that you can use several packages, but we encourage you to use `nltk` - Natural Language ToolKit as well as `torchtext`.


  Take a look at:
  * `nltk.tokenize.word_tokenize` or `torchtext.data.utils.get_tokenizer` for tokenization
  * `nltk.corpus.stopwords` for stop words removal
  * `nltk.stem.PorterStemmer` for stemming

---


In [None]:

def tokenize_text(text: str) -> list[str]:
  ...

def remove_stop_words(tokenized_text: list[str]) -> list[str]:
  ...

def stem_words(tokenized_text: list[str]) -> list[str]:
  ...

In [None]:
def test_tokenize_text():
    text = "This is a sample sentence."
    expected_tokens = ["This", "is", "a", "sample", "sentence", "."]
    assert tokenize_text(text) == expected_tokens
    print("tokenize_text test passed.")

test_tokenize_text()

def test_remove_stop_words():
    tokenized_text = ["This", "is", "a", "sample", "sentence", "."]
    expected_output = ["sample", "sentence", "."]
    assert remove_stop_words(tokenized_text) == expected_output
    print("remove_stop_words test passed.")

test_remove_stop_words()

def test_stem_words():
    tokenized_text = ["running", "quickly", "cats"]
    expected_output = ["run", "quickli", "cat"]
    assert stem_words(tokenized_text) == expected_output
    print("stem_words test passed.")

test_stem_words()



In [None]:
_tokenized = tokenize_text(_single_spaced)
_without_sw = remove_stop_words(_tokenized)
_stemmed = stem_words(_without_sw)

print(_single_spaced)
print('-'*10)
print(_tokenized)
print('-'*10)
print(_without_sw)
print('-'*10)
print(_stemmed)

As you can see, there is a lot of words removed as well as the unnecessary language rules. Now we are able to construct full cleaning preprocessing stage.

---

## EXERCISE 3:

Create the function preprocessing_stage to apply the processing steps to a piece of string in following order:


  This function applies the following preprocessing steps in order:

1.   Convert text to lowercase.

2.   Remove numbers from the text.

3. Remove punctuation from the text.

4. Replace multiple spaces with a single space.

5. Tokenize the text.

6. Remove stop words from the tokenized text.

7. Apply stemming to the remaining tokens.
---


In [17]:
def preprocessing_stage(text):
    """
    Performs a series of preprocessing steps on the input text.

    This function applies the following preprocessing steps in order:
    1. Converts text to lowercase.
    2. Removes numbers from the text.
    3. Removes punctuation from the text.
    4. Replaces multiple spaces with a single space.
    5. Tokenizes the text.
    6. Removes stop words from the tokenized text.
    7. Applies stemming to the remaining tokens.

    Args:
    text (str): The input text to be preprocessed.

    Returns:
    list: A list of preprocessed and stemmed tokens.
    """

    ...


def clean_text_inplace(df):
    """
    Applies the preprocessing_stage function to the 'Text' column of the DataFrame.

    This function modifies the DataFrame in place by applying the preprocessing_stage
    function to each entry in the 'Text' column.

    Args:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The DataFrame with preprocessed 'Text' column.
    """

    df['Text'] = df['Text'].apply(preprocessing_stage)
    return df

def preprocess(df):
    """
    Applies a series of preprocessing steps to the DataFrame.

    This function performs the following steps:
    1. Fills any missing values in the DataFrame with a space.
    2. Normalizes the 'Score' column.
    3. Normalizes the 'Helpfulness' column.
    4. Concatenates the 'Title' and 'Text' columns.
    5. Encodes the 'Category' column if it exists.
    6. Applies text cleaning to the 'Text' column.

    Args:
    df (pd.DataFrame): The input DataFrame to be preprocessed.

    Returns:
    pd.DataFrame: The preprocessed DataFrame.
    """
    df.fillna(" ", inplace=True)
    _preprocess_score = preprocess_score_inplace(df)
    _preprocess_helpfulness = preprocess_helpfulness_inplace(_preprocess_score)
    _concatted = concat_title_text_inplace(_preprocess_helpfulness)

    if 'Category' in df.columns:
        _encoded = encode_categories(_concatted)
        _cleaned = clean_text_inplace(_encoded)
    else:
        _cleaned = clean_text_inplace(_concatted)
    return _cleaned


In [20]:
assert preprocessing_stage("Hi, this is a sample text :)")== ['hi', 'sampl', 'text']
print("Preprocessing stage Passed")

Preprocessing stage Passed


And now let's apply it on our train and test dataframes.

In [None]:
train_preprocessed = preprocess(train_dataframe)
test_preprocessed = preprocess(test_dataframe)

train_preprocessed.head()

Now, let's split our original train dataset into train and val sets.

In [22]:
from sklearn.model_selection import train_test_split

ratio = 0.2
train, val = train_test_split(
    train_preprocessed, stratify=train_preprocessed['Category'], test_size=0.2, random_state=420
)

And now, for the best result, lets get rid of pandas so that nothing is stopping us from working with torchtext. For that let's create an iterator that is going to yield samples for us.

# Creating dataloaders

First, you should generate our vocab from the train set.

For that, use `torchtext.vocab.build_vocab_from_iterator`.


The function, `yield_tokens`, will take a DataFrame (df) as input and iterate over each row using iterrows(). For each row (sample), it will convert the row to a list and yield a list of tokens. The function effectively serves as a generator that will be used to provide tokens to build the vocabulary.



In [None]:
from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(df):
    for _, sample in train.iterrows():
        yield sample.to_list()[2]


# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

vocab = build_vocab_from_iterator(yield_tokens(train), specials=special_symbols)
vocab.set_default_index(UNK_IDX)

And then use our vocab to encode the tokenized sequence

In [None]:
sample = train['Text'][2]
print(sample)
encoded = vocab(sample)
print(encoded)

Now we can define our collate function and create dataloaders



---

## EXERCISE 4:

Write a `collate_batch` function that will be designed to take a batch of data and process it into a format suitable for input into a neural network. Here’s what this function will do:

1. Initialize Lists:

  `label_list, text_list, score_list` and `helpfulness_list` to store respective data from each sample in the batch.

  `offsets` to store the starting index of each sequence in the concatenated batch (useful for models like RNNs that require this information).

2. Iterate Over Batch:

  For each item in the batch, extract `_helpfulnes, _score, _text, and _label`.
  Append `_label` to `label_list`.

  Convert `_text` using text_pipeline, wrap it in a tensor, and append to `text_list`.

  Append `_score` to `score_list`.

  Append `_helpfulnes` to `helpfulness_list`.

  Calculate the cumulative length of sequences for offsets.

3. Pad Sequences:

  Use `pad_sequence` to pad the text sequences so that all sequences in the batch have the same length. The padding_value of 1 is used here.

4. Convert Lists to Tensors:

  Convert `label_list`, `score_list`, `helpfulness_list`, and `offsets` to tensors.

  Move these tensors to the specified device (CPU or GPU).

5. Return Tensors:

  Return the batched tensors ready for input to the model.

---


In [28]:
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as Fun

torch.manual_seed(420)

text_pipeline = lambda x: vocab(x)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    ...

train_dataloader = DataLoader(
    train.to_numpy(), batch_size=128, shuffle=True, collate_fn=collate_batch
)

val_dataloader = DataLoader(
    val.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

# Defining Network


For writing a network we can use `torch.nn.Embedding` or `torch.nn.EmbeddingBag`. This will allow your netorwk to learn embedding vector for your tokens.

As for the other modules in your network, we can consider these options:
* Simple Linear layers, activations, basic stuff that goes into the network
* There is a possible of not using the offsets (indices of sequences) in the formard, put use predefined sequence length (maximum length, some value, etc.). If this is an option for you, change the `collate_batch` function according to your architecture.
* You could use all this recurrent stuff (RNN, GRU, LSTM, even Transformer, all up to you), but remembder about the dimentions and hidden states

In [23]:
import torch.nn as nn

class TextClassificationModel(nn.Module):
    def __init__(self, num_classes):
        super(TextClassificationModel, self).__init__()
        num_words = len(vocab.get_itos())
        embed_dim = 1024
        self.embedding = nn.Embedding(num_embeddings=num_words, embedding_dim=embed_dim)
        self.dropout = nn.Dropout(p=0.6)
        # a simple bidirectional lstm with an hidden_dim of 128
        self.lstm = nn.LSTM(embed_dim, 128, bidirectional=True, batch_first=True, num_layers=2, dropout=0.5)
        # output layer is a layer which has only one output
        # input(512) = 128+128 for mean and same for max pooling
        self.out = nn.Sequential(
            nn.Linear(512, num_classes),
            nn.Softmax()
        )

    def forward(self, text):
        x = self.embedding(text)
        x = self.dropout(x)
        # move the embedding output to lstm
        x,_ = self.lstm(x)
        # apply mean and max pooling on lstm output
        avg_pool = torch.mean(x,1)
        max_pool, _ = torch.max(x,1)
        # concatenate mean and max pooling this is why 512
        # 128 for each direction = 256
        # avg_pool = 256, max_pool = 256
        out = torch.cat((avg_pool,max_pool), 1)
        # pass through the output layer and return the output
        out = self.out(out)
        return out

In [None]:
from tqdm.autonotebook import tqdm

def train_one_epoch(
    model,
    loader,
    optimizer,
    loss_fn,
    epoch_num=-1
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: train",
        leave=True,
    )
    model.train()
    train_loss = 0.0
    for i, batch in loop:
        labels, texts, offsets, scores, helpfulness = batch
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(texts)
        # loss calculation
#         loss = loss_fn(outputs, labels.unsqueeze(1).float())
        loss = loss_fn(outputs, labels)

        # backward pass
        loss.backward()

        # optimizer run
        optimizer.step()

        train_loss += loss.item()
        loop.set_postfix({"loss": train_loss/(i * len(labels))})

def val_one_epoch(
    model,
    loader,
    loss_fn,
    epoch_num=-1,
    best_so_far=0.0,
    ckpt_path='best.pt'
):

    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: val",
        leave=True,
    )
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            labels, texts, offsets, scores, helpfulness = batch

            # forward pass
            outputs = model(texts)
            # loss calculation
#             loss = loss_fn(outputs, labels.unsqueeze(1).float())
            loss = loss_fn(outputs, labels)

            _, predicted = outputs.data.max(1, keepdim=True)
            total += labels.size(0)
            correct += predicted.eq(labels.data.view_as(predicted)).sum()

            val_loss += loss
            loop.set_postfix({"loss": val_loss/total, "acc": correct / total})

        if correct / total > best_so_far:
            torch.save(model.state_dict(), ckpt_path)
            return correct / total

    return best_so_far

In [25]:
epochs = 20
model = TextClassificationModel(len(train_preprocessed['Category'].unique())).to(device)
optimizer = torch.optim.Adam(model.parameters(),lr = 1e-3)
loss_fn = nn.CrossEntropyLoss()

In [None]:
best = -float('inf')
for epoch in range(epochs):
    train_one_epoch(model, train_dataloader, optimizer, loss_fn, epoch_num=epoch)
    best = val_one_epoch(model, val_dataloader, loss_fn, epoch, best_so_far=best)

# Predictions

---

## EXERCISE 5:

The `collate_batch` function would be used to process batches of data into the suitable format. Here's how you can complete it:

1. Initialization:

  Lists: Four lists will be initialized to store text sequences (`text_list`), scores (`score_list`), helpfulness scores (`helpfulness_list`), and offsets (`offsets`).
  Offsets: The `offsets` list will be initialized with [0].

2. Iterate Over Batch:

  The function will iterate through each item in the batch. For each item, it will extract `_helpfulness, _score, _text, and id` (though id will not be used further).

  - Text Processing:
  The `_text` will be processed using `text_pipeline`, which typically converts the text to a list of token indices.
  The processed text will then be converted to a tensor and appended to text_list.

  - Scores and Helpfulness:
  `_score` will be appended to `score_list`.
  `_helpfulness` will be appended to `helpfulness_list`.

  - Offsets:
  Offsets are intended to store the starting index of each sequence in the concatenated batch, but in this function, offsets will not be correctly updated beyond initialization.

3. Padding Sequences:

  `pad_sequence` will be used to pad the sequences in `text_list` to ensure they all have the same length. This function will pad sequences to the right, using 1 as the padding value.

4. Convert to Tensors:

  The lists `text_list`, `score_list`, and `helpfulness_list` will be converted to tensors.
  The offsets list will also be converted to a tensor, but since offsets will not be updated correctly, it will remain as [0].

5. Return Tensors:

  The function will return the padded text sequences, offsets, scores, and helpfulness scores as tensors. These tensors will be moved to the specified device (CPU or GPU).
---


In [32]:
def collate_batch(batch):
    ...

# Create DataLoader for test data
test_dataloader = DataLoader(
    test_preprocessed.to_numpy(),  # Convert preprocessed test data to NumPy array
    batch_size=128,                # Set batch size to 128
    shuffle=False,                 # Do not shuffle data for predictions
    collate_fn=collate_batch       # Use collate_batch function to process batches
)

In [33]:
# Define predict function to make predictions
def predict(
    model,
    loader,
):
    """
    Predicts labels for data batches using the provided model and DataLoader.

    This function iterates over batches in the DataLoader, performs a forward pass
    through the model to obtain predicted outputs, and collects these predictions.

    Args:
    model (torch.nn.Module): The trained PyTorch model for prediction.
    loader (torch.utils.data.DataLoader): DataLoader containing batches of data.

    Returns:
    list: List of predicted labels for all batches in DataLoader.
    """

    # Create tqdm progress bar for prediction loop
    loop = tqdm(
        enumerate(loader, 1),       # Enumerate over DataLoader with start index 1
        total=len(loader),          # Set total number of batches for tqdm
        desc="Predictions:",        # Description for tqdm progress bar
        leave=True,                 # Leave tqdm progress bar after completion
    )
    predictions = []                # Initialize empty list to store predictions
    with torch.no_grad():
        model.eval()                # Set model to evaluation mode (no gradient computation)

        # Iterate over batches in DataLoader
        for i, batch in loop:
            texts, offsets, scores, helpfulness = batch  # Unpack batch data

            # Forward pass: compute model outputs
            outputs = model(texts, offsets)

            # Get predicted labels: argmax along the second dimension (class dimension)
            _, predicted = torch.max(outputs.data, 1)

            # Convert predicted tensor to list and append to predictions
            predictions += predicted.detach().cpu().tolist()

    return predictions


In [36]:
# Define predict function to make predictions
def predict(
    model,
    loader,
):
    """
    Predicts labels for data batches using the provided model and DataLoader.

    This function iterates over batches in the DataLoader, performs a forward pass
    through the model to obtain predicted outputs, and collects these predictions.

    Args:
    model (torch.nn.Module): The trained PyTorch model for prediction.
    loader (torch.utils.data.DataLoader): DataLoader containing batches of data.

    Returns:
    list: List of predicted labels for all batches in DataLoader.
    """

    # Create tqdm progress bar for prediction loop
    loop = tqdm(
        enumerate(loader, 1),       # Enumerate over DataLoader with start index 1
        total=len(loader),          # Set total number of batches for tqdm
        desc="Predictions:",        # Description for tqdm progress bar
        leave=True,                 # Leave tqdm progress bar after completion
    )
    predictions = []                # Initialize empty list to store predictions
    with torch.no_grad():
        model.eval()                # Set model to evaluation mode (no gradient computation)

        # Iterate over batches in DataLoader
        for i, batch in loop:
            texts, offsets, scores, helpfulness = batch  # Unpack batch data
            # Combine texts and offsets into a single input if your model expects it
            input_data = (texts, offsets) # Pack texts and offsets together

            # Forward pass: compute model outputs
            outputs = model(input_data) # Pass the combined input to the model

            # Get predicted labels: argmax along the second dimension (class dimension)
            _, predicted = torch.max(outputs.data, 1)

            # Convert predicted tensor to list and append to predictions
            predictions += predicted.detach().cpu().tolist()

    return predictions


In [42]:
def predict(
    model,
    loader,
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc="Predictions:",
        leave=True,
    )
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            texts, offsets, scores, helpfulness = batch

            # forward pass and loss calculation
            outputs = model(texts)

            _, predicted = torch.max(outputs.data, 1)
            predictions += predicted.detach().cpu().tolist()

    return predictions

In [None]:
# Load the best model checkpoint
# Load checkpoint from "best.pt" file
ckpt = torch.load("best.pt")
# Update model with loaded state_dict
model.load_state_dict(ckpt)

# Make predictions using the model and test DataLoader
predictions = predict(model, test_dataloader)  # Get predictions for test data
predictions[:10]


In [45]:
# Convert predictions to corresponding category labels and save to CSV
# Map predictions to category labels using idx2cat
results = pd.Series(predictions).apply(lambda x: idx2cat[x])

# Save results to CSV with 'id' as index label
results.to_csv('result.csv', index_label='id')