<a href="https://colab.research.google.com/github/nosportugal/faast-data-science/blob/main/courses/deep_learning/unit5/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 5: Word Embeddings

By now, you should have the files `labeledTrainData.tsv` and `testData.tsv` in a folder named `ldsa-dl-course-data` in you Google Drive. If you don't, please check the README file of Unit 2 for instructions.

We recommend that you to use [Weights & Biases](https://wandb.ai/site) (W&B) to track your experiments. Sign up on W&B with your Google account so that connection with the Google Colab environment is seamless. Follow the [documentation](https://docs.wandb.ai/guides/integrations/lightning) to integrate W&B with PyTorch Lightning.

## 1) Setup

In [None]:
!pip install lightning==2.0.1 wandb --quiet

In [None]:
from google.colab import drive

drive.mount("/content/drive")

In [None]:
import wandb

# This will open a window so you can login to W&B on Google Colab.
# If that doesn't work, set your W&B API key below
# If you do, remove your key before publishing to GitHub.

# %env WANDB_API_KEY=YOUR_WANDB_API_KEY
wandb.login()
run = wandb.init(project="imdb_sentiment")

## 2) Load the train **dataset**

Load the train dataset from the tsv files stored in your Google Drive. Split it into train and validation datasets.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/labeledTrainData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:20000]
df_val = df_shuffled.iloc[20000:25000]

## 3) Tokenization

The goal of this section is to transform the data, such that each word is segmented and mapped to an integer.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
# We use the Keras text Tokenizer, as it is quite simple to use for this end.
# Note that this is a pre-processing step, which we can decouple from the model.
# Keras is being used in a way similar to how sklearn was used in previous
# units, simply as a means of doing data processing. The model side still uses
# PyTorch only. Other alternatives for the processing are spaCy and torchtext.

tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=10000, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True
)

# The tokenizer looks at all the words that exist in the training dataset, and
# assigns an integer to each unique word (up to the 10000 most common)
tokenizer.fit_on_texts(df_train["review"])

In [None]:
# This method first transforms each sentence into an array of integers. We also
# add padding (extra zeros) so that the length of the sequences are consistent.
def tokenize_to_array(texts, max_seq_len):
    tokenized_texts = tokenizer.texts_to_sequences(texts)

    X = np.empty((len(texts), max_seq_len))
    X[...] = 0

    for i, tokenized_text in enumerate(tokenized_texts):
        X[i, : len(tokenized_text)] = tokenized_text

    return X

In [None]:
# Discover the length of the longest sentence in the training dataset.
train_texts = df["review"]
print(
    f"Max. sequence length on train dataset: {len(max(tokenizer.texts_to_sequences(train_texts), key=len))}"
)

In [None]:
max_seq_len = 2200  # Add an extra margin.

X_train = tokenize_to_array(df_train["review"], max_seq_len)
X_val = tokenize_to_array(df_val["review"], max_seq_len)

In [None]:
# Example of a tokenized sentence.
print(X_train.shape)
print(X_train)

## 4) Data loader

Create a data PyTorch `Dataset` and corresponding `DataLoader` for the train and validation datasets.

## 5) Model definition

Define a PyTorch model and the corresponding PyTorch Lightning module.

In [None]:
pytorch_model = ...

In [None]:
from lightning import LightningModule

In [None]:
class LightningModel(LightningModule):
    def __init__(self, model, learning_rate):
        pass

## 6) Model training

Train your model using a Lightning trainer.

## 7) Inference

Load the test dataset from the tsv file stored in your Google Drive and the model from the checkpoints you created on W&B. Finally, perform inference with the model on the test dataset.

In [None]:
df_test = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/testData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

X_test = tokenize_to_array(df_test["review"], max_seq_len)

In [None]:
# Define checkpoint reference.
checkpoint_reference = "[USERNAME]/imdb_sentiment/model-[MODEL_ID]:best"

# Download checkpoint locally (if not already cached).
artifact = run.use_artifact(checkpoint_reference, type="model")
artifact_dir = artifact.download()

# Load checkpoint.
model = LightningModel.load_from_checkpoint(str(artifact_dir) + "/model.ckpt")

In [None]:
predicted_labels = ...

In [None]:
wandb.finish()

## 8) Post-process for Kaggle submission

Assuming the predicted class labels are stored in `predicted_labels` (as a Torch tensor), create a csv file ready for submission on Kaggle.

In [None]:
output = pd.DataFrame(data={"id": df_test["id"], "sentiment": predicted_labels})

In [None]:
output.to_csv("output.csv", index=False, quoting=3)