<a href="https://colab.research.google.com/github/nosportugal/faast-data-science/blob/main/courses/deep_learning/unit3/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 3: Lightning and W&B

Your challenge in this unit will be to classify the sentiment expressed in [IMDb](https://www.imdb.com/) movie reviews using a neural network and to submit your results to [this Kaggle competition](https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview).

By now, you should have the files `labeledTrainData.tsv` and `testData.tsv` in a folder named `ldsa-dl-course-data` in your Google Drive. If you don't, please check the README file of Unit 2 for instructions.

[This Jupyter notebook](https://github.com/Lightning-AI/dl-fundamentals/blob/main/unit08-large-language-models/8.2-bag-of-words/8.2-part2-bag-of-words-classifier.ipynb) from Unit 8.2 of the Deep Learning Fundamentals course should help you adapting your solution from the previous week to use Lightning. When you're happy with your results, make an inference on the test set and make your first submission to the Kaggle competition.

We recommend that you to use [Weights & Biases](https://wandb.ai/site) (W&B) to track your experiments. Sign up on W&B with your Google account so that connection with the Google Colab environment is seamless. Follow the [documentation](https://docs.wandb.ai/guides/integrations/lightning) to integrate W&B with PyTorch Lightning.

## 1) Setup

In [None]:
!pip install lightning==2.0.1 wandb --quiet

In [None]:
from google.colab import drive

drive.mount("/content/drive")

In [None]:
import wandb

# This will open a window so you can login to W&B on Google Colab.
# If that doesn't work, set your W&B API key below
# If you do, remove your key before publishing to GitHub.

# %env WANDB_API_KEY=YOUR_WANDB_API_KEY
wandb.login()
run = wandb.init(project="imdb_sentiment")

## 2) Load the train **dataset**

Load the train dataset from the tsv files stored in your Google Drive. Split it into train and validation datasets.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/labeledTrainData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:20000]
df_val = df_shuffled.iloc[20000:25000]

## 3) Vectorization

Use Bag-of-Words for vectorizing the dataset.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(lowercase=True, max_features=10_000, stop_words="english")

cv.fit(df_train["review"])

X_train = cv.transform(df_train["review"])
X_val = cv.transform(df_val["review"])

## 4) Data loader

Create a data PyTorch `Dataset` and corresponding `DataLoader` for the train and validation datasets.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

In [None]:
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.int64)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

In [None]:
train_ds = TextDataset(X_train.todense(), df_train["sentiment"].values)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
val_ds = TextDataset(X_val.todense(), df_val["sentiment"].values)

val_loader = DataLoader(
    dataset=val_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
for batch_idx, (features, class_labels) in enumerate(train_loader):
    break

features.shape

## 5) Model definition

Define a PyTorch model and the corresponding PyTorch Lightning module.

In [None]:
pytorch_model = ...

In [None]:
from lightning import LightningModule

In [None]:
class LightningModel(LightningModule):
    def __init__(self, model, learning_rate):
        pass

## 6) Model training

Train your model using a Lightning trainer.

## 7) Inference

Load the test dataset from the tsv file stored in your Google Drive and the model from the checkpoints you created on W&B. Finally, perform inference with the model on the test dataset.

In [None]:
df_test = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/testData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

In [None]:
# Define checkpoint reference.
checkpoint_reference = "[USERNAME]/imdb_sentiment/model-[MODEL_ID]:best"

# Download checkpoint locally (if not already cached).
artifact = run.use_artifact(checkpoint_reference, type="model")
artifact_dir = artifact.download()

# Load checkpoint.
model = LightningModel.load_from_checkpoint(str(artifact_dir) + "/model.ckpt")

In [None]:
predicted_labels = ...

In [None]:
wandb.finish()

## 8) Post-process for Kaggle submission

Assuming the predicted class labels are stored in `predicted_labels` (as a Torch tensor), create a csv file ready for submission on Kaggle.

In [None]:
output = pd.DataFrame(data={"id": df_test["id"], "sentiment": predicted_labels})

In [None]:
output.to_csv("output.csv", index=False, quoting=3)