<a href="https://colab.research.google.com/github/nosportugal/faast-data-science/blob/main/courses/deep_learning/unit2/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 2: Neural Networks

Your challenge in this unit is to classify the sentiment expressed in [IMDb](https://www.imdb.com/) movie reviews using a neural network.

By now, you should have the files `labeledTrainData.tsv` and `testData.tsv` in a folder named `ldsa-dl-course-data` in your Google Drive. If you don't, please check the README file of Unit 2 for instructions.

This notebook already contains implementations of bag-of-Words, the data loaders and the model definition as per the videos 8.2. Make sure to carefully review the code and then implement the training loop using vanilla Pytorch following this [reference notebook](https://github.com/Lightning-AI/dl-fundamentals/blob/main/unit04-multilayer-nets/4.3-mlp-pytorch/4.3-mlp-pytorch-part3-5-mnist/4.3-mlp-pytorch-part5-mnist.ipynb) that was shown in 4.3

## 1) Setup

In [None]:
from google.colab import drive

drive.mount("/content/drive")

In [None]:
!pip install wandb --quiet

In [None]:
import wandb

# This will open a window so you can login to W&B on Google Colab.
# If that doesn't work, set your W&B API key below
# If you do, remove your key before publishing to GitHub.

# %env WANDB_API_KEY=YOUR_WANDB_API_KEY
wandb.login()

## 2) Load the train **dataset**

Load the train dataset from the tsv files stored in your Google Drive. Split it into train and validation datasets.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/labeledTrainData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:20000]
df_val = df_shuffled.iloc[20000:25000]

## 3) Vectorization

Use Bag-of-Words for vectorizing the dataset.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(lowercase=True, max_features=10_000, stop_words="english")

cv.fit(df_train["review"])

X_train = cv.transform(df_train["review"])
X_val = cv.transform(df_val["review"])

## 4) Data loader

Create a data PyTorch `Dataset` and corresponding `DataLoader` for the train and validation datasets.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

In [None]:
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.int64)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

In [None]:
train_ds = TextDataset(X_train.todense(), df_train["sentiment"].values)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
val_ds = TextDataset(X_val.todense(), df_val["sentiment"].values)

val_loader = DataLoader(
    dataset=val_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
for batch_idx, (features, class_labels) in enumerate(train_loader):
    break

features.shape

## 5) Model definition

Define a PyTorch model.

In [None]:
class LogisticRegression(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super().__init__()
        self.linear = torch.nn.Linear(num_features, num_classes)

    def forward(self, x):
        logits = self.linear(x)
        return logits


model = LogisticRegression(num_features=10_000, num_classes=2)

## 6) Model training

Define your training loop using vanilla PyTorch.