# Question 1 - Creating a Dataset

In this question you'll create a dataset class for the amazon sentiment analysis dataset.

Add the following to the class below:
```__init__```:
1. Enumerate the different labels and create two dict attributes: ```self.label2idx```, ```self.idx2label```.
2. Instantiate a ```TfidfVectorizer``` and use ```TfidfVectorizer.fit_transform``` to transform the sentences into tf-idf vectors. Documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform).
3. Set the attribute ```self.vocab_size``` using the tokenizer's ```vocabulary_``` attribute.


```__getitem__```:
1. Reimplement the method so that it returns the tf-idf vector of the sentence in a tensor. the tensor should be of shape ```[vocab_size]``` and not ```[1, vocab_size]```. You can use the ```Tensor.squeeze()``` method to do this ((documentation)[https://pytorch.org/docs/stable/generated/torch.squeeze.html#torch.squeeze])
2. You should return the idx of the label instead of the label itself.
3. The output should be in the following format: ```data = {"input_vectors": setnence, "labels": label}```

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# from torch.

In [2]:
df = pd.read_csv('./amazon_sa/train.csv')
df.head()

Unnamed: 0,reviewText,label
0,"Remember when you were a kid, and something ca...",Positive
1,I've enjoyed this show from the beginning.The ...,Positive
2,This is an awesome TV show! Very entertaining...,Positive
3,It's fun to get a glimpse into a totally diffe...,Positive
4,Under the Dome is an interesting way to see a ...,Positive


In [5]:
lb = df['label'].tolist()

In [12]:
d = {label: ii for ii, label in enumerate(df['label'].unique())}

In [15]:
tf = TfidfVectorizer()

In [49]:
ttt = tf.fit_transform(df['reviewText'].tolist())

In [51]:
ttt[0, :]

<1x49221 sparse matrix of type '<class 'numpy.float64'>'
	with 512 stored elements in Compressed Sparse Row format>

In [106]:
class ClassificationDataset(Dataset):

    def __init__(self, file_path, tokenizer=None):
        # Read data
        self.file_path = file_path
        data = pd.read_csv(self.file_path)

        # Split to sentences and labels
        self.sentences = data['reviewText'].tolist()
        self.labels = data['label'].tolist()

        # Enumerate labels
        self.label2idx = {label: ii for ii, label in enumerate(sorted(data['label'].unique()))}
        self.idx2label = {v: k for k, v in self.label2idx.items()}
        # Tokenize sentences
        if tokenizer is not None:
            self.tokenizer = tokenizer
            self.tokenized_sent = self.tokenizer.transform(self.sentences)
            
        else:
            self.tokenizer = TfidfVectorizer()
            self.tokenized_sent = self.tokenizer.fit_transform(self.sentences)      

        # Set vocab_size attribute
        self.vocab_size = len(self.tokenizer.vocabulary_)
    def __getitem__(self, item):
        # Tensorize sentence
        sentence = self.tokenized_sent[item, :].toarray().squeeze()
        sentence = torch.tensor(sentence).float()
        # Get label idx
        label = self.label2idx[self.labels[item]]
        data = {"input_vectors": sentence, "labels": label}
        return data

    def __len__(self):
        return len(self.sentences)

In [107]:
train_dataset = ClassificationDataset('./amazon_sa/train.csv', tokenizer=None)
test_dataset = ClassificationDataset('./amazon_sa/test.csv', tokenizer=train_dataset.tokenizer)

batch_size = 4
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

In [108]:
train_dataset[0]['input_vectors']

tensor([0., 0., 0.,  ..., 0., 0., 0.])

In [109]:
for batch in train_loader:
    break
print(batch)

{'input_vectors': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), 'labels': tensor([1, 1, 1, 1])}


# Question 2 - Modeling

In this question you will implement a simple neural network that will classify a sentence given its tf-idf vector.

Implement a model with the following architecture:
1. A linear layer from ```vocab_size``` to ```hidden_dim```.
2. A ReLU activation fn.
3. A linear layer from ```hidden_dim``` to ```num_classes```.
4. A cross Entropy Loss

```forward```:
1. If labels are passed, should return the output of the second layer and the loss.
2. Otherwise, should pass the output of the second layer and None.

In [110]:
from torch import nn

In [155]:
class TfIdfClassifier(nn.Module):

    def __init__(self, vocab_size, num_classes, hidden_dim=100):
        super(TfIdfClassifier, self).__init__()
        self.fc1 = nn.Linear(vocab_size, hidden_dim)
        self.activation = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        # self.loss = torch.nn.BCEWithLogitsLoss()
        self.loss = torch.nn.CrossEntropyLoss()
        

    def forward(self, input_vectors, labels=None):
        x = self.fc1(input_vectors)
        x = self.activation(x)
        x = self.fc2(x)
        if labels is None:
            return x, None
        else:
            return x, self.loss(x, labels)

In [156]:
model = TfIdfClassifier(train_dataset.vocab_size, len(train_dataset.label2idx), hidden_dim=100)
print(model)

TfIdfClassifier(
  (fc1): Linear(in_features=49221, out_features=100, bias=True)
  (activation): ReLU()
  (fc2): Linear(in_features=100, out_features=2, bias=True)
  (loss): CrossEntropyLoss()
)


In [157]:
model(**batch)

(tensor([[-0.0446, -0.0075],
         [-0.0432, -0.0075],
         [-0.0445, -0.0090],
         [-0.0440, -0.0066]], grad_fn=<AddmmBackward0>),
 tensor(0.6751, grad_fn=<NllLossBackward0>))