# DSCI 572 Lab 3: CBOW model, minibatch training and (optionally) pretrained embeddings

In this lab, we'll work on a familiar task, namely, sentiment analysis. We'll build a CBOW model using pytorch. We'll also incorporate pretrained embeddings which turn out to have a substantial impact on model performance. Finally, we'll investigate the impact of minibatch training and dropout on model accuracy.

**Note!** This can be a good opportunity to rehearse running code on Google Colab, where you get access to a GPU. Check this [tutorial](https://www.marktechpost.com/2021/01/09/getting-started-with-pytorch-in-google-collab-with-free-gpu/). 

## Getting started

Run the following code:

## IMBD data:
https://github.com/jungyeul/mds-cl-2023-24/blob/main/block4/data.zip

In [1]:
from copy import deepcopy
from collections import Counter
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import numpy as np
import torch
import torch.nn as nn
import nltk

# We'll use double values in our tensors
torch.set_default_dtype(torch.float32)

# Checks if GPU is available, otherwise use CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
torch.backends.cudnn.deterministic=True
print(device)

cpu


We'll now read data for sentiment analysis. This code is given to you. 

We get 500 training examples, 1000 development examples and 8476 test examples.

In [2]:
train = pd.read_csv("data/IMDB.train.tsv", header=None, names=["text", "sentiment"], sep="\t")[:500]
dev = pd.read_csv("data/IMDB.dev.tsv", header=None, names=["text", "sentiment"], sep="\t")
test = pd.read_csv("data/IMDB.test.tsv", header=None, names=["text", "sentiment"], sep="\t")

print(f"Number of training examples: {len(train)}")
print(f"Number of development examples: {len(dev)}")
print(f"Number of test examples: {len(test)}")

Number of training examples: 500
Number of development examples: 1000
Number of test examples: 8476


We'll then encode sentiment labels (`positive` and `negative`) as numbers.

In [3]:
label_encoder = LabelEncoder()
label_encoder.fit(train.sentiment)

train_y = label_encoder.transform(train.sentiment)
dev_y = label_encoder.transform(dev.sentiment)
test_y = label_encoder.transform(test.sentiment)

In [4]:
train.sentiment

0      positive
1      positive
2      negative
3      positive
4      positive
         ...   
495    negative
496    negative
497    negative
498    positive
499    negative
Name: sentiment, Length: 500, dtype: object

In [5]:
print(train_y)

[1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0
 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 0
 1 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 0
 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1
 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1
 0 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1
 0 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0
 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0
 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0
 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 0
 0 0 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1
 1 0 1 1 1 0 0 0 0 0 1 0 

## Assignment 1

We'll start by training baseline sklearn sentiment analysis systems using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for feature extraction. 

### Assignment 1.1
rubric={accuracy:2, quality:1}

Start by fitting a `CountVectorizer` and `TfidfVectorizer` using the training set `train`. For now, you don't need to worry about setting any of the parameters for either vectorizer.

You can then transform our datasets into two sets of matrices:

* `train_count_X`, `dev_count_X` and `test_count_X` (using `CountVectorizer`)
* `train_tfidf_X`, `dev_tfidf_X` and `test_tfidf_X` (using `TfidfVectorizer`)

In [6]:
from nltk.corpus import stopwords
en_stopwords = stopwords.words("english")

# your code here



train_count_X, dev_count_X, test_count_X = ...
train_tfidf_X, dev_tfidf_X, test_tfidf_X = ...


In [None]:
###########################################################################
# This is useful for discrete probabilistic models that model binary events 
###########################################################################
count_vectorizer = CountVectorizer(stop_words=en_stopwords, binary=True) 

## Assignment 1.2
rubric={accuracy:1, reasoning:1}

You should now fit two [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) models:

* `lr_count` using your count vectorizer features
* `lr_tfidf` using your tfidf vectorizer features

Evaluate your models on the **development** data using the sklearn function [`f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). `lr_count` should get f-score > 75% and `lr_tfidf` > 55%.

In [7]:
lr_count = LogisticRegression()
# fit, predict, results 
lr_tfidf = LogisticRegression()
# fit, predict, results 


# 'micro': f1 for parsing (lab2)
# Calculate metrics globally by counting the total true positives, false negatives and false positives.
# 'macro': f1 for "each" label; 
# Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

F-score (macro) for Count: 0.7814496603940944
F-score (micro) for Count: 0.795
F-score (macro) for TF-IDF: 0.575930278181811
F-score (micro) for TF-IDF: 0.679


Why do you think CountVectorizer would achieve better performance than TfidfVectorizer on this task?

## Assignment 2

We'll then convert our training data into pytorch tensors. We *will not* use the output of sklearn vectorizers for this assignment. Instead we will directly numericalize our `train`, `dev` and `test` datasets. 

### Assignment 2.1
rubric={accuracy:1}

Start by creating a [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) `vocabulary` which gives the count for each word type in the `train` dataset.

To tokenize the sentences in `train`, you can simply split at spaces.

In [8]:
# your code here

vocabulary = Counter(...)

Assertions to test your code: 

In [9]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert vocabulary["the"] == 6416
assert vocabulary["dog"] == 7

In [12]:
len(vocabulary)

12775

Next, create a mapping `word2id` which translates every word type in `vocabulary` into a unique id number in the range `1 ... len(vocabulary)`. `word2id` should also map the symbol `PAD="<pad>"` to the ID `0`.

In [10]:
# Please use this constant whenever you refer to the padding symbol
PAD="<pad>"

# your code here

In [11]:
len(word2id)

12776

Assertions as a partial check of your code:

In [13]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert word2id[PAD] == 0
assert len(word2id) == len(vocabulary) + 1

### Assignment 2.2
rubric={accuracy:1}

Write a function `numericalize_ex` which takes the following arguments:

1. `ex`, **a string representing a review**. E.g. `"great movie !"`
1. `vocabulary`, the word type counter which we created above
1. `word2id`, the mapping words -> ID numbers which we created above
1. `min_count`, the minimum count of word type. Rarer words are filtered out.
1. `max_count`, the maximum count of word type. More frequent words are filtered out.

Your function should first split `ex` into individual tokens (you can split at spaces). You should then filter out all words whose frequency is < `min_count` or > `max_count`. 

Then, transform the example into a set and transform all the remaining words into ID numbers using `word2id`. Return a `torch.tensor` of shape `n`, where `n` is the count of ID numbers.

When you initialize the tensor, use `dtype=torch.long`.

In [25]:
min_count = 5
max_count = 100

def numericalize_ex(ex, vocabulary, word2id, min_count, max_count):
# your code here
# using ex, make a tensor (>= min_count and <= max_count) for each "ex" (a SINGLE review); 


In [29]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert numericalize_ex(train.text[0], vocabulary, word2id, 5, 100).size()[0] == 70
numericalize_ex(train.text[0], vocabulary, word2id, 5, 100).size() # it contains 70 words... 

torch.Size([70])

### Assignment 2.3
rubric={accuracy:2}

Write a function `numericalize()` which takes the following arguments:

1. `data`, one of our datasets `train`, `dev` or `test`
1. `data_y`, the list of numeric labels for the examples in `data` (0 or 1 corresponding to positive and negative sentiment, respectively)
1. `vocabulary`, the word type counter which we created above
1. `word2id`, the mapping words -> ID numbers which we created above
1. `min_count`, the minimum count of word type in `train`. All rarer words are filtered out.
1. `batch_size`, our minibatch size.

You should first convert all examples in `data` into tensors using `numericalize_ex`. 

Then, pack the examples and their labels into minibatches containing `batch_size` examples each. Every minibatch should be a 3-tuple containing:

1. A minibatch `b` of input examples of dimension `batch_size x k`, where `k` is the maximal length of an example vector in the minibatch: `mbatch = pad_sequence(sequences=mbatch, batch_first=True)` makes a tensor; 
1. A minibatch of sequence lengths of shape `batch_size`, where the elements are the lengths of the examples in `b` before padding is applied: lengths (of `ex` in `mbatch`) should be a tensor; 
1. A minibatch of labels of shape `batch_size`, where each label `i` corresponds to example `b[i]`: `mbatch_y` (labels of `mbatch`) should also be a tensor;


You will need to pad all examples in `b` to the same length using the padding symbol `word2id[PAD]`. Use the pytorch function [`pad_sequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html?highlight=pad_sequence) to convert a list of examples `x` of shapes `len_x` into a padded minibatch of length `batch_size x max_len_x`. You will need to call the function with the argument `batch_first=True` because we want the batch size to be the first dimension.

If `batch_size` does not evenly divide `len(data)`, you may need to create one smaller minibatch to account for all training examples. This is okay.

**Note:** We're returning a list from this function. It is, however, often better to create a [data loader](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel). This can save memory when we're dealing with very large training sets. You'll learn more about this later.  

In [44]:
from torch.nn.utils.rnn import pad_sequence

batch_size = 15

def numericalize(data, 
                 data_y,
                 vocabulary, 
                 word2id,
                 min_count, 
                 max_count, 
                 batch_size):
# your code here
    
    # !!!change your data into list of tensor (note: your ex is a tensor); !!!
    
    # a result list of tuples
    res = []             
    
    for i in range(0, len(data), batch_size):           # get data with `batch_size`
        # you may want to get your mbatch using the current start and the end positions
        #   mbatch = data[start:end]
        # mbatch should be padded!  `batch_first=True`
        # get the length, and make it a tensor
        # get the data_y, and make it a tensor
    
        res.append((mbatch, length, mbatch_y))

    return res

Some tests to check that the number of batches which you generate looks okay:

In [42]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
batches = numericalize(train, train_y, vocabulary, word2id, 5, 100, 15)
assert 500//15 + 1 == len(batches)
assert batches[-1][0].size()[0] == 500 % 15

RESULT of the first batch:
tensor([[ 132,   28,  116,  ...,    0,    0,    0],
        [ 205,  285,  292,  ...,    0,    0,    0],
        [ 132,  205,  351,  ...,    0,    0,    0],
        ...,
        [1235, 1171,  292,  ...,    0,    0,    0],
        [1290, 1117, 1318,  ...,    0,    0,    0],
        [1399, 1409, 1390,  ...,    0,    0,    0]]) 131
tensor([ 70, 100,  40,  68, 131,  14, 111,  62,  37,  35,  33,  77, 116,  75,
         40])
tensor([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0])


Let's then numericalize the training, development and test data using `min_count` 5, `max_count` 100 and `batch_size` 10:

In [45]:
torch_train = numericalize(train, train_y, vocabulary, word2id, 5, 100, 10)
torch_dev = numericalize(dev, dev_y, vocabulary, word2id, 5, 100, 10)
torch_test = numericalize(test, test_y, vocabulary, word2id, 5, 100, 10)

## Assignment 3

We'll now build a CBOW model for sentiment classification. 


**!!!!! Using the context (surrounding words), predict a middle word**

### Assignment 3.1
rubric={accuracy:5}

We'll now write a baseline torch model `CBOW` for classification of CBOW inputs. This model does not yet implement dropout or pretrained embeddings.

#### The `__init__` function

Your `__init__` function should take the following parameters:

1. `num_words`, the number of unique word type features + 1 for the symbol `PAD` (i.e. `len(word2id)`) 
1. `num_classes`, the number of output classes . In ocur case, this will always be 2 because we have exactly two classes: positive and negative.
1. `dropout_prob`, the dropout probability

Your model should contain the following layers in order:

1. `self.embedding`, an embedding of dimension `EMB_SIZE` which can embed all word types recognized by `word2id`: (`nn.Embedding`)
1. `self.linear1`, a linear layer which maps `EMB_SIZE`-dimensional inputs to `HIDDEN_SIZE`-dimensional outputs: (`nn.Linear`)
1. `self.dropout`, dropout with probability defined `dropout_prob`: (`nn.Dropout`)
1. `self.relu` which applies relu to the output of `self.linear1`: (`nn.ReLU`)
1. `self.linear2` which maps `HIDDEN_SIZE`-dimensional inputs to `num_classes`-dimensional outputs
1. `self.log_softmax` which applies log-softmax to the output of `self.linear2`: (`nn.LogSoftmax`)

**Note**, when you initalize `self.embedding`, make sure to define `word2id[PAD]` as the padding symbol as explained [in the documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). The effect is that `PAD` will always be embedded as the zero vector.

#### The `forward` function

Your forward function takes two arguments: 

1. a minibatch of examples `x` having shape `batch_size x k` as input.
1. A tensor `lengths` which indiactes the length of each example in `x`. 

Your forward function should:

1. Apply `self.embedding` to x. This results in a `batch_size x k x EMB_SIZE` tensors.
1. You should then compute the sum of the embeddings for each example in the batch using [`torch.tensor.sum`](https://pytorch.org/docs/stable/generated/torch.sum.html?highlight=sum#torch.sum) resulting in a `batch_size x EMB_SIZE` tensor.
1. Normalize each embedded example by dividing with the lengths in `lengths`. You can first use [`unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html?highlight=unsqueeze#torch.unsqueeze) and [`expand`](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html?highlight=expand) `lengths` to a `batch_size x EMB_SIZE` tensor and then use [`torch.div`](https://pytorch.org/docs/stable/generated/torch.div.html)    
1. Pass the averaged embeddings through `self.linear1`, `self.dropout` `self.relu`, `self.linear2` and finally `self.log_softmax`. This results in a `batch_size x num_classes` tensor.
1. Return the result.

In [63]:
HIDDEN_SIZE = 100
EMB_SIZE = 100


class CBOW(nn.Module):
    def __init__(self, num_words, num_classes, dropout_prob):
        super().__init__()
#your code here
        # number of input words -> embedding size
        # for nn.Embedding;

        # embedding size -> hidden size
        # for nn.Linear
        
        # dropout
        # nn.Dropout

        # relu
        # nn.ReLU

        # hidden size -> number of output class
        # for nn.Linear 

        # logsoftmax
        # for LogSoftmax
        
    def forward(self, x, lengths):
#your code here
# x -> embedding -> linear -> dropout -> relu -> linear -> return log_softmax; 

# for you x:
    # 1. You should then compute the sum of the embeddings for each example in the batch using 
    #       [`torch.tensor.sum`] resulting in a `batch_size x EMB_SIZE` tensor.
    # 1. Normalize each embedded example by dividing with the lengths in `lengths`. 
    # You can first use [`unsqueeze`] and [`expand`] `lengths` to a `batch_size x EMB_SIZE` tensor and 
    # then use [`torch.div`]

        # TO DIV your x;

        # print("x after embbeding: ", x.size(), ": batch_size x k x EMB_SIZE")
        # print("x after computing the sum of the embeddings for each example: ", x.size(), ": batch_size x EMB_SIZE")

        # print("length before unsqueeze:", lengths.size())
        # print("length after unsqueeze:", lengths.size())
        # print("length after expand:", lengths.size()) 
        # print("then, you can use `div`, and `linear`, ....")


### RESULT:
# x after embbeding:  torch.Size([10, 131, 100]) : batch_size x k x EMB_SIZE
# x after computing the sum of the embeddings for each example:  torch.Size([10, 100]) : batch_size x EMB_SIZE

# length before unsqueeze: torch.Size([10])
# length after unsqueeze: torch.Size([10, 1])
# then, get batch_size to expand; 
# length after expand: torch.Size([10, 100])

# then, you can use `div`, and `linear`, ....

Assertions to check your code:

In [69]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.

model = CBOW(len(word2id), 2, 0)    # initialize with `num_words, num_classes, dropout_prob`
model.train(False)                  # disable dropout
x = torch_train[0]                  # your x is tuple (mbatch, lengths, mbatch_y);
res = model(x[0], x[1])             # foward using mbatch and length, which gives a result of log_softmax: values in the range [-inf, 0)
assert res.size()[0] == x[0].size()[0]
assert res[0].exp().sum() - 1 < 0.001
print(res[0])                       # log_softmax restuls; 
print(res[0].exp())                 # return a new tensor with the exponential ;
print(res[0].exp().sum())           # sum = 1

tensor([-0.7598, -0.6307], grad_fn=<SelectBackward0>)
tensor([0.4678, 0.5322], grad_fn=<ExpBackward0>)
tensor(1., grad_fn=<SumBackward0>)


## Assignment 4

### Assignment 4.1
rubric={accuracy:2}

Write a function `eval_model` which takes two arguments:

1. `data`, a torch data set containing examples `(input_minibatch, lengths, output_minibatch)` 
1. `model` a CBOW model

The function applies `model` to each input minibatch in `data` and returns the macro F-score computed by the sklearn function `f1_score`. 

Before running inference, make sure to call `model.train(False)` to disable dropout.

**Remember** to use `with torch.no_grad()` in order to avoid !

In [87]:

# iterate data: input_minibatch, lengths, output_minibatch
# forward(input_minibatch, lengths) 
# save ressult using argmax     (sys)
# save output_minibatch         (gold)
# then, f1_score(gold, sys)


You can now evaluate an untrained model on the development set. The performance is unlikely to be particularly good.

In [88]:
model = CBOW(len(word2id), 2, 0)
eval_model(torch_dev, model)

0.37347980766160427

### Assignment 4.2
rubric={accuracy:3}

You should now write a training function `train_model`. The function takes the following parameters:

1. `model`, a CBOW model
1. `train_data`, a dataset of torch training examples
1. `dev_data`, a dataset of torch development examples
1. `max_epochs`, the maximum number of epochs for training

You should first:

1. Initialize a `CBOW` model `model` with `len(word2id)` word types, 2 output classes and dropout probability `dropout_prob`
1. Initialize an `Adam` optimizer for `model` (you can use the deafaults for the `lr` and `betas`)
1. Initialize an [`NLLLoss`](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss function.

Run training for `max_epochs`. Each epoch iterates over the training examples `(x, lengths, y)` in `train` and:

1. Calls `model.train(True)` to enable dropout
1. Calls `zero_grad` to erase old gradients                 
    - it requires for each data iteration!
1. Applys the model to `x`
1. Compute the loss w.r.t. `y`.
1. Runs one step of backprop.

You should keep track of the average loss per training example over the epoch. As a general rule, the average loss should decrease through training. 

Once every epoch, you need to evaluate your model the development data. Prinf the average loss and the `f1_score` on the development set. 

Keep track of the best development accuracy and store the model which attains the best development accuracy. You can use `deepcopy` to save the model so that its parameters won't be affected by subsequent updates.

Finally, return the best model you found.

In [2]:
# Your code here
def train_model(model, train_data, dev_data, max_epochs):

    # torch.optim.Adam(...)
    # nn.NLLLoss()
    ...

    for epoch ...
        ...
        model.train(True)
        for ... using `torch_train`
            # 1. Calls `zero_grad` to erase old gradients                 
            #     - it requires for each data iteration!
            # 1. Applys the model to `x`
            # 1. Compute the loss w.r.t. `y`.
            # 1. Runs one step of backprop.      

            # keep track of the average loss per training      
            ...
        ...

        print(f"Epoch {epoch + 1}: Average loss = {total_loss/len(train_data)}, Dev F1 = {f1}")
    return best_model

Now, train a model for 50 `max_epochs` with dropout probability 30%. You will probably get within 5%-points from CountVectorizer but without pretrained embeddings, it is hard to do better

In [91]:
model = CBOW(len(word2id), 2, 0.3)
model = train_model(model, torch_train, torch_dev, 50)

Epoch 1: Average loss = 0.8060874354314297, Dev F1 = 0.3714644877435575
Epoch 2: Average loss = 0.7578593166999064, Dev F1 = 0.3714644877435575
Epoch 3: Average loss = 0.7017631801750285, Dev F1 = 0.3714644877435575
Epoch 4: Average loss = 0.7046927934780495, Dev F1 = 0.3714644877435575
Epoch 5: Average loss = 0.679618664286193, Dev F1 = 0.3714644877435575
Epoch 6: Average loss = 0.6864398440660732, Dev F1 = 0.3714644877435575
Epoch 7: Average loss = 0.6642061903010327, Dev F1 = 0.3714644877435575
Epoch 8: Average loss = 0.6666451351601858, Dev F1 = 0.3714644877435575
Epoch 9: Average loss = 0.6604696580776849, Dev F1 = 0.3714644877435575
Epoch 10: Average loss = 0.6538994665610914, Dev F1 = 0.3963202566089417
Epoch 11: Average loss = 0.6331285712388628, Dev F1 = 0.4202847373036473
Epoch 12: Average loss = 0.6037663973416716, Dev F1 = 0.45232760766741353
Epoch 13: Average loss = 0.5653785109101361, Dev F1 = 0.6370346432870142
Epoch 14: Average loss = 0.4918996149264689, Dev F1 = 0.6915

Print the F-score of your model on the test data. Compare this against our CountVectorizer and TfidfVectorizer models.

CBOW will probably land somewhere between CountVectorizer and TfidfVectorizer. 

In [92]:
# your code here



## macro restuls: 
print(f"F-score CBOW: {cbow_f1}")
print(f"F-score CountVectorizer: {count_f1}")
print(f"F-score TfidfVectorizer: {tfidf_f1}")

F-score CBOW: 0.7413886839126629
F-score CountVectorizer: 0.7772952581265258
F-score TfidfVectorizer: 0.5661413372993042
