# Document vectors
The first thing we're going to do, as usual, is begin by importing libraries and modules we're going to use today. We're introducing a new library, called ```datasets```, which is part of the ```huggingface``` universe. 

```datasets``` provides easy access to a wide range of example datasets which are widely-known in the NLP world, it's worth spending some time looking around to see what you can find. For example, here are a collection of [multilabel classification datasets](https://huggingface.co/datasets?task_ids=task_ids:multi-class-classification&sort=downloads).

We'll be working with the ```huggingface``` ecosystem more and more as we progress this semester.

In [2]:
# data processing
import pandas as pd
import numpy as np

# huggingface datasets
from datasets import load_dataset

# scikit learn tools
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

# plotting tools
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


## Load data
We're going to be working with actual text data data, specifically a subset of the well-known [GLUE Benchmarks](https://gluebenchmark.com/). These benchmarks are regularly used to test how well certain models perform across a range of different language tasks. We'll work today specifically with the Stanford Sentiment Treebank 2 (SST2) - you can learn more [here](https://huggingface.co/datasets/glue) and [here](https://nlp.stanford.edu/sentiment/index.html).

The dataset we get back is a complex, hierarchical object with lots of different features. I recommend that you dig around a little and see what it contains. For today, we're going to work with only the training dataset right now, and we're going to split it into sentences and labels.

In [3]:
# load the sst2 dataset
dataset = load_dataset("glue", "sst2")
# select the train split
train_data = dataset["train"]
X = train_data["sentence"]
y = train_data["label"]

In [16]:
dataset["train"].features
dataset["train"][:20]

pd.DataFrame(dataset["train"][:20])

Unnamed: 0,sentence,label,idx
0,hide new secretions from the parental units,0,0
1,"contains no wit , only labored gags",0,1
2,that loves its characters and communicates som...,1,2
3,remains utterly satisfied to remain the same t...,0,3
4,on the worst revenge-of-the-nerds clichés the ...,0,4
5,that 's far too tragic to merit such superfici...,0,5
6,demonstrates that the director of such hollywo...,1,6
7,of saucy,1,7
8,a depressed fifteen-year-old 's suicidal poetry,0,8
9,are more deeply thought through than in most `...,1,9


Let's split the data into a training and a test set. We will later train a simple classifier to start looking at what one can do with vector representations of text, that's why we need a set of documents that are left aside. For now, let's simply focus on the training set to estimate our <span style="color:red">document-term model.</span>

In [21]:
import random
seed = 42
train_idx = random.sample(range(len(X)), k=int(len(X)*.7)) # we are sampling 70% as training set
train_X, test_X, train_y, test_y = [], [], [], []
for i in train_idx:
    train_X.append(X[i])
    train_y.append(y[i])
for i in set(range(len(X))) - set(train_idx):
    test_X.append(X[i])
    test_y.append(y[i])

In [22]:
list(zip(train_X[:10], train_y[:10])) # zip is a function that combines two lists

[('lack depth or complexity , ', 0),
 ('true inspiration ', 1),
 ("the author 's devotees will probably find it fascinating ; ", 1),
 ("detailing how one international city welcomed tens of thousands of german jewish refugees while the world 's democracie ",
  1),
 ('is greatness here ', 1),
 ('laced with liberal doses of dark humor , gorgeous exterior photography , and a stable-full of solid performances , no such thing is a fascinating little tale . ',
  1),
 ("'s better than one might expect when you look at the list of movies starring ice-t in a major role . ",
  1),
 ("than an hour-and-a-half-long commercial for britney 's latest album ", 0),
 ("i have a confession to make : i did n't particularly like e.t. the first time i saw it as a young boy ",
  0),
 ('released in imax format ', 1)]

In [23]:
print('Number of training examples: ', len(train_X))
print('Number of test examples: ', len(test_X))

Number of training examples:  47144
Number of test examples:  20205



## Create document representations
We're going to work with a bag-of-words model (like the ones we talked about in class), which we can create quite simply using the ```CountVectorizer()``` class available via ```scikit-learn```. You can read more about the default parameters of the vectorizer [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

After we initialize the vectorizer, we first _fit_ this vectorizer to our data (the model learns parameters such as which words to include in the vocabulary, based on the statistics of the text and the parameters passed to  `CountVectorizer`) and then _transform_ the original data into the bag-of-words representation.

Let's start by fitting a model where default constraints are placed on vocabulary size.

In [39]:
simple_vectorizer = CountVectorizer()
X_vect = simple_vectorizer.fit_transform(train_X) # X_vect is a sparse matrix containing the counts of each word in each sentence

In [64]:
# inspect the vocabulary
simple_vectorizer.vocabulary_

{'lack': 6791,
 'depth': 3167,
 'or': 8346,
 'complexity': 2396,
 'true': 12452,
 'inspiration': 6266,
 'the': 12058,
 'author': 904,
 'devotees': 3257,
 'will': 13321,
 'probably': 9227,
 'find': 4596,
 'it': 6455,
 'fascinating': 4454,
 'detailing': 3222,
 'how': 5836,
 'one': 8295,
 'international': 6344,
 'city': 2117,
 'welcomed': 13225,
 'tens': 12016,
 'of': 8254,
 'thousands': 12125,
 'german': 5093,
 'jewish': 6539,
 'refugees': 9736,
 'while': 13269,
 'world': 13442,
 'democracie': 3110,
 'is': 6444,
 'greatness': 5310,
 'here': 5642,
 'laced': 6789,
 'with': 13373,
 'liberal': 6992,
 'doses': 3549,
 'dark': 2933,
 'humor': 5872,
 'gorgeous': 5237,
 'exterior': 4325,
 'photography': 8817,
 'and': 565,
 'stable': 11317,
 'full': 4944,
 'solid': 11089,
 'performances': 8735,
 'no': 8105,
 'such': 11650,
 'thing': 12093,
 'little': 7090,
 'tale': 11881,
 'better': 1206,
 'than': 12053,
 'might': 7611,
 'expect': 4261,
 'when': 13257,
 'you': 13527,
 'look': 7143,
 'at': 839,
 'l

This is the number of words the vectorizer uses as features (i.e., words that are *not* excluded because too frequent, or too infrequent)

In [70]:
len(simple_vectorizer.vocabulary_)

4856

In [67]:
print(X_vect.shape)
print(X_vect.toarray())

(47144, 12653)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


As you can see, the resulting matrix has dimensions `[n_documents, n_words]`.
Note that there is a simple way to get a term-term matrix (in how many documents two words co-occur) by computing the dot product of the term-document matrix and its transpose.

In [68]:
np.dot(X_vect.T, X_vect).toarray() # the diagonal essentially indicates how often a term occurs overall.

array([[ 4,  4,  0, ...,  0,  0,  0],
       [ 4, 72,  0, ...,  0,  0,  0],
       [ 0,  0, 11, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  2,  0,  0],
       [ 0,  0,  0, ...,  0,  2,  0],
       [ 0,  0,  0, ...,  0,  0,  3]])

What happens to dimensionality if manipulate input parameters, e.g., `min_df`? Try to play with `CountVectorizer` parameters to get familiar with the function.

In [80]:
# Trying out different parameters for the CountVectorizer
simple_vectorizer = CountVectorizer(min_df=20) # only include words that occur at least 20 times
simple_vectorizer = CountVectorizer(min_df=20,max_df = 20) # only include words that occur at least 20 times and at most 20 times, meaning that we look only at the words that occur 20 times
simple_vectorizer = CountVectorizer(max_features=100) # only include the 100 most frequent words
simple_vectorizer = CountVectorizer()

X_vect = simple_vectorizer.fit_transform(train_X) # X_vect is a sparse matrix containing the counts of each word in each sentence

print(X_vect.shape)
print(X_vect.toarray())

(47144, 13576)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### Dimensionality reduction
Our current matrix is fairly sparse. Could we apply what we have learned during the lecture to convert it to a dense and more compact matrix? Let's apply the `SVD` algorithm we discussed in class.

SVD is a dimensionality reduction tool based on FFT

In [11]:
svd = TruncatedSVD(n_components=500)
svd.fit(X_vect)
X_svd = svd.transform(X_vect)

In [81]:
TruncatedSVD?

[0;31mInit signature:[0m
[0mTruncatedSVD[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_components[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malgorithm[0m[0;34m=[0m[0;34m'randomized'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_iter[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_oversamples[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpower_iteration_normalizer[0m[0;34m=[0m[0;34m'auto'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtol[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of
truncated singular value decomposition (SVD). Contrary to PCA, this
estima

How does our vector space look like?

In [12]:
X_svd

array([[ 0.14780454,  0.11506785,  0.1554124 , ...,  0.00799068,
        -0.00481978, -0.00969582],
       [ 0.95225277,  0.58317257,  1.86374944, ..., -0.02462209,
        -0.04920296,  0.03677421],
       [ 0.48328449,  0.26313435,  0.97998618, ...,  0.04034044,
        -0.01923677,  0.01243743],
       ...,
       [ 0.4526981 ,  0.3015155 ,  0.80824259, ..., -0.05395082,
         0.03215544, -0.05053857],
       [ 1.16343675, -0.59805182, -0.47743977, ..., -0.04093378,
        -0.0663819 ,  0.01634439],
       [ 1.16586976, -0.17648983,  1.01953916, ...,  0.11087603,
        -0.01884062, -0.03356088]])

### Classifying sentiment

Congratulations! You have created your first document representation. 

We will dive deeper into classification in the coming weeks, but to demonstrate what we can do with these representations, let's go through an example.

As we saw earlier, our documents have labels indicating the sentiment of each of the document. Can we predict sentiment on the basis of bag of words representations of our documents?
Let's use a simple `scikit-learn` classifier to learn to predict sentiment from text. We will learn more about this later on, for now all you need to know is that the classifier estimates a relation between input and output such that it is able to predict the output (in this case, the sentiment of the sentence, which is `0` for negative sentences, `1` for positive) from the input.

We will use a `LogisticRegression` classifier (not necessarily best, but one the fastest), but you can experiment with multiple classifiers (e.g., https://scikit-learn.org/stable/modules/svm.html).

In [13]:
classifier = LogisticRegression(max_iter=2000).fit(X_vect, train_y)


Let's transform the test data, which we need for evaluation.

In [14]:
X_vect_test = simple_vectorizer.transform(test_X)

And finally, let's compute how often the model predictions match the true labels.

In [15]:
print('Model accuracy: ', np.mean(classifier.predict(X_vect_test) == test_y))

Model accuracy:  0.8917594654788419


That's pretty good: let's take a look at a couple of examples.

In [16]:
list(zip(test_X, classifier.predict(X_vect_test)))

[('demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . ',
  1),
 ("for those moviegoers who complain that ` they do n't make movies like they used to anymore ",
  1),
 ("the part where nothing 's happening , ", 0),
 ('saw how bad this movie was ', 0),
 ('in world cinema ', 1),
 ('the action is stilted ', 0),
 ('will find little of interest in this film , which is often preachy and poorly acted ',
  0),
 ("it 's about issues most adults have to face in marriage and i think that 's what i liked about it -- the real issues tucked between the silly and crude storyline ",
  1),
 ('covers this territory with wit and originality , suggesting that with his fourth feature ',
  1),
 ('gorgeous and deceptively minimalist ', 1),
 ("proves once again he has n't lost his touch , bringing off a superb performance in an admittedly middling film . ",
  1),
 ("poor ben bratt could n't find stardom if mapques

### Some optional tasks
- Does performance change if we use a `TfidfVectorizer`?
- Can you write your own version of `CountVectorizer()`? In other words, a function that takes a corpus of documents and creates a bag-of-words representation for every document?
- What about `TfidfVectorizer()`? Look over the formulae in the slides from Tuesday.

In [17]:
# my own version of CountVectorizer()
class MyCountVectorizer:
    def __init__(self):
        self.vocab = {}
        self.vocab_size = 0

    def fit(self, X):
        for x in X:
            for word in x.split():
                if word not in self.vocab:
                    self.vocab[word] = self.vocab_size
                    self.vocab_size += 1

    def transform(self, X):
        X_vect = np.zeros((len(X), self.vocab_size))
        for i, x in enumerate(X):
            for word in x.split():
                X_vect[i, self.vocab[word]] += 1
        return X_vect
    