<a href="https://colab.research.google.com/github/nerealegui/NLP-MBD-EN-PT/blob/main/tagging_parsing_practice/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files

In [2]:
repository_name = "NLP-MBD-EN-PT"
repository_url = 'https://github.com/nerealegui/' + repository_name

In [3]:
! git clone $repository_url

Cloning into 'NLP-MBD-EN-PT'...
remote: Enumerating objects: 4548, done.[K
remote: Counting objects: 100% (208/208), done.[K
remote: Compressing objects: 100% (139/139), done.[K
remote: Total 4548 (delta 130), reused 113 (delta 68), pack-reused 4340 (from 1)[K
Receiving objects: 100% (4548/4548), 16.68 MiB | 14.23 MiB/s, done.
Resolving deltas: 100% (231/231), done.


Install the requirements

Now you have everything you need to execute the code in Colab

# Bag-of-words

In [4]:
import nltk
nltk.download('shakespeare')
nltk.download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Unzipping corpora/shakespeare.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


The `nltk` library includes several corpus for experimentation. In this markdown we are going to make use of the corpus including the set of Shakespeare's plays.

In the following cell, I will load the corpus and create a dataframe with the name of the book and the textual content.

In [7]:
shakespeare_df = pd.DataFrame(columns=["book", "words"])
for ii, book in enumerate(nltk.corpus.shakespeare.fileids()):
    shakespeare_df.loc[ii] = (book, " ".join(nltk.corpus.shakespeare.words(book)))
print(shakespeare_df)

           book                                              words
0   a_and_c.xml  The Tragedy of Antony and Cleopatra Dramatis P...
1     dream.xml  A Midsummer Night ' s Dream Dramatis Personae ...
2    hamlet.xml  The Tragedy of Hamlet , Prince of Denmark Dram...
3  j_caesar.xml  The Tragedy of Julius Caesar Dramatis Personae...
4   macbeth.xml  The Tragedy of Macbeth Dramatis Personae DUNCA...
5  merchant.xml  The Merchant of Venice Dramatis Personae The D...
6   othello.xml  The Tragedy of Othello , the Moor of Venice Dr...
7   r_and_j.xml  The Tragedy of Romeo and Juliet Text placed in...


While this representation can be useful for humans, it is of no use if you want to use these data for an NLP system.

As we discussed in class, we need to create the document-term matrix which will be the input for any NLP system we need to create on top of it. In the document term matrix we have a row for each one of the different documents (the Shakespeare's plays) and a column for each one of the words in the dataset. At each cell, you will find the weight of the word in the document (for example, how many times does the word appear in the document).

In class we presented several weighting approaches, let's see how we can create them.

Let's start with the simplest one: The Binary weighting. Binary weighting only defines if a word appears (1) or does not appear (0) in a document

Here’s what happens in that snippet, step by step:

1.  Instantiate the vectorizer  
     `binary_weighting = CountVectorizer(binary=True)`  
      – Builds a vocabulary and, when transforming, will mark each word as present (1) or absent (0) in a document.

2.  Learn the vocabulary and transform the text into a sparse matrix  
     `binary_shakespeare = binary_weighting.fit_transform(shakespeare_df.words)`  
      – `fit` scans all documents to build the vocab (one column per unique token).  
      – `transform` turns each document into a row of 0/1s, stored in a SciPy sparse matrix of shape `(n_docs, n_terms)`.

3.  Convert to a pandas DataFrame  
     `binary_shakespeare.toarray()`  
      – Converts the sparse matrix to a dense NumPy array.  
     `columns=binary_weighting.get_feature_names_out()`  
      – Labels each column with the corresponding token.  
     `pd.DataFrame(...)`  
      – Wraps the array in a DataFrame for easy inspection.

4.  Inspect the result  
     `print(binary_dt_matrix)`  
      – Shows a table where each row is a play, each column is a word, and each cell is 0 or 1 indicating absence/presence.

This binary‐weighted document–term matrix is useful when you only care about whether a term occurs at all, not how often.

In [9]:
# Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
binary_weighting = CountVectorizer(binary=True)
binary_shakespeare = binary_weighting.fit_transform(shakespeare_df.words)
binary_dt_matrix = pd.DataFrame(binary_shakespeare.toarray(), columns=binary_weighting.get_feature_names_out())
print(binary_dt_matrix)

   1992  1996  1998  1999  abandon  abate  abatements  abbey  abhor  abhorred  \
0     0     0     0     0        0      0           0      0      0         0   
1     0     0     0     0        0      1           0      0      0         0   
2     0     0     0     0        0      1           1      0      0         1   
3     0     0     0     0        0      0           0      0      0         0   
4     0     0     0     0        0      0           0      0      0         1   
5     0     0     0     0        0      1           0      0      0         0   
6     0     0     0     0        1      0           0      0      1         0   
7     1     1     1     1        0      1           0      1      0         1   

   ...  your  yours  yourself  yourselves  youth  youthful  youths  zeal  \
0  ...     1      1         1           1      1         0       0     0   
1  ...     1      1         1           1      1         0       0     0   
2  ...     1      1         1           1 

Let's inspect the most and least important terms related to the document 6 (Othello)

In [10]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][-25:])



25 most important terms for document othello.xml
about           1
above           1
yet             1
aboard          1
zounds          1
abode           1
abuser          1
abused          1
abuse           1
access          1
absolute        1
absent          1
absence         1
abroad          1
yond            1
abuses          1
acquainted      1
acquaintance    1
acknown         1
achieved        1
aches           1
ache            1
write           1
writ            1
wrongs          1
Name: 6, dtype: int64
25 least important terms for document othello.xml
1998           0
yesty          0
yesternight    0
accepted       0
accept         0
accents        0
abysm          0
according      0
accord         0
yarely         0
yare           0
yard           0
xv             0
xml            0
xiv            0
yearns         0
yawning        0
youthful       0
yourselves     0
younker        0
youngest       0
younger        0
yorick         0
yaw            0
yaughan        0
Name

As you can see, the representation is not very useful as it is. By only telling us if a word appears or not in a document is not giving us a lot of information. **Can you think on a situation where this binary weighting can be sufficient?**

The next thing to know will be whether the word appears only once or several times.

In [11]:
tf_weighting = CountVectorizer() # now the vector is not binary
tf_shakespeare = tf_weighting.fit_transform(shakespeare_df.words)
tf_dt_matrix = pd.DataFrame(tf_shakespeare.toarray(), columns=tf_weighting.get_feature_names_out())
print(tf_dt_matrix)

   1992  1996  1998  1999  abandon  abate  abatements  abbey  abhor  abhorred  \
0     0     0     0     0        0      0           0      0      0         0   
1     0     0     0     0        0      1           0      0      0         0   
2     0     0     0     0        0      1           1      0      0         1   
3     0     0     0     0        0      0           0      0      0         0   
4     0     0     0     0        0      0           0      0      0         1   
5     0     0     0     0        0      1           0      0      0         0   
6     0     0     0     0        1      0           0      0      3         0   
7     1     1     1     1        0      1           0      1      0         1   

   ...  your  yours  yourself  yourselves  youth  youthful  youths  zeal  \
0  ...   140     11        15           1      5         0       0     0   
1  ...   123      4         3           3      7         0       0     0   
2  ...   242      6        15           1 

Ok, now we have the words weighted according to how many times they appear in the document.

Let's check now the most and least important words in Othello

In [12]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
and          794
the          761
to           629
you          486
of           475
my           416
that         395
iago         360
in           341
othello      336
not          318
it           317
is           309
me           281
cassio       254
he           246
for          240
desdemona    230
be           226
but          221
with         221
this         220
do           219
her          215
have         207
Name: 6, dtype: int64
25 least important terms for document othello.xml
1998           0
yesty          0
yesternight    0
accepted       0
accept         0
accents        0
abysm          0
according      0
accord         0
yarely         0
yare           0
yard           0
xv             0
xml            0
xiv            0
yearns         0
yawning        0
youthful       0
yourselves     0
younker        0
youngest       0
younger        0
yorick         0
yaw            0
yaughan        0
Name: 6, dtype: int64


**What problem do you see with the most important words? Are they really representative?**



Let's check now how to create the TF-IDF weighting to see if we can improve this representation

**TF-IDF (Term Frequency-Inverse Document Frequency)** features are numerical representations of text that reflect the importance of a word within a document relative to a collection of documents

- **Term Frequency (TF):**
Measures how often a word appears in a specific document. A higher TF indicates the word is more important within that document.

- **Inverse Document Frequency (IDF):**
Measures how rare or common a word is across the entire corpus of documents. Rare words have a higher IDF, indicating they are more informative and important.

In [13]:
tf_idf_weighting = TfidfVectorizer()
tf_idf_shakespeare = tf_idf_weighting.fit_transform(shakespeare_df.words)
tf_idf_dt_matrix = pd.DataFrame(tf_idf_shakespeare.toarray(), columns=tf_idf_weighting.get_feature_names_out())
print(tf_idf_dt_matrix)

       1992      1996      1998      1999   abandon     abate  abatements  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.001132    0.000000   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.000551    0.000869   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
4  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
5  0.000000  0.000000  0.000000  0.000000  0.000000  0.000843    0.000000   
6  0.000000  0.000000  0.000000  0.000000  0.000973  0.000000    0.000000   
7  0.001163  0.001163  0.001163  0.001163  0.000000  0.000738    0.000000   

      abbey     abhor  abhorred  ...      your     yours  yourself  \
0  0.000000  0.000000  0.000000  ...  0.062407  0.004903  0.006686   
1  0.000000  0.000000  0.000000  ...  0.087673  0.002851  0.002138   
2  0.000000  0.000000  0.000628  ...  0.083986  0.002082  0.005206   
3  0.000000  0.000000  0.0

In [14]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
iago         0.350125
othello      0.326783
and          0.308385
the          0.295568
cassio       0.247033
to           0.244300
desdemona    0.223691
you          0.188760
of           0.184487
my           0.161572
that         0.153416
emilia       0.134215
in           0.132442
not          0.123509
it           0.123121
is           0.120014
me           0.109139
roderigo     0.101147
he           0.095545
for          0.093215
be           0.087777
but          0.085835
with         0.085835
this         0.085447
do           0.085058
Name: 6, dtype: float64
25 least important terms for document othello.xml
juggling      0.0
juice         0.0
juiced        0.0
jule          0.0
juliet        0.0
joyful        0.0
joyfully      0.0
joys          0.0
kerns         0.0
judges        0.0
judgments     0.0
jour          0.0
journeymen    0.0
journeys      0.0
jovial        0.0
jowls         0.0
jointress     0.0
joints        0.0
joi

**What do you see now in the representation? Have we solved all the problems?**

# StopWords

In the previous section we have experimenting some problems related to stopwords, such as `and` or `of`. These words do not carry any meaning and are unlikely to provide any advantage for any subsequent NLP task and, therefore, we are safe to remove them.

Let's see how to do it via NLTK.

Since stopwords are language-dependant, NLTK provides a list for several languages.

In [15]:
from nltk.corpus import stopwords
print("Languages for which NLTK provides an stopword list:", ", ".join(stopwords.fileids()))

Languages for which NLTK provides an stopword list: albanian, arabic, azerbaijani, basque, belarusian, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, tamil, turkish


We are just interested in the English stopword list

In [16]:
print("Example of 25 English stopwords:", ", ".join(stopwords.words("english")[:25]))

Example of 25 English stopwords: a, about, above, after, again, against, ain, all, am, an, and, any, are, aren, aren't, as, at, be, because, been, before, being, below, between, both


We can use this list to remove these words from our representation and create the document term matrix without them. Let's check.

In [17]:
sw_free_tf_idf_weighting = TfidfVectorizer(stop_words='english')
sw_free_tf_idf_shakespeare = sw_free_tf_idf_weighting.fit_transform(shakespeare_df.words)
sw_free_tf_idf_dt_matrix = pd.DataFrame(sw_free_tf_idf_shakespeare.toarray(), columns=sw_free_tf_idf_weighting.get_feature_names_out())
print(sw_free_tf_idf_dt_matrix)

       1992      1996      1998      1999   abandon     abate  abatements  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.002333    0.000000   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.001020    0.001609   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
4  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
5  0.000000  0.000000  0.000000  0.000000  0.000000  0.001867    0.000000   
6  0.000000  0.000000  0.000000  0.000000  0.001506  0.000000    0.000000   
7  0.001902  0.001902  0.001902  0.001902  0.000000  0.001206    0.000000   

      abbey     abhor  abhorred  ...     young   younger  youngest   younker  \
0  0.000000  0.000000  0.000000  ...  0.002220  0.001340  0.000000  0.000000   
1  0.000000  0.000000  0.000000  ...  0.010286  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.001163  ...  0.010921  0.001163  0.000000

In [18]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
iago            0.542255
othello         0.506104
cassio          0.382591
desdemona       0.346441
emilia          0.207864
roderigo        0.156651
thou            0.086619
brabantio       0.070794
lodovico        0.066276
moor            0.064270
venice          0.059331
shall           0.058348
good            0.055340
montano         0.054225
tis             0.051130
come            0.050528
let             0.049927
lord            0.048723
thy             0.047520
love            0.046919
ll              0.045716
handkerchief    0.045188
thee            0.045114
know            0.043310
bianca          0.042175
Name: 6, dtype: float64
25 least important terms for document othello.xml
outcries       0.0
outbreak       0.0
outbrave       0.0
ousel          0.0
ourself        0.0
ounce          0.0
outlawry       0.0
overplus       0.0
overpeering    0.0
overpeer       0.0
overheard      0.0
overhear       0.0
overflown      0.0
overd

It's much better now, isn't it?

Try to play with the previous code, change the document to see how the different weightings affect their representation or to use a different corpus from the ones included in NLTK