<a href="https://colab.research.google.com/github/nerealegui/NLP-MBD-EN-PT/blob/main/tagging_parsing_practice/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files

In [19]:
repository_name = "NLP-MBD-EN-PT"
repository_url = 'https://github.com/nerealegui/' + repository_name

In [20]:
! git clone $repository_url

fatal: destination path 'NLP-MBD-EN-PT' already exists and is not an empty directory.


Install the requirements

Now you have everything you need to execute the code in Colab

# Bag-of-words

In [21]:
# Install missing packages
%pip install nltk
%pip install pandas
%pip install numpy


Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [22]:

# NLTK Corpora  in https://www.nltk.org/nltk_data/
import nltk
nltk.download('abc')
nltk.download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

[nltk_data] Downloading package abc to /Users/nerealegui/nltk_data...
[nltk_data]   Package abc is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nerealegui/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The `nltk` library includes several corpus for experimentation. In this markdown we are going to make use of the corpus including the set of abc's plays.

In the following cell, I will load the corpus and create a dataframe with the name of the book and the textual content.

In [23]:
abc_df = pd.DataFrame(columns=["book", "words"])
print(nltk.corpus.abc.fileids())
for ii, book in enumerate(nltk.corpus.abc.fileids()):
    abc_df.loc[ii] = (book, " ".join(nltk.corpus.abc.words(book)))
print(abc_df)

['rural.txt', 'science.txt']
          book                                              words
0    rural.txt  PM denies knowledge of AWB kickbacks The Prime...
1  science.txt  Cystic fibrosis affects 30 , 000 children and ...


While this representation can be useful for humans, it is of no use if you want to use these data for an NLP system.

As we discussed in class, we need to create the document-term matrix which will be the input for any NLP system we need to create on top of it. In the document term matrix we have a row for each one of the different documents (the abc's plays) and a column for each one of the words in the dataset. At each cell, you will find the weight of the word in the document (for example, how many times does the word appear in the document).

In class we presented several weighting approaches, let's see how we can create them.

Let's start with the simplest one: The Binary weighting. Binary weighting only defines if a word appears (1) or does not appear (0) in a document

Here’s what happens in that snippet, step by step:

1.  Instantiate the vectorizer  
     `binary_weighting = CountVectorizer(binary=True)`  
      – Builds a vocabulary and, when transforming, will mark each word as present (1) or absent (0) in a document.

2.  Learn the vocabulary and transform the text into a sparse matrix  
     `binary_abc = binary_weighting.fit_transform(abc_df.words)`  
      – `fit` scans all documents to build the vocab (one column per unique token).  
      – `transform` turns each document into a row of 0/1s, stored in a SciPy sparse matrix of shape `(n_docs, n_terms)`.

3.  Convert to a pandas DataFrame  
     `binary_abc.toarray()`  
      – Converts the sparse matrix to a dense NumPy array.  
     `columns=binary_weighting.get_feature_names_out()`  
      – Labels each column with the corresponding token.  
     `pd.DataFrame(...)`  
      – Wraps the array in a DataFrame for easy inspection.

4.  Inspect the result  
     `print(binary_dt_matrix)`  
      – Shows a table where each row is a play, each column is a word, and each cell is 0 or 1 indicating absence/presence.

This binary‐weighted document–term matrix is useful when you only care about whether a term occurs at all, not how often.

In [24]:
# Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
binary_weighting = CountVectorizer(binary=True)
binary_abc = binary_weighting.fit_transform(abc_df.words)
binary_dt_matrix = pd.DataFrame(binary_abc.toarray(), columns=binary_weighting.get_feature_names_out())
print(binary_dt_matrix)

   00  000  000ºc  00am  00pm  010  02  02811  03  03549  ...  zoology  zoom  \
0   1    1      0     1     1    1   1      0   1      0  ...        0     0   
1   0    1      1     0     0    0   1      1   1      1  ...        1     1   

   zooming  zooplankton  zoos  zuberb  zucchini  zukerman  zulu  zwingmann  
0        0            0     0       0         0         0     1          0  
1        1            1     1       1         1         1     0          1  

[2 rows x 27623 columns]


Let's inspect the most and least important terms related to the document 6 (Othello)

In [28]:
document = 1
print("25 most important terms for document", abc_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", abc_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][-25:])



25 most important terms for document science.txt
zwingmann          1
fun                1
fulfilling         1
full               1
fuller             1
fullerenes         1
fully              1
fumed              1
fumes              1
fumigating         1
fumo               1
function           1
fujiwara           1
functional         1
functionalised     1
functionality      1
functioning        1
functions          1
fund               1
fundaci            1
fundamental        1
fundamentalists    1
fuk                1
fujinaga           1
gainesville        1
Name: 1, dtype: int64
25 least important terms for document science.txt
earmarked       0
earn            0
cellar          0
tabulate        0
productively    0
integral        0
cdma            0
cease           0
ceased          0
tablelands      0
cecil           0
tableland       0
ceduna          0
tablegrapes     0
celcius         0
earthier        0
celebrates      0
celebrating     0
germinated      0
earns       

As you can see, the representation is not very useful as it is. By only telling us if a word appears or not in a document is not giving us a lot of information. **Can you think on a situation where this binary weighting can be sufficient?**

The next thing to know will be whether the word appears only once or several times.

In [29]:
tf_weighting = CountVectorizer() # now the vector is not binary
tf_abc = tf_weighting.fit_transform(abc_df.words)
tf_dt_matrix = pd.DataFrame(tf_abc.toarray(), columns=tf_weighting.get_feature_names_out())
print(tf_dt_matrix)

   00  000  000ºc  00am  00pm  010  02  02811  03  03549  ...  zoology  zoom  \
0  13  451      0     3     2    1   1      0   1      0  ...        0     0   
1   0  248      1     0     0    0   1      1   1      1  ...        4     1   

   zooming  zooplankton  zoos  zuberb  zucchini  zukerman  zulu  zwingmann  
0        0            0     0       0         0         0     1          0  
1        2            7     2       2         1         1     0          2  

[2 rows x 27623 columns]


Ok, now we have the words weighted according to how many times they appear in the document.

Let's check now the most and least important words in Othello

In [31]:
document = 1
print("25 most important terms for document", abc_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", abc_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document science.txt
the     23705
of      11878
to       9068
and      8665
in       7549
that     4960
says     4474
is       4016
it       2997
for      2723
are      2357
be       2286
as       2198
on       2173
with     2135
have     2130
they     2063
from     2028
at       1974
by       1760
he       1703
this     1669
but      1665
or       1501
an       1435
Name: 1, dtype: int64
25 least important terms for document science.txt
duel        0
duffing     0
duffy       0
dugdale     0
saddam      0
dunkeld     0
dunmall     0
dunmore     0
dunn        0
dunsmore    0
dunstan     0
duping      0
sackings    0
duress      0
durham      0
sack        0
durong      0
durum       0
dusting     0
sabotage    0
sabien      0
dutton      0
sabah       0
rutile      0
00          0
Name: 1, dtype: int64


**What problem do you see with the most important words? Are they really representative?**



Let's check now how to create the TF-IDF weighting to see if we can improve this representation

**TF-IDF (Term Frequency-Inverse Document Frequency)** features are numerical representations of text that reflect the importance of a word within a document relative to a collection of documents

- **Term Frequency (TF):**
Measures how often a word appears in a specific document. A higher TF indicates the word is more important within that document.

- **Inverse Document Frequency (IDF):**
Measures how rare or common a word is across the entire corpus of documents. Rare words have a higher IDF, indicating they are more informative and important.

In [32]:
tf_idf_weighting = TfidfVectorizer()
tf_idf_abc = tf_idf_weighting.fit_transform(abc_df.words)
tf_idf_dt_matrix = pd.DataFrame(tf_idf_abc.toarray(), columns=tf_idf_weighting.get_feature_names_out())
print(tf_idf_dt_matrix)

         00       000     000ºc     00am      00pm       010        02  \
0  0.000692  0.017086  0.000000  0.00016  0.000106  0.000053  0.000038   
1  0.000000  0.007449  0.000042  0.00000  0.000000  0.000000  0.000030   

      02811        03     03549  ...   zoology      zoom   zooming  \
0  0.000000  0.000038  0.000000  ...  0.000000  0.000000  0.000000   
1  0.000042  0.000030  0.000042  ...  0.000169  0.000042  0.000084   

   zooplankton      zoos    zuberb  zucchini  zukerman      zulu  zwingmann  
0     0.000000  0.000000  0.000000  0.000000  0.000000  0.000053   0.000000  
1     0.000295  0.000084  0.000084  0.000042  0.000042  0.000000   0.000084  

[2 rows x 27623 columns]


In [34]:
document = 1
print("25 most important terms for document", abc_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", abc_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document science.txt
the     0.711989
of      0.356761
to      0.272361
and     0.260257
in      0.226737
that    0.148976
says    0.134378
is      0.120622
it      0.090016
for     0.081786
are     0.070793
be      0.068661
as      0.066018
on      0.065267
with    0.064126
have    0.063975
they    0.061963
from    0.060912
at      0.059290
by      0.052862
he      0.051150
this    0.050129
but     0.050009
or      0.045083
an      0.043101
Name: 1, dtype: float64
25 least important terms for document science.txt
ecumenical        0.0
schuller          0.0
eddie             0.0
eden              0.0
schneider         0.0
edmonds           0.0
edmund            0.0
edna              0.0
schild            0.0
eion              0.0
schemes           0.0
educating         0.0
schembri          0.0
edwards           0.0
scheduling        0.0
schaap            0.0
sceptics          0.0
scenic            0.0
egan              0.0
scaremongering    0.0
egrets      

**What do you see now in the representation? Have we solved all the problems?**

# StopWords

In the previous section we have experimenting some problems related to stopwords, such as `and` or `of`. These words do not carry any meaning and are unlikely to provide any advantage for any subsequent NLP task and, therefore, we are safe to remove them.

Let's see how to do it via NLTK.

Since stopwords are language-dependant, NLTK provides a list for several languages.

In [35]:
from nltk.corpus import stopwords
print("Languages for which NLTK provides an stopword list:", ", ".join(stopwords.fileids()))

Languages for which NLTK provides an stopword list: albanian, arabic, azerbaijani, basque, belarusian, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, tamil, turkish


We are just interested in the English stopword list

In [36]:
print("Example of 25 English stopwords:", ", ".join(stopwords.words("english")[:25]))

Example of 25 English stopwords: a, about, above, after, again, against, ain, all, am, an, and, any, are, aren, aren't, as, at, be, because, been, before, being, below, between, both


We can use this list to remove these words from our representation and create the document term matrix without them. Let's check.

In [37]:
sw_free_tf_idf_weighting = TfidfVectorizer(stop_words='english')
sw_free_tf_idf_abc = sw_free_tf_idf_weighting.fit_transform(abc_df.words)
sw_free_tf_idf_dt_matrix = pd.DataFrame(sw_free_tf_idf_abc.toarray(), columns=sw_free_tf_idf_weighting.get_feature_names_out())
print(sw_free_tf_idf_dt_matrix)

         00       000     000ºc      00am      00pm       010        02  \
0  0.002867  0.070776  0.000000  0.000662  0.000441  0.000221  0.000157   
1  0.000000  0.038415  0.000218  0.000000  0.000000  0.000000  0.000155   

      02811        03     03549  ...   zoology      zoom   zooming  \
0  0.000000  0.000157  0.000000  ...  0.000000  0.000000  0.000000   
1  0.000218  0.000155  0.000218  ...  0.000871  0.000218  0.000435   

   zooplankton      zoos    zuberb  zucchini  zukerman      zulu  zwingmann  
0     0.000000  0.000000  0.000000  0.000000  0.000000  0.000221   0.000000  
1     0.001524  0.000435  0.000435  0.000218  0.000218  0.000000   0.000435  

[2 rows x 27329 columns]


In [39]:
document = 1
print("25 most important terms for document", abc_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", abc_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document science.txt
says           0.693014
new            0.158306
researchers    0.157376
research       0.139408
say            0.137704
university     0.135691
people         0.123764
study          0.120201
years          0.114625
scientists     0.113540
like           0.111527
dr             0.097586
professor      0.087363
used           0.081012
team           0.078378
time           0.074506
australian     0.070014
australia      0.068310
human          0.063198
journal        0.062424
technology     0.059326
cells          0.058551
year           0.057312
make           0.056383
world          0.056228
Name: 1, dtype: float64
25 least important terms for document science.txt
reconciliation     0.0
gartrell           0.0
cbh                0.0
garth              0.0
reclassified       0.0
garside            0.0
thiele             0.0
thiel              0.0
thieblemont        0.0
garrett            0.0
garrard            0.0
garnishes          0.0
r

It's much better now, isn't it?

Try to play with the previous code, change the document to see how the different weightings affect their representation or to use a different corpus from the ones included in NLTK