# TF-IDF on the Governance Set
This notebook runs TF-IDF on the governance data set. It combines analysis and learning, so there are a few things in this notebook that someone experienced in TF-IDF would not need. We leave it in, to show our learning steps.

The learning part of this notebook is based on [TF-IDF Vectorizer scikit-learn](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a) by Mukesh Chaudhary. We replicate their steps and build from there.

Prerequisite: This notebook expects that the "_Explore the Governance Data Set_" notebook has created a clean data set in `CACHE_DIR`.

**braindump**

* TF-idf van alle duurzaamheisdocument: haal je eruit wat de typische DV woorden zijn ten op zichte vban alle woorden. --> traingen met DV subset en dan running met ALL superset.

* Onderscheid maken tussen documenten / clusteren: vectors maken van alle documenten; tfidf van ieder DV document apart t.o.v. alle DV documenten; klassificate *binnen* DV documenten; trainen alle DV docs; test tegen elk DV docui


* anders: pak alleen woorden die echt uniek zijn voor DV set (uit de eerste stap dus); en alleen die neem je mee in de test stap; je reduceret de feature set tot de DV-unieke woorden. Hiermee k-means custering voeden.



---
## Dependencies and Imports

In [3]:
!pip install pandas scikit-learn



In [4]:
import re
import sys
from pathlib import Path
WRITE='w'
READ_BINARY='rb'
print("python=={}".format(re.sub(r'\s.*', '', sys.version)))

from sklearn import __version__ as sklearn__version__
print(f"scikit-learn=={sklearn__version__}")
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import pandas as pd
print(f"pandas=={pd.__version__}")

import numpy as np
print(f"numpy=={np.__version__}")


python==3.10.9
scikit-learn==1.2.2
pandas==2.0.1
numpy==1.24.3


---
## Data Loading
All data was preprocessed by the "_Explore the Governance Data Set_" notebook, so our data loading steps here can be simplified. We don't have to worry about tokenization, stemming and stop words. It is still useful to have convencience functions and constants to make the rest of the code more readable.

In [5]:
CACHE_DIR = Path('../cache/Governance').resolve()

# The files containing the extracted text from the raw documents.
GLOB_ALL_DOCUMENTS = str(CACHE_DIR) + '/GM????????.txt'

GLOB_CA = str(CACHE_DIR) + '/GM????CA??.txt'
GLOB_DV = str(CACHE_DIR) + '/GM????DV??.txt'
GLOB_EX = str(CACHE_DIR) + '/GM????EX??.txt'
GLOB_IK = str(CACHE_DIR) + '/GM????IK??.txt'
GLOB_JS = str(CACHE_DIR) + '/GM????JS??.txt'
GLOB_OB = str(CACHE_DIR) + '/GM????OB??.txt'
GLOB_PB = str(CACHE_DIR) + '/GM????PB??.txt'
GLOB_TV = str(CACHE_DIR) + '/GM????TV??.txt'
GLOB_WS = str(CACHE_DIR) + '/GM????WS??.txt'

# take a glob and make it iterable. We cannot use globs as objects, since these get
# "exhausted" when you iterate over them.
# https://stackoverflow.com/questions/51108256/how-to-take-a-pathname-string-with-wildcards-and-resolve-the-glob-with-pathlib
def expand_glob(glob):
    p = Path(glob)
    return Path(p.parent).expanduser().glob(p.name)

print(f"all text documents = {GLOB_ALL_DOCUMENTS}")

def load_documents_as_string_array(glob):
    return [file.read_text() for file in expand_glob(glob)]

def document_names(glob):
    return [file.stem for file in expand_glob(glob)]


all text documents = /home/jovyan/jads/execute-enexis/cache/Governance/GM????????.txt


---
## Replicate Mukesh' Vectorizers (with Adaptations)

This section replicates [TF-IDF Vectorizer scikit-learn](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a) by Mukesh Chaudhary for the sake of understanding their steps better. We don't use the initialisation parameters from the example. Using a `word` analyser is the default, for example. We pack the result into a Pandas dataframe so it looks nice, adorning the rows and columns with the document names and words used.

We also keep the stop words by _not_ specifying a stop word list. A large part of why we plan on using TF-IDF is so that common words are automatically filtered. Thus, we don't specify `stop_words='english'`.

In [6]:
train_names     = ['Doc1',             'Doc2']
# to test its behaviour, you can try some alternatives
# train_documents = ['The sky is blue. the the the the the the ', 'The sun is bright.']
# train_documents = ['The sky is blue. sky sky sky sky sky sky ', 'The sun is bright.']
train_documents = ['The sky is blue.', 'The sun is bright.']
test_names      = ['Doc3',                         'Doc4']
test_documents  = ['The sun in the sky is bright', 'We can see the shining sun, the bright sun.']


First, the count vectorizer. We train the vectorizer and build up the internal word list. Then we look at the generated word count matrix and take a peek at the internal vocabulary.

In [7]:
def count_vectorize_strings(index, strings):
    # run the vectorizer on the data
    vectorizer = CountVectorizer()
    word_matrix = vectorizer.fit_transform(strings)
    words_list = vectorizer.get_feature_names_out()

    # take the output and package it into various useful data frames
    per_document    = pd.DataFrame(index=index, columns=words_list, data=word_matrix.toarray())
    sum_over_corpus = pd.DataFrame(per_document.sum(), columns=['sum']).T

    return vectorizer, per_document, sum_over_corpus

count_vectorizer, count_per_document, count_sum_over_corpus = count_vectorize_strings(train_names, train_documents)

# show the word count per document. This is just that: a word count for each
# word in the vocabulary, counted for each document separately.

count_per_document


Unnamed: 0,blue,bright,is,sky,sun,the
Doc1,1,0,1,1,0,1
Doc2,0,1,1,0,1,1


In [8]:
# show the sum of word counts over the corpus

count_sum_over_corpus


Unnamed: 0,blue,bright,is,sky,sun,the
sum,1,1,2,1,1,2


In [10]:
# take a peek at the internal word list of the vectorizer. The numbers are
# internal indices into its private data structure. Not very useful for our
# code, but still good to get some idea of how the vectorizer works internally.

pd.DataFrame().from_dict(count_vectorizer.vocabulary_, orient='index', columns=['vocabulary index']).sort_values('vocabulary index').T


Unnamed: 0,blue,bright,is,sky,sun,the
vocabulary index,0,1,2,3,4,5


Next, we explore the TF-IDF vectorizer, which internally consists of a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), followed by a [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).


In [37]:
def tfidf_vectorize_strings(index, strings, max_df=1.0, max_features=None):
    # run the vectorizer on the data
    vectorizer = TfidfVectorizer(max_df=max_df, max_features=max_features)
    word_matrix = vectorizer.fit_transform(strings)
    words_list = vectorizer.get_feature_names_out()

    # take the output and package it into various useful data frames
    matrix = pd.DataFrame(index=index, columns=words_list, data=word_matrix.toarray())
    idf = pd.DataFrame(columns=words_list, data=[vectorizer.idf_])

    return vectorizer, matrix, idf

tfidf_vectorizer, tfidf_matrix, tfidf_idf = tfidf_vectorize_strings(train_names, train_documents)
tfidf_matrix

# XXX ask Marieke to explain why the numbers are not 0 for the stop words.


Unnamed: 0,blue,bright,is,sky,sun,the
Doc1,0.576152,0.0,0.409937,0.576152,0.0,0.409937
Doc2,0.0,0.576152,0.409937,0.0,0.576152,0.409937


In [38]:
tfidf_idf

Unnamed: 0,blue,bright,is,sky,sun,the
0,1.405465,1.405465,1.0,1.405465,1.405465,1.0


### Stop Word Elimination with `max_df`
We can control stop word elimination with the `max_df` parameter. Setting that value to `1.0` effectively disables stop word elimination. Any other value between 0.0 and 1.0 represents more or less strict stop word removal. The default value for `max_df` is 1.0, which essentially means "keep all words". If we try lowering `max_df`, the vectorizer starts eliminating words that do not have much information in them. We might try a value of 0.5, for example, and look at the resulting matrix and the list of words that were eliminated this way.

In [39]:
tfidf_vectorizer, tfidf_matrix, tfidf_idf = tfidf_vectorize_strings(train_names, train_documents, max_df=0.5)
tfidf_matrix


Unnamed: 0,blue,bright,sky,sun
Doc1,0.707107,0.0,0.707107,0.0
Doc2,0.0,0.707107,0.0,0.707107


In [40]:
# Let's have a quick peek at the stop word list that the vectorizer built up.

tfidf_vectorizer.stop_words_


{'is', 'the'}

Then the inference step, this does not modify the internal state of the vectorizer anymore. We reuse the structure of the data frames that we generated during training to make the new data frame more readable.

In [41]:
# Here we use the count vectorizer again, which does not have a stop word list
# and thus will show values for "the" and "is". The TF-IDF vectorizer comes
# after.

count_vector = count_vectorizer.transform(test_documents)
pd.DataFrame(index=test_names, columns=count_per_document.columns,
             data=count_vector.todense())

Unnamed: 0,blue,bright,is,sky,sun,the
Doc3,0,1,1,1,1,2
Doc4,0,1,0,0,2,2


In [42]:
# and using the last TF-IDF vectorizer, which does have stop word elimination.

tfidf_data = tfidf_vectorizer.transform(test_documents).todense()
pd.DataFrame(index=test_names, columns=tfidf_matrix.columns,
             data=tfidf_data)


Unnamed: 0,blue,bright,sky,sun
Doc3,0.0,0.57735,0.57735,0.57735
Doc4,0.0,0.447214,0.0,0.894427


### Stop Word Elemination using `max_features`
An alternative to using `max_df` is to use `max_features` to control stop word elimination. `max_features` sets an upper limit on the number of words in the vocabulary, ordered by term frequency.

In [43]:
tfidf_vectorizer, tfidf_matrix, tfidf_idf = tfidf_vectorize_strings(train_names, train_documents, max_features=2)
tfidf_matrix


Unnamed: 0,is,the
Doc1,0.707107,0.707107
Doc2,0.707107,0.707107


In [44]:
# Let's have a quick peek at the stop word list that the vectorizer built up.

tfidf_vectorizer.stop_words_


{'blue', 'bright', 'sky', 'sun'}

... oh. So this gives us the _reverse_ of what we need. This shows that using term frequency for this purpose is not really useful.

---
## Apply Vectorizers to Governance Data Set
With the howto replicated, we can now apply the same to our own data sets. We reimplement the methods here to rely on file globs rather than using string arrays. We could reduce the code duplication somewhat, but that's not really high priority at this stage.

In [30]:
def tfidf_vectorize(glob, max_df=1.0):
    # run the vectorizer on the data
    vectorizer = TfidfVectorizer(max_df=max_df)
    word_matrix = vectorizer.fit_transform(load_documents_as_string_array(glob))
    words_list = vectorizer.get_feature_names_out()

    # take the output and package it into various useful data frames
    matrix = pd.DataFrame(index=document_names(glob), columns=words_list, data=word_matrix.toarray())
    idf = pd.DataFrame(columns=words_list, data=[vectorizer.idf_])

    return vectorizer, matrix, idf


In [33]:
all_docs_vectorizer, all_docs_matrix, all_docs_idf = tfidf_vectorize(GLOB_ALL_DOCUMENTS, max_df=0.01)
all_docs_matrix

Unnamed: 0,aaaa,aaaaccommodaties,aaaactieprogramma,aaaactiva,aaaaf,aaaafkortingenlijst,aaaafval,aaaafvalstoff,aaaalgemen,aaaann,...,èffeoturer,èii,èlèèmnn,èmd,èmjim,ères,èàca,èàcaa,èèn,èèè
GM0003DV02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0003EX06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0003OB01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0003OB02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0005CA01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GM1987DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1987IK01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1987JS01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1987PB01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
#all_docs_vectorizer.stop_words_
len(all_docs_vectorizer.stop_words_)

18733

In [35]:
dv_docs_vectorizer, dv_docs_matrix, dv_docs_idf = tfidf_vectorize(GLOB_DV, max_df=0.01)
dv_docs_matrix

Unnamed: 0,aab,aachener,aadefg,aadijk,aak,aalburg,aalderveld,aalscholver,aalsmeerder,aalst,...,zwollenar,zwoud,zzh,zzprs,zzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zèlf,àlle,ànciew,ècht
GM0003DV02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0005DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0007DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0009DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0034DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GM1945DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
#dv_docs_vectorizer.stop_words_#
len(dv_docs_vectorizer.stop_words_)


14343

---
## Cross Referencing Data Sets
Now try training on one set and running on the other. Run the `all_docs_vectorizer` on the DV data set.

In [46]:
all_vs_dv_matrix = all_docs_vectorizer.transform(load_documents_as_string_array(GLOB_DV))

In [47]:
pd.DataFrame(index=document_names(GLOB_DV), columns=all_docs_vectorizer.get_feature_names_out(), data=all_vs_dv_matrix.toarray())

Unnamed: 0,aaa,aaaa,aaaaccommodaties,aaaactieprogramma,aaaactiva,aaaaf,aaaafkortingenlijst,aaaafval,aaaafvalstoff,aaaalgemen,...,èffeoturer,èii,èlèèmnn,èmd,èmjim,ères,èàca,èàcaa,èèn,èèè
GM0003DV02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0005DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0007DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0009DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM0034DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GM1945DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GM1955DV03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
