# Roget's Thesaurus in the 21st Century
---
>Spanakis Panagiotis-Alexios, Pregraduate Student
Department of Management Science and Technology
Athens University of Economics and Business
t8200158@aueb.gr

## Before we start

We first need to go over the dependencies needed for this notebook to function properly

### Dependencies for this Notebook


We will need to use libraries that are not included in the Python Standard Library which are handled using Poetry

In order to install the dependencies, we will need to run the following commands in the terminal 

1. (If we haven't already installed Poetry)
```bash
pip install poetry
```

2. After we have installed Poetry, we will need to run the following command in the terminal
```bash
poetry install
```

3. (Optional) If we want to use the Jupyter Notebook with the virtual environment created by Poetry, we will need to run the following command in the terminal
```bash
poetry shell
```

4. After we have installed the dependencies, we will need to run the following command in the terminal
```bash
jupyter notebook
```




We will use the following libraries:

- `pandas` for data manipulation
- `numpy` for numerical operations
- `matplotlib` for plotting

---
- `requests` for making HTTP requests
- `beautifulsoup4` for web scraping
---
- `re` for regular expressions
- `os` for file manipulation
- `json` for JSON manipulation
---
- `langchain`  to augment the power of LLMs with our data and to handle the embedding models
- `chromadb` to store the embeddings to a vector database
- `nomic` in order to interact with the nomic API


In [1]:
import requests
from bs4 import BeautifulSoup
import re
import os
import json

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Get Roget's Thesaurus Classification

### Let's explain how the words along with their classes and divisions and sections are organized in the Gutenberg page

The words are organized in the page in the following manner:

1) They belong to a class
2) They belong to a division (if it exists) within the class
3) They belong to a section within the division (if it exists) or within the class

Let's start by getting the page and parsing it

In [3]:
# Get the page
r = requests.get("https://www.gutenberg.org/files/10681/old/20040627-10681-h-body-pos.htm")
# Parse the page
html = r.text
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')


1) We can notice that all classes on the page are represented by a `<dt>` tag with an `<a>` tag inside it. 
The `<a>` tag has a name attribute that starts with **`CLASS`**. The text of the `<a>` tag is the name of the class.


2) We can notice that all divisions on the page are represented by a `<dt>` tag with an `<a>` tag inside it.
The `<a>` tag has a name attribute that starts with **`DIVISION`**. The text of the `<a>` tag is the name of the division.


3) We can notice that all sections on the page are represented by a `<dt>` tag with an `<a>` tag inside it.
The `<a>` tag has a name attribute that starts with **`SECTION`**. The text of the `<a>` tag is the name of the section.

Lastly, the words are represented by a `<dt>` tag with an `<a>` tag inside it. The `<a>` tag has a name attribute that is a number or a number with a letter at the end.
We notice that the words come after the `<a>` tag where the name attribute is a number and also that tag is inside a `<b>` tag.
So in order to get the words we must get the text of the `<b>` tag that comes after the `<a>` tag where the name attribute is a number.

With the above in mind, we can extract the hierarchy of classes, divisions, sections, and words from the page.
We can then create a directory structure that mirrors the structure of the page and save the words under each section to a file in the directory structure 
using a dictionary to hold the entire hierarchy.

In [4]:
# Initialize a dictionary to hold the entire hierarchy
hierarchy = {}
current_class = None
current_division = None
current_section = None

# Find all <dt> tags
dt_tags = soup.find_all('dt')

Having the dictionary initialized and the `<dt>` tags found, we can now iterate through each `<dt>` tag and extract the hierarchy of classes, divisions, sections, and words from the page. Also, split the words if they have a comma or a dot and add them to the hierarchy separately.

We will use the following logic to extract the hierarchy:

1) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`CLASS`** then we have found a class. We will add the class to the hierarchy and continue to the next `<dt>` tag to check for division/section.
2) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`DIVISION`** then we have found a division. We will add the division to the hierarchy and continue to the next `<dt>` tag to check for section.
3) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`SECTION`** then we have found a section. We will add the section to the hierarchy and continue to the next `<dt>` tag to check for words.
4) Lastly, if the `<dt>` tag has an `<a>` tag with a name attribute that is a number or a number with a letter at the end then we have found a word. We will add the word to the hierarchy under the section it belongs to. Important to note that we will split the word regarding the comma and . and add the words to the hierarchy separately. Also, if the word has (\u0086), a cross,  then we will remove it so that we can have a clean word.

In [5]:
# Iterate through each <dt> tag
for dt in dt_tags:
    # Check for class
    class_a_tag = dt.find('a', attrs={'name': re.compile("^CLASS")})
    if class_a_tag:
        current_class = re.sub(r'\s+', ' ', class_a_tag.text).strip()
        hierarchy[current_class] = {'divisions': {}, 'sections': {}}
        current_division = None
        current_section = None
        # Now that we got the class we can continue to the next <dt> tag to check for division/section
        continue

    # Check for division
    division_a_tag = dt.find('a', attrs={'name': re.compile("^DIVISION")})
    if division_a_tag and current_class:
        current_division = re.sub(r'\s+', ' ', division_a_tag.text).strip()
        hierarchy[current_class]['divisions'][current_division] = {'sections': {}}
        current_section = None
        # Now that we got the division we can continue to the next <dt> tag to check for section
        continue

    # Check for section
    section_a_tag = dt.find('a', attrs={'name': re.compile("^SECTION")})
    if section_a_tag:
        current_section = re.sub(r'\s+', ' ', section_a_tag.text).strip()
        if current_division:
            hierarchy[current_class]['divisions'][current_division]['sections'][current_section] = []
        else:
            hierarchy[current_class]['sections'][current_section] = []
        # Now that we got the section we can continue to the next <dt> tag to check for words
        continue

    # Check for words (the words are before an a tag with a name attribute that is a number (integer or float))
    word_a_tags = dt.find_all('a', attrs={'name': re.compile("^\d+(\.\d+)?$")})
    for word_a_tag in word_a_tags:
        word = word_a_tag.find_next('b').get_text() if word_a_tag.find_next('b') else ''
        word = re.sub(r'\s+', ' ', word).strip()

        # Split the word regarding the comma and . and add the words to the hierarchy
        words = re.split(r',|\.', word)
        for word in words:
            word = word.strip()
            # If the word has (\u0086) then remove it
            if '†' in word:
                word = word.replace('†', '')
            if current_section and word:
                if current_division:
                    hierarchy[current_class]['divisions'][current_division]['sections'][current_section].append(word)
                else:
                    hierarchy[current_class]['sections'][current_section].append(word)

Now that we have the hierarchy of classes, divisions, sections, and words, we can take at the hierarchy we created.

In [6]:
hierarchy['WORDS EXPRESSING ABSTRACT RELATIONS']

{'divisions': {},
 'sections': {'EXISTENCE': ['Existence',
   'Inexistence',
   'Substantiality',
   'Unsubstantiality',
   'Intrinsicality',
   'Extrinsicality',
   'State',
   'Circumstance'],
  'RELATION': ['Relation',
   'Irrelation',
   'Consanguinity',
   'Correlation',
   'Identity',
   'Contrariety',
   'Difference',
   'Uniformity',
   'Nonuniformity',
   'Similarity',
   'Dissimilarity',
   'Imitation',
   'Nonimitation',
   'Variation',
   'Copy',
   'Prototype',
   'Agreement',
   'Disagreement'],
  'QUANTITY': ['Quantity',
   'Degree',
   'Equality',
   'Inequality',
   'Mean',
   'Compensation',
   'Greatness',
   'Smallness',
   'Superiority',
   'Inferiority',
   'Increase',
   'Nonincrease',
   'Decrease',
   'Addition',
   'Nonaddition',
   'Subtraction',
   'Adjunct',
   'Remainder',
   'Decrement',
   'Mixture',
   'Simpleness',
   'Junction',
   'Disjunction',
   'Connection',
   'Coherence',
   'Incoherence',
   'Combination',
   'Decomposition',
   'Whole',
   'P

Let's now save the hierarchy to a JSON file so that we can use it later.

In [7]:
# Save the hierarchy to a json file
import json

with open('hierarchy.json', 'w') as file:
    json.dump(hierarchy, file)

### Create the directory structure and save the words to files

Now that we have the hierarchy, we can create a directory structure that mirrors the structure of the page and save the words under each section to a file in the directory structure.

Let's start by creating a function that writes the words to a file.

In [8]:
def write_words_to_file(words, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write('\n'.join(words))

We start by creating a directory structure that mirrors the structure of the page.

In [9]:
def create_directory_structure(base_path, hierarchy):
    for class_name, class_content in hierarchy.items():
        class_path = os.path.join(base_path, class_name)
        os.makedirs(class_path, exist_ok=True)

        for division_name, division_content in class_content.get('divisions', {}).items():
            division_path = os.path.join(class_path, division_name)
            os.makedirs(division_path, exist_ok=True)

            for section_name, words in division_content.get('sections', {}).items():
                section_path = os.path.join(division_path, section_name)
                os.makedirs(section_path, exist_ok=True)
                write_words_to_file(words, os.path.join(section_path, 'words.txt'))

        for section_name, words in class_content.get('sections', {}).items():
            section_path = os.path.join(class_path, section_name)
            os.makedirs(section_path, exist_ok=True)
            write_words_to_file(words, os.path.join(section_path, 'words.txt'))


# Create the directory structure
base_path = 'roget_thesaurus_final'
create_directory_structure(base_path, hierarchy)

Now that we have created the directory structure and saved the words to files, we can take a look at how many words we have in the thesaurus.

In [10]:
# Count the lines in the files
line_count = sum([sum([len(open(os.path
                                .join(root, file)).readlines()) for file in files]) for root, dirs, files in
                  os.walk(base_path)])
line_count

1057

We have 1057 words in the thesaurus. 

> Note: Some words can exist in multiple sections, so the actual number of unique words is less than 1057.

## Get Word Embeddings

We can get different embedding models from various sources such as `GloVe`, `fastText`, `Word2Vec`, `BERT`. 
Even better from models such as `Gemini` or `OpenAI's` `text-embedding-3` which are more advanced and can provide better embeddings.

In our case, we will use the brand new `Nomic` embeddings and more specifically the `nomic-embed-text-v1.5`,
a text embedding model that supports variable output sizes.

It supports for variable embedding size and specialized for retrieval, similarity, 
clustering and classification. Recommended output sizes are 768, 512, 256, 128 and 64.
Source: [Nomic Embeddings Descriptions](https://docs.nomic.ai/atlas/models/text-embedding).

Noteworthy to mention is that the `nomic-embed-text-v1.5` has a MTEB Score of 62.28 which is a very good score and close to the best score
of the `text-embedding-3-large` which has a MTEB Score of 64.6 of OPENAI with the same input sequence length of 8192.
However, the `nomic-embed-text-v1.5` is open-source and free to use, with a free trial of 1M tokens.

More Details here : [Nomic Embeddings](https://blog.nomic.ai/posts/nomic-embed-text-v1)

We will save two different versions of the embeddings. 
One will be embeddings with a task type specialized for clustering
and one will be embeddings with a task type specialized for classification.

The best option to use Nomic Embed is through our production-ready Nomic Embedding API.

We can access the API via the python package `nomic` which is a Python client for the Nomic API.
We will access them nomic library to get the embeddings for the words in the thesaurus.

We can get a Nomic Atlas API key at [Nomic Atlas](https://atlas.nomic.ai/) after following the instructions to create an account and get the API key.

Then in our terminal we need to run the following command to set the API key as an environment variable
```bash
nomic login ```YOUR_API_KEY```
```


With the above set we can now get the embedding model and use it to get the embeddings for the words in the thesaurus.

Langchain provides many useful tools.

We can start by loading the words from the files using the `DirectoryLoader` and `TextLoader` classes.

In [3]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# pip install langchain-community==0.0.19

path = "roget_thesaurus"
text_loader_kwargs = {'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader,
                         loader_kwargs=text_loader_kwargs, show_progress=True,
                         use_multithreading=True)
docs = loader.load()

100%|██████████| 39/39 [00:00<00:00, 5571.64it/s]


In [4]:
# Create a list of the documents where each document is a word from the files and split the documents according to \n
docs = [doc.page_content.split("\n") for doc in docs]
# Now make the documents a flat list
docs = [word for doc in docs for word in doc]

Now we will store the embeddings to a vector database using `chromadb` which is a vector database that can store and query embeddings.

In [ ]:
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection_clustering = client.create_collection(name="nomic_clustering_v1")
collection_classification = client.create_collection(name="nomic_classification_v1.5")

We can now use the `nomic` library to get the embeddings for the words in the thesaurus with the `nomic-embed-text-v1.5` model
and the specialized task type for classification.

From some background tests, it was concluded that the `nomic-embed-text-v1` model produces better embeddings for clustering
so we will use this model for the specialized task type for clustering.

In [ ]:
from nomic import embed

nomic_embeddings_clustering = embed.text(
    texts= docs,
    model='nomic-embed-text-v1',
    task_type='clustering'
)

nomic_embeddings_classification = embed.text(
    texts= docs,
    model='nomic-embed-text-v1.5',
    task_type='classification'
)

We can now add the embeddings to the vector database using `chromadb` into two different collections, one for clustering and one for classification.

In [ ]:
collection_clustering.add(
    documents=docs,
    embeddings=nomic_embeddings_clustering['embeddings'],
    metadatas=[{"word": word} for word in docs],
    ids=[f"word_{i}" for i in range(len(docs))]
)

collection_classification.add(
    documents=docs,
    embeddings=nomic_embeddings_classification['embeddings'],
    metadatas=[{"word": word} for word in docs],
    ids=[f"word_{i}" for i in range(len(docs))]
)