# Roget's Thesaurus in the 21st Century
---
>Spanakis Panagiotis-Alexios, Pregraduate Student
Department of Management Science and Technology
Athens University of Economics and Business
t8200158@aueb.gr

## Before we start

We first need to go over the dependencies needed for this notebook to function properly

### Dependencies for all the Notebooks in the Project


We will need to use libraries that are not included in the Python Standard Library which are handled using Poetry

In order to install the dependencies, we will need to run the following commands in the terminal 

1. (If we haven't already installed Poetry)
```bash
pip install poetry
```

2. After we have installed Poetry, we will need to run the following command in the terminal
```bash
poetry install
```

3. (Optional) If we want to use the Jupyter Notebook with the virtual environment created by Poetry, we will need to run the following command in the terminal
```bash
poetry shell
```

4. After we have installed the dependencies, we will need to run the following command in the terminal
```bash
jupyter notebook
```


We will use the following libraries:

- `pandas` for data manipulation
---
- `requests` for making HTTP requests
- `beautifulsoup4` for web scraping
---
- `re` for regular expressions
- `os` for file manipulation
- `json` for JSON manipulation
---
- `chromadb` to store the embeddings to a vector database
- `nomic` in order to interact with the nomic API


In [8]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import os
import json
import chromadb
from nomic import embed

## Get Roget's Thesaurus Classification

### Let's explain how the words along with their classes and divisions and sections are organized in the Gutenberg page

The words are organized in the page in the following manner:

1) They belong to a class
2) They belong to a division (if it exists) within the class
3) They belong to a section within the division (if it exists) or within the class

Let's start by getting the page and parsing it

In [9]:
# Get the page
r = requests.get("https://www.gutenberg.org/files/10681/old/20040627-10681-h-body-pos.htm")
# Parse the page
html = r.text
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')


1) We can notice that all classes on the page are represented by a `<dt>` tag with an `<a>` tag inside it. 
The `<a>` tag has a name attribute that starts with **`CLASS`**. The text of the `<a>` tag is the name of the class.


2) We can notice that all divisions on the page are represented by a `<dt>` tag with an `<a>` tag inside it.
The `<a>` tag has a name attribute that starts with **`DIVISION`**. The text of the `<a>` tag is the name of the division.


3) We can notice that all sections on the page are represented by a `<dt>` tag with an `<a>` tag inside it.
The `<a>` tag has a name attribute that starts with **`SECTION`**. The text of the `<a>` tag is the name of the section.

Lastly, the words are represented by a `<dt>` tag with an `<a>` tag inside it. The `<a>` tag has a name attribute that is a number or a number with a letter at the end.
We notice that the words come after the `<a>` tag where the name attribute is a number and also that tag is inside a `<b>` tag.
So in order to get the words we must get the text of the `<b>` tag that comes after the `<a>` tag where the name attribute is a number.

With the above in mind, we can extract the hierarchy of classes, divisions, sections, and words from the page.
We can then create a directory structure that mirrors the structure of the page and save the words under each section to a file in the directory structure 
using a dictionary to hold the entire hierarchy.

In [10]:
# Initialize a dictionary to hold the entire hierarchy
hierarchy = {}
current_class = None
current_division = None
current_section = None

# Find all <dt> tags
dt_tags = soup.find_all('dt')

Having the dictionary initialized and the `<dt>` tags found, we can now iterate through each `<dt>` tag and extract the hierarchy of classes, divisions, sections, and words from the page. Also, split the words if they have a comma or a dot and add them to the hierarchy separately.

We will use the following logic to extract the hierarchy:

1) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`CLASS`** then we have found a class. We will add the class to the hierarchy and continue to the next `<dt>` tag to check for division/section.
2) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`DIVISION`** then we have found a division. We will add the division to the hierarchy and continue to the next `<dt>` tag to check for section.
3) If the `<dt>` tag has an `<a>` tag with a name attribute that starts with **`SECTION`** then we have found a section. We will add the section to the hierarchy and continue to the next `<dt>` tag to check for words.
4) Lastly, if the we have found a section, then we have found words. We will add the words to the hierarchy under the section.
5) We will also exclude common stop words such as "adj", "plur", "adv", "lat", "v", "n", "c", "i", "no", "and", "phr" from the words and convert them to lowercase,
and remove any duplicates.

In [15]:
# Define a set of common stop words to exclude
stop_words = {"adj", "plur", "adv", "lat", "v", "n", "c", "i", "no", "and", "phr"}

for dt in dt_tags:
    class_a_tag = dt.find('a', attrs={'name': re.compile("^CLASS")})
    if class_a_tag:
        current_class = re.sub(r'\s+', ' ', class_a_tag.text).strip()
        hierarchy[current_class] = {'divisions': {}, 'sections': {}}
        current_division = None
        current_section = None
        continue

    division_a_tag = dt.find('a', attrs={'name': re.compile("^DIVISION")})
    if division_a_tag and current_class:
        current_division = re.sub(r'\s+', ' ', division_a_tag.text).strip()
        hierarchy[current_class]['divisions'][current_division] = {'sections': {}}
        current_section = None
        continue

    section_a_tag = dt.find('a', attrs={'name': re.compile("^SECTION")})
    if section_a_tag:
        current_section = re.sub(r'\s+', ' ', section_a_tag.text).strip()
        if current_division:
            hierarchy[current_class]['divisions'][current_division]['sections'][current_section] = []
        else:
            hierarchy[current_class]['sections'][current_section] = []
        continue

    if current_section:
        text = dt.get_text(separator=' ', strip=True)
        text = re.sub(r'\[.*?\]', '', text)
        words = re.findall(r'\b[a-zA-Z]{2,}\b', text)
        filtered_words = set(word.lower() for word in words if word.lower() not in stop_words)
        
        if current_division:
            hierarchy[current_class]['divisions'][current_division]['sections'][current_section].extend(list(filtered_words))
            hierarchy[current_class]['divisions'][current_division]['sections'][current_section] = list(set(hierarchy[current_class]['divisions'][current_division]['sections'][current_section]))
        else:
            hierarchy[current_class]['sections'][current_section].extend(list(filtered_words))
            hierarchy[current_class]['sections'][current_section] = list(set(hierarchy[current_class]['sections'][current_section]))

for class_key, class_value in hierarchy.items():
    for section_key in class_value['sections']:
        class_value['sections'][section_key] = list(set(class_value['sections'][section_key]))
    for division_key, division_value in class_value['divisions'].items():
        for section_key in division_value['sections']:
            division_value['sections'][section_key] = list(set(division_value['sections'][section_key]))

print(hierarchy)




Now that we have the hierarchy of classes, divisions, sections, and words, we can take at the hierarchy we created.

In [16]:
hierarchy['WORDS EXPRESSING ABSTRACT RELATIONS']

{'divisions': {},
 'sections': {'EXISTENCE': ['earth',
   'abeyance',
   'object',
   'principle',
   'inhesion',
   'mold',
   'substantiality',
   'nirvana',
   'creature',
   'features',
   'environment',
   'emergence',
   'of',
   'peculiarities',
   'nullity',
   'moods',
   'declensions',
   'tone',
   'footing',
   'point',
   'oddity',
   'space',
   'objectiveness',
   'faggot',
   'status',
   'formal',
   'abstract',
   'one',
   'speciality',
   'core',
   'disposition',
   'lot',
   'ontology',
   'nothing',
   'nihil',
   'fool',
   'pith',
   'tabula',
   'reality',
   'nihility',
   'absence',
   'nil',
   'imagination',
   'air',
   'esse',
   'talk',
   'crasis',
   'are',
   'form',
   'uncertainty',
   'actuality',
   'ii',
   'in',
   'thing',
   'oblivion',
   'john',
   'inmost',
   'nature',
   'inanity',
   'marrow',
   'fabric',
   'non',
   'annihilation',
   'appearance',
   'juncture',
   'power',
   'circumstance',
   'turn',
   'ignis',
   'humor',
   'd

Let's now save the hierarchy to a JSON file so that we can use it later.

In [17]:
# Save the hierarchy to a json file

with open('hierarchy_full.json', 'w') as file:
    json.dump(hierarchy, file)

## Get Word Embeddings

We can get different embedding models from various sources such as `GloVe`, `fastText`, `Word2Vec`, `BERT`. 
Even better from models such as `Gemini` or `OpenAI's` `text-embedding-3` which are more advanced and can provide better embeddings.

In our case, we will use the brand new `Nomic` embeddings and more specifically the `nomic-embed-text-v1`,
a text embedding model that supports variable output sizes.
Source: [Nomic Embeddings Descriptions](https://docs.nomic.ai/atlas/models/text-embedding).

Noteworthy to mention is that the `nomic-embed-text-v1` has a MTEB Score of 62.39 which is a very good score and close to the best score
of the `text-embedding-3-large` which has a MTEB Score of 64.6 of OPENAI with the same input sequence length of 8192.
However, the `nomic-embed-text-v1` is open-source and free to use, with a free trial of 1M tokens.

> Note: The nomic-embed-text-v1.5 model was also tested, 
> but it did not provide better embeddings for the thesaurus according to some background tests for classification and clustering tasks.

More Details here : [Nomic Embeddings](https://blog.nomic.ai/posts/nomic-embed-text-v1)

We will save two different versions of the embeddings. 
One will be embeddings with a task type specialized for clustering, 
and one will be embeddings with a task type specialized for classification.

The best option to use Nomic Embed is through the production-ready Nomic Embedding API.

We can access the API via the python package `nomic` which is a Python client for the Nomic API.
We will access them nomic library to get the embeddings for the words in the thesaurus.

We can get a Nomic Atlas API key at [Nomic Atlas](https://atlas.nomic.ai/) after following the instructions to create an account and get the API key.

Then in our terminal we need to run the following command to set the API key as an environment variable
```bash
nomic login ```YOUR_API_KEY```
```

With the above set, we can now get the embedding model and use it to get the embeddings for the words in the thesaurus.

Now we will store the embeddings to a vector database using `chromadb` which is a vector database that can store and query embeddings.

In [30]:
client = chromadb.PersistentClient(path="./chroma_db")
collection_clustering = client.get_or_create_collection(name="nomic_clustering_v1_new")
collection_classification = client.get_or_create_collection(name="nomic_classification_v1_new")

We will now retrieve the words along with their respective classes, divisions, and sections from the hierarchy and store them in a DataFrame.

In [19]:
with open("hierarchy_full.json", "r") as f:
    categories = json.load(f)

Let's now retrieve the words along with their respective classes, divisions, and sections from the hierarchy and store them in a DataFrame.

In [20]:
def json_to_df(categories, class_name=None, division_name='N/A'):
    df = pd.DataFrame()
    for key, value in categories.items():
        # Check if the current key is a class
        if 'divisions' in value and 'sections' in value:
            class_df = json_to_df(value['divisions'], class_name=key)
            section_df = json_to_df(value['sections'], class_name=key,
                                    division_name='N/A')  # Reset division for sections
            df = pd.concat([df, class_df, section_df], ignore_index=True)
        elif isinstance(value, dict):  # This is a division or a section
            if 'sections' in value:  # This is a division
                sub_df = json_to_df(value['sections'], class_name=class_name, division_name=key)
            else:  # This is a section without a division
                sub_df = json_to_df(value, class_name=class_name, division_name=division_name)
            df = pd.concat([df, sub_df], ignore_index=True)
        else:  # This is the actual list of words
            for word in value:
                new_row = pd.DataFrame({
                    "word": [word],
                    "class": [class_name],
                    "division": [division_name],
                    "section": [key],
                })
                df = pd.concat([df, new_row], ignore_index=True)
    return df


categories_df = json_to_df(categories)

# Displaying the first few rows of the DataFrame
categories_df

Unnamed: 0,word,class,division,section
0,earth,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE
1,abeyance,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE
2,object,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE
3,principle,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE
4,inhesion,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE
...,...,...,...,...
44124,resurrection,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS
44125,nine,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS
44126,bible,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS
44127,testament,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS


We can now use the `nomic` library to get the embeddings for the words in the thesaurus with the `nomic-embed-text-v1` model
and the specialized task type for clustering and classification.

In [21]:
nomic_embeddings_clustering = embed.text(
    texts=categories_df['word'].tolist(),
    model='nomic-embed-text-v1',
    task_type='clustering'
)

nomic_embeddings_classification = embed.text(
    texts=categories_df['word'].tolist(),
    model='nomic-embed-text-v1',
    task_type='classification'
)

Let's separate the different embeddings and store them in the DataFrame.

In [22]:
categories_df['embedding_clustering'] = nomic_embeddings_clustering['embeddings']
categories_df['embedding_classification'] = nomic_embeddings_classification['embeddings']

With that done, let's take a look at the DataFrame with the embeddings.

In [23]:
categories_df

Unnamed: 0,word,class,division,section,embedding_clustering,embedding_classification
0,earth,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE,"[0.04272461, 0.033111572, -0.021224976, -0.031...","[0.024032593, 0.022918701, -0.024398804, -0.06..."
1,abeyance,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE,"[0.01852417, 0.039001465, -0.02949524, -0.0682...","[0.005935669, 0.02468872, -0.03262329, -0.0731..."
2,object,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE,"[-0.015281677, 0.0048713684, -0.0038871765, -0...","[-0.03942871, -0.018493652, 0.0003092289, -0.0..."
3,principle,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE,"[0.039489746, 0.047302246, -0.011039734, -0.05...","[0.021484375, 0.016967773, -0.009857178, -0.06..."
4,inhesion,WORDS EXPRESSING ABSTRACT RELATIONS,,EXISTENCE,"[0.01637268, -0.032806396, -0.025100708, -0.04...","[0.015586853, -0.045928955, -0.027908325, -0.0..."
...,...,...,...,...,...,...
44124,resurrection,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS,"[0.012008667, 0.018371582, -0.018218994, -0.03...","[-0.0010471344, -0.0064468384, -0.022781372, -..."
44125,nine,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS,"[0.038970947, 0.0703125, -0.021026611, -0.0572...","[0.014175415, 0.049072266, -0.019195557, -0.09..."
44126,bible,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS,"[-0.00028800964, 0.022644043, -0.019973755, -0...","[-0.024993896, -0.0103302, -0.01550293, -0.072..."
44127,testament,WORDS RELATING TO THE SENTIENT AND MORAL POWERS,,RELIGIOUS AFFECTIONS,"[0.021850586, -0.025665283, -0.018066406, -0.0...","[-0.0036010742, -0.045654297, -0.019226074, -0..."


Now, in order to efficiently store the class, division, and section of each word, we will prepare a dictionary with the word as the key and the class, division, and section as the values.

In [24]:
# Create a dictionary where each word maps to its corresponding class, division, and section
words_dict = categories_df.groupby('word').agg('first').to_dict('index')

# Create the metadata list using list comprehension
metadatas = [
    {
        "class": words_dict[word]['class'],
        "division": words_dict[word].get('division', 'N/A'),
        "section": words_dict[word]['section']
    }
    for word in categories_df['word'] if word in words_dict
]

With the metadata in place, we can now add the embeddings to the vector database using `chromadb` into two different collections, one for clustering and one for classification.

In [31]:
from tqdm import tqdm
def add_in_batches(collection, documents, embeddings, metadatas, ids, max_batch_size=5461):
    total_items = len(documents)
    for start_idx in tqdm(range(0, total_items, max_batch_size)):
        end_idx = min(start_idx + max_batch_size, total_items)
        batch_documents = documents[start_idx:end_idx]
        batch_embeddings = embeddings[start_idx:end_idx]
        batch_metadatas = metadatas[start_idx:end_idx]
        batch_ids = ids[start_idx:end_idx]
        
        collection.add(
            documents=batch_documents,
            embeddings=batch_embeddings,
            metadatas=batch_metadatas,
            ids=batch_ids
        )

# Assuming categories_df, nomic_embeddings_clustering, nomic_embeddings_classification, and metadatas are already defined
documents = categories_df['word'].tolist()
embeddings_clustering = nomic_embeddings_clustering['embeddings']
embeddings_classification = nomic_embeddings_classification['embeddings']
ids = [f"word_{i}" for i in range(len(documents))]

# Add documents in batches to the clustering collection
add_in_batches(collection_clustering, documents, embeddings_clustering, metadatas, ids)

# Add documents in batches to the classification collection
add_in_batches(collection_classification, documents, embeddings_classification, metadatas, ids)

100%|██████████| 9/9 [00:27<00:00,  3.04s/it]
100%|██████████| 9/9 [00:23<00:00,  2.61s/it]


### Now we have successfully stored the embeddings to a vector database using `chromadb` and we can use them for clustering and classification tasks.