In this notebook I will process the thesaurus dataset and extract the words and phrases from it. I got the thesaurus in pure text format from the following link: https://www.gutenberg.org/ebooks/10681. I choose this format and i will process it with a mixed strategy ofregex and utilizing llms that you will see in the following cells.

In [1]:
import re
import json 

In [2]:
with open('pg10681.txt', 'r', encoding='utf-8') as file:
    file_content = file.read()

I used a strategy that find the classes of the thesaurus and gets the text of each class. Then i broke this strategy into smaller parts to get the divisions and sections of each class. I used regex to find the classes, divisions and sections. I also used a strategy to find the words and phrases in the sections. I will show the code for each part in the following cells.

The first part is to extract the classes from the thesaurus and their text. I used the following code to do this.

In [3]:
def extract_classes_with_text(text):
    class_pattern = r'(CLASS\s+[IVXLCDM]+\n[^\n]+)'  # Pattern to match class titles
    end_marker = "End of E-Thesaurus"  # Define the end marker
    
    # Find all class titles and their positions
    classes = [(match.group(), match.end()) for match in re.finditer(class_pattern, text)]
    
    # Add an artificial end marker for the last class to simplify logic
    if re.search(end_marker, text):
        classes.append(("END", re.search(end_marker, text).start()))
    else:
        classes.append(("END", len(text)))
    
    class_dict = {}
    # Iterate through classes to assign text to each class
    for i in range(len(classes) - 1):
        class_title = classes[i][0].strip()
        start_index = classes[i][1]
        end_index = classes[i + 1][1]
        # Extract text for the current class
        class_text = text[start_index:end_index].strip()
        class_dict[class_title] = class_text
    
    return class_dict



Method to remove newlines from dictionary keys

In [4]:
def remove_newlines_from_dict_keys(input_dict):
    modified_dict = {key.replace('\n', ': '): value for key, value in input_dict.items()}
    return modified_dict


Same idea as before in order to extract the divisions of each class.

In [5]:
def extract_divisions(class_dict):
    division_pattern = r"DIVISION\s(?:I|V|X|L|C|D|M)+\n[A-Z\s]+\n"  # Regex pattern for division titles

    for class_title, class_text in class_dict.items():
        # Find all division titles and their positions
        divisions = [(match.group().strip(), match.end()) for match in re.finditer(division_pattern, class_text)]
        
        # Handle case with no divisions
        if not divisions:
            class_dict[class_title] = {"NO_DIVISION": class_text}
            continue

        # Add an artificial end marker for the last division to simplify logic
        divisions.append(("END", len(class_text)))

        division_dict = {}
        # Iterate through divisions to assign text to each division
        for i in range(len(divisions) - 1):
            division_title = divisions[i][0].strip()
            start_index = divisions[i][1]
            end_index = divisions[i + 1][1]
            # Extract text for the current division
            division_text = class_text[start_index:end_index].strip()
            division_dict[division_title] = division_text

        # Update the class in the dictionary with its divisions
        class_dict[class_title] = division_dict

    return class_dict


Same idea as before in order to extract the sections of each division.

In [6]:
def refine_section_extraction(class_dict):
    # Pattern to match section identifiers and optional descriptive titles
    full_section_pattern = r'(SECTION\s+[IVXLCDM]+\.)\s*([^\n]*)'
    
    for class_title, divisions in class_dict.items():
        for division_title, division_text in divisions.items():
            # Temporary dictionary to store sections for the current division
            temp_section_dict = {}
            
            # Find all matches for the full section pattern
            matches = list(re.finditer(full_section_pattern, division_text))
            
            for i, match in enumerate(matches):
                # Determine the start of the next match or use the end of the division text if at the last match
                end_index = matches[i + 1].start() if i + 1 < len(matches) else len(division_text)
                
                # Extract the full section title, combining the identifier and the optional descriptive title
                full_section_title = match.group(1) + (' ' + match.group(2).strip() if match.group(2).strip() else '')
                # Extract the text for this section
                section_text = division_text[match.end():end_index].strip()
                
                temp_section_dict[full_section_title] = section_text
            
            # If no sections were found, use a placeholder
            if not temp_section_dict:
                temp_section_dict["NO_SECTION"] = division_text
            
            # Update the division entry with its sections
            divisions[division_title] = temp_section_dict

    return class_dict


The following method is used to process sections with title like this: "1. BEING, IN THE ABSTRACT" and no title key.

In [7]:
def process_sections_with_no_title_key(class_dict):
    # Adjusted pattern for subsection titles within the section text
    adjusted_pattern_for_subsections = r'^\d\.\s?([A-Z]+([,\s]+[A-Z]+)*([,\s]+[a-z]+)*)*$'

    for class_title, divisions in class_dict.items():
        for division_title, sections in divisions.items():
            updated_sections = {}
            for section_title, section_text in sections.items():
                # Attempt to capture the full section title and text excluding this title
                full_title_search = re.search(r'(SECTION\s+[IVXLCDM]+\.)\s*([^\n]+)', section_text)
                if full_title_search:
                    full_section_title = full_title_search.group(1) + " " + full_title_search.group(2).strip()
                    section_text_without_title = section_text[len(full_section_title):].strip()
                else:
                    full_section_title = section_title
                    section_text_without_title = section_text

                titles_matches = list(re.finditer(adjusted_pattern_for_subsections, section_text_without_title, re.MULTILINE))
                if titles_matches:
                    title_dict = {}
                    for k in range(len(titles_matches)):
                        title = titles_matches[k].group(0).strip()
                        title_start_index = titles_matches[k].end()
                        title_end_index = titles_matches[k + 1].start() if k + 1 < len(titles_matches) else len(section_text_without_title)
                        subsection_text = section_text_without_title[title_start_index:title_end_index].strip()
                        title_dict[title] = subsection_text

                    updated_sections[full_section_title] = title_dict
                else:
                    updated_sections[full_section_title] = {"NO_TITLE": section_text_without_title}

            divisions[division_title] = updated_sections

    return class_dict

The text between the titles had some subtitles that were not needed. I used the following method to remove them. They looked like this "3. FORMAL EXISTENCE Internal conditions". So this method removes the first two lines of the text if the text does not start with a number. So if it has text like "Internal conditions" it will remove the first two lines.

In [8]:
def transform_title_values(class_dict):
    for class_title, divisions in class_dict.items():
        for division_title, sections in divisions.items():
            for section_title, section in sections.items():
                for title, text in section.items():
                    # Determine if the text starts with a number
                    if not text.strip().startswith(tuple('0123456789')):
                        lines = text.splitlines()
                        remaining_lines = lines[2:]
                        text = '\n'.join(remaining_lines)  # Rejoin the remaining lines
                    
                    # Define the section pattern for splitting the text
                    section_pattern = r'(?=\n\d+[a-z]?\. )'
                    # Split the text into sections
                    sections = re.split(section_pattern, text)
                    
                    # Update the text for the title with the processed sections
                    section[title] = sections

    return class_dict


Method that exports the dictionary to a json file.

In [9]:
def write_data_to_json_file(data, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


The following code is used to extract the words and phrases from classes, divisions and sections. This creates an output like the thesaurus.json file. Then this have text like this 

`3. Substantiality -- N. substantiality, hypostasis; person, being,
thing, object, article, item; something, a being, an existence;
creature, body, substance, flesh and blood, stuff, substratum; matter
&c 316; corporeity^, element, essential nature, groundwork, materiality,
substantialness, vital part.
     [Totality of existences], world &c 318; plenum.
Adj. substantive, substantial; hypostatic; personal, bodily, tangible
&c (material) 316; corporeal.
Adv. substantially &c adj.; bodily, essentially.`

 and this has to be preprocessed to get the words and phrases.

In [10]:
classes = extract_classes_with_text(file_content)
classes = remove_newlines_from_dict_keys(classes)
classes = extract_divisions(classes)
classes = refine_section_extraction(classes)
classes = process_sections_with_no_title_key(classes)
classes = transform_title_values(classes)
write_data_to_json_file(classes, 'thesaurus.json')


In [11]:
def clean_text(text):
    cleaned_text = re.sub(r'\d+', '', text)  # Remove digits
    cleaned_text = re.sub(r'&c\.?', '', cleaned_text)  # Remove "&c" and "&c."
    pattern = ".*?-- N\."
    cleaned_text = re.sub(pattern, '', cleaned_text)
    cleaned_text = re.sub(r'\^', '', cleaned_text)
    text_markers_pattern = r'\b(?:Adv|adj|adv|Verb|Phr|V|v)\.?\b'
    cleaned_text = re.sub(text_markers_pattern, '', cleaned_text, flags=re.IGNORECASE)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    cleaned_text = cleaned_text.strip()
    cleaned_text = re.sub(r'\{[^}]*\}', '', cleaned_text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    cleaned_text = cleaned_text.lower()
    cleaned_text = re.sub(r'\[.*?\]', '', cleaned_text)
    tokens = [token.strip() for token in re.split('[,;.]', cleaned_text) if token.strip()]
    return tokens


In [12]:
for class_title, divisions in classes.items():
    for division_title, sections in divisions.items():
        for section_title, titles in sections.items():
            for title, list_of_word_sections in titles.items():
                for i, lst in enumerate(list_of_word_sections):
                    cleaned_text = clean_text(lst)
                    titles[title][i] = cleaned_text
                    


We also add a method that will remove the words that are not relevant because the llm missed them. We also remove duplicates per class.

Finally we write the data to a json file.

In [13]:
write_data_to_json_file(classes, 'thesaurus_final.json')