In this notebook I will process the thesaurus dataset and extract the words and phrases from it. I got the thesaurus in pure text format from the following link: https://www.gutenberg.org/ebooks/10681. I choose this format and i will process it with a mixed strategy ofregex and utilizing llms that you will see in the following cells.

In [385]:
import re

In [386]:
with open('pg10681.txt', 'r', encoding='utf-8') as file:
    file_content = file.read()

I used a strategy that find the classes of the thesaurus and gets the text of each class. Then i broke this strategy into smaller parts to get the divisions and sections of each class. I used regex to find the classes, divisions and sections. I also used a strategy to find the words and phrases in the sections. I will show the code for each part in the following cells.

The first part is to extract the classes from the thesaurus and their text. I used the following code to do this.

In [387]:
def extract_classes_with_text(text):
    class_pattern = r'(CLASS\s+[IVXLCDM]+\n[^\n]+)'  # Pattern to match class titles
    end_marker = "End of E-Thesaurus"  # Define the end marker
    
    # Find all class titles and their positions
    classes = [(match.group(), match.end()) for match in re.finditer(class_pattern, text)]
    
    # Add an artificial end marker for the last class to simplify logic
    if re.search(end_marker, text):
        classes.append(("END", re.search(end_marker, text).start()))
    else:
        classes.append(("END", len(text)))
    
    class_dict = {}
    # Iterate through classes to assign text to each class
    for i in range(len(classes) - 1):
        class_title = classes[i][0].strip()
        start_index = classes[i][1]
        end_index = classes[i + 1][1]
        # Extract text for the current class
        class_text = text[start_index:end_index].strip()
        class_dict[class_title] = class_text
    
    return class_dict



Method to remove newlines from dictionary keys

In [388]:
def remove_newlines_from_dict_keys(input_dict):
    modified_dict = {key.replace('\n', ': '): value for key, value in input_dict.items()}
    return modified_dict


Same idea as before in order to extract the divisions of each class.

In [389]:
def extract_divisions(class_dict):
    division_pattern = r"DIVISION\s(?:I|V|X|L|C|D|M)+\n[A-Z\s]+\n"  # Regex pattern for division titles

    for class_title, class_text in class_dict.items():
        # Find all division titles and their positions
        divisions = [(match.group().strip(), match.end()) for match in re.finditer(division_pattern, class_text)]
        
        # Handle case with no divisions
        if not divisions:
            class_dict[class_title] = {"NO_DIVISION": class_text}
            continue

        # Add an artificial end marker for the last division to simplify logic
        divisions.append(("END", len(class_text)))

        division_dict = {}
        # Iterate through divisions to assign text to each division
        for i in range(len(divisions) - 1):
            division_title = divisions[i][0].strip()
            start_index = divisions[i][1]
            end_index = divisions[i + 1][1]
            # Extract text for the current division
            division_text = class_text[start_index:end_index].strip()
            division_dict[division_title] = division_text

        # Update the class in the dictionary with its divisions
        class_dict[class_title] = division_dict

    return class_dict


Same idea as before in order to extract the sections of each division.

In [390]:
def refine_section_extraction(class_dict):
    # Pattern to match section identifiers and optional descriptive titles
    full_section_pattern = r'(SECTION\s+[IVXLCDM]+\.)\s*([^\n]*)'
    
    for class_title, divisions in class_dict.items():
        for division_title, division_text in divisions.items():
            # Temporary dictionary to store sections for the current division
            temp_section_dict = {}
            
            # Find all matches for the full section pattern
            matches = list(re.finditer(full_section_pattern, division_text))
            
            for i, match in enumerate(matches):
                # Determine the start of the next match or use the end of the division text if at the last match
                end_index = matches[i + 1].start() if i + 1 < len(matches) else len(division_text)
                
                # Extract the full section title, combining the identifier and the optional descriptive title
                full_section_title = match.group(1) + (' ' + match.group(2).strip() if match.group(2).strip() else '')
                # Extract the text for this section
                section_text = division_text[match.end():end_index].strip()
                
                temp_section_dict[full_section_title] = section_text
            
            # If no sections were found, use a placeholder
            if not temp_section_dict:
                temp_section_dict["NO_SECTION"] = division_text
            
            # Update the division entry with its sections
            divisions[division_title] = temp_section_dict

    return class_dict


The following method is used to process sections with title like this: "1. BEING, IN THE ABSTRACT" and no title key.

In [391]:
def process_sections_with_no_title_key(class_dict):
    # Adjusted pattern for subsection titles within the section text
    adjusted_pattern_for_subsections = r'^\d\.\s?([A-Z]+([,\s]+[A-Z]+)*([,\s]+[a-z]+)*)*$'

    for class_title, divisions in class_dict.items():
        for division_title, sections in divisions.items():
            updated_sections = {}
            for section_title, section_text in sections.items():
                # Attempt to capture the full section title and text excluding this title
                full_title_search = re.search(r'(SECTION\s+[IVXLCDM]+\.)\s*([^\n]+)', section_text)
                if full_title_search:
                    full_section_title = full_title_search.group(1) + " " + full_title_search.group(2).strip()
                    section_text_without_title = section_text[len(full_section_title):].strip()
                else:
                    full_section_title = section_title
                    section_text_without_title = section_text

                titles_matches = list(re.finditer(adjusted_pattern_for_subsections, section_text_without_title, re.MULTILINE))
                if titles_matches:
                    title_dict = {}
                    for k in range(len(titles_matches)):
                        title = titles_matches[k].group(0).strip()
                        title_start_index = titles_matches[k].end()
                        title_end_index = titles_matches[k + 1].start() if k + 1 < len(titles_matches) else len(section_text_without_title)
                        subsection_text = section_text_without_title[title_start_index:title_end_index].strip()
                        title_dict[title] = subsection_text

                    updated_sections[full_section_title] = title_dict
                else:
                    updated_sections[full_section_title] = {"NO_TITLE": section_text_without_title}

            divisions[division_title] = updated_sections

    return class_dict

The text between the titles had some subtitles that were not needed. I used the following method to remove them. They looked like this "3. FORMAL EXISTENCE Internal conditions". So this method removes the first two lines of the text if the text does not start with a number. So if it has text like "Internal conditions" it will remove the first two lines.

In [392]:
import re

def transform_title_values(class_dict):
    for class_title, divisions in class_dict.items():
        for division_title, sections in divisions.items():
            for section_title, section in sections.items():
                for title, text in section.items():
                    # Determine if the text starts with a number
                    if not text.strip().startswith(tuple('0123456789')):
                        lines = text.splitlines()
                        remaining_lines = lines[2:]
                        text = '\n'.join(remaining_lines)  # Rejoin the remaining lines
                    
                    # Define the section pattern for splitting the text
                    section_pattern = r'(?=\n\d+[a-z]?\. )'
                    # Split the text into sections
                    sections = re.split(section_pattern, text)
                    
                    # Update the text for the title with the processed sections
                    section[title] = sections

    return class_dict


Method that exports the dictionary to a json file.

In [393]:
import json 
def write_data_to_json_file(data, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


The following code is used to extract the words and phrases from classes, divisions and sections. This creates an output like the thesaurus.json file. Then this have text like this 

`3. Substantiality -- N. substantiality, hypostasis; person, being,
thing, object, article, item; something, a being, an existence;
creature, body, substance, flesh and blood, stuff, substratum; matter
&c 316; corporeity^, element, essential nature, groundwork, materiality,
substantialness, vital part.
     [Totality of existences], world &c 318; plenum.
Adj. substantive, substantial; hypostatic; personal, bodily, tangible
&c (material) 316; corporeal.
Adv. substantially &c adj.; bodily, essentially.`

 and this has to be preprocessed to get the words and phrases.

In [394]:
classes = extract_classes_with_text(file_content)
classes = remove_newlines_from_dict_keys(classes)
classes = extract_divisions(classes)
classes = refine_section_extraction(classes)
classes = process_sections_with_no_title_key(classes)
classes = transform_title_values(classes)
write_data_to_json_file(classes, 'thesaurus.json')


To address the challenge of text preprocessing, I initially explored various strategies and regular expressions (regex) to extract words and phrases. However, finding a regex capable of capturing every desired word and phrase proved difficult. Consequently, I experimented with providing a sample text to ChatGPT, asking it to identify and return unique words and phrases. This experiment demonstrated ChatGPT's ability to effectively identify unique terms and eliminate noise from the data.

Based on these insights, I decided to leverage LangChain to construct a pipeline specifically designed for processing text. This pipeline aims to systematically extract words and phrases from the dataset. The implementation details and code for this pipeline will be presented in the subsequent cells.

With the following code, we establish a pipeline that interacts with a language model (LLM) by providing a prompt. This prompt guides the LLM on the task to perform. Specifically, we select the OpenAI ChatGPT 3.5 Turbo model for its processing capabilities. Additionally, we integrate an output parser designed to organize the extracted words and phrases into a list format. Ultimately, this forms a chain operation dedicated to parsing the input text, with the goal of efficiently identifying and extracting words and phrases.

In [395]:
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

output_parser = CommaSeparatedListOutputParser() # Create a strict output parser that will give us a list of words and phrases

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="You are a data extraction tool. Your task is to process the following thesaurus text, identify and extract only the words and phrases it contains, and ignore any explanatory text like(Adv,Adj,V,N), examples, and symbols used in thesaurus context. The input is this{raw_text}",
    input_variables=["raw_text"],
)

model = ChatOpenAI(temperature=0,openai_api_key="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")

chain = prompt | model | output_parser

This section of the code enhances the previously defined pipeline by introducing concurrent execution, significantly accelerating text processing. To adhere to the OpenAI ChatGPT 3.5 Turbo's API usage policies, which stipulate specific limits on requests and tokens per minute, we implement a rate limiting strategy. This precaution ensures our automated interactions remain within the allowable thresholds, preventing potential bans due to excessive requests. The code below demonstrates the application of a rate limiting mechanism alongside concurrent pipeline calls, optimizing efficiency while maintaining compliance with OpenAI's API constraints.

In [399]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def invoke_chain(word, path):
    result = chain.invoke({"raw_text": word})
    print(result)
    return result, path

def update_dictionary(classes, result, path):
    # Update the dictionary in place based on the path and result
    d = classes
    for key in path[:-1]:  # Navigate to the correct position
        d = d[key]
    d[path[-1]] = result  # Update with the result

def process_words_concurrently(words_with_paths, classes):
    with ThreadPoolExecutor(max_workers=50) as executor:
        futures = {executor.submit(invoke_chain, word, path): (word, path) for word, path in words_with_paths}
        
        for future in as_completed(futures):
            word, path = futures[future]
            try:
                result, path = future.result()
                update_dictionary(classes, result, path)
            except Exception as exc:
                print(f'Word {word} generated an exception: {exc}')

def batch_process_with_rate_limiting(words_with_paths, classes):
    batch_size = 50
    for i in range(0, len(words_with_paths), batch_size):
        batch = words_with_paths[i:i+batch_size]
        start_time = time.time()
        process_words_concurrently(batch, classes)
        end_time = time.time()
        elapsed = end_time - start_time
        if elapsed < 60:  # If less than a minute, sleep the remainder
            time.sleep(60 - elapsed)

def collect_words(classes, path=[]):
    words_with_paths = []
    for key, value in classes.items():
        if isinstance(value, dict):
            words_with_paths.extend(collect_words(value, path + [key]))
        elif isinstance(value, list):
            for i, word in enumerate(value):
                words_with_paths.append((word, path + [key, i]))
    return words_with_paths

def main(classes):
    words_with_paths = collect_words(classes)
    batch_process_with_rate_limiting(words_with_paths, classes)


main(classes)  # Uncomment this line to run the script
print(classes)


['Decrement', 'discount', 'defect', 'loss', 'deduction', 'afterglow', 'eduction', 'waste.']
['Variation', 'alteration', 'modification', 'moods', 'tenses', 'discrepance', 'discrepancy', 'divergency', 'deviation', 'aberration', 'innovation', 'vary', 'deviate', 'diverge', 'alternate', 'swerve', 'varied', 'modified', 'diversified.']
['Nonimitation\nno imitation\noriginality\ncreativeness\ninvention\ncreation\nunimitated\nuncopied\nunmatched\nunparalleled\ninimitable\nunique\noriginal\ncreative\ninventive\nuntranslated\nexceptional\nrare\nsui generis\nuncommon\nunexampled']
['Extrinsicality', 'objectiveness', 'non ego', 'extraneousness', 'accident', 'appearance', 'phenomenon', 'derived from without', 'objective', 'extrinsic', 'extraneous', 'modal', 'adventitious', 'ascititious', 'adscititious', 'incidental', 'accidental', 'nonessential', 'contingent', 'fortuitous', 'implanted', 'ingrafted', 'inculcated', 'infused', 'outward', 'apparent', 'extrinsically.']
['Nonuniformity', 'diversity', 'irr

Although we employ a stringent output parser, validated and extensively tested by LangChain, the LLMs occasionally produce errors. Specifically, some text extractions do not yield the expected structured output in list format. This deviation from the anticipated structure can impact downstream processing and analysis. To address this, we will implement corrective measures to ensure that all outputs conform to the required list structure. The following sections will detail the steps taken to identify and rectify these parsing inaccuracies, aiming to refine the extraction process for improved accuracy and consistency.


In [429]:
for class_title, divisions in classes.items():
    for division_title, sections in divisions.items():
        for section_title, titles in sections.items():
            for title, list_of_word_sections in titles.items():
                for i, lst in enumerate(list_of_word_sections):
                    if ("Words and phrases" in lst[0]) or ('The words and phrases' in lst[0]) or ('words and phrases' in lst[0]) or len(lst) == 1:
                        text = lst[0]
                        # Extract the new list by splitting the text after removing the specific prefix
                        _, _, rest = text.partition(":\n\n- ")
                        words_list = rest.split("\n- ")
                        # Directly replace the old list with the new one
                        titles[title][i] = words_list


Do some more preprocessing in the words.

In [430]:
for class_title, divisions in classes.items():
    for division_title, sections in divisions.items():
        for section_title, titles in sections.items():
            for title, list_of_word_sections in titles.items():
                for i, lst in enumerate(list_of_word_sections):
                    for j, word in enumerate(lst):
                        # Remove \n from the words
                          titles[title][i][j] = word.replace('\n', '')
                        
                    

We also add a method that will remove the words that are not relevant because the llm missed them. We also remove duplicates per class.

In [ ]:
not_relevant_words = ['adj', 'adv', 'n', 'v','adj.','Phr','Adv','Adj.','N','V','adv.','n.','v.','adj.','adj','adv.','n','v.','n.','v','adv','adj']
for class_section, divisions in classes.items():
    seen = set()
    for division_section, sections in divisions.items():
        for section, titles in sections.items():
            for title, list_of_words in titles.items():
                for i, words in enumerate(list_of_words):
                    if len(words) < 3:
                        continue
                    new_words = []
                    for word in words:
                        if word not in seen and word not in not_relevant_words:
                            seen.add(word)
                            new_words.append(word)
                    list_of_words[i] = new_words

Finally we write the data to a json file.

In [431]:
write_data_to_json_file(classes, 'thesaurus_final.json')

The culmination of our text processing efforts is the `thesaurus_final.json` file, which houses the extracted words and phrases from the thesaurus dataset. This file represents a critical resource for subsequent phases of our project. Initially, I attempted to employ regular expressions (regex) for data cleaning, but encountered significant challenges. Despite dedicating four hours to regex development, the outcomes were not satisfactory. 

In contrast, the strategy of leveraging a language model (LLM) API proved not only more effective but also cost-efficient. With an expenditure of just $1.50 on the LLM API, we achieved superior results compared to manual regex efforts. This experience underscores the value of integrating LLMs into data processing workflows, especially for large-scale projects demanding extensive manual labor and sophisticated data cleaning. This approach offers a pragmatic balance between cost, time investment, and quality of results, making it a compelling strategy for complex data processing tasks.