In [2]:
import os
import re

def combine_text_files(directory):
    combined_string = ""

    # Loop through all files in the directory
    for filename in os.listdir(directory):
        # Check if the file is a text file
        if filename.endswith(".txt.clean"):
            filepath = os.path.join(directory, filename)
            
            # Open and read the file
            with open(filepath, 'r', encoding='utf-8') as file:
                combined_string += file.read() + "\n"  # Add a newline after each file's content

    return combined_string

combine_text_files = combine_text_files('../storage/corpus/')

with open('../storage/corpus/complete_corpus.txt', 'w', encoding='utf-8') as output:
        output.write(combine_text_files)

### 1. **Mixed Document Formatting**

Inconsistent document formatting can impact the pre-processing phase. Pre-processing is important for generating reliable embeddings and subsequent search queries. When the text contains irregular spacing, truncated paragraphs, and non-standard date formats, there can be errors in errors in tokenization, sentence segmentation, and the overall embedding process:

-   **Tokenization Issues**: Inconsistent spacing and formatting can confuse the tokenization process where text is split into words or subwords. For instance, the phrase "October 30,1735 July 4, 1826" might be incorrectly tokenized, resulting in poor quality embeddings. If the tokenization fails, the model might generate embeddings that do not accurately reflect the content, leading to irrelevant or incorrect search results.
-   **Sentence Segmentation**: The model relies on clear sentence boundaries to generate accurate embeddings. Consider for example: "Adams, a sponsor of the American Revolution in Massachusetts, was a driving force for independence in 1776; Jefferson called him the 'Colossus of Independence'." The semicolon here could be misinterpreted as a sentence boundary, leading to fragmented embeddings and poor information retrieval.
-   **Date and Number Parsing**: Non-standard date formats and number representations can also lead to errors in understanding the timeline of events. For example, the format "1797 1801" without clear punctuation might be misinterpreted by the model, affecting its understanding of historical sequences or the retrieval of date-specific information.

In order to mitigate for these problems, we can consider:

-   **Preprocessing Normalization**: The system could implement text normalization during preprocessing to standardize spacing, punctuation, and date formats. This could involve using regular expressions to correct common formatting issues before tokenization.
-   **Advanced Tokenization Techniques**: The system can use advanced tokenization techniques that can handle irregular spacing and formatting, such as subword tokenization or specialized tokenizers designed for historical texts.
-   **Contextual Embedding Models**: The system can employ models that can better understand context even with formatting issues, such as transformers with pre-trained contextual embeddings that are less sensitive to minor formatting variations.



In [3]:
with open("../storage/corpus/complete_corpus.txt", "r", encoding="utf-8") as file:
    corpus = file.read()

# Inconsistent date formats
date_pattern = re.compile(r'\b\d{4}\s+\d{4}\b')
inconsistent_dates = date_pattern.findall(corpus)

# Spacing issues
spacing_issues = re.findall(r'\s{2,}', corpus)
total_spacing_issues = len(spacing_issues)

print(f"Total Inconsistent Date Formats: {len(inconsistent_dates)}")
print(f"Total Spacing Issues: {total_spacing_issues}")

Total Inconsistent Date Formats: 14
Total Spacing Issues: 1524


### 2. **Character and Named Entity Recognition (NER)**

The corpus contains numerous references to historical figures and events, making accurate Named Entity Recognition (NER) crucial for the model's ability to retrieve and generate relevant answers. Misidentifying or failing to recognize entities like "John Adams" or "Thomas Jefferson" could lead to incorrect or misleading information being retrieved, which would be particularly problematic given the domain-specific and institution-specific focus of the system. Here are some issues worth considering:

-   **Disambiguation of Common Names**: Names like "John Adams" or "Thomas Jefferson" could refer to multiple individuals within different contexts. Without proper NER, the model might confuse these entities, leading to errors in information retrieval. For example, "John Adams" could refer to the second President of the United States or another historical figure with the same name, depending on the context.
-   **Historical Context**: Many entities in the corpus are tied to specific historical events. Accurate recognition of these entities is essential for understanding the context of a query. For example, understanding that "Colossus of Independence" refers to John Adams and his role in the American Revolution is critical for accurate retrieval.
-   **Complex Entity Relationships**: The corpus features complex relationships between entities, such as family connections (e.g., "John Adams" and "John Quincy Adams") and political affiliations. The model needs to correctly recognize these relationships to provide accurate and contextually relevant answers.

In order to mitigate for these problems, we can consider:

-   **Advanced NER Models**: The system can utilize state-of-the-art NER models, such as BERT-based NER or custom-trained models on similar historical text corpora, to improve the accuracy of entity recognition.
-   **Entity Linking**: The system could implement an entity linking system that can disambiguate entities based on context. For instance, linking "John Adams" to his role as the second U.S. President rather than another individual with the same name.
-   **Contextual Embeddings**: Use contextual embeddings that can capture the nuances of the relationships between entities and their historical significance. This would help in maintaining the context and providing accurate, relevant information in response to queries.

In [4]:
from collections import Counter

# Named entity pattern
entity_pattern = re.compile(r'\b[A-Z][a-z]+\s[A-Z][a-z]+\b')
entities = entity_pattern.findall(corpus)

# Count the frequency of entities
entity_counts = Counter(entities)
common_entities = entity_counts.most_common(10)

print("Top 10 Entities and Their Frequencies:")
for entity, count in common_entities:
    print(f"{entity}: {count}")

Top 10 Entities and Their Frequencies:
United States: 122
New York: 110
White House: 58
Theodore Roosevelt: 45
Abraham Lincoln: 34
Vice President: 30
Republican Party: 27
John Adams: 23
Woodrow Wilson: 20
Supreme Court: 19


By addressing these two critical issues, the system can significantly improve its ability to accurately retrieve and generate relevant answers, particularly in a domain-specific context where precision is paramount.