# Normilizer for Peter Green's translation of the Odyssey

Solid and clean copy. Here are some characteristics. 

## Before
- Clean from origin
    * no pagination or other in page marginalia
    * no weid characters in text
    * native digital version (no scan, OCR, etc)
- Separated by books
- Line number every five lines
- Prosody "diacritics":  
    - [] Macron (¯) in long vowels (Ancient Greek quirk)
    - [] Dieresis (¨) in accented/pronounced vowels (Ancient Greek quirk)
    - [] Diacritics, grace and acute (`|´)
    - [] Circumflex (^) in vowels with high and falling pitch (Ancient Greek quirk) 
- Superscripts for footnotes remain
- In `Cleaning`, it was stripped from source indicatorat the end of each book, i.e., footer line: EBSCOhost: eBook Collection (EBSCOhost) printed on 3/7/2025 11:00:14 PM UTC via UNIVERSITAET TUEBINGEN. All use subject to https://www.ebsco.com/terms-of-use.

## After
- clean text: only alphabetic characters
- kept syntactical punctuation (. , ; : - " ') for future analysis
- kept one line = one verse; a line break structure for prosodic analysis
- removed macron and accents
- removed book titles
- removed line numbers
- removed numeral digits


In [None]:
# Step 1: read txt and remove book's headers
filepath = "/Users/debr/odysseys_en/cleaned_txts/Odyssey_Green_Cleaned.txt"
output_filepath = "/Users/debr/odysseys_en/normalized_txts/Odyssey_Green_Normalized_v1.txt"

# Create book headers pattern
book_headers = [f'Book {i}' for i in range(1, 25)]

# Read the entire file
with open(filepath, 'r') as file:
    lines = file.readlines()

# Filter out the book headers
new_lines = [line for line in lines if line.strip() not in book_headers]

# Print a preview of the processed text
text = "".join(new_lines)

The man, Muse—tell me about that resourceful man, who wandered
far and wide, when he’d sacked Troy’s sacred citadel:
many men’s townships he saw, and learned their ways of thinking,
many the griefs he suffered at heart on the open sea,
battling for his own life and his comrades’ homecoming. Yet
5
no


## Path A: Controlled normalization
A transparent, "manual," normalization using regex.  

In [10]:
# Step 2: Normalize the text
import re

# Removing numeral digits
def remove_digits(text):
    """
    Remove all digit characters from the given text.
    """
    return re.sub(r'\d+', '', text)

# Removing diacritics, macrons, circumflex, dieresis/umlaut, etc
def normalize_text(text):
    """
    Normalize text by replacing diacritics with their base characters.
    """
    # Dictionary mapping diacritics to their base characters
    replacements = {
        # Dieresis/umlaut characters
        'ä': 'a', 'ë': 'e', 'ï': 'i', 'ö': 'o', 'ü': 'u',
        'Ä': 'A', 'Ë': 'E', 'Ï': 'I', 'Ö': 'O', 'Ü': 'U',
        
        # Circumflex characters
        'â': 'a', 'ê': 'e', 'î': 'i', 'ô': 'o', 'û': 'u',
        'Â': 'A', 'Ê': 'E', 'Î': 'I', 'Ô': 'O', 'Û': 'U',
        
        # Macron characters
        'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u',
        'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U',
        
        # Even more diacritics
        # Acute
        'á': 'a', 'é': 'e', 'í': 'i', 'ó': 'o', 'ú': 'u',
        # Grave
        'à': 'a', 'è': 'e', 'ì': 'i', 'ò': 'o', 'ù': 'u',
    }
    
    # Replacement time!
    for original, replacement in replacements.items():
        text = text.replace(original, replacement)
    
    return text

# Apply the normalization to the text
text = remove_digits(text)
text = normalize_text(text)


In [7]:
bf_norm = "".join(new_lines)

In [9]:
def extract_and_compare_section(text_to_compare, start_phrase, end_phrase):
    """
    Extract a section of text between two phrases and compare before/after normalization.
    """
    # Find the start and end indices
    start_index = text_to_compare.find(start_phrase)
    if start_index == -1:
        return "Start phrase not found in text"
    
    # Add the length of the start phrase to get to the end of it
    start_index += len(start_phrase)
    
    # Find the end phrase starting from after the start phrase
    end_index = text_to_compare.find(end_phrase, start_index)
    if end_index == -1:
        return "End phrase not found after the start phrase"
    
    # Extract the section
    section = text_to_compare[start_index:end_index]
    
    # Normalize the section
    normalized_section = normalize_text(section)
    
    # Create a comparison output
    result = "Original section:\n"
    result += section
    result += "\n\nNormalized section:\n"
    result += normalized_section
    
    return result


text_to_compare = bf_norm

# Extract and compare the section
comparison = extract_and_compare_section(
    text_to_compare,
    "where proud attendants unloaded their gear, while they",
    "one succeeding another; and when the sun went down"
)

print(comparison)

Original section:

360
themselves all went to assembly, but would not permit
any others to sit with them, either young or old; and next
Eupeithēs’ son Antinoös addressed them, saying: “Look here,
see how the gods have saved this fellow from harm!
Day after day our lookouts perched up on the windy heights,
365


Normalized section:

360
themselves all went to assembly, but would not permit
any others to sit with them, either young or old; and next
Eupeithes’ son Antinoos addressed them, saying: “Look here,
see how the gods have saved this fellow from harm!
Day after day our lookouts perched up on the windy heights,
365



In [None]:
# Create output directory if it doesn't exist
import os
os.makedirs(os.path.dirname(output_filepath), exist_ok=True)

# Write to a new file 
with open(output_filepath, 'w') as file:
    file.writelines(new_lines)