# Normilizer for Peter Green's translation of the Odyssey

Solid and clean copy. Here are some characteristics. 

## Before
- Clean from origin
    * no pagination or other in page marginalia
    * no weid characters in text
- Separated by books
- Line number every at the every ten lines
## After
- clean text: only alphabetic characters
- kept syntactical punctuation (. , ; : - " ') for future analysis
- kept one line = one verse; a line break structure for prosodic analysis
- removed book titles
- removed line numbers
- removed numeral digits


## Path B: Broad-brush method for normalization

An indiscriminate approach to normalization using the unicodedata Python module.
Fast an efficient but we don't control all the changes as method A.

#### Output file: "Odyssey_Wilson_Normalized_v2.txt"

In [1]:
import os
import unicodedata
import re

def normalize_text_unicodedata(text):
    """
    Normalize text using Python's unicodedata module.
    This handles all diacritics in a standardized way.
    """
    # NFD decomposes characters into base characters and combining marks, ie, "é" -> "e" + "◌́" 
    # Then we filter out the combining marks (category starts with 'M')
    return ''.join(c for c in unicodedata.normalize('NFD', text)
                  if not unicodedata.category(c).startswith('M'))

def remove_digits(text):
    """
    Remove all digits from a string.
    """
    if isinstance(text, list):
        # If text is a list, apply the function to each element
        return [remove_digits(line) for line in text]
    else:
        # Remove digits using regex
        return re.sub(r'\d', '', text)

# Read clean text again
filepath = "/Users/debr/odysseys_en/cleaned_txts/Odyssey_Wilson_Cleaned.txt"

with open(filepath, 'r', encoding='utf-8') as file:
    lines = file.readlines()

# Filter out the book headers
book_headers = [f'Book {i}' for i in range(1, 25)]
filtered_lines = [line for line in lines if line.strip() not in book_headers]

# Normalize each line
normalized_lines = [normalize_text_unicodedata(line) for line in filtered_lines]

# Remove digits
final_lines = remove_digits(normalized_lines) 

# Create output directory if it doesn't exist
output_filepath = "/Users/debr/odysseys_en/normalized_txts/Odyssey_Wilson_Normalized_v2.txt"
os.makedirs(os.path.dirname(output_filepath), exist_ok=True)

# Write to a new file
with open(output_filepath, 'w', encoding='utf-8') as file:
    file.writelines(final_lines)

print(f"Normalization complete. File saved to: {output_filepath}")

Normalization complete. File saved to: /Users/debr/odysseys_en/normalized_txts/Odyssey_Wilson_Normalized_v2.txt


In [2]:
text_uni = "".join(final_lines)
text_uni


'Tell me about a complicated man.\nMuse, tell me how he wandered and was lost\nwhen he had wrecked the holy town of Troy,\nand where he went, and who he met, the pain\nhe suﬀered in the storms at sea, and how\nhe worked to save his life and bring his men\nback home. He failed to keep them safe; poor fools,\nthey ate the Sun God’s cattle, and the god\nkept them from home. Now goddess, child of Zeus,\ntell the old story for our modern times. \nFind the beginning.\nAll the other Greeks\nwho had survived the brutal sack of Troy\nsailed safely home to their own wives—except\nthis man alone. Calypso, a great goddess,\nhad trapped him in her cave; she wanted him\nto be her husband. When the year rolled round\nin which the gods decreed he should go home\nto Ithaca, his troubles still went on.\nThe man was friendless. All the gods took pity,\nexcept Poseidon’s anger never ended \nuntil Odysseus was back at home.\nBut now the distant Ethiopians,\nwho live between the sunset and the dawn,\nwere w