<a href="https://colab.research.google.com/github/kitamuramoe/aleppo-domari-glossing/blob/main/aleppo_domari_glossing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1a: Morphological Segmentation (Preprocessing + Word Replacement)

Insert morphological boundaries into words in two steps:

1. **Preprocessing**:  
   This step adds a space or hyphen (`-`) after specific punctuation marks or endings.  
   For example, the clitic `šii` (meaning ‘also’) is separated from the preceding word,  
   as it typically follows an inflected phrase and should be treated as an independent morpheme.  
   This prevents segmentation errors in the next step.

2. **Word replacement using a segmentation list**:  
   Words are checked for known endings and replaced with their segmented equivalents.  
   These replacements are defined in a text file (`morpheme_boundary_dic.txt`)  
   where each line lists an original word ending and its segmented version (with hyphens).  
   Longer patterns are applied first to avoid partial matches overriding full ones.

Lines starting with a quotation mark (e.g., translations) are skipped and preserved as-is.

Example :
- `saaʕidkarrisaa` → `saaʕid- -kar-r-is-aa .`

## Input Format

The input file should consist of unsegmented Aleppo Domari sentences,  
each followed by an English translation enclosed in quotation marks (e.g., `'`).  
Each sentence pair should appear on two consecutive lines, followed by a blank line to separate entries.  
Lines starting with a quotation mark will be preserved throughout all processing steps.

In [None]:
def preprocess_morpheme_boundaries(input_file, output_file):
    # Define a mapping of patterns to replace
    replacements = {
        ',' : ' ,',
        '.' : ' .',
        '!' : ' !',
        '?' : ' ?',
        ':' : ' :',
        ';' : ' ;',
        'šii': ' -šii',
        '!?' : ' !?'
    }

    # Read the input file
    with open(input_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Process each line
    updated_lines = []
    for line in lines:
      # Skip lines that begin with a quotation mark
        if line.startswith("'"):
            updated_lines.append(line.strip())
            continue
        # Split the line into words and process each word
        words = line.strip().split()
        updated_words = []
        for word in words:
            # Check if the word ends with any pattern in the replacements dictionary
            for original, replacement in replacements.items():
                if word.endswith(original):
                    word = word[:len(word) - len(original)] + replacement
                    break  # Stop checking other patterns once a match is found
            updated_words.append(word)
        # Reconstruct the line with updated words
        updated_lines.append(' '.join(updated_words))

    # Write the updated content to the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write('\n'.join(updated_lines))

    print(f"Processed file saved as {output_file}")

# Process the uploaded file
input_file = "input_texts/sample_input.txt"
output_file = "output_texts/morpheme_boundary_base.txt"
preprocess_morpheme_boundaries(input_file, output_file)


In [None]:
def load_replacements(replacements_file):
    """Load replacements from a text file into a dictionary."""
    replacements = {}
    with open(replacements_file, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line and '\t' in line:  # Ensure valid entries
                original, replacement = line.split('\t', 1)
                replacements[original] = replacement

    # Sort replacements by length (longest first) to prevent partial matches interfering
    sorted_replacements = dict(sorted(replacements.items(), key=lambda x: len(x[0]), reverse=True))

    return sorted_replacements  # Now the function correctly returns the sorted dictionary

def add_morpheme_boundaries(input_file, output_file, replacements_file):
    """Process the input file, replacing words based on the replacements dictionary."""

    # Load replacements from file
    replacements = load_replacements(replacements_file)

    # Read the input file
    with open(input_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Process each line
    updated_lines = []
    for line in lines:
        if line.startswith("'"):  # Skip lines that begin with a quotation mark
            updated_lines.append(line.strip())
            continue

        words = line.strip().split()
        updated_words = []
        for word in words:
            for original, replacement in replacements.items():
                if word.endswith(original):
                    word = word[:len(word) - len(original)] + replacement
                    break  # Stop checking other patterns once a match is found
            updated_words.append(word)

        updated_lines.append(' '.join(updated_words))

    # Write the updated content to the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write('\n'.join(updated_lines))

    print(f"Processed file saved as {output_file}")

# Define file paths
replacements_file = "dictionaries/morpheme_boundary_dic.txt"
input_file = "output_texts/morpheme_boundary_base.txt"
output_file = "output_texts/morpheme_boundary_processed.txt"

# Run the function
add_morpheme_boundaries(input_file, output_file, replacements_file)


## Step 1b: Correcting Over-segmentation

This step applies additional adjustments to morpheme segmentation using a list of replacement patterns.  
It is used to correct cases where automatic segmentation may have over-applied boundary rules.

The script reads a replacement dictionary (`adjust_morpheme_boundary.txt`),  
where each line lists an original word and its adjusted version.

Replacements are applied from longest to shortest match to avoid partial overlap issues.

Example:
- `ḍ -o-m` → `ḍom`

In [None]:
def load_replacements(replacements_file):
    """Load gloss replacements from a text file into a dictionary, sorted by length (longest first)."""
    replacements = {}
    with open(replacements_file, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line and '\t' in line:  # Ensure valid entries
                original, replacement = line.split('\t', 1)
                replacements[original] = replacement

    # Sort replacements by length (longest first) to prevent partial matches interfering
    sorted_replacements = dict(sorted(replacements.items(), key=lambda x: len(x[0]), reverse=True))
    return sorted_replacements

def adjust_morpheme_boundaries(input_file, output_file, replacements):
    """Read a text file, replace occurrences based on a dictionary, and save the result."""
    # Read content
    with open(input_file, "r", encoding="utf-8") as file:
        content = file.read()

    # Apply replacements
    for old, new in replacements.items():
        content = content.replace(old, new)

    # Write the updated content to the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(content)

    print(f"Processed file saved as {output_file}")

# Define file paths
replacements_file = "dictionaries/adjust_morpheme_boundary.txt"
input_file = "output_texts/morpheme_boundary_processed.txt"  # Change this to your actual input file
output_file = "output_texts/morpheme_boundary_adjusted.txt"

# Load replacements and process the file
replacements = load_replacements(replacements_file)
adjust_morpheme_boundaries(input_file, output_file, replacements)


## Step 2: Splitting into One Word Per Line

This step converts each sentence into a list of individual words, with one word per line.

This step is applied to segmented text.  
Lines that start with a quotation mark (e.g., translations or comments) are preserved as they are.

After this step, **download the output file and manually check whether each word is correctly segmented** into morphemes.  
If over-segmentation or incorrect boundaries are found, correct them manually before proceeding.



In [None]:
input_file = 'output_texts/morpheme_boundary_adjusted.txt'
output_file = 'output_texts/processed_words.txt'

# Open the input and output files
with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
    for line in infile:
        # Strip leading/trailing whitespace
        line = line.strip()

        # Skip processing if the line starts with a quotation mark
        if not line or line.startswith("'"):
            outfile.write(line + '\n')
        else:
            # Split the line into words and write each word on a new line
            words = line.split()
            for word in words:
                outfile.write(word + '\n')

print(f"Processed text has been saved to '{output_file}'.")


## Step 3: Interlinear Glossing

In this step, each word is matched against a predefined gloss dictionary and assigned a gloss if available.

The script reads a dictionary file (`aleppo_domari_dic_updated.txt`) from the `dictionaries/` folder.  
Each line in the file lists a word and its corresponding gloss, separated by a tab character.

The script then reads the list of segmented words from `output_texts/processed_words.txt`,  
and for each word:
- If a gloss is found, the word and its gloss are written side by side (tab-separated).
- If no gloss is found, the word is written alone to indicate a missing entry.

Example:
- `ḍom` → `ḍom	Dom/people`


The glossed output is saved as `output_texts/output_texts_glossed.txt`.

In [None]:
dictfile = open('dictionaries/aleppo_domari_dic_updated.txt', 'r')
dic = {}
for line in dictfile:
    # Strip trailing whitespace and split the line by tabs
    line = line.rstrip().split('\t')
    # Ensure there are at least two elements (key and value)
    if len(line) >= 2:
        key = line[0]
        value = line[1]
        # Add the key-value pair to the dictionary
        dic[key] = value
dictfile.close()
print(dic)

In [None]:
txtfile = open('output_texts/processed_words.txt')
words = []
for i in txtfile:
  i = i.rstrip()
  words.append(i)

In [None]:
output_file = 'output_texts/glossed.txt'

# Open the output file for writing
with open(output_file, 'w', encoding='utf-8') as out_f:
    # Iterate through each word in the list
    for w in words:
        if w in dic:
            # If the word is in the dictionary, write the word and its value
            out_f.write(f"{w}\t{dic[w]}\n")
        else:
            # If the word is not in the dictionary, write the word alone
            out_f.write(f"{w}\n")

# Print to verify the output file content (optional)
with open(output_file, 'r', encoding='utf-8') as out_f:
    print(out_f.read())

## Step 4: Formatting for Interlinear Glossing

This step reformats the glossed text into a standard three-line interlinear glossing format.

The input file (`output_texts/glossed.txt`) contains one word per line, optionally followed by a gloss separated by a tab.  
Each sentence block is separated by an empty line and may include a free translation enclosed in quotation marks (e.g., `'` or `"`).

For each block, the script outputs:

1. A segmented sentence (space-separated)
2. A gloss line (aligned with the sentence)
3. A free translation line

If a gloss is missing:
- `*` is inserted for general words
- `-*` is inserted if the word begins with a hyphen (e.g., for clitics)

After formatting, an additional cleaning step is applied to remove formatting artifacts such as:
- Extra spaces before punctuation (e.g., `' , '` → `','`)
- Double asterisks (`**`) from unglossed items
- Redundant or broken hyphenation (e.g., `- -` → `-`)

Example:
- ḥasanee laavtiiy-ee,
- `*` girl-COP.PRES/2SG.PRES/PL
- 'the daughter of ḥasanee,'

In [None]:
# Function to process the input file and reformat the content
def format_glossing(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        sentence = []
        gloss = []
        translation = None

        for line in infile:
            line = line.strip()
            if not line:  # Empty line indicates end of an entry
                if sentence and gloss and translation:
                    outfile.write(' '.join(sentence) + '\n')
                    outfile.write(' '.join(gloss) + '\n')
                    outfile.write(translation + '\n\n')
                sentence = []
                gloss = []
                translation = None
                continue

            # Detect English translation enclosed in single or double quotes
            if line.startswith(("'", '"', "‘", "’")) and line.endswith(("'", '"', "‘", "’")):
                translation = line
            else:
                # Extract the word and gloss
                parts = line.split('\t')
                if len(parts) == 2:
                    sentence.append(parts[0])
                    gloss.append(parts[1])
                elif len(parts) == 1:  # If no gloss is provided
                    sentence.append(parts[0])
                    if parts[0].startswith('-'):
                        gloss.append('-*')  # Add '-*' if the sentence part starts with '-'
                    else:
                        gloss.append('*')   # Default: append '*'

        # Write the last entry if the file doesn't end with a blank line
        if sentence and gloss and translation:
            outfile.write(' '.join(sentence) + '\n')
            outfile.write(' '.join(gloss) + '\n')
            outfile.write(translation + '\n')

# Specify input and output file paths
input_file = 'output_texts/glossed.txt'  # Replace with your actual file name
output_file = 'output_texts/formatted_glossing.txt'

# Process the file
format_glossing(input_file, output_file)

print("Data reformatted and saved to", output_file)


In [None]:
# Define replacements
replacements = {
    ' ,': ',',
    ' .': '.',
    ' !': '!',
    ' ?': '?',
    ' :': ':',
    ' ;': ';',
    ' **': '',
    '- -': '-',
    ' -': '-',
    '- ': '-'
}

# Load the file
input_file = "output_texts/formatted_glossing.txt"  # Adjust path if necessary
output_file = "output_texts/cleaned_formatted_glossing.txt"

# Read content
with open(input_file, "r", encoding="utf-8") as file:
    content = file.read()

# Apply replacements
for old, new in replacements.items():
    content = content.replace(old, new)

# Save cleaned content
with open(output_file, "w", encoding="utf-8") as file:
    file.write(content)

print(f"Cleaning complete. Saved as {output_file}")


## Step 5: Gloss Disambiguation

This step performs **gloss disambiguation**—it replaces multiple gloss entries with more specific or context-appropriate ones.

The input file (`output_texts/cleaned_formatted_glossing.txt`) may contain glosses with multiple possible analyses.  
To improve clarity, a set of replacement rules is applied using a dictionary file (`gloss_replacement.txt`), where each line lists an original gloss string and its disambiguated version, separated by a tab.

Example:
- d-išt-ir-r-ee
- give-COP/PROG-3SG-2SG-PRES → give-PROG-3SG-2SG-PRES

The disambiguated and finalized output is saved as `output_texts/formatted_glossing_final.txt`.

This automatic disambiguation process works for glosses that can be clarified based on surrounding morphemes.  
However, certain glosses—such as those involving lexemes with multiple context-dependent meanings (e.g., `qay-` glossed as `eat/vomit`)—cannot be resolved by this process.  
Such cases should be manually reviewed and resolved after this step.


In [None]:
def load_replacements(replacements_file):
    """Load gloss replacements from a text file into a dictionary, sorted by length (longest first)."""
    replacements = {}
    with open(replacements_file, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line and '\t' in line:  # Ensure valid entries
                original, replacement = line.split('\t', 1)
                replacements[original] = replacement

    # Sort replacements by length (longest first) to prevent partial matches interfering
    sorted_replacements = dict(sorted(replacements.items(), key=lambda x: len(x[0]), reverse=True))
    return sorted_replacements

def replace_text(input_file, output_file, replacements):
    """Read a text file, replace occurrences based on a dictionary, and save the result."""
    # Read content
    with open(input_file, "r", encoding="utf-8") as file:
        content = file.read()

    # Apply replacements
    for old, new in replacements.items():
        content = content.replace(old, new)

    # Write the updated content to the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(content)

    print(f"Processed file saved as {output_file}")

# Define file paths
replacements_file = "dictionaries/gloss_replacement.txt"
input_file = "output_texts/cleaned_formatted_glossing.txt"  # Change this to your actual input file
output_file = "output_texts/formatted_glossing_final.txt"

# Load replacements and process the file
replacements = load_replacements(replacements_file)
replace_text(input_file, output_file, replacements)
