## French formality detector in translation memories

Imagine you've been working with a client for 3 years, building up a large TM in English-French. Suddenly, the client decides they want to change their tone - they want to switch from formal French (using 'vous') to informal French (using 'tu') in all their content. You have a huge TM with many thousands of segments.
This would be a huge task to do manually - you'd need to check every segment.

### What this notebook does
- Takes a TMX file containing English-French translations
- Identifies segments using formal French ("vous")
- Creates a list of segments that need to be changed to informal French ("tu")
- Exports results to a CSV file

### Recommended: familiarity with
* Python basics (variables, functions)
* Environment management
* Directory structure
* Package management
* Git for version control
* TMX/XML structure
* French formality rules

## 1. Setting up our tools

In [None]:
#load packages
import os # Operating System interface for cross-platform file and directory operations
import xml.etree.ElementTree as ET # built-in xml parser for handling TMX files
import pandas as pd #data manipulation library for creating and managing our translation dataframes
import spacy # NLP library
from collections import Counter #to count occurrences of items

## 2. Spacy demo 
Before we analyze our TMX, let's see how we can detect language patterns. We'll use spaCy, a language analysis tool.

### 2.1 POS and morphological tagging

In [None]:
# Load English language model from spaCy (this is like loading a tool that helps us understand language)
# spacy uses the universal dependencies for most languages https://github.com/UniversalDependencies
nlp = spacy.load("en_core_web_sm")

# Our example sentence we want to analyze
text = "I love New York"
doc = nlp(text)

for token in doc: # For each word, we'll get 4 pieces of information:
   token_text = token.text  # 1. token_text: the actual word
   token_pos = token.pos_   # 2. token_pos: Part of speech 
   token_dep = token.dep_    # 3. token_dep: dependencies
   token_morph = token.morph  # 4. token_morph: morphological information
   print(f"{token_text:<10}{token_pos:<10}{token_dep:<10}{token_morph}") # Print info in a neat format :<10 means "use 10 spaces" for nice alignment

### 2.1 Dependency and entity information

In [None]:
from spacy import displacy
displacy.render(doc, style="dep") #display dependency information
displacy.render(doc, style="ent") #display entity information 

## 3. Working with Translation Memories
Now let's work with our TM

### 3.1 Defining a function to parse TMX Files

In [None]:
def parse_tmx_to_df(tmx_file_path):
    """
    Parse a TMX file and extract source (English), target (French) texts and their UIDs
    into a pandas dataframe.
    
    Parameters:
    tmx_file_path (str): Path to your TMX file
    
    Returns:
    pandas.dataframe: dataframe with columns ['tuid', 'source', 'target']
    
    Example usage:
        df = parse_tmx_to_df(file_path)
    """
    # Define the namespace mapping
    namespaces = {'xml': 'http://www.w3.org/XML/1998/namespace'}
    
    # Parse the xml file
    tree = ET.parse(tmx_file_path)
    root = tree.getroot()
    
    # Prepare lists to store the data
    data = []
    
    # Find all translation units
    for tu in root.findall('.//tu'):
        # Get the tuid from the tuid attribute
        tuid = tu.get('tuid')
            
        # Find the English and French segments using namespace
        en_tuv = tu.find("./tuv[@{%s}lang='en']" % namespaces['xml'])
        fr_tuv = tu.find("./tuv[@{%s}lang='fr']" % namespaces['xml'])
        
        en_seg = en_tuv.find('seg')
        fr_seg = fr_tuv.find('seg')
        
        source = en_seg.text
        target = fr_seg.text
        
        # Add to data list
        data.append({
            'tuid': tuid,
            'source': source,
            'target': target
        })
    
    # Create dataframe and sort by tuid
    df = pd.DataFrame(data)
    df['tuid'] = df['tuid'].astype(int)
    df = df.sort_values('tuid')
    
    return df

### 3.2 Using our parsing function to parse our TM

In [None]:
if __name__ == "__main__":
    # Define input and output directories
    input_dir = 'input_data'
    output_dir = 'output_data'
    
    # Create directories if they don't exist
    os.makedirs(input_dir, exist_ok=True)
    os.makedirs(output_dir, exist_ok=True)
    
    # Construct file paths using os.path.join so it can be run on any machine
    input_file = os.path.join(input_dir, 'alice.tmx')
    output_file = os.path.join(output_dir, 'translations.csv')
    
    # Load TMX and create dataframe
    df = parse_tmx_to_df(input_file)
    
    # Save to csv
    df.to_csv(output_file, index=False, encoding='utf-8')

    # Display TMX length
    print("\nTMX length:")
    print(f"Total number of segments: {len(df)}")
    
    # Display settings and preview
    pd.set_option('display.max_colwidth', 150) 
    display(df.head()) #display first 5 rows


## 4. Detecting formal language
Now that we can load TMX files, let's analyze the French text for formality.

In [None]:
# Load the French model
nlp = spacy.load("fr_core_news_sm")

### 4.1 Defining a function to detect formality

In [None]:
def check_formality_type(text, nlp):
    """
    Check formality type in French text based on number and person.
    
    Parameters:
    text (str): French text to analyze
    nlp: spaCy language object
    
    Returns:
    str: 'Formal' if all second-person forms are plural
         'Informal' if all second-person forms are singular
         'Not Applicable' if both types are found or neither is found
    """

    plural_found = False
    singular_found = False

    doc = nlp(text)
    
    for token in doc:
        text_lower = token.text.lower()

        # Check for specific pronouns and their forms
        #being extra careful here and adding the specific words
        #not all surface forms have the expected metadata, so we risk missclassification
        if text_lower in ["vous", "vôtre", "vôtres", "votre", "vos"]:
            plural_found = True
        elif text_lower in ["tu", "te", "t'", "toi", "tien", "tienne", "tiens", "tiennes", "ton", "ta", "tes"]:
            singular_found = True
        
        # Check for verbs and auxiliaries for second-person forms
        if token.pos_ in ["VERB", "AUX"]:
            person = token.morph.get("Person")
            number = token.morph.get("Number")
            if person == ["2"]:
                if number == ["Plur"]:
                    plural_found = True
                elif number == ["Sing"]:
                    singular_found = True

    # Determine return value based on found forms
    if plural_found and singular_found:
        return 'Not Applicable'  # Both found
    elif plural_found:
        return 'Formal'  # All forms are plural
    elif singular_found:
        return 'Informal'  # All forms are singular
    else:
        return 'Not Applicable'  # Neither found

## 5. Using our formality function with our data

In [None]:
# Create output directory if it doesn't exist
os.makedirs('output_data', exist_ok=True)

# Add formality type column to DataFrame
df['formality_type'] = df['target'].apply(lambda x: check_formality_type(x, nlp))


# Save results
output_file = os.path.join('output_data', 'translations_with_formality_types.csv')
df.to_csv(output_file, index=False, encoding='utf-8')

### 5.1 Show a sample of our data with formality

In [None]:
print("Sample of all segments:")
display(df[['tuid', 'source', 'target', 'formality_type']].head())

### 5.2 Show a sample of our data with formality, only formal segments

In [None]:
print("\nFormal segments (plural):")
display(df[df['formality_type'] == 'Formal'][['tuid', 'source', 'target', 'formality_type']].head())

## 6. Statistics and distribution of formalities in our TM

In [None]:
print("\nDistribution of formality types:")
print(df['formality_type'].value_counts())

## 7. Testing and verification

In [None]:
doc = nlp("Ceci ton exemple et votre document.")
for token in doc:
    print(f"{token.text:<10}{token.pos_:<10}{token.dep_:<10}{token.morph}")

#test our function with new content
check_formality_type(doc, nlp)