# Translation Corrector

## About
This notebook validates the the German and English translations.

There are two phases:

- the first phase just checks the structural aspects of the AsciiDoc files. It makes life so easier if there are the same amount of text lines and blank lines for each language section. The result are hint where the structure of one translation doesn't match the other.
- the second phase checks for content and differences between the translations. The result is a assessment sheet as well as merge files that include corrections. Note that this makes only sense to use if the files are structural identical because the translation checks on a line-by-line basis.

## Prerequisites


### Set up OpenAI
This cell configures OpenAI's ChatGPT. The very important part is the structured response format of the request to ChatGPT.

There is also a separate file written just in case of failures of the calls to OpenAI. Technically, you could read in this file and work with it afterwards.

In [21]:
import pandas as pd
from openai import OpenAI
from tqdm.notebook import tqdm
import os
import csv
import json
from datetime import datetime

# set the role of the LLM
ai_system = """
You are a translator for iSAQB curricula between German and English.
Your task is to identify translation errors and inconsistencies between translations."
"""

# define the schema and format including what we expect from the LLM's result
response_json_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "result",
        "schema": {
            "type": "object",
            "properties": {
                "correctness": {
                    "type": "number",
                    "description" : "a value between 0 and 1 that indicates how well the two texts fit together"},
                "assessment": {
                    "type": "string",
                    "description" : "a brief assessment of how good the translation into English is"},
                "corrected_text_de": {
                    "type": "string",
                    "description" : "an improved translation of the German text"},
                "corrected_text_en": {
                        "type": "string",
                        "description": "an improved translation of the English text"}
            },
            "required": ["correctness", "assessment", "corrected_text_de", "corrected_text_en"],
            "additionalProperties": False
        },
        "strict": True
    }
}

openai_client = OpenAI(
   api_key = os.environ["OPENAI_API_KEY"]
)

# the magic function that does all the work for us!
def correct_translation(content):

    messages=[
        {"role": "system", "content": ai_system},
        {"role": "user", "content": content}
    ]

    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages = messages,
        response_format = response_json_format,
        temperature = 0,
        n=1)

    return completion.choices[0]


def check_translations(df, column_DE="text_DE", column_EN="text_EN", cache_filepath="llm_results.csv"):

    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
    temp_filepath = f"temp/{timestamp}_{cache_filepath}"
    # Clear the file by opening it in write mode initially
    with open(temp_filepath, 'w') as csv_file:
        pass  # This clears the file

    results = []

    # Open the CSV file in append mode
    with open(temp_filepath, 'a', newline='') as file:
        
        csv_file = csv.writer(file)
        csv_file.writerow(["correctness", "assessment", "corrected_text_de", "corrected_text_en"])

        for i, r in tqdm(df.iterrows(), total=len(df), desc="Checking translations"):
            try:
                # Define the prompt for the LLM
                prompt = f"""
                Here is the German and English translation: 

                German text:

                {r[column_DE]}

                English text:

                {r[column_EN]}

                Keep the AsciiDoc formatting as it is. Give also hints and corrections if there are differences between the German formatting and the English formatting.

                Return a result in the defined JSON schema.
                """

                # Call the translation correction function
                result = correct_translation(prompt)

                # Convert the result to a Python dictionary
                result_dict = json.loads(result.message.content)

                # Append the result to the result list
                results.append(result_dict)

                # Append the result to the CSV file
                csv_file.writerow([
                    result_dict['correctness'],
                    result_dict['assessment'],
                    result_dict['corrected_text_de'],
                    result_dict['corrected_text_en']]
                    )

                # Optional: Immediately flush the file to ensure the result is written
                file.flush()

            except Exception as e:
                # Log the error and continue with the next row
                print(f"Error occurred on row {i}: {str(e)}")

    return df.join(pd.DataFrame.from_dict(results))

### Helper function: show output as HTML

This function used mostly at the end of a cell will render the later included links accordingly. But is can also called within a cell in between if needed.


In [None]:
from IPython.display import display, HTML

def show(df):
    return display(HTML(df.to_html(escape=False)))

### Helper function: show errors

All the follwing checks work the same: They try to find errors in the documents.If there are errors, an non empty DataFrame is the result. Thus by purley checking for the length of a validation result, we can show a non-breaking red error message.

In [125]:
from IPython.display import display, HTML

def check(result, error_message="unknown"):
    if not len(result) == 0:
        display(HTML(f'<h2><span style="color: red;">ERROR: {error_message}</h2></b>'))
        show(result)
    else:
        display(HTML(f'<span style="color: green;">OK</span>'))

## Data input

### Import docs into a pandas DataFrames

In [139]:
import glob

structural_df = None

def load_files():

    global structural_df 
    
    # get all relevant AsciiDoc files
    all_adoc_files = glob.glob('../docs/[0-9]*/*.adoc', recursive=True)

    # filter out
    # - "00-introduction" because they are just the files for combining AsciiDoc files
    # - the reference list with books where translation validation makes no sense
    file_list = [f for f in all_adoc_files if not f.endswith('00-introduction.adoc') and not f.endswith('00-references.adoc')]

    print(f"{len(file_list)} files (re-)loaded.")
    file_list[:5] # just show first 5 items to not clutter the notebook with too much text

    df_list = []

    # make a Dataframe for each file
    for file in file_list:

        # read in the AsciiDoc file, keep the blank lines for the line numbers
        df = pd.read_csv(file, names=['text'], sep="\r", skip_blank_lines=False, )

        # add file name as first column
        df.insert(0,'filename',file)

        # Mark the rows with language tags
        df.loc[df['text'] == '// tag::DE[]', 'lang'] = 'DE'
        df.loc[df['text'] == '// tag::EN[]', 'lang'] = 'EN'

        # Forward fill the 'lang' column to propagate the tags to the subsequent rows
        df['lang'] = df['lang'].ffill()

        # add line numbers in file
        df['line'] = df.index + 1

        df_de = df[df['lang'] == 'DE'].reset_index(drop=True)
        df_en = df[df['lang'] == 'EN'].reset_index(drop=True)
        
        # Create Side-by-Side (sbs) Dataframe
        sbs_df = df_de[['filename', 'text', 'line']].join(df_en[['text', 'line']], lsuffix="_DE", rsuffix="_EN")

        # Enrich Dataframe with direct links into file in VS Code via HTML
        # Maybe the `replace` at the end needs to be adjusted to whatever IDE and operating system you have, but the default should work fine for most environments
        urls_DE = sbs_df.apply(lambda x: f"<a href=\"vscode://file/{os.path.abspath(x['filename'])}:{float(x['line_DE'])}\">{float(x['line_DE'])}</a>".replace("/mnt/c/", "C:/"), axis=1).apply(lambda x: f"{x}")
        urls_EN = sbs_df.apply(lambda x: f"<a href=\"vscode://file/{os.path.abspath(x['filename'])}:{float(x['line_EN'])}\">{float(x['line_EN'])}</a>".replace("/mnt/c/", "C:/"), axis=1).apply(lambda x: f"{x}")
        sbs_df.insert(sbs_df.columns.get_loc('line_DE') + 1, 'link_DE', urls_DE)
        sbs_df.insert(sbs_df.columns.get_loc('line_EN') + 1, 'link_EN', urls_EN)

        df_list.append(sbs_df)

        structural_df = pd.concat(df_list, ignore_index=True)


load_files()

show(structural_df.head())

28 files (re-)loaded.


Unnamed: 0,filename,text_DE,line_DE,link_DE,text_EN,line_EN,link_EN
0,../docs\00a-preamble\01-what-to-expect-of-an-advanced-level-module.adoc,// tag::DE[],1,1.0,// tag::EN[],10.0,10.0
1,../docs\00a-preamble\01-what-to-expect-of-an-advanced-level-module.adoc,=== Was vermittelt ein Advanced Level Modul?,2,2.0,=== What is taught in an Advanced Level module?,11.0,11.0
2,../docs\00a-preamble\01-what-to-expect-of-an-advanced-level-module.adoc,,3,3.0,,12.0,12.0
3,../docs\00a-preamble\01-what-to-expect-of-an-advanced-level-module.adoc,Das Modul kann unabhängig von einer CPSA-F-Zertifizierung besucht werden.,4,4.0,The module can be attended independently of a CPSA-F certification.,13.0,13.0
4,../docs\00a-preamble\01-what-to-expect-of-an-advanced-level-module.adoc,,5,5.0,,14.0,14.0


## Learning Goals Check

### Learning Goals

In [128]:
def get_learning_goals(df, column_name, lg_name):

    lang = column_name.split("_")[-1]
 
    lg_filter = df[column_name].fillna('').str.startswith(f"==== {lg_name}")
    lgs = df[lg_filter][[column_name]]

    lgs[column_name] = lgs[column_name].str.replace("==== ","").str.replace("\ \[.*\]" , "", regex=True)

    lgs[[f'main_section_{lang}', f'sub_section_{lang}', f'pure_text_{lang}']] = \
        lgs[column_name].str.extract(f"{lg_name} ([0-9]*)-([0-9]*)\:(.*)")
    
    lgs[[f'main_section_ref_{lang}', f'sub_section_ref_{lang}']] = \
        df[column_name].shift()[lg_filter].str.extract(f"{lg_name}-([0-9]*)-([0-9]*)")

    
    return lgs.reset_index(drop=True)

lgs_DE = get_learning_goals(structural_df, 'text_DE', 'LZ')
lgs_EN = get_learning_goals(structural_df, 'text_EN', 'LG')

lgs = lgs_DE.join(lgs_EN, how="outer")
lgs.head(1)


Unnamed: 0,text_DE,main_section_DE,sub_section_DE,pure_text_DE,main_section_ref_DE,sub_section_ref_DE,text_EN,main_section_EN,sub_section_EN,pure_text_EN,main_section_ref_EN,sub_section_ref_EN
0,LZ 1-1: Gründe für Veränderungen an Software,1,1,Gründe für Veränderungen an Software,1,1,LG 1-1: Reasons for software changes,1,1,Reasons for software changes,1,1


### Check for consistency

010 check main sections numbers

In [129]:
check(lgs[lgs['main_section_DE'] != lgs['main_section_EN']], "main section numbers are different")

020 Check subsections numbers


In [130]:
check(lgs[lgs['sub_section_DE'] != lgs['sub_section_EN']], "sub section numbers are different")

030 Check references numbers sections

In [132]:
check(
    lgs[
    (lgs['main_section_DE'] != lgs['main_section_ref_DE']) |
    (lgs['sub_section_DE'] != lgs['sub_section_ref_DE']) |
    (lgs['main_section_EN'] != lgs['main_section_ref_EN']) |
    (lgs['sub_section_EN'] != lgs['sub_section_ref_EN'])
    ], 
    "References above section numbers don't fit."
)

040 Check for duplication



In [133]:
check(
    lgs[lgs.duplicated(keep=False, subset=["main_section_DE", "sub_section_DE"])],
    "There are secions that have the same numbers at different places.")

### Check translations

In [134]:
checked_lgs = check_translations(lgs, "pure_text_DE", "pure_text_EN", "lg_results.csv")
check(checked_lgs[checked_lgs['correctness'] <= 0.9], "There are some differences between the German and English translations regarding learing goals.")

Checking translations:   0%|          | 0/29 [00:00<?, ?it/s]

Unnamed: 0,text_DE,main_section_DE,sub_section_DE,pure_text_DE,main_section_ref_DE,sub_section_ref_DE,text_EN,main_section_EN,sub_section_EN,pure_text_EN,main_section_ref_EN,sub_section_ref_EN,correctness,assessment,corrected_text_de,corrected_text_en
2,LZ 1-3: Kernbegriffe für Softwareevolution und -änderung,1,3,Kernbegriffe für Softwareevolution und -änderung,1,3,LG 1-3: Core terms of software evolution and -change,1,3,Core terms of software evolution and -change,1,3,0.9,"The translation is mostly accurate, but the use of the hyphen before 'change' is inconsistent with the German text. The term 'change' should be translated without the hyphen for better clarity.",Kernbegriffe für Softwareevolution und -änderung,Core terms of software evolution and change
3,LZ 1-4: Mögliche Vorgehensweisen bei Änderungen,1,4,Mögliche Vorgehensweisen bei Änderungen,1,4,LG 1-4: Possible approaches for changes,1,4,Possible approaches for changes,1,4,0.9,"The translation is mostly accurate, but 'Vorgehensweisen' could be better translated as 'procedures' or 'methods' instead of 'approaches' for a more precise meaning.",Mögliche Vorgehensweisen bei Änderungen,Possible procedures for changes
4,LZ 2-1: Grundlagen der Analyse zur Trennung von „Problem“ und „Lösung“,2,1,Grundlagen der Analyse zur Trennung von „Problem“ und „Lösung“,2,1,LG 2-1: Basics of the analysis to distinguish “problem” from “solution”,2,1,Basics of the analysis to distinguish “problem” from “solution”,2,1,0.9,"The translation is mostly accurate, but the term 'Grundlagen' could be better translated as 'Fundamentals' to convey a stronger sense of foundational principles.",Grundlagen der Analyse zur Trennung von „Problem“ und „Lösung“,Fundamentals of the analysis to distinguish “problem” from “solution”
5,LZ 2-2: Typische Praktiken und Methoden zur Ist-Analyse,2,2,Typische Praktiken und Methoden zur Ist-Analyse,2,2,LG 2-2: Typical practices and methods for current state analysis,2,2,Typical practices and methods for current state analysis,2,2,0.9,"The translation is mostly accurate, but the term 'Ist-Analyse' is more commonly translated as 'as-is analysis' in English, which is a standard term in business analysis contexts.",Typische Praktiken und Methoden zur Ist-Analyse,Typical practices and methods for as-is analysis
10,LZ 2-7: Grundlegende Laufzeitanalysen,2,7,Grundlegende Laufzeitanalysen,2,7,LG 2-7: Fundamental runtime analysis,2,7,Fundamental runtime analysis,2,7,0.9,"The translation is mostly accurate, but 'Grundlegende' could be better translated as 'Basic' instead of 'Fundamental' for a more precise meaning in this context.",Grundlegende Laufzeitanalysen,Basic runtime analysis
11,LZ 3-1: Mittels unterschiedlicher Arten betriebswirtschaftlicher Größen situativ argumentieren,3,1,Mittels unterschiedlicher Arten betriebswirtschaftlicher Größen situativ argumentieren,3,1,LG 3-1: Argue contextually using different types of business metrics,3,1,Argue contextually using different types of business metrics,3,1,0.9,The translation is mostly accurate but could be improved for clarity and precision. The term 'betriebswirtschaftlicher Größen' is more accurately translated as 'business variables' rather than 'business metrics'.,Mittels unterschiedlicher Arten betriebswirtschaftlicher Größen situativ argumentieren,Argue contextually using different types of business variables
13,LZ 3-3: Probleme und Lösungsansätze schätzen,3,3,Probleme und Lösungsansätze schätzen,3,3,LG 3-3: Estimate for problems and solution approaches,3,3,Estimate for problems and solution approaches,3,3,0.6,The translation is not entirely accurate. The phrase 'Estimate for problems and solution approaches' does not convey the same meaning as the German text. A more appropriate translation would focus on the act of assessing rather than estimating.,Probleme und Lösungsansätze bewerten,Assess problems and solution approaches
14,LZ 4-1: Bewertete Probleme und Lösungsansätze explizit darstellen,4,1,Bewertete Probleme und Lösungsansätze explizit darstellen,4,1,LG 4-1: Explicitly represent evaluated problems and solution approaches,4,1,Explicitly represent evaluated problems and solution approaches,4,1,0.9,"The translation is mostly accurate, but the term 'solution approaches' could be more naturally translated as 'solutions' or 'approaches to solutions'.",Bewertete Probleme und Lösungsansätze explizit darstellen,Explicitly represent evaluated problems and solutions
16,LZ 4-3: Effekte von „Rewrite“ versus „kontinuierliche Verbesserung“,4,3,Effekte von „Rewrite“ versus „kontinuierliche Verbesserung“,4,3,LG 4-3: Impact of “rewrite” versus “continuous improvement”,4,3,Impact of “rewrite” versus “continuous improvement”,4,3,0.9,"The translation is mostly accurate, but the term 'Effekte' could be better translated as 'Effects' instead of 'Impact' for a more precise meaning in this context.",Effekte von „Rewrite“ versus „kontinuierliche Verbesserung“,Effects of “rewrite” versus “continuous improvement”
17,LZ 5-1: Möglichkeiten zur Optimierung von Entwicklungsprozessen,5,1,Möglichkeiten zur Optimierung von Entwicklungsprozessen,5,1,LG 5-1: Potential approaches to optimize development processes,5,1,Potential approaches to optimize development processes,5,1,0.9,"The translation is mostly accurate, but the word 'Möglichkeiten' could be better translated as 'Options' instead of 'Potential approaches' for a more direct translation.",Möglichkeiten zur Optimierung von Entwicklungsprozessen,Options for optimizing development processes


## Cleaning Phase I: Structural checking


### Check for different blank lines

Check which lines in the German version are blank but not so in the English version.

You can delete the `head()` method in the line before the last line for showing all findings.

Tip: Work backwards from the bottom to the top when you want to fix this manally.

Pro-Tip: Introduce a automated formatting for AsciiDoc file or write a tool (if existent)

In [138]:
load_files()

# Filter rows where both text_DE and text_EN are not NaN at the same time
blanks_df = structural_df[(structural_df['text_DE'].isna() & ~structural_df['text_EN'].isna())]

# Display the filtered DataFrame
check(blanks_df, "Blank lines on German and English parts don't match.")

28 files found.


### Check for wrong abbreviations

In [31]:
ev_de_check = structural_df[structural_df['text_DE'].fillna('').str.contains("e\.V\.")]
display(HTML(ev_de_check.to_html(escape=False)))

Unnamed: 0,filename,text_DE,line_DE,link_DE,text_EN,line_EN,link_EN


In [32]:
ev_en_check = structural_df[structural_df['text_EN'].fillna('').str.contains("e\.V\.")]
display(HTML(ev_en_check.to_html(escape=False)))

Unnamed: 0,filename,text_DE,line_DE,link_DE,text_EN,line_EN,link_EN


### Check for different amount of sentences in a line

<span style="color:blue">Status: doesn't work, don't know ecactly why (e.g. because of "z. B." and so on... but should work in theory)</span>

In [141]:
import spacy
from spacy.cli import download

nlp_de = None
# Download the German model
try:
    nlp_de = spacy.load('de_core_news_sm')
except:
    download("de_core_news_sm")
    nlp_de = spacy.load('de_core_news_sm')


nlp_en = None
# Download the English model
try:
    nlp_en = spacy.load('en_core_web_sm')
except:
    download("en_core_web_sm")
    nlp_en = spacy.load('en_core_web_sm')

sentence_de_count = lambda x : len(list(nlp_de(x).sents))
sentence_en_count = lambda x : len(list(nlp_en(x).sents))

sentences_df = structural_df.copy()
sentences_df['text_DE'] = sentences_df['text_DE'].fillna('')
sentences_df['text_EN'] = sentences_df['text_EN'].fillna('')


sentences_df['sentence_count_DE'] = sentences_df['text_DE'].str.replace("=", "").apply(sentence_de_count)
sentences_df['sentence_count_EN'] = sentences_df['text_EN'].str.replace("=", "").apply(sentence_en_count)

different_amount_of_sentences_df = sentences_df[sentences_df['sentence_count_DE'] != sentences_df['sentence_count_EN']]

print(f"{len(different_amount_of_sentences_df)} differences regarding amount of sentences found!")


#display(HTML(different_amount_of_sentences_df.to_html(escape=False)))

253 differences regarding amount of sentences found!


### Check durations 

In [185]:
number_of_duration_term_documents = len(structural_df[structural_df['filename'].str.contains("duration-terms")]['filename'].drop_duplicates())
number_of_duration_term_documents

6

In [194]:
number_of_duration_term_documents = len(structural_df[structural_df['filename'].str.contains("duration-terms")]['filename'].drop_duplicates())
number_of_correct_formatted_duration_terms_DE = len(structural_df[structural_df["text_DE"].fillna('').str.contains(r"\| Lehre: \d{2,3} Min\. \| Übung: \d{2,3} Min\.")])
number_of_correct_formatted_duration_terms_DE
assert number_of_duration_term_documents == number_of_correct_formatted_duration_terms_DE, "Falsch formatierte Zeitangaben in DE vorhanden"

AssertionError: Falsch formatierte Zeitangaben in DE vorhanden

In [None]:

assert number_of_duration_term_documents == number_of_correct_formatted_duration_terms_DE

Check times structure

In [163]:
#| Lehre: 30 Min. | Übung: 75 Min.
#| Teaching: 30 min | Practice: 75 min
times_df["text_DE"].str.contains(r"\| Lehre: \d{2} Min\. \| Übung: \d{2} Min\.")
times_df["text_EN"].str.contains(r"\| Teaching: \d{2} min \| Practice: \d{2} min")


50    False
51    False
54    False
55    False
58    False
59    False
62    False
63    False
66    False
67    False
70    False
71    False
Name: text_EN, dtype: bool

Check different times

In [None]:

times_df = structural_df[
    (structural_df['text_DE'].str.strip().str.contains("^\| [0-9]*$")) |
    (structural_df['text_EN'].str.strip().str.contains("^\| [0-9]*$"))
].copy()

times_df.head(1)

times_df["times_theory_DE"], times_df["times_practice_DE"] = times_df["text_DE"].str.extract(".*([0-9]+) Min. (Lehre)([0-9]+)")
times_df["times_theory_EN"], times_df["times_practice_EN"] = times_df["text_EN"].str.extract(".*([0-9]+).*([0-9]+)")

result = times_df[
    (times_df['times_theory_DE'] != times_df['times_theory_EN']) |
    (times_df['times_practice_DE'] != times_df['times_practice_DE'])]
check(result, "There are different times in the documents.")

Calculate the complete times

In [146]:
times_df['times_DE'].sum()

np.int64(1150)

### Check bullet-point

In [36]:
indentation = lambda x: len(x) - len(x.lstrip()) if isinstance(x, str) and x.lstrip().startswith(('-', '*')) else None

bullets_df = structural_df.copy()


bullets_df['indentation_DE'] = bullets_df['text_DE'].apply(indentation)
bullets_df['indentation_EN'] = bullets_df['text_EN'].apply(indentation)
bullets_df = bullets_df.dropna(subset=['indentation_DE', 'indentation_EN'], how='all')
bullets_result = bullets_df[bullets_df['indentation_DE'] != bullets_df['indentation_EN']]
show(bullets_result)

Unnamed: 0,filename,text_DE,line_DE,link_DE,text_EN,line_EN,link_EN,indentation_DE,indentation_EN
292,../docs\02-analyze\02-learning-goals.adoc,(Statische) Analyse von bestehendem Quellcode und dessen Strukturen durchführen und dokumentieren.,56,56.0,* Perform and document (static) analysis of existing source code and its structure.,118.0,118.0,,0.0
487,../docs\06-examples\01-duration-terms.adoc,**Anmerkung**:,7,7.0,,25.0,25.0,0.0,


### Check URLs

In [37]:
import requests

check_url = lambda x : requests.get(x, timeout=3).status_code

urls = pd.DataFrame(structural_df['text_DE'].str.extract("(https?://.*)").dropna())
urls['url'] = urls[0].str.split("[").str[0].str.split(")").str[0]
urls['status'] = urls['url'].apply(check_url)
urls[urls['status'] != 200]

Unnamed: 0,0,url,status


## Phase II: Translation correction

### Prepare Dataframe

An additional Dataframe based on the `structural_df`.  For this, we simplay drop all blank lines which don't need to be checked.

In [84]:
import re

regex_dict = {}
regex_dict["meta_data"] = "^\["
regex_dict["comments"] = "^//"
regex_dict["table_header"] = "^\|===$"
regex_dict["table_number_cells"] = "^\| [0-9]*$"
regex_dict["table_emtpy_cell"] = "^\|$"

regex = re.compile(r"|".join(regex_dict.values()))

result_dfs = []
for filename, group in structural_df.groupby('filename'):
    # Reset index for each group to ensure correct row ordering
    group = group.reset_index(drop=True)

    # Remove NaN rows for DE section and shift the rest upwards
    non_nan_de = group.dropna(subset=['text_DE']).reset_index(drop=True)
    group['text_DE'], group['line_DE'] = non_nan_de['text_DE'], non_nan_de['line_DE']
    
    # Remove NaN rows for EN section and shift the rest upwards
    non_nan_en = group.dropna(subset=['text_EN']).reset_index(drop=True)
    group['text_EN'], group['line_EN'] = non_nan_en['text_EN'], non_nan_en['line_EN']
    
    # get rid of meta data items; we checked before if that would be a problem
    group = group[~(
        (group['text_DE'].str.contains(regex)) |
       (group['text_EN'].str.contains(regex))
    )]
        
    # Append the processed non NaN group to the result list (it's enough doing it via one of the lines columns)
    result_dfs.append(group.dropna(subset=['line_DE']))

# Concatenate all the processed groups back together 
translations_df = pd.concat(result_dfs, ignore_index=True)
translations_df.head(1)

Unnamed: 0,filename,text_DE,line_DE,link_DE,text_EN,line_EN,link_EN
0,../docs\00a-preamble\01-what-to-expect-of-an-a...,=== Was vermittelt ein Advanced Level Modul?,2.0,"<a href=""vscode://file/c:\dev\repos\curriculum...",=== What is taught in an Advanced Level module?,11.0,"<a href=""vscode://file/c:\dev\repos\curriculum..."


### Create corrected translations

This is the core of this notebook. It sends the German and the English translation to ChatGPT and requests an assessment.

Note: This might take very long for huge data.

In [57]:
checked_translations = check_translations(translations_df, cache_filepath="text_results.csv")
checked_translations.head(1)

Checking translations:   0%|          | 0/276 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Rearrange translation result's columns for easier correcting

In [46]:
assessed_content = checked_translations[[
    'filename',
    'text_DE',
    'text_EN',
    'corrected_text_en',
    'correctness',
    'assessment',
    'link_DE',
    'link_EN',
    'line_DE',
    'line_EN']]

assessed_content.head(1)

Unnamed: 0,filename,text_DE,text_EN,corrected_text_en,correctness,assessment,link_DE,link_EN,line_DE,line_EN
0,../docs\00a-preamble\01-what-to-expect-of-an-a...,=== Was vermittelt ein Advanced Level Modul?,=== What is taught in an Advanced Level module?,=== What is conveyed in an Advanced Level module?,0.9,"The translation is mostly accurate, but the te...","<a href=""vscode://file/c:\dev\repos\curriculum...","<a href=""vscode://file/c:\dev\repos\curriculum...",2.0,11.0


### Create assessment output

In [47]:
assessed_content.to_html("temp/translation_assessment_report.html", escape=False)
assessed_content.to_excel("temp/translation_assessment_report.xlsx", index=None)
assessed_content.head(1)

Unnamed: 0,filename,text_DE,text_EN,corrected_text_en,correctness,assessment,link_DE,link_EN,line_DE,line_EN
0,../docs\00a-preamble\01-what-to-expect-of-an-a...,=== Was vermittelt ein Advanced Level Modul?,=== What is taught in an Advanced Level module?,=== What is conveyed in an Advanced Level module?,0.9,"The translation is mostly accurate, but the te...","<a href=""vscode://file/c:\dev\repos\curriculum...","<a href=""vscode://file/c:\dev\repos\curriculum...",2.0,11.0


### Set the threshold
Set the threshold for correctness. The default with 0.8 may be too ambitious.m

In [45]:
THRESHOLD = 0.5
needed_corrections = assessed_content[assessed_content['correctness'] <= THRESHOLD]
display(HTML(needed_corrections.to_html(escape=False)))

Unnamed: 0,filename,text_DE,text_EN,corrected_text_en,correctness,assessment,link_DE,link_EN,line_DE,line_EN
21,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,|===,|===,|===,0.0,"The translations do not match at all, indicating a complete failure in conveying the original meaning and structure.",5.0,54.0,6.0,55.0
33,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,| 30,| 60,| 30,0.0,"The translations do not match in content, as the German text indicates '30' while the English text indicates '60'. This is a significant discrepancy that needs to be addressed.",17.0,66.0,22.0,71.0
38,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,| 280,| 180,| 280,0.0,"The translations do not match in content, indicating a significant error in translation. The German text indicates a value of 280, while the English text states 180, which is inconsistent.",22.0,71.0,29.0,78.0
43,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,|,|,,0.0,"The translations do not match at all, indicating a complete lack of coherence between the German and English texts.",27.0,76.0,36.0,85.0
44,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,|,|,,0.0,"The translations do not match at all, indicating a complete lack of correspondence between the German and English texts.",28.0,77.0,37.0,86.0
45,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,|,|,,0.0,"The translations do not match at all, indicating a complete lack of correspondence between the German and English texts.",29.0,78.0,38.0,87.0
49,../docs\00b-basics\02-curriculum-structure-and-chronological-breakdown.adoc,|===,|===,|===,0.0,"The translations do not match at all, indicating a complete failure in conveying the original meaning and structure.",33.0,82.0,43.0,92.0
59,../docs\00b-basics\03-timing-didactics-and-more.adoc,|===,|===,|===,0.0,"The translations do not match at all, indicating a complete failure in conveying the original meaning and structure.",15.0,34.0,18.0,37.0
63,../docs\00b-basics\03-timing-didactics-and-more.adoc,|===,|===,|===,0.0,"The translations do not match at all, indicating a complete failure in conveying the original meaning and structure.",19.0,38.0,22.0,41.0
64,../docs\00b-basics\04-prerequisites-for-this-training.adoc,=== Voraussetzungen für das Modul IMPROVE,=== Prerequisites,=== Prerequisites for the IMPROVE Module,0.5,"The English translation is incomplete as it does not convey the full meaning of the German text, which specifies 'Voraussetzungen für das Modul IMPROVE' (Prerequisites for the IMPROVE module).",2.0,22.0,2.0,22.0


### Produce output files



#### Create mergeable version

Produces a merge file to work with a merge editor. If you're brave, you can also set `is_override` to true to merge the suggestions of the LLM directly into the original file.

Tip: Overriding only should be enabled if both sections of the translations are almost structurally identically because otherwise it leads to complete chaos!

**Warning: Before setting it to True, make sure that you committed your local changes! Good luck!**

In [67]:
# flag for setting the override of the original file
is_override = False

# Group correctiony for each file to process each file only once
for filename, group in needed_corrections.groupby('filename'):
    # Open the file once for reading
    with open(filename, 'r') as file:
        lines = file.readlines()

    # Modify the relevant lines in memory
    modified = False
    for index, row in group.iterrows():
        if row['corrected_text_en']:  # Check if corrected_text_en is not empty
            line_number = int(row['line_EN']) - 1  # Convert to zero-based index
            # Insert merge-style corrections
            lines[line_number] = (
                f"<<<<<<< ORIGINAL\n{lines[line_number]}"
                f"=======\n{row['corrected_text_en']}\n"
                f">>>>>>> CORRECTED | German text: \'{row['text_DE']}\' (Line {int(row['line_DE'])}) | Assessment: {row['assessment']}\n"
            )
            modified = True  # Mark as modified since we are replacing the line

    # Write the updated content to a new file ending with '_merge.adoc'
    if modified:
        new_filename = filename if is_override else filename.replace('.adoc', '_merge.adoc')
        with open(new_filename, 'w') as file:
            file.writelines(lines)

        print(f"Merge-style corrected file saved as {new_filename}")

Merge-style corrected file saved as ../docs/00b-basics/02-curriculum-structure-and-chronological-breakdown_merge.adoc
Merge-style corrected file saved as ../docs/00b-basics/03-timing-didactics-and-more_merge.adoc
Merge-style corrected file saved as ../docs/04-planning/01-duration-terms_merge.adoc


ValueError: cannot convert float NaN to integer

#### Create diffable version
Version that produces a separate file to work with a diff editor. It's disabled by default via the `is_active` flag because the mergable version is way better (IMHO).

In [None]:
# this is just a flag to set this block active or not
is_active = False

# Group by 'filename' to process each file only once
for filename, group in needed_corrections.groupby('filename'):
    # Open the file once for reading
    with open(filename, 'r') as file:
        lines = file.readlines()

    # Modify the relevant lines in memory
    modified = False
    for index, row in group.iterrows():
        line_number = int(row['line_EN']) - 1  # Convert to zero-based index
        lines[line_number] = row['corrected_text_en'] + '\n'
        modified = True  # Mark as modified since we are replacing the line

    # Write the updated content to a new file ending with '_corrected.adoc'
    if modified and is_active:
        new_filename = filename.replace('.adoc', '_aicorrected.adoc')
        with open(new_filename, 'w') as file:
            file.writelines(lines)

## Summary

Happy correcting!

Markus Harrer, October 2024