# Pipeline - Translate to German

This notebook takes the paraphrased sentences from the idiom paraphrase model(s) and feeds them through a regular T5 model (not fine-tuned) to translate them from English to German.

## Load packages and data

In [None]:
!pip install sentencepiece -q
!pip install transformers -q
!pip install torch -q

[K     |████████████████████████████████| 1.2 MB 5.4 MB/s 
[K     |████████████████████████████████| 4.2 MB 5.6 MB/s 
[K     |████████████████████████████████| 86 kB 3.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 28.7 MB/s 
[K     |████████████████████████████████| 596 kB 34.2 MB/s 
[?25h

In [None]:
# Drive
from google.colab import drive

# Util
import os
import re
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', None)

# ML
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import inputs and predictions of the models


# Import split 1 no prefix (easiest for the inputs)
path = "file path here"
val_data = pd.read_csv(path+"data/data_nopref_val_split1.csv", sep="=")
en_id_data = val_data[['input']]

# test_data = pd.read_csv(path+"data/data_nopref_test_split1.csv", sep="=")
# en_id_data = test_data['input']


In [None]:
# Loading in prediction data
file_path = "file path here"
predictions_file = pd.read_csv(file_path+"predictions_IOBs.csv")

# Isolate the columns we need
en_lit_data = predictions_file[['Generated Text']].rename(columns={"Generated Text": "input"})
en_id_data = predictions_file[['Input']].rename(columns={"Input": "input"})
# references = predictions_file['Actual Text']
# sources = predictions_file['Input']

In [None]:
en_lit_data.iloc[0][0]

"Let's assume that she is right."

In [None]:
def cleanSources(sources, prefix, space_punct=False, suffix=None):
  '''Function to remove the prefix from the source sentence
  For example, if you had the prefix "paraphrase:" then you should add that as the prefix parameter.
  '''
  if space_punct:
    sources = sources.str.replace(",", " ,")
    sources = sources.str.replace(".", " .")
    sources = sources.str.replace("'", " '")
    sources = sources.str.replace("  ", " ")


  sources = sources.str.replace('"', '')
  # Removes any remainder of a prefix that might have stuck
  for p in prefix:
    sources = sources.str.replace(p, "")
  
  # Removes the idiom and everything following it
  if suffix is not None:
    for i in range(len(suffix)):
      sources[i] = sources[i].split(suffix, 1)[0]
  return sources

In [None]:
en_id_data['input'] = cleanSources(en_id_data['input'], prefix=["id_par sentence:"], suffix="idiom: ")

In [None]:
# Create input data

# Sentences with idioms
# en_id_input = pd.DataFrame()
# en_id_input['input'] = en_id_data.agg('translate English to German: {0[input]}'.format, axis=1)

# Sentences with idioms paraphrased
en_lit_input = pd.DataFrame()
en_lit_input['input'] = en_lit_data.agg('translate English to German: {0[input]}'.format, axis=1)


In [None]:
# Check max length
lengths_en_id = en_id_input["input"].str.split(" ")
lengths_en_lit = en_lit_input["input"].str.split(" ")

print("Max number of tokens input = ", max(lengths_en_id.str.len().max(),lengths_en_lit.str.len().max()))

Max number of tokens input =  107


In [None]:
# Check data
en_lit_input.head(n=10)

Unnamed: 0,input
0,translate English to German: Let's assume that...
1,translate English to German: The management wa...
2,translate English to German: I am really happy...
3,translate English to German: The boiling point...
4,translate English to German: The teacher makes...
5,translate English to German: I heard you had a...
6,translate English to German: I appreciate the ...
7,translate English to German: The client has be...
8,translate English to German: Losing that job t...
9,translate English to German: The teacher had a...


##Setup functions & classes

###CLASS: InputData

An InputData class for reading and loading the InputData into the dataloader, and then feed it into the neural network.

In [None]:
class InputData(Dataset):
    """
    Creating a dataset class for reading the dataset and
    loading it into the dataloader, to pass it to the
    neural network (only source, no target)

    """

    def __init__(
        self, dataframe, tokenizer, input_len, input_text
    ):
        """
        Initializes an InputData class

        Args:
            dataframe (pandas.DataFrame): Input dataframe
            tokenizer (transformers.tokenizer): Transformers tokenizer
            input_len (int): Max length of source text
            input_text (str): column name of source text
        """
        self.tokenizer = tokenizer
        self.data = dataframe
        self.input_len = input_len
        self.input_text = self.data[input_text]

    def __len__(self):
        """returns the length of dataframe"""

        return len(self.input_text)

    def __getitem__(self, index):
        """return the input ids and attention masks"""

        input_text = str(self.input_text[index])

        # cleaning data so as to ensure data is in string type
        input_text = " ".join(input_text.split())

        input = self.tokenizer.batch_encode_plus(
            [input_text],
            max_length=self.input_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )

        input_ids = input["input_ids"].squeeze()
        input_mask = input["attention_mask"].squeeze()

        return {
            "input_ids": input_ids.to(dtype=torch.long),
            "input_mask": input_mask.to(dtype=torch.long),
        }

###FUNC: generate

Validate function is same as the Train function, but for the validation data



In [None]:
def generate(tokenizer, model, device, loader):

  """
  Function to generate predictions using the model

  """
  model.eval()
  predictions = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          ids = data['input_ids'].to(device, dtype = torch.long)
          mask = data['input_mask'].to(device, dtype = torch.long)

          # Generate outputs
          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

          predictions.extend(preds)

  print("Outputs generated.")


  return predictions

###FUNC: T5Generate

T5Generate accepts the input data and utilizes the InputData class for data handling and the generate function to generate outputs from the T5 model.

In [None]:
def T5Generate(
    input_data, input_text, model_type="t5-small", output_dir="./outputs/"
):

    """
    T5Generate has 3 arguments:

      input_data: Input dataframe of input data
      input_text: Column name of the input text
      output_dir: Output directory to save results

    """

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(42)  # pytorch random seed
    np.random.seed(42)  # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenizer for encoding the text
    tokenizer = T5Tokenizer.from_pretrained(model_type)

    # Defining the model. The model is then sent to device (GPU/TPU)
    model = T5ForConditionalGeneration.from_pretrained(model_type)
    model = model.to(device)

    # Importing the raw dataset
    input_data = input_data[[input_text]]

    # Creation of InputData and Dataloader
    print(f"INPUT data: {input_data.shape}")

    # Creating the Input dataset for further creation of Dataloader
    input_set = InputData(
        input_data,
        tokenizer,
        110,
        "input",
    )

    # Defining the parameters for creation of dataloaders
    input_params = {
        "batch_size": 4,
        "shuffle": False,
        "num_workers": 0,
    }

    # Creation of Dataloaders for data
    input_loader = DataLoader(input_set, **input_params)

    # Generating output
    translations = generate(tokenizer, model, device, input_loader)
    final_df = pd.DataFrame({"Input": input_data[input_text], "Generated Text": translations})
    final_df.to_csv(os.path.join(output_dir, "translations.csv"))

    print(
        f"""Generated data saved @ {os.path.join(output_dir,'translations.csv')}\n"""
    )

## Translate sentences

#### Paraphrased sentences

In [None]:
# Generate translations to paraphrased sentences
T5Generate(input_data=en_lit_input, input_text="input", model_type="t5-small", output_dir=file_path)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


INPUT data: (412, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/Pipeline/Data/Idpar_Idiom_IOBs/outputs/test_data/translations.csv



In [None]:
# Load translations
translations_lit = pd.read_csv(file_path+"translations.csv")
# en_id_data = en_id_data[['input']].rename(columns={"input": "Source"})
translations = en_id_data.join(translations_lit["Input"])
translations["Input"] = cleanSources(translations['Input'], prefix=["translate English to German:"], suffix=None)
translations = translations.join(translations_lit["Generated Text"])
translations.head()

translations.to_csv(os.path.join(file_path, "translations_Source_German.csv"))

# for i in range(len(translations_lit)):
#   print(i)
#   print(translations_lit['Input'].iloc[i])
#   print(translations_lit['Generated Text'].iloc[i])
#   print()


In [None]:
# Still some cases where the output is basically blank.
print(en_id_input['input'].iloc[153])
print(translations_lit['Input'].iloc[153])
print(translations_lit['Generated Text'].iloc[153])
print()
print(en_id_input['input'].iloc[223])
print(translations_lit['Input'].iloc[223])
print(translations_lit['Generated Text'].iloc[223])
print()
print(en_id_input['input'].iloc[362])
print(translations_lit['Input'].iloc[362])
print(translations_lit['Generated Text'].iloc[362])
print()

# Also a case where the sentence is not translated to german but rewritten in english?
print()
print(en_id_input['input'].iloc[4])
print(translations_lit['Input'].iloc[4])
print(translations_lit['Generated Text'].iloc[4])
print()

# Also some cases where the english sentence was unnecessarily rewritten (though not really a problem)
print()
print(en_id_input['input'].iloc[2])
print(translations_lit['Input'].iloc[2])
print(translations_lit['Generated Text'].iloc[2])
print()

# Also many cases where idiom is not translated correctly, so the end result is weird as well
print()
print(en_id_input['input'].iloc[15])
print(translations_lit['Input'].iloc[15]) # Here missing punctuation also affects things
print(translations_lit['Generated Text'].iloc[15])
print()



697
translate English to German: Did you enjoy the ballet this weekend? Not at all.
Das Ballett hat Sie an diesem Wochenende genossen, aber gar nicht.

698
translate English to German: His piercing blue eyes are at opposites with the rest of his features and
Seine blauen Augen sind gegensätzlich mit dem Rest seiner Merkmale und

699
translate English to German: You seem to be crying fake tears at the thought of having to miss work tomorrow.
Sie scheinen gefälschte Tränen zu schreien, wenn man denkt, morgen die Arbeit verpassen zu müssen.

700
translate English to German: Working and studying at the same time has led to me having to use only my energy at the same time.
Die Arbeit und das Studium hat dazu geführt, dass ich nur meine Energie gleichzeitig nutzen musste.

701
translate English to German: Going into a business without carrying out proper studies is very risky.
Es ist sehr riskant, in ein Unternehmen zu gehen, ohne ordentliche Studien durchzuführen.

702
translate English t

#### Idiomatic sentences

In [None]:
# Generate translations to idiomatic sentences
#T5Generate(input_data=en_id_input, input_text="input", model_type="t5-small", output_dir=path+"outputs")

In [None]:
# Load translations
translations_id = pd.read_csv(path+"outputs/translations_idiomatic.csv")
for i in range(len(translations_id)):
  print(i)
  print(translations_id['Input'].iloc[i])
  print(translations_id['Generated Text'].iloc[i])
  print()


0
translate English to German: I don't believe that he didn't take the money , but I will give him the benefit of the doubt until I can prove otherwise .
Ich glaube nicht, dass er das Geld nicht genommen hat, aber ich werde ihm den Vorteil des Zweifels geben, solange ich nicht nachweisen kann.

1
translate English to German: She manages to give her father a ballpark amount that she would need every week .
Sie schafft es, ihrem Vater einen Ballparkbetrag zu geben, den sie jede Woche benötigen würde.

2
translate English to German: It was really good to have you here and I would like to thank all of you from the bottom of my heart .
Es war wirklich gut, Sie hier zu haben und ich möchte Ihnen allen von Herzen danken.

3
translate English to German: I have seen many turning points in my life and don't believe that only one of them ever became the reason for my success .
Ich habe viele Wendepunkte in meinem Leben gesehen und glaube nicht, dass nur einer von ihnen jemals der Grund für meinen

## Full pipeline

Load english sentences with idioms, feed into paraphrase model to get literal sentences, then feed into basic t5 for translation to german.. 

#### Paraphrase dataset

In [None]:
# Loading data
path = "/content/drive/MyDrive/MRP Idiom Translation/"
data = pd.read_csv(path+"data/data_idpar_test_split1.csv", sep="=")
data_en_id = data[['input']]

In [None]:
# Paraphrase
model_dir = "/content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/"
T5Generate(input_data=data_en_id, input_text="input", model_type=model_dir+"model_files", output_dir=model_dir)

INPUT data: (823, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/translations.csv



In [None]:
# Create input data for translation
data_para = pd.read_csv(model_dir+"translations.csv")

data_en_lit = pd.DataFrame()
data_en_lit['input'] = data_para.agg('translate English to German: {0[Generated Text]}'.format, axis=1)

In [None]:
# Translate
T5Generate(input_data=data_en_lit, input_text="input", model_type="t5-small", output_dir=model_dir)

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

INPUT data: (823, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/translations.csv



In [None]:
# Properly structure output file
data_translated = pd.read_csv(model_dir+"translations.csv")
data_par_test_IOB = pd.read_csv(path+'data/'+"data_test_IOB_split1.csv", sep="=")

# Clean idiom input
idiom_sents = data_para[["Input"]]
idiom_sents["Input"] = idiom_sents["Input"].apply(lambda x: re.sub("id_par sentence: ", "", x))
idiom_sents["Input"] = idiom_sents["Input"].apply(lambda x: re.sub("idiom: (.|\s)*", "", x))
data_output = pd.DataFrame({"Idiom": data_par_test_IOB['Idiom'], "Input": idiom_sents['Input'], "Paraphrased": data_para['Generated Text'], "Translated": data_translated["Generated Text"]})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
data_output

Unnamed: 0,Idiom,Input,Paraphrased,Translated
0,the benefit of the doubt,Let's give her the benefit of the doubt and as...,Let's doubt her and assume that she is right.,Lassen Sie uns sie zweifeln und davon ausgehen...
1,ballpark figure,The management was given a ballpark figure at ...,The management was given an estimated cost at ...,Das Management erhielt zu Beginn der Präsentat...
2,from the bottom of my heart,I am really happy with the new job and I mean ...,I am really happy with the new job and I mean ...,"Ich bin wirklich froh über die neue Arbeit, un..."
3,turning point,The turning point in the story came when the p...,The boiling point in the story came when the p...,"Der Brennpunkt der Geschichte kam, als der Pro..."
4,go the extra mile,"When it comes to weaker students , the teacher...","When it comes to weaker students, the teacher ...","Wenn es um schwächere Schüler geht, tut der Le..."
5,pop the question,I heard you had a special date with Tom yester...,I heard you had a special date with Tom yester...,"Ich hörte, dass Sie gestern ein besonderes Dat..."
6,sense of humour,I appreciate the fact that you have a sense of...,I appreciate the fact that you have an ability...,"Ich schätze die Tatsache, dass Sie humorvoll s..."
7,turn back on,"The client has been given a commitment , we ca...","The client has been given a commitment, we can...","Der Kunde wurde eine Verpflichtung gegeben, kö..."
8,a blessing in disguise,Losing that job turned out to be a blessing in...,Losing that job turned out to be an apparent m...,Der Verlust dieses Arbeitsplatzes erwies sich ...
9,put a sock in it,The teacher had asked the student to be quiet ...,The teacher had asked the student to be quiet ...,"Der Lehrer hatte den Schüler gebeten, mehrmals..."


In [None]:
# Save results
data_output.to_csv(model_dir+"translations_para.csv")


#### Comparison test set

In [None]:
# Load comparison
path= "/content/drive/MyDrive/MRP Idiom Translation/data/"
data = pd.read_csv(path+"comparison_data.csv")

In [None]:
# Prepare for paraphrasing
data_en_id = pd.DataFrame()
data_en_id['input'] = data.agg('id_par sentence: {0[input]} idiom: {0[idiom]}'.format, axis=1)

In [None]:
# Check max length
lengths_data = data_en_id["input"].str.split(" ")

print("Max number of tokens input = ", lengths_data.str.len().max())

Max number of tokens input =  95


In [None]:
# Paraphrase
model_dir = "/content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/"
T5Generate(input_data=data_en_id, input_text="input", model_type=model_dir+"model_files", output_dir=model_dir)

INPUT data: (122, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/translations.csv



In [None]:
# Create input data for translation
data_para = pd.read_csv(model_dir+"translations.csv")

data_en_lit = pd.DataFrame()
data_en_lit['input'] = data_para.agg('translate English to German: {0[Generated Text]}'.format, axis=1)

In [None]:
# Translate
T5Generate(input_data=data_en_lit, input_text="input", model_type="t5-small", output_dir=model_dir)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


INPUT data: (122, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/MRP Idiom Translation/outputs/id_par prefix + idiom (paraphrase model)/model 1: 50 epochs, batch 4, split 1/translations.csv



In [None]:
# Properly structure output file
data_translated = pd.read_csv(model_dir+"translations.csv")

# Clean idiom input
idiom_sents = data_para[["Input"]]
idiom_sents["Input"] = idiom_sents["Input"].apply(lambda x: re.sub("id_par sentence: ", "", x))
idiom_sents["Input"] = idiom_sents["Input"].apply(lambda x: re.sub("idiom: (.|\s)*", "", x))
data_output = pd.DataFrame({"Idiom": data['idiom'],"Input": idiom_sents['Input'], "Paraphrased": data_para['Generated Text'], "Translated": data_translated["Generated Text"]})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [None]:
data_output

Unnamed: 0,Idiom,Input,Paraphrased,Translated
0,head over heels,Tom and Mary are head over heels in love with ...,Tom and Mary are deeply in love with each othe...,Tom und Mary verlieben sich tief miteinander u...
1,a sight for sore eyes,I can't believe that I haven't seen you in a y...,I can't believe that I haven't seen you in a y...,"Ich kann nicht glauben, dass ich Sie in einem ..."
2,in a nutshell,"In a nutshell , all the new mayor was saying i...",I am certain that all the new mayor was saying...,"Ich bin sicher, dass der neue Bürgermeister ge..."
3,beyond a shadow of doubt,The government has clarified beyond a shadow o...,The government has clarified for certain that ...,"Die Regierung hat sicher klargestellt, dass di..."
4,kill two birds with one stone,"I have to go to the bank , and on the way back...","I have to go to the bank, and on the way back,...","Ich muss an die Bank gehen, und auf dem Weg zu..."
5,think outside the box,The team always thinks outside the box to come...,The team always thinks of solutions out of the...,"Das Team denkt stets an Lösungen, die aus der ..."
6,go the extra mile,"When it comes to weaker students , the teacher...","When it comes to weaker students, the teacher ...","Wenn es um schwächere Schüler geht, tut der Le..."
7,bite the bullet,"When the time comes , I'll bite the bullet and...","When the time comes, I'll be quick and take my...","Wenn die Zeit kommt, werde ich schnell sein un..."
8,take care,Take care not to cut yourself on that rusty pi...,Take care not to cut yourself on that rusty pi...,"Stellen Sie sicher, dass Sie sich nicht auf di..."
9,pig in a poke,If you buy a used car without examining it tho...,If you buy a used car without examining it tho...,"Wenn Sie ein gebrauchtes Auto kaufen, ohne es ..."


In [None]:
# Save results
data_output.to_csv(model_dir+"translations_comparison_pipeline.csv")


In [None]:
# Also directly translate sentences with T5
en_id_input = pd.DataFrame()
en_id_input['input'] = data.agg('translate English to German: {0[input]}'.format, axis=1)

T5Generate(input_data=en_id_input, input_text="input", model_type="t5-small", output_dir=path)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


INPUT data: (122, 1)
Outputs generated.
Generated data saved @ /content/drive/MyDrive/MRP Idiom Translation/data/translations.csv



In [None]:
# Properly structure output file
data_translated = pd.read_csv(path+"translations.csv")

data_output = pd.DataFrame({"Idiom": data['idiom'],"Input": idiom_sents['Input'], "Translated": data_translated["Generated Text"]})


Unnamed: 0,Idiom,Input,Translated
0,head over heels,Tom and Mary are head over heels in love with ...,Tom und Mary verlieben sich einander in den Ko...
1,a sight for sore eyes,I can't believe that I haven't seen you in a y...,"Ich kann nicht glauben, dass ich Sie in einem ..."
2,in a nutshell,"In a nutshell , all the new mayor was saying i...","Kurz gesagt, der neue Bürgermeister sagte: Der..."
3,beyond a shadow of doubt,The government has clarified beyond a shadow o...,"Die Regierung hat unbestreitbar klargestellt, ..."
4,kill two birds with one stone,"I have to go to the bank , and on the way back...",Ich muss an die Bank gehen und auf dem Rückweg...
5,think outside the box,The team always thinks outside the box to come...,"Das Team denkt immer außerhalb der Box, um ein..."
6,go the extra mile,"When it comes to weaker students , the teacher...","Wenn es um schwächere Studenten geht, macht de..."
7,bite the bullet,"When the time comes , I'll bite the bullet and...","Wenn die Zeit kommt, werde ich die Kugel beiße..."
8,take care,Take care not to cut yourself on that rusty pi...,"Stellen Sie sicher, dass Sie sich nicht auf di..."
9,pig in a poke,If you buy a used car without examining it tho...,"Wenn man ein gebrauchtes Auto kauft, ohne es z..."


In [None]:
# Save results
data_output.to_csv(path+"translations_comparison_baseline.csv")

## Evaluation using COMET

Using reference-free COMET metric, we evaluate how well the german translations are for translating the idiomatic sentence directly and for using the pipeline. 

Can't seem to load model with the limited RAM of Colab

(Also, unlikely to work, since COMET probably doesn't understand idioms)

In [None]:
!pip install unbabel-comet -q

from comet import download_model, load_from_checkpoint



Downloading wmt21-comet-qe-da.tar.gz
wmt21-comet-qe-da.tar.gz: 1.72GB [00:49, 34.9MB/s]                            
Extracting /root/.cache/torch/unbabel_comet/wmt21-comet-qe-da.tar.gz
Extracted /root/.cache/torch/unbabel_comet/wmt21-comet-qe-da.tar.gz


In [None]:
model_path = download_model("wmt21-comet-qe-da")

model = load_from_checkpoint(model_path)

wmt21-comet-qe-mqm is already in cache.


In [None]:
# Prepare data
comet_id_input = {"src": en_id_data['input'], "mt": translations_id['Generated Text']}
comet_lit_input = {"src": en_id_data['input'], "mt": translations_lit['Generated Text']}


In [None]:
model = load_from_checkpoint(model_path)

seg_scores_id, sys_score_id = model.predict(comet_id_input, batch_size=4, gpus=1)
seg_scores_lit, sys_score_lit = model.predict(comet_lit_input, batch_size=4, gpus=1)

In [None]:
#model = load_from_checkpoint(model_path)

data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
    },
    {
        "src": "Schulen und Kindergärten wurden eröffnet.",
        "mt": "Schools and kindergartens were open",
    }
]

seg_scores, sys_score = model.predict(data, batch_size=8)

MisconfigurationException: ignored