<a href="https://colab.research.google.com/github/mgelinass/a05mteval/blob/main/bleuFileV05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BLEU: BiLingual Evaluation Understudy

*NLP evaluation metric used in Machine Translation tasks*

*Suitable for measuring corpus level similarity*

*$n$-gram comparison between words in candidate sentence and reference sentences*

*Range: 0 (no match) to 1 (exact match)*

### 1. Libraries
*Install and import necessary libraries*


In [2]:
import nltk
import nltk.translate.bleu_score as bleu

import math
import numpy
import os

try:
  nltk.data.find('tokenizers/punkt')
except LookupError:
  nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 2. Dataset
*Array of words: candidate and reference sentences split into words*

In [None]:
hyp1 = str('The fat cat was comfortably sitting on a big mat').split()
hyp2 = str('The fat cat was comfortably sitting on a big rug').split()
hyp3 = str('The clever cat was comfortably sitting on a small pillow').split()
hyp4 = str('The big dog was comfortably sitting on a small mat').split()


ref_a = str('The fat cat was comfortably sitting on a big rug').split()
ref_b = str('The plump cat lounged cozily on a large mat').split()
ref_c = str('A chubby feline was resting peacefully atop a spacious rug').split()
ref_d = str('The hefty cat relaxed contentedly on a broad mat').split()


### 3. *Sentence* score calculation
*Compares 1 hypothesis (candidate or source sentence) with 1+ reference sentences, returning the highest score when compared to multiple reference sentences.*

In [None]:
score_ref_a = bleu.sentence_bleu([ref_a], hyp1)
print("Hyp1 and ref_a : {}".format(score_ref_a))



In [None]:
score_ref_b = bleu.sentence_bleu([ref_a], hyp2)
print("Hyp and ref_b : {}".format(score_ref_b))


In [None]:
score_ref_b = bleu.sentence_bleu([ref_a], hyp3)
print("Hyp and ref_b : {}".format(score_ref_b))

In [None]:
score_ref_b = bleu.sentence_bleu([ref_a], hyp4)
print("Hyp and ref_b : {}".format(score_ref_b))

In [None]:
score_ref_ab = bleu.sentence_bleu([ref_a, ref_b, ref_c, ref_d], hyp4)
print("Hyp vs multiple refs: {}".format(score_ref_ab))

In [None]:
hyp1

['The', 'fat', 'cat', 'was', 'comfortably', 'sitting', 'on', 'a', 'big', 'mat']

# Test the following sentences:


## Reference:

"More than ever, we need to stand up for science. Science that is universal – shared by all humanity – and that is unifying," the president of the European Commission said on Monday in a speech delivered at La Sorbonne University in Paris.


## Original (DE):

"Mehr denn je müssen wir für die Wissenschaft eintreten. Eine Wissenschaft, die universell ist - von der ganzen Menschheit geteilt - und die uns vereint", sagte der Präsident der Europäischen Kommission am Montag in einer Rede an der Universität La Sorbonne in Paris.

### hypothesis 1 (Google)
"More than ever, we must stand up for science. A science that is universal — shared by all humanity — and that unites us," said the President of the European Commission in a speech at La Sorbonne University in Paris on Monday.


### hypotesis 2 (deepL)

“More than ever, we must stand up for science. A science that is universal—shared by all of humanity—and that unites us,” said the President of the European Commission on Monday in a speech at La Sorbonne University in Paris.

### hypothesis 3 (chatGPT)

"More than ever, we must stand up for science. A science that is universal – shared by all of humanity – and that unites us," said the President of the European Commission on Monday in a speech at the University of La Sorbonne in Paris.

### hypothesis 4 (systran)

"More than ever, we need to stand up for science. A science that is universal - shared by all humanity - and that unites us", said the President of the European Commission in a speech at the University of La Sorbonne in Paris on Monday.

## Original (UK)


«Сьогодні, як ніколи раніше, ми повинні відстоювати науку. Науку, яка є універсальною, спільною для всього людства і яка об’єднує», — сказала голова Європейської комісії у своїй промові в університеті Сорбонна в Парижі.

### hypothesis 5 (Google (uk>en))

"Today, more than ever, we must stand up for science. Science that is universal, common to all humanity and that unites," said the European Commission President in her speech at the Sorbonne University in Paris.



# Evaluation of files and directories

## Downloading zip files (from a URL on HeiBox)

Text2: (cultural capital)

text02mtDeepL2en.zip : https://heibox.uni-heidelberg.de/f/0e1e6d0e1a274157be7f/

text02mtGoogle2en.zip : https://heibox.uni-heidelberg.de/f/ddec498c9ff44ffc8b14/

text02ori.zip : https://heibox.uni-heidelberg.de/f/8ffeca89e2d04dc4baac/

Text3: (interview)

text03mtDeepL2en.zip : https://heibox.uni-heidelberg.de/f/51193f8a6e074330adaf/

text03mtGoogle2en.zip : https://heibox.uni-heidelberg.de/f/c93ef08f70e7417fbbe8/

text03ori.zip : https://heibox.uni-heidelberg.de/f/b997f0ca33b948a09ff2/



In [1]:
!wget https://heibox.uni-heidelberg.de/f/0e1e6d0e1a274157be7f/?dl=1
!mv index.html?dl=1 text02mtDeepL2en.zip
!unzip text02mtDeepL2en.zip
!rm text02mtDeepL2en.zip

!wget https://heibox.uni-heidelberg.de/f/ddec498c9ff44ffc8b14/?dl=1
!mv index.html?dl=1 text02mtGoogle2en.zip
!unzip text02mtGoogle2en.zip
!rm text02mtGoogle2en.zip

!wget https://heibox.uni-heidelberg.de/f/8ffeca89e2d04dc4baac/?dl=1
!mv index.html?dl=1 text02ori.zip
!unzip text02ori.zip
!rm text02ori.zip







--2025-08-26 13:52:26--  https://heibox.uni-heidelberg.de/f/0e1e6d0e1a274157be7f/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/6a9b3bec-e56e-416a-bf8f-3aceca2dcfaa/text02mtDeepL2en.zip [following]
--2025-08-26 13:52:27--  https://heibox.uni-heidelberg.de/seafhttp/files/6a9b3bec-e56e-416a-bf8f-3aceca2dcfaa/text02mtDeepL2en.zip
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 18673 (18K) [application/zip]
Saving to: ‘index.html?dl=1’


2025-08-26 13:52:27 (461 KB/s) - ‘index.html?dl=1’ saved [18673/18673]

Archive:  text02mtDeepL2en.zip
   creating: text02mtDeepL2en/
  inflating: text02mtDeepL2en/text02uk2enD.txt  
  inflating: __MACOSX/text02mtDeepL2en/._text02uk2enD.txt  
  infl

In [None]:
!wget https://heibox.uni-heidelberg.de/f/51193f8a6e074330adaf/?dl=1
!mv index.html?dl=1 text03mtDeepL2en.zip
!unzip text03mtDeepL2en.zip
!rm text03mtDeepL2en.zip

!wget https://heibox.uni-heidelberg.de/f/c93ef08f70e7417fbbe8/?dl=1
!mv index.html?dl=1 text03mtGoogle2en.zip
!unzip text03mtGoogle2en.zip
!rm text03mtGoogle2en.zip

!wget https://heibox.uni-heidelberg.de/f/b997f0ca33b948a09ff2/?dl=1
!mv index.html?dl=1 text03ori.zip
!unzip text03ori.zip
!rm text03ori.zip

## Recursively crawling directories

In [None]:
import os, re

def get_files_in_directory(directory_path):
    file_paths = []
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            file_path = os.path.join(root, file)
            if re.match(r'.*\.txt', file_path):
                file_paths.append(file_path)
    return file_paths


def printBLEUscores(file_list, file_ref):
    for file in sorted(file_list):
        with open(file, 'r') as file:
            file_content = file.read()
            file_test = file_content.split()
            score_ref_a = bleu.sentence_bleu([file_ref], file_test)
            print(f'{file.name}\t{score_ref_a}')
            # print(file + '\t' + str(score_ref_a) + '\n')

    return




### Text 2

In [None]:
# reference file
file_reference = '/content/text02ori/text02en.txt'

with open(file_reference, 'r') as file:
    file_content_ref = file.read()

file_ref = file_content_ref.split()

In [None]:
file_ref

In [None]:
filesGoogle = get_files_in_directory('/content/text02mtGoogle2en')
filesDeepL = get_files_in_directory('/content/text02mtDeepL2en')

In [None]:
printBLEUscores(filesGoogle, file_ref)

In [None]:
printBLEUscores(filesDeepL, file_ref)

### Text 3

In [None]:
# reference file
file_reference = '/content/text03ori/text03en.txt'

with open(file_reference, 'r') as file:
    file_content_ref = file.read()

file_ref = file_content_ref.split()

In [None]:
filesGoogle = get_files_in_directory('/content/text03mtGoogle2en')
filesDeepL = get_files_in_directory('/content/text03mtDeepL2en')

In [None]:
printBLEUscores(filesGoogle, file_ref)

In [None]:
printBLEUscores(filesDeepL, file_ref)