# Training an inflectional system

## Inputs & Installs

In [3]:
!pip install wandb



In [4]:
!wandb login aa0b9ecff47af231f410704977e504d7928ffb05

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [5]:
import wandb

wandb.init(project="danish-inflection-beta")

[34m[1mwandb[0m: Currently logged in as: [33mignacioct_[0m ([33mignacio_at_ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [6]:
!pip install fairseq



In [7]:
!pip install tensorboardX



## Preprocess the data

In [8]:
!bash ./preprocess.sh dan

2024-01-06 23:02:52.734264: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 23:02:52.734314: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 23:02:52.735307: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-06 23:02:52.740641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-06 23:02:57 | INFO | fairseq_cli.preprocess |

## Train

Train with default parameters, roughly the baseline in SIGMORPHON 2020 shared task
Let this run until the loss on the validation (dev) test no longer improves. (Maybe 10 minutes with a GPU).

In [9]:
!bash ./train.sh dan --patience 3

2024-01-06 23:03:00.919268: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 23:03:00.919326: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 23:03:00.920239: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-06 23:03:00.925497: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-06 23:03:04 | INFO | numexpr.utils | NumExpr 

## Generate predictions on test data

Generate predictions on test data - read in all the inputs from tst.esp.input and generate outputs to the file tst.esp.output (this is slow and takes about a minute)

In [8]:
#!fairseq-interactive data-bin/dan/ --source-lang=dan.input --target-lang=dan.output --path=checkpoints/dan-models/checkpoint_best.pt --input=tst.dan.input | grep -P "D-[0-9]+" | cut -f3 > tst.dan.output

2024-01-06 22:54:42.220435: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 22:54:42.220512: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 22:54:42.222690: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-06 22:54:42.232371: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-06 22:54:48 | INFO | fairseq_cli.interactive 

In [9]:
# Read in the generated outputs and inputs and display the first 20 side-by-side
linesinput = [l.strip() for l in open("tst.dan.input")]
linesoutput = [l.strip() for l in open("tst.dan.output")]
tuple(zip(linesinput, linesoutput))[:20] # Look at 20 first test inputs and predicted outputs

(('< t a g e t e s > N DEF NOM SG', '< t a g e s t e n s >'),
 ('< k o r s > N INDF NOM SG', '< k r o s >'),
 ('< p r o n o m e n > N DEF NOM SG', '< p r o n o n e s >'),
 ('< b i o l o g i > N INDF GEN PL', '< b i l o g i e r s >'),
 ('< i n f o r m a t i o n > N INDF GEN PL', '< i n i f o r a m e r s >'),
 ('< o r d k l a s s e > N INDF NOM PL', '< o r d l a s e r s >'),
 ('< p r æ s i d e n t > N INDF NOM SG', '< p r æ s i n d e >'),
 ('< j æ g e r > N DEF NOM SG', '< ø g e r n e s >'),
 ('< p i l o t > N DEF NOM SG', '< p i l o t e n s >'),
 ('< p å f u g l > N INDF GEN SG', '< p u f l g s >'),
 ('< f i l i a l > N INDF GEN SG', '< f i l a l s >'),
 ('< g a f f e l > N INDF GEN PL', '< g a f e l e r s >'),
 ('< r o o m i e > N INDF GEN PL', '< r o m e m i e r s >'),
 ('< n æ s e > N INDF NOM SG', '< n e s >'),
 ('< m o d e r s m å l > N INDF NOM PL', '< m o d r e s >'),
 ('< T V - p r o g r a m > N INDF NOM PL', '< h u p r o p r a m e r >'),
 ('< k r o k o d i l l e > N INDF GEN PL

## Calculate test accuracy and Levensthein distance

In [12]:
# Creating the predictions from the checkpoint
!fairseq-interactive data-bin/dan/ --source-lang=dan.input --target-lang=dan.output --path=checkpoint_best.pt --input=tst.dan.input | grep -P "D-[0-9]+" | cut -f3 > tst.dan.prediction

2024-01-06 23:43:07.854789: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-06 23:43:07.854842: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-06 23:43:07.855757: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-06 23:43:07.860900: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-06 23:43:12 | INFO | fairseq_cli.interactive 

In [13]:
# Calculating the accuracy between the ground truth and the predictions
linesprediction = [l.strip() for l in open("tst.dan.prediction")]
linesground = [l.strip() for l in open("tst.dan.output")]

# Checking if both files have the same number of lines
assert sum(1 for _ in enumerate(linesprediction)) == sum(1 for _ in enumerate(linesground))
assert sum(1 for _ in enumerate(linesprediction)) != 0
assert sum(1 for _ in enumerate(linesground)) != 0

hits = 0
lines = 0


for pred, ground in zip(linesprediction, linesground):
  if pred == ground:
    hits += 1
  lines += 1

print("Accuracy: " + str(hits/lines))


Accuracy: 0.6266666666666667


In [15]:
# Calculating the Levenshtein distance between the ground truth and the predictions

# Function from ChatGPT
def levenshtein_distance(str1, str2):
    len_str1 = len(str1) + 1
    len_str2 = len(str2) + 1

    # Create a matrix to store the distances
    matrix = [[0 for _ in range(len_str2)] for _ in range(len_str1)]

    # Initialize the matrix
    for i in range(len_str1):
        matrix[i][0] = i
    for j in range(len_str2):
        matrix[0][j] = j

    # Fill in the matrix
    for i in range(1, len_str1):
        for j in range(1, len_str2):
            cost = 0 if str1[i - 1] == str2[j - 1] else 1
            matrix[i][j] = min(
                matrix[i - 1][j] + 1,        # Deletion
                matrix[i][j - 1] + 1,        # Insertion
                matrix[i - 1][j - 1] + cost  # Substitution
            )

    # The bottom-right cell contains the Levenshtein distance
    return matrix[-1][-1]

linesprediction = [l.strip() for l in open("tst.dan.prediction")]
linesground = [l.strip() for l in open("tst.dan.output")]

# Checking if both files have the same number of lines
assert sum(1 for _ in enumerate(linesprediction)) == sum(1 for _ in enumerate(linesground))
assert sum(1 for _ in enumerate(linesprediction)) != 0
assert sum(1 for _ in enumerate(linesground)) != 0

distances = 0
lines = 0


for pred, ground in zip(linesprediction, linesground):

  distances += levenshtein_distance(pred, ground)
  lines += 1

print("Levenshtein distance: " + str(distances/lines))

Levenshtein distance: 1.38


Distance = 0: The strings are identical. No edits are needed.

Distance = 1: The strings are very similar. Typically, this means either a single insertion, deletion, or substitution is required to make them identical. For example, "cat" and "cot" have a Levenshtein distance of 1 because you can transform one into the other by changing a single character.

Distance > 1: As the distance increases, the dissimilarity between the strings also increases. A distance of 2 or more indicates a greater degree of dissimilarity, involving multiple edits.