<a href="https://colab.research.google.com/github/pruzhinki/JetBrains/blob/main/CodingChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this coding challenge, I was required to fine-tune and evaluate a small pre-trained model (gs://t5-data/pretrained_models/mt5/small) on a new task of predicting the language a given text is written in (e.g. using the 14 languages from XNLI dataset)

This project is using the TensorFlow T5 codebase, therefore I was asked to use [t5.models.HfPyTorchModel API](https://github.com/google-research/text-to-text-transfer-transformer/blob/a08f0d1c4a7caa6495aec90ce769a29787c3c87c/t5/models/hf_model.py#L38).

In [None]:
%%capture
!git clone https://github.com/google-research/text-to-text-transfer-transformer.git
%cd text-to-text-transfer-transformer  
!pip install .

In [None]:
import t5
import t5.models
import torch
import tensorflow.compat.v1 as tf
import transformers
import tensorflow_datasets as tfds
import seqio
import functools

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

#Dataset and data preprocessing

I will use the [Language Identification dataset](https://huggingface.co/datasets/amazon_reviews_multi) from kaggle to fine-tune mt5 small. I have split the dataset for train(80%) and validation(20%) and uploaded them in my github. Each record in the dataset contains the text and the language label. This means that we just need to bring this dataset into the right format, and then we can already take a look at how well T5 performs on it. After that, we will finetune on this dataset to improve performance.

In [None]:
URL_train = 'https://raw.githubusercontent.com/pruzhinki/JetBrains/main/train_dataset.csv'
URL_val = 'https://raw.githubusercontent.com/pruzhinki/JetBrains/main/val_dataset.csv'
file_path_train = tf.keras.utils.get_file(origin=URL_train)
file_path_val = tf.keras.utils.get_file(origin=URL_val)

file_path = {
    "train": file_path_train,
    "validation": file_path_val
}

Downloading data from https://raw.githubusercontent.com/pruzhinki/JetBrains/main/train_dataset.csv
Downloading data from https://raw.githubusercontent.com/pruzhinki/JetBrains/main/val_dataset.csv


Define a function to load the CSV data as tf.data.Dataset in TensorFlow.

In [None]:
def dataset_fn(split, shuffle_files=False):
  # We only have one file for each split.
  del shuffle_files
  ds = tf.data.TextLineDataset(file_path[split])
  ds = ds.map(
      functools.partial(tf.io.decode_csv, record_defaults=["", ""],
                        field_delim=",", use_quote_delim=False),
      num_parallel_calls=tf.data.experimental.AUTOTUNE)
  # Map each tuple to a {"sentence": ... "language": ...} dict.
  ds = ds.map(lambda *ex: dict(zip(["sentence", "language"], ex)))
  return ds

print("A few raw validation examples...")
for ex in tfds.as_numpy(dataset_fn("validation").take(3)):
  print(ex)

A few raw validation examples...
{'sentence': b'klement gottwaldi surnukeha palsameeriti ning paigutati mausoleumi surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundem\xc3\xa4rke  aastal viidi ta surnukeha mausoleumist \xc3\xa4ra ja kremeeriti zl\xc3\xadni linn kandis aastatel \xe2\x80\x93 nime gottwaldov ukrainas harkivi oblastis kandis zmiivi linn aastatel \xe2\x80\x93 nime gotvald', 'language': b'Estonian'}
{'sentence': b'sebes joseph pereira thomas  p\xc3\xa5 eng the jesuits and the sino-russian treaty of nerchinsk  the diary of thomas pereira bibliotheca instituti historici s i --   rome libris', 'language': b'Swedish'}
{'sentence': b'\xe0\xb8\x96\xe0\xb8\x99\xe0\xb8\x99\xe0\xb9\x80\xe0\xb8\x88\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x8d\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb8\xe0\xb8\x87 \xe0\xb8\xad\xe0\xb8\xb1\xe0\xb8\x81\xe0\xb8\xa9\xe0\xb8\xa3\xe0\xb9\x82\xe0\xb8\xa3\xe0\xb8\xa1\xe0\xb8\xb1\xe0\xb8\x99 thanon charoen krung \xe0\xb9\x80\xe0\xb8\

Now, we write a preprocess function to convert the examples in the tf.data.Dataset into a text-to-text format, with both inputs and targets fields. Finally, we prepend 'language prediction:' to the inputs so that the model knows what task it's trying to solve.

In [None]:
def dataset_preprocessor(ds):
    def to_inputs_and_targets(ex):
      """Map {"sentence": ..., "language": ...}->{"inputs": ..., "targets": ...}."""
      return {'inputs': tf.strings.join(
                  ['language prediction: ', ex['sentence']]),
              'targets': ex['language']
              }
    return ds.map(to_inputs_and_targets, 
                  num_parallel_calls=tf.data.experimental.AUTOTUNE
                  )

# Creating new task

T5 uses seqio for managing data pipelines and evaluaton metics. Two core components of the seqio are Task and Mixture objects.

A Task is a dataset along with preprocessing functions and evaluation metrics. A Mixture is a collection of Task objects along with a mixing rate or a function defining how to compute a mixing rate based on the properties of the constituent Tasks.

For this example, we will fine-tune the model to do language prediction, so we only create a new task called 'language_prediction' and put in it the registry.

In [None]:
DEFAULT_OUTPUT_FEATURES = {
    "inputs":
        seqio.Feature(
            vocabulary=t5.data.get_default_vocabulary(), add_eos=True),
    "targets":
        seqio.Feature(
            vocabulary=t5.data.get_default_vocabulary(), add_eos=True)
}


seqio.TaskRegistry.add(
    "language_prediction",
    # Specify the task source.
    source=seqio.FunctionDataSource(
        # Supply a function which returns a tf.data.Dataset.
        dataset_fn=dataset_fn,
        splits=["train", "validation"],
         ),
    # Supply a list of functions that preprocess the input tf.data.Dataset.
    preprocessors=[
        dataset_preprocessor,
        seqio.preprocessors.tokenize_and_append_eos
    ],
    # We'll use accuracy as our evaluation metric.
    metric_fns=[t5.evaluation.metrics.accuracy],
    output_features=DEFAULT_OUTPUT_FEATURES,
)

<seqio.dataset_providers.Task at 0x7f8241a32d90>

Now the new task is stored in the registry. Let's look at a few pre-processed examples from the validation set. Note they contain both the tokenized (integer) and plain-text inputs and targets.

In [None]:
nq_task = seqio.TaskRegistry.get("language_prediction")
ds = nq_task.get_dataset(split="validation", sequence_length={"inputs": 32, "targets": 6})
print("A few preprocessed validation examples...")
for ex in tfds.as_numpy(ds.take(3)):
  print(ex)

A few preprocessed validation examples...
{'inputs_pretokenized': b'language prediction: toplumun b\xc3\xbcy\xc3\xbck bir k\xc4\xb1sm\xc4\xb1 okuma yazma bilmedi\xc4\x9fi i\xc3\xa7in molla nasredin dergisi yazarlar\xc4\xb1 edebi dilin geli\xc5\x9ftirilmesinde halk diline y\xc3\xb6nelme gere\xc4\x9fini savunmu\xc5\x9flard\xc4\xb1r dergi \xc3\xbclkenin gelece\xc4\x9fi a\xc3\xa7\xc4\xb1s\xc4\xb1ndan molla nasreddin ile pek \xc3\xa7ok konuda ama\xc3\xa7 birli\xc4\x9fi i\xc3\xa7erisinde olan f\xc3\xbcy\xc3\xbbzat ile dil meselesinde ayr\xc4\xb1lm\xc4\xb1\xc5\x9ft\xc4\xb1r f\xc3\xbcy\xc3\xbbzat arap\xc3\xa7a- fars\xc3\xa7a tamlamalarla ve zaman zaman a\xc4\x9fdal\xc4\xb1 bir osmanl\xc4\xb1 \xc3\xbcslubuyla se\xc3\xa7kin bir kesime hitap ederken molla nasreddin tamamen a\xc3\xa7\xc4\xb1k duru bir azerbaycan t\xc3\xbcrk\xc3\xa7esiyle ve sade bir \xc3\xbcslupla topluma y\xc3\xb6nelik yay\xc4\xb1n yapm\xc4\xb1\xc5\x9ft\xc4\xb1r', 'inputs': array([ 1612, 21332,    10,   420,  5171,   202,     3, 

In [None]:
model = t5.models.HfPyTorchModel('google/mt5-small', "./hft5/", device)

Downloading:   0%|          | 0.00/553 [00:00<?, ?B/s]

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Mt5 will not generate any meaningful result for this task before fine-tuning.

In [None]:
inputs = ["language prediction: This is a book.",
          "language prediction: Das ist ein Buch." ,
          "language prediction: 这是一本书。"
          ]
model.predict(
    inputs,
    sequence_length={"inputs": 32},
    batch_size=2
)

INFO:absl:language prediction: This is a book.
  ->  ⁇  est game here  language est prediction şihinhin     B est  
INFO:absl:language prediction: Das ist ein Buch.
  ->  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
INFO:absl:language prediction: 这是一本书。
  ->  ⁇  ⁇  ⁇  ⁇ 


Evaluate the pre-trained checkpoint, before further fine-tuning. The accuracy is 0.

In [None]:
model.eval(
    "language_prediction",
    sequence_length={"inputs": 32, "targets": 6},
    batch_size=256,
)

INFO:absl:Adding task 'language_prediction' with predict metric_fn(s).
 Got: {'inputs': 32, 'targets': 6}, 
 Max Lengths:{'inputs': 925, 'targets': 5}
INFO:absl:Evaluating checkpoint step: 0
INFO:absl:eval/language_prediction/accuracy at step 0: 0.000


Run 30000 steps of fine-tuning

In [None]:
model.train(
    mixture_or_task_name='language_prediction',
    steps=30000,
    save_steps=3000,
    sequence_length={"inputs": 32, "targets": 6},
    split="train",
    batch_size=200,
    optimizer=functools.partial(transformers.AdamW, lr=1e-4),
)

INFO:absl:Saving checkpoint for step 0
INFO:absl:Saving checkpoint for step 3000
INFO:absl:Saving checkpoint for step 6000
INFO:absl:Saving checkpoint for step 9000
INFO:absl:Saving checkpoint for step 12000
INFO:absl:Saving checkpoint for step 15000
INFO:absl:Saving checkpoint for step 18000
INFO:absl:Saving checkpoint for step 21000
INFO:absl:Saving checkpoint for step 24000
INFO:absl:Saving checkpoint for step 27000
INFO:absl:Saving final checkpoint for step 30000


Evaluate after fine-tuning, the accurary is improved to 69%.

In [None]:
model.eval(
    'language_prediction',
    checkpoint_steps="all",
    sequence_length={"inputs": 32, "targets": 6},
    batch_size=256,
)

INFO:absl:Adding task 'language_prediction' with predict metric_fn(s).
 Got: {'inputs': 32, 'targets': 6}, 
 Max Lengths:{'inputs': 925, 'targets': 5}
INFO:absl:Evaluating checkpoint step: 0
INFO:absl:Loading from ./hft5/model-0.checkpoint
INFO:absl:eval/language_prediction/accuracy at step 0: 0.000
INFO:absl:Evaluating checkpoint step: 3000
INFO:absl:Loading from ./hft5/model-3000.checkpoint
INFO:absl:eval/language_prediction/accuracy at step 3000: 21.750
INFO:absl:Evaluating checkpoint step: 6000
INFO:absl:Loading from ./hft5/model-6000.checkpoint
INFO:absl:eval/language_prediction/accuracy at step 6000: 35.023
INFO:absl:Evaluating checkpoint step: 9000
INFO:absl:Loading from ./hft5/model-9000.checkpoint
INFO:absl:eval/language_prediction/accuracy at step 9000: 57.773
INFO:absl:Evaluating checkpoint step: 12000
INFO:absl:Loading from ./hft5/model-12000.checkpoint
INFO:absl:eval/language_prediction/accuracy at step 12000: 65.750
INFO:absl:Evaluating checkpoint step: 15000
INFO:absl:Lo

We can see that although our fine-tuned mt5 cannot always corretly recognize the language such as german, but it is able to generate a very close answer. For English and Chinese, our model has correct prediction.

In [28]:
inputs = ["language prediction: How are the young Germans doing? What does the pandemic mean for them? And are there really many of them in Germany? The phenomenon in numbers.",
          "language prediction: Wer schon sehr gut Deutsch spricht und in Deutschland studieren möchte, kann sich zum TestDaF anmelden.",
          "language prediction: 如果你的德语已经说得很好，并且想在德国学习，你可以报名参加TestDaF。"
          ]
model.predict(
    inputs,
    sequence_length={"inputs": 32},
    batch_size=2
)

INFO:absl:language prediction: How are the young Germans doing? What does the pandemic mean for them? And are there really many of them in Germany? The phenomenon in numbers.
  -> English
INFO:absl:language prediction: Wer schon sehr gut Deutsch spricht und in Deutschland studieren möchte, kann sich zum TestDaF anmelden.
  -> English
INFO:absl:language prediction: 如果你的德语已经说得很好，并且想在德国学习，你可以报名参加TestDaF。
  -> Chinese
