# CS 195: Natural Language Processing
## Transfer Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [1]:
!pip install --quiet bitsandbytes
!pip install --quiet --upgrade transformers # Install latest version of transformers
!pip install --quiet --upgrade accelerate
!pip install --quiet sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece accelerate bitsandbytes

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m144.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m219.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [None]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Downloading readme:   0%|          | 0.00/100 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/503687 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [None]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 5000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [None]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 1000
    })
})

### Reminder of what the data looks like

In [None]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

Riding a ferry across the bay offers incredible views of the skyline.
⛴🌉🌊👀


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [None]:
from transformers import AutoTokenizer
#"mistralai/Mistral-7B-v0.1"
#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [None]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [486, 90367, 4256, 339, 12431, 484, 2119, 336, 19195, 259, 264, 3162, 3644, 3171, 4133, 261, 259, 114441, 263, 533, 3658, 772, 52743, 8125, 261, 305, 287, 5169, 18040, 260, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [None]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁The',
 '▁RV',
 '▁life',
 '▁is',
 '▁exact',
 'ly',
 '▁what',
 '▁I',
 '▁needed',
 '▁',
 '-',
 '▁end',
 'less',
 '▁high',
 'way',
 ',',
 '▁',
 'sunset',
 's',
 '▁that',
 '▁take',
 '▁your',
 '▁breath',
 '▁away',
 ',',
 '▁and',
 '▁the',
 '▁open',
 '▁road',
 '.',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [None]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 248919, 243, 162, 158, 166, 4667, 59597, 4667, 247172, 245240, 22717, 4667, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁',
 '🚐',
 '<0xF0>',
 '<0x9F>',
 '<0x9B>',
 '<0xA3>',
 '️',
 '☀',
 '️',
 '🌇',
 '🌄',
 '❤',
 '️',
 '</s>']

In [None]:
tokenizer.decode(target.input_ids)

'🚐🛣️☀️🌇🌄❤️</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [None]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [None]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

tf_model.h5:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [None]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [None]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
#tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x2c4214e20>

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
#model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [None]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15] )
print( tokenized_datasets["test"]["input_ids"][15] )

The Swiss Army Knife is the ultimate tool to have on hand  From opening packages to cutting loose threads, it does it all!
✂️🔑✉️✅🔧🛠️🔪🔢
[486, 53155, 259, 42777, 412, 70379, 339, 287, 259, 54148, 16080, 288, 783, 351, 3993, 7119, 30363, 259, 126431, 288, 259, 66127, 259, 104463, 259, 159548, 261, 609, 259, 6975, 609, 751, 309, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [None]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

['<pad>', '▁', '✨', '✨', '</s>']

In [None]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'✨✨'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

I went through about 10 models and tried to orignially get a dataset called Orca to work with some language and question answering models. These models kept crashing my notebook and I was unable to actually use them.

I eventually went to a dataset that translates chinese to english. I trained it on a translator model. I used the https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt. Which helped me understand how to use pytorch to apply the dataset and model I needed.

In [3]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['input'] is not None

dataset = load_dataset("suolyer/translate_zh2en",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output', 'id'],
    num_rows: 10000
})

In [4]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 5000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

In [5]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'id'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input', 'output', 'id'],
        num_rows: 1000
    })
})

In [6]:
print(dataset_split["train"]["input"][46])
print(dataset_split["train"]["output"][46])

请你把中文翻译成为英文
例如，如果出于某种原因，人们对未来变得不那么有信心，他们就会削减支出，囤积更多的钱。
For instance, if for some reason people have become less confident about the future, they will cut back on their outlays and hoard more money.


In [7]:
from transformers import AutoTokenizer, BartTokenizer
#"mistralai/Mistral-7B-v0.1"
#uses the sentencepiece tokenizer
model_checkpoint = "acul3/mt5-translate-en-id"

#model_checkpoint = "mistralai/Mistral-7B-v0.1"
#tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [8]:
inputs = tokenizer(dataset_split["train"]["input"][46])
inputs

{'input_ids': [259, 20256, 4235, 9803, 18884, 138961, 27674, 102031, 259, 122434, 261, 21304, 2371, 5162, 23085, 18910, 34635, 261, 79316, 2991, 25704, 150257, 1597, 43050, 1637, 145895, 261, 16171, 95099, 76195, 35244, 150502, 261, 241447, 72909, 111234, 10270, 306, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁',
 '请',
 '你',
 '把',
 '中文',
 '翻译',
 '成为',
 '英文',
 '▁',
 '例如',
 ',',
 '如果',
 '出',
 '于',
 '某',
 '种',
 '原因',
 ',',
 '人们',
 '对',
 '未来',
 '变得',
 '不',
 '那么',
 '有',
 '信心',
 ',',
 '他们',
 '就会',
 '削',
 '减',
 '支出',
 ',',
 '囤',
 '积',
 '更多的',
 '钱',
 '。',
 '</s>']

In [10]:
target = tokenizer(dataset_split["train"]["output"][46])
target

{'input_ids': [1102, 259, 13371, 261, 955, 332, 2155, 10870, 2559, 783, 259, 11467, 24691, 92537, 1388, 287, 11350, 261, 287, 276, 898, 35424, 3004, 351, 259, 1616, 1350, 103929, 305, 623, 3588, 1097, 8129, 260, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁For',
 '▁',
 'instance',
 ',',
 '▁if',
 '▁for',
 '▁some',
 '▁reason',
 '▁people',
 '▁have',
 '▁',
 'become',
 '▁less',
 '▁confident',
 '▁about',
 '▁the',
 '▁future',
 ',',
 '▁the',
 'y',
 '▁will',
 '▁cut',
 '▁back',
 '▁on',
 '▁',
 'their',
 '▁out',
 'lays',
 '▁and',
 '▁ho',
 'ard',
 '▁more',
 '▁money',
 '.',
 '</s>']

In [12]:
tokenizer.decode(target.input_ids)

'For instance, if for some reason people have become less confident about the future, they will cut back on their outlays and hoard more money.</s>'

In [13]:
max_input_length = 75
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["input"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["output"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



In [14]:
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input', 'output', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [None]:
from transformers import TFAutoModelForCausalLM, AutoModelForSeq2SeqLM, BartTokenizer, BartModel, TFAutoModelForSeq2SeqLM, TFAutoModelForSequenceClassification, RemBertForCausalLM

import transformers

from transformers import AutoModelForSequenceClassification

#model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
#model = RemBertForCausalLM.from_pretrained(model_checkpoint)

#model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint)


model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
#model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, device_map="auto", load_in_8bit=True)

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [None]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(['id', 'input', 'output'])
tokenized_datasets_no_text

In [18]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [19]:
from transformers import create_optimizer
import tensorflow as tf
import keras
from keras import layers, Sequential

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile()

# Train in mixed-precision float16 - can be helpful if running on a GPU
#tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [20]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x78dbc0488c70>

In [67]:
def is_not_none(example):
    return example['input'] is not None

raw_datasets = load_dataset("suolyer/translate_zh2en",split="train")

# Filter the dataset
raw_datasets = raw_datasets.filter(is_not_none)
# Shuffle the dataset
shuffled_dataset = raw_datasets.shuffle(seed=42)

# Select a small sample
sample_size = 5000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))
dataset_split = sample_dataset.train_test_split(test_size=0.2)

In [68]:
from transformers import AutoTokenizer
from transformers import pipeline

#tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")



[{'translation_text': '默认为扩展线索'}]

In [69]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

In [70]:
zh_sentence = dataset_split["train"]["input"][46]
en_sentence = dataset_split["train"]["output"][46]

inputs = tokenizer(zh_sentence, text_target=en_sentence)
inputs
wrong_targets = tokenizer(en_sentence)

In [71]:
max_length = 128


def preprocess_function(examples):
    inputs = examples["input"]
    targets = examples["output"]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [72]:
tokenized_datasets = dataset_split.map(preprocess_function, batched=True, remove_columns=dataset_split["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [73]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [74]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [75]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [76]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [80]:
pip install sacrebleu evaluate

In [77]:
import evaluate

metric = evaluate.load("sacrebleu")

In [78]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [81]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [95]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"Helsinki-NLP/opus-mt-en-zh",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
batch["labels"]

In [43]:
from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["test"], collate_fn=data_collator, batch_size=8
)

In [44]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [45]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)



In [46]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [50]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)


In [51]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels