本文我们将运用 Transformers 库来完成翻译任务。翻译是典型的 Seq2Seq (sequence-to-sequence) 任务，即对于给定的词语序列，输出一个对应的词语序列。翻译任务不仅与文本摘要任务很相似，而且我们可以将本文的操作应用于其他的 Seq2Seq 任务，例如：

风格转换 (Style transfer)：将采用某种风格书写的文本转换为另一种风格，例如将文言文转换为白话文、将莎士比亚式英语转换为现代英语；
生成式问答 (Generative question answering)：对于给定的问题，基于上下文生成对应的答案。
如果有足够多的语料，我们可以从头训练一个翻译模型，但是微调预训练好的翻译模型会更快，比如将 mT5、mBART 等多语言模型微调到特定的语言对。

本文我们将微调一个翻译模型进行英到罗马尼亚翻译，该模型已经基于大规模的 Opus 语料库对翻译任务进行了预训练，因此可以直接用于翻译。而通过我们的微调，可以进一步提升该模型在特定语料上的性能。

In [13]:
import os
os.environ["WANDB_DISABLED"]="true"

# 安装所需要的包

In [14]:
! pip install datasets transformers sacrebleu torch sentencepiece transformers[sentencepiece] -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version

In [15]:
import transformers
print(transformers.__version__)

4.18.0


# 在翻译任务上微调模型

在这个笔记本中，我们将看到如何为英语到汉语的翻译任务微调一个Hugggingface Transformers 模型。使用的数据为 WMT 数据集，这是一个机器翻译数据集，由各种来源的集合组成，包括新闻评论和议会会议记录。

下面是使用预训练模型的示例翻译文本
参考链接: https://huggingface.co/Helsinki-NLP/opus-mt-en-zh
https://huggingface.co/Helsinki-NLP/opus-mt-en-ro?text=My+name+is+Sarah+and+I+live+in+London

![](https://img-blog.csdnimg.cn/3331f29f447a401aade229fb356a9708.png)

下面内容我们主要讲如何使用Datasets来加载翻译数据集，以及使用Trainer API 实现模型微调。


只要该模型在 Transformers 库中具有sequence-to-sequence版本，那么我们下面的方法也可以加载各种类型的模型。 在这里，我们选择了 Helsinki-NLP/opus-mt-en-ro模型权重。

In [4]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-ro"

### 加载数据集

我们将使用 [datasets](https://github.com/huggingface/datasets/tree/master/datasets/wmt16) 库下载数据并获取我们需要用于评估的指标（将我们的模型与基准）。 这可以通过函数 load_dataset 和 load_metric 轻松完成。 我们在这里使用 WMT 数据集的英语/中文的部分数据集.


神经机器翻译领域国际上最常用的数据集是WMT，很多机器翻译任务基于这个数据集进行训练，Google的工程师们基于WMT16 en-de准备了一个脚本：wmt16_en_de.sh。这个脚本先下载数据，再使用Moses Tokenizer，清理训练数据，并使用BPE生成32,000个Subword的词汇表。可以使用梯子直接下载预处理后的文件：
https://github.com/huggingface/datasets/tree/master/datasets/wmt16

In [17]:
from datasets import load_dataset, load_metric
raw_datasets = load_dataset("wmt16", "ro-en") # ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
metric = load_metric("sacrebleu")

Reusing dataset wmt16 (C:\Users\yanqiang\.cache\huggingface\datasets\wmt16\ro-en\1.0.0\af3c5d746b307726d0de73ebe7f10545361b9cb6f75c83a1734c000e48b6264f)


  0%|          | 0/3 [00:00<?, ?it/s]

数据集对象为 [datasetdict](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict)，其中包含训练、验证和测试集的每一个键：train、validation、test

In [18]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 610320
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
})

In [36]:
raw_datasets['train'].to_pandas().to_csv('data/09/train.csv',index=None)
raw_datasets['validation'].to_pandas().to_csv('data/09/validation.csv',index=None)
raw_datasets['test'].to_pandas().to_csv('data/09/test.csv',index=None)

In [19]:
raw_datasets["train"][0]

{'translation': {'en': 'Membership of Parliament: see Minutes',
  'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}

为了大致了解数据的情况，我们用下面函数随机从数据集中选取一些样本

In [20]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation
0,"{'en': 'It has 25 mayoral and 700 city-council candidates.', 'ro': 'Aceasta are 25 de candidaţi la primării şi 700 la consiliile locale.'}"
1,"{'en': 'This is the question which we will have to ask ourselves once the Commission has finished its analysis.', 'ro': 'Aceasta este întrebarea pe care trebuie să ne-o punem imediat ce Comisia va fi terminat analiza sa.'}"
2,"{'en': 'Sekulovski: To our great surprise and joy, the second awarded work -- the Museum of Memory and Tolerance in Mexico -- shares a similar story.', 'ro': 'Sekulovski: Spre marea noastră surpriză şi bucurie, a doua lucrare premiată -- Muzeul Memoriei şi Toleranţei din Mexic -- împărtăşeţte o poveste similară.'}"
3,"{'en': 'He also discussed NATO's continuing engagement in Kosovo, noting that the improving situation has allowed the Alliance to reduce its number of troops there. (Balkan Web, Shekulli, Alsat, Klan - 30/09/10)', 'ro': 'El a discutat despre menţinerea implicării NATO în Kosovo, menţionând că îmbunătăţirea situaţiei a permis Alianţei să îşi reducă numărul de trupe de acolo. (Balkan Web, Shekulli, Alsat, Klan - 30/09/10)'}"
4,"{'en': 'in writing. - I wish to express my regret in view of the imprisonment of Gilad Shalit.', 'ro': 'în scris. - Doresc să îmi exprim regretul cu privire la încarcerarea lui Gilad Shalit.'}"


In [21]:
metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: The system stream (a sequence of segments).
    references: A list of one or more reference streams (each a sequence of segments).
    smooth_method: The smoothing method to use. (Default: 'exp').
    smooth_value: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1).
    tokenize: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for
        Japanese and '13a' (mteval) otherwise.
    lowercase: Lowercase the data. If True, enables case-insensitivity. (Default: False).
    force: Insist that your tokenized input is actually detokenized.

Returns:
    'score': BLEU score,
    'counts'

[SacreBLEU]指出传统计算BLEU的方式存在三个问题，其中最主要的问题是已有的计算方式需要用户自己提供tokenize过的结果，甚至还要提供tokenize过的参考译文，而不同人tokenize的方式不同，产生的结果就会不同（例如是不是会把复合词中间的连接符-分开，UNK如何计算等等）。理想的方式是用户提供detokenize后的结果，而且完全不碰参考译文，而中间的数据处理过程完全交由评估脚本自动处理。为此，作者提供了sacrebleu这个Python包，其不仅实现了上述功能，而且内置了WMT近几期比赛所有方向的参考译文下载，并保证自身输出的结果和WMT官方评测结果一致。因此，建议都使用SacreBLEU来评估模型输出

- [神经翻译笔记5扩展c. 机器翻译系统的常见评价指标](https://zhuanlan.zhihu.com/p/258207437)
- [A Call for Clarity in Reporting BLEU Scores](https://arxiv.org/pdf/1804.08771v2.pdf)
- [sacreBLEU](https://github.com/mjpost/sacrebleu)

In [22]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

# 数据预处理

在将这些文本输入模型之前，我们需要对它们进行预处理。我们使用的是 Transformers Tokenizer 来处理文本，它将对输入进行分词（包括将标记转换为它们在预训练词汇表中的相应 ID）并将转换为模型需要的输入格式

下面我们使用 AutoTokenizer.from_pretrained 方法实例化我们的分词器，这将确保：

- 我们得到一个与我们想要使用的模型架构相对应的分词器，
- 我们下载预训练这个特定检查点时使用的词汇。
- 该词汇表将被缓存，因此下次我们运行单元时不会再次下载它。

如果想手动下载模型，可以指定本地模型路径

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/770k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

In [6]:
tokenizer

PreTrainedTokenizer(name_or_path='Helsinki-NLP/opus-mt-en-ro', vocab_size=59543, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

eos：end of sentence 句子的结束

下面可以直接在一个句子或一对句子上调用分词方法：

In [26]:
tokenizer("Hello, this one sentence!")

{'input_ids': [125, 778, 3, 63, 141, 9191, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [24]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], [187, 32, 716, 9191, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [17]:
tokenizer.convert_ids_to_tokens(tokenizer("Hello, this one sentence!")['input_ids'])

['▁He', 'llo', ',', '▁this', '▁one', '▁sentence', '!', '</s>']

为了给模型准备好翻译的targets，我们使用as_target_tokenizer来控制targets所对应的特殊token：

In [25]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], [235, 1705, 11, 32, 8, 1205, 5305, 59, 29579, 581, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [27]:
with tokenizer.as_target_tokenizer(): 
    print(tokenizer("Hello, this one sentence!"))
    model_input = tokenizer("Hello, this one sentence!")
    tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
    # 打印看一下special toke
    print('tokens: {}'.format(tokens))

{'input_ids': [10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokens: ['▁Hel', 'lo', ',', '▁', 'this', '▁o', 'ne', '▁se', 'nten', 'ce', '!', '</s>']


下面是预处理文本的函数。 我们只需使用参数 truncation=True 将它们提供给标记器。 这将确保输入比所选模型可以处理的更长的输入将被截断为模型接受的最大长度。 稍后将处理填充（在数据整理器中），因此我们将示例填充到批处理中的最长长度，而不是整个数据集。

In [11]:
prefix = ""
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"
def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    # 输入做分词编码
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [13]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 3543, 9, 15202, 0]]}

要将这个函数应用于我们数据集中的所有句子对，我们只需使用我们之前创建的数据集对象的 map 方法。 这会将函数应用于数据集中所有拆分的所有元素，因此我们的训练、验证和测试数据将在一个命令中进行预处理

In [30]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/611 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## 微调模型

现在我们的数据已经准备好了，我们可以下载预训练模型并对其进行微调。 由于我们的任务是序列到序列的类型，我们使用 AutoModelForSeq2SeqLM 类。 与分词器一样，from_pretrained 方法将为我们下载并缓存模型。

In [31]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/287M [00:00<?, ?B/s]

要实例化 Seq2SeqTrainer，我们需要再定义三件事。 最重要的是 [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments)，这是一个包含自定义训练的所有属性的类。 它需要一个文件夹名称，用于保存模型的检查点，所有其他参数都是可选的：

In [32]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,#优化器
    save_total_limit=3, # 
    num_train_epochs=1,
    predict_with_generate=True    
)

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


在这里，我们将评估设置为在每个 epoch 结束时进行，调整学习率，使用在单元格顶部定义的 batch_size 并自定义权重衰减。 由于 Seq2SeqTrainer 会定期保存模型并且我们的数据集非常大，我们告诉它最多保存 3 次。 最后，我们使用 predict_with_generate 选项（正确生成摘要）并激活混合精度训练（更快一点）。

模型将保存在 **{model_name}-finetuned-{source_lang}-to-{target_lang}** 目录下

然后，我们需要一种特殊的数据整理器，它不仅可以将输入填充到批处理中的最大长度，还可以填充标签：

In [35]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

为我们的 Seq2SeqTrainer 定义的最后一件事是如何从预测中计算指标。 我们需要为此定义一个函数，它将仅使用我们之前加载的指标，并且我们必须进行一些预处理以将预测解码为文本：

In [33]:
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

然后我们只需要将所有这些连同我们的数据集一起传递给 Seq2SeqTrainer：

In [37]:
trainer = Seq2SeqTrainer(
    model,# 传入初始化的模型
    args,# 传入参数
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

我们现在可以通过调用 train 方法来微调我们的模型：

In [None]:
trainer.train()

In [39]:
import os
for dirname, _, filenames in os.walk('opus-mt-en-ro-finetuned-en-to-ro'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\config.json
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\optimizer.pt
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\pytorch_model.bin
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\rng_state.pth
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\scheduler.pt
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\source.spm
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\special_tokens_map.json
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\target.spm
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\tokenizer_config.json
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\trainer_state.json
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\training_args.bin
opus-mt-en-ro-finetuned-en-to-ro\checkpoint-500\vocab.json
opus-mt-en-ro-finetuned-en-to-ro\runs\Apr10_12-34-33_DESKTOP-G5E8965\events.out.tfevents.1649565367.DESKTOP-G5E8965.20652.0
opus-mt-en-ro-finetuned-en-to-ro\runs\Apr10_12-34-33_DESKTOP-G5E8965\1649565367.2931008\events.out.tfevents.164956

* training_args.bin：参数持久化文件
* tokenizer_config.json：Helsinki-NLP/opus-mt-en-ro tokenizer
* trainer_state.json：模型训练状态

我们的微调模型已经保存在 *opus-mt-en-ro-finetuned-en-to-ro/checkpoint-38000*

加载模型并将一些文本从英语翻译成罗马尼亚语

In [40]:
from transformers import MarianMTModel, MarianTokenizer
src_text = ['My name is Sarah and I live in London']
model_name = 'opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)

Didn't find file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\added_tokens.json. We won't load it.
loading file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\source.spm
loading file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\target.spm
loading file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\vocab.json
loading file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\tokenizer_config.json
loading file None
loading file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\special_tokens_map.json


[]


In [41]:
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

loading configuration file opus-mt-en-ro-finetuned-en-to-ro/checkpoint-500\config.json
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-ro",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      59542
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 59542,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 0,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "L

['Numele meu este Sarah şi locuiesc în Londra.']

我们的微调模型比预训练模型做得好得多，并且接近谷歌翻译
**input text** -> My name is Sarah and I live in London

**pre-trained model prediction** -> Numele meu este Sarah şi locuiesc în Londra.

**fine-tune model prediction** -> Numele meu este Sarah şi locuiesc la Londra

**google translator prediction** -> Numele meu este Sarah şi locuiesc la Londra


![](https://img-blog.csdnimg.cn/063a6de022c44c4ebb48c470dbcfaa2a.png)


### 微调一个英文到中文的翻译模型

背景描述

520万个中英文平行语料( 原始数据1.1G，压缩文件596M)

数据说明
中英文平行语料520万对。每一个对，包含一个英文和对应的中文。中文或英文，多数情况是一句带标点符号的完整的话。
对于一个平行的中英文对，中文平均有36个字，英文平均有19个单词(单词如“she”)
数据集划分：数据去重并划分。训练集：516万；验证集：3.9万。

** 结构：**

{"english": <english>, "chinese": <chinese>}

其中，english是英文句子，chinese是中文句子，中英文一一对应。

** 例子：**

{"english": "In Italy, there is no real public pressure for a new, fairer tax system.", "chinese": "在意大利，公众不会真的向政府施压，要求实行新的、更公平的税收制度。"}


https://www.heywhale.com/mw/dataset/5de5fcafca27f8002c4ca993/content

In [None]:
# 离线加载
import os

data_path = 'data/09' #数据路径/home/mw/input/task063578
cache_dir = './data/cache'
data_files = {
    'train': 'translation2019zh_train.json',# 数据集名称 训练集
    'validation':  'translation2019zh_valid.json', # 验证集
    # 'test': os.path.join(data_path, 'test.csv') # 测试集
 }
datasets = load_dataset(data_path, data_files=data_files, cache_dir=cache_dir)