## Введение в обработку естественного языка

Домашнее задание №11

Урок 11. Модель Transformer-1

*Формат именования файла домашней работы: FIO_NLP_HW_N.ipynb, где N - номер домашнего задания*

**

Разобраться с моделью перевода (с механизмом внимания) как она устроена, запустить для перевода с русского на английский (при желании можно взять другие пары языков)

In [1]:
!pip install transformers
!pip install transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2
Looking in in

In [2]:
import io
import re

In [3]:
!wget http://www.manythings.org/anki/rus-eng.zip

--2023-05-19 15:14:52--  http://www.manythings.org/anki/rus-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15460248 (15M) [application/zip]
Saving to: ‘rus-eng.zip’


2023-05-19 15:14:53 (53.5 MB/s) - ‘rus-eng.zip’ saved [15460248/15460248]



In [4]:
!mkdir rus-eng
!unzip rus-eng.zip -d rus-eng/

Archive:  rus-eng.zip
  inflating: rus-eng/rus.txt         
  inflating: rus-eng/_about.txt      


In [5]:
path_to_file = "rus-eng/rus.txt"

In [6]:
def preprocess_sentence(w):
  w = w.lower().strip()

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Zа-яА-Я?.!,']+", " ", w)

  w = w.strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [7]:
def create_dataset(path, num_examples):
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

  word_pairs = [[preprocess_sentence(w) for w in l.split('\t')[:2]]  for l in lines[:num_examples]]

  return zip(*word_pairs)

In [8]:
en, ru = create_dataset(path_to_file, None)
print(en[25])
print(ru[25])

<start> duck ! <end>
<start> пригнись ! <end>


In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en", do_lower_case=True)
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [10]:
def translate(sentence):

  inputs = tokenizer.encode(sentence, return_tensors="pt")
  outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

  result = tokenizer.decode(outputs[0])

  return result

In [13]:
# хаотично беру предложения из корпуса с прошлого (10) домашнего задания.

for i in [152, 210, 157, 120, 256, 523, 526, 548, 4536, 61458, 56355, 15296]:
    print(ru[i], translate(ru[i]))

<start> я заплатила . <end> <pad> #start> I paid. <end></s>
<start> будь спокоен . <end> <pad> <start> stay calm. <end></s>
<start> я плаваю . <end> <pad> <start> I swim. <end></s>
<start> поезжайте сейчас . <end> <pad> <start> go now. <end></s>
<start> сделай это . <end> <pad> <start> do it. <end></s>
<start> ловите меня . <end> <pad> <start> catch me. <end></s>
<start> взбодрись ! <end> <pad> #start> cheer up! #end></s>
<start> иди отсюда . <end> <pad> Get out of here. <end></s>
<start> я тренер . <end> <pad> <start> I'm coach. <end></s>
<start> сохрани это на потом . <end> <pad> <start> save this for later. <end></s>
<start> мне всех видно . <end> <pad> <start> I can see everyone. <end></s>
<start> разрежьте это пополам . <end> <pad> <start> cut it in half. <end></s>


На мой взгляд предобученная трансформерная архитектура справилась с переводом очень неплохо, хотя понятно, что не идеально.