### Задание  
1. Взять предобученную трансформерную архитектуру и решить задачу перевода  
2. (дополнительная не обязательная задача) взять датасет из datasets для задачи классификации на русском языке затем взять модель которая предобучена на такой задачи классификации и замерить качество до обучения и после обучения на этом датасете

### Решение задачи перевода

Использован материал https://stackoverflow.com/questions/68185061/strange-results-with-huggingface-transformermarianmt-translation-of-larger-tex

In [1]:
from transformers import MarianMTModel, MarianTokenizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import LineTokenizer
import math
import torch

In [2]:
dev = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(dev)

In [3]:
mname = 'Helsinki-NLP/opus-mt-ru-en'
tokenizer = MarianTokenizer.from_pretrained(mname)
model = MarianMTModel.from_pretrained(mname)
model.to(device)

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(62518, 512, padding_idx=62517)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(62518, 512, padding_idx=62517)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1

In [4]:
lt = LineTokenizer()
batch_size = 8

In [5]:
text = 'Иностранную валюту в России запрещать не собираются, заявила глава Банка России Эльвира Набиуллина \
на полях Петербургского международного экономического форума (ПМЭФ). По ее словам, доллар, евро и другие валюты останутся \
в обороте, валютные вклады граждан конфискованы не будут. А вот валютные ограничения по мере стабилизации финансовой \
системы должны быть сняты, подчеркнула она. Набиуллина напомнила, что большинство ограничительных мер на движение капитала \
введены в ответ на "заморозку" российских активов и направлены в основном на нерезидентов из недружественных стран. \
"В условиях заморозки наших золотовалютных резервов мы можем только перекрыть отток капитала. Однако по мере стабилизации \
финансовой системы все ограничения постепенно ослабляются", - подчеркнула она.'

In [6]:
paragraphs = lt.tokenize(text)
translated_paragraphs = []
for paragraph in paragraphs:
    sentences = sent_tokenize(paragraph)
    batches = math.ceil(len(sentences) / batch_size)
    translated = []
    for i in range(batches):
        sent_batch = sentences[i*batch_size:(i+1)*batch_size]
        model_inputs = tokenizer(sent_batch, return_tensors="pt", padding=True, truncation=True, max_length=500).to(device)
        with torch.no_grad():
            translated_batch = model.generate(**model_inputs)
        translated += translated_batch
    translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    translated_paragraphs += [" ".join(translated)]
translated_text = "\n".join(translated_paragraphs)

In [7]:
translated_text

'Foreign currency is not being banned in Russia, says the head of the Bank of Russia, Elvira Nabiullin, in the fields of the Petersburg International Economic Forum (PMEF). According to her, the dollar, the euro and other currencies would remain in circulation, and citizens &apos; foreign exchange deposits would not be confiscated. The foreign exchange restrictions should be lifted as the financial system stabilized, she stressed. Nabiullina recalled that most of the restrictive measures on capital flows had been introduced in response to the freezing of Russian assets and targeted mainly at non-residents from unfriendly countries. "With the freezing of our foreign exchange reserves, we can only stop capital outflows. However, as the financial system is stabilized, all restrictions are gradually relaxed," she stressed.'

### Дообучение претренированой модели

In [8]:
import numpy as np

import tensorflow as tf
from tensorflow.keras.losses import SparseCategoricalCrossentropy

from datasets import load_dataset
from datasets import load_metric

from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification

In [9]:
my_datasets = load_dataset("blinoff/kinopoisk")
my_datasets

Reusing dataset kinopoisk (C:\Users\User\.cache\huggingface\datasets\blinoff___kinopoisk\simple\1.0.0\62f52027aea59f64f49c7b16165b82cb4dc45031bad3660c2719bf2a6ea4a44e)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['content', 'title', 'grade3', 'movie_name', 'part', 'review_id', 'author', 'date', 'grade10', 'Idx'],
        num_rows: 36591
    })
    validation: Dataset({
        features: ['content', 'title', 'grade3', 'movie_name', 'part', 'review_id', 'author', 'date', 'grade10', 'Idx'],
        num_rows: 36591
    })
})

In [10]:
set(my_datasets["train"]["grade3"])

{'Bad', 'Good', 'Neutral'}

In [11]:
set(my_datasets["validation"]["grade3"])

{'Bad', 'Good', 'Neutral'}

In [12]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [13]:
def tokenize_function(example):
    return tokenizer(example["content"], padding="max_length", truncation=True)

In [14]:
def transform_labels(label):
    label = label['grade3']
    num = 0
    if label == 'Bad':
        num = 0
    elif label == 'Good':
        num = 1
    elif label == 'Neutral':
        num = 2

    return {'labels': num}

In [15]:
tokenized_datasets = my_datasets.map(tokenize_function, batched=True)

remove_columns = ['title', 'grade3', 'movie_name', 'part', 'review_id', 'author', 'date', 'grade10', 'Idx']
tokenized_datasets = tokenized_datasets.map(transform_labels, remove_columns=remove_columns)



  0%|          | 0/37 [00:00<?, ?ba/s]

  0%|          | 0/37 [00:00<?, ?ba/s]

  0%|          | 0/36591 [00:00<?, ?ex/s]

  0%|          | 0/36591 [00:00<?, ?ex/s]

In [16]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(500))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(100))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["validation"]

In [17]:
tf_train_dataset = small_train_dataset.remove_columns(["content"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["content"]).with_format("tensorflow")

In [18]:
tf_eval_dataset[:5]

{'input_ids': <tf.Tensor: shape=(5, 512), dtype=int64, numpy=
 array([[  101,   467, 17424, ...,     0,     0,     0],
        [  101,   450, 28404, ..., 16948, 28401,   102],
        [  101,   455, 10286, ..., 28396, 10286,   102],
        [  101,   464, 28400, ...,   476, 28405,   102],
        [  101,   464, 28401, ...,   106,   460,   102]], dtype=int64)>,
 'token_type_ids': <tf.Tensor: shape=(5, 512), dtype=int64, numpy=
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)>,
 'attention_mask': <tf.Tensor: shape=(5, 512), dtype=int64, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]], dtype=int64)>,
 'labels': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([1, 0, 1, 2, 1], dtype=int64)>}

In [19]:
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["labels"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(4)

eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["labels"]))
eval_tf_dataset = eval_tf_dataset.batch(4)

In [20]:
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
eval_pred = model.predict(eval_tf_dataset)['logits']



In [23]:
eval_class = np.argmax(eval_pred, axis=1)

In [24]:
metric = load_metric("accuracy")

In [25]:
metric.compute(predictions=eval_class, references=tf_eval_dataset['labels'])

{'accuracy': 0.59}

In [26]:
model.compile(
    optimizer="adam",
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
model.fit(
    train_tf_dataset,
    validation_data=eval_tf_dataset,
    epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1363036cac0>

In [29]:
eval_pred = model.predict(eval_tf_dataset)['logits']
eval_class = np.argmax(eval_pred, axis=1)
metric.compute(predictions=eval_class, references=tf_eval_dataset['labels'])



{'accuracy': 0.8}