Обучите простую рекуррентную нейронную сеть (без GRU/LSTM, без внимания) решать задачу дешифровки шифра Цезаря:
1. Написать алгоритм шифра Цезаря для генерации выборки (сдвиг на N каждой буквы). Например если N=2, то буква A переходит в букву C. Можно поиграться с
языком на выбор (немецкий, русский и т.д.)
2. Создать архитектуру рекуррентной нейронной сети.
3. Обучить ее (вход - зашифрованная фраза, выход - дешифрованная фраза).
4. Проверить качество модели.


2 балла за правильно выполненное задание.

In [74]:
import re
import torch
import warnings
import time
import numpy as np

warnings.filterwarnings("ignore")
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
np.set_printoptions(threshold=1000)

Загружаем файл с текстом из интернета

In [75]:
!wget https://tululu.org/txt.php?id=51554

--2022-09-18 11:30:43--  https://tululu.org/txt.php?id=51554
Resolving tululu.org (tululu.org)... 104.21.82.5, 172.67.167.88, 2606:4700:3034::6815:5205, ...
Connecting to tululu.org (tululu.org)|104.21.82.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 667704 (652K) [text/plain]
Saving to: ‘txt.php?id=51554.5’


2022-09-18 11:30:43 (12.3 MB/s) - ‘txt.php?id=51554.5’ saved [667704/667704]



In [76]:
file_path = '/content/txt.php?id=51554'
string_size = 60
batch_size = 10
NUM_EPOCHS = 20
LEARNING_RATE = 0.01

Класс для кодирования текста по правилам шифта Цезаря с заданным шагом, и раскодировки для проверки на незнакомом корпусе текста. Также создает словарь для кодировки при чтении файла.

In [77]:
class Cesar(object):
    def __init__(self, step):
        self.step = step
        self.alphabet = ''
        self.len_alphabet = 0

    def alphabet_from_file(self, file_path):
        with open(file_path) as file:
            while True:
                text = file.read(string_size)
                if not text:
                    break
                for ch in text:
                    if ch not in self.alphabet:
                        self.alphabet += ch
        self.alphabet = re.sub(r'[^a-zA-Z.!? ]+', r'', ''.join(sorted(self.alphabet)))
        self.len_alphabet = len(self.alphabet)

    def encode(self, text):
        res = ''
        for c in text:
            if c in self.alphabet:
                res += self.alphabet[(self.alphabet.index(c) + self.step) % len(self.alphabet)]
        return res

    def decode(self, text):
        res = ''
        for c in text:
            res += self.alphabet[(self.alphabet.index(c) - self.step% len(self.alphabet))]
        return res

coder = Cesar(2)
coder.alphabet_from_file(file_path)
alpha = coder.alphabet
alpha

' !.?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

Переводим части текста в массив чисел по индексу буквы в словаре и генерируем тензоры для обучения

In [78]:
def sent_to_index(sentence):
    return [alpha.find(y) for y in sentence]

In [79]:
def make_tensor(file_path, step):
    text_array = []
    with open(file_path) as file:
        while True:
            text = file.read(step)
            if not text:
                break
            text_array.append(re.sub(r'[^a-zA-Z.!? ]', r' ', text))
    del text_array[-1]
    y_train = torch.tensor([sent_to_index(lines) for lines in text_array[:4*len(text_array) // 5]])
    x_train = torch.tensor([sent_to_index(coder.encode(lines)) for lines in text_array[:4*len(text_array) // 5]])

    y_test = torch.tensor([sent_to_index(lines) for lines in text_array[4*len(text_array) // 5:]])
    x_test = torch.tensor([sent_to_index(coder.encode(lines)) for lines in text_array[4*len(text_array) // 5:]])

    return x_train, y_train, x_test, y_test

In [80]:
x_train, y_train, x_test, y_test = make_tensor(file_path, string_size)

Класс для датасетов для подачи в даталоадер.

In [81]:
class MyDataset(torch.utils.data.Dataset):

    def __init__(self, x, y):
        super().__init__()
        self._len = len(x)
        self.y = y
        self.x = x
    
    def __len__(self):
        return self._len

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

In [82]:
train_ds = torch.utils.data.DataLoader(MyDataset(x_train, y_train), 
                                       batch_size=batch_size, 
                                       shuffle=True)
test_ds = torch.utils.data.DataLoader(MyDataset(x_test, y_test), 
                                       batch_size=batch_size, 
                                       shuffle=True)

Простая RNN модель

In [83]:
class RNNModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.embed = torch.nn.Embedding(60, 32)
        self.rnn = torch.nn.RNN(32, 128, batch_first=True)
        self.linear = torch.nn.Linear(128, len(alpha))

    def forward(self, sentence, state=None):
        x = self.embed(sentence)
        out, hidden = self.rnn(x)
        return self.linear(out)

Инициализация модели, фукции потерь и оптимизатора

In [84]:
model = RNNModel().to(DEVICE)
loss = torch.nn.CrossEntropyLoss().to(DEVICE)
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

In [85]:
for epoch in range(NUM_EPOCHS):
    train_loss, train_acc, iter_num = .0, .0, .0
    start_epoch_time = time.time()
    model.train()
    for x, y in train_ds:
        x = x.to(DEVICE)
        y = y.view(1, -1).squeeze().to(DEVICE)
        optimizer.zero_grad()
        out = model.forward(x).view(-1, len(alpha))
        l = loss(out, y)
        train_loss += l.item()
        batch_acc = (out.argmax(dim=1) == y)
        train_acc += batch_acc.sum().item() / batch_acc.shape[0]
        l.backward()
        optimizer.step()
        iter_num += 1
    print(f"Epoch: {epoch+1}, loss: {train_loss:.4f}, acc: "
        f"{train_acc / iter_num:.4f}",
        end=" | ")
    test_loss, test_acc, iter_num = .0, .0, .0
    model.eval()
    for x, y in test_ds:
        x = x.to(DEVICE)
        y = y.view(1, -1).squeeze()
        out = model.forward(x).view(-1, len(alpha)).to(DEVICE)
        l = loss(out, y)
        test_loss += l.item()
        batch_acc = (out.argmax(dim=1) == y)
        test_acc += batch_acc.sum().item() / batch_acc.shape[0]
        iter_num += 1
    print(
        f"test loss: {test_loss:.4f}, test acc: {test_acc / iter_num:.4f} | "
        f"{time.time() - start_epoch_time:.2f} sec."
    )

Epoch: 1, loss: 1181.0266, acc: 0.7861 | test loss: 100.9240, test acc: 0.9479 | 10.12 sec.
Epoch: 2, loss: 250.0898, acc: 0.9697 | test loss: 42.2988, test acc: 0.9771 | 10.17 sec.
Epoch: 3, loss: 133.7215, acc: 0.9786 | test loss: 28.5908, test acc: 0.9785 | 10.04 sec.
Epoch: 4, loss: 98.0773, acc: 0.9821 | test loss: 22.6376, test acc: 0.9826 | 10.12 sec.
Epoch: 5, loss: 79.9862, acc: 0.9852 | test loss: 19.0181, test acc: 0.9857 | 9.92 sec.
Epoch: 6, loss: 68.1352, acc: 0.9880 | test loss: 16.4295, test acc: 0.9892 | 9.98 sec.
Epoch: 7, loss: 59.3662, acc: 0.9899 | test loss: 14.4260, test acc: 0.9910 | 10.12 sec.
Epoch: 8, loss: 52.4600, acc: 0.9910 | test loss: 12.8142, test acc: 0.9924 | 10.01 sec.
Epoch: 9, loss: 46.8313, acc: 0.9925 | test loss: 11.4864, test acc: 0.9934 | 10.02 sec.
Epoch: 10, loss: 42.1517, acc: 0.9935 | test loss: 10.3737, test acc: 0.9939 | 10.09 sec.
Epoch: 11, loss: 38.1818, acc: 0.9945 | test loss: 9.4285, test acc: 0.9947 | 10.14 sec.
Epoch: 12, loss: 

Ячейка для тестирования модели на любом тексте, в sentence можно внести любой текст на английском для проверки.

In [86]:
sentence = """Jupyter Notebook
Jupyter notebook, formerly known as the IPython notebook, is a flexible tool that helps you create readable analyses, 
as you can keep code, images, comments, formulae and plots together."""
encrypted_sentence = coder.encode(sentence)
encrypted_sentence_idx = sent_to_index(encrypted_sentence)
result = model(torch.tensor([encrypted_sentence_idx]).to(DEVICE)).argmax(dim=2)
deencrypted_sentence = "".join([alpha[i.item()] for i in result.flatten()])
print(f'Encrypted sentence is : \n{encrypted_sentence}')
print("-" * 20)
print(f'Predicted sentence: \n{deencrypted_sentence}')
print(f'Decrypted sentence is : \n{coder.decode(encrypted_sentence)}')

Encrypted sentence is : 
Lwr vgt.PqvgdqqmLwr vgt.pqvgdqqm.hqtogtn .mpqyp.cu.vjg.KR vjqp.pqvgdqqm.ku.c.hngzkdng.vqqn.vjcv.jgnru. qw.etgcvg.tgcfcdng.cpcn ugu.cu. qw.ecp.mggr.eqfg.kocigu.eqoogpvu.hqtowncg.cpf.rnqvu.vqigvjgtA
--------------------
Predicted sentence: 
Jupyter Notebookuupyter notebook formerly known as the Iiython notebook is a flexible tool that helps you create readable analyses as you can keep code images comments formulae and plots together.
Decrypted sentence is : 
Jupyter NotebookJupyter notebook formerly known as the IPython notebook is a flexible tool that helps you create readable analyses as you can keep code images comments formulae and plots together.
