# Creative Articulator 

## Overview

In this demo, we will look at all the processes in Creative Articulator namely:
 - the creation of a parallel corpus
 - the division of text into fragments 
 - translation of text.

## Parallel corpus

A parallel corpus is a handy tool that allows you to link parts from different corpus. With it, all the corpus will be stored in a single zip file.

In this demonstration, we will work with texts and retellings of Fyodor Mikhailovich Dostoevsky's novel Crime and Punishment, which are located in the `source` folder.


The first step is to create a corpus of texts and retellings that are stored in md format.

#### Сorpus for English text:

In [None]:
from tg.grammar_ru.corpus import CorpusBuilder
from pathlib import Path


CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_crime_and_puhishment.base.zip'),
    Path('./source/book/eng'),
    ['book']
)


In [None]:
from tg.grammar_ru.corpus import CorpusReader


eng_book_reader = CorpusReader(Path('./files/eng_crime_and_puhishment.base.zip'))
eng_book = eng_book_reader.get_toc().index

eng_book_reader.get_toc() 

#### Corpus for English retelling:

In [None]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_retell.base.zip'),
    Path('./source/retell/eng'),
    ['book']
)

In [None]:
eng_retell_reader = CorpusReader(Path('./files/eng_retell.base.zip'))
eng_retell = eng_retell_reader.get_toc().index
eng_retell_reader.get_toc() 

#### Сorpus for Russian text:

In [None]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_crime_and_puhishment.base.zip'),
    Path('./source/book/ru'),
    ['book']
)

In [None]:
ru_book_reader = CorpusReader(Path('./files/ru_crime_and_puhishment.base.zip'))
ru_book = ru_book_reader.get_toc().index
ru_book_reader.get_toc() 

#### Corpus for Russian retelling:

In [None]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_retell.base.zip'),
    Path('./source/retell/ru'),
    ['book']
    )

In [None]:
ru_retell_reader = CorpusReader(Path('./files/ru_retell.base.zip'))
ru_retell = ru_retell_reader.get_toc().index
ru_retell_reader.get_toc() 

Next, we form a parallel corpus

In [None]:
import pandas as pd

def add_relation(df_1,df_2,name_1,name_2):
    rel_1 = {'file_1':df_1, 'file_2':df_2,'relation_name':f"{name_1}_{name_2}"}
    rel_2 = {'file_1':df_2, 'file_2':df_1,'relation_name':f"{name_2}_{name_1}"}
    rel = pd.concat([rel_1,rel_2])
    return rel

def add_dfs(name):
    frames = list(name.get_frames())
    dfs = dict(zip(name.get_toc().index,frames))

    return dfs

In [None]:
CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_book_reader),
    "ru_book",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_retell_reader),
    "ru_retell",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_retell_reader),
    "eng_retell",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_book_reader),
    "eng_book",
    None)


We check that all parts have been successfully recorded

In [21]:
reader = CorpusReader(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_retell', 'eng_book'], dtype=object)

In [None]:
from tg.grammar_ru.corpus import ParallelCorpus
parallel_corpus = ParallelCorpus(Path('./files/parallel_corpus.zip'))

### Translate text

In [None]:
from tg.projects.retell.translate.utils import get_array_chapters

ru_retell_text = get_array_chapters(parallel_corpus.ru_retell)
eng_retell_text = get_array_chapters(parallel_corpus.eng_retell)
ru_book_text = get_array_chapters(parallel_corpus.ru_book)


print(ru_book_text[0][:500])

In [None]:
!pip install googletrans==3.1.0a0

In [None]:
from tg.projects.retell.translate.utils import translate
translate_retell = translate(eng_retell_text)
print(translate_retell[0])

In [None]:
from tg.projects.retell.retell_utils.metrics import get_jaccard_index
from tg.projects.retell.translate.utils import jac_metric
import numpy as np



jaccard_sim = np.array([get_jaccard_index(ru_book_text[i],ru_retell_text[i]) for i in range(len(ru_retell_text))])
jac_metric(jaccard_sim)

In [None]:
jaccard_sim = np.array([get_jaccard_index(ru_book_text[i],translate_retell[i]) for i in range(len(ru_retell_text))])
jac_metric(jaccard_sim)

In [None]:
from yo_fluq_ds import FileIO

result = ''

for text in translate_retell:
    result += "\n## part\n"

    result += text


FileIO.write_text(result, Path("./source/translate/translate.md"))


In [None]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/translate.base.zip'),
    Path('./source/translate'),
    ['book']
)

In [None]:
ru_translate_reader = CorpusReader(Path('./files/translate.base.zip'))
ru_translate_reader.get_toc() 

In [None]:
CorpusBuilder.update_parallel_data(
    Path('./files/translate.base.zip'),
    add_dfs(ru_translate_reader),
    "ru_translate",
    None)