# Creative Articulator 

## Overview

In this demo, we will look at all the processes in Creative Articulator namely:
 - the creation of a parallel corpus
 - the division of text into fragments 
 - translation of text.

## Parallel corpus

A parallel corpus is a handy tool that allows you to link parts from different corpus. With it, all the corpus will be stored in a single zip file.

In this demonstration, we will work with texts and retellings of Fyodor Mikhailovich Dostoevsky's novel Crime and Punishment, which are located in the `sourse` folder.


The first step is to create a corpus of texts and retellings that are stored in md format.

#### Сorpus for English text:

In [1]:
from grammar_ru.corpus import CorpusBuilder
from pathlib import Path


CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/en_crime_and_puhishment.base.zip'),
    Path('./source/book/en'),
    ['book'],
    custom_guid_factory = lambda index: f'en_book_{index}'
)


  0%|          | 0/1 [00:00<?, ?it/s]

Note: for all the corpora in this demos we use a custom file naming, provided by `custom_guid_factory`. This is because these names will be used to bind the ML-retells to the original fragments in the following parts. You don't need to do it, if you build a corpus only once and then spread in as a zip-files.

In [2]:
from grammar_ru.corpus import CorpusReader


en_book_reader = CorpusReader(Path('./files/en_crime_and_puhishment.base.zip'))
en_book = en_book_reader.get_toc().index

en_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
en_book_0,en_crime_and_punishment.md,2024-03-31 12:51:22.640883,0,4028,15089,0,Part One,I,Part One / I,en_crime_and_punishment,4029
en_book_1,en_crime_and_punishment.md,2024-03-31 12:51:22.656883,1,8965,32957,1,Part One,II,Part One / II,en_crime_and_punishment,22995
en_book_2,en_crime_and_punishment.md,2024-03-31 12:51:22.678882,2,6439,24251,2,Part One,III,Part One / III,en_crime_and_punishment,39435
en_book_3,en_crime_and_punishment.md,2024-03-31 12:51:22.696882,3,6473,23395,3,Part One,IV,Part One / IV,en_crime_and_punishment,55909
en_book_4,en_crime_and_punishment.md,2024-03-31 12:51:22.713882,4,5342,19413,4,Part One,V,Part One / V,en_crime_and_punishment,71252


#### Corpus for English retelling:

In [3]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/en_retell.base.zip'),
    Path('./source/retell/en'),
    ['book'],
    custom_guid_factory = lambda index: f'en_retell_{index}'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
en_retell_reader = CorpusReader(Path('./files/en_retell.base.zip'))
en_retell = en_retell_reader.get_toc().index
en_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
en_retell_0,en_retell.md,2024-03-31 12:51:25.453882,0,776,3295,0,Part I,Chapter 1,Part I / Chapter 1,en_retell,777
en_retell_1,en_retell.md,2024-03-31 12:51:25.457882,1,1191,5178,1,Part I,Chapter 2,Part I / Chapter 2,en_retell,11969
en_retell_2,en_retell.md,2024-03-31 12:51:25.473883,2,1042,4464,2,Part I,Chapter 3,Part I / Chapter 3,en_retell,23012
en_retell_3,en_retell.md,2024-03-31 12:51:25.488882,3,1098,4711,3,Part I,Chapter 4,Part I / Chapter 4,en_retell,34111
en_retell_4,en_retell.md,2024-03-31 12:51:25.503883,4,954,3932,4,Part I,Chapter 5,Part I / Chapter 5,en_retell,45066


#### Сorpus for Russian text:

In [5]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_crime_and_puhishment.base.zip'),
    Path('./source/book/ru'),
    ['book'],
    custom_guid_factory = lambda index: f'ru_book_{index}'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
ru_book_reader = CorpusReader(Path('./files/ru_crime_and_puhishment.base.zip'))
ru_book = ru_book_reader.get_toc().index
ru_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ru_book_0,ru_crime_and_punishment.md,2024-03-31 12:51:34.489882,0,3243,13794,0,ЧАСТЬ ПЕРВАЯ,I,ЧАСТЬ ПЕРВАЯ / I,ru_crime_and_punishment,3244
ru_book_1,ru_crime_and_punishment.md,2024-03-31 12:51:34.496883,1,7122,29360,1,ЧАСТЬ ПЕРВАЯ,II,ЧАСТЬ ПЕРВАЯ / II,ru_crime_and_punishment,20367
ru_book_2,ru_crime_and_punishment.md,2024-03-31 12:51:34.527882,2,5526,22721,2,ЧАСТЬ ПЕРВАЯ,III,ЧАСТЬ ПЕРВАЯ / III,ru_crime_and_punishment,35894
ru_book_3,ru_crime_and_punishment.md,2024-03-31 12:51:34.553882,3,5231,21071,3,ЧАСТЬ ПЕРВАЯ,IV,ЧАСТЬ ПЕРВАЯ / IV,ru_crime_and_punishment,51126
ru_book_4,ru_crime_and_punishment.md,2024-03-31 12:51:34.580882,4,4285,17322,4,ЧАСТЬ ПЕРВАЯ,V,ЧАСТЬ ПЕРВАЯ / V,ru_crime_and_punishment,65412


#### Corpus for Russian retelling:

In [7]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_retell.base.zip'),
    Path('./source/retell/ru'),
    ['book'],
    custom_guid_factory = lambda index: f'ru_retell_{index}'
    )

  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
ru_retell_reader = CorpusReader(Path('./files/ru_retell.base.zip'))
ru_retell = ru_retell_reader.get_toc().index
ru_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ru_retell_0,ru_retell.md,2024-03-31 12:51:36.176882,0,396,1814,0,Часть 1,Глава 1: Знакомство с Родионом,Часть 1 / Глава 1: Знакомство с Родионом,ru_retell,397
ru_retell_1,ru_retell.md,2024-03-31 12:51:36.179882,1,704,3090,1,Часть 1,Глава 2: Встреча с Мармеладовым,Часть 1 / Глава 2: Встреча с Мармеладовым,ru_retell,11102
ru_retell_2,ru_retell.md,2024-03-31 12:51:36.193882,2,280,1330,2,Часть 1,Глава 3: Письмо матери,Часть 1 / Глава 3: Письмо матери,ru_retell,21383
ru_retell_3,ru_retell.md,2024-03-31 12:51:36.206882,3,461,2020,3,Часть 1,Глава 4: Мысли о сестре,Часть 1 / Глава 4: Мысли о сестре,ru_retell,31845
ru_retell_4,ru_retell.md,2024-03-31 12:51:36.219882,4,193,892,4,Часть 1,Глава 5: Сон Раскольникова,Часть 1 / Глава 5: Сон Раскольникова,ru_retell,42039


Next, we form a parallel corpus

In [9]:
import pandas as pd
import os

PARALLEL_CORPUS_PATH = Path('./files/parallel_corpus.zip')
if PARALLEL_CORPUS_PATH.is_file():
    os.unlink(PARALLEL_CORPUS_PATH)

def add_relation(df_1,df_2,name_1,name_2):
    rel_1 = pd.DataFrame({'file_1':df_1, 'file_2':df_2,'relation_name':f"{name_1}_{name_2}"})
    rel_2 = pd.DataFrame({'file_1':df_2, 'file_2':df_1,'relation_name':f"{name_2}_{name_1}"})
    rel = pd.concat([rel_1,rel_2])
    return rel

CorpusBuilder.update_parallel_data(
    PARALLEL_CORPUS_PATH,
    ru_book_reader,
    "ru_book",
    None)

CorpusBuilder.update_parallel_data(
    PARALLEL_CORPUS_PATH,
    ru_retell_reader,
    "ru_retell",
    add_relation(ru_book_reader.get_toc().index,ru_retell_reader.get_toc().index,"ru_book","ru_retell"))

CorpusBuilder.update_parallel_data(
    PARALLEL_CORPUS_PATH,
    en_book_reader,
    "en_book",
    add_relation(ru_book_reader.get_toc().index,en_book_reader.get_toc().index,"ru_book","en_book"))


CorpusBuilder.update_parallel_data(
    PARALLEL_CORPUS_PATH,
    en_retell_reader,
    "en_retell",
    add_relation(en_book_reader.get_toc().index,en_retell_reader.get_toc().index,"en_book","en_retell"))



## Browsing the parallel corpus

We check that all parts have been successfully recorded

In [10]:
from grammar_ru.corpus import ParallelCorpus

reader = ParallelCorpus(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'en_book', 'en_retell'], dtype=object)

In [11]:
reader.en_book.get_toc().head()

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id,subcorpus_name
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
en_book_0,en_crime_and_punishment.md,2024-03-31 12:51:39.028882,0,4028,15089,0,Part One,I,Part One / I,en_crime_and_punishment,1056634,en_book
en_book_1,en_crime_and_punishment.md,2024-03-31 12:51:39.057882,1,8965,32957,1,Part One,II,Part One / II,en_crime_and_punishment,1075600,en_book
en_book_2,en_crime_and_punishment.md,2024-03-31 12:51:39.098882,2,6439,24251,2,Part One,III,Part One / III,en_crime_and_punishment,1092040,en_book
en_book_3,en_crime_and_punishment.md,2024-03-31 12:51:39.133882,3,6473,23395,3,Part One,IV,Part One / IV,en_crime_and_punishment,1108514,en_book
en_book_4,en_crime_and_punishment.md,2024-03-31 12:51:39.168882,4,5342,19413,4,Part One,V,Part One / V,en_crime_and_punishment,1123857,en_book


In [12]:
reader.get_relations()

Unnamed: 0,file_1,file_2,relation_name
0,ru_book_0,ru_retell_0,ru_book_ru_retell
1,ru_book_1,ru_retell_1,ru_book_ru_retell
2,ru_book_2,ru_retell_2,ru_book_ru_retell
3,ru_book_3,ru_retell_3,ru_book_ru_retell
4,ru_book_4,ru_retell_4,ru_book_ru_retell
...,...,...,...
36,en_retell_36,en_book_36,en_retell_en_book
37,en_retell_37,en_book_37,en_retell_en_book
38,en_retell_38,en_book_38,en_retell_en_book
39,en_retell_39,en_book_39,en_retell_en_book


## Translate text

In [13]:
from ca.utils_translate import translate_subcorpus

translate_subcorpus(
    PARALLEL_CORPUS_PATH,
    "en_retell",
    'en_to_ru_retell', 
    custom_guid_factory = lambda index: f'en_to_ru_retell_{index}'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
reader = ParallelCorpus(PARALLEL_CORPUS_PATH)

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'en_book', 'en_retell', 'en_to_ru_retell'],
      dtype=object)

In [15]:
reader.get_relations().feed(lambda z: z.loc[z.file_1.str.startswith('en_to_ru')]).head()

Unnamed: 0,file_1,file_2,relation_name
0,en_to_ru_retell_0,en_retell_0,en_to_ru_retell_en_retell
1,en_to_ru_retell_1,en_retell_1,en_to_ru_retell_en_retell
2,en_to_ru_retell_2,en_retell_2,en_to_ru_retell_en_retell
3,en_to_ru_retell_3,en_retell_3,en_to_ru_retell_en_retell
4,en_to_ru_retell_4,en_retell_4,en_to_ru_retell_en_retell


### Remove redundant files

In [16]:
import os
from pathlib import Path

os.remove(Path('./files/en_crime_and_puhishment.base.zip'))
os.remove(Path('./files/en_retell.base.zip'))
os.remove(Path('./files/ru_crime_and_puhishment.base.zip'))
os.remove(Path('./files/ru_retell.base.zip'))