# Creative Articulator 

## Overview

In this demo, we will look at all the processes in Creative Articulator namely:
 - the creation of a parallel corpus
 - the division of text into fragments 
 - translation of text.

## Parallel corpus

A parallel corpus is a handy tool that allows you to link parts from different corpus. With it, all the corpus will be stored in a single zip file.

In this demonstration, we will work with texts and retellings of Fyodor Mikhailovich Dostoevsky's novel Crime and Punishment, which are located in the `sourse` folder.


The first step is to create a corpus of texts and retellings that are stored in md format.

#### Сorpus for English text:

In [1]:
from grammar_ru.corpus import CorpusBuilder
from pathlib import Path


CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_crime_and_puhishment.base.zip'),
    Path('./source/book/eng'),
    ['book'],
    custom_guid_factory = lambda index: f'eng_book_{index}'
)


  0%|          | 0/1 [00:00<?, ?it/s]

Note: for all the corpora in this demos we use a custom file naming, provided by `custom_guid_factory`. This is because these names will be used to 

In [2]:
from grammar_ru.corpus import CorpusReader


eng_book_reader = CorpusReader(Path('./files/eng_crime_and_puhishment.base.zip'))
eng_book = eng_book_reader.get_toc().index

eng_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
eng_book_0,eng_crime_and_punishment.md,2024-03-31 12:04:53.320693,0,4028,15089,0,Part One,I,Part One / I,eng_crime_and_punishment,4029
eng_book_1,eng_crime_and_punishment.md,2024-03-31 12:04:53.347693,1,8965,32957,1,Part One,II,Part One / II,eng_crime_and_punishment,22995
eng_book_2,eng_crime_and_punishment.md,2024-03-31 12:04:53.382693,2,6439,24251,2,Part One,III,Part One / III,eng_crime_and_punishment,39435
eng_book_3,eng_crime_and_punishment.md,2024-03-31 12:04:53.410693,3,6473,23395,3,Part One,IV,Part One / IV,eng_crime_and_punishment,55909
eng_book_4,eng_crime_and_punishment.md,2024-03-31 12:04:53.438693,4,5342,19413,4,Part One,V,Part One / V,eng_crime_and_punishment,71252


#### Corpus for English retelling:

In [3]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_retell.base.zip'),
    Path('./source/retell/eng'),
    ['book'],
    custom_guid_factory = lambda index: f'eng_retell_{index}'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
eng_retell_reader = CorpusReader(Path('./files/eng_retell.base.zip'))
eng_retell = eng_retell_reader.get_toc().index
eng_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
eng_retell_0,eng_retell.md,2024-03-31 12:05:27.271693,0,776,3295,0,Part I,Chapter 1,Part I / Chapter 1,eng_retell,777
eng_retell_1,eng_retell.md,2024-03-31 12:05:27.275693,1,1191,5178,1,Part I,Chapter 2,Part I / Chapter 2,eng_retell,11969
eng_retell_2,eng_retell.md,2024-03-31 12:05:27.291693,2,1042,4464,2,Part I,Chapter 3,Part I / Chapter 3,eng_retell,23012
eng_retell_3,eng_retell.md,2024-03-31 12:05:27.305693,3,1098,4711,3,Part I,Chapter 4,Part I / Chapter 4,eng_retell,34111
eng_retell_4,eng_retell.md,2024-03-31 12:05:27.320693,4,954,3932,4,Part I,Chapter 5,Part I / Chapter 5,eng_retell,45066


#### Сorpus for Russian text:

In [5]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_crime_and_puhishment.base.zip'),
    Path('./source/book/ru'),
    ['book'],
    custom_guid_factory = lambda index: f'ru_book_{index}'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
ru_book_reader = CorpusReader(Path('./files/ru_crime_and_puhishment.base.zip'))
ru_book = ru_book_reader.get_toc().index
ru_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ru_book_0,ru_crime_and_punishment.md,2024-03-31 12:05:47.681693,0,3243,13794,0,ЧАСТЬ ПЕРВАЯ,I,ЧАСТЬ ПЕРВАЯ / I,ru_crime_and_punishment,3244
ru_book_1,ru_crime_and_punishment.md,2024-03-31 12:05:47.688693,1,7122,29360,1,ЧАСТЬ ПЕРВАЯ,II,ЧАСТЬ ПЕРВАЯ / II,ru_crime_and_punishment,20367
ru_book_2,ru_crime_and_punishment.md,2024-03-31 12:05:47.720693,2,5526,22721,2,ЧАСТЬ ПЕРВАЯ,III,ЧАСТЬ ПЕРВАЯ / III,ru_crime_and_punishment,35894
ru_book_3,ru_crime_and_punishment.md,2024-03-31 12:05:47.747693,3,5231,21071,3,ЧАСТЬ ПЕРВАЯ,IV,ЧАСТЬ ПЕРВАЯ / IV,ru_crime_and_punishment,51126
ru_book_4,ru_crime_and_punishment.md,2024-03-31 12:05:47.772693,4,4285,17322,4,ЧАСТЬ ПЕРВАЯ,V,ЧАСТЬ ПЕРВАЯ / V,ru_crime_and_punishment,65412


#### Corpus for Russian retelling:

In [7]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_retell.base.zip'),
    Path('./source/retell/ru'),
    ['book'],
    custom_guid_factory = lambda index: f'ru_retell_{index}'
    )

  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
ru_retell_reader = CorpusReader(Path('./files/ru_retell.base.zip'))
ru_retell = ru_retell_reader.get_toc().index
ru_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ru_retell_0,ru_retell.md,2024-03-31 12:06:42.880248,0,396,1814,0,Часть 1,Глава 1: Знакомство с Родионом,Часть 1 / Глава 1: Знакомство с Родионом,ru_retell,397
ru_retell_1,ru_retell.md,2024-03-31 12:06:42.884248,1,704,3090,1,Часть 1,Глава 2: Встреча с Мармеладовым,Часть 1 / Глава 2: Встреча с Мармеладовым,ru_retell,11102
ru_retell_2,ru_retell.md,2024-03-31 12:06:42.898249,2,280,1330,2,Часть 1,Глава 3: Письмо матери,Часть 1 / Глава 3: Письмо матери,ru_retell,21383
ru_retell_3,ru_retell.md,2024-03-31 12:06:42.911249,3,461,2020,3,Часть 1,Глава 4: Мысли о сестре,Часть 1 / Глава 4: Мысли о сестре,ru_retell,31845
ru_retell_4,ru_retell.md,2024-03-31 12:06:42.924248,4,193,892,4,Часть 1,Глава 5: Сон Раскольникова,Часть 1 / Глава 5: Сон Раскольникова,ru_retell,42039


Next, we form a parallel corpus

In [9]:
import pandas as pd

def add_relation(df_1,df_2,name_1,name_2):
    rel_1 = pd.DataFrame({'file_1':df_1, 'file_2':df_2,'relation_name':f"{name_1}_{name_2}"})
    rel_2 = pd.DataFrame({'file_1':df_2, 'file_2':df_1,'relation_name':f"{name_2}_{name_1}"})
    rel = pd.concat([rel_1,rel_2])
    return rel

def add_dfs(name):
    frames = list(name.get_frames())
    dfs = dict(zip(name.get_toc().index,frames))

    return dfs

In [10]:
CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_book_reader),
    "ru_book",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_retell_reader),
    "ru_retell",
    add_relation(ru_book_reader.get_toc().index,ru_retell_reader.get_toc().index,"ru_book","ru_retell"))

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_book_reader),
    "eng_book",
    add_relation(ru_book_reader.get_toc().index,eng_book_reader.get_toc().index,"ru_book","eng_book"))


CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_retell_reader),
    "eng_retell",
    add_relation(eng_book_reader.get_toc().index,eng_retell_reader.get_toc().index,"eng_book","eng_retell"))



We check that all parts have been successfully recorded

In [11]:
reader = CorpusReader(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_book', 'eng_retell'], dtype=object)

## Translate text

In [12]:
path_parallel_corpus = Path('./files/parallel_corpus.zip')

In [13]:
from ca.utils_translate import translate_subcorpus

translate_subcorpus(path_parallel_corpus,"eng_retell")

  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
reader = CorpusReader(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_book', 'eng_retell', 'ru_translate'],
      dtype=object)

### Remove Corpuses

In [15]:
import os
from pathlib import Path

os.remove(Path('./files/eng_crime_and_puhishment.base.zip'))
os.remove(Path('./files/eng_retell.base.zip'))
os.remove(Path('./files/ru_crime_and_puhishment.base.zip'))
os.remove(Path('./files/ru_retell.base.zip'))
os.remove(Path('./files/translate.base.zip'))