# Creative Articulator 

## Overview

In this demo, we will look at all the processes in Creative Articulator namely:
 - the creation of a parallel corpus
 - the division of text into fragments 
 - translation of text.

## Parallel corpus

A parallel corpus is a handy tool that allows you to link parts from different corpus. With it, all the corpus will be stored in a single zip file.

In this demonstration, we will work with texts and retellings of Fyodor Mikhailovich Dostoevsky's novel Crime and Punishment, which are located in the `sourse` folder.


The first step is to create a corpus of texts and retellings that are stored in md format.

#### Сorpus for English text:

In [1]:
from grammar_ru.corpus import CorpusBuilder
from pathlib import Path


CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_crime_and_puhishment.base.zip'),
    Path('./source/book/eng'),
    ['book']
)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
from grammar_ru.corpus import CorpusReader


eng_book_reader = CorpusReader(Path('./files/eng_crime_and_puhishment.base.zip'))
eng_book = eng_book_reader.get_toc().index

eng_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
25107a73-3506-42f4-8cd4-25cf70535747,eng_crime_and_punishment.md,2024-03-30 15:49:43.545227,0,4028,15089,0,Part One,I,Part One / I,eng_crime_and_punishment,4029
5c009685-0056-421a-aad4-f6c579e3c2d0,eng_crime_and_punishment.md,2024-03-30 15:49:43.558227,1,8965,32957,1,Part One,II,Part One / II,eng_crime_and_punishment,22995
9a5bc191-da76-46de-bdbd-c6cdd16206fa,eng_crime_and_punishment.md,2024-03-30 15:49:43.579227,2,6439,24251,2,Part One,III,Part One / III,eng_crime_and_punishment,39435
ed93829c-6110-4d26-b76c-85f2dbd18194,eng_crime_and_punishment.md,2024-03-30 15:49:43.596227,3,6473,23395,3,Part One,IV,Part One / IV,eng_crime_and_punishment,55909
268d3381-533c-4aa2-9d57-7715aa293e29,eng_crime_and_punishment.md,2024-03-30 15:49:43.613226,4,5342,19413,4,Part One,V,Part One / V,eng_crime_and_punishment,71252


#### Corpus for English retelling:

In [3]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/eng_retell.base.zip'),
    Path('./source/retell/eng'),
    ['book']
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
eng_retell_reader = CorpusReader(Path('./files/eng_retell.base.zip'))
eng_retell = eng_retell_reader.get_toc().index
eng_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
fa5a2451-7eb8-4872-80b7-0cb5b0b02d18,eng_retell.md,2024-03-30 15:49:46.614792,0,776,3295,0,Part I,Chapter 1,Part I / Chapter 1,eng_retell,777
2ee63fc9-4a07-46c4-b3b4-c131199a5da9,eng_retell.md,2024-03-30 15:49:46.618792,1,1191,5178,1,Part I,Chapter 2,Part I / Chapter 2,eng_retell,11969
04882510-a03b-4414-b44c-5f93469e7641,eng_retell.md,2024-03-30 15:49:46.634792,2,1042,4464,2,Part I,Chapter 3,Part I / Chapter 3,eng_retell,23012
e241242c-60dd-4ba8-9ac5-e5e3d6155897,eng_retell.md,2024-03-30 15:49:46.649792,3,1098,4711,3,Part I,Chapter 4,Part I / Chapter 4,eng_retell,34111
ef0722e1-f963-4f2b-9565-53812544ada6,eng_retell.md,2024-03-30 15:49:46.664792,4,954,3932,4,Part I,Chapter 5,Part I / Chapter 5,eng_retell,45066


#### Сorpus for Russian text:

In [5]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_crime_and_puhishment.base.zip'),
    Path('./source/book/ru'),
    ['book']
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
ru_book_reader = CorpusReader(Path('./files/ru_crime_and_puhishment.base.zip'))
ru_book = ru_book_reader.get_toc().index
ru_book_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8ce8c632-7e8f-40a3-850d-31924bf42ca5,ru_crime_and_punishment.md,2024-03-30 15:49:55.816047,0,3243,13794,0,ЧАСТЬ ПЕРВАЯ,I,ЧАСТЬ ПЕРВАЯ / I,ru_crime_and_punishment,3244
4c1b3705-7199-49a7-b7d8-44a2e7a7c37a,ru_crime_and_punishment.md,2024-03-30 15:49:55.823047,1,7122,29360,1,ЧАСТЬ ПЕРВАЯ,II,ЧАСТЬ ПЕРВАЯ / II,ru_crime_and_punishment,20367
c1f49617-5911-4bd4-8258-90c49f3d1cdc,ru_crime_and_punishment.md,2024-03-30 15:49:55.853120,2,5526,22721,2,ЧАСТЬ ПЕРВАЯ,III,ЧАСТЬ ПЕРВАЯ / III,ru_crime_and_punishment,35894
cc29bddf-1dd9-49a3-bc8d-7a323cc3f59c,ru_crime_and_punishment.md,2024-03-30 15:49:55.880047,3,5231,21071,3,ЧАСТЬ ПЕРВАЯ,IV,ЧАСТЬ ПЕРВАЯ / IV,ru_crime_and_punishment,51126
5e98b7f6-c134-4eee-bc45-df0ecf8a9459,ru_crime_and_punishment.md,2024-03-30 15:49:55.904048,4,4285,17322,4,ЧАСТЬ ПЕРВАЯ,V,ЧАСТЬ ПЕРВАЯ / V,ru_crime_and_punishment,65412


#### Corpus for Russian retelling:

In [7]:
CorpusBuilder.convert_interformat_folder_to_corpus(
    Path('./files/ru_retell.base.zip'),
    Path('./source/retell/ru'),
    ['book']
    )

  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
ru_retell_reader = CorpusReader(Path('./files/ru_retell.base.zip'))
ru_retell = ru_retell_reader.get_toc().index
ru_retell_reader.get_toc()[:5]

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,header_1,headers,book,max_id
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
03e9e420-c07c-447f-933e-8c22203b3293,ru_retell.md,2024-03-30 15:49:57.499147,0,396,1814,0,Часть 1,Глава 1: Знакомство с Родионом,Часть 1 / Глава 1: Знакомство с Родионом,ru_retell,397
0cac389e-95a6-40de-bd74-0cf8c65b1f43,ru_retell.md,2024-03-30 15:49:57.503147,1,704,3090,1,Часть 1,Глава 2: Встреча с Мармеладовым,Часть 1 / Глава 2: Встреча с Мармеладовым,ru_retell,11102
82477b01-06dd-4d57-a9f8-c1c703ac0320,ru_retell.md,2024-03-30 15:49:57.517147,2,280,1330,2,Часть 1,Глава 3: Письмо матери,Часть 1 / Глава 3: Письмо матери,ru_retell,21383
2061975f-e7f4-4b2a-9674-854cb68b418c,ru_retell.md,2024-03-30 15:49:57.530147,3,461,2020,3,Часть 1,Глава 4: Мысли о сестре,Часть 1 / Глава 4: Мысли о сестре,ru_retell,31845
a8cb9f27-3fec-41dd-9bd8-e44f3dad31e6,ru_retell.md,2024-03-30 15:49:57.544147,4,193,892,4,Часть 1,Глава 5: Сон Раскольникова,Часть 1 / Глава 5: Сон Раскольникова,ru_retell,42039


Next, we form a parallel corpus

In [9]:
import pandas as pd

def add_relation(df_1,df_2,name_1,name_2):
    rel_1 = pd.DataFrame({'file_1':df_1, 'file_2':df_2,'relation_name':f"{name_1}_{name_2}"})
    rel_2 = pd.DataFrame({'file_1':df_2, 'file_2':df_1,'relation_name':f"{name_2}_{name_1}"})
    rel = pd.concat([rel_1,rel_2])
    return rel

def add_dfs(name):
    frames = list(name.get_frames())
    dfs = dict(zip(name.get_toc().index,frames))

    return dfs

In [10]:
CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_book_reader),
    "ru_book",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_retell_reader),
    "ru_retell",
    add_relation(ru_book_reader.get_toc().index,ru_retell_reader.get_toc().index,"ru_book","ru_retell"))

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_book_reader),
    "eng_book",
    add_relation(ru_book_reader.get_toc().index,eng_book_reader.get_toc().index,"ru_book","eng_book"))


CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_retell_reader),
    "eng_retell",
    add_relation(eng_book_reader.get_toc().index,eng_retell_reader.get_toc().index,"eng_book","eng_retell"))



We check that all parts have been successfully recorded

In [11]:
reader = CorpusReader(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_book', 'eng_retell'], dtype=object)

## Translate text

In [12]:
path_parallel_corpus = Path('./files/parallel_corpus.zip')

In [13]:
from ca.utils_translate import translate_subcorpus

translate_subcorpus(path_parallel_corpus,"eng_retell")

  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
reader = CorpusReader(Path('./files/parallel_corpus.zip'))

df = reader.get_toc()

df.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_book', 'eng_retell', 'ru_translate'],
      dtype=object)

### Remove Corpuses

In [15]:
import os
from pathlib import Path

os.remove(Path('./files/eng_crime_and_puhishment.base.zip'))
os.remove(Path('./files/eng_retell.base.zip'))
os.remove(Path('./files/ru_crime_and_puhishment.base.zip'))
os.remove(Path('./files/ru_retell.base.zip'))
os.remove(Path('./files/translate.base.zip'))