# Text fragmentation

## Overview
This demo shows text fragmentation and retells post-processing with parallel corpus:
- text fragmentation with localization (english, russian)
- model retells post-processing
- translating english retells to russian
- writing valuable data to parallel corpus

#### Text fragmentation with localization (english, russian)
The FragmentsBuilder parses the text from book corpus into json-format file with: 
- prompt
- text fragments of preset words limit
- some info about text frame (like first word id and last word id in book corpus, chapter). 

In [1]:
from grammar_ru.corpus import CorpusBuilder, CorpusReader, ParallelCorpus
from pathlib import Path

parallel_corpus = ParallelCorpus(Path('files/parallel_corpus.zip'))
toc = parallel_corpus.get_toc()
toc.subcorpus_name.unique()

array(['ru_book', 'ru_retell', 'eng_book', 'eng_retell', 'ru_translate'],
      dtype=object)

In [2]:
eng_corpus = parallel_corpus.eng_book

Fragmentation of english and russian books.

In [3]:
from ca.book_fragments.fragments_builder import FragmentsBuilder
from ca.book_fragments.localizators.ru_localizator import RussianLocalizator
from ca.book_fragments.localizators.eng_localizator import EnglishLocalizator

In [None]:
eng_fragments_builder = FragmentsBuilder(
    eng_corpus, 
    output_path='./files/fragments', 
    file_name="eng_crime_and_punishment_fragments", 
    localizator=EnglishLocalizator()
)

eng_fragments_builder.construct_fragments_json()

Processing frame with id: d4d7834a-5476-4c35-b450-f7320e92607d, 16/123

In [None]:
ru_corpus = Path('./files/ru_crime_and_puhishment.base.zip')

ru_fragments_builder = FragmentsBuilder(
    ru_corpus, 
    output_path='./files/fragments', 
    file_name="ru_crime_and_punishment_fragments", 
    localizator=RussianLocalizator(),
    prompt='{}'
)

ru_fragments_builder.construct_fragments_json()

#### Model retells post-processing

Model returns json file with retells and some log info, after that retells are cleared and prettified,
then prepared texts are parsed into corresponding corpuses.

In [None]:
from tg.ca.book_fragments.utils.parse_retells_to_corpus import parse_retells_to_corpus

In [None]:
parse_retells_to_corpus(
    Path('./files/fragments/eng_crime_and_punishment_fragments.json'),
    Path('./source/retell/eng/eng_crime_and_punishment_fragments.json'),
    Path('./files/eng_crime_and_punishment_retell.base.zip'),
)

In [None]:
parse_retells_to_corpus(
    Path('./files/fragments/ru_crime_and_punishment_fragments.json'),
    Path('./source/retell/ru/ru_crime_and_punishment_fragments.json'),
    Path('./files/ru_crime_and_punishment_retell.base.zip'),
)

In [None]:
ru_retell_reader = CorpusReader(Path('./files/ru_crime_and_punishment_retell.base.zip'))
ru_retell = ru_retell_reader.get_toc().index
ru_retell_reader.get_frames().first()

In [None]:
eng_retell_reader = CorpusReader(Path('./files/eng_crime_and_punishment_retell.base.zip'))
eng_retell = eng_retell_reader.get_toc().index
eng_retell_reader.get_toc()

Functions for convenient parallel corpus assemblying.

In [None]:
import pandas as pd

def add_relation(df_1,df_2,name_1,name_2):
    rel_1 = pd.DataFrame({'file_1':df_1, 'file_2':df_2,'relation_name':f"{name_1}_{name_2}"})
    rel_2 = pd.DataFrame({'file_1':df_2, 'file_2':df_1,'relation_name':f"{name_2}_{name_1}"})
    rel = pd.concat([rel_1,rel_2])
    return rel

def add_dfs(name):
    frames = list(name.get_frames())
    dfs = dict(zip(name.get_toc().index,frames))

    return dfs

Record english and russian books corpuses.

In [None]:
eng_book_reader = CorpusReader(Path('./files/eng_crime_and_puhishment.base.zip'))
eng_book_reader.get_toc()

In [None]:
ru_book_reader = CorpusReader(Path('./files/ru_crime_and_puhishment.base.zip'))
ru_book_reader.get_toc()

In [None]:
CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_book_reader),
    "eng_book",
    None)

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_book_reader),
    "ru_book",
    None)

Record english and russian retells, add relations to books.

In [None]:
eng_retell_reader = CorpusReader(Path('./files/eng_crime_and_punishment_retell.base.zip'))
eng_retell_reader.get_toc()

In [None]:
ru_retell_reader = CorpusReader(Path('./files/ru_crime_and_punishment_retell.base.zip'))
ru_retell_reader.get_toc()

In [None]:
CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(eng_retell_reader),
    "eng_retell",
    add_relation(eng_book_reader.get_toc().index, eng_retell_reader.get_toc().index, "eng_book", "eng_retell"))

CorpusBuilder.update_parallel_data(
    Path('./files/parallel_corpus.zip'),
    add_dfs(ru_retell_reader),
    "ru_retell",
    add_relation(ru_book_reader.get_toc().index, ru_retell_reader.get_toc().index, "ru_book", "ru_retell"))

In [None]:
path_parallel_corpus = Path('./files/parallel_corpus.zip')

In [None]:
!pip install googletrans==3.1.0a0

Add data with translated english retell.

In [None]:
from tg.ca.utils_translate import translate_subcorpus

translate_subcorpus(path_parallel_corpus,"eng_retell")

Finally, show parallel corpus contents.

In [None]:
reader = CorpusReader(path_parallel_corpus)
df = reader.get_toc()
df.subcorpus_name.unique()

Clear corpuses

In [None]:
import os
from pathlib import Path

# os.remove(Path('./files/eng_crime_and_puhishment.base.zip'))
# os.remove(Path('./files/ru_crime_and_puhishment.base.zip'))
os.remove(Path('./files/eng_crime_and_punishment_retell.base.zip'))
os.remove(Path('./files/ru_crime_and_punishment_retell.base.zip'))
os.remove(Path('./files/translate.base.zip'))