# Data Formats

In this research we use tabular representation of words in the text. This is not how text is represented in e.g. Hugging Face Transformers Models, where words are usually decomposed into the sequence of tokens. Tokens approach simply would provide us much more information than needed to solve the tasks we want, and we would need to deal with large datasets to avoid overfitting. Instead, we want to provide less precise information about each word, such as Pymorphy or Slovnet outputs. Hence, organizing this data in text is needed.

In this demo we show the set of classes that encapsulate the tedious preprocessing of text, such as separation into sentences/words and running well-known solutions on it. This is done under the hood, so after running a few routines we can get data in tabular format and dive directly into interesting tasks instead of peculiarities of NLP.

## Separator

Separator is the class that process the text and produces a dataframe.

In [1]:
from grammar_ru import Separator

text = 'Маги только думали, что могут контролировать информационную среду, но на самом деле все происходило по таким же биологическим законам, по которым рыбы в океане выбирают, куда им плыть. Это не люди выстраивали картину мира, а картина мира выстраивала себя через них. Бесполезно было искать виноватых.'
df = Separator.separate_string(text)
df.head()

Unnamed: 0,word_id,sentence_id,word_index,paragraph_id,word_tail,word,word_type,word_length
0,0,0,0,0,1,Маги,ru,4
1,1,0,1,0,1,только,ru,6
2,2,0,2,0,0,думали,ru,6
3,3,0,3,0,1,",",punct,1
4,4,0,4,0,1,что,ru,3


Columns' names are mostly self-explaining.
* `word_tail` is the amount of spaces that followed the word in the original text. `
* `word_id`, `sentence_id`, `paragraph_id` must be unique in the whole corpus.
* `word_index` is the position of the word inside the sentence.

Once separated, text can be viewed with `Separator.Viewer`

In [2]:
Separator.Viewer().to_text(df.loc[df.sentence_id==2])

'Бесполезно было искать виноватых.'

## Bundle

Bundle is a set of named dataframes, typically describing the same text.

Separator can run featurizers on the text, placing the output of each featurizer into one or several dataframes in the bundle:

In [4]:
from grammar_ru.features import PyMorphyFeaturizer, SlovnetFeaturizer

db = Separator.build_bundle(text, [PyMorphyFeaturizer(), SlovnetFeaturizer()])

In [5]:
list(db.data_frames)

['src', 'pymorphy', 'slovnet']

In [6]:
db.data_frames['src'].head()

Unnamed: 0,word_id,sentence_id,word_index,paragraph_id,word_tail,word,word_type,word_length
0,0,0,0,0,1,Маги,ru,4
1,1,0,1,0,1,только,ru,6
2,2,0,2,0,0,думали,ru,6
3,3,0,3,0,1,",",punct,1
4,4,0,4,0,1,что,ru,3


Data bundle allows **read-only** accessing of the dataframes:

In [7]:
db.src.head()

Unnamed: 0,word_id,sentence_id,word_index,paragraph_id,word_tail,word,word_type,word_length
0,0,0,0,0,1,Маги,ru,4
1,1,0,1,0,1,только,ru,6
2,2,0,2,0,0,думали,ru,6
3,3,0,3,0,1,",",punct,1
4,4,0,4,0,1,что,ru,3


In [8]:
db.pymorphy.head()

Unnamed: 0_level_0,normal_form,alternatives,score,delta_score,POS,animacy,gender,number,case,aspect,transitivity,person,tense,mood,voice,involvement
word_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,маг,1,1.0,1.0,NOUN,anim,masc,plur,nomn,,,,,,,
1,только,3,0.5,0.25,ADVB,,,,,,,,,,,
2,думать,2,0.5,0.0,VERB,,,plur,,impf,intr,,past,indc,,
3,",",1,1.0,1.0,NONE,,,,,,,,,,,
4,что,5,0.922033,0.891525,CONJ,,,,,,,,,,,


In [9]:
db.slovnet.head()

Unnamed: 0_level_0,POS,Animacy,Case,Gender,Number,Aspect,Mood,Tense,VerbForm,Voice,Person,Degree,Polarity,relation,syntax_parent_id
word_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,NOUN,Anim,Nom,Fem,Plur,,,,,,,,,nsubj,2
1,PART,,,,,,,,,,,,,advmod,2
2,VERB,,,,Plur,Imp,Ind,Past,Fin,Act,,,,root,-1
3,PUNCT,,,,,,,,,,,,,punct,5
4,SCONJ,,,,,,,,,,,,,mark,5


Once text was featurized, basic statistical research may be performed with the `pandas` library. 

`Separator.Viewer` can additionaly be used to highlight the words in the text:

In [10]:
df = db.src.merge(db.pymorphy[['POS']], left_on='word_id', right_index=True)
Separator.Viewer().highlight('POS','auto').tooltip('POS').to_html_display(df.loc[df.sentence_id==1])

## Corpus

Corpus is a set of bundles, each representing one "atomic" text in the collection, e.g. a chapter in the book. "Atomic" means that the statistical operations we perform on this text can be performed within this one bundle, and we don't need to join across the bundles. Separation of texts into bundles is important to control the memory consumption: sometimes we need e.g. to merge dataframes with themselves, and such operations are too memory-intensive to be performed on the whole books. 

Corpuses are build from "pseudo-md" files that contain:

* Headers, starting with `#`
* Tags, starting with `$`
* Raw text

When building corpus from e.g. HTML files, those must first be converted into pseudomd format, and then fed to the `grammar_ru` pipelines. The advantage of this approach is that such pseudomd files can be reviewed manually.

`grammar_ru` also contains the auxiliary code to convert `fb2` format into pseudomd.

This is an example of md-file:

In [11]:
from yo_fluq_ds import *

print(FileIO.read_text('files/mds/averchenko.md')[:200]+'...')

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 48: character maps to <undefined>

This is how conversion works:

* First, we convert the `md` files inside a specified folder into a base corpus
* Second, we run the featurizers on the base corpus, producing the featurized corpus.

Both corpuses are `zip`-files that are easy to spread around

In [13]:
from grammar_ru.corpus import CorpusBuilder
from pathlib import Path

corpus = Path('files/corpus.zip')
base_corpus = Path('files/corpus.base.zip')


CorpusBuilder.convert_interformat_folder_to_corpus(
    base_corpus,
    Path('files/mds'),
    ['author']
)

CorpusBuilder.featurize_corpus(
    base_corpus,
    corpus,
    [
        PyMorphyFeaturizer(),
        SlovnetFeaturizer()
    ]
)


  0%|          | 0/2 [00:00<?, ?it/s]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 48: character maps to <undefined>

Corpus contains table-of-contents (toc) file, describing all bundles in the corpus: basic statistics, headers and tags:

In [12]:
from tg.grammar_ru import CorpusReader

reader = CorpusReader(corpus)
reader.get_toc()

Unnamed: 0_level_0,filename,timestamp,part_index,token_count,character_count,ordinal,header_0,headers,author,max_id,tag_rating
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
92c27d27-c4f7-43be-a40c-4004d880f228,chekhov.md,2023-02-12 14:00:46.016147,0,1422,6010,0,Ванька,Ванька,chekhov,1423,
f0f7eae2-130a-4abc-a0f7-5628da95ab59,averchenko.md,2023-02-12 14:00:46.174728,0,204,865,1,Баклуши,Баклуши,averchenko,11628,10.0
d01bfc6c-6254-4c0b-b5de-15b67e944a76,averchenko.md,2023-02-12 14:00:46.197242,1,453,2035,2,Белые короли,Белые короли,averchenko,22082,20.0


`reader.read_frames()` provides an iterator with the `src` frames of each bundle:

In [13]:
reader.read_frames().first().head()

Unnamed: 0,word_id,sentence_id,word_index,paragraph_id,word_tail,word,word_type,word_length,file_id,corpus_id
0,0,0,0,0,1,Ванька,ru,6,92c27d27-c4f7-43be-a40c-4004d880f228,corpus.zip
1,1,0,1,0,0,Жуков,ru,5,92c27d27-c4f7-43be-a40c-4004d880f228,corpus.zip
2,2,0,2,0,1,",",punct,1,92c27d27-c4f7-43be-a40c-4004d880f228,corpus.zip
3,3,0,3,0,1,девятилетний,ru,12,92c27d27-c4f7-43be-a40c-4004d880f228,corpus.zip
4,4,0,4,0,0,мальчик,ru,7,92c27d27-c4f7-43be-a40c-4004d880f228,corpus.zip


`reader.read_bundles()` provides an iterator that reads the whole bundles:

In [14]:
reader.read_bundles().first()

{'src': {'shape': (1422, 10), 'index_name': None}, 'pymorphy': {'shape': (1422, 16), 'index_name': 'word_id'}, 'slovnet': {'shape': (1422, 16), 'index_name': 'word_id'}}