### Принцип работы

Для задачи классификации важно, чтобы классы были примерно одного размера. При наличии нескольких корпусов, каждый из которых содержит разное количество токенов и предложений, решить задачу балансировки размеров действительно необходимо. 

Балансировать корпуса нужно с ориентацией на то, какими будут сэмплы. То есть, например, неправильно балансировать по количеству токенов, если каждый сэмпл это отдельное предложение, потому что в разных корпусах длины предложений могут отличаться.

Поскольку предыдущие задачи работали только с предложениями, то балансировщик будет работать на уровне предложений. Пусть есть последовательность корпусов $c_1, c_2 \ldots c_n$, через $|c_i|$ обозначим количество предложений в корпусе. На начальном этапе очень вероятно, что $|c_1| \neq |c_2| \ldots \neq |c_n|$, хочется $|c_1| \approx |c_2| \ldots \approx |c_n|$ с хорошей точностью. Давайте для каждого корпуса проитерируемся по его предложениям и у каждого вычислим целый логарифм длины, тогда все предложения разобьются на бакеты 

$$\large B^k_{L} = \{ s \space | \space s \in c_k \space \wedge \space \lfloor \log (|s|) \rceil = L\}$$

где $k \in [1 \ldots n]$, а разброс значений $L$ зависит от основания логарифма. Затем все бакеты собираются в один датафрейм. Понижая основание логарифма можно увеличивать количество бакетов, при единичном основании, например, все предложения просто разобьются по длине. Это может быть полезно, если при дльнейшей обработке захочется в большей степени контролировать длину предложений.

<img src="images/balancer.png" width="800"/>

После разбиения на бакеты в дело вступает балансировщик. Он принимает на вход датафрейм с бакетами и число $Q$ - верхняя граница на размер бакета. И все что он делает, это из каждого бакета рандомно выбирает $Q$ предложений, а при построении индекса фильтрует предложения, которые не попадают в выборку. Это всегда будет работать корректно, поскольку каждое предложение принадлежит какому-то корпусу, и у него можно посчитать целый логарифм длины, а корпус и логарифм вместе определяют адрес бакета.

### Пример использования

Код балансировщика находится в ```tg/grammar_ru/ml/corpus/corpus_balancer.py```

In [1]:
from tg.grammar_ru.common import Loc
from tg.grammar_ru.ml.corpus import CorpusReader, CorpusBuilder, BucketCorpusBalancer
from tg.grammar_ru.ml.corpus.corpus_reader import read_data

from yo_fluq_ds import Queryable, Query, fluq

from typing import List, Union
from pathlib import Path

import math
import pandas as pd

На первом шаге формируем список корпусов, из которых будет собираться бандл и которые нужно сбалансировать

In [2]:
CORPUS_NAMES = [
    "books.base.zip", 
    "ficbook.base.zip"
]

CORPUS_LIST = [Loc.corpus_path/corpus_name for corpus_name in CORPUS_NAMES]

In [3]:
CORPUS_LIST

[PosixPath('/home/dabdya/grammar_ru/data-cache/corpus/books.base.zip'),
 PosixPath('/home/dabdya/grammar_ru/data-cache/corpus/ficbook.base.zip')]

Проверим исходную сбалансированность, имеет ли смысл вообще делать что-то дальше

In [6]:
def get_sentences_count(corpus_path):
    return sum([
        len(frame.groupby("sentence_id")) 
        for frame in read_data(corpus_path)
    ])

In [7]:
for i, corpus_name in enumerate(CORPUS_NAMES):
    print(f"{corpus_name} = {get_sentences_count(CORPUS_LIST[i])} sentences")

100%|███████████████████████████████████████| 2533/2533 [00:43<00:00, 58.50it/s]


books.base.zip = 903668 sentences


100%|█████████████████████████████████████████| 998/998 [00:14<00:00, 67.08it/s]

ficbook.base.zip = 120270 sentences





Видно что есть большая несбалансированность, поэтому фиксируем праметры для формирования бакетов и начинаем билдинг

In [8]:
LOG_BASE = math.e
BUCKET_PATH = Loc.corpus_path/"prepare/buckets/example.parquet"

In [9]:
BucketCorpusBalancer.build_buckets_frame(CORPUS_LIST, BUCKET_PATH, LOG_BASE)

100%|███████████████████████████████████████| 3531/3531 [04:08<00:00, 14.21it/s]


In [10]:
pd.read_parquet(BUCKET_PATH)

Unnamed: 0_level_0,sentences,bucket_size
bucket,Unnamed: 1_level_1,Unnamed: 2_level_1
books.base.zip/0,"[2105037, 2635119, 2635122, 2930082, 2930089, ...",129
books.base.zip/1,"[7, 8, 11, 15, 10711, 10730, 10731, 10745, 107...",90591
books.base.zip/2,"[1, 2, 6, 9, 12, 14, 16, 17, 18, 19, 20, 21, 2...",434207
books.base.zip/3,"[0, 3, 4, 5, 10, 13, 10712, 10716, 10718, 1072...",355499
books.base.zip/4,"[10715, 10717, 10724, 10744, 10777, 10793, 107...",23177
books.base.zip/5,"[189579, 784022, 2391783, 5783820, 5998276, 67...",65
ficbook.base.zip/0,"[73552, 93817, 117139, 117186, 117216, 117370,...",449
ficbook.base.zip/1,"[0, 10144, 10145, 10170, 10176, 10178, 10180, ...",13029
ficbook.base.zip/2,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",55187
ficbook.base.zip/3,"[10138, 10139, 10140, 10146, 10147, 10152, 101...",47695


Посмотрев на распределение бакетов, можно сделать вывод о том, какие номера бакетов ```BUCKET_NUMBERS``` следует взять и какой лимит ```BUCKET_LIMIT``` на размер бакета установить. Все зависит от соотношения размеров корпусов и распределения длин предложений в них. Недостаточно механически выставлять эти гиперпараметры, иногда это может не сработать, например, когда мало данных и один из корпусов состоит преимущественно из коротких/длинных предложений.


Если из полученных бакетов не получается отобрать бакеты с хорошой точностью, то полезно вернуться на шаг назад и попробовать снизить основание логарифма, чтобы более гибко управлять длинами предложений.

В ситуации выше можно разными способами скомбинировать бакеты. Например, если оставить номера ```[2,3,4]```, то для достаточно точной балансировки придется ставить лимит на бакет около ```4000```, но можно также пожертвовать точностью на свое усмотрение и поставить размер ```4000 + K```, где ```K``` обознает перевес в сторону одного из копусов. Или можно оставить номера ```[2,3]``` и взять лимит ```45000```. 

Вообщем, здесь возможны разные варианты. Если данных много то скорее всего проблем не будет, но если данных мало и они сильно не сбалансированны изначально, то лучше перепроверить.

In [13]:
BUCKET_NUMBERS = [2, 3, 4]
BUCKET_LIMIT = 4000

In [14]:
BucketCorpusBalancer.filter_buckets_by_bucket_numbers(BUCKET_PATH, BUCKET_NUMBERS)

In [15]:
pd.read_parquet(BUCKET_PATH)

Unnamed: 0,sentences,bucket_size
books.base.zip/2,"[1, 2, 6, 9, 12, 14, 16, 17, 18, 19, 20, 21, 2...",434207
books.base.zip/3,"[0, 3, 4, 5, 10, 13, 10712, 10716, 10718, 1072...",355499
books.base.zip/4,"[10715, 10717, 10724, 10744, 10777, 10793, 107...",23177
ficbook.base.zip/2,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",55187
ficbook.base.zip/3,"[10138, 10139, 10140, 10146, 10147, 10152, 101...",47695
ficbook.base.zip/4,"[10198, 21538, 21556, 21565, 93821, 93824, 938...",3896


In [None]:
BALANCED_PATH = Loc.corpus_path/"prepare/balanced/example_balanced.zip"

```BALANCED_PATH``` указывает на место, куда нужно слить корпуса. Функция ниже не является частью балансировщика и определяется там где строится бандл и выполняются другие трансфьюзы.

In [123]:
def balancing() -> None:
    balancer = BucketCorpusBalancer(
        buckets = pd.read_parquet(BUCKET_PATH), 
        log_base = LOG_BASE,
        bucket_limit = BUCKET_LIMIT,
    )

    CorpusBuilder.transfuse_corpus(
        sources = CORPUS_LIST,
        destination = BALANCED_PATH,
        selector = balancer.select
    )

In [124]:
balancing()

2022-12-01 11:11:11.268193+00:00 INFO: Processed 7 words. 1/3531
2022-12-01 11:11:11.389087+00:00 INFO: Processed 80 words. 2/3531
2022-12-01 11:11:11.550925+00:00 INFO: Processed 178 words. 3/3531
2022-12-01 11:11:11.667973+00:00 INFO: Processed 497 words. 4/3531
2022-12-01 11:11:11.792911+00:00 INFO: Processed 599 words. 5/3531
2022-12-01 11:11:11.946499+00:00 INFO: Processed 665 words. 6/3531
2022-12-01 11:11:12.085980+00:00 INFO: Processed 863 words. 7/3531
2022-12-01 11:11:12.255750+00:00 INFO: Processed 1092 words. 8/3531
2022-12-01 11:11:12.566449+00:00 INFO: Processed 1182 words. 9/3531
2022-12-01 11:11:12.868502+00:00 INFO: Processed 1346 words. 10/3531
2022-12-01 11:11:13.130785+00:00 INFO: Processed 1547 words. 11/3531
2022-12-01 11:11:13.247006+00:00 INFO: Processed 1679 words. 12/3531
2022-12-01 11:11:13.432888+00:00 INFO: Processed 1891 words. 13/3531
2022-12-01 11:11:13.611632+00:00 INFO: Processed 2015 words. 14/3531
2022-12-01 11:11:13.770795+00:00 INFO: Processed 2330

2022-12-01 11:11:31.607754+00:00 INFO: Processed 16042 words. 119/3531
2022-12-01 11:11:31.679528+00:00 INFO: Processed 16078 words. 120/3531
2022-12-01 11:11:31.807879+00:00 INFO: Processed 16121 words. 121/3531
2022-12-01 11:11:31.904325+00:00 INFO: Processed 16121 words. 122/3531
2022-12-01 11:11:32.026494+00:00 INFO: Processed 16134 words. 123/3531
2022-12-01 11:11:32.120917+00:00 INFO: Processed 16134 words. 124/3531
2022-12-01 11:11:32.201057+00:00 INFO: Processed 16134 words. 125/3531
2022-12-01 11:11:32.285247+00:00 INFO: Processed 16143 words. 126/3531
2022-12-01 11:11:32.343367+00:00 INFO: Processed 16143 words. 127/3531
2022-12-01 11:11:32.418586+00:00 INFO: Processed 16151 words. 128/3531
2022-12-01 11:11:32.469293+00:00 INFO: Processed 16151 words. 129/3531
2022-12-01 11:11:32.531177+00:00 INFO: Processed 16151 words. 130/3531
2022-12-01 11:11:32.606232+00:00 INFO: Processed 16157 words. 131/3531
2022-12-01 11:11:32.661004+00:00 INFO: Processed 16157 words. 132/3531
2022-1

2022-12-01 11:11:41.349468+00:00 INFO: Processed 19847 words. 235/3531
2022-12-01 11:11:41.409993+00:00 INFO: Processed 19847 words. 236/3531
2022-12-01 11:11:41.465850+00:00 INFO: Processed 19847 words. 237/3531
2022-12-01 11:11:41.541134+00:00 INFO: Processed 19881 words. 238/3531
2022-12-01 11:11:41.602766+00:00 INFO: Processed 19881 words. 239/3531
2022-12-01 11:11:41.674250+00:00 INFO: Processed 19890 words. 240/3531
2022-12-01 11:11:41.728348+00:00 INFO: Processed 19890 words. 241/3531
2022-12-01 11:11:41.780612+00:00 INFO: Processed 19890 words. 242/3531
2022-12-01 11:11:41.837103+00:00 INFO: Processed 19890 words. 243/3531
2022-12-01 11:11:41.886281+00:00 INFO: Processed 19890 words. 244/3531
2022-12-01 11:11:41.942839+00:00 INFO: Processed 19890 words. 245/3531
2022-12-01 11:11:42.023592+00:00 INFO: Processed 19902 words. 246/3531
2022-12-01 11:11:42.128787+00:00 INFO: Processed 20009 words. 247/3531
2022-12-01 11:11:42.183779+00:00 INFO: Processed 20009 words. 248/3531
2022-1

2022-12-01 11:11:54.981972+00:00 INFO: Processed 25775 words. 351/3531
2022-12-01 11:11:55.146647+00:00 INFO: Processed 26015 words. 352/3531
2022-12-01 11:11:55.273318+00:00 INFO: Processed 26058 words. 353/3531
2022-12-01 11:11:55.402530+00:00 INFO: Processed 26335 words. 354/3531
2022-12-01 11:11:55.527889+00:00 INFO: Processed 26445 words. 355/3531
2022-12-01 11:11:55.616739+00:00 INFO: Processed 26483 words. 356/3531
2022-12-01 11:11:55.749451+00:00 INFO: Processed 26727 words. 357/3531
2022-12-01 11:11:55.903678+00:00 INFO: Processed 26847 words. 358/3531
2022-12-01 11:11:56.020367+00:00 INFO: Processed 26847 words. 359/3531
2022-12-01 11:11:56.324752+00:00 INFO: Processed 26986 words. 360/3531
2022-12-01 11:11:56.453989+00:00 INFO: Processed 27357 words. 361/3531
2022-12-01 11:11:56.553196+00:00 INFO: Processed 27600 words. 362/3531
2022-12-01 11:11:56.643976+00:00 INFO: Processed 27684 words. 363/3531
2022-12-01 11:11:56.713688+00:00 INFO: Processed 27705 words. 364/3531
2022-1

2022-12-01 11:12:13.757548+00:00 INFO: Processed 46320 words. 467/3531
2022-12-01 11:12:13.880921+00:00 INFO: Processed 46410 words. 468/3531
2022-12-01 11:12:14.003337+00:00 INFO: Processed 46613 words. 469/3531
2022-12-01 11:12:14.090721+00:00 INFO: Processed 46819 words. 470/3531
2022-12-01 11:12:14.212198+00:00 INFO: Processed 46854 words. 471/3531
2022-12-01 11:12:14.313116+00:00 INFO: Processed 46892 words. 472/3531
2022-12-01 11:12:14.427845+00:00 INFO: Processed 47116 words. 473/3531
2022-12-01 11:12:14.534123+00:00 INFO: Processed 47393 words. 474/3531
2022-12-01 11:12:14.623103+00:00 INFO: Processed 47553 words. 475/3531
2022-12-01 11:12:14.730991+00:00 INFO: Processed 47839 words. 476/3531
2022-12-01 11:12:14.830624+00:00 INFO: Processed 47925 words. 477/3531
2022-12-01 11:12:14.930875+00:00 INFO: Processed 48091 words. 478/3531
2022-12-01 11:12:15.030947+00:00 INFO: Processed 48317 words. 479/3531
2022-12-01 11:12:15.127121+00:00 INFO: Processed 48405 words. 480/3531
2022-1

2022-12-01 11:12:26.849741+00:00 INFO: Processed 59297 words. 581/3531
2022-12-01 11:12:26.967439+00:00 INFO: Processed 59362 words. 582/3531
2022-12-01 11:12:27.096812+00:00 INFO: Processed 59546 words. 583/3531
2022-12-01 11:12:27.246802+00:00 INFO: Processed 59636 words. 584/3531
2022-12-01 11:12:27.334395+00:00 INFO: Processed 59675 words. 585/3531
2022-12-01 11:12:27.436998+00:00 INFO: Processed 59704 words. 586/3531
2022-12-01 11:12:27.514997+00:00 INFO: Processed 59739 words. 587/3531
2022-12-01 11:12:27.631797+00:00 INFO: Processed 59792 words. 588/3531
2022-12-01 11:12:27.731268+00:00 INFO: Processed 59829 words. 589/3531
2022-12-01 11:12:28.259377+00:00 INFO: Processed 60027 words. 590/3531
2022-12-01 11:12:28.332124+00:00 INFO: Processed 60064 words. 591/3531
2022-12-01 11:12:28.416076+00:00 INFO: Processed 60088 words. 592/3531
2022-12-01 11:12:28.521806+00:00 INFO: Processed 60099 words. 593/3531
2022-12-01 11:12:28.606985+00:00 INFO: Processed 60153 words. 594/3531
2022-1

2022-12-01 11:12:41.565958+00:00 INFO: Processed 77924 words. 697/3531
2022-12-01 11:12:41.700108+00:00 INFO: Processed 78204 words. 698/3531
2022-12-01 11:12:41.844689+00:00 INFO: Processed 78460 words. 699/3531
2022-12-01 11:12:41.994821+00:00 INFO: Processed 78604 words. 700/3531
2022-12-01 11:12:42.133420+00:00 INFO: Processed 78971 words. 701/3531
2022-12-01 11:12:42.274213+00:00 INFO: Processed 79118 words. 702/3531
2022-12-01 11:12:42.416168+00:00 INFO: Processed 79463 words. 703/3531
2022-12-01 11:12:42.558974+00:00 INFO: Processed 79865 words. 704/3531
2022-12-01 11:12:42.711294+00:00 INFO: Processed 80101 words. 705/3531
2022-12-01 11:12:42.824384+00:00 INFO: Processed 80309 words. 706/3531
2022-12-01 11:12:42.950130+00:00 INFO: Processed 80570 words. 707/3531
2022-12-01 11:12:43.056281+00:00 INFO: Processed 80688 words. 708/3531
2022-12-01 11:12:43.164447+00:00 INFO: Processed 80775 words. 709/3531
2022-12-01 11:12:43.278116+00:00 INFO: Processed 80911 words. 710/3531
2022-1

2022-12-01 11:12:57.495460+00:00 INFO: Processed 99795 words. 813/3531
2022-12-01 11:12:57.631828+00:00 INFO: Processed 99845 words. 814/3531
2022-12-01 11:12:58.048241+00:00 INFO: Writing 297 frames, 50042 words, to write 49982, to keep 60
2022-12-01 11:12:58.230009+00:00 INFO: Processed 100039 words. 815/3531
2022-12-01 11:12:58.355431+00:00 INFO: Processed 100129 words. 816/3531
2022-12-01 11:12:58.437032+00:00 INFO: Processed 100142 words. 817/3531
2022-12-01 11:12:58.521261+00:00 INFO: Processed 100182 words. 818/3531
2022-12-01 11:12:58.652615+00:00 INFO: Processed 100388 words. 819/3531
2022-12-01 11:12:58.746905+00:00 INFO: Processed 100445 words. 820/3531
2022-12-01 11:12:58.821597+00:00 INFO: Processed 100445 words. 821/3531
2022-12-01 11:12:58.955767+00:00 INFO: Processed 100737 words. 822/3531
2022-12-01 11:12:59.071557+00:00 INFO: Processed 100768 words. 823/3531
2022-12-01 11:12:59.138911+00:00 INFO: Processed 100768 words. 824/3531
2022-12-01 11:12:59.270636+00:00 INFO: 

2022-12-01 11:13:11.851259+00:00 INFO: Processed 112463 words. 927/3531
2022-12-01 11:13:11.942137+00:00 INFO: Processed 112558 words. 928/3531
2022-12-01 11:13:12.061817+00:00 INFO: Processed 112774 words. 929/3531
2022-12-01 11:13:12.187635+00:00 INFO: Processed 112920 words. 930/3531
2022-12-01 11:13:12.298017+00:00 INFO: Processed 113046 words. 931/3531
2022-12-01 11:13:12.352997+00:00 INFO: Processed 113046 words. 932/3531
2022-12-01 11:13:12.408747+00:00 INFO: Processed 113046 words. 933/3531
2022-12-01 11:13:12.491145+00:00 INFO: Processed 113115 words. 934/3531
2022-12-01 11:13:12.592806+00:00 INFO: Processed 113152 words. 935/3531
2022-12-01 11:13:12.718310+00:00 INFO: Processed 113210 words. 936/3531
2022-12-01 11:13:12.800108+00:00 INFO: Processed 113210 words. 937/3531
2022-12-01 11:13:12.946717+00:00 INFO: Processed 113270 words. 938/3531
2022-12-01 11:13:13.114164+00:00 INFO: Processed 113313 words. 939/3531
2022-12-01 11:13:13.187816+00:00 INFO: Processed 113355 words. 9

2022-12-01 11:13:25.482938+00:00 INFO: Processed 132946 words. 1041/3531
2022-12-01 11:13:25.678439+00:00 INFO: Processed 133049 words. 1042/3531
2022-12-01 11:13:25.810986+00:00 INFO: Processed 133207 words. 1043/3531
2022-12-01 11:13:25.933116+00:00 INFO: Processed 133245 words. 1044/3531
2022-12-01 11:13:26.009484+00:00 INFO: Processed 133319 words. 1045/3531
2022-12-01 11:13:26.127879+00:00 INFO: Processed 133427 words. 1046/3531
2022-12-01 11:13:26.233696+00:00 INFO: Processed 133443 words. 1047/3531
2022-12-01 11:13:26.369605+00:00 INFO: Processed 133603 words. 1048/3531
2022-12-01 11:13:26.454597+00:00 INFO: Processed 133616 words. 1049/3531
2022-12-01 11:13:26.592811+00:00 INFO: Processed 133795 words. 1050/3531
2022-12-01 11:13:26.675353+00:00 INFO: Processed 133805 words. 1051/3531
2022-12-01 11:13:26.806595+00:00 INFO: Processed 133998 words. 1052/3531
2022-12-01 11:13:26.925296+00:00 INFO: Processed 134176 words. 1053/3531
2022-12-01 11:13:27.072055+00:00 INFO: Processed 13

2022-12-01 11:13:40.482365+00:00 INFO: Processed 140448 words. 1154/3531
2022-12-01 11:13:40.561302+00:00 INFO: Processed 140448 words. 1155/3531
2022-12-01 11:13:40.665375+00:00 INFO: Processed 140472 words. 1156/3531
2022-12-01 11:13:40.835201+00:00 INFO: Processed 140622 words. 1157/3531
2022-12-01 11:13:40.986034+00:00 INFO: Processed 140641 words. 1158/3531
2022-12-01 11:13:41.142645+00:00 INFO: Processed 140671 words. 1159/3531
2022-12-01 11:13:41.278578+00:00 INFO: Processed 140741 words. 1160/3531
2022-12-01 11:13:41.408283+00:00 INFO: Processed 140843 words. 1161/3531
2022-12-01 11:13:41.524942+00:00 INFO: Processed 140959 words. 1162/3531
2022-12-01 11:13:41.657011+00:00 INFO: Processed 141030 words. 1163/3531
2022-12-01 11:13:41.790730+00:00 INFO: Processed 141116 words. 1164/3531
2022-12-01 11:13:41.932418+00:00 INFO: Processed 141222 words. 1165/3531
2022-12-01 11:13:42.076996+00:00 INFO: Processed 141364 words. 1166/3531
2022-12-01 11:13:42.211632+00:00 INFO: Processed 14

2022-12-01 11:13:53.712783+00:00 INFO: Processed 147133 words. 1267/3531
2022-12-01 11:13:53.820807+00:00 INFO: Processed 147154 words. 1268/3531
2022-12-01 11:13:53.911842+00:00 INFO: Processed 147173 words. 1269/3531
2022-12-01 11:13:53.998740+00:00 INFO: Processed 147201 words. 1270/3531
2022-12-01 11:13:54.076304+00:00 INFO: Processed 147207 words. 1271/3531
2022-12-01 11:13:54.169150+00:00 INFO: Processed 147261 words. 1272/3531
2022-12-01 11:13:54.257736+00:00 INFO: Processed 147295 words. 1273/3531
2022-12-01 11:13:54.352487+00:00 INFO: Processed 147312 words. 1274/3531
2022-12-01 11:13:54.451862+00:00 INFO: Processed 147417 words. 1275/3531
2022-12-01 11:13:54.517082+00:00 INFO: Processed 147417 words. 1276/3531
2022-12-01 11:13:54.605119+00:00 INFO: Processed 147447 words. 1277/3531
2022-12-01 11:13:54.684397+00:00 INFO: Processed 147463 words. 1278/3531
2022-12-01 11:13:54.798612+00:00 INFO: Processed 147463 words. 1279/3531
2022-12-01 11:13:55.172239+00:00 INFO: Processed 14

2022-12-01 11:14:03.550383+00:00 INFO: Processed 150000 words. 1378/3531
2022-12-01 11:14:03.609124+00:00 INFO: Processed 150000 words. 1379/3531
2022-12-01 11:14:03.699191+00:00 INFO: Processed 150020 words. 1380/3531
2022-12-01 11:14:03.763431+00:00 INFO: Processed 150020 words. 1381/3531
2022-12-01 11:14:03.846345+00:00 INFO: Processed 150030 words. 1382/3531
2022-12-01 11:14:03.928464+00:00 INFO: Processed 150039 words. 1383/3531
2022-12-01 11:14:04.000861+00:00 INFO: Processed 150063 words. 1384/3531
2022-12-01 11:14:04.086456+00:00 INFO: Processed 150096 words. 1385/3531
2022-12-01 11:14:04.169638+00:00 INFO: Processed 150105 words. 1386/3531
2022-12-01 11:14:04.613531+00:00 INFO: Processed 150105 words. 1387/3531
2022-12-01 11:14:04.667151+00:00 INFO: Processed 150105 words. 1388/3531
2022-12-01 11:14:04.724350+00:00 INFO: Processed 150105 words. 1389/3531
2022-12-01 11:14:04.809883+00:00 INFO: Processed 150171 words. 1390/3531
2022-12-01 11:14:04.866699+00:00 INFO: Processed 15

2022-12-01 11:14:13.263393+00:00 INFO: Processed 151562 words. 1491/3531
2022-12-01 11:14:13.337141+00:00 INFO: Processed 151568 words. 1492/3531
2022-12-01 11:14:13.408013+00:00 INFO: Processed 151634 words. 1493/3531
2022-12-01 11:14:13.511888+00:00 INFO: Processed 151690 words. 1494/3531
2022-12-01 11:14:13.579653+00:00 INFO: Processed 151690 words. 1495/3531
2022-12-01 11:14:13.655091+00:00 INFO: Processed 151742 words. 1496/3531
2022-12-01 11:14:13.755139+00:00 INFO: Processed 151767 words. 1497/3531
2022-12-01 11:14:13.831513+00:00 INFO: Processed 151781 words. 1498/3531
2022-12-01 11:14:13.883356+00:00 INFO: Processed 151781 words. 1499/3531
2022-12-01 11:14:13.937005+00:00 INFO: Processed 151781 words. 1500/3531
2022-12-01 11:14:14.029935+00:00 INFO: Processed 151822 words. 1501/3531
2022-12-01 11:14:14.085979+00:00 INFO: Processed 151822 words. 1502/3531
2022-12-01 11:14:14.189858+00:00 INFO: Processed 151915 words. 1503/3531
2022-12-01 11:14:14.288921+00:00 INFO: Processed 15

2022-12-01 11:14:24.929421+00:00 INFO: Processed 157766 words. 1604/3531
2022-12-01 11:14:24.982791+00:00 INFO: Processed 157766 words. 1605/3531
2022-12-01 11:14:25.033125+00:00 INFO: Processed 157766 words. 1606/3531
2022-12-01 11:14:25.081476+00:00 INFO: Processed 157766 words. 1607/3531
2022-12-01 11:14:25.133745+00:00 INFO: Processed 157766 words. 1608/3531
2022-12-01 11:14:25.213593+00:00 INFO: Processed 157800 words. 1609/3531
2022-12-01 11:14:25.267981+00:00 INFO: Processed 157800 words. 1610/3531
2022-12-01 11:14:25.319019+00:00 INFO: Processed 157800 words. 1611/3531
2022-12-01 11:14:25.373851+00:00 INFO: Processed 157800 words. 1612/3531
2022-12-01 11:14:25.421040+00:00 INFO: Processed 157800 words. 1613/3531
2022-12-01 11:14:25.470572+00:00 INFO: Processed 157800 words. 1614/3531
2022-12-01 11:14:25.521191+00:00 INFO: Processed 157800 words. 1615/3531
2022-12-01 11:14:25.581681+00:00 INFO: Processed 157800 words. 1616/3531
2022-12-01 11:14:25.651741+00:00 INFO: Processed 15

2022-12-01 11:14:36.517875+00:00 INFO: Processed 164537 words. 1717/3531
2022-12-01 11:14:36.567453+00:00 INFO: Processed 164537 words. 1718/3531
2022-12-01 11:14:36.640405+00:00 INFO: Processed 164543 words. 1719/3531
2022-12-01 11:14:36.689071+00:00 INFO: Processed 164543 words. 1720/3531
2022-12-01 11:14:36.744354+00:00 INFO: Processed 164543 words. 1721/3531
2022-12-01 11:14:36.794660+00:00 INFO: Processed 164543 words. 1722/3531
2022-12-01 11:14:36.849133+00:00 INFO: Processed 164543 words. 1723/3531
2022-12-01 11:14:36.897349+00:00 INFO: Processed 164543 words. 1724/3531
2022-12-01 11:14:36.951144+00:00 INFO: Processed 164543 words. 1725/3531
2022-12-01 11:14:37.005879+00:00 INFO: Processed 164543 words. 1726/3531
2022-12-01 11:14:37.079387+00:00 INFO: Processed 164557 words. 1727/3531
2022-12-01 11:14:37.135838+00:00 INFO: Processed 164557 words. 1728/3531
2022-12-01 11:14:37.187252+00:00 INFO: Processed 164557 words. 1729/3531
2022-12-01 11:14:37.242896+00:00 INFO: Processed 16

2022-12-01 11:14:44.393288+00:00 INFO: Processed 167464 words. 1830/3531
2022-12-01 11:14:44.472931+00:00 INFO: Processed 167502 words. 1831/3531
2022-12-01 11:14:44.599850+00:00 INFO: Processed 167596 words. 1832/3531
2022-12-01 11:14:44.660600+00:00 INFO: Processed 167596 words. 1833/3531
2022-12-01 11:14:44.775193+00:00 INFO: Processed 167614 words. 1834/3531
2022-12-01 11:14:44.891646+00:00 INFO: Processed 167640 words. 1835/3531
2022-12-01 11:14:44.964511+00:00 INFO: Processed 167677 words. 1836/3531
2022-12-01 11:14:45.042080+00:00 INFO: Processed 167714 words. 1837/3531
2022-12-01 11:14:45.119633+00:00 INFO: Processed 167769 words. 1838/3531
2022-12-01 11:14:45.229146+00:00 INFO: Processed 167918 words. 1839/3531
2022-12-01 11:14:45.320458+00:00 INFO: Processed 167979 words. 1840/3531
2022-12-01 11:14:45.425600+00:00 INFO: Processed 168099 words. 1841/3531
2022-12-01 11:14:45.508582+00:00 INFO: Processed 168115 words. 1842/3531
2022-12-01 11:14:45.596219+00:00 INFO: Processed 16

2022-12-01 11:14:55.410234+00:00 INFO: Processed 175184 words. 1943/3531
2022-12-01 11:14:55.462462+00:00 INFO: Processed 175184 words. 1944/3531
2022-12-01 11:14:55.514160+00:00 INFO: Processed 175184 words. 1945/3531
2022-12-01 11:14:55.565220+00:00 INFO: Processed 175184 words. 1946/3531
2022-12-01 11:14:55.619486+00:00 INFO: Processed 175184 words. 1947/3531
2022-12-01 11:14:55.738677+00:00 INFO: Processed 175209 words. 1948/3531
2022-12-01 11:14:55.941877+00:00 INFO: Processed 175370 words. 1949/3531
2022-12-01 11:14:56.143762+00:00 INFO: Processed 175601 words. 1950/3531
2022-12-01 11:14:56.198926+00:00 INFO: Processed 175601 words. 1951/3531
2022-12-01 11:14:56.311086+00:00 INFO: Processed 175791 words. 1952/3531
2022-12-01 11:14:56.434476+00:00 INFO: Processed 175974 words. 1953/3531
2022-12-01 11:14:56.548288+00:00 INFO: Processed 176134 words. 1954/3531
2022-12-01 11:14:56.663396+00:00 INFO: Processed 176314 words. 1955/3531
2022-12-01 11:14:56.723489+00:00 INFO: Processed 17

2022-12-01 11:15:07.998860+00:00 INFO: Processed 188689 words. 2056/3531
2022-12-01 11:15:08.222723+00:00 INFO: Processed 189272 words. 2057/3531
2022-12-01 11:15:08.374743+00:00 INFO: Processed 189499 words. 2058/3531
2022-12-01 11:15:08.622732+00:00 INFO: Processed 190105 words. 2059/3531
2022-12-01 11:15:08.930093+00:00 INFO: Processed 191239 words. 2060/3531
2022-12-01 11:15:09.174282+00:00 INFO: Processed 192189 words. 2061/3531
2022-12-01 11:15:09.270930+00:00 INFO: Processed 192334 words. 2062/3531
2022-12-01 11:15:09.606381+00:00 INFO: Processed 192998 words. 2063/3531
2022-12-01 11:15:09.970447+00:00 INFO: Processed 193282 words. 2064/3531
2022-12-01 11:15:10.241294+00:00 INFO: Processed 193595 words. 2065/3531
2022-12-01 11:15:10.551616+00:00 INFO: Processed 194005 words. 2066/3531
2022-12-01 11:15:10.755940+00:00 INFO: Processed 194074 words. 2067/3531
2022-12-01 11:15:11.047842+00:00 INFO: Processed 194495 words. 2068/3531
2022-12-01 11:15:11.143641+00:00 INFO: Processed 19

2022-12-01 11:15:24.295298+00:00 INFO: Processed 204258 words. 2167/3531
2022-12-01 11:15:24.432093+00:00 INFO: Processed 204352 words. 2168/3531
2022-12-01 11:15:24.572223+00:00 INFO: Processed 204365 words. 2169/3531
2022-12-01 11:15:24.685307+00:00 INFO: Processed 204416 words. 2170/3531
2022-12-01 11:15:24.821554+00:00 INFO: Processed 204515 words. 2171/3531
2022-12-01 11:15:24.931268+00:00 INFO: Processed 204532 words. 2172/3531
2022-12-01 11:15:25.046142+00:00 INFO: Processed 204598 words. 2173/3531
2022-12-01 11:15:25.199103+00:00 INFO: Processed 204657 words. 2174/3531
2022-12-01 11:15:25.366506+00:00 INFO: Processed 204681 words. 2175/3531
2022-12-01 11:15:25.590950+00:00 INFO: Processed 204815 words. 2176/3531
2022-12-01 11:15:25.834702+00:00 INFO: Processed 204914 words. 2177/3531
2022-12-01 11:15:25.980425+00:00 INFO: Processed 204975 words. 2178/3531
2022-12-01 11:15:26.135365+00:00 INFO: Processed 205251 words. 2179/3531
2022-12-01 11:15:26.193694+00:00 INFO: Processed 20

2022-12-01 11:15:40.404143+00:00 INFO: Processed 220493 words. 2280/3531
2022-12-01 11:15:40.512248+00:00 INFO: Processed 220524 words. 2281/3531
2022-12-01 11:15:40.620714+00:00 INFO: Processed 220549 words. 2282/3531
2022-12-01 11:15:40.741854+00:00 INFO: Processed 220666 words. 2283/3531
2022-12-01 11:15:40.840962+00:00 INFO: Processed 220717 words. 2284/3531
2022-12-01 11:15:40.951836+00:00 INFO: Processed 220789 words. 2285/3531
2022-12-01 11:15:41.058949+00:00 INFO: Processed 220829 words. 2286/3531
2022-12-01 11:15:41.149039+00:00 INFO: Processed 220834 words. 2287/3531
2022-12-01 11:15:41.250118+00:00 INFO: Processed 220876 words. 2288/3531
2022-12-01 11:15:41.363557+00:00 INFO: Processed 220970 words. 2289/3531
2022-12-01 11:15:41.472867+00:00 INFO: Processed 220995 words. 2290/3531
2022-12-01 11:15:41.581040+00:00 INFO: Processed 221025 words. 2291/3531
2022-12-01 11:15:41.667452+00:00 INFO: Processed 221025 words. 2292/3531
2022-12-01 11:15:41.809133+00:00 INFO: Processed 22

2022-12-01 11:15:57.376206+00:00 INFO: Processed 242441 words. 2393/3531
2022-12-01 11:15:58.037387+00:00 INFO: Processed 245973 words. 2394/3531
2022-12-01 11:15:58.105691+00:00 INFO: Processed 245973 words. 2395/3531
2022-12-01 11:15:58.274295+00:00 INFO: Processed 246290 words. 2396/3531
2022-12-01 11:15:58.888122+00:00 INFO: Processed 249829 words. 2397/3531
2022-12-01 11:15:58.969647+00:00 INFO: Processed 249907 words. 2398/3531
2022-12-01 11:15:59.143205+00:00 INFO: Writing 251 frames, 50014 words, to write 49998, to keep 16
2022-12-01 11:15:59.330559+00:00 INFO: Processed 249971 words. 2399/3531
2022-12-01 11:16:00.195939+00:00 INFO: Processed 253673 words. 2400/3531
2022-12-01 11:16:00.349558+00:00 INFO: Processed 253890 words. 2401/3531
2022-12-01 11:16:00.587329+00:00 INFO: Processed 254593 words. 2402/3531
2022-12-01 11:16:01.546636+00:00 INFO: Processed 258328 words. 2403/3531
2022-12-01 11:16:01.660190+00:00 INFO: Processed 258341 words. 2404/3531
2022-12-01 11:16:01.71545

2022-12-01 11:16:12.427575+00:00 INFO: Processed 265517 words. 2504/3531
2022-12-01 11:16:13.075360+00:00 INFO: Processed 265970 words. 2505/3531
2022-12-01 11:16:13.308670+00:00 INFO: Processed 266593 words. 2506/3531
2022-12-01 11:16:13.420779+00:00 INFO: Processed 266635 words. 2507/3531
2022-12-01 11:16:13.533951+00:00 INFO: Processed 266926 words. 2508/3531
2022-12-01 11:16:13.620845+00:00 INFO: Processed 266931 words. 2509/3531
2022-12-01 11:16:13.866142+00:00 INFO: Processed 268090 words. 2510/3531
2022-12-01 11:16:13.945592+00:00 INFO: Processed 268090 words. 2511/3531
2022-12-01 11:16:14.073833+00:00 INFO: Processed 268206 words. 2512/3531
2022-12-01 11:16:14.191185+00:00 INFO: Processed 268510 words. 2513/3531
2022-12-01 11:16:14.364215+00:00 INFO: Processed 269048 words. 2514/3531
2022-12-01 11:16:14.449243+00:00 INFO: Processed 269129 words. 2515/3531
2022-12-01 11:16:14.520097+00:00 INFO: Processed 269129 words. 2516/3531
2022-12-01 11:16:14.577489+00:00 INFO: Processed 26

2022-12-01 11:16:24.910561+00:00 INFO: Processed 292841 words. 2618/3531
2022-12-01 11:16:25.065239+00:00 INFO: Processed 292935 words. 2619/3531
2022-12-01 11:16:25.147995+00:00 INFO: Processed 292941 words. 2620/3531
2022-12-01 11:16:25.333220+00:00 INFO: Processed 293174 words. 2621/3531
2022-12-01 11:16:25.749627+00:00 INFO: Processed 294379 words. 2622/3531
2022-12-01 11:16:25.809293+00:00 INFO: Processed 294379 words. 2623/3531
2022-12-01 11:16:25.925061+00:00 INFO: Processed 294756 words. 2624/3531
2022-12-01 11:16:26.013457+00:00 INFO: Processed 294964 words. 2625/3531
2022-12-01 11:16:26.261874+00:00 INFO: Processed 295076 words. 2626/3531
2022-12-01 11:16:26.514144+00:00 INFO: Processed 295317 words. 2627/3531
2022-12-01 11:16:26.726466+00:00 INFO: Processed 296218 words. 2628/3531
2022-12-01 11:16:26.819302+00:00 INFO: Processed 296507 words. 2629/3531
2022-12-01 11:16:26.973877+00:00 INFO: Processed 297085 words. 2630/3531
2022-12-01 11:16:27.040456+00:00 INFO: Processed 29

2022-12-01 11:16:38.707500+00:00 INFO: Processed 325576 words. 2729/3531
2022-12-01 11:16:38.793540+00:00 INFO: Processed 325669 words. 2730/3531
2022-12-01 11:16:38.905726+00:00 INFO: Processed 325678 words. 2731/3531
2022-12-01 11:16:39.040098+00:00 INFO: Processed 326107 words. 2732/3531
2022-12-01 11:16:39.144369+00:00 INFO: Processed 326387 words. 2733/3531
2022-12-01 11:16:39.238415+00:00 INFO: Processed 326871 words. 2734/3531
2022-12-01 11:16:39.321590+00:00 INFO: Processed 327194 words. 2735/3531
2022-12-01 11:16:39.400363+00:00 INFO: Processed 327482 words. 2736/3531
2022-12-01 11:16:39.465381+00:00 INFO: Processed 327539 words. 2737/3531
2022-12-01 11:16:39.530459+00:00 INFO: Processed 327601 words. 2738/3531
2022-12-01 11:16:39.599660+00:00 INFO: Processed 327698 words. 2739/3531
2022-12-01 11:16:39.642922+00:00 INFO: Processed 327698 words. 2740/3531
2022-12-01 11:16:39.709593+00:00 INFO: Processed 327713 words. 2741/3531
2022-12-01 11:16:39.786430+00:00 INFO: Processed 32

2022-12-01 11:16:49.203516+00:00 INFO: Processed 350841 words. 2840/3531
2022-12-01 11:16:49.289163+00:00 INFO: Processed 351741 words. 2841/3531
2022-12-01 11:16:49.363912+00:00 INFO: Processed 352067 words. 2842/3531
2022-12-01 11:16:49.446120+00:00 INFO: Processed 352400 words. 2843/3531
2022-12-01 11:16:49.515131+00:00 INFO: Processed 352413 words. 2844/3531
2022-12-01 11:16:49.579058+00:00 INFO: Processed 352463 words. 2845/3531
2022-12-01 11:16:49.679770+00:00 INFO: Processed 353784 words. 2846/3531
2022-12-01 11:16:49.745992+00:00 INFO: Processed 353977 words. 2847/3531
2022-12-01 11:16:49.807434+00:00 INFO: Processed 353996 words. 2848/3531
2022-12-01 11:16:49.881039+00:00 INFO: Processed 354209 words. 2849/3531
2022-12-01 11:16:49.950818+00:00 INFO: Processed 354555 words. 2850/3531
2022-12-01 11:16:50.023329+00:00 INFO: Processed 354838 words. 2851/3531
2022-12-01 11:16:50.097824+00:00 INFO: Processed 354916 words. 2852/3531
2022-12-01 11:16:50.142029+00:00 INFO: Processed 35

2022-12-01 11:17:04.575898+00:00 INFO: Processed 388956 words. 2953/3531
2022-12-01 11:17:04.649865+00:00 INFO: Processed 389090 words. 2954/3531
2022-12-01 11:17:04.731196+00:00 INFO: Processed 389550 words. 2955/3531
2022-12-01 11:17:04.805101+00:00 INFO: Processed 389914 words. 2956/3531
2022-12-01 11:17:04.872523+00:00 INFO: Processed 390114 words. 2957/3531
2022-12-01 11:17:05.044138+00:00 INFO: Processed 390422 words. 2958/3531
2022-12-01 11:17:05.407496+00:00 INFO: Processed 390470 words. 2959/3531
2022-12-01 11:17:05.635527+00:00 INFO: Processed 390477 words. 2960/3531
2022-12-01 11:17:05.831985+00:00 INFO: Processed 391159 words. 2961/3531
2022-12-01 11:17:06.088987+00:00 INFO: Processed 391475 words. 2962/3531
2022-12-01 11:17:06.231066+00:00 INFO: Processed 391475 words. 2963/3531
2022-12-01 11:17:06.375280+00:00 INFO: Processed 391475 words. 2964/3531
2022-12-01 11:17:06.614874+00:00 INFO: Processed 391573 words. 2965/3531
2022-12-01 11:17:06.738362+00:00 INFO: Processed 39

2022-12-01 11:17:16.435367+00:00 INFO: Processed 414932 words. 3064/3531
2022-12-01 11:17:16.509361+00:00 INFO: Processed 415004 words. 3065/3531
2022-12-01 11:17:16.573160+00:00 INFO: Processed 415063 words. 3066/3531
2022-12-01 11:17:16.664512+00:00 INFO: Processed 415712 words. 3067/3531
2022-12-01 11:17:16.782022+00:00 INFO: Processed 416325 words. 3068/3531
2022-12-01 11:17:16.856307+00:00 INFO: Processed 416503 words. 3069/3531
2022-12-01 11:17:16.937475+00:00 INFO: Processed 416879 words. 3070/3531
2022-12-01 11:17:17.010449+00:00 INFO: Processed 417054 words. 3071/3531
2022-12-01 11:17:17.093482+00:00 INFO: Processed 417285 words. 3072/3531
2022-12-01 11:17:17.179936+00:00 INFO: Processed 417457 words. 3073/3531
2022-12-01 11:17:17.270315+00:00 INFO: Processed 417924 words. 3074/3531
2022-12-01 11:17:17.337454+00:00 INFO: Processed 417930 words. 3075/3531
2022-12-01 11:17:17.424436+00:00 INFO: Processed 418364 words. 3076/3531
2022-12-01 11:17:17.497096+00:00 INFO: Processed 41

2022-12-01 11:17:25.870239+00:00 INFO: Writing 170 frames, 50001 words, to write 49995, to keep 6
2022-12-01 11:17:26.096072+00:00 INFO: Processed 449936 words. 3177/3531
2022-12-01 11:17:26.219883+00:00 INFO: Processed 450234 words. 3178/3531
2022-12-01 11:17:26.325130+00:00 INFO: Processed 450448 words. 3179/3531
2022-12-01 11:17:26.399542+00:00 INFO: Processed 450453 words. 3180/3531
2022-12-01 11:17:26.478787+00:00 INFO: Processed 450561 words. 3181/3531
2022-12-01 11:17:26.549869+00:00 INFO: Processed 450847 words. 3182/3531
2022-12-01 11:17:26.625347+00:00 INFO: Processed 451262 words. 3183/3531
2022-12-01 11:17:26.720114+00:00 INFO: Processed 451390 words. 3184/3531
2022-12-01 11:17:26.819684+00:00 INFO: Processed 451622 words. 3185/3531
2022-12-01 11:17:26.886766+00:00 INFO: Processed 451747 words. 3186/3531
2022-12-01 11:17:26.968254+00:00 INFO: Processed 451841 words. 3187/3531
2022-12-01 11:17:27.038048+00:00 INFO: Processed 452200 words. 3188/3531
2022-12-01 11:17:27.126892

2022-12-01 11:17:36.225946+00:00 INFO: Processed 480649 words. 3288/3531
2022-12-01 11:17:36.299170+00:00 INFO: Processed 480662 words. 3289/3531
2022-12-01 11:17:36.370349+00:00 INFO: Processed 480668 words. 3290/3531
2022-12-01 11:17:36.524107+00:00 INFO: Processed 480675 words. 3291/3531
2022-12-01 11:17:36.730497+00:00 INFO: Processed 481486 words. 3292/3531
2022-12-01 11:17:36.830373+00:00 INFO: Processed 481486 words. 3293/3531
2022-12-01 11:17:36.941490+00:00 INFO: Processed 481536 words. 3294/3531
2022-12-01 11:17:37.021522+00:00 INFO: Processed 481541 words. 3295/3531
2022-12-01 11:17:37.120319+00:00 INFO: Processed 481630 words. 3296/3531
2022-12-01 11:17:37.205128+00:00 INFO: Processed 481656 words. 3297/3531
2022-12-01 11:17:37.368434+00:00 INFO: Processed 482860 words. 3298/3531
2022-12-01 11:17:37.437481+00:00 INFO: Processed 482883 words. 3299/3531
2022-12-01 11:17:37.514148+00:00 INFO: Processed 483155 words. 3300/3531
2022-12-01 11:17:37.593364+00:00 INFO: Processed 48

2022-12-01 11:17:46.944996+00:00 INFO: Processed 506023 words. 3399/3531
2022-12-01 11:17:47.043099+00:00 INFO: Processed 506041 words. 3400/3531
2022-12-01 11:17:47.109454+00:00 INFO: Processed 506239 words. 3401/3531
2022-12-01 11:17:47.217246+00:00 INFO: Processed 507363 words. 3402/3531
2022-12-01 11:17:47.310186+00:00 INFO: Processed 507929 words. 3403/3531
2022-12-01 11:17:47.389551+00:00 INFO: Processed 508285 words. 3404/3531
2022-12-01 11:17:47.453534+00:00 INFO: Processed 508519 words. 3405/3531
2022-12-01 11:17:47.496678+00:00 INFO: Processed 508519 words. 3406/3531
2022-12-01 11:17:47.571957+00:00 INFO: Processed 508705 words. 3407/3531
2022-12-01 11:17:47.652756+00:00 INFO: Processed 508786 words. 3408/3531
2022-12-01 11:17:47.753461+00:00 INFO: Processed 508812 words. 3409/3531
2022-12-01 11:17:47.822423+00:00 INFO: Processed 508862 words. 3410/3531
2022-12-01 11:17:47.895344+00:00 INFO: Processed 508944 words. 3411/3531
2022-12-01 11:17:47.970325+00:00 INFO: Processed 50

2022-12-01 11:17:57.941031+00:00 INFO: Processed 535217 words. 3512/3531
2022-12-01 11:17:58.023017+00:00 INFO: Processed 535542 words. 3513/3531
2022-12-01 11:17:58.101163+00:00 INFO: Processed 536040 words. 3514/3531
2022-12-01 11:17:58.173395+00:00 INFO: Processed 536197 words. 3515/3531
2022-12-01 11:17:58.278895+00:00 INFO: Processed 536991 words. 3516/3531
2022-12-01 11:17:58.555105+00:00 INFO: Processed 539811 words. 3517/3531
2022-12-01 11:17:58.664554+00:00 INFO: Processed 540189 words. 3518/3531
2022-12-01 11:17:58.761572+00:00 INFO: Processed 540297 words. 3519/3531
2022-12-01 11:17:58.860239+00:00 INFO: Processed 540842 words. 3520/3531
2022-12-01 11:17:58.935381+00:00 INFO: Processed 541140 words. 3521/3531
2022-12-01 11:17:59.011206+00:00 INFO: Processed 541549 words. 3522/3531
2022-12-01 11:17:59.081859+00:00 INFO: Processed 541925 words. 3523/3531
2022-12-01 11:17:59.154413+00:00 INFO: Processed 542270 words. 3524/3531
2022-12-01 11:17:59.229717+00:00 INFO: Processed 54

In [126]:
from collections import defaultdict
lengths = defaultdict(int)
for frame in read_data(BALANCED_PATH):
    for corpus_name in CORPUS_NAMES:
        lengths[corpus_name] += len(frame[frame.original_corpus_id == corpus_name].groupby("sentence_id"))

100%|███████████████████████████████████████████| 11/11 [00:01<00:00,  8.67it/s]


In [129]:
lengths

defaultdict(int, {'books.base.zip': 12000, 'ficbook.base.zip': 11896})

Теперь все хорошо, и можно делать последующую обработку.