## Data Partition Notebook
This notebook documents the process of partitioning large human text datasets created in the first notebook **Human Data Collection**.

1. **Human Final**: human_final.csv, a subset containing 500 samples per genre, reserved for later experiments.
2. **Subsets for Generation**: social_for_generation.csv, news_for_generation.csv and poems_for_generation.csv, sets of 1,000 samples per genre, selected to provide examples to the generative model in the next steps.
3. **Extra**: social_human_extra.csv, news_human_extra.csv and poems_human_extra.csv, remaining data after the above subsets were extracted, ensuring distinct and non-overlapping partitions.

Make sure to adjust the file paths.


In [None]:
import pandas as pd
import numpy as np

### Load the files

In [None]:
# Make sure to check the file path
df_social = pd.read_csv('data/original/social_media.csv')
df_social.head()

Unnamed: 0,texts,source,word_counts,genre
0,"Кстати, как неожиданно КПРФ стало не все равно...",vk,14,social
1,"можно и по-другому сказать: ""убогая клоунада"" ...",vk,36,social
2,Вот он тонкий незаметный ход против России. Зю...,vk,23,social
3,просто в этом паблике раньше подобных постов н...,vk,21,social
4,Это не КПРФ - это цирк. Коммунизм - это совсем...,vk,12,social


In [None]:
df_poems = pd.read_csv('data/original/poems.csv')
df_poems.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,На серебряные шпоры…,На серебряные шпоры…\nНа серебряные шпоры\nЯ в...,Лермонтов Михаил Юрьевич,59,poems
1,Вид гор из степей Козлова,Вид гор из степей Козлова\nПилигрим\nАллах ли ...,Лермонтов Михаил Юрьевич,113,poems
2,"К (О, не скрывай! Ты плакала об нем…)","К (О, не скрывай! Ты плакала об нем…)\nО, не ...",Лермонтов Михаил Юрьевич,63,poems
3,"Жалобы турка (письмо к другу, иностранцу)","Жалобы турка (письмо к другу, иностранцу)\nТы ...",Лермонтов Михаил Юрьевич,98,poems
4,К кн. Л. Г-ой,К кн. Л. Г-ой\nКогда ты холодно внимаешь\nРасс...,Лермонтов Михаил Юрьевич,104,poems


In [None]:
df_news = pd.read_csv('data/original/news.csv')
df_news.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,Синий богатырь,Синий богатырь\nВ 1930-е годы Советский Союз о...,lenta.ru,1905,news
1,Загитова согласилась вести «Ледниковый период»,Загитова согласилась вести «Ледниковый период»...,lenta.ru,154,news
2,Объяснена опасность однообразного питания,Объяснена опасность однообразного питания\nРос...,lenta.ru,140,news
3,«Предохраняться? А зачем?»,«Предохраняться? А зачем?»\nВ 2019 году телека...,lenta.ru,2915,news
4,Ефремов систематически употреблял наркотики,Ефремов систематически употреблял наркотики\nА...,lenta.ru,139,news


### Data Partition
We need to separate the human data that will be used later in experiments and ensure it is not used as examples for model text generation. We also need to ensure that the samples we select are representative in terms of word count distribution, and balanced by genre and source distribution (e.g., for the social media genre: VK, Facebook, Pikabu, etc.).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def stratify_and_split(data, n_samples):
    """
    Stratify and set aside samples based on source and word count.

    Parameters:
        data (pd.DataFrame): The input dataset with 'source' and 'word_count' columns.
        n_samples (int): Number of samples to set aside.

    Returns:
        pd.DataFrame: Stratified sampled data (n_samples).
        pd.DataFrame: Remaining data after exclusion.
    """
    # Create word count bins for stratification
    data['word_count_bin'] = pd.qcut(data['word_counts'], q=4, duplicates='drop')

    # Perform stratified sampling based on 'source' and 'word_count_bin'
    stratified_data = (
        data.groupby(['source', 'word_count_bin'], group_keys=False, observed=False)
        .apply(
            lambda group: group.sample(
                n=min(len(group), int(n_samples / len(data['source'].unique()))),
                random_state=42
            )
        )
    )

    # Ensure the final sample size matches n_samples
    stratified_data = stratified_data.sample(n=min(len(stratified_data), n_samples), random_state=42)

    # Drop the stratified samples from the original dataset
    remaining_data = data.drop(index=stratified_data.index).copy()

    # Reset index after exclusion to clean up the data
    stratified_data = stratified_data.reset_index(drop=True)
    remaining_data = remaining_data.reset_index(drop=True)

    # Remove the temporary 'word_count_bin' column
    stratified_data = stratified_data.drop(columns=['word_count_bin'], errors='ignore')
    remaining_data = remaining_data.drop(columns=['word_count_bin'], errors='ignore')

    return stratified_data, remaining_data

**Let's start with the social media dataset first:**

In [None]:
# Set aside 500 samples for social media genre
social_samples, df_social_remaining = stratify_and_split(df_social, 500)

  .apply(


In [None]:
print(len(social_samples))

500


In [None]:
social_samples.head()

Unnamed: 0,texts,source,word_counts,genre
0,"При желании, по крайней мере до 5 версии ведра...",pikabu,17,social
1,😎 В сообществах ссылки бросать не принято. У ж...,vk,33,social
2,Между Зюгой и Монсоном стоит спортсменка Марья...,fb,26,social
3,Тогда другой пример: принтер Xerox 3124 в блок...,pikabu,41,social
4,"Скорее всего и показывает, если не этим нарост...",pikabu,27,social


In [None]:
#df_social_remaining.head()

In [None]:
social_samples['source'].value_counts()

Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
fb,177
vk,163
pikabu,160


In [None]:
social_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,41.102
std,72.218908
min,1.0
25%,11.0
50%,20.0
75%,43.0
max,708.0


In [None]:
print((social_samples['word_counts'] == 1).sum())

2


In [None]:
social_samples[social_samples['word_counts'] == 1]

Unnamed: 0,texts,source,word_counts,genre
171,Бывал 03-05,pikabu,1,social
439,это ЕР₽)),vk,1,social


In [None]:
len(df_social)

632216

In [None]:
len(df_social_remaining)

631716

Cross-checking if the function worked properly and the texts are indeed removed from the original df:

In [None]:
def filter_common_rows(df1, column1, df2, column2):

    common_mask = df1[column1].isin(df2[column2])
    common_rows_df = df1[common_mask]

    return common_rows_df, len(common_rows_df)

In [None]:
common_texts_df, common_count = filter_common_rows(
    social_samples, 'texts', df_social_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
# Set aside 1000 samples for data generation purposes
social_generation, df_social_extras = stratify_and_split(df_social_remaining, 1000)

  .apply(


In [None]:
len(social_generation)

1000

In [None]:
len(df_social_remaining)

631716

In [None]:
len(df_social_extras)

630716

In [None]:
social_generation['word_counts'].describe()

Unnamed: 0,word_counts
count,1000.0
mean,39.941
std,70.586585
min,1.0
25%,10.0
50%,20.0
75%,43.0
max,1125.0


We found common texts and will remove them from the df with extra texts:

In [None]:
# Remove rows from df_social_extras that have texts in common with social_generation
df_social_extras = df_social_extras[~df_social_extras['texts'].isin(social_generation['texts'])]
print(f"Updated df_social_extras with {len(df_social_extras)} rows remaining.")

Updated df_social_extras with 630713 rows remaining.


In [None]:
# Now there are no common texts between the two dfs
common_texts_df, common_count = filter_common_rows(
    social_generation, 'texts', df_social_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


Save to files for further use.

In [None]:
social_generation.to_csv("data/original/social_for_generation.csv", index=False, encoding='utf-8')
print("File saved as social_for_generation.csv")

File saved as social_for_generation.csv


In [None]:
df_social_extras.to_csv("data/original/social_human_extra.csv", index=False, encoding='utf-8')
print("File saved as social_human_extra.csv")

File saved as social_human_extra.csv


**We'll repeat the same steps for news genre:**

In [None]:
# Set aside 500 samples for news genre
news_samples, df_news_remaining = stratify_and_split(df_news, 500)

  .apply(


In [None]:
common_texts_df, common_count = filter_common_rows(
    news_samples, 'texts', df_news_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
print(len(df_news))
print(len(news_samples))
print(len(df_news_remaining))

20801
500
20301


In [None]:
news_samples['source'].value_counts()

Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
lenta.ru,185
ria.ru,174
meduza.io,141


In [None]:
news_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,274.206
std,397.248643
min,38.0
25%,132.75
50%,175.0
75%,245.25
max,4238.0


In [None]:
# Set aside 1000 samples for data generation purposes
news_generation, df_news_extras = stratify_and_split(df_news_remaining, 1000)

  .apply(


In [None]:
print(len(news_generation))
print(len(df_news_extras))
print(len(df_news_remaining))

1000
19301
20301


In [None]:
common_texts_df, common_count = filter_common_rows(
    news_generation, 'texts', df_news_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
news_generation.to_csv("data/original/news_for_generation.csv", index=False, encoding='utf-8')
print("File saved as news_for_generation.csv")

df_news_extras.to_csv("data/original/news_human_extra.csv", index=False, encoding='utf-8')
print("File saved as news_human_extra.csv")

File saved as news_for_generation.csv
File saved as news_human_extra.csv


We'll repeat these steps one more time for Poems genre:

In [None]:
# There are very long poems in this dataset. We'd like to inspect them and possibly exclude from our analysis.
df_poems['word_counts'].describe()

Unnamed: 0,word_counts
count,19302.0
mean,190.239043
std,802.926799
min,5.0
25%,55.0
50%,85.0
75%,139.0
max,30118.0


In [None]:
threshold = 1000
very_long_poems = df_poems[df_poems['word_counts'] > threshold]

In [None]:
# This dataset appears to include long poems and plays.
# Such instances are not many.
len(very_long_poems)

344

In [None]:
very_long_poems.sample(10)

Unnamed: 0,title,texts,source,word_counts,genre,word_count_bin
14170,Огонь,"Огонь\nНе устану тебя восхвалять,\nО, внезапны...",Бальмонт Константин Дмитриевич,1625,poems,"(139.0, 30118.0]"
7517,На Волге,На Волге\n(Детство Валежникова)\n1\n. . . . . ...,Некрасов Николай Алексеевич,1297,poems,"(139.0, 30118.0]"
16870,Песня о походе Владимира на Корсунь,Песня о походе Владимира на Корсунь\nЧАСТЬ ПЕР...,Толстой Алексей Константинович,1257,poems,"(139.0, 30118.0]"
7306,Сказка о царевне Ясносвете,"Сказка о царевне Ясносвете\nЦып, цып, цып! ко ...",Некрасов Николай Алексеевич,3011,poems,"(139.0, 30118.0]"
3468,Встреча,Встреча\nРассказ в стихах\nПосвящается А.Фету\...,Аполлон Александрович Григорьев,1117,poems,"(139.0, 30118.0]"
188,Исповедь,Исповедь\nI\nДень гас; в наряде голубом\nКрутя...,Лермонтов Михаил Юрьевич,1066,poems,"(139.0, 30118.0]"
1820,Меланиппа-философ. Трагедия,Меланиппа-философ. Трагедия\nПосвящается\nБори...,Анненский Иннокентий Федорович,10196,poems,"(139.0, 30118.0]"
827,Русалка,"Русалка\nБЕРЕГ ДНЕПРА. МЕЛЬНИЦА\nМельник , Доч...",Пушкин Александр Сергеевич,3665,poems,"(139.0, 30118.0]"
15355,На счастие,"На счастие\nВсегда прехвально, препочтемно,\nВ...",Гавриил Романович Державин,1001,poems,"(139.0, 30118.0]"
12811,Вольные мысли (1907),Вольные мысли (1907)\n(Посв. Г. Чулкову)\nО см...,Блок Александр Александрович,1827,poems,"(139.0, 30118.0]"


In [None]:
# We will remove them from the final dataset to focus on shorter poems only
df_poems = df_poems[df_poems['word_counts'] <= 1000]

In [None]:
df_poems['word_counts'].describe()

Unnamed: 0,word_counts
count,18958.0
mean,119.92668
std,122.293804
min,5.0
25%,54.0
50%,83.0
75%,135.0
max,1000.0


In [None]:
# Set aside 500 samples for poems genre
poems_samples, df_poems_remaining = stratify_and_split(df_poems, 500)

  .apply(


In [None]:
print(len(df_poems))
print(len(poems_samples))
print(len(df_poems_remaining))

18958
500
18458


In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_samples, 'texts', df_poems_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 2


In [None]:
poems_samples.duplicated().sum()

0

In [None]:
df_poems_remaining = df_poems_remaining[~df_poems_remaining['texts'].isin(poems_samples['texts'])]

In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_samples, 'texts', df_poems_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
poems_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,125.592
std,132.344897
min,10.0
25%,53.0
50%,84.5
75%,138.5
max,967.0


In [None]:
# Set aside 1000 samples for data generation purposes
poems_generation, df_poems_extras = stratify_and_split(df_poems_remaining, 1000)

  .apply(


In [None]:
print(len(poems_generation))
print(len(df_poems_extras))
print(len(df_poems_remaining))

1000
17456
18456


In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_generation, 'texts', df_poems_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 3


In [None]:
# We will remove them from the remaining df to avoid using these texts twice.
df_poems_extras = df_poems_extras[~df_poems_extras['texts'].isin(poems_generation['texts'])]

In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_generation, 'texts', df_poems_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
poems_generation.to_csv("data/original/poems_for_generation.csv", index=False, encoding='utf-8')
print("File saved as poems_for_generation.csv")

df_poems_extras.to_csv("data/original/poems_human_extra.csv", index=False, encoding='utf-8')
print("File saved as poems_human_extra.csv")

File saved as poems_for_generation.csv
File saved as poems_human_extra.csv


####Let's create the final human dataset for further experiments:

In [None]:
social_samples.head()

Unnamed: 0,texts,source,word_counts,genre
0,"При желании, по крайней мере до 5 версии ведра...",pikabu,17,social
1,😎 В сообществах ссылки бросать не принято. У ж...,vk,33,social
2,Между Зюгой и Монсоном стоит спортсменка Марья...,fb,26,social
3,Тогда другой пример: принтер Xerox 3124 в блок...,pikabu,41,social
4,"Скорее всего и показывает, если не этим нарост...",pikabu,27,social


In [None]:
news_samples.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,В СП рассказали о производстве бриллиантов в Р...,В СП рассказали о производстве бриллиантов в Р...,ria.ru,132,news
1,"Трамп не исключил, что США могут прекратить ве...","Трамп не исключил, что США могут прекратить ве...",ria.ru,81,news
2,Адвокаты Никулина просят отложить процесс в СШ...,Адвокаты Никулина просят отложить процесс в СШ...,ria.ru,417,news
3,"В Минске протестующие начали кидать в ОМОН ""ко...","В Минске протестующие начали кидать в ОМОН ""ко...",ria.ru,135,news
4,Найдена самая дешевая московская квартира,Найдена самая дешевая московская квартира\nКва...,lenta.ru,140,news


In [None]:
poems_samples.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,Рука Алкида тяжела…,"Рука Алкида тяжела…\nРука Алкида тяжела,\nУжас...",Толстой Алексей Константинович,42,poems
1,Я только сестра всему живому…,Я только сестра всему живому…\nЯ только сестра...,Аделаида Казимировна Герцык,85,poems
2,Альбаум,"Альбаум\nКогда аемны оставишь царствы,\nПойдеш...",Гавриил Романович Державин,233,poems
3,Эпиграмма на Д. И. Хвостова,Эпиграмма на Д. И. Хвостова\nПолезен ли другим...,Иван Андреевич Крылов,18,poems
4,"А.Н. Мальцевой («Пью ль мадеру, пью ли квас я…»)","А.Н. Мальцевой («Пью ль мадеру, пью ли квас я…...",Толстой Алексей Константинович,64,poems


In [None]:
print(len(social_samples))
print(len(news_samples))
print(len(poems_samples))

500
500
500


In [None]:
# Drop the 'title' column
news_samples1 = news_samples.drop(columns=['title'])
poems_samples1 = poems_samples.drop(columns=['title'])

In [None]:
# Concatenate the datasets
df_human = pd.concat([social_samples, news_samples1, poems_samples1], ignore_index=True)

In [None]:
# Add the 'class' column with a value of 0 for human data
df_human['class'] = 0

In [None]:
df_human.head()

Unnamed: 0,texts,source,word_counts,genre,class
0,"При желании, по крайней мере до 5 версии ведра...",pikabu,17,social,0
1,😎 В сообществах ссылки бросать не принято. У ж...,vk,33,social,0
2,Между Зюгой и Монсоном стоит спортсменка Марья...,fb,26,social,0
3,Тогда другой пример: принтер Xerox 3124 в блок...,pikabu,41,social,0
4,"Скорее всего и показывает, если не этим нарост...",pikabu,27,social,0


In [None]:
print(len(df_human))

1500


In [None]:
df_human['genre'].value_counts()

Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
social,500
news,500
poems,500


In [None]:
df_human.to_csv("data/original/human_final.csv", index=False, encoding='utf-8')
print("File saved as human_final.csv")

File saved as human_final.csv
