## Data Partition Notebook
This notebook documents the process of partitioning large human text datasets created in the first notebook **Human Data Collection**.

1. **Human Final**: human_final.csv, a subset containing 500 samples per genre, reserved for later experiments.
2. **Subsets for Generation**: social_for_generation.csv, news_for_generation.csv and poems_for_generation.csv, sets of 1,000 samples per genre, selected to provide examples to the generative model in the next steps.
3. **Extra**: social_human_extra.csv, news_human_extra.csv and poems_human_extra.csv, remaining data after the above subsets were extracted, ensuring distinct and non-overlapping partitions.

Make sure to adjust the file paths.


In [None]:
import pandas as pd
import numpy as np

### Load the files

In [None]:
# Make sure to check the file path
df_social = pd.read_csv('data/original/social_media.csv')
df_social.head()

Unnamed: 0,texts,source,word_counts,genre
0,"–ö—Å—Ç–∞—Ç–∏, –∫–∞–∫ –Ω–µ–æ–∂–∏–¥–∞–Ω–Ω–æ –ö–ü–†–§ —Å—Ç–∞–ª–æ –Ω–µ –≤—Å–µ —Ä–∞–≤–Ω–æ...",vk,14,social
1,"–º–æ–∂–Ω–æ –∏ –ø–æ-–¥—Ä—É–≥–æ–º—É —Å–∫–∞–∑–∞—Ç—å: ""—É–±–æ–≥–∞—è –∫–ª–æ—É–Ω–∞–¥–∞"" ...",vk,36,social
2,–í–æ—Ç –æ–Ω —Ç–æ–Ω–∫–∏–π –Ω–µ–∑–∞–º–µ—Ç–Ω—ã–π —Ö–æ–¥ –ø—Ä–æ—Ç–∏–≤ –†–æ—Å—Å–∏–∏. –ó—é...,vk,23,social
3,–ø—Ä–æ—Å—Ç–æ –≤ —ç—Ç–æ–º –ø–∞–±–ª–∏–∫–µ —Ä–∞–Ω—å—à–µ –ø–æ–¥–æ–±–Ω—ã—Ö –ø–æ—Å—Ç–æ–≤ –Ω...,vk,21,social
4,–≠—Ç–æ –Ω–µ –ö–ü–†–§ - —ç—Ç–æ —Ü–∏—Ä–∫. –ö–æ–º–º—É–Ω–∏–∑–º - —ç—Ç–æ —Å–æ–≤—Å–µ–º...,vk,12,social


In [None]:
df_poems = pd.read_csv('data/original/poems.csv')
df_poems.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,–ù–∞ —Å–µ—Ä–µ–±—Ä—è–Ω—ã–µ —à–ø–æ—Ä—ã‚Ä¶,–ù–∞ —Å–µ—Ä–µ–±—Ä—è–Ω—ã–µ —à–ø–æ—Ä—ã‚Ä¶\n–ù–∞ —Å–µ—Ä–µ–±—Ä—è–Ω—ã–µ —à–ø–æ—Ä—ã\n–Ø –≤...,–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,59,poems
1,–í–∏–¥ –≥–æ—Ä –∏–∑ —Å—Ç–µ–ø–µ–π –ö–æ–∑–ª–æ–≤–∞,–í–∏–¥ –≥–æ—Ä –∏–∑ —Å—Ç–µ–ø–µ–π –ö–æ–∑–ª–æ–≤–∞\n–ü–∏–ª–∏–≥—Ä–∏–º\n–ê–ª–ª–∞—Ö –ª–∏ ...,–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,113,poems
2,"–ö (–û, –Ω–µ —Å–∫—Ä—ã–≤–∞–π! –¢—ã –ø–ª–∞–∫–∞–ª–∞ –æ–± –Ω–µ–º‚Ä¶)","–ö (–û, –Ω–µ —Å–∫—Ä—ã–≤–∞–π! –¢—ã –ø–ª–∞–∫–∞–ª–∞ –æ–± –Ω–µ–º‚Ä¶)\n–û, –Ω–µ ...",–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,63,poems
3,"–ñ–∞–ª–æ–±—ã —Ç—É—Ä–∫–∞ (–ø–∏—Å—å–º–æ –∫ –¥—Ä—É–≥—É, –∏–Ω–æ—Å—Ç—Ä–∞–Ω—Ü—É)","–ñ–∞–ª–æ–±—ã —Ç—É—Ä–∫–∞ (–ø–∏—Å—å–º–æ –∫ –¥—Ä—É–≥—É, –∏–Ω–æ—Å—Ç—Ä–∞–Ω—Ü—É)\n–¢—ã ...",–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,98,poems
4,–ö –∫–Ω. –õ. –ì-–æ–π,–ö –∫–Ω. –õ. –ì-–æ–π\n–ö–æ–≥–¥–∞ —Ç—ã —Ö–æ–ª–æ–¥–Ω–æ –≤–Ω–∏–º–∞–µ—à—å\n–†–∞—Å—Å...,–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,104,poems


In [None]:
df_news = pd.read_csv('data/original/news.csv')
df_news.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,–°–∏–Ω–∏–π –±–æ–≥–∞—Ç—ã—Ä—å,–°–∏–Ω–∏–π –±–æ–≥–∞—Ç—ã—Ä—å\n–í 1930-–µ –≥–æ–¥—ã –°–æ–≤–µ—Ç—Å–∫–∏–π –°–æ—é–∑ –æ...,lenta.ru,1905,news
1,–ó–∞–≥–∏—Ç–æ–≤–∞ —Å–æ–≥–ª–∞—Å–∏–ª–∞—Å—å –≤–µ—Å—Ç–∏ ¬´–õ–µ–¥–Ω–∏–∫–æ–≤—ã–π –ø–µ—Ä–∏–æ–¥¬ª,–ó–∞–≥–∏—Ç–æ–≤–∞ —Å–æ–≥–ª–∞—Å–∏–ª–∞—Å—å –≤–µ—Å—Ç–∏ ¬´–õ–µ–¥–Ω–∏–∫–æ–≤—ã–π –ø–µ—Ä–∏–æ–¥¬ª...,lenta.ru,154,news
2,–û–±—ä—è—Å–Ω–µ–Ω–∞ –æ–ø–∞—Å–Ω–æ—Å—Ç—å –æ–¥–Ω–æ–æ–±—Ä–∞–∑–Ω–æ–≥–æ –ø–∏—Ç–∞–Ω–∏—è,–û–±—ä—è—Å–Ω–µ–Ω–∞ –æ–ø–∞—Å–Ω–æ—Å—Ç—å –æ–¥–Ω–æ–æ–±—Ä–∞–∑–Ω–æ–≥–æ –ø–∏—Ç–∞–Ω–∏—è\n–†–æ—Å...,lenta.ru,140,news
3,¬´–ü—Ä–µ–¥–æ—Ö—Ä–∞–Ω—è—Ç—å—Å—è? –ê¬†–∑–∞—á–µ–º?¬ª,¬´–ü—Ä–µ–¥–æ—Ö—Ä–∞–Ω—è—Ç—å—Å—è? –ê¬†–∑–∞—á–µ–º?¬ª\n–í 2019 –≥–æ–¥—É —Ç–µ–ª–µ–∫–∞...,lenta.ru,2915,news
4,–ï—Ñ—Ä–µ–º–æ–≤ —Å–∏—Å—Ç–µ–º–∞—Ç–∏—á–µ—Å–∫–∏ —É–ø–æ—Ç—Ä–µ–±–ª—è–ª –Ω–∞—Ä–∫–æ—Ç–∏–∫–∏,–ï—Ñ—Ä–µ–º–æ–≤ —Å–∏—Å—Ç–µ–º–∞—Ç–∏—á–µ—Å–∫–∏ —É–ø–æ—Ç—Ä–µ–±–ª—è–ª –Ω–∞—Ä–∫–æ—Ç–∏–∫–∏\n–ê...,lenta.ru,139,news


### Data Partition
We need to separate the human data that will be used later in experiments and ensure it is not used as examples for model text generation. We also need to ensure that the samples we select are representative in terms of word count distribution, and balanced by genre and source distribution (e.g., for the social media genre: VK, Facebook, Pikabu, etc.).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def stratify_and_split(data, n_samples):
    """
    Stratify and set aside samples based on source and word count.

    Parameters:
        data (pd.DataFrame): The input dataset with 'source' and 'word_count' columns.
        n_samples (int): Number of samples to set aside.

    Returns:
        pd.DataFrame: Stratified sampled data (n_samples).
        pd.DataFrame: Remaining data after exclusion.
    """
    # Create word count bins for stratification
    data['word_count_bin'] = pd.qcut(data['word_counts'], q=4, duplicates='drop')

    # Perform stratified sampling based on 'source' and 'word_count_bin'
    stratified_data = (
        data.groupby(['source', 'word_count_bin'], group_keys=False, observed=False)
        .apply(
            lambda group: group.sample(
                n=min(len(group), int(n_samples / len(data['source'].unique()))),
                random_state=42
            )
        )
    )

    # Ensure the final sample size matches n_samples
    stratified_data = stratified_data.sample(n=min(len(stratified_data), n_samples), random_state=42)

    # Drop the stratified samples from the original dataset
    remaining_data = data.drop(index=stratified_data.index).copy()

    # Reset index after exclusion to clean up the data
    stratified_data = stratified_data.reset_index(drop=True)
    remaining_data = remaining_data.reset_index(drop=True)

    # Remove the temporary 'word_count_bin' column
    stratified_data = stratified_data.drop(columns=['word_count_bin'], errors='ignore')
    remaining_data = remaining_data.drop(columns=['word_count_bin'], errors='ignore')

    return stratified_data, remaining_data

**Let's start with the social media dataset first:**

In [None]:
# Set aside 500 samples for social media genre
social_samples, df_social_remaining = stratify_and_split(df_social, 500)

  .apply(


In [None]:
print(len(social_samples))

500


In [None]:
social_samples.head()

Unnamed: 0,texts,source,word_counts,genre
0,"–ü—Ä–∏ –∂–µ–ª–∞–Ω–∏–∏, –ø–æ –∫—Ä–∞–π–Ω–µ–π –º–µ—Ä–µ –¥–æ 5 –≤–µ—Ä—Å–∏–∏ –≤–µ–¥—Ä–∞...",pikabu,17,social
1,üòé –í —Å–æ–æ–±—â–µ—Å—Ç–≤–∞—Ö —Å—Å—ã–ª–∫–∏ –±—Ä–æ—Å–∞—Ç—å –Ω–µ –ø—Ä–∏–Ω—è—Ç–æ. –£ –∂...,vk,33,social
2,–ú–µ–∂–¥—É –ó—é–≥–æ–π –∏ –ú–æ–Ω—Å–æ–Ω–æ–º —Å—Ç–æ–∏—Ç —Å–ø–æ—Ä—Ç—Å–º–µ–Ω–∫–∞ –ú–∞—Ä—å—è...,fb,26,social
3,–¢–æ–≥–¥–∞ –¥—Ä—É–≥–æ–π –ø—Ä–∏–º–µ—Ä: –ø—Ä–∏–Ω—Ç–µ—Ä Xerox 3124 –≤ –±–ª–æ–∫...,pikabu,41,social
4,"–°–∫–æ—Ä–µ–µ –≤—Å–µ–≥–æ –∏ –ø–æ–∫–∞–∑—ã–≤–∞–µ—Ç, –µ—Å–ª–∏ –Ω–µ —ç—Ç–∏–º –Ω–∞—Ä–æ—Å—Ç...",pikabu,27,social


In [None]:
#df_social_remaining.head()

In [None]:
social_samples['source'].value_counts()

Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
fb,177
vk,163
pikabu,160


In [None]:
social_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,41.102
std,72.218908
min,1.0
25%,11.0
50%,20.0
75%,43.0
max,708.0


In [None]:
print((social_samples['word_counts'] == 1).sum())

2


In [None]:
social_samples[social_samples['word_counts'] == 1]

Unnamed: 0,texts,source,word_counts,genre
171,–ë—ã–≤–∞–ª 03-05,pikabu,1,social
439,—ç—Ç–æ –ï–†‚ÇΩ)),vk,1,social


In [None]:
len(df_social)

632216

In [None]:
len(df_social_remaining)

631716

Cross-checking if the function worked properly and the texts are indeed removed from the original df:

In [None]:
def filter_common_rows(df1, column1, df2, column2):

    common_mask = df1[column1].isin(df2[column2])
    common_rows_df = df1[common_mask]

    return common_rows_df, len(common_rows_df)

In [None]:
common_texts_df, common_count = filter_common_rows(
    social_samples, 'texts', df_social_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
# Set aside 1000 samples for data generation purposes
social_generation, df_social_extras = stratify_and_split(df_social_remaining, 1000)

  .apply(


In [None]:
len(social_generation)

1000

In [None]:
len(df_social_remaining)

631716

In [None]:
len(df_social_extras)

630716

In [None]:
social_generation['word_counts'].describe()

Unnamed: 0,word_counts
count,1000.0
mean,39.941
std,70.586585
min,1.0
25%,10.0
50%,20.0
75%,43.0
max,1125.0


We found common texts and will remove them from the df with extra texts:

In [None]:
# Remove rows from df_social_extras that have texts in common with social_generation
df_social_extras = df_social_extras[~df_social_extras['texts'].isin(social_generation['texts'])]
print(f"Updated df_social_extras with {len(df_social_extras)} rows remaining.")

Updated df_social_extras with 630713 rows remaining.


In [None]:
# Now there are no common texts between the two dfs
common_texts_df, common_count = filter_common_rows(
    social_generation, 'texts', df_social_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


Save to files for further use.

In [None]:
social_generation.to_csv("data/original/social_for_generation.csv", index=False, encoding='utf-8')
print("File saved as social_for_generation.csv")

File saved as social_for_generation.csv


In [None]:
df_social_extras.to_csv("data/original/social_human_extra.csv", index=False, encoding='utf-8')
print("File saved as social_human_extra.csv")

File saved as social_human_extra.csv


**We'll repeat the same steps for news genre:**

In [None]:
# Set aside 500 samples for news genre
news_samples, df_news_remaining = stratify_and_split(df_news, 500)

  .apply(


In [None]:
common_texts_df, common_count = filter_common_rows(
    news_samples, 'texts', df_news_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
print(len(df_news))
print(len(news_samples))
print(len(df_news_remaining))

20801
500
20301


In [None]:
news_samples['source'].value_counts()

Unnamed: 0_level_0,count
source,Unnamed: 1_level_1
lenta.ru,185
ria.ru,174
meduza.io,141


In [None]:
news_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,274.206
std,397.248643
min,38.0
25%,132.75
50%,175.0
75%,245.25
max,4238.0


In [None]:
# Set aside 1000 samples for data generation purposes
news_generation, df_news_extras = stratify_and_split(df_news_remaining, 1000)

  .apply(


In [None]:
print(len(news_generation))
print(len(df_news_extras))
print(len(df_news_remaining))

1000
19301
20301


In [None]:
common_texts_df, common_count = filter_common_rows(
    news_generation, 'texts', df_news_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
news_generation.to_csv("data/original/news_for_generation.csv", index=False, encoding='utf-8')
print("File saved as news_for_generation.csv")

df_news_extras.to_csv("data/original/news_human_extra.csv", index=False, encoding='utf-8')
print("File saved as news_human_extra.csv")

File saved as news_for_generation.csv
File saved as news_human_extra.csv


We'll repeat these steps one more time for Poems genre:

In [None]:
# There are very long poems in this dataset. We'd like to inspect them and possibly exclude from our analysis.
df_poems['word_counts'].describe()

Unnamed: 0,word_counts
count,19302.0
mean,190.239043
std,802.926799
min,5.0
25%,55.0
50%,85.0
75%,139.0
max,30118.0


In [None]:
threshold = 1000
very_long_poems = df_poems[df_poems['word_counts'] > threshold]

In [None]:
# This dataset appears to include long poems and plays.
# Such instances are not many.
len(very_long_poems)

344

In [None]:
very_long_poems.sample(10)

Unnamed: 0,title,texts,source,word_counts,genre,word_count_bin
14170,–û–≥–æ–Ω—å,"–û–≥–æ–Ω—å\n–ù–µ —É—Å—Ç–∞–Ω—É —Ç–µ–±—è –≤–æ—Å—Ö–≤–∞–ª—è—Ç—å,\n–û, –≤–Ω–µ–∑–∞–ø–Ω—ã...",–ë–∞–ª—å–º–æ–Ω—Ç –ö–æ–Ω—Å—Ç–∞–Ω—Ç–∏–Ω –î–º–∏—Ç—Ä–∏–µ–≤–∏—á,1625,poems,"(139.0, 30118.0]"
7517,–ù–∞ –í–æ–ª–≥–µ,–ù–∞ –í–æ–ª–≥–µ\n(–î–µ—Ç—Å—Ç–≤–æ –í–∞–ª–µ–∂–Ω–∏–∫–æ–≤–∞)\n1\n. . . . . ...,–ù–µ–∫—Ä–∞—Å–æ–≤ –ù–∏–∫–æ–ª–∞–π –ê–ª–µ–∫—Å–µ–µ–≤–∏—á,1297,poems,"(139.0, 30118.0]"
16870,–ü–µ—Å–Ω—è –æ –ø–æ—Ö–æ–¥–µ –í–ª–∞–¥–∏–º–∏—Ä–∞ –Ω–∞ –ö–æ—Ä—Å—É–Ω—å,–ü–µ—Å–Ω—è –æ –ø–æ—Ö–æ–¥–µ –í–ª–∞–¥–∏–º–∏—Ä–∞ –Ω–∞ –ö–æ—Ä—Å—É–Ω—å\n–ß–ê–°–¢–¨ –ü–ï–†...,–¢–æ–ª—Å—Ç–æ–π –ê–ª–µ–∫—Å–µ–π –ö–æ–Ω—Å—Ç–∞–Ω—Ç–∏–Ω–æ–≤–∏—á,1257,poems,"(139.0, 30118.0]"
7306,–°–∫–∞–∑–∫–∞ –æ —Ü–∞—Ä–µ–≤–Ω–µ –Ø—Å–Ω–æ—Å–≤–µ—Ç–µ,"–°–∫–∞–∑–∫–∞ –æ —Ü–∞—Ä–µ–≤–Ω–µ –Ø—Å–Ω–æ—Å–≤–µ—Ç–µ\n–¶—ã–ø, —Ü—ã–ø, —Ü—ã–ø! –∫–æ ...",–ù–µ–∫—Ä–∞—Å–æ–≤ –ù–∏–∫–æ–ª–∞–π –ê–ª–µ–∫—Å–µ–µ–≤–∏—á,3011,poems,"(139.0, 30118.0]"
3468,–í—Å—Ç—Ä–µ—á–∞,–í—Å—Ç—Ä–µ—á–∞\n–†–∞—Å—Å–∫–∞–∑ –≤ —Å—Ç–∏—Ö–∞—Ö\n–ü–æ—Å–≤—è—â–∞–µ—Ç—Å—è –ê.–§–µ—Ç—É\...,–ê–ø–æ–ª–ª–æ–Ω –ê–ª–µ–∫—Å–∞–Ω–¥—Ä–æ–≤–∏—á –ì—Ä–∏–≥–æ—Ä—å–µ–≤,1117,poems,"(139.0, 30118.0]"
188,–ò—Å–ø–æ–≤–µ–¥—å,–ò—Å–ø–æ–≤–µ–¥—å\nI\n–î–µ–Ω—å –≥–∞—Å; –≤ –Ω–∞—Ä—è–¥–µ –≥–æ–ª—É–±–æ–º\n–ö—Ä—É—Ç—è...,–õ–µ—Ä–º–æ–Ω—Ç–æ–≤ –ú–∏—Ö–∞–∏–ª –Æ—Ä—å–µ–≤–∏—á,1066,poems,"(139.0, 30118.0]"
1820,–ú–µ–ª–∞–Ω–∏–ø–ø–∞-—Ñ–∏–ª–æ—Å–æ—Ñ. –¢—Ä–∞–≥–µ–¥–∏—è,–ú–µ–ª–∞–Ω–∏–ø–ø–∞-—Ñ–∏–ª–æ—Å–æ—Ñ. –¢—Ä–∞–≥–µ–¥–∏—è\n–ü–æ—Å–≤—è—â–∞–µ—Ç—Å—è\n–ë–æ—Ä–∏...,–ê–Ω–Ω–µ–Ω—Å–∫–∏–π –ò–Ω–Ω–æ–∫–µ–Ω—Ç–∏–π –§–µ–¥–æ—Ä–æ–≤–∏—á,10196,poems,"(139.0, 30118.0]"
827,–†—É—Å–∞–ª–∫–∞,"–†—É—Å–∞–ª–∫–∞\n–ë–ï–†–ï–ì –î–ù–ï–ü–†–ê. –ú–ï–õ–¨–ù–ò–¶–ê\n–ú–µ–ª—å–Ω–∏–∫ , –î–æ—á...",–ü—É—à–∫–∏–Ω –ê–ª–µ–∫—Å–∞–Ω–¥—Ä –°–µ—Ä–≥–µ–µ–≤–∏—á,3665,poems,"(139.0, 30118.0]"
15355,–ù–∞ —Å—á–∞—Å—Ç–∏–µ,"–ù–∞ —Å—á–∞—Å—Ç–∏–µ\n–í—Å–µ–≥–¥–∞ –ø—Ä–µ—Ö–≤–∞–ª—å–Ω–æ, –ø—Ä–µ–ø–æ—á—Ç–µ–º–Ω–æ,\n–í...",–ì–∞–≤—Ä–∏–∏–ª –†–æ–º–∞–Ω–æ–≤–∏—á –î–µ—Ä–∂–∞–≤–∏–Ω,1001,poems,"(139.0, 30118.0]"
12811,–í–æ–ª—å–Ω—ã–µ –º—ã—Å–ª–∏ (1907),–í–æ–ª—å–Ω—ã–µ –º—ã—Å–ª–∏ (1907)\n(–ü–æ—Å–≤. –ì. –ß—É–ª–∫–æ–≤—É)\n–û —Å–º...,–ë–ª–æ–∫ –ê–ª–µ–∫—Å–∞–Ω–¥—Ä –ê–ª–µ–∫—Å–∞–Ω–¥—Ä–æ–≤–∏—á,1827,poems,"(139.0, 30118.0]"


In [None]:
# We will remove them from the final dataset to focus on shorter poems only
df_poems = df_poems[df_poems['word_counts'] <= 1000]

In [None]:
df_poems['word_counts'].describe()

Unnamed: 0,word_counts
count,18958.0
mean,119.92668
std,122.293804
min,5.0
25%,54.0
50%,83.0
75%,135.0
max,1000.0


In [None]:
# Set aside 500 samples for poems genre
poems_samples, df_poems_remaining = stratify_and_split(df_poems, 500)

  .apply(


In [None]:
print(len(df_poems))
print(len(poems_samples))
print(len(df_poems_remaining))

18958
500
18458


In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_samples, 'texts', df_poems_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 2


In [None]:
poems_samples.duplicated().sum()

0

In [None]:
df_poems_remaining = df_poems_remaining[~df_poems_remaining['texts'].isin(poems_samples['texts'])]

In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_samples, 'texts', df_poems_remaining, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
poems_samples['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,125.592
std,132.344897
min,10.0
25%,53.0
50%,84.5
75%,138.5
max,967.0


In [None]:
# Set aside 1000 samples for data generation purposes
poems_generation, df_poems_extras = stratify_and_split(df_poems_remaining, 1000)

  .apply(


In [None]:
print(len(poems_generation))
print(len(df_poems_extras))
print(len(df_poems_remaining))

1000
17456
18456


In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_generation, 'texts', df_poems_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 3


In [None]:
# We will remove them from the remaining df to avoid using these texts twice.
df_poems_extras = df_poems_extras[~df_poems_extras['texts'].isin(poems_generation['texts'])]

In [None]:
common_texts_df, common_count = filter_common_rows(
    poems_generation, 'texts', df_poems_extras, 'texts'
)

print(f"Number of common texts: {common_count}")

Number of common texts: 0


In [None]:
poems_generation.to_csv("data/original/poems_for_generation.csv", index=False, encoding='utf-8')
print("File saved as poems_for_generation.csv")

df_poems_extras.to_csv("data/original/poems_human_extra.csv", index=False, encoding='utf-8')
print("File saved as poems_human_extra.csv")

File saved as poems_for_generation.csv
File saved as poems_human_extra.csv


####Let's create the final human dataset for further experiments:

In [None]:
social_samples.head()

Unnamed: 0,texts,source,word_counts,genre
0,"–ü—Ä–∏ –∂–µ–ª–∞–Ω–∏–∏, –ø–æ –∫—Ä–∞–π–Ω–µ–π –º–µ—Ä–µ –¥–æ 5 –≤–µ—Ä—Å–∏–∏ –≤–µ–¥—Ä–∞...",pikabu,17,social
1,üòé –í —Å–æ–æ–±—â–µ—Å—Ç–≤–∞—Ö —Å—Å—ã–ª–∫–∏ –±—Ä–æ—Å–∞—Ç—å –Ω–µ –ø—Ä–∏–Ω—è—Ç–æ. –£ –∂...,vk,33,social
2,–ú–µ–∂–¥—É –ó—é–≥–æ–π –∏ –ú–æ–Ω—Å–æ–Ω–æ–º —Å—Ç–æ–∏—Ç —Å–ø–æ—Ä—Ç—Å–º–µ–Ω–∫–∞ –ú–∞—Ä—å—è...,fb,26,social
3,–¢–æ–≥–¥–∞ –¥—Ä—É–≥–æ–π –ø—Ä–∏–º–µ—Ä: –ø—Ä–∏–Ω—Ç–µ—Ä Xerox 3124 –≤ –±–ª–æ–∫...,pikabu,41,social
4,"–°–∫–æ—Ä–µ–µ –≤—Å–µ–≥–æ –∏ –ø–æ–∫–∞–∑—ã–≤–∞–µ—Ç, –µ—Å–ª–∏ –Ω–µ —ç—Ç–∏–º –Ω–∞—Ä–æ—Å—Ç...",pikabu,27,social


In [None]:
news_samples.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,–í –°–ü —Ä–∞—Å—Å–∫–∞–∑–∞–ª–∏ –æ –ø—Ä–æ–∏–∑–≤–æ–¥—Å—Ç–≤–µ –±—Ä–∏–ª–ª–∏–∞–Ω—Ç–æ–≤ –≤ –†...,–í –°–ü —Ä–∞—Å—Å–∫–∞–∑–∞–ª–∏ –æ –ø—Ä–æ–∏–∑–≤–æ–¥—Å—Ç–≤–µ –±—Ä–∏–ª–ª–∏–∞–Ω—Ç–æ–≤ –≤ –†...,ria.ru,132,news
1,"–¢—Ä–∞–º–ø –Ω–µ –∏—Å–∫–ª—é—á–∏–ª, —á—Ç–æ –°–®–ê –º–æ–≥—É—Ç –ø—Ä–µ–∫—Ä–∞—Ç–∏—Ç—å –≤–µ...","–¢—Ä–∞–º–ø –Ω–µ –∏—Å–∫–ª—é—á–∏–ª, —á—Ç–æ –°–®–ê –º–æ–≥—É—Ç –ø—Ä–µ–∫—Ä–∞—Ç–∏—Ç—å –≤–µ...",ria.ru,81,news
2,–ê–¥–≤–æ–∫–∞—Ç—ã –ù–∏–∫—É–ª–∏–Ω–∞ –ø—Ä–æ—Å—è—Ç –æ—Ç–ª–æ–∂–∏—Ç—å –ø—Ä–æ—Ü–µ—Å—Å –≤ –°–®...,–ê–¥–≤–æ–∫–∞—Ç—ã –ù–∏–∫—É–ª–∏–Ω–∞ –ø—Ä–æ—Å—è—Ç –æ—Ç–ª–æ–∂–∏—Ç—å –ø—Ä–æ—Ü–µ—Å—Å –≤ –°–®...,ria.ru,417,news
3,"–í –ú–∏–Ω—Å–∫–µ –ø—Ä–æ—Ç–µ—Å—Ç—É—é—â–∏–µ –Ω–∞—á–∞–ª–∏ –∫–∏–¥–∞—Ç—å –≤ –û–ú–û–ù ""–∫–æ...","–í –ú–∏–Ω—Å–∫–µ –ø—Ä–æ—Ç–µ—Å—Ç—É—é—â–∏–µ –Ω–∞—á–∞–ª–∏ –∫–∏–¥–∞—Ç—å –≤ –û–ú–û–ù ""–∫–æ...",ria.ru,135,news
4,–ù–∞–π–¥–µ–Ω–∞ —Å–∞–º–∞—è –¥–µ—à–µ–≤–∞—è –º–æ—Å–∫–æ–≤—Å–∫–∞—è –∫–≤–∞—Ä—Ç–∏—Ä–∞,–ù–∞–π–¥–µ–Ω–∞ —Å–∞–º–∞—è –¥–µ—à–µ–≤–∞—è –º–æ—Å–∫–æ–≤—Å–∫–∞—è –∫–≤–∞—Ä—Ç–∏—Ä–∞\n–ö–≤–∞...,lenta.ru,140,news


In [None]:
poems_samples.head()

Unnamed: 0,title,texts,source,word_counts,genre
0,–†—É–∫–∞ –ê–ª–∫–∏–¥–∞ —Ç—è–∂–µ–ª–∞‚Ä¶,"–†—É–∫–∞ –ê–ª–∫–∏–¥–∞ —Ç—è–∂–µ–ª–∞‚Ä¶\n–†—É–∫–∞ –ê–ª–∫–∏–¥–∞ —Ç—è–∂–µ–ª–∞,\n–£–∂–∞—Å...",–¢–æ–ª—Å—Ç–æ–π –ê–ª–µ–∫—Å–µ–π –ö–æ–Ω—Å—Ç–∞–Ω—Ç–∏–Ω–æ–≤–∏—á,42,poems
1,–Ø —Ç–æ–ª—å–∫–æ —Å–µ—Å—Ç—Ä–∞ –≤—Å–µ–º—É –∂–∏–≤–æ–º—É‚Ä¶,–Ø —Ç–æ–ª—å–∫–æ —Å–µ—Å—Ç—Ä–∞ –≤—Å–µ–º—É –∂–∏–≤–æ–º—É‚Ä¶\n–Ø —Ç–æ–ª—å–∫–æ —Å–µ—Å—Ç—Ä–∞...,–ê–¥–µ–ª–∞–∏–¥–∞ –ö–∞–∑–∏–º–∏—Ä–æ–≤–Ω–∞ –ì–µ—Ä—Ü—ã–∫,85,poems
2,–ê–ª—å–±–∞—É–º,"–ê–ª—å–±–∞—É–º\n–ö–æ–≥–¥–∞ –∞–µ–º–Ω—ã –æ—Å—Ç–∞–≤–∏—à—å —Ü–∞—Ä—Å—Ç–≤—ã,\n–ü–æ–π–¥–µ—à...",–ì–∞–≤—Ä–∏–∏–ª –†–æ–º–∞–Ω–æ–≤–∏—á –î–µ—Ä–∂–∞–≤–∏–Ω,233,poems
3,–≠–ø–∏–≥—Ä–∞–º–º–∞ –Ω–∞ –î. –ò. –•–≤–æ—Å—Ç–æ–≤–∞,–≠–ø–∏–≥—Ä–∞–º–º–∞ –Ω–∞ –î. –ò. –•–≤–æ—Å—Ç–æ–≤–∞\n–ü–æ–ª–µ–∑–µ–Ω –ª–∏ –¥—Ä—É–≥–∏–º...,–ò–≤–∞–Ω –ê–Ω–¥—Ä–µ–µ–≤–∏—á –ö—Ä—ã–ª–æ–≤,18,poems
4,"–ê.–ù. –ú–∞–ª—å—Ü–µ–≤–æ–π (¬´–ü—å—é –ª—å –º–∞–¥–µ—Ä—É, –ø—å—é –ª–∏ –∫–≤–∞—Å —è‚Ä¶¬ª)","–ê.–ù. –ú–∞–ª—å—Ü–µ–≤–æ–π (¬´–ü—å—é –ª—å –º–∞–¥–µ—Ä—É, –ø—å—é –ª–∏ –∫–≤–∞—Å —è‚Ä¶...",–¢–æ–ª—Å—Ç–æ–π –ê–ª–µ–∫—Å–µ–π –ö–æ–Ω—Å—Ç–∞–Ω—Ç–∏–Ω–æ–≤–∏—á,64,poems


In [None]:
print(len(social_samples))
print(len(news_samples))
print(len(poems_samples))

500
500
500


In [None]:
# Drop the 'title' column
news_samples1 = news_samples.drop(columns=['title'])
poems_samples1 = poems_samples.drop(columns=['title'])

In [None]:
# Concatenate the datasets
df_human = pd.concat([social_samples, news_samples1, poems_samples1], ignore_index=True)

In [None]:
# Add the 'class' column with a value of 0 for human data
df_human['class'] = 0

In [None]:
df_human.head()

Unnamed: 0,texts,source,word_counts,genre,class
0,"–ü—Ä–∏ –∂–µ–ª–∞–Ω–∏–∏, –ø–æ –∫—Ä–∞–π–Ω–µ–π –º–µ—Ä–µ –¥–æ 5 –≤–µ—Ä—Å–∏–∏ –≤–µ–¥—Ä–∞...",pikabu,17,social,0
1,üòé –í —Å–æ–æ–±—â–µ—Å—Ç–≤–∞—Ö —Å—Å—ã–ª–∫–∏ –±—Ä–æ—Å–∞—Ç—å –Ω–µ –ø—Ä–∏–Ω—è—Ç–æ. –£ –∂...,vk,33,social,0
2,–ú–µ–∂–¥—É –ó—é–≥–æ–π –∏ –ú–æ–Ω—Å–æ–Ω–æ–º —Å—Ç–æ–∏—Ç —Å–ø–æ—Ä—Ç—Å–º–µ–Ω–∫–∞ –ú–∞—Ä—å—è...,fb,26,social,0
3,–¢–æ–≥–¥–∞ –¥—Ä—É–≥–æ–π –ø—Ä–∏–º–µ—Ä: –ø—Ä–∏–Ω—Ç–µ—Ä Xerox 3124 –≤ –±–ª–æ–∫...,pikabu,41,social,0
4,"–°–∫–æ—Ä–µ–µ –≤—Å–µ–≥–æ –∏ –ø–æ–∫–∞–∑—ã–≤–∞–µ—Ç, –µ—Å–ª–∏ –Ω–µ —ç—Ç–∏–º –Ω–∞—Ä–æ—Å—Ç...",pikabu,27,social,0


In [None]:
print(len(df_human))

1500


In [None]:
df_human['genre'].value_counts()

Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
social,500
news,500
poems,500


In [None]:
df_human.to_csv("data/original/human_final.csv", index=False, encoding='utf-8')
print("File saved as human_final.csv")

File saved as human_final.csv
