# Dataset split

The goal of this notebook is to create train and validation sets and check their consistency.

In [1]:
from thc.utils.env import check_repository_path


REPOSITORY_DIR = check_repository_path()
RAW_DATA_DIR = REPOSITORY_DIR.joinpath("data", "raw")
PROCESSED_DATA_DIR = REPOSITORY_DIR.joinpath("data", "processed")

As we already know from the [previous notebook](https://github.com/mrtovsky/thc/blob/main/notebooks/00-texts-integrity.ipynb) the whole dataset is significantly imbalanced so creation of the validation dataset should be done in the **stratified** fashion.

In [2]:
import codecs


TRAIN_TEXT_FILE = RAW_DATA_DIR.joinpath("training_set_clean_only_text.txt")
TRAIN_TAGS_FILE = RAW_DATA_DIR.joinpath("training_set_clean_only_tags.txt")

with codecs.open(str(TRAIN_TEXT_FILE), mode="r", encoding="utf-8") as file:
    text = file.read().splitlines()
with codecs.open(str(TRAIN_TAGS_FILE), mode="r") as file:
    tags = [int(tag) for tag in file]

Split training dataset to create a holdout set for validation purposes. The corpus is small enough that cross-validation would be the correct approach to measure the model performance but repeating **DistilBERT** model fine-tuning would consume a lot of additional time.

In [3]:
from sklearn.model_selection import train_test_split


text_train, text_valid, tags_train, tags_valid = train_test_split(
    text, tags,
    test_size=0.3,
    random_state=42,
    shuffle=True,
    stratify=tags,
)

Save results.

In [4]:
with codecs.open(str(PROCESSED_DATA_DIR.joinpath("train_text.txt")), "w", "utf-8") as file:
    for tweet in text_train:
        file.write(f"{tweet}\n")

with codecs.open(str(PROCESSED_DATA_DIR.joinpath("valid_text.txt")), "w", "utf-8") as file:
    for tweet in text_valid:
        file.write(f"{tweet}\n")
        
with codecs.open(str(PROCESSED_DATA_DIR.joinpath("train_tags.txt")), "w") as file:
    for label in tags_train:
        file.write(f"{label}\n")

with codecs.open(str(PROCESSED_DATA_DIR.joinpath("valid_tags.txt")), "w") as file:
    for label in tags_valid:
        file.write(f"{label}\n")

Copy test data to **processed** folder as well and rename it for convenience.

In [5]:
import shutil


shutil.copy(
    RAW_DATA_DIR.joinpath("test_set_only_text.txt"),
    PROCESSED_DATA_DIR.joinpath("test_text.txt"),
)
shutil.copy(
    RAW_DATA_DIR.joinpath("test_set_only_tags.txt"),
    PROCESSED_DATA_DIR.joinpath("test_tags.txt"),
)

PosixPath('/usr/local/coding/thc-project/thc/data/processed/test_tags.txt')