# Splitting data

Splitting data is a crucial step when it comes to validating models, let's explore a couple ways to do that

## Random Split

First, let's split our data randomly in train and test

In [1]:
import pandas as pd
import spacy
import umap
import numpy as np 
from pathlib import Path
import sys
sys.path.append("..")

from ml_editor.data_processing import format_raw_df

data_path = Path('../data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
from ml_editor.data_processing import get_random_train_test_split, get_vectorized_inputs_and_label

train_df_rand, test_df_rand = get_random_train_test_split(df[df["is_question"]], test_size=0.2, random_state=40)

In [3]:
len(train_df_rand), len(test_df_rand)

(6376, 1595)

## Author Split

Some authors may be more successful on average, and that may due to factors other than the quality of their formulation such as their popularity. To remove this potential source of bias, we could split by author

In [4]:
from ml_editor.data_processing import get_split_by_author

train_author, test_author = get_split_by_author(df[df["is_question"]])

print("%s questions in training, %s in test." % (len(train_author),len(test_author)))
train_owners = set(train_author['OwnerUserId'].values)
test_owners = set(test_author['OwnerUserId'].values)
print("%s different owners in the training set" % len(train_owners))
print("%s different owners in the testing set" % len(test_owners))
print("%s overlapping owners" % len(train_owners.intersection(test_owners)))

5676 questions in training, 2295 in test.
2723 different owners in the training set
1167 different owners in the testing set
0 overlapping owners


Going forward, we will use the author split