# Train-Test Split

This notebook splits the provided data into a train, validation and a test dataset. 

This is done in a seperate file so that all models trained are using the same data.

We split the data not only in a train and a test dataset because we also need a seperate dataset to do  hyperparameters optimization 

The created dataset are saved as:

- data/train_df.csv
- data/validation_df.csv
- data/test_df.csv


### Load the data

The helper function below loads the raw data without any preprocessing.

In [1]:
from helper.data_loading import load_inital_dataframe

In [2]:
file = 'data/training_sample.tsv'

df = load_inital_dataframe(file)

In [3]:
df.head(2)

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,...,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engaged_follows_engaging,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101\t56898\t137\t174\t63247\t10526\t131\t3197\...,,3C21DCFB8E3FEC1CB3D2BFB413A78220,Video,,,Retweet,76B8A9C3013AE6414A3E6012413CDC3B,1581467323,D1AA2C85FA644D64346EDD88470525F2,...,000046C8606F1C3F5A7296222C88084B,131,2105,False,1573978269,False,,,,
1,101\t102463\t10230\t10105\t21040\t10169\t12811...,,3D87CC3655C276F1771752081423B405,,BB422AA00380E45F312FD2CAA75F4960,92D397F8E0F1E77B36B8C612C2C51E23,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1580975391,4DC65AC7BD963DE1F7617C047C33DE99,...,00006047187D0D18598EF12A650E1DAC,22,50,False,1340673962,False,,,,


## Data Split

In order to simulate the challenge accordingly we cannot just simply do a random split on the data, because in the challenge the data used for testing is samples in a time later (2 Weeks) than the training data.

Therefore we need to split our data also on a timeline. This means that we first sort the DataFrame by the creation date of the tweets and then split the dataframe in a training and test and validation dataset.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
df = df.sort_values(by=['tweet_timestamp'])
train, test_val = train_test_split(df, test_size=0.3, shuffle=False)
val, test = train_test_split(test_val, test_size=0.5, shuffle=True)

In [9]:
train.to_csv("data/train.csv", index=False)
val.to_csv("data/validation.csv", index=False)
test.to_csv("data/test.csv", index=False)

#### Sample Loading
The next section shows how to load the respective files.

In [14]:
from helper.data_loading import load_subsample
train = load_subsample("data/train.csv")
val = load_subsample("data/validation.csv")
test = load_subsample("data/test.csv")

In [15]:
train['tweet_timestamp'].tail()

56292   2020-02-10 19:59:59
56293   2020-02-10 19:59:59
56294   2020-02-10 20:00:00
56295   2020-02-10 20:00:00
56296   2020-02-10 20:00:01
Name: tweet_timestamp, dtype: datetime64[ns]

In [16]:
val['tweet_timestamp'].tail()

12059   2020-02-11 23:44:32
12060   2020-02-12 02:15:32
12061   2020-02-12 16:43:02
12062   2020-02-11 18:03:19
12063   2020-02-12 10:26:01
Name: tweet_timestamp, dtype: datetime64[ns]

In [17]:
test['tweet_timestamp'].head()

0   2020-02-11 05:25:19
1   2020-02-11 16:28:25
2   2020-02-11 08:39:29
3   2020-02-11 04:23:12
4   2020-02-11 15:49:30
Name: tweet_timestamp, dtype: datetime64[ns]

We can see that all the samples from the validation and test set are created later than the samples in the training set.