# Data Loading:

In this notebook, we will preprocess the data. The preprocessing pipeline will contain the following steps:

1. Tokenization
2. Text Normalization
    1. Case folding
    2. Lemmatization
    3. Stemming
3. Convert into indexes

In [63]:
import pandas as pd
import re

Import training data

In [55]:
with open( "data/writingprompts/train.wp_target") as f:
    stories = f.readlines()
stories = [" ".join(i.split()) for i in stories]

In [56]:
stories[1]

"-Week 18 aboard the Depth Reaver , Circa 2023- <newline> <newline> I walk about the dull gray halls , the artificial gravity making my steps feel almost as if they were on land . Almost . I glance out a window as I pass it by . There 's the sun , and there 's the moon right there . And , of course , there 's the Earth . I kinda miss it . Then again , space is pretty cool . It 's got some brilliant views , and the wifi is surprisingly good . Even countless miles away from the Earth , I can crush Silver noobs on CS GO . <newline> <newline> I pass by Dale Malkowitz , the head scientist on board . <newline> <newline> `` Evening , Dale , '' I say . <newline> <newline> `` What up , Danny ? '' he replies cordially . <newline> <newline> `` Nothin ' much . A little bored , I guess . '' <newline> <newline> He shakes his head in disbelief . `` I really , *really* do n't understand how you can be bored in space . '' <newline> <newline> `` Well hey , '' I say slightly defensively , `` Aside from t

How many targets do we have?

In [57]:
len(stories)

272600

Let's get the sources as well. We should probably remove things like `[WP]` from the start of the sentence. 

In [80]:
with open( "data/writingprompts/train.wp_source") as f:
    prompts = f.readlines()
prompts = [" ".join(i.split()) for i in prompts]

# let's use regex and remove the things in square brackets like [WP],[EU],[IP]
pattern = r"\[\s*[A-Z]{2}\s*\]"
prompts=[re.sub(pattern,'',i) for i in prompts]

In [59]:
len(prompts)

272600

Now, we can put these in a dataframe where 

In [84]:
train_df=pd.DataFrame({'stories':stories,'prompts':prompts})
train_df.to_csv("data/train_df.csv")

We can do this with the testing and validation dataset as well:

For the validation dataset

In [86]:
# read the targets
with open( "data/writingprompts/valid.wp_target") as f:
    stories = f.readlines()
stories = [" ".join(i.split()) for i in stories]

# read the source
with open( "data/writingprompts/valid.wp_source") as f:
    prompts = f.readlines()
prompts = [" ".join(i.split()) for i in prompts]

# let's use regex and remove the things in square brackets like [WP],[EU],[IP]
pattern = r"\[\s*[A-Z]{2}\s*\]"
prompts=[re.sub(pattern,'',i) for i in prompts]

# save the dataset
valid_df=pd.DataFrame({'stories':stories,'prompts':prompts})
valid_df.to_csv("data/valid_df.csv")

For the testing dataset

In [87]:
# read the targets
with open( "data/writingprompts/test.wp_target") as f:
    stories = f.readlines()
stories = [" ".join(i.split()) for i in stories]

# read the source
with open( "data/writingprompts/test.wp_source") as f:
    prompts = f.readlines()
prompts = [" ".join(i.split()) for i in prompts]

# let's use regex and remove the things in square brackets like [WP],[EU],[IP]
pattern = r"\[\s*[A-Z]{2}\s*\]"
prompts=[re.sub(pattern,'',i) for i in prompts]

# save the dataset
test_df=pd.DataFrame({'stories':stories,'prompts':prompts})
test_df.to_csv("data/test_df.csv")