## Data Preprocessing
In this notebook we show how to apply common preprocessing steps (lowercase standardization and tokenization) to the  SNLI corpus. We also reshape the data into the form needed to implement [Gensen](https://github.com/Maluuba/gensen), a model that learns rich fixed-length sentence embeddings as described in the paper [here](https://openreview.net/forum?id=B18WgG-CZ&noteId=B18WgG-CZ).

### 00 Global Settings

In [4]:
import sys
sys.path.append("../../../")

from utils_nlp.dataset.preprocess import to_lowercase
from utils_nlp.dataset import snli

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [5]:
DATA_DIR_PATH  = '../../../data'
# SRC_FILE_PATH  = '../../../data/raw/snli_1.0_dev.txt'
# DEST_FILE_PATH = '../../../data/clean/snli_1.0_clean_dev.txt'

### 01 Tokenize

We first load the dataset and convert it to a pandas dataframe. 

In [7]:
# Download and extract data as txt files
train = snli.load_pandas_df(DATA_DIR_PATH, file_split='train')
dev = snli.load_pandas_df(DATA_DIR_PATH, file_split='dev')
test = snli.load_pandas_df(DATA_DIR_PATH, file_split='test')

We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as [score, sentence_1, sentence_2].

In [None]:
def clean(df):
    src_file_path = os.path.join(DATA_DIR_PATH, "snli_1.0_{}.txt".format(file_split))
    dest_file_path = os.path.join()

In [5]:
transformed_df = snli.clean_snli(SRC_FILE_PATH, DEST_FILE_PATH)

Once we have a clean pandas dataframe, call the nltk tokenizer which will add two new columns [sentence1_tokens, sentence2_tokens]

In [6]:
tokenized_df = preprocess.nltk_tokenizer(transformed_df)

See top 5 entries in the dataframe after applying tokenizer

In [7]:
tokenized_df.head(5)

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,Two women are embracing while holding to go pa...,The sisters are hugging goodbye while holding ...,"[Two, women, are, embracing, while, holding, t...","[The, sisters, are, hugging, goodbye, while, h..."
1,entailment,Two women are embracing while holding to go pa...,Two woman are holding packages.,"[Two, women, are, embracing, while, holding, t...","[Two, woman, are, holding, packages, .]"
2,contradiction,Two women are embracing while holding to go pa...,The men are fighting outside a deli.,"[Two, women, are, embracing, while, holding, t...","[The, men, are, fighting, outside, a, deli, .]"
3,entailment,"Two young children in blue jerseys, one with t...",Two kids in numbered jerseys wash their hands.,"[Two, young, children, in, blue, jerseys, ,, o...","[Two, kids, in, numbered, jerseys, wash, their..."
4,neutral,"Two young children in blue jerseys, one with t...",Two kids at a ballgame wash their hands.,"[Two, young, children, in, blue, jerseys, ,, o...","[Two, kids, at, a, ballgame, wash, their, hand..."


## 02 Stopword removal

If the sentence is not tokenized then nltk_remove_stop_words() will first tokenize the sentence and then remove stop words.
<br>
It will add two new columns [sentence1_tokens_stop, sentence2_tokens_stop]


In [8]:
rm_stop_words_df = preprocess.nltk_remove_stop_words(tokenized_df)

See top 5 entries in the dataframe after removing stop words

In [9]:
rm_stop_words_df.head(5)

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens,sentence1_tokens_stop,sentence2_tokens_stop
0,neutral,Two women are embracing while holding to go pa...,The sisters are hugging goodbye while holding ...,"[Two, women, are, embracing, while, holding, t...","[The, sisters, are, hugging, goodbye, while, h...","[Two, women, embracing, holding, go, packages, .]","[The, sisters, hugging, goodbye, holding, go, ..."
1,entailment,Two women are embracing while holding to go pa...,Two woman are holding packages.,"[Two, women, are, embracing, while, holding, t...","[Two, woman, are, holding, packages, .]","[Two, women, embracing, holding, go, packages, .]","[Two, woman, holding, packages, .]"
2,contradiction,Two women are embracing while holding to go pa...,The men are fighting outside a deli.,"[Two, women, are, embracing, while, holding, t...","[The, men, are, fighting, outside, a, deli, .]","[Two, women, embracing, holding, go, packages, .]","[The, men, fighting, outside, deli, .]"
3,entailment,"Two young children in blue jerseys, one with t...",Two kids in numbered jerseys wash their hands.,"[Two, young, children, in, blue, jerseys, ,, o...","[Two, kids, in, numbered, jerseys, wash, their...","[Two, young, children, blue, jerseys, ,, one, ...","[Two, kids, numbered, jerseys, wash, hands, .]"
4,neutral,"Two young children in blue jerseys, one with t...",Two kids at a ballgame wash their hands.,"[Two, young, children, in, blue, jerseys, ,, o...","[Two, kids, at, a, ballgame, wash, their, hand...","[Two, young, children, blue, jerseys, ,, one, ...","[Two, kids, ballgame, wash, hands, .]"
