<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

## Data Preprocessing
In this notebook we show how to apply common preprocessing steps (lowercase standardization and tokenization) to the  [SNLI](https://nlp.stanford.edu/projects/snli/) corpus. We also reshape the data into the form needed to implement [Gensen](https://github.com/Maluuba/gensen), a model that learns rich fixed-length sentence embeddings as described in the paper [here](https://openreview.net/forum?id=B18WgG-CZ&noteId=B18WgG-CZ).

### 00 Global Settings

In [1]:
import sys
sys.path.append("../../../")

import os
from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens
from utils_nlp.dataset import snli

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
BASE_DATA_PATH  = '../../../data'

### 01 Tokenize

We first load the dataset and convert it to a pandas dataframe. 

In [3]:
# Download and extract data
train = snli.load_pandas_df(BASE_DATA_PATH, file_split='train')
dev = snli.load_pandas_df(BASE_DATA_PATH, file_split='dev')
test = snli.load_pandas_df(BASE_DATA_PATH, file_split='test')

We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2.

In [4]:
def clean(df, file_split):
    src_file_path = os.path.join(BASE_DATA_PATH, "raw/snli_1.0/snli_1.0_{}.txt".format(file_split))
    dest_file_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_clean_{}.txt".format(file_split))
    return snli.clean_snli(src_file_path, dest_file_path).dropna() # drop rows with any NaN vals

In [5]:
train = clean(train, 'train')
dev = clean(dev, 'dev')
test = clean(test, 'test')

Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK](https://www.nltk.org/) library for tokenization.

In [6]:
train_tok = to_nltk_tokens(to_lowercase(train))
dev_tok = to_nltk_tokens(to_lowercase(dev))
test_tok = to_nltk_tokens(to_lowercase(test))

[nltk_data] Downloading package punkt to /Users/caseyhong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /Users/caseyhong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /Users/caseyhong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
dev_tok.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,two women are embracing while holding to go pa...,the sisters are hugging goodbye while holding ...,"[two, women, are, embracing, while, holding, t...","[the, sisters, are, hugging, goodbye, while, h..."
1,entailment,two women are embracing while holding to go pa...,two woman are holding packages.,"[two, women, are, embracing, while, holding, t...","[two, woman, are, holding, packages, .]"
2,contradiction,two women are embracing while holding to go pa...,the men are fighting outside a deli.,"[two, women, are, embracing, while, holding, t...","[the, men, are, fighting, outside, a, deli, .]"
3,entailment,"two young children in blue jerseys, one with t...",two kids in numbered jerseys wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, in, numbered, jerseys, wash, their..."
4,neutral,"two young children in blue jerseys, one with t...",two kids at a ballgame wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, at, a, ballgame, wash, their, hand..."


### 02 Reshape
We need to prepare our data in a specific way in order for the Gensen model to be able to ingest it. We do this by
* Saving the tokens for each split in a `snli_1.0_{split}.txt.clean` file, with the sentence pairs and scores tab-separated and the tokens separated by a single space.
* Saving the tokenized sentence and labels separately, in the form `snli_1.0_{split}.txt.s1.tok` or `snli_1.0_{split}.txt.s2.tok` or `snli_1.0_{split}.txt.lab`.

In [8]:
split_map = {'train': train_tok, 'dev': dev_tok, 'test': test_tok}
for file_split, df in split_map.items():
    base_txt_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt".format(file_split))
    df['s1.tok'] = df['sentence1_tokens'].apply(lambda x: ' '.join(x))
    df['s2.tok'] = df['sentence2_tokens'].apply(lambda x: ' '.join(x))
    df['s1.tok'].to_csv("{}.s1.tok".format(base_txt_path), sep=' ', header=False, index=False)
    df['s2.tok'].to_csv("{}.s2.tok".format(base_txt_path), sep=' ', header=False, index=False)
    df['score'].to_csv("{}.lab".format(base_txt_path), sep=' ', header=False, index=False)
    df[['s1.tok', 's2.tok', 'score']].to_csv("{}.clean".format(base_txt_path), sep="\t", header=False, index=False)

In [9]:
# remove quotations from .tok files
import shutil

for file_split in split_map.keys():
    s1_tok_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt.s1.tok".format(file_split))
    s2_tok_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt.s2.tok".format(file_split))
    with open(s1_tok_path, 'r') as fin, open("{}.tmp".format(s1_tok_path), 'w') as tmp:
        for line in fin:
            s = line.replace('\"', '')
            tmp.write(s)
    with open(s2_tok_path, 'r') as fin, open("{}.tmp".format(s2_tok_path), 'w') as tmp:
        for line in fin:
            s = line.replace('\"', '')
            tmp.write(s)
    shutil.move("{}.tmp".format(s1_tok_path), s1_tok_path)
    shutil.move("{}.tmp".format(s2_tok_path), s2_tok_path)