<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

## Data Load & Prep

In this notebook we show how to download and preprocess the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset for sentence similarity.

We show how to apply common preprocessing steps (lowercase standardization and tokenization) to the  [SNLI](https://nlp.stanford.edu/projects/snli/) corpus. We also reshape the data into the form needed to implement [Gensen](https://github.com/Maluuba/gensen), a model that learns rich fixed-length sentence embeddings as described in the paper [here](https://openreview.net/forum?id=B18WgG-CZ&noteId=B18WgG-CZ).

### 00 Global Settings

In [1]:
import sys
sys.path.append("../../../")

import os
from utils_nlp.dataset.preprocess import to_lowercase_all, to_nltk_tokens
from utils_nlp.dataset import snli

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]


In [2]:
BASE_DATA_PATH = '../../../data'

### 01 Load SNLI as a pandas dataframe
We provide a function `load_pandas_df` which
* Downloads the SNLI zipfile at the specified directory location
* Extracts the file based on the specified split
* Loads the split as a pandas dataframe
The zipfile contains the following files:
* snli_1.0_dev.txt
* snli_1.0_train.txt
* snli_1.0_test.tx
* snli_1.0_dev.jsonl
* snli_1.0_train.jsonl
* snli_1.0_test.jsonl  
The loader defaults to reading from the .txt file; however, the user can change this to .jsonl by setting the optional `file_type` parameter when calling the function.

In [3]:
# defaults to txt
train = snli.load_pandas_df(BASE_DATA_PATH, file_split="train")

# or, load dataframe from jsonl
dev = snli.load_pandas_df(BASE_DATA_PATH, file_split="dev", file_type="jsonl")

In [4]:
train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


### 02 Tokenize

Now that we've loaded the data into a pandas.DataFrame, we can tokenize the sentences.
We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2.

In [5]:
def clean(df, file_split):
    src_file_path = os.path.join(BASE_DATA_PATH, "raw/snli_1.0/snli_1.0_{}.txt".format(file_split))
    if not os.path.exists(os.path.join(BASE_DATA_PATH, "clean/snli_1.0")):
        os.makedirs(os.path.join(BASE_DATA_PATH, "clean/snli_1.0"))
    dest_file_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt".format(file_split))
    clean_df = snli.clean_snli(src_file_path).dropna() # drop rows with any NaN vals
    clean_df.to_csv(dest_file_path)
    return clean_df

In [6]:
train = clean(train, 'train')

In [7]:
train.head()

Unnamed: 0,score,sentence1,sentence2
0,neutral,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.
1,contradiction,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette."
2,entailment,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse."
3,neutral,Children smiling and waving at camera,They are smiling at their parents
4,entailment,Children smiling and waving at camera,There are children present


Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK](https://www.nltk.org/) library for tokenization.

In [8]:
train_tok = to_nltk_tokens(to_lowercase_all(train))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jamahaja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
train_tok.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ..."
1,contradiction,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,..."
2,entailment,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]"
3,neutral,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]"
4,entailment,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]"


### 03 Reshape for GenSen model
We need to prepare our data in a specific way in order for the Gensen model to be able to ingest it. We do this by
* Saving the tokens for each split in a `snli_1.0_{split}.txt.clean` file, with the sentence pairs and scores tab-separated and the tokens separated by a single space.
* Saving the tokenized sentence and labels separately, in the form `snli_1.0_{split}.txt.s1.tok` or `snli_1.0_{split}.txt.s2.tok` or `snli_1.0_{split}.txt.lab`.

In [10]:
train = snli.load_pandas_df(BASE_DATA_PATH, file_split="train")
dev = snli.load_pandas_df(BASE_DATA_PATH, file_split="dev")
test = snli.load_pandas_df(BASE_DATA_PATH, file_split="test")

clean_train = clean(train, file_split="train")
clean_dev = clean(dev, file_split="dev")
clean_test = clean(dev, file_split="test")

train_tok = to_nltk_tokens(to_lowercase_all(clean_train))
dev_tok = to_nltk_tokens(to_lowercase_all(clean_dev))
test_tok = to_nltk_tokens(to_lowercase_all(clean_test))

split_map = {'train': train_tok, 'dev': dev_tok, 'test': test_tok}
for file_split, df in split_map.items():
    base_txt_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt".format(file_split))
    df['s1.tok'] = df['sentence1_tokens'].apply(lambda x: ' '.join(x))
    df['s2.tok'] = df['sentence2_tokens'].apply(lambda x: ' '.join(x))
    df['s1.tok'].to_csv("{}.s1.tok".format(base_txt_path), sep=' ', header=False, index=False)
    df['s2.tok'].to_csv("{}.s2.tok".format(base_txt_path), sep=' ', header=False, index=False)
    df['score'].to_csv("{}.lab".format(base_txt_path), sep=' ', header=False, index=False)
    df[['s1.tok', 's2.tok', 'score']].to_csv("{}.clean".format(base_txt_path), sep="\t", header=False, index=False)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jamahaja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jamahaja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jamahaja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
# remove quotations from .tok files
import shutil

for file_split in split_map.keys():
    s1_tok_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt.s1.tok".format(file_split))
    s2_tok_path = os.path.join(BASE_DATA_PATH, "clean/snli_1.0/snli_1.0_{}.txt.s2.tok".format(file_split))
    with open(s1_tok_path, 'r') as fin, open("{}.tmp".format(s1_tok_path), 'w') as tmp:
        for line in fin:
            s = line.replace('\"', '')
            tmp.write(s)
    with open(s2_tok_path, 'r') as fin, open("{}.tmp".format(s2_tok_path), 'w') as tmp:
        for line in fin:
            s = line.replace('\"', '')
            tmp.write(s)
    shutil.move("{}.tmp".format(s1_tok_path), s1_tok_path)
    shutil.move("{}.tmp".format(s2_tok_path), s2_tok_path)