<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

## Data Load & Prep
In this notebook we show how to download and preprocess the [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) dataset for sentence similarity.

For this task, the only preprocessing we need to do is

* Make all text lowercase
* Tokenize (segment the text into words, punctuation marks, etc). Here we show an example using [spaCy](https://spacy.io/)

### 00 Global Settings

In [1]:
import sys
sys.path.append("../../../") ## set the environment path

import os
import azureml.dataprep as dp
import pandas as pd

from utils_nlp.dataset import stsbenchmark
from utils_nlp.dataset.preprocess import to_lowercase, to_spacy_tokens
from utils_nlp.dataset.url_utils import maybe_download

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [2]:
STS_URL = "http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz"
BASE_DATA_PATH = "../../../data"
RAW_DATA_PATH = os.path.join(BASE_DATA_PATH, "raw")
CLEAN_DATA_PATH = os.path.join(BASE_DATA_PATH, "clean")

### 01 Data Download

Make a directory for the data if it doesn't already exist, and then download.

In [3]:
if not os.path.exists(RAW_DATA_PATH):
    os.makedirs(RAW_DATA_PATH)

In [4]:
def download_sts(url, dirpath):
    zipfile = maybe_download(url, work_directory = dirpath)
    unzipped = stsbenchmark._extract_sts(zipfile, target_dirpath = dirpath, tmode = "r:gz")
    return zipfile, unzipped

In [5]:
tarfile, datapath = download_sts(STS_URL, RAW_DATA_PATH)
print("Data downloaded to {}".format(datapath))

418kB [00:07, 52.4kB/s]                            

Data downloaded to ../../../data/raw/stsbenchmark





### 02 Data Understanding
In this section we 
* load raw data into a dataframe
* peek into the first 5 rows

We can load the data using a `read` function that has built-in automatic filetype inference:

In [6]:
dflow = dp.auto_read_file(path=os.path.join(datapath, "sts-train.csv"))
dflow.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7
0,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.
5,main-captions,MSRvid,2012test,11,4.25,Some men are fighting.,Two men are fighting.
6,main-captions,MSRvid,2012test,12,0.5,A man is smoking.,A man is skating.
7,main-captions,MSRvid,2012test,13,1.6,The man is playing the piano.,The man is playing the guitar.
8,main-captions,MSRvid,2012test,14,2.2,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...
9,main-captions,MSRvid,2012test,16,5.0,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.


The `auto_read_file` function from the AzureML Data Prep module actually returns a `Dataflow` object, which you can read more about [here](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py). We can easily transfer the data into a Pandas DataFrame (as before) in a single line using the `to_pandas_dataframe` function, or we can continue manipulating the data as a Dataflow object using the AzureML Data Prep API. For the remainder of this notebook we will be doing the latter.

### 03 Data Cleaning
Now that we know about the general shape of the data, we can clean it so that it is ready for further preprocessing. The main operation we need for the STS Benchmark data is to drop all of columns except for the sentence pairs and scores.

In [7]:
sentences = dflow.keep_columns(['Column5', 'Column6', 'Column7']) \
                    .rename_columns({'Column5': 'score', 'Column6': 'sentence1', 'Column7': 'sentence2'})
sentences.head()

Unnamed: 0,score,sentence1,sentence2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.
5,4.25,Some men are fighting.,Two men are fighting.
6,0.5,A man is smoking.,A man is skating.
7,1.6,The man is playing the piano.,The man is playing the guitar.
8,2.2,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...
9,5.0,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.


### 04 One-Shot Dataframe Loading
You can also use our STSBenchmark utils to automatically download, extract, and persist the data. You can then load the sanitized data as a pandas DataFrame in one line. 

In [8]:
# Initializing this instance runs the downloader and extractor behind the scenes
sts_train = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="train")

In [9]:
sts_train.head()

Unnamed: 0,score,sentence1,sentence2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


### 05 Make Lowercase
We start with simple standardization of the text by making all text lowercase.

In [10]:
sts_train_low = to_lowercase(sts_train)
sts_train_low.head()

Unnamed: 0,score,sentence1,sentence2
0,5.0,a plane is taking off.,an air plane is taking off.
1,3.8,a man is playing a large flute.,a man is playing a flute.
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...
3,2.6,three men are playing chess.,two men are playing chess.
4,4.25,a man is playing the cello.,a man seated is playing the cello.


### 06 Tokenize
We tokenize the text using spaCy's non-destructive tokenizer.

In [11]:
sts_train_tok = to_spacy_tokens(sts_train_low.head(10)) # operating on a small slice of the data as an example
sts_train_tok.head(10)

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,5.0,a plane is taking off.,an air plane is taking off.,"[a, plane, is, taking, off, .]","[an, air, plane, is, taking, off, .]"
1,3.8,a man is playing a large flute.,a man is playing a flute.,"[a, man, is, playing, a, large, flute, .]","[a, man, is, playing, a, flute, .]"
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...,"[a, man, is, spreading, shreded, cheese, on, a...","[a, man, is, spreading, shredded, cheese, on, ..."
3,2.6,three men are playing chess.,two men are playing chess.,"[three, men, are, playing, chess, .]","[two, men, are, playing, chess, .]"
4,4.25,a man is playing the cello.,a man seated is playing the cello.,"[a, man, is, playing, the, cello, .]","[a, man, seated, is, playing, the, cello, .]"
5,4.25,some men are fighting.,two men are fighting.,"[some, men, are, fighting, .]","[two, men, are, fighting, .]"
6,0.5,a man is smoking.,a man is skating.,"[a, man, is, smoking, .]","[a, man, is, skating, .]"
7,1.6,the man is playing the piano.,the man is playing the guitar.,"[the, man, is, playing, the, piano, .]","[the, man, is, playing, the, guitar, .]"
8,2.2,a man is playing on a guitar and singing.,a woman is playing an acoustic guitar and sing...,"[a, man, is, playing, on, a, guitar, and, sing...","[a, woman, is, playing, an, acoustic, guitar, ..."
9,5.0,a person is throwing a cat on to the ceiling.,a person throws a cat on the ceiling.,"[a, person, is, throwing, a, cat, on, to, the,...","[a, person, throws, a, cat, on, the, ceiling, .]"


### 07 Optional: Remove Stop Words
Removing stop words is another common preprocessing step for NLP tasks. We use the `rm_spacy_stopwords` utility function to do this on the dataframe. This function makes use of the spaCy language model's default set of stop words. If we need to add our own set of stop words (for example, if we are doing an NLP task for a very specific domain of content), we can do this in-line by simply providing the list as the `custom_stopwords` parameter of `rm_spacy_stopwords`.

In [12]:
from utils_nlp.dataset.preprocess import rm_spacy_stopwords
rm_spacy_stopwords(sts_train_tok) # operating on a small slice of the data as an example

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens,sentence1_tokens_rm_stopwords,sentence2_tokens_rm_stopwords
0,5.0,a plane is taking off.,an air plane is taking off.,"[a, plane, is, taking, off, .]","[an, air, plane, is, taking, off, .]","[plane, taking, .]","[air, plane, taking, .]"
1,3.8,a man is playing a large flute.,a man is playing a flute.,"[a, man, is, playing, a, large, flute, .]","[a, man, is, playing, a, flute, .]","[man, playing, large, flute, .]","[man, playing, flute, .]"
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...,"[a, man, is, spreading, shreded, cheese, on, a...","[a, man, is, spreading, shredded, cheese, on, ...","[man, spreading, shreded, cheese, pizza, .]","[man, spreading, shredded, cheese, uncooked, p..."
3,2.6,three men are playing chess.,two men are playing chess.,"[three, men, are, playing, chess, .]","[two, men, are, playing, chess, .]","[men, playing, chess, .]","[men, playing, chess, .]"
4,4.25,a man is playing the cello.,a man seated is playing the cello.,"[a, man, is, playing, the, cello, .]","[a, man, seated, is, playing, the, cello, .]","[man, playing, cello, .]","[man, seated, playing, cello, .]"
5,4.25,some men are fighting.,two men are fighting.,"[some, men, are, fighting, .]","[two, men, are, fighting, .]","[men, fighting, .]","[men, fighting, .]"
6,0.5,a man is smoking.,a man is skating.,"[a, man, is, smoking, .]","[a, man, is, skating, .]","[man, smoking, .]","[man, skating, .]"
7,1.6,the man is playing the piano.,the man is playing the guitar.,"[the, man, is, playing, the, piano, .]","[the, man, is, playing, the, guitar, .]","[man, playing, piano, .]","[man, playing, guitar, .]"
8,2.2,a man is playing on a guitar and singing.,a woman is playing an acoustic guitar and sing...,"[a, man, is, playing, on, a, guitar, and, sing...","[a, woman, is, playing, an, acoustic, guitar, ...","[man, playing, guitar, singing, .]","[woman, playing, acoustic, guitar, singing, .]"
9,5.0,a person is throwing a cat on to the ceiling.,a person throws a cat on the ceiling.,"[a, person, is, throwing, a, cat, on, to, the,...","[a, person, throws, a, cat, on, the, ceiling, .]","[person, throwing, cat, ceiling, .]","[person, throws, cat, ceiling, .]"
