## Data Load & Prep
In this notebook we show how to download the [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) data and prepare it for pre-processing. Because open-source data may have been curated for tasks that differ slightly from our own, it is useful to do some basic preliminary data exploration and clean the data before moving forward in the NLP pipeline.  

### 00 Global Settings

In [2]:
import sys
sys.path.append("../../") ## set the environment path

import os

import pandas as pd
import azureml.dataprep as dp

from utils_nlp.dataset.url_utils import maybe_download
from utils_nlp.dataset.stsbenchmark import extract_sts

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [3]:
STS_URL = "http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz"
BASE_DATA_PATH = "../../data"
RAW_DATA_PATH = os.path.join(BASE_DATA_PATH, "raw")
CLEAN_DATA_PATH = os.path.join(BASE_DATA_PATH, "clean")

### 01 Data Download

Make a directory for the data if it doesn't already exist, and then download.

In [4]:
if not os.path.exists(RAW_DATA_PATH):
    os.makedirs(RAW_DATA_PATH)

In [5]:
def download_sts(url, dirpath):
    zipfile = maybe_download(url, work_directory = dirpath)
    unzipped = extract_sts(zipfile, target_dirpath = dirpath, tmode = "r:gz")
    return zipfile, unzipped

In [6]:
tarfile, datapath = download_sts(STS_URL, RAW_DATA_PATH)
print("Data downloaded to {}".format(datapath))

418kB [00:03, 128kB/s]                             

Data downloaded to ../../data/raw/stsbenchmark





### 02 Data Understanding
In this section we show how to: 
* load raw data into a dataframe
* peek into the first n rows

One way to do this is by checking the filetypes of the data we've downloaded and utilizing the appropriate pandas `read` function.

In [7]:
print(os.listdir(datapath))

['sts-test.csv', 'sts-dev.csv', 'readme.txt', 'correlation.pl', 'LICENSE.txt', 'sts-train.csv']


Because the data is in csv format, we can use the pandas `read_csv` function.

In [11]:
## TODO figure out how to integrate the runtools extension that lets you run the entire NB at once, skipping errors
df = pd.read_csv(os.path.join(datapath, "sts-train.csv"), sep='\t')
df.head(10)

ParserError: Error tokenizing data. C error: Expected 7 fields in line 2508, saw 8


We see that this throws a parsing error: "Expected 7 fields, saw 8". One workaround is as follows:

In [8]:
df = pd.read_csv(os.path.join(datapath, "sts-train.csv"), sep='\t', names=list(range(7)))
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6
0,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.
5,main-captions,MSRvid,2012test,11,4.25,Some men are fighting.,Two men are fighting.
6,main-captions,MSRvid,2012test,12,0.5,A man is smoking.,A man is skating.
7,main-captions,MSRvid,2012test,13,1.6,The man is playing the piano.,The man is playing the guitar.
8,main-captions,MSRvid,2012test,14,2.2,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...
9,main-captions,MSRvid,2012test,16,5.0,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.


We could alternatively use a `read` function that has built-in automatic filetype inference:

In [9]:
dflow = dp.auto_read_file(path=os.path.join(datapath, "sts-train.csv"))
dflow.head(10)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7
0,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.
5,main-captions,MSRvid,2012test,11,4.25,Some men are fighting.,Two men are fighting.
6,main-captions,MSRvid,2012test,12,0.5,A man is smoking.,A man is skating.
7,main-captions,MSRvid,2012test,13,1.6,The man is playing the piano.,The man is playing the guitar.
8,main-captions,MSRvid,2012test,14,2.2,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...
9,main-captions,MSRvid,2012test,16,5.0,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.


The `auto_read_file` function from the AzureML Data Prep module actually returns a `Dataflow` object, which you can read more about [here](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py). We can easily transfer the data into a Pandas DataFrame (as before) in a single line using the `to_pandas_dataframe` function, or we can continue manipulating the data as a Dataflow object using the AzureML Data Prep API. For the remainder of this notebook we will be doing the latter.

### 03 Data Cleaning
Now that we know about the general shape of the data, we can clean it so that it is ready for further preprocessing. The main operation we need for the STS Benchmark data is to drop all of columns except for the sentence pairs and the score, which will be used to supervise our sentence similarity training.

In [10]:
sentences = dflow.keep_columns(['Column5', 'Column6', 'Column7']) \
                .rename_columns({'Column5': 'score', 'Column6': 's1', 'Column7': 's2'})
sentences.head(10)

Unnamed: 0,score,s1,s2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.
5,4.25,Some men are fighting.,Two men are fighting.
6,0.5,A man is smoking.,A man is skating.
7,1.6,The man is playing the piano.,The man is playing the guitar.
8,2.2,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...
9,5.0,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.


We will want to do this for all the datasets (train, dev, and test) and then persist the results into a new clean directory.

In [11]:
def clean_sts(src_dir, filenames, target_dir):
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
    filepaths = [os.path.join(src_dir, f) for f in filenames]
    for i,fp in enumerate(filepaths):
        dat = dp.auto_read_file(path=fp)
        s = dat.keep_columns(['Column5', 'Column6', 'Column7']).rename_columns({'Column5': 'score', 'Column6': 's1', 'Column7': 's2'})
        sdf = s.to_pandas_dataframe().to_csv(os.path.join(target_dir, filenames[i]), sep='\t')

In [12]:
sts_files = [f for f in os.listdir(os.path.join(RAW_DATA_PATH, "stsbenchmark")) if f.endswith(".csv")]
clean_sts(os.path.join(RAW_DATA_PATH, "stsbenchmark"), sts_files, os.path.join(CLEAN_DATA_PATH, "stsbenchmark"))

### 04 One-Shot Data Prep
You can also use our STSBenchmark utils to automatically download, extract, and persist the data. You can then load the sanitized data as a pandas DataFrame in one line. 

In [4]:
from utils_nlp.dataset.stsbenchmark import STSBenchmark

In [5]:
# Initializing this instance runs the downloader and extractor behind the scenes
sts_dev = STSBenchmark("dev", base_data_path=BASE_DATA_PATH)

In [7]:
df = sts_dev.as_dataframe()
df.head(10)

Unnamed: 0,score,s1,s2
0,5.0,A man with a hard hat is dancing.,A man wearing a hard hat is dancing.
1,4.75,A young child is riding a horse.,A child is riding a horse.
2,5.0,A man is feeding a mouse to a snake.,The man is feeding a mouse to the snake.
3,2.4,A woman is playing the guitar.,A man is playing guitar.
4,2.75,A woman is playing the flute.,A man is playing a flute.
5,2.615,A woman is cutting an onion.,A man is cutting onions.
6,5.0,A man is erasing a chalk board.,The man is erasing the chalk board.
7,2.333,A woman is carrying a boy.,A woman is carrying her baby.
8,3.75,Three men are playing guitars.,Three men are on stage playing guitars.
9,5.0,A woman peels a potato.,A woman is peeling a potato.
