## Data Preprocessing

The goal of this notebook is to demonstrate how to use [spaCy](https://spacy.io/) and pandas to preprocess the STS Benchmark data for the sentence similarity task. For this task, the only preprocessing we need to do is

* Make all text lowercase
* Tokenize (segment the text into words, punctuation marks, etc)

### 00 Global Settings

In [9]:
import sys
sys.path.append("../../") ## set the environment path

import os
import pandas as pd
from utils_nlp.dataset.preprocess import to_lowercase, to_spacy_tokens

from utils_nlp.dataset.stsbenchmark import STSBenchmark

In [11]:
BASE_DATA_PATH = "../../data"

### 01 Load Data

We can use the STSBenchmark utils to load the data as a pandas dataframe.

In [3]:
df = STSBenchmark("train", base_data_path=BASE_DATA_PATH).as_dataframe()

In [4]:
df.head(5)

Unnamed: 0,score,sentence1,sentence2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


### 02 Make Lowercase
We start with simple standardization of the text by making all text lowercase.

In [5]:
df_low = to_lowercase(df)
df_low.head(5)

Unnamed: 0,score,sentence1,sentence2
0,5.0,a plane is taking off.,an air plane is taking off.
1,3.8,a man is playing a large flute.,a man is playing a flute.
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...
3,2.6,three men are playing chess.,two men are playing chess.
4,4.25,a man is playing the cello.,a man seated is playing the cello.


### 03 Tokenize
We tokenize the text using spaCy's non-destructive tokenizer.

In [6]:
import time
start = time.time()
df_tok = to_spacy_tokens(df_low)
print("Time to tokenize the data: {} s".format(time.time()-start))

Time to tokenize the data: 111.01340293884277 s


In [7]:
df_tok.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,5.0,a plane is taking off.,an air plane is taking off.,"[plane, taking, .]","[air, plane, taking, .]"
1,3.8,a man is playing a large flute.,a man is playing a flute.,"[man, playing, large, flute, .]","[man, playing, flute, .]"
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...,"[man, spreading, shreded, cheese, pizza, .]","[man, spreading, shredded, cheese, uncooked, p..."
3,2.6,three men are playing chess.,two men are playing chess.,"[men, playing, chess, .]","[men, playing, chess, .]"
4,4.25,a man is playing the cello.,a man seated is playing the cello.,"[man, playing, cello, .]","[man, seated, playing, cello, .]"


### 04 Persist
Since it is generally a good practice to save transformations of data incrementally, we do these two transforms (lowercase standardization and tokenization) for the train, dev, and test datasets and persist them to the data/stsbenchmark/preprocess directory

In [14]:
preprocessed_dir = os.path.join(BASE_DATA_PATH, "preprocessed", "stsbenchmark")
if not os.path.exists(preprocessed_dir):
    os.makedirs(preprocessed_dir)

In [19]:
def preprocess_all():
    for split in ["train", "dev", "test"]:
        df = STSBenchmark(split, base_data_path=BASE_DATA_PATH).as_dataframe()
        df_low = to_lowercase(df)
        df_tok = to_spacy_tokens(df_low)
        df_tok.to_csv(os.path.join(preprocessed_dir, "sts-{}.csv".format(split)), sep='\t')

In [20]:
preprocess_all()