## Data Preprocessing

The goal of this notebook is to demonstrate how to use [spaCy](https://spacy.io/) and pandas to preprocess the STS Benchmark data for the sentence similarity task. For this task, the only preprocessing we need to do is

* Make all text lowercase
* Tokenize (segment the text into words, punctuation marks, etc)

### 00 Global Settings

In [1]:
import sys
sys.path.append("../../../") ## set the environment path

import os
import pandas as pd

from utils_nlp.dataset.preprocess import to_lowercase, to_spacy_tokens
from utils_nlp.dataset.stsbenchmark import STSBenchmark

In [2]:
DATA_DIR_PATH = "../../../data"

### 01 Load Data

We can use the STSBenchmark utils to load the data as a pandas dataframe.

In [3]:
df = STSBenchmark("train", base_data_path=BASE_DATA_PATH).as_dataframe()

418kB [00:01, 210kB/s]                                                                                                                                                                                                                      


Data downloaded to ../../../data\raw\stsbenchmark
Writing clean dataframe to ../../../data\clean\stsbenchmark\sts-dev.csv
Writing clean dataframe to ../../../data\clean\stsbenchmark\sts-test.csv
Writing clean dataframe to ../../../data\clean\stsbenchmark\sts-train.csv


In [4]:
df.head(5)

Unnamed: 0,score,sentence1,sentence2
0,5.0,A plane is taking off.,An air plane is taking off.
1,3.8,A man is playing a large flute.,A man is playing a flute.
2,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,2.6,Three men are playing chess.,Two men are playing chess.
4,4.25,A man is playing the cello.,A man seated is playing the cello.


### 02 Make Lowercase
We start with simple standardization of the text by making all text lowercase.

In [5]:
df_low = to_lowercase(df)
df_low.head(5)

Unnamed: 0,score,sentence1,sentence2
0,5.0,a plane is taking off.,an air plane is taking off.
1,3.8,a man is playing a large flute.,a man is playing a flute.
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...
3,2.6,three men are playing chess.,two men are playing chess.
4,4.25,a man is playing the cello.,a man seated is playing the cello.


### 03 Tokenize
We tokenize the text using spaCy's non-destructive tokenizer.

In [6]:
df_tok = to_spacy_tokens(df_low)
df_tok.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,5.0,a plane is taking off.,an air plane is taking off.,"[a, plane, is, taking, off, .]","[an, air, plane, is, taking, off, .]"
1,3.8,a man is playing a large flute.,a man is playing a flute.,"[a, man, is, playing, a, large, flute, .]","[a, man, is, playing, a, flute, .]"
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...,"[a, man, is, spreading, shreded, cheese, on, a...","[a, man, is, spreading, shredded, cheese, on, ..."
3,2.6,three men are playing chess.,two men are playing chess.,"[three, men, are, playing, chess, .]","[two, men, are, playing, chess, .]"
4,4.25,a man is playing the cello.,a man seated is playing the cello.,"[a, man, is, playing, the, cello, .]","[a, man, seated, is, playing, the, cello, .]"


### 04 Persist
Since it is generally a good practice to save transformations of data incrementally, we do these two transforms (lowercase standardization and tokenization) for the train, dev, and test datasets and persist them to the data/stsbenchmark/preprocess directory.

In [7]:
def preprocess_all(dir_name="preprocessed"):
    preprocessed_dir = os.path.join(BASE_DATA_PATH, dir_name, "stsbenchmark")
    if not os.path.exists(preprocessed_dir):
        os.makedirs(preprocessed_dir)
    for split in ["train", "dev", "test"]:
        df = STSBenchmark(split, base_data_path=BASE_DATA_PATH).as_dataframe()
        df_low = to_lowercase(df)
        df_tok = to_spacy_tokens(df_low)
        print("Writing tokenized dataframe to {}".format(os.path.join(preprocessed_dir, "sts-{}.csv".format(split))))
        df_tok.to_csv(os.path.join(preprocessed_dir, "sts-{}.csv".format(split)), sep='\t')

In [8]:
preprocess_all()

Writing tokenized dataframe to ../../../data\preprocessed\stsbenchmark\sts-train.csv
Writing tokenized dataframe to ../../../data\preprocessed\stsbenchmark\sts-dev.csv
Writing tokenized dataframe to ../../../data\preprocessed\stsbenchmark\sts-test.csv


### 05 Optional: Remove Stop Words
Removing stop words is another common preprocessing step for NLP tasks. We use the `rm_spacy_stopwords` utility function to do this on the dataframe. This function makes use of the spaCy language model's default set of stop words. If we need to add our own set of stop words (for example, if we are doing an NLP task for a very specific domain of content), we can do this in-line by simply providing the list as the `custom_stopwords` parameter of `rm_spacy_stopwords`.

In [9]:
from utils_nlp.dataset.preprocess import rm_spacy_stopwords

rm_spacy_stopwords(df_tok[:10]) # operating on a small slice of the data as an example

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens,sentence1_tokens_stop,sentence2_tokens_stop
0,5.0,a plane is taking off.,an air plane is taking off.,"[a, plane, is, taking, off, .]","[an, air, plane, is, taking, off, .]","[plane, taking, .]","[air, plane, taking, .]"
1,3.8,a man is playing a large flute.,a man is playing a flute.,"[a, man, is, playing, a, large, flute, .]","[a, man, is, playing, a, flute, .]","[man, playing, large, flute, .]","[man, playing, flute, .]"
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...,"[a, man, is, spreading, shreded, cheese, on, a...","[a, man, is, spreading, shredded, cheese, on, ...","[man, spreading, shreded, cheese, pizza, .]","[man, spreading, shredded, cheese, uncooked, p..."
3,2.6,three men are playing chess.,two men are playing chess.,"[three, men, are, playing, chess, .]","[two, men, are, playing, chess, .]","[men, playing, chess, .]","[men, playing, chess, .]"
4,4.25,a man is playing the cello.,a man seated is playing the cello.,"[a, man, is, playing, the, cello, .]","[a, man, seated, is, playing, the, cello, .]","[man, playing, cello, .]","[man, seated, playing, cello, .]"
5,4.25,some men are fighting.,two men are fighting.,"[some, men, are, fighting, .]","[two, men, are, fighting, .]","[men, fighting, .]","[men, fighting, .]"
6,0.5,a man is smoking.,a man is skating.,"[a, man, is, smoking, .]","[a, man, is, skating, .]","[man, smoking, .]","[man, skating, .]"
7,1.6,the man is playing the piano.,the man is playing the guitar.,"[the, man, is, playing, the, piano, .]","[the, man, is, playing, the, guitar, .]","[man, playing, piano, .]","[man, playing, guitar, .]"
8,2.2,a man is playing on a guitar and singing.,a woman is playing an acoustic guitar and sing...,"[a, man, is, playing, on, a, guitar, and, sing...","[a, woman, is, playing, an, acoustic, guitar, ...","[man, playing, guitar, singing, .]","[woman, playing, acoustic, guitar, singing, .]"
9,5.0,a person is throwing a cat on to the ceiling.,a person throws a cat on the ceiling.,"[a, person, is, throwing, a, cat, on, to, the,...","[a, person, throws, a, cat, on, the, ceiling, .]","[person, throwing, cat, ceiling, .]","[person, throws, cat, ceiling, .]"
