# Natural Language Processing

OK, so we can classify images, what about text? Let's have a go at doing some natural language processing using the HuggingFace toolset.

## Importing Documents

We should probably start by setting up the environment and getting data (documents) into the notebook.

In [17]:
import pandas as pd
from pathlib import Path
from datasets import Dataset,DatasetDict
from transformers import AutoModelForSequenceClassification,AutoTokenizer

In [18]:
# Let's create a Pandas dataframe object by importing the CSV data
df = pd.read_csv('data/train.csv')

In [19]:
# Below we have some tabular data
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [20]:
# We can ask Pandas to describe the object to us
# Below we can see we're dealing with about 37K rows of data
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [21]:
# We can set labels for our columns as follows, this syntax is preferred for setting
# What were doing is concatenating all the strings in each row and passing all 37K strings to input
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [22]:
# It looks like we have a labeled dataframe, good!
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

## Tokenization (Tokenizing?)

Now that we have the data ready, how do we convert all these documents into numbers so our neural network can understand and work with the data. Well we want to split the document into words, but we generally don't want to end up with a vocabulary that's too big, so we use sub-word, or groupings of words.. tokens if you will. For this we're going to leverage HuggingFace's 'transformer' module, which gives us access to thousands and thousands of pre-trained models.

In [23]:
# Let's create HuggingFace dataset from our Pandas dataframe (not to be confused with PyTorch's datasets)
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

In [24]:
# Deberta works very well
model_nm = 'microsoft/deberta-v3-small'

In [25]:
# We can download our model and instantiate a AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


TypeError: Couldn't build proto file into descriptor pool: duplicate file name (sentencepiece_model.proto)

In [None]:
tokz.tokenize("A Token a day keeps the doctor away.")

## Numericalization

Gives each token a unique id based on where it is in the vocabulary. 