PART 3: Advanced word and document embeddings and neural classification.

Here we only use Flair package, which is probably the most easy-to-use Python library to use state-of-the art pretrained neural language models. Using pretrained models gives major advantage as it allows leveraging information from huge datasets to smaller NLP problems (your problem). We only need to fine-tune a pretrained model to perform well for a new problem.

We'll take preprocessed data from PART 1, which we load here from pickle file.

For full documents and examples, see
https://github.com/zalandoresearch/flair/tree/master/resources/docs
https://www.analyticsvidhya.com/blog/2019/02/flair-nlp-library-python/

In [None]:
# root folder of data with pickle file"
DATA_ROOT = r'D:\Downloads\NLP_introduction-20191023T064334Z-001\NLP_introduction' + r'\\'

We start with simple word vectors which are included also in Flair (Fasttext type). These embeddings are not context-dependent but always fixed.

In [19]:
from flair.embeddings import WordEmbeddings
word_embedding = WordEmbeddings('fi') # fasttext word embeddings for Finnish

from flair.data import Sentence
sentence = Sentence('hiilijalanjälki')
word_embedding.embed(sentence)
# now check out the embedded tokens.
vector = sentence[0].embedding.cpu().detach().numpy() # result is tensor, we want convert it to normal vector
print('Simple word embedding for "%s": %s' % (sentence[0],str(vector)))

Simple word embedding for "Token: 1 hiilijalanjälki": [ 0.28942   -0.25729    0.2793    -0.14168    0.21692    0.33281
 -0.076142   0.21425   -0.14987    0.50143    0.38442   -0.030661
 -0.54667   -0.06934    0.1375    -0.40508    0.58006   -0.10255
  0.13013    0.20505    0.0084429  0.18679   -0.17214   -0.015876
  0.16071    0.012619  -0.26008    0.5827     0.13461    0.38794
  0.27849    0.31263   -0.28229   -0.29986   -0.36067    0.57393
  0.45562    0.25721   -0.16588   -0.34081   -0.029271  -0.053188
 -0.28379   -0.31579   -0.16162   -0.044539   0.11141   -0.56292
  0.042089  -0.17313   -0.10631   -0.046749  -0.37972    0.12351
 -0.14223   -0.55344   -0.4255     0.21749    0.56593   -0.30287
 -0.4045     0.28351    0.14293   -0.15708    0.56132    0.8697
 -0.48887    0.1861     0.092133  -0.0092559  0.50473   -0.090265
 -0.60152    0.0038192 -0.12302   -0.11521   -0.25384    0.27161
 -0.18558   -0.12193    0.44237   -0.017731   0.10056   -0.10506
  0.33185    0.029694   0.25036  

Next we get embeddings which depend on context. These embeddings come from models, not just from a big fixed table like simple embeddings. In Flair, we can easily combine multiple models.

In [17]:
from flair.embeddings import FlairEmbeddings,WordEmbeddings,BertEmbeddings

# https://github.com/stefan-it/flair-lms#multilingual-flair-embeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('fi-forward')
flair_backward_embedding = FlairEmbeddings('fi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

from flair.embeddings import StackedEmbeddings

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

# Next we test the embeddings for example sentences
from flair.data import Sentence
import numpy as np

# make two highly similar sentences
sentence=[None]*2
sentence[0] = Sentence('Suomalaisen hiilijalanjälki on vuodessa keskimäärin noin 11 tonnia hiilidioksidiksi muutettuna .')
sentence[1] = Sentence('Japanilaisen hiilijalanjälki on vuodessa enintään noin 11 tonnia hiilidioksidiksi muutettuna .')
# NOTE: All tokens should be separated by space, no "smart" tokenizer is used here!

# get embedding for the second word, which was same in both sentences
vectors=[None]*2
for i,s in enumerate(sentence):
    # just embed a sentence using the StackedEmbedding as you would with any single embedding.
    stacked_embeddings.embed(s)
    # now check out the embedded tokens.
    token=s[1]
    vectors[i] = token.embedding.cpu().detach().numpy()
    print('Contextual embedding for "%s": %s' % (token,str(vectors[i])))
print("Correlation between vectors %f" % (np.corrcoef(vectors)[0,1]))

Contextual embedding for "Token: 2 hiilijalanjälki": [ 1.5275617e-04 -1.0290725e-01  4.7491589e-03 ...  5.4320967e-01
  3.1850666e-01 -3.9369926e-01]
Contextual embedding for "Token: 2 hiilijalanjälki": [ 1.8941310e-04 -6.0013626e-02  3.3751559e-03 ...  2.9572883e-01
  1.4452136e-01 -4.6072593e-01]
Correlation between vectors 0.978912


Finally we take above idea a step further: We fine-tune above models to create document embeddings that are used with a classifier. This requires some ~10 mins using PyTorch and GPU.

In [None]:
from flair.data import Corpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.datasets import CSVClassificationCorpus

# Step 1: Create .CSV files of our data
import pickle
import random
import pandas
import csv

data=pickle.load(open(DATA_ROOT + 'turkuNLP_preprocessed_data.pickle','rb'))
# create shuffled indices
ind = list(range(len(data)))
random.seed(1)
random.shuffle(ind)

# create training, development and testing splits using ratios 8:1:1
# save each split into own file
MAX_TOKENS = 1000 # need to limit document size or else run out of memory :(
cutter = lambda x:x[1:] if len(x)<MAX_TOKENS else x[1:MAX_TOKENS] # first token is just repeat, skip...
low = 0
for frac,file in zip([0.8,0.9,1.0],['train.csv','dev.csv','test.csv']):
    up = round(frac*len(data))
    frame = pandas.DataFrame()
    frame['label'] = [data[ind[i]]['label'] for i in range(low,up)]
    frame['text'] = [" ".join(cutter(data[ind[i]]['tokens_raw'])) for i in range(low,up)]
    # use pandas for easy file saving
    frame.to_csv(DATA_ROOT + file,encoding="utf-8",sep="\t",index=False,quoting=csv.QUOTE_NONE)
    low = up

# this is the folder in which train, test and dev files reside
data_folder = DATA_ROOT

# column format indicating which columns hold the text and label(s)
column_name_map = {0: "label_topic",1: "text"} # note: 0 = first column!

# load corpus containing training, test and dev data and if CSV has a header, you can skip it
corpus: Corpus = CSVClassificationCorpus(data_folder,
                                         column_name_map,
                                         skip_header=True,
                                         delimiter='\t',    # tab-separated files
                                         in_memory=True
)
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

# 3. make a list of word embeddings
word_embeddings = [WordEmbeddings('fi'),
                   #FlairEmbeddings('fi-forward'), # can add, if enough memory
                   #FlairEmbeddings('fi-backward'),
                   ]

# 4. initialize document embedding by passing list of word embeddings
# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings: DocumentRNNEmbeddings = DocumentRNNEmbeddings(word_embeddings,
                                                                     hidden_size=512,
                                                                     reproject_words=True,
                                                                     reproject_words_dimension=150,
                                                                     )

# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier, corpus)

# 7. start the training
import torch
torch.cuda.empty_cache()

trainer.train(data_folder+"flair_temp", # Main path to which all output during training is logged and models are saved
              learning_rate=0.1,
              mini_batch_size=40,
              anneal_factor=0.5,
              patience=4,
              max_epochs=100)

After training, we can load the model and use it to make predictions for new texts.

In [None]:
# Finally we test the model with couple of text segments
classifier = TextClassifier.load(data_folder+r"flair_temp\final-model.pt")

# create and test example sentences
from flair.data import Sentence
sentence = Sentence('Jälleenhankintahinnalla tarkoitetaan omaisuuden uushankintahintaa . Tällöin voidaan viitata siihen , mihin hintaan esimerkiksi tietty kone tai laite voitaisiin korvata hankkimalla se markkinoilta tänä päivänä .')
classifier.predict(sentence)
print('Predicted class for text "%s"\n%s' % (sentence.to_plain_string(),sentence.labels))

sentence = Sentence('Elinluovutuskortti antaa luvan käyttää kortin täyttäneen henkilön elimiä ja kudoksia kuoleman jälkeen toisten henkilöiden hengen pelastamiseksi tai terveyden parantamiseksi . Vuoden 2010 lakimuutoksen jälkeen kortti ei enää ole välttämätön , sillä nykyisin oletetaan vainajan suostuneen elintensä luovutukseen , ellei hänen tiedetä sitä elinaikanaan erityisesti kieltäneen . Elinluovutuskortin mukana kantaminen on edelleen hyvä varmistaa tahtonsa toteutuminen . Suomalaisista noin 18 prosenttia on allekirjoittanut elinluovutuskortin .')
classifier.predict(sentence)
print('Predicted class for text "%s"\n%s' % (sentence.to_plain_string(),sentence.labels))