# Lab3.4 Sentiment Classification using transformer models

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook explains how you can use a transformer model that is fine-tuned for sentiment analysis. Fine-tuned transformer models are published regularly on the huggingface platform: https://huggingface.co/models

These models are very big (Gigabytes) and require a computer with sufficient memory to load. Furthermore, loading these models takes some time as well. It is also possible to copy such a model to your disk and to load the local copy. Still a substantial memory is needed to load it.

This notebook requires installing some deep learning packages: transformers, pytorch and simpletransformers. If you are not experienced with installing these packages, make sure you first define a virtual environment for python, activate this environment and install the packages in this enviroment.

Please consult the Python documentation for installing such an enviroment:

https://docs.python.org/3/library/venv.html

After activating your enviroment you can install pytorch, transformers and simpletransformers from the command line. If you start this notebook within the same virtual environment you can also execute the next installation commands from your notebook. Once installed, you can comment out the next cell.

In [2]:
#!conda install pytorch cpuonly -c pytorch
#!pip install transformers
#!pip install simpletransformers

Huggingface transfomers provides an option to create a **pipeline** to perform a NLP task with a pretrained model: 

"The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering."

More information can be found here: https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html

We will use the pipeline module to load a fine-tuned model to perform senteiment analysis

In [3]:
from transformers import pipeline

We load a transformer model 'distilbert-base-uncased-finetuned-sst-2-english' that is fine-tuned for binary classification from the Hugging face repository:

https://huggingface.co/models

We need to load the model for the sequence classifcation and the tokenizer to convert the sentences into tokens according to the vocabulary of the model.

Loading the model takes some time.

In [4]:
sentimentenglish = pipeline("sentiment-analysis", 
                            model="distilbert-base-uncased-finetuned-sst-2-english", 
                            tokenizer="distilbert-base-uncased-finetuned-sst-2-english")

Downloading: 100%|██████████| 629/629 [00:00<00:00, 156kB/s]
Downloading: 100%|██████████| 255M/255M [01:32<00:00, 2.89MB/s] 
Downloading: 100%|██████████| 48.0/48.0 [00:00<00:00, 13.7kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 474kB/s] 


We now created an instantiation of a pipeline that can tokenize any sentence, obtain a sententence embedding from the transformer language model and perform the **sentiment-analysis** task. Let's try it out on an example sentence.

In [5]:
sentence_pos_en = "Nice hotel and the service is great"

In [6]:
sentimentenglish(sentence_pos_en)

[{'label': 'POSITIVE', 'score': 0.999881386756897}]

In [7]:
sentence_neg_en = "The rooms are dirty and the wifi does not work"

In [8]:
sentimentenglish(sentence_neg_en)

[{'label': 'NEGATIVE', 'score': 0.9997870326042175}]

This is easy and seems to work very well. 

## Using a Dutch fine-tuned transformer model

We can use a fine-tuned Dutch model for Dutch sentiment analysis by creating another pipeline. Again loading this model takes some time. Also note that after loading, both moodels are loaded in memory. So if you have issues loading, you may want to start over and try again just with the Dutch pipeline.

In [9]:
sentimentdutch = pipeline("sentiment-analysis", 
                          model="wietsedv/bert-base-dutch-cased-finetuned-sentiment", 
                          tokenizer="wietsedv/bert-base-dutch-cased-finetuned-sentiment")

Downloading: 100%|██████████| 1.20k/1.20k [00:00<00:00, 340kB/s]
Downloading: 100%|██████████| 416M/416M [02:03<00:00, 3.53MB/s] 
Downloading: 100%|██████████| 40.0/40.0 [00:00<00:00, 19.7kB/s]
Downloading: 100%|██████████| 236k/236k [00:00<00:00, 280kB/s]  
Downloading: 100%|██████████| 112/112 [00:00<00:00, 33.2kB/s]


We test it on two similar Dutch sentences:

In [10]:
sentence_pos_nl="Mooi hotel en de service is geweldig"
sentence_neg_nl="De kamers zijn smerig en de wifi doet het niet"

In [11]:
sentimentdutch(sentence_pos_nl)

[{'label': 'pos', 'score': 0.9999955892562866}]

In [12]:
sentimentdutch(sentence_neg_nl)

[{'label': 'neg', 'score': 0.6675182580947876}]

This seems to work fine too although the score for negative in the second example is much lower.

## Inspecting sentence representations using Simpletransformers

The Simpletransformers package is built on top of the transformer package. It simplifies the use of transformers even more and provides excellent documentation: https://simpletransformers.ai

The site explains also how you can fine-tune models yourself or even how to build models from scratch, assuming you have the computing power and the data.

Here we are going to use it to inspect the sentence representations a bit more. Unfortunately, we need to load the English model again as an instantiation of a RepresentationModel. So if you have memory issues, please stop the kernel and start again from here.

Loading the model may gave a lot of warnings. You can ignore these. If you do not have a graphical card (GPU) and or cuda installed to use the GPU you need to set use_cuda to False, as shown below.

In [14]:
from simpletransformers.language_representation import RepresentationModel
        
#sentences = ["Example sentence 1", "Example sentence 2"]
model = RepresentationModel(
        model_type="bert",
        model_name="distilbert-base-uncased-finetuned-sst-2-english",
        use_cuda=False ## If you cannot use a GPU set this to false
    )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing BertForTextRepresentation: ['distilbert.transformer.layer.1.attention.k_lin.bias', 'classifier.weight', 'distilbert.transformer.layer.3.sa_layer_norm.bias', 'distilbert.transformer.layer.2.attention.k_lin.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.3.attention.out_lin.bias', 'distilbert.transformer.layer.0.output_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'distilbert.transformer.layer.3.attention.k_lin.bias', 'distilbert.transformer.layer.1.output_layer_norm.bias', 'distilbert.transformer.layer.5.ffn.lin1.weight', 'distilbert.transformer.layer.4

The Representationmodel allows you to obtain a sentence encoding. We do that next for the positive English example which consists of 7 words:

In [15]:
sentence_pos_en

'Nice hotel and the service is great'

According to the simpletransformers API the input must be a list even when it is a single sentence. If you pass a string as input, it will turn it into a list of charcaters, each character as a separate sentence.

In [16]:
word_vectors = model.encode_sentences([sentence_pos_en], combine_strategy=None)

The result is a numpy array with the shape (1, 9, 768) 

In [17]:
print(type(word_vectors))
print(word_vectors.shape)

<class 'numpy.ndarray'>
(1, 9, 768)


The first number indicates the number of sentences, which is **1** in our case. The next digit **9** indicates the number of tokens and the final digit is the number of dimension for each token according to the transformer model, which **768** in case of BERT models.

We can ask for the full embedding representation for the first token:

In [18]:
print('Nr of dimensions for the mebdding of the first token:', len(word_vectors[0][0]))
print(word_vectors[0][0])

Nr of dimensions for the mebdding of the first token: 768
[ 7.67106533e-01  5.33791959e-01  2.67039895e-01 -6.76840127e-01
  1.36183822e+00 -2.70547777e-01 -6.71268627e-02 -1.05598474e+00
 -2.85047412e-01 -7.26978540e-01  9.67526853e-01  1.11051038e-01
 -1.32694316e+00 -5.28408587e-01 -9.15984392e-01 -7.58146644e-01
 -2.61910051e-01 -1.44160375e-01 -1.39175296e+00  7.00954318e-01
  1.08403742e+00  3.10060918e-01 -5.61872721e-01 -4.18358862e-01
 -4.46640819e-01  1.51834667e+00 -6.58160388e-01  6.48984611e-01
 -7.35697091e-01 -1.73979983e-01 -4.06234324e-01  3.64694476e-01
  1.38108611e+00 -9.75431651e-02  1.60878146e+00 -8.11416268e-01
 -7.32650161e-01  2.78848588e-01 -3.81985813e-01  3.93769711e-01
  9.50365543e-01  1.10369158e+00  1.60771802e-01 -1.97742209e-01
  7.30382919e-01 -1.19919455e+00  4.79121357e-01  2.61293501e-01
 -1.16474843e+00  9.78898108e-01  3.07603097e+00  1.25448906e+00
  1.23442978e-01  1.01036096e+00  6.45935953e-01 -7.15118110e-01
 -2.73068309e-01 -2.20954394e+00

**WAIT** Our sentence has 7 words so why do we get 9 tokens here?

We can  use the tokenizer of the model to get the token representation of the transformer and check it out.

In [19]:
tokenized_sentence = model.tokenizer(sentence_pos_en)
tokenized_sentence

{'input_ids': [101, 100, 3309, 1998, 1996, 2326, 2003, 2307, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Although our sentence has 7 words, we get 9 identifiers. We can use the **decode** function to convert them back to words:

In [20]:
model.tokenizer.decode(101)

'[ C L S ]'

The first token is the special token **CLS** which is an abstract sentence representation. Let's check another one:

In [21]:
model.tokenizer.decode(3309)

'h o t e l'

Allright, this a word from our sentence. Let's decode them all:

In [22]:
tokenid_list = tokenized_sentence['input_ids']
for token_id in tokenid_list:
    print(token_id, model.tokenizer.decode(token_id))

101 [ C L S ]
100 [ U N K ]
3309 h o t e l
1998 a n d
1996 t h e
2326 s e r v i c e
2003 i s
2307 g r e a t
102 [ S E P ]


The transformer model added the special tokens **CLS** and **SEP** but also represented our "Nice" with the **UNK** token. Any idea why? Check the name of the model we used.....

We used the uncased model, which means that for training all inoput was downcased.

# End of this notebook