# Lab3.4 Sentiment Classification using transformer models

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook explains how you can use a transformer model that is fine-tuned for sentiment analysis. Fine-tuned transformer models are published regularly on the huggingface platform: https://huggingface.co/models

These models are very big (Gigabytes) and require a computer with sufficient memory to load. Furthermore, loading these models takes some time as well. It is also possible to copy such a model to your disk and to load the local copy. Still a substantial memory is needed to load it.

This notebook requires installing some deep learning packages: transformers, pytorch and simpletransformers. If you are not experienced with installing these packages, make sure you first define a virtual environment for python, activate this environment and install the packages in this enviroment.

Please consult the Python documentation for installing such an enviroment:

https://docs.python.org/3/library/venv.html

After activating your enviroment you can install pytorch, transformers and simpletransformers from the command line. If you start this notebook within the same virtual environment you can also execute the next installation commands from your notebook. Once installed, you can comment out the next cell.

In [3]:
!pip3 install torch
!pip3 install transformers 
!pip3 install simpletransformers

ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch

[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting simpletransformers
  Using cached simpletransformers-0.63.9-py3-none-any.whl (250 kB)
Collecting datasets
  Using cached datasets-2.10.1-py3-none-any.whl (469 kB)
Collecting scipy
  Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl (42.2 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.1-cp311-cp311-win_amd64.whl (8.2 MB)
Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting tensorboard
  Using cached tensorboard-2.12.0-py3-none-any.whl (5.6 MB)
Collecting wandb>=0.10.32
  Using cached wandb-0.13.10-py3-none-any.whl (2.0 MB)
Collecting streamlit
  Using cached streamlit-1.19.0-py2.py3-none-any.whl (9.6 MB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.97.tar.gz (524 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting GitPython>=1.0.0
  Using cached GitPython-3.1.31-p

  DEPRECATION: sentencepiece is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  error: subprocess-exited-with-error
  
  Running setup.py install for sentencepiece did not run successfully.
  exit code: 1
  
  [23 lines of output]
  running install
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\sentencepiece
  copying src\sentencepiece/__init__.py -> build\lib.win-amd64-cpython-311\sentencepiece
  copying src\sentencepiece/_version.py -> build\lib.win-amd64-cpython-311\sentencepiece
  copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-311\sentencepiece
  copying src\sentencepiece/sentencepiece_

Huggingface transfomers provides an option to create a **pipeline** to perform a NLP task with a pretrained model: 

"The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering."

More information can be found here: https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html

We will use the pipeline module to load a fine-tuned model to perform senteiment analysis

In [None]:
!which pip
!pip list
!pip uninstall transformers
!python -m pip3 install transformers
from transformers import pipeline

'which' is not recognized as an internal or external command,
operable program or batch file.


Package                Version
---------------------- ----------
attrs                  22.2.0
blis                   0.7.9
catalogue              2.0.8
certifi                2022.12.7
charset-normalizer     3.0.1
click                  8.1.3
colorama               0.4.6
confection             0.0.4
cymem                  2.0.7
distlib                0.3.6
en-core-web-sm         3.5.0
et-xmlfile             1.1.0
filelock               3.9.0
huggingface-hub        0.12.1
idna                   3.4
Jinja2                 3.1.2
joblib                 1.2.0
langcodes              3.3.0
lxml                   4.9.2
MarkupSafe             2.1.2
munch                  2.5.0
murmurhash             1.0.9
nltk                   3.8.1
numpy                  1.24.2
openpyxl               3.0.10
packaging              23.0
pandas                 1.5.3
pathy                  0.10.1
pip                    22.3.1
platformdirs           3.0.0
plotly                 5.13.0
preshed                3.0.8


[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


We load a transformer model 'distilbert-base-uncased-finetuned-sst-2-english' that is fine-tuned for binary classification from the Hugging face repository:

https://huggingface.co/models

We need to load the model for the sequence classifcation and the tokenizer to convert the sentences into tokens according to the vocabulary of the model.

Loading the model takes some time.

In [4]:
sentimentenglish = pipeline("sentiment-analysis", 
                            model="distilbert-base-uncased-finetuned-sst-2-english", 
                            tokenizer="distilbert-base-uncased-finetuned-sst-2-english")

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

We now created an instantiation of a pipeline that can tokenize any sentence, obtain a sententence embedding from the transformer language model and perform the **sentiment-analysis** task. Let's try it out on an example sentence.

In [5]:
sentence_pos_en = "Nice hotel and the service is great"

In [6]:
sentimentenglish(sentence_pos_en)

[{'label': 'POSITIVE', 'score': 0.999881386756897}]

In [7]:
sentence_neg_en = "The rooms are dirty and the wifi does not work"

In [8]:
sentimentenglish(sentence_neg_en)

[{'label': 'NEGATIVE', 'score': 0.9997870326042175}]

This is easy and seems to work very well. 

## Using a Dutch fine-tuned transformer model

We can use a fine-tuned Dutch model for Dutch sentiment analysis by creating another pipeline. Again loading this model takes some time. Also note that after loading, both moodels are loaded in memory. So if you have issues loading, you may want to start over and try again just with the Dutch pipeline.

In [9]:
sentimentdutch = pipeline("sentiment-analysis", 
                          model="wietsedv/bert-base-dutch-cased-finetuned-sentiment", 
                          tokenizer="wietsedv/bert-base-dutch-cased-finetuned-sentiment")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/241k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

We test it on two similar Dutch sentences:

In [10]:
sentence_pos_nl="Mooi hotel en de service is geweldig"
sentence_neg_nl="De kamers zijn smerig en de wifi doet het niet"

In [11]:
sentimentdutch(sentence_pos_nl)

[{'label': 'pos', 'score': 0.9999955892562866}]

In [12]:
sentimentdutch(sentence_neg_nl)

[{'label': 'neg', 'score': 0.6675326824188232}]

This seems to work fine too although the score for negative in the second example is much lower.

## Inspecting sentence representations using Simpletransformers

The Simpletransformers package is built on top of the transformer package. It simplifies the use of transformers even more and provides excellent documentation: https://simpletransformers.ai

The site explains also how you can fine-tune models yourself or even how to build models from scratch, assuming you have the computing power and the data.

Here we are going to use it to inspect the sentence representations a bit more. Unfortunately, we need to load the English model again as an instantiation of a RepresentationModel. So if you have memory issues, please stop the kernel and start again from here.

Loading the model may gave a lot of warnings. You can ignore these. If you do not have a graphical card (GPU) and or cuda installed to use the GPU you need to set use_cuda to False, as shown below.

In [13]:
from simpletransformers.language_representation import RepresentationModel
        
#sentences = ["Example sentence 1", "Example sentence 2"]
model = RepresentationModel(
        model_type="bert",
        model_name="distilbert-base-uncased-finetuned-sst-2-english",
        use_cuda=False ## If you cannot use a GPU set this to false
    )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing BertForTextRepresentation: ['distilbert.transformer.layer.5.ffn.lin2.bias', 'distilbert.transformer.layer.1.attention.k_lin.bias', 'distilbert.transformer.layer.4.sa_layer_norm.bias', 'distilbert.transformer.layer.2.attention.out_lin.weight', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.3.output_layer_norm.weight', 'distilbert.transformer.layer.3.attention.q_lin.weight', 'distilbert.transformer.layer.5.output_layer_norm.weight', 'distilbert.transformer.layer.1.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.3.ffn.lin1.bias', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.ffn.lin2.wei

The Representationmodel allows you to obtain a sentence encoding. We do that next for the positive English example which consists of 7 words:

In [14]:
sentence_pos_en

'Nice hotel and the service is great'

According to the simpletransformers API the input must be a list even when it is a single sentence. If you pass a string as input, it will turn it into a list of charcaters, each character as a separate sentence.

In [15]:
word_vectors = model.encode_sentences([sentence_pos_en], combine_strategy=None)

The result is a numpy array with the shape (1, 9, 768) 

In [16]:
print(type(word_vectors))
print(word_vectors.shape)

<class 'numpy.ndarray'>
(1, 9, 768)


The first number indicates the number of sentences, which is **1** in our case. The next digit **9** indicates the number of tokens and the final digit is the number of dimension for each token according to the transformer model, which **768** in case of BERT models.

We can ask for the full embedding representation for the first token:

In [17]:
print('Nr of dimensions for the mebdding of the first token:', len(word_vectors[0][0]))
print(word_vectors[0][0])

Nr of dimensions for the mebdding of the first token: 768
[-9.06034887e-01 -4.87977117e-02  4.74802107e-02 -1.13801634e+00
  9.35453296e-01 -1.23630416e+00  2.17447853e+00  4.51255798e-01
  5.90349019e-01 -1.95086861e+00 -1.47528207e+00 -2.24826545e-01
  1.55494559e+00  1.06500769e+00 -1.40686542e-01  7.52872825e-01
 -1.61901936e-01  7.77709007e-01  5.06455123e-01  1.60853386e+00
 -1.06702197e+00 -4.81569618e-01  3.72172892e-01  1.15838265e+00
 -9.21475470e-01 -2.75591284e-01 -4.57448423e-01  8.95897299e-02
 -5.26742101e-01 -2.69893020e-01 -8.04319620e-01  6.22947991e-01
  1.48045027e+00 -1.96724489e-01 -6.02553070e-01 -1.02787495e+00
 -1.89065874e+00 -1.07499492e+00 -4.31458801e-02  1.59636819e+00
 -8.58116224e-02  7.70465076e-01 -1.70893526e+00  3.77522223e-02
  2.52258629e-01 -6.95083082e-01 -2.25055051e+00 -4.72664148e-01
 -5.02944648e-01 -3.95256788e-01 -4.64026779e-01  2.86221147e-01
 -9.66042340e-01  1.08464456e+00 -5.62285364e-01  1.42635727e+00
 -7.45007455e-01 -7.32180774e-01

**WAIT** Our sentence has 7 words so why do we get 9 tokens here?

We can  use the tokenizer of the model to get the token representation of the transformer and check it out.

In [18]:
tokenized_sentence = model.tokenizer(sentence_pos_en)
tokenized_sentence

{'input_ids': [101, 100, 3309, 1998, 1996, 2326, 2003, 2307, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Although our sentence has 7 words, we get 9 identifiers. We can use the **decode** function to convert them back to words:

In [20]:
model.tokenizer.decode(101)

'[ C L S ]'

The first token is the special token **CLS** which is an abstract sentence representation. Let's check another one:

In [21]:
model.tokenizer.decode(3309)

'h o t e l'

Allright, this a word from our sentence. Let's decode them all:

In [22]:
tokenid_list = tokenized_sentence['input_ids']
for token_id in tokenid_list:
    print(token_id, model.tokenizer.decode(token_id))

101 [ C L S ]
100 [ U N K ]
3309 h o t e l
1998 a n d
1996 t h e
2326 s e r v i c e
2003 i s
2307 g r e a t
102 [ S E P ]


The transformer model added the special tokens **CLS** and **SEP** but also represented our "Nice" with the **UNK** token. Any idea why? Check the name of the model we used.....

We used the uncased model, which means that for training all inoput was downcased.

# End of this notebook