<a href="https://colab.research.google.com/github/jcdumlao14/Sentiment_Analysis-HuggingFace-/blob/main/Analysis_%F0%9F%A4%97.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis with Hugging Face 🤗**

In [2]:
# Transformers installation
%%capture
! pip install transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.1


## Highlights tour

Join me on a tour of the powerful 🤗 Transformers library, where we'll explore its many features for Natural Language Understanding (NLU) and Natural Language Generation (NLG). With its pre-trained models and intuitive pipeline API, you'll be up and running in no time.

We'll start by taking a quick peek at the pipeline API, which lets you easily apply pretrained models to tasks like sentiment analysis and text generation. But we won't stop there. We'll also dive into the library's more advanced capabilities, including access to the underlying models and tools for preprocessing your data.

Whether you're a seasoned NLP expert or just starting out, this tour is sure to leave you impressed with what 🤗 Transformers can do. So buckle up and let's get started!

As you explore our documentation, you'll notice that all code examples come equipped with a handy switch in the top left corner. This switch allows you to easily toggle between PyTorch and TensorFlow implementations, ensuring that our examples work seamlessly with whichever backend you prefer.

And if you don't see the switch? No worries! That means the code is already compatible with both PyTorch and TensorFlow, so you can get up and running right away without any extra configuration.

At our core, we believe that access to cutting-edge NLP tools should be as seamless and hassle-free as possible. With our versatile code examples and backend-agnostic approach, we're proud to make that a reality for our users.

## Accelerating your workflow with a pipeline

Ready to dive into the world of pretrained models? Look no further than 🤗 Transformers' `pipeline` feature! With just a few lines of code, you can harness the power of some of the most cutting-edge NLP models out there.

The `pipeline` feature comes loaded with a variety of tasks to choose from, including sentiment analysis, text generation, name entity recognition (NER), question answering, filling masked text, summarization, translation, and feature extraction. So whether you need to classify text, generate new language, or extract meaning from unstructured data, we've got you covered.

Let's take a closer look at sentiment analysis, just one of the many tasks available through the `pipeline`. But don't worry - our [task summary](https://huggingface.co/transformers/task_summary.html) has you covered for all the other options!

Ready to see what 🤗 Transformers can do for you? Let's get started with `pipeline`!

In [3]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

When you run this command for the first time, 🤗 Transformers automatically downloads and caches a pretrained model and its corresponding tokenizer. While we'll dive into the details of each later on, here's a quick overview: the tokenizer is responsible for preprocessing the text, breaking it down into manageable pieces for the model to analyze. Once the text is preprocessed, the model takes over, using its sophisticated algorithms to make predictions based on the input.

But that's not all - the `pipeline` feature brings everything together into a cohesive workflow. It handles both the tokenization and model prediction steps, and even post-processes the predictions to make them more readable. For example:

[classifier('We are delighted to introduce you to the 🤗 Transformers library.')]

With 🤗 Transformers' intuitive `pipeline` feature, you don't need to be a machine learning expert to get the most out of your NLP tasks. So what are you waiting for? Let's get started!







In [4]:
classifier('We are delighted to introduce you to the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9998339414596558}]

In [5]:
classifier('The pizza itself may not be exceptional, but the crust is undeniably fantastic.')

[{'label': 'POSITIVE', 'score': 0.9998449087142944}]

Great news! With 🤗 Transformers, you can process multiple sentences at once - simply pass in a list of sentences, and the library takes care of the rest. The text is automatically preprocessed and organized into a batch, ready to be analyzed by the model.

After the model has made its predictions, you'll receive a list of dictionaries with the results, like this:

[results = classifier(["We are delighted to introduce you to the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")]

By returning results in an easy-to-read format, 🤗 Transformers makes it simple to incorporate NLP into your workflow, no matter your level of expertise. So why wait? Give it a try and see the power of batch processing for yourself!

In [6]:
results = classifier(["We are delighted to introduce you to the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


In the output we received, the second sentence was classified as negative, but with a fairly neutral score. This is because the model used for sentiment analysis was unable to confidently assign a positive or negative sentiment to the text.

The default model used in this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". If you'd like to learn more about this model, you can check out its [model page]((https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)), which provides additional information about its architecture and training data.

If you'd like to use a different model, such as one that has been trained on French data, you can browse the [model hub](https://huggingface.co/models) to find a suitable option. By filtering for tags such as "French" and "text-classification", you may find a suggestion like "nlptown/bert-base-multilingual-uncased-sentiment".

To use a different model, simply specify its name when calling `pipeline`:







In [7]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.33688193559646606}]

You now have a sentiment analysis classifier that can handle texts not only in English and French, but also in Dutch, German, Italian, and Spanish! Additionally, you can replace the model name with the path to a local folder where a pretrained model is stored (as we'll see later on). You can also pass a model object and its associated tokenizer if you prefer.

To achieve this, we need to use two classes. First, we'll use AutoTokenizer to download and instantiate the tokenizer associated with the model we chose. Second, we'll use `AutoModelForSequenceClassification` (or `TFAutoModelForSequenceClassification` if you're using TensorFlow) to download the model itself. Note that if we were using the library for a different task, we would need to use a different model class, which is summarized in the [task summary](https://huggingface.co/transformers/task_summary.html).

In [9]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

Now, to download the models and tokenizer we found previously, we just have to use the
`AutoModelForSequenceClassification.from_pretrained` method (feel free to replace `model_name` by
any other model from the model hub):

In [10]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [11]:
classifier("I exhibit good behavior.")

[{'label': '4 stars', 'score': 0.49289628863334656}]

If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
pretrained model on your data.

## Exploring Pretrained Models: Understanding How They Work?

In [12]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading tf_model.h5:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_57']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Using the tokenizer

The tokenizer plays a crucial role in preprocessing the text for the model. It first splits the text into tokens, which can include words, parts of words, punctuation symbols, and more. The way this is done can vary based on certain rules, which is why it's important to instantiate the tokenizer using the same name as the pretrained model to ensure consistency.

Once the tokens have been generated, the tokenizer then converts them into numerical values, creating a tensor that can be fed into the model. To accomplish this, the tokenizer utilizes a vocabulary, which is downloaded when we instantiate it with the `from_pretrained` method. This is necessary to ensure that the same vocabulary is used as when the model was originally pretrained.

To preprocess a given text, we can simply pass it through the tokenizer.

In [13]:
inputs = tokenizer("We are delighted to introduce you to the 🤗 Transformers library.")

The output of the tokenizer is a dictionary containing a mapping from strings to lists of integers. This dictionary includes the ids of the tokens as well as additional arguments that are useful to the model. For example, in the case of sentiment analysis, the output dictionary will also contain an attention mask which the model will use to gain a better understanding of the sequence:

In [14]:
print(inputs)

{'input_ids': [101, 2057, 2024, 15936, 2000, 8970, 2017, 2000, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


To process multiple sentences at once and send them to the model as a batch, you can pass a list of sentences directly to the tokenizer. However, to ensure that all sentences have the same length, you may need to pad or truncate them to the maximum length that the model can accept. You can specify these options to the tokenizer to get tensors back. Here is an example:

In [15]:
tf_batch = tokenizer(
    ["We are delighted to introduce you to the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

When passing a list of sentences to the tokenizer, the padding is automatically applied on the side expected by the model (in this case, on the right), with the padding token the model was pretrained with. The attention mask is also adapted to take the padding into account. This ensures that the length of the input sequences is the same, and they can be processed as a batch by the model.

In [16]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2057, 2024, 15936, 2000, 8970, 2017, 2000, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


You can learn more about tokenizers [here](https://huggingface.co/transformers/preprocessing.html).

### Using the model

After preprocessing your input using the tokenizer, you can directly send it to the model, which will have all the necessary information it requires. If you are using a TensorFlow model, you can pass the dictionary keys directly as tensors. However, if you are using a PyTorch model, you need to unpack the dictionary using the `**` operator.

In [17]:
tf_outputs = tf_model(tf_batch)

In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final activations of the model.

In [18]:
print(tf_outputs)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.223428  ,  4.4798603 ],
       [ 0.08181466, -0.04178996]], dtype=float32)>, hidden_states=None, attentions=None)


The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element.

Let's apply the SoftMax activation to get predictions.



In [19]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

We can see we get the numbers from before:



In [20]:
print(tf_predictions)

tf.Tensor(
[[1.6601139e-04 9.9983394e-01]
 [5.3086185e-01 4.6913809e-01]], shape=(2, 2), dtype=float32)


If you have labels, you can provide them to the model, it will return a tuple with the loss and the final activations.



In [21]:
import tensorflow as tf
tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))

The models are implemented as standard PyTorch or TensorFlow modules, which means you can use them in your regular training loop. Additionally, Hugging Face Transformers provides a `Trainer` class (or `TFTrainer` for TensorFlow) to facilitate training, handling tasks such as distributed training, mixed precision, and more.

> **NOTE:** Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
> They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which
> case the attributes not set (that have `None` values) are ignored.

Once your model is fine-tuned, you can save it with its tokenizer in the following way:



In [22]:
save_directory = "my_model/distilbert-base-uncased-finetuned-sst-2-english"

tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)


To load a saved model, you can use the `AutoModel.from_pretrained` method by passing the directory name instead of the model name. With 🤗 Transformers, you can seamlessly switch between PyTorch and TensorFlow by loading any saved model in either framework. If you are loading a PyTorch model in TensorFlow, you can use the `TFAutoModel.from_pretrained` method like this:

In [None]:
from transformers import AutoTokenizer, TFAutoModel

#tokenizer = AutoTokenizer.from_pretrained(save_directory)
#model = TFAutoModel.from_pretrained(save_directory, from_pt=True)


and if you are loading a saved TensorFlow model in a PyTorch model, you should use the following code:

In [24]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained(save_directory)
model = DistilBertModel.from_pretrained(save_directory, from_tf=True)


You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
All TF 2.0 model weights were used when initializing DistilBertModel.

All the weights of DistilBertModel were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertModel for predictions without further training.


Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:

In [25]:
tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = tf_outputs[-2:]

### Viewing the source code

The `AutoModel` and `AutoTokenizer` classes serve as shortcuts to work with any pretrained model. However, the library has a model class for each combination of architecture and task. This structure makes it easy to access and modify the code if necessary.

In the previous example, the model used was named "distilbert-base-uncased-finetuned-sst-2-english," indicating that it uses the [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) architecture. Since `AutoModelForSequenceClassification` (or `TFAutoModelForSequenceClassification` for TensorFlow) was used, the model that was automatically created is `DistilBertForSequenceClassification`. You can refer to the model's documentation or browse the source code to view all relevant details for that specific model. To instantiate the model and tokenizer directly without using the auto shortcut, you can follow this example:

In [26]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_151']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Customizing the model

If you want to customize how the model is built, you can create a custom configuration class. Each architecture has its own relevant configuration class, such as `DistilBertConfig` for DistilBERT, which allows you to specify various parameters like hidden dimension and dropout rate. However, if you make significant changes to the architecture, such as modifying the hidden size, you cannot use a pretrained model and must train from scratch. In this case, you can initialize the model directly from the configuration.

To use the predefined vocabulary of DistilBERT, you can load the tokenizer with the `DistilBertTokenizer.from_pretrained` method. To instantiate the model from scratch using the configuration, you can use `DistilBertForSequenceClassification(config)` instead of `DistilBertForSequenceClassification.from_pretrained`.

In [27]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification(config)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

If you only need to modify the head of a model, such as changing the number of output labels, you can still use a pretrained model for the body. For example, to create a classifier with 10 different labels using a pretrained body, you can pass any argument that a configuration would take to the `from_pretrained` method. This will update the default configuration with the provided argument. Alternatively, you can create a custom configuration with all the default values and just change the number of labels.

In [28]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_191', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use 