<h1>Table of Contents<span class="tocSkip"></span></h1>


# Introduction
<hr style="border:2px solid black"> </hr>


**What?** Sentiment analysis with Hugging Face transformer



# Imports
<hr style="border:2px solid black"> </hr>

In [1]:
#!pip install transformers==4.10.0
# if you any problem pip uninstall torch and the reinsall the above
from transformers import pipeline

In [2]:
import transformers

In [3]:
transformers.__version__

'4.10.0'

# Hugging face `pipeline`
<hr style="border:2px solid black"> </hr>


- The `pipeline()` is the easiest way to use a pretrained model for inference. 
- You can use the pipeline() out-of-the-box for many tasks across different modalities such as:
    - text classification
    - text generation
    - etc ...
- The `pipeline()` downloads and caches a default pretrained model and tokenizer for sentiment analysis.
- This can load anymodel in the hugguing face hub, if something is not present, you'll have to fine tune one and then if you want share in the hub.

- Remember, **architecture** refers to the skeleton of the model and **checkpoints** are the weights for a given architecture. For example, BERT is an architecture, while bert-base-uncased is a checkpoint. Model is a general term that can mean either architecture or checkpoint.
    


In [4]:
# Load the sentiment analysis pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

clf = pipeline("sentiment-analysis",
               model="distilbert-base-uncased-finetuned-sst-2-english", 
               tokenizer=tokenizer)
#clf = pipeline("sentiment-analysis")

2023-02-06 11:30:24.845260: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-06 11:30:24.862511: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


# Sentiment analysis
<hr style="border:2px solid black"> </hr>

In [5]:
# creating dummy dataset
data = ["I am happy to be reading this article",
        "I am not happy to read this article",
        "This is a really informative article",
        "Thank you for reading",
        "Umm, not sure but thank you for reading",
        "I found this 50% interesting and 50% uninteresting"]

# classifying each instance
results = clf(data)

for result in results:
    print(f"label: {result['label']}, score: {round(result['score'], 4)}")

label: POSITIVE, score: 0.9999
label: NEGATIVE, score: 0.9989
label: POSITIVE, score: 0.9998
label: POSITIVE, score: 0.9998
label: POSITIVE, score: 0.9996
label: NEGATIVE, score: 0.6725


# References
<hr style="border:2px solid black"> </hr>


- [Best Tools For NLP Projects That Every Data Scientist and ML Engineer Should Try](https://neptune.ai/blog/best-tools-for-nlp-projects)
- [Hugging Face tranformer introduction](https://huggingface.co/docs/transformers/quicktour)



# Requirements
<hr style="border:2px solid black"> </hr>

In [6]:
%load_ext watermark
%watermark -v -iv -m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.29.0

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 21.4.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

transformers: 4.10.0
autopep8    : 1.6.0
json        : 2.0.9

