# Introduction to `transformers`

💡 **Question 1**: I have a sentence below and how can I analyse it with `transformers`.

**Question 2**: We have a dataset with a text file and we want to quickly extract themes and sentiments to have a "feel" of the data.

**What we will cover**:

- Install `transformers` library
- Work with `pipeline`
- Work with Hugging Face datasets

In this activity, we are going to start working with the `transformers` library and have a look at its key function, `pipeline`.

## Pipeline function

The pipeline function in the Hugging Face Transformers library is a high-level API that simplifies the process of using pre-trained models for various NLP tasks.

It abstracts away much of the complexity involved in loading models (e.g. tokenizing input, and processing output) and perform tasks like text classification, named entity recognition, sentiment analysis, question answering, translation, and more with just a few lines of code.

### Install required libraries and load your text

In [1]:
!pip install transformers datasets torch



We will import specific functions from `transformers` library that we will use in this tutorial:

- `pipeline()`: A function for various NLP tasks such as sentiment analysis, text classification, etc.

- `AutoTokenizer`: A class that automatically loads the appropriate tokenizer for a given model.

- `AutoModelForSequenceClassification`: A class that automatically loads the appropriate model for sequence classification tasks.


In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

LLM is resource intensive so GPU is preferred. 

In [3]:
import torch
# Check if GPU is available and set device accordingly
device = 0 if torch.cuda.is_available() else -1

We will experiment with this text, feel free to change it to your area!

In [4]:
text="Cognitive behavioral therapy is a widely recommended treatment for managing anxiety and depression, helping individuals develop coping strategies to improve their mental health."

We import the pipeline function from the `transformers` library, create a text classification pipeline, specifying the device (GPU or CPU), and use the pipeline to classify the given text.

The pipeline() method has the following structure: 

```python
from transformers import pipeline

# To use a default model & tokenizer for a given task(e.g. question-answering)
pipeline("<task-name>")

# To use an existing model
pipeline("<task-name>", model="<model_name>")

# To use a custom model/tokenizer
pipeline('<task-name>', model='<model name>',tokenizer='<tokenizer_name>')
```
Let's start with

## Sentiment analysis 


In [None]:
classifier = pipeline("text-classification", device=device)
result = classifier(text)
print(result)


You may see this warning when you run:

```python
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
```

When you use the pipeline function without specifying a model and revision, the `transformers `library defaults to a pre-configured model for the specified task. This is designed to make it easier for users to get started quickly without needing to know the details of which model and version of this model (`revision`) to use.  

For example, if you create a text classification pipeline without specifying a model, it defaults to using the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.

For production environments, it is better to explicitly specify the model and its revision to ensure consistency and avoid unexpected changes in behavior due to updates in the default model.

So we can run it with a different model:

`FacebookAI/roberta-large-mnli`: an optimized version of BERT, fine-tuned on a larger dataset compared to BERT, including the Common Crawl dataset 

*Common Crawl dataset: a large-scale web dataset that contains petabytes of web data collected over several years. It includes raw web page data, metadata, and text extractions)*

`cardiffnlp/twitter-roberta-base-sentiment`: A RoBERTa model fine-tuned on Twitter data for sentiment analysis.

In [1]:
# FacebookAI/roberta-large-mnl
classifier = pipeline("text-classification",
                      model="FacebookAI/roberta-large-mnli", device=device)
result = classifier(text)
print(result) 

NameError: name 'pipeline' is not defined

In [None]:
# cardiffnlp/twitter-roberta-base-sentiment
classifier = pipeline("text-classification",
                      model="cardiffnlp/twitter-roberta-base-sentiment", device=device)
result = classifier(text)
print(result)   

if you have several texts, they use the list:

In [None]:
text2="Former President Donald Trump is struggling with an attack strategy against Kamala Harris, amid reports he called her a “b**ch” repeatedly in private as the Republican presidential nominee also sees support in the polls plunge."

In [None]:
classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device=device)
result = classifier([text, text2])
print(result) 

or with a dataset, such as `ag_news` [here](https://paperswithcode.com/dataset/ag-news) and also on [Hugging Face](https://huggingface.co/datasets/fancyzhx/ag_news)

![PaperWithCode](../images/ag_news.png)

In [None]:
from datasets import load_dataset

# Load a sample dataset (you can replace this with your own dataset)
dataset = load_dataset('ag_news', split='train[:10]')  # Using a subset for demonstration

Using it with pipeline function:

In [None]:
# Initialize the sentiment analysis pipeline
classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", revision="af0f99b", device=device)



In [None]:
# Apply the pipeline to each text in the dataset
results = [classifier(item['text']) for item in dataset]

# Print the results
for result in results:
    print(result)

Some other 
currently available pipelines are:

- zero-shot-classification
- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation


Let's try them!

## Zero-shot classification

Most of the time, training an ML model requires all the possible labels/targets to be known beforehand (e.g. you have a dataset which is manually labelled), e.g. each news is assigned to a limited number of topics (e.g. science, politics, or education).

What if your dataset has NO labels, but you have some ideas which labels can be used? 

 

In [None]:
candidate_labels = ["economy", "health", "education", "foreign policy", "environment", "technology", "social justice"]

In [None]:
# Initialize the zero-shot classification pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=device)


In [None]:
# Apply the pipeline to each text in the dataset
results = [classifier(item['text'], candidate_labels=candidate_labels) for item in dataset]

# Print the results
for result in results:
    print(result)

In [None]:
# Initialize the sentiment analysis pipeline
classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                      revision="af0f99b", device=device)

# Apply the pipeline to each text in the dataset
results = [classifier(item['text']) for item in dataset]

# Print the results
for result in results:
    print(result)

complete

## Question Answering

Imagine writing a comprehensive literature review that spans hundreds of research papers, each filled with detailed analyses and discussions. You might want to prepare a table that specifies a sample size in the reviewed paper, methods used, timeframe for data collection and key findings. Instead of manually reading through each paper, you can use a question-answering model. This tool will quickly extract and present the exact information you're looking for across all the papers, streamlining your literature review process and allowing you to concentrate on synthesizing the research.

What we need:

- provide a model with proper context (collected of papers to review)
- question we are interested in finding the answer to (e.g What is the sample size in this research?)



In [None]:
# Import the question-answering class and tokenizer from transformers
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

Specify the model and its tokenizer (notice that the model name and the tokenizer are the same)

In [None]:
model = "deepset/roberta-base-squad2"

task = 'question-answering'
QA_model = pipeline(task, model=model, tokenizer=model)

In [None]:
QA_input = {
          'question': 'when is Apple hosting an event?',
          'context': dataset[-1]
          }

In [None]:
model_response = QA_model(QA_input)
pd.DataFrame([model_response])

In [None]:
dataset[-1]