# CS 195: Natural Language Processing
## Introduction to the Hugging Face Transformers Library

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F1_1_HuggingFace.ipynb)


## References

Hugging Face *Quicktour*: https://huggingface.co/docs/transformers/quicktour

Hugging Face *Run Inference with Pipelines tutorial*: https://huggingface.co/docs/transformers/pipeline_tutorial

Hugging Face *NLP Course, Chapter 2*: https://huggingface.co/learn/nlp-course/chapter2/1

## What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide a popular free, open-source Python library called **transformers** for NLP (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## Installing the transformers module

This is my favored way of installing packages from a Jupyter Notebook

If you have lots of Python distributions installed, it should use the right one

It may take a few minutes, but *you should only have to do this once*

In [None]:
import sys
!{sys.executable} -m pip install transformers

Collecting transformers
  Downloading transformers-4.33.0-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m70.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.0 MB/s[0m eta [36m0:00:0

## Using the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use

![full_nlp_pipeline.svg](https://github.com/ericmanley/f23-CS195NLP/blob/main/images/full_nlp_pipeline.svg?raw=1)
image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

classifier("I love how easy it is to build sentiment-aware applications with the transformers library!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9984305500984192}]

**Test it out:** Try changing the input to get different labels/scores

## Working with batches of text

To get classifications of many different examples, pass in a list of strings.

In [None]:
results = classifier(["It's really cool that you can get classifications for a whole batch of text",
                      "I wonder if the rest of the class will be this easy.",
                     "Spolier alert: it won't be."])
print(results)

[{'label': 'POSITIVE', 'score': 0.9991173148155212}, {'label': 'NEGATIVE', 'score': 0.9557349681854248}, {'label': 'NEGATIVE', 'score': 0.9962737560272217}]


Note that the results come back as a list of dictionaries, so you can manipulate it in the normal ways.

In [None]:
print("The sentence had",results[0]["label"],"sentiment, with a score of",results[0]["score"])

The sentence had POSITIVE sentiment, with a score of 0.9991173148155212


## Exercise: Specifying a model

Now try asking for a specific model.

Replace one line of code in your earlier example.

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

How is this model different from the first model?

Create a cell in this notebook and note the differences you see

In [None]:
results = classifier(["It's really cool that you can get classifications for a whole batch of text",
                      "I wonder if the rest of the class will be this easy.",
                     "Spolier alert: it won't be."])
print(results)

[{'label': 'admiration', 'score': 0.5409085750579834}, {'label': 'surprise', 'score': 0.7312608957290649}, {'label': 'neutral', 'score': 0.7667409181594849}]


Is different:

*   Instead of just positive or negative descriptions. It gives you more emotional descriptions
*   Scores are lower



## Applied Exploration

The `roberta-base-go_emotions` model is documented here: https://huggingface.co/SamLowe/roberta-base-go_emotions

Answer some questions about this:
* What is `roberta-base`? Write down some things you can learn about it from the documentation.
* What is `go_emotions`? Write down some things you can learn about it from the documentation.

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?

##QUESTIONS ABOUT ROBERTA-BASE AND GO_EMOTIONS

What is roberta-base? Write down some things you can learn about it from the documentation.

1. CAN LEARN ROBERTA-BASE DEFINITION:
 - The documentation defined roberta-base as a pretrained model on english language using Masked Language Modeling (MLM)
 - They also describe roberta-base as a transformer pretrained on English language data.

2. LEARN THE USES FOR ROBERTA-BASE:
 - The documentation states that the model is mainly focused on tasks for entire sentences to make decisions

3. LEARN HOW TO USE:
 - The documentation gives you code and walks you through how each part is implemented


______________________________________________________________________________

What is go_emotions? Write down some things you can learn about it from the documentation.

1. CAN LEARN GO_EMOTIONS DEFINITION:
 - The documentation defines go_emotions as a dataset made up of reddit comments that are labeled between 27 different emotions and a neutral.

2. CAN LEARN ABOUT THE DATASET STRUCTURE:
 - The documentation talks about instances, which are specific comments in the dataset
 - The documentation describles the data fields in the raw data and in the simple configuration

3. CAN CONSIDER THE IMPACT OF USING THE DATA:
 - The documentation talks about the social impact of using the dataset and the ability for computers to better interact with humans
 - There is also the mention of biases in the way the data was taken and the context for which it took the sentences

###TRYING OUT ADDITIONAL MODELS

ABOUT THE SENTIMENT/EMOTIONAL MODEL:

Who made it? What is it used for?
The bert-base model is mainly used when determining if a product review was favorable. It was made by nlptown.

What kind of data was it trained on?
Trained on reviews data in 6 different languages.


Are they based on some other model and trained on new data (fine-tuned) for a specific task?
The documentation did not note if the data was from a previous model.


In [None]:
classifier_bertbase = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
results1 = classifier_bertbase(["This vaccum destroyed all of the dust in my carpet.",
                                "The chocolate ice cream from joes ice cream shop lacks flavor",
                                "I love the tasy steak at AJ's steakhouse"])

print(results1)

[{'label': '1 star', 'score': 0.8253408670425415}, {'label': '2 stars', 'score': 0.4203665256500244}, {'label': '5 stars', 'score': 0.6982265114784241}]


ABOUT THE SPAM DETECTION MODEL:

Who made it? What is it used for?
The bert-tiny-finetuned-sms-spam-detection model was made by mrm8488. It is used to determine spam messages and sorts they by LABEL_0 which means that it is most likely not a spam message. Then by LABEL_1 which means that it most likely is a spam message.

What kind of data was it trained on?
The data was trained on the sms_spam dataset which has close to 6k rows of nonspam and spam text data.

Are they based on some other model and trained on new data (fine-tuned) for a specific task?
The documentation did not note if the data was from a previous model.

In [None]:
classifier_spam = pipeline("sentiment-analysis", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")

In [None]:
results2 = classifier_spam(["I really do enjoy watching the Chicago Bears play football! BUY Bears tickets free join my website now! It is amazing. Here is the link bears.com",
                            "The game today was amazing! Good job guys.",
                            "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"])

print(results2)

[{'label': 'LABEL_0', 'score': 0.9288454055786133}, {'label': 'LABEL_0', 'score': 0.9367383122444153}, {'label': 'LABEL_1', 'score': 0.8992626667022705}]


ABOUT THE CHATGPT DETECTION MODEL:

Who made it? What is it used for?
The model was made by Hello-Simple AI.
It is used to determine if text is generated by chatgpt or written by a human.

What kind of data was it trained on?
The model is trained on the mix of full-text and splitted sentences of answers from Hello-SimpleAI/HC3.

Are they based on some other model and trained on new data (fine-tuned) for a specific task?
The documentation notes that the base model is from roberta-base and is trained on Hello-SimpleAI/HC3 data.

In [None]:
classifier_chatgpt = pipeline("sentiment-analysis", model="Hello-SimpleAI/chatgpt-detector-roberta")

Downloading (…)lve/main/config.json:   0%|          | 0.00/858 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [None]:
results3 = classifier_chatgpt(["ChatGPT is a large language model developed by OpenAI. It is trained on a massive dataset of text and is able to generate human-like responses to a wide range of prompts. It can be used for a variety of tasks such as language translation, text summarization, and conversation generation. It has been trained on a diverse set of internet text and is capable of understanding and generating text in a variety of languages and styles.",
                            "The game today was amazing! Good job guys.",
                            "In a browser you can access the information easier than a password manager due to the extended encryption techniques designed specifically to protect passwords."])

print(results3)

[{'label': 'ChatGPT', 'score': 0.9987819790840149}, {'label': 'Human', 'score': 0.8698444962501526}, {'label': 'ChatGPT', 'score': 0.9705960154533386}]


###SMALL PROJECT

######DIFFERENT FILE

In [None]:
classifier_emo = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]