<a href="https://colab.research.google.com/github/katmathematics/CS-195/blob/main/F1_1_HuggingFace_Mathesius.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Introduction to the Hugging Face Transformers Library

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F1_1_HuggingFace.ipynb)


## References

Hugging Face *Quicktour*: https://huggingface.co/docs/transformers/quicktour

Hugging Face *Run Inference with Pipelines tutorial*: https://huggingface.co/docs/transformers/pipeline_tutorial

Hugging Face *NLP Course, Chapter 2*: https://huggingface.co/learn/nlp-course/chapter2/1

## What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide a popular free, open-source Python library called **transformers** for NLP (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## Installing the transformers module

This is my favored way of installing packages from a Jupyter Notebook

If you have lots of Python distributions installed, it should use the right one

It may take a few minutes, but *you should only have to do this once*

In [None]:
import sys
!{sys.executable} -m pip install transformers

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.9 MB/s[0m eta [36m0:00:0

## Using the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use

![full_nlp_pipeline.svg](https://github.com/ericmanley/f23-CS195NLP/blob/main/images/full_nlp_pipeline.svg?raw=1)
image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

#classifier("We are an exception among people. We belong to those who are not an integral part of humanity but exist only to teach the world some type of great lesson.")
classifier(" Luckily Olli arrives there also, and he seems to know too much about everything.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9993765950202942}]

**Test it out:** Try changing the input to get different labels/scores

## Working with batches of text

To get classifications of many different examples, pass in a list of strings.

In [None]:
results = classifier(["It's really cool that you can get classifications for a whole batch of text",
                      "I wonder if the rest of the class will be this easy.",
                     "Spolier alert: it won't be."])
print(results)

[{'label': 'POSITIVE', 'score': 0.9991173148155212}, {'label': 'NEGATIVE', 'score': 0.9557349681854248}, {'label': 'NEGATIVE', 'score': 0.9962737560272217}]


Note that the results come back as a list of dictionaries, so you can manipulate it in the normal ways.

In [None]:
print("The sentence had",results[0]["label"],"sentiment, with a score of",results[0]["score"])

The sentence had POSITIVE sentiment, with a score of 0.9991173148155212


## Exercise: Specifying a model

Now try asking for a specific model.

Replace one line of code in your earlier example.

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

NameError: ignored

How is this model different from the first model?

Create a cell in this notebook and note the differences you see

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

classifier("We are an exception among people. We belong to those who are not an integral part of humanity but exist only to teach the world some type of great lesson.")

classifier("He extolled the achievements of Europe, especially in rational and logical thought, its progressive spirit, its leadership in science, and indeed its leadership on the path to freedom.")

classifier("The Russian government saw his ideas as dangerous and unsound.")

classifier(" Luckily Olli arrives there also, and he seems to know too much about everything.")

[{'label': 'neutral', 'score': 0.5005195140838623}]

## Applied Exploration

The `roberta-base-go_emotions` model is documented here: https://huggingface.co/SamLowe/roberta-base-go_emotions

Answer some questions about this:
* What is `roberta-base`? Write down some things you can learn about it from the documentation.
* What is `go_emotions`? Write down some things you can learn about it from the documentation.

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?

## What is roberta-base?

roberta-base is an English language sentiment analysis model. The model was introduced in a 2019 paper by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. The model was trained on raw, unlabeled texts using masked language modeling. The model, while usable in its base form, is intended to be further refined for specific use cases.

## What is go_emotions?

Go_emotions is a dataset of 58,000 reddit comments that have each been categorized as having 1 of 27 possible emotions.

# Further Sentimenet Analysis

### Model: [rubert-tiny2-russian-sentiment](https://huggingface.co/seara/rubert-tiny2-russian-sentiment)
### Author: [seara](https://huggingface.co/seara)



In [None]:
# Russian text example
# Classifier Source: https://huggingface.co/seara/rubert-tiny2-russian-sentiment
from transformers import pipeline

quotes = []
philosophers = []

philosophers.append("Чаадаев")
quotes.append("тусклое и мрачное существование, лишённое силы и энергии, которое ничто не оживляло, кроме злодеяний, ничто не смягчало, кроме рабства. Ни пленительных воспоминаний, ни грациозных образов в памяти народа, ни мощных поучений в его предании… Мы живём одним настоящим, в самых тесных его пределах, без прошедшего и будущего, среди мёртвого застоя.")

classifier = pipeline("sentiment-analysis", model="seara/rubert-tiny2-russian-sentiment")

results = classifier(quotes)

for idx in range(len(results)):
  print(philosophers[idx] + "\'s writing is " + results[idx]["label"] + ".")

Чаадаев's writing is negative.


# Further Analysis

In [None]:
!{sys.executable} huggingface-cli login token = "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"

classifier = pipeline("sentiment-analysis", model="kearney/infoquality")

/bin/bash: line 1: {sys.executable}: command not found


NameError: ignored