<a href="https://colab.research.google.com/github/katmathematics/CS-195/blob/main/F1_1_HuggingFace_Mathesius.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Introduction to the Hugging Face Transformers Library

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F1_1_HuggingFace.ipynb)


## References

Hugging Face *Quicktour*: https://huggingface.co/docs/transformers/quicktour

Hugging Face *Run Inference with Pipelines tutorial*: https://huggingface.co/docs/transformers/pipeline_tutorial

Hugging Face *NLP Course, Chapter 2*: https://huggingface.co/learn/nlp-course/chapter2/1

## What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide a popular free, open-source Python library called **transformers** for NLP (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## Installing the transformers module

This is my favored way of installing packages from a Jupyter Notebook

If you have lots of Python distributions installed, it should use the right one

It may take a few minutes, but *you should only have to do this once*

In [2]:
import sys
!{sys.executable} -m pip install transformers

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.0 MB/s[0m eta [36m0:00:0

## Using the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use

![full_nlp_pipeline.svg](https://github.com/ericmanley/f23-CS195NLP/blob/main/images/full_nlp_pipeline.svg?raw=1)
image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [13]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

classifier("Luckily Olli arrives there also, and he seems to know too much about everything.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9993765950202942}]

**Test it out:** Try changing the input to get different labels/scores

## Working with batches of text

To get classifications of many different examples, pass in a list of strings.

In [3]:
results = classifier(["It's really cool that you can get classifications for a whole batch of text",
                      "I wonder if the rest of the class will be this easy.",
                     "Spolier alert: it won't be."])
print(results)

[{'label': 'POSITIVE', 'score': 0.9991173148155212}, {'label': 'NEGATIVE', 'score': 0.9557349681854248}, {'label': 'NEGATIVE', 'score': 0.9962737560272217}]


Note that the results come back as a list of dictionaries, so you can manipulate it in the normal ways.

In [4]:
print("The sentence had",results[0]["label"],"sentiment, with a score of",results[0]["score"])

The sentence had POSITIVE sentiment, with a score of 0.9991173148155212


## Exercise: Specifying a model

Now try asking for a specific model.

Replace one line of code in your earlier example.

In [5]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

How is this model different from the first model?

Create a cell in this notebook and note the differences you see

In [8]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

results = classifier(["We are an exception among people. We belong to those who are not an integral part of humanity but exist only to teach the world some type of great lesson.",
                      "He extolled the achievements of Europe, especially in rational and logical thought, its progressive spirit, its leadership in science, and indeed its leadership on the path to freedom.",
                      "The Russian government saw his ideas as dangerous and unsound."])
print(results)

[{'label': 'neutral', 'score': 0.6024043560028076}, {'label': 'admiration', 'score': 0.6494354009628296}, {'label': 'neutral', 'score': 0.5005195140838623}]


## Applied Exploration

The `roberta-base-go_emotions` model is documented here: https://huggingface.co/SamLowe/roberta-base-go_emotions

Answer some questions about this:
* What is `roberta-base`? Write down some things you can learn about it from the documentation.
* What is `go_emotions`? Write down some things you can learn about it from the documentation.

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?

## What is [roberta-base](https://huggingface.co/roberta-base)?

roberta-base is an English language sentiment analysis model. The model was introduced in a 2019 paper by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. The model was trained on raw, unlabeled texts using masked language modeling. The model, while usable in its base form, is intended to be further refined for specific use cases.

## What is [go_emotions](https://huggingface.co/datasets/go_emotions)?

Go_emotions is a dataset of approximately 58,000 english-language reddit comments that have each been categorized as having 1 of 27 possible emotions. The dataset was created by the following researchers at Amazon Alexa, Google Research, and Stanford: Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi. It was annotated by 3 English-speaking crowdworkers from India.

The data includes 11 features. The primary 3 are text, labels, & comment_id (which serves as the key for the data). Additionally the data also includes the author of the text, the subreddit is was taken from, the link_id and parent_id of the text, the timestamp the text was created at, the rater_id of who labeled the text, and an "example_very_unclear" field for if the text cannot be classified. The simplified data includes a train-validate-test split of 43,410 training, 5426 validation, and 5427 testing examples.

# Further Sentimenet Analysis

### Model: [rubert-tiny2-russian-sentiment](https://huggingface.co/seara/rubert-tiny2-russian-sentiment)
### Author: [seara](https://huggingface.co/seara)

### Description:

Seara's RuBERT-tiny2-russian-sentiment model is a sentiment analysis model tuned for use with short Russian texts. The model is based on [RuBERT-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) a general purpose model based on bert for use with Russian texts. RuBERT-tiny2-russian-sentiment was trained on Kaggle Russian News Dataset, Linis Crowd 2015, Linis Crowd 2016, RuReviews, and RuSentiment in order to fine-tune it for sentiment analysis on short texts.



In [7]:
# Classifier Source: https://huggingface.co/seara/rubert-tiny2-russian-sentiment
from transformers import pipeline

text = []
sources = []

sources.append("Чаадаев")
text.append("тусклое и мрачное существование, лишённое силы и энергии, которое ничто не оживляло, кроме злодеяний, ничто не смягчало, кроме рабства. Ни пленительных воспоминаний, ни грациозных образов в памяти народа, ни мощных поучений в его предании… Мы живём одним настоящим, в самых тесных его пределах, без прошедшего и будущего, среди мёртвого застоя.")

sources.append("Соловьёв")
text.append("Смысл и достоинство любви как чувства состоит в том, что она заставляет нас действительно всем нашим существом признать за другим то безусловное центральное значение, которое, в силу эгоизма, мы ощущаем только в самих себе. Любовь важна не как одно из наших чувств, а как перенесение всего нашего жизненного интереса из себя в другое, как перестановка самого центра нашей личной жизни.")

sources.append("Ivan Svyatetsky (who is definitely a philosopher and definitely not a random user on VK)")
text.append("хорошо, что снег хоть мягкий. кот и так ошарашенный от снега")

classifier = pipeline("sentiment-analysis", model="seara/rubert-tiny2-russian-sentiment")

results = classifier(text)

for idx in range(len(results)):
  print(sources[idx] + "\'s writing is " + results[idx]["label"] + ".")

Чаадаев's writing is negative.
Соловьёв's writing is neutral.
Ivan Svyatetsky (who is definitely a philosopher and definitely not a random user on VK)'s writing is positive.


# Further Analysis

## Topic Classification

### Model: [Classify Title Subject](https://huggingface.co/Mingyi/classify_title_subject)
### Author: [Mingyi](https://huggingface.co/Mingyi)
### Description:
Mingyi's "Classify Title Subject" model is intended for use in classifying the type of content such as youtube videos, podcasts, articles, etc. based on the content's title. The model is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased), a model designed for multipurpose use on 104 different languages. This model specializes by further training on a mix of primarily English and Chinese labeled content titles.

In [9]:
# Classifier Source: https://huggingface.co/Mingyi/classify_title_subject
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="Mingyi/classify_title_subject")

label_dict = {
  "LABEL_0": "Art",
  "LABEL_1": "Personal Development",
  "LABEL_2": "World",
  "LABEL_3": "Health",
  "LABEL_4": "Science",
  "LABEL_5": "Business",
  "LABEL_6": "Humanities",
  "LABEL_7": "Technology",
}

label_frequency = {
  "LABEL_0": 0,
  "LABEL_1": 0,
  "LABEL_2": 0,
  "LABEL_3": 0,
  "LABEL_4": 0,
  "LABEL_5": 0,
  "LABEL_6": 0,
  "LABEL_7": 0,
}

bookshelf = ["Pocket History of Poland", "Insects and Spiders of North America","Gender Queer","Beyond Borscht","Statistics - Expanded Edition", "Calculus - Expanded Edition","Heretics!","Berlin-Charlottenburg einst & jetzt"]

results = classifier(bookshelf)

print("\n--------------------------------------------------")
for idx in range(len(results)):
  print("\"" + bookshelf[idx] + "\" Genre: " + label_dict[results[idx]["label"]] + ".")
  label_frequency[results[idx]["label"]] = label_frequency[results[idx]["label"]] + 1
print("--------------------------------------------------")
for key in label_frequency:
    if label_frequency[key] != 0:
      print(label_dict[key] + " makes up " + str(int((label_frequency[key]/len(results))*100)) + "% of the bookshelf")
print("--------------------------------------------------")

Some layers from the model checkpoint at Mingyi/classify_title_subject were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at Mingyi/classify_title_subject.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.



--------------------------------------------------
"Pocket History of Poland" Genre: Humanities.
"Insects and Spiders of North America" Genre: Science.
"Gender Queer" Genre: Humanities.
"Beyond Borscht" Genre: Personal Development.
"Statistics - Expanded Edition" Genre: Technology.
"Calculus - Expanded Edition" Genre: Technology.
"Heretics!" Genre: Science.
"Berlin-Charlottenburg einst & jetzt" Genre: Humanities.
--------------------------------------------------
Personal Development makes up 12% of the bookshelf
Science makes up 25% of the bookshelf
Humanities makes up 37% of the bookshelf
Technology makes up 25% of the bookshelf
--------------------------------------------------


## Content Detection

### Author: [Madhurjindal](https://huggingface.co/madhurjindal)
### Model: [AutoNLP Gibberish Detector](madhurjindal/autonlp-Gibberish-Detector-492513457)
### Description:
Madhurjindal's AutoNLP Gibberish Detector model is a model trained to seperate out varying levels of gibberish within texts. The model can differentiate between 4 classes: "noise" (ex. "Sul Sul, O Vwa Vwaf Sna"), "word salad" (ex. "apple eat apple"), "mild gibberish" (ex. "fork spoon fork spoon fork spoon"), and "clean" (ex. "you can understand this sentence"). The model was trained using [AutoNLP](https://autonlp.ai/), a high-level AI building tool designed to make creating AI models more accessible. The base model used is distilbert, a refined, lightweight version of bert. The training data used for refinement is unspecified.

In [11]:
# Classifier Source: https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457
from transformers import pipeline

text = "ajdsadalksdsakdksljdklsj kdjksldjfkdsjfkdsjfkl dkjfklsjdfkdsjfkjdsfkl sfkdsjfksdjfkdsjfksj skjjkdjft. cdat rat madt pat. ksdlkjljskjds. apple eat apple. giraffffffffffffffffffffffffffffffe. jdksjdklsjdksdjflkds dklfjsdklfjsdklfj sdjfskfklsdjfl klsjdfkjdskfj sklfjkdfjskjkdljf. Sul Sul, O Vwa Vwaf Sna. this is the secret message within the gibberish. fork spoon fork spoon fork spoon. the dog the dog the dogsdhfdskfdsj."
text_list = text.split(".")
# Removes any empty strings
text_list = [sentence for sentence in text_list if sentence != ""]

classifier = pipeline("sentiment-analysis", model="madhurjindal/autonlp-Gibberish-Detector-492513457")

results = classifier(text_list)

print("---------------------------------")
print("Original Text:")
print(text)
print("---------------------------------")

print("---------------------------------")
print("Clean Text:")

for idx in range(len(results)):
  if results[idx]["label"] == "clean":
      print(text_list[idx])

print("---------------------------------")

---------------------------------
Original Text:
ajdsadalksdsakdksljdklsj kdjksldjfkdsjfkdsjfkl dkjfklsjdfkdsjfkjdsfkl sfkdsjfksdjfkdsjfksj skjjkdjft. cdat rat madt pat. ksdlkjljskjds. apple eat apple. giraffffffffffffffffffffffffffffffe. jdksjdklsjdksdjflkds dklfjsdklfjsdklfj sdjfskfklsdjfl klsjdfkjdskfj sklfjkdfjskjkdljf. Sul Sul, O Vwa Vwaf Sna. this is the secret message within the gibberish. fork spoon fork spoon fork spoon. the dog the dog the dogsdhfdskfdsj.
---------------------------------
---------------------------------
Clean Text:
 this is the secret message within the gibberish
---------------------------------
