<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks
**Module 11: Natural Language Processing with Hugging Face**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

* **Part 11.1: Introduction to Hugging Face** [[Video]](https://www.youtube.com/watch?v=1IHXSbz02XM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_huggingface.ipynb)
* Part 11.2: Hugging Face Tokenizers [[Video]](https://www.youtube.com/watch?v=U-EGU1RyChg&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_02_tokenizers.ipynb)
* Part 11.3: Hugging Face Datasets [[Video]](https://www.youtube.com/watch?v=Mq5ODegT17M&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_03_hf_datasets.ipynb)
* Part 11.4: Training Hugging Face Models [[Video]](https://www.youtube.com/watch?v=https://www.youtube.com/watch?v=l69ov6b7DOM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_04_hf_train.ipynb)
* Part 11.5: What are Embedding Layers in Keras [[Video]](https://www.youtube.com/watch?v=OuNH5kT-aD0list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN&index=58) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_05_embedding.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab


# Part 11.1: Introduction to Hugging Face

Transformers have become a mainstay of natural language processing. This module will examine the [Hugging Face](https://huggingface.co/) Python library for natural language processing, bringing together pretrained transformers, data sets, tokenizers, and other elements. Through the Hugging Face API, you can quickly begin using sentiment analysis, entity recognition, language translation, summarization, and text generation.

Colab does not install Hugging face by default. Whether you are installing Hugging Face directly into a local computer or utilizing it through Colab, the following commands will install the library.





In [None]:
!pip install transformers
!pip install transformers[sentencepiece]

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 8.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.3 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 42.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

Now that we have Hugging Face installed, the following sections will demonstrate how to apply Hugging Face to a variety of everyday tasks. After this introduction, the remainder of this module will take a deeper look at several specific NLP tasks applied to Hugging Face.

## Sentiment Analysis

Sentiment analysis uses natural language processing, text analysis, computational linguistics, and biometrics to identify the tone of written text. Passages of written text can be into simple binary states of positive or negative tone. More advanced sentiment analysis might classify text into additional categories: sadness, joy, love, anger, fear, or surprise.

To demonstrate sentiment analysis, we begin by loading sample text, Shakespeare's [18th sonnet](https://en.wikipedia.org/wiki/Sonnet_18), a famous poem.

In [None]:
from urllib.request import urlopen

# Read sample text, a poem
URL = "https://data.heatonresearch.com/data/t81-558/datasets/sonnet_18.txt"
f = urlopen(URL)
text = f.read().decode("utf-8")

Usually, you have to preprocess text into embeddings or other vector forms before presentation to a neural network. Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows you to pass regular Python strings to the transformers and return standard Python values. 

We begin by loading a text-classification model. We do not specify the exact model type wanted, so Hugging Face automatically chooses a network from the Hugging Face hub named:

* distilbert-base-uncased-finetuned-sst-2-english

To specify the model to use, pass the model paramater, such as:

```
pipe = pipeline(model="roberta-large-mnli")
```

The following code loads a model pipeline and performs sentiment analysis:





In [None]:
import pandas as pd
from transformers import pipeline

classifier = pipeline("text-classification")

outputs = classifier(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Unnamed: 0,label,score
0,POSITIVE,0.984666


As you can see, the poem was considered 0.98 positive.

## Entity Tagging

Entity tagging is the process that takes source text and finds parts of that text that represent entities, such as one of the following:

* Location (LOC) 
* Organizations (ORG)
* Person (PER)
* Miscellaneous (MISC)

The following code requests a "named entity recognizer" (ner) and processes the specified text. As you can see, the person (PER) Abraham Lincoln and location (LOC) of the United States is recognized.

In [None]:
text2 = "Abraham Lincoln was a president who lived in the United States."

tagger = pipeline("ner", aggregation_strategy="simple")
outputs = tagger(text2)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998893,Abraham Lincoln,0,15
1,LOC,0.999651,United States,49,62


## Question Answering

Another common task for NLP is question answering from a reference text. For this example, we will pose the question "what shall fade" to Hugging Face for [Sonnet 18](https://en.wikipedia.org/wiki/Sonnet_18). We see the correct answer of "eternal summer."

In [None]:
reader = pipeline("question-answering")
question = "What now shall fade?"
outputs = reader(question=question,context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Unnamed: 0,score,start,end,answer
0,0.471141,414,428,eternal summer


## Language Translation

Language translation is yet another common task for NLP and Hugging Face. The following code translates Sonnet 18 from English into German.

In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Sonnet 18 Originaltext William Shakespeare Soll ich dich mit einem Sommertag vergleichen? Du bist schöner und gemäßigter: Raue Winde schütteln die lieblichen Knospen des Mai, Und der Sommervertrag hat zu kurz ein Datum: Irgendwann zu heiß das Auge des Himmels leuchtet, Und oft ist sein Gold Teint dimm'd; Und jede faire von Fair irgendwann sinkt, Durch Zufall oder die Natur wechselnden Kurs untrimm'd; Aber dein ewiger Sommer wird nicht verblassen noch verlieren Besitz von dem Schönen du schuld; noch wird der Tod prahlen du wandert in seinem Schatten, Wenn in ewigen Linien zur Zeit wachsen: So lange die Menschen atmen oder Augen sehen können, So lange lebt dies und dies gibt dir Leben.


## Summarization

Summarization is an NLP task that summarizes a more lengthy text into just a few sentences. The following code summarizes the Wikipedia entry for an "apple."

In [None]:
text2 = """
An apple is an edible fruit produced by an apple tree (Malus domestica). 
Apple trees are cultivated worldwide and are the most widely grown species 
in the genus Malus. The tree originated in Central Asia, where its wild 
ancestor, Malus sieversii, is still found today. Apples have been grown 
for thousands of years in Asia and Europe and were brought to North America 
by European colonists. Apples have religious and mythological significance 
in many cultures, including Norse, Greek, and European Christian tradition.
"""


summarizer = pipeline("summarization")
outputs = summarizer(text2,max_length=45, clean_up_tokenization_spaces = True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Your min_length=56 must be inferior than your max_length=45.


 An apple is an edible fruit produced by an apple tree (Malus domestica) Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Apples have religious and mythological


## Text Generation

Finally, text generation allows us to take an input text and request the pretrained neural network to continue that text. Here an example is provided that generates additional text after Sonnet 18.

In [None]:
from urllib.request import urlopen

generator = pipeline("text-generation")
outputs = generator(text, max_length=400)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sonnet 18 original text
William Shakespeare

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.itia
With those three sentences from The Merchant of Venice, it seems probable that there is something of the sort heretical on the part of most of us. As a literary criticism, there is a bit more of the sort, and we have heard (and sometimes known) things that don't actually happen because other writers are much more a