<a href="https://colab.research.google.com/github/jeosol/tensorflow-tutorials/blob/main/Jeff_Hinton_Chapter11_Natural_Language_Processing_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 2.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 28.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstall

Sentiment Analysis: identify the tone of written text. More advanced sentiment analysis might classify text into additional categories: sadness, joy, love, anger, fear, or surprise

In [12]:
from urllib.request import urlopen
# read sample text, a poem
URL = 'https://data.heatonresearch.com/data/t81-558/datasets/sonnet_18.txt'
f = urlopen(URL)
text = f.read().decode('utf-8')


In [13]:
# preprocess text into embeddings or other vector forms before presentation to a neural network
# Hugging Face provides a pipeline that simplifies this process greatly

# The pipeline allows you to pass regular Python strings to the transformers and return standard
# Python values

# We begin by loading a text classification model. We do not specify the exact model type
# wanted, so Hugging Face automatically chooses a network from the Hugging Face hub names:
# distilbert-base-uncased-finetuned-sst-2-english

# to specify the model to use, pass the model peramter such, as
# pipe = pipeline(model='roberta-large-mnli)

import pandas as pd 
from transformers import pipeline

classifier = pipeline('text-classification')

# we can now display the sentiment anlaysis results with a Pandas dataframe
outputs = classifier(text)
pd.DataFrame(outputs)



No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Unnamed: 0,label,score
0,POSITIVE,0.984666


## Entity Tagging

Entity tagging is the process that takes source text and find parts of that text that represents entities, such as one of the following:
* Location (LOC)
* Organizations (ORG)
* Person (PER)
* Miscellaneous (MISC)


In [15]:
# Named entity recognization
# The following code requests a 'named entity recognizer' (ner) and
# processes the specified text

text2 = 'Abraham Lincoln was a president who lived in the United States.'
tagger = pipeline('ner', aggregation_strategy='simple')
outputs = tagger(text2)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998893,Abraham Lincoln,0,15
1,LOC,0.999651,United States,49,62


## Question Answering

Another common task for NLP is question answering from a reference text. We load such a model with the following code. 

In [17]:
reader = pipeline('question-answering')
question = 'What now shall fade?'
outputs = reader(question=question, context=text) 
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Unnamed: 0,score,start,end,answer
0,0.471141,414,428,eternal summer


## Language Translation

Language translation is yet another common task for NLP and Hugging Face.

In [20]:
translator = pipeline('translation_en_to_de',model='Helsinki-NLP/opus-mt-en-de')

# the following code translates Sonnet 18 from English into German
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100) 
print(outputs[0]['translation_text'])



Sonnet 18 Originaltext William Shakespeare Soll ich dich mit einem Sommertag vergleichen? Du bist schöner und gemäßigter: Raue Winde schütteln die lieblichen Knospen des Mai, Und der Sommervertrag hat zu kurz ein Datum: Irgendwann zu heiß das Auge des Himmels leuchtet, Und oft ist sein Gold Teint dimm'd; Und jede faire von Fair irgendwann sinkt, Durch Zufall oder die Natur wechselnden Kurs untrimm'd; Aber dein ewiger Sommer wird nicht verblassen noch verlieren Besitz von dem Schönen du schuld; noch wird der Tod prahlen du wandert in seinem Schatten, Wenn in ewigen Linien zur Zeit wachsen: So lange die Menschen atmen oder Augen sehen können, So lange lebt dies und dies gibt dir Leben.
