# Overview
* In the first three lessons we did a decent amount of custom work.
* In this last and final lesson, we first go over the parts of HuggingFace that might be most useful.
  * We will be working from here https://huggingface.co/transformers/usage.html
* We will then look at named entity recognition in spaCy.
* And finally we will have a couple of concluding thoughts.

# HuggingFace Pipelines
* They have these pipelines that allow you do things quickly
* See https://huggingface.co/transformers/main_classes/pipelines.html#the-task-specific-pipelines for more information
* The horse's mouth to see the pipelines available https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines.py

## Text Generation Pipeline
You can create text from a prompt

In [1]:
from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("The space aliens came to me to talk about", 
                     max_length=100))


Downloading: 100%|██████████| 230/230 [00:00<00:00, 73.4kB/s]
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'The space aliens came to me to talk about their own race to whom they had come to enslave. Because they were aliens, they were more like friends, the only beings of intelligence that humans came to know. I knew that for much but it just so happened that at least one of them was a human. And not even that one could live in my life, only by way of the Force, at least not to have any sort of sense of morality or morality or self. I was also'}]


## Summarization Pipeline
* Works best on news articles. Many models were trained on the Daily Mail bullets.

In [8]:
from transformers import pipeline

# Contrary to instructions, you need to do as below as illustrated
# https://github.com/huggingface/transformers/issues/4504
#summarizer = pipeline("summarization")
summarizer = pipeline('summarization',
                      model='bart-large-cnn', 
                      tokenizer='bart-large-cnn')

text = """
Cats don't like to wrestle.
It would be nice if we could abandon all of this inane talk about celebrities and go back to gossiping about people that we actually know.
Amazing how worthless I would be if you sent me back in time 10,000 years. I could describe amazing technology but couldn't build any of it.
I can't believe we still have pennies. They aren't even worth picking up off the ground.
Sometimes you wake and wish the adventure dream you were having didn't have to end. I want video games with that much excitement and realism
Why isn't there an app that lets me share photos with all my social media sites? Facebook, G+, Twitter. Oh yeah, #NoProfitInFreedom
This weekend, I was able to convince my youngest son, age 7, to branch beyond chicken nuggets to Orange Chicken. #CulinaryVictory
I've found that being a fair-weather sports fan is a real time saver.
My son, age 7, was playing "restaurant," and the first thing he did was set up security cameras. #ModernWorld
I don't know why, but refactoring code is a lot of fun. #OddlySatisfying
Happiness is working on your projects.
"""

print(summarizer(text, max_length=50, min_length=30))

Downloading: 100%|██████████| 899k/899k [00:08<00:00, 110kB/s] 
Downloading: 100%|██████████| 456k/456k [00:04<00:00, 106kB/s]  


[{'summary_text': "This weekend, I was able to convince my youngest son, age 7, to branch beyond chicken nuggets to Orange Chicken. I don't know why, but refactoring code is a lot of fun. #OddlyS"}]


## Translation Pipeline
* It only has a few languages set up at the moment.
* Google translate gives "Les chats n'aiment pas lutter."


In [11]:
from transformers import pipeline

translator = pipeline("translation_en_to_fr")
print(translator("Cats don't like to wrestle.", max_length=40))

Downloading: 100%|██████████| 230/230 [00:00<00:00, 49.3kB/s]


[{'translation_text': "Les chats n'aiment pas se battre."}]


## Named Entity Recognition Pipeline


In [21]:
from transformers import pipeline

nlp = pipeline("ner")

tweet = """If you write "Jane Austin", Google Docs recommends a correction' \
        'to "Austen". If you write "Austin, Texas", it doesn't. Nice."""

ners = nlp(tweet)
import pprint
pprint.pprint(ners)

Downloading: 100%|██████████| 230/230 [00:00<00:00, 64.0kB/s]


[{'entity': 'I-MISC', 'index': 5, 'score': 0.8857880234718323, 'word': 'Jane'},
 {'entity': 'I-MISC', 'index': 6, 'score': 0.957772970199585, 'word': 'Austin'},
 {'entity': 'I-ORG', 'index': 9, 'score': 0.9754951000213623, 'word': 'Google'},
 {'entity': 'I-ORG', 'index': 10, 'score': 0.6128684878349304, 'word': 'Doc'},
 {'entity': 'I-ORG', 'index': 11, 'score': 0.8213635683059692, 'word': '##s'},
 {'entity': 'I-MISC',
  'index': 20,
  'score': 0.9108381867408752,
  'word': 'Austen'},
 {'entity': 'I-LOC',
  'index': 27,
  'score': 0.9868152737617493,
  'word': 'Austin'},
 {'entity': 'I-LOC', 'index': 29, 'score': 0.9850055575370789, 'word': 'Texas'}]


# spaCy Named Entity Recognition
on tokens 
https://spacy.io/api/token
and at doc level
https://spacy.io/usage/linguistic-features#named-entities


In [22]:
import spacy
tweet = """If you write "Jane Austin", Google Docs recommends a correction' \
        'to "Austen". If you write "Austin, Texas", it doesn't. Nice."""
nlp = spacy.load("en_core_web_sm")  # note not vector model of before
tokens = nlp(tweet)
for token in tokens:
    print(token, token.ent_iob_, token.ent_type_)
    
print("****** start document entities *******")

doc = nlp(tweet)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

If O 
you O 
write O 
" O 
Jane B PERSON
Austin I PERSON
" O 
, O 
Google B PERSON
Docs I PERSON
recommends O 
a O 
correction O 
' O 
         O 
' O 
to O 
" O 
Austen B WORK_OF_ART
" O 
. O 
If O 
you O 
write O 
" O 
Austin B PERSON
, O 
Texas B GPE
" O 
, O 
it O 
does O 
n't O 
. O 
Nice O 
. O 
****** start document entities *******
Jane Austin 14 25 PERSON
Google Docs 28 39 PERSON
Austen 78 84 WORK_OF_ART
Austin 101 107 PERSON
Texas 109 114 GPE


## Training your own named entity recognizer in spaCy
* You might want to train your own named entity recognizer for your domain-specific entities. You can see how to do that here
https://spacy.io/usage/training

# Conclusion

## Three main ways to get practical value from NLP now:
* Find similar documents with vectors
* Classification by converting documents to vectors
* Extracting named entities of interest

## Other Tools
We ended up covering HuggingFace and spaCy. Other tools you might want to look at
* OpenNMT: good for translations and summarization https://opennmt.net/
* AllenNLP: more research oriented https://allennlp.org/
* Stanford NLP: (in Java) if you need to do constituency parsing (make parsing trees) https://nlp.stanford.edu/software/
* fast.ai: I haven't used it but I've heard good things. See https://www.fast.ai/ and https://towardsdatascience.com/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2 