# Natural Language Processing - Experiments

Natural Language Processing is a vast topic on which people work for decades using various methodolgies.

Over the last 10 years, Deep Learning has made vast progress in NLP, especially these last 5 years with the arrival of **transformers**.

It is out of the scope of this notebook to explain what these powerfull tools are. But thanks to Hugging Face, it is possible to experiment  their power.

## Hugging Face
<div align="left" width=100%><img src="https://huggingface.co/front/assets/course-logo.svg" width=10%></div>

The Hugging Face consits of about 50 people and describes its ambition as: "Build, train and deploy state of the art models powered by the reference open source in natural language processing."

It offers tools making available pre-trained language models you may have hear of such as: BERT, ALBERT, RoBERTa, DistilBERT, GPT, GPT-2, Transformer XL, BART, mBART, T5 ...

## Sentiment Analysis
This is an example of how the NLP application can extract semantic-like meaning from the text. In this case, we will classify a text as:
- positive sentiment
- negative sentiment

The system uses what is called a pretrained Language Model and a pretrained classier. 

First we create a specific classifer for sentiment analysis (first time, a model will be downloaded). It is pretrained for language and for sentiment classification. We can imediately make prediction without training ourselves.

In [3]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

Write your own sentence and see how the classifer evaluates the sentiment. You will get a class as well as a score (how sure te model is)

In [4]:
txt = "This week at NiHub is really interesting and I am so glad I could join."

classifier(txt)

[{'label': 'POSITIVE', 'score': 0.9997544884681702}]

In [6]:
txt = "I missed my train, I am so frustrated."

classifier(txt)

[{'label': 'NEGATIVE', 'score': 0.9997418522834778}]

In [7]:
txt = "I am totally lost with these AI stuff. Ways too technical for me this late in the afternoon."

classifier(txt)

[{'label': 'NEGATIVE', 'score': 0.9997511506080627}]

In [8]:
txt = "Our order book is full, our factory is running to maximum capacity. This quarter looks promising."

classifier(txt)

[{'label': 'POSITIVE', 'score': 0.997969925403595}]

### Chinese model
There is also one model pretrained in Chinese, and classifying news text into the type of news it may be.

In [24]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer,pipeline
model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
cn_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/880 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/409M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

[{'label': 'mainland China politics', 'score': 0.72116619348526}]

In [27]:
txt = "北京上个月召开了两会"

cn_classification(txt)

[{'label': 'mainland China politics', 'score': 0.72116619348526}]

In [28]:
txt = "近年来，部分食品和化妆品企业为追求高额利润，设计和使用层数过多、空隙率过大、成本过高的包装，将包装成本附加到消费者身上，既造成资源浪费和环境污染，又损害了消费者的合法权益。"

cn_classification(txt)

[{'label': 'financial news', 'score': 0.921950101852417}]

In [29]:
txt = "在第78届威尼斯国际电影节进行全球首映，影片斩获全场好评，映后掌声长达八分钟。"

cn_classification(txt)

[{'label': 'entertainment', 'score': 0.9383667707443237}]

## Zero-shot classification
Pretrained language model are really powerfull, and in some cases it even allows to do what is called zero-shot classification, that is to classify texts that haven’t been labelled. This is a useful scenario because labelling text is rather time-consuming and most often requires specific domain expertise. 

Let's try what it gives.

First we create a classifier for zero shot classification. First time you create it, it will download the pretrained models. You can see that some of the files are several gigabytes.

In [9]:
classifier = pipeline("zero-shot-classification")

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [10]:
txt = "This is a course about the Transformers library"
labels = ["education", "politics", "business"]

classifier(txt, candidate_labels=labels)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.844596266746521, 0.11197619885206223, 0.043427616357803345]}

The model evaluate that the sentence is most probably about "education" with a condidence of 84%.

Here three labels: **coding, cooking and woodworking**.

The sentences are cut and pasted from related web pages.

Let's see what it gives.

In [19]:
txt_coding = 'C++ Operators are used to perform operations on variables and values. In the example below, we use the + operator to add together two values. Although the + operator is often used to add together two values, like in the example above, it can also be used to add together a variable and a value, or a variable and another variable.'
txt_woodwork = 'The most common type of biscuit joints is edge-to-edge joints. This is often used for gluing up table tops of various width boards of the same thickness, where biscuits are used along the planed long edges of the boards. To glue up a tabletop of various boards, lay out the boards side-by-side with each board\'s end grain turned in the opposite direction of that of the previous board.'
txt_cooking = 'In a large bowl, combine the beef, egg, onion, milk and bread OR cracker crumbs. Season with salt and pepper to taste and place in a lightly greased 9x5-inch loaf pan, or form into a loaf and place in a lightly greased 9x13-inch baking dish.'
labels=['coding', 'cooking', 'woodworking']

In [20]:
classifier(txt_coding, candidate_labels=labels)

{'sequence': 'C++ Operators are used to perform operations on variables and values. In the example below, we use the + operator to add together two values. Although the + operator is often used to add together two values, like in the example above, it can also be used to add together a variable and a value, or a variable and another variable.',
 'labels': ['coding', 'cooking', 'woodworking'],
 'scores': [0.9064509272575378, 0.05030824989080429, 0.04324081912636757]}

In [21]:
classifier(txt_cooking, candidate_labels=labels)

{'sequence': 'In a large bowl, combine the beef, egg, onion, milk and bread OR cracker crumbs. Season with salt and pepper to taste and place in a lightly greased 9x5-inch loaf pan, or form into a loaf and place in a lightly greased 9x13-inch baking dish.',
 'labels': ['cooking', 'coding', 'woodworking'],
 'scores': [0.9360044598579407, 0.04869146645069122, 0.015304102562367916]}

In [22]:
classifier(txt_woodwork, candidate_labels=labels)

{'sequence': "The most common type of biscuit joints is edge-to-edge joints. This is often used for gluing up table tops of various width boards of the same thickness, where biscuits are used along the planed long edges of the boards. To glue up a tabletop of various boards, lay out the boards side-by-side with each board's end grain turned in the opposite direction of that of the previous board.",
 'labels': ['woodworking', 'coding', 'cooking'],
 'scores': [0.5040045380592346, 0.30588528513908386, 0.1901102513074875]}

Seems our classifer is less positive with wood work 😀 But it still gets it right.

There are versions of this zero shot classification in several languages, but not yet in Chinese. The one below is for French.

In [23]:
classifier_fr = pipeline("zero-shot-classification", 
                         model="BaptisteDoyen/camembert-base-xnli")

Downloading:   0%|          | 0.00/882 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/299 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

In [30]:
txt = "L'équipe de France joue aujourd'hui au Parc des Princes"
labels = ["sport","politique","science"]

classifier_fr(txt, candidate_labels=labels)

{'sequence': "L'équipe de France joue aujourd'hui au Parc des Princes",
 'labels': ['sport', 'science', 'politique'],
 'scores': [0.6888960599899292, 0.20755235850811005, 0.10355162620544434]}

### Try your own !

In [None]:
txt = "Write your text here"
labels = ["education", "politics", "business"]

classifier(txt, candidate_labels=labels)

## Text generation

Text generation is a basic function usefull for chatbots, advanced question and answer systems and other application. It is not easy to generate a realistic sentence.

Here we use it in a simple way, for fun.

First we create our generator, then we start a sentence and have the generator continue it

In [31]:
generator = pipeline("text-generation")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [37]:
results = generator("In this worshop about AI, we will teach you how to")

print(results[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this worshop about AI, we will teach you how to use the algorithms. We will create a class for you. I recommend you to click on the screenshot below

If you enjoy this, make sure you give us your attention! This


Sounds like the generator has gotten good marketing lessons !

In [49]:
txt = "I really want to get somewhere with my studies but"
length = 50
nbr_answers = 3

results = generator(txt, total_length=length, num_return_sequences=nbr_answers);
for i, r in enumerate(results):
    print(i+1, '>>>')
    print(r['generated_text'])
    print('-------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1 >>>
I really want to get somewhere with my studies but I don't want to go to college and live in Washington DC. What does that tell you about how to do things?"

So why did the DNC come out that way? That's a
-------------
2 >>>
I really want to get somewhere with my studies but now I have to go to Harvard and I can't go anywhere except New York. If there's anything we can do together, I want to."

McNaught has been called the most
-------------
3 >>>
I really want to get somewhere with my studies but in the end it doesn't really seem important.

You have just graduated from your MBA, where you studied electrical design. Which major are you looking for?

The course I'm going
-------------


You may have heard of GPT-n from OpenAI. Here is a version of GPT-2, that is pretty good at text generation.

In [41]:
gpt2 = pipeline("text-generation", model="distilgpt2")

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [50]:
txt = "In this worshop, we will explain you how to"
length = 50
nbr_answers = 3
results = gpt2(txt, max_length=length, num_return_sequences=nbr_answers)
for i, r in enumerate(results):
    print(i+1, '>>>')
    print(r['generated_text'])
    print('-------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1 >>>
In this worshop, we will explain you how to get the most out of the day for your pet. By doing this, we won't get anywhere else for your pet... and we will do our best…
A very quick and simple way
-------------
2 >>>
In this worshop, we will explain you how to win against any combination of this and our decklists to begin we will provide you with something specific that can also help you in a way you can make the difference in the game.



-------------
3 >>>
In this worshop, we will explain you how to bring your ideas to the market with a new and convenient approach.


If you have any questions, feel free to ask and we will answer within an hour or so to show you how
-------------


# Conclusion

This is just a simple notebook to experiment with NLP. There are obviously a wide space of possibilities with this, especially when you use some of these tools and integrate it into finetuned or customized sytems.

Hopefully, this will give you a taste for machine learning and deep learning.