# Natural Language Processing
After attending the workshops by Olaf Janssen [\[1\]](#References) on Natural Language Processing (NLP), <br>
it is time to start playing with state-of-the-art NLP pre-trained models, such as BERT (and derivatives such as ALBERT and RoBERTa), GPT-2 or XLNET. <br>
<br>
The assignment is to first pick a NLP task (such as Q&A, search engine, chatbot, classification, etc.). <br>
Then to choose a context wherein this task is being ran (such as a game, a service, smart applications, etc.). <br>
And, build it! <br>
The idea is that the student will be working with pre-trained models and that the focus is primarily on designing the right pipeline for the problem at hand. <br>
Endlessly fine-tuning the model is not a priority in this assignment; the goal is to demonstrate that the student understands how a NLP task is build up and can apply NLP techniques. <br>
The result must be an interactive prototype that tackles a specific NLP task.

# Introduction
This notebook will contain my NLP learning journey throughout the assignment. <br>
Before diving into the assignment for the workshop, I will first experiment with NLP pipelines. <br>
I do this because it helps me better understand how NLP libraries are used. <br>
Afterwards, I will explain the task I have chosen and how I plan to approach it. <br>
Lastly, I will conclude my learning journey with an interactive version of my NLP model. 

# Preparation
In this section useful libraries are imported which are used in most data science projects.

In [1]:
import os
import sys

# sets the path to the home directory of this repository so other modules can be imported. 
project_path = os.getcwd()
root_path = os.path.split(os.path.split(os.getcwd())[0])[0]
assert root_path.endswith("Fontys-ADS"), "The root path does not end with Fontys-ADS: " + root_path 
sys.path.insert(0, root_path)

import numpy as np
import tensorflow as tf
import transformers
transformers.logging.set_verbosity(40)
# set the seed for reproducible results.
np.random.seed(56)
tf.random.set_seed(56)

# optionally, set TensorFlow to use the GPU with all available memory.
# physical_devices = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_memory_growth(physical_devices[0], True)

Information about the installed packages:

In [2]:
print(f'transformers=={transformers.__version__}')
print(f'numpy=={np.__version__}')
print(f'tensorflow=={tf.__version__}')

transformers==3.5.0
numpy==1.19.2
tensorflow==2.3.0


# Hands-On with NLP pipelines
For the experimenting with NLP pipelines I found the library transformers from huggingface [\[2\]](#References). <br>
It has a simple to use API and has support for many NLP models. <br>
I will experiment with a few different NLP tasks using the pipeline API.

In [3]:
from transformers import pipeline

## Text generation pipeline
For text generation, I started with creating a pipeline with a GPT2 (small) pre-trained model. <br>
But the small GPT2 model did not produce good enough outputs. So, I swapped out the small model for the large model.

In [4]:
text_gen = pipeline("text-generation", model="gpt2-large")

And tried a few different text-generation continuations....

In [7]:
result = text_gen("Many moons ago", max_length=100)
display(result[0]['generated_text'])

'Many moons ago it had looked as if this island, in these waters, might belong to us. We now know, however, that what we took from the islands of the sea must have been of great value. And we were able to draw from this island a rich vein of gold. And it was known among us, in those days, as a place where the gold was in greatest demand. We did not have the means to mine it ourselves, so we sent people along the coast to find'

The pipeline resulted in some interesting continuations:
- <b> Providing "In a forgotten land, far far away, ", resulted in: </b> <br> 
"In a forgotten land, far far away, are the only things left: the walls from which I was to die, and the sun that shone when I slept. Therein lies a world of promise. And it I have known all"
- <b> Providing "Have you read Olaf's NLP blog post?", resulted in: </b> <br> 
"Have you read Olaf's NLP blog post? It's the most useful thing I've read on NLP, ever. (And you should read it.)\n\nAdvertisements\n\nLike this: Like Loading... Related\n\nPosted in Unc"
- <b> Providing "In a galaxy far away, the sith are preparing for", resulted in: </b> <br> 
"In a galaxy far away, the sith are preparing for a grand festival. The Sith Lord, Darth Vader, is about to make his grand return. As a reward for his master's treachery, Vader himself is given the title of Darth Vader"

Not every continuation makes sense. It also feels like it is just spitting out text it has read before from blog posts, or the like.

## Question & Answer pipeline
For the question and answer task, I can also provide a context for the model to answer questions on. <br>
For this model, I chose the most popular question-answering model on huggingface's website.

In [5]:
nlp = pipeline("question-answering", model="deepset/bert-large-uncased-whole-word-masking-squad2")

The context I provide is about a band I was listening to during the making of this notebook.

In [6]:
context = r"""
The band Æther Realm has a song called The Sun, The Moon, The Stars which is 20 minutes in length.
The Sun, The Moon, The Stars is most known for its lengthy guitar riffs and melodies.
But I also like the song called The Fool which is only 4 minutes in length.
After playing the 20 minute song, I played the rest of the album.
"""

The questions I ask below all belong to the context, so they should be answerable.

In [7]:
result = nlp(question="What is the name of the song which is 20 minutes in  length?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What is the name of the band which has a song of 20 minutes in length?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What happened after listening to The Sun, The Moon, The Star?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="Which song did I like?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What is the name of the shortest song?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What is the name of the lengthiest song?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'The Sun, The Moon, The Stars', score: 0.9546, start: 40, end: 68
Answer: 'Æther Realm', score: 0.9729, start: 10, end: 21
Answer: 'I played the rest of the album.', score: 0.0283, start: 296, end: 327
Answer: 'The Fool', score: 0.9815, start: 218, end: 226
Answer: 'The Fool', score: 0.0494, start: 218, end: 226
Answer: 'The Sun, The Moon, The Stars', score: 0.0472, start: 40, end: 68


The questions below are not explained in the context and should not be answerable.

In [10]:
result = nlp(question="Who is the lead singer of the band?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What genre is this band?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'Æther Realm', score: 0.0, start: 10, end: 21
Answer: 'Æther Realm', score: 0.2439, start: 10, end: 21


The six questions it was able to answer, it did succesfully. <br>
The two questions it should not be able to answer, failed to be answered. <br>

## Translation pipeline
For the translation pipeline, there are also many models available. <br>
I have chosen the T5 model to translate English to German. <br>

In [19]:
translator = pipeline("translation_en_to_de", model="t5-large")

In [22]:
translation = translator("My bike is stolen.", max_length=40)
translation[0]['translation_text']

'Mein Fahrrad ist gestohlen.'

I have not had German in years, but I am fairly sure that this translation is correct.

# NLP Assignment

# Conclusion
soontm

# References
\[1\] Janssen, O. (2020). NLP Introduction Overview. Olafjanssen.github.io. Retrieved from https://olafjanssen.github.io/nlp-workshop/. <br>
\[2\] Wolf et al (2019). HuggingFace's Transformers: State-of-the-art Natural Language Processing. Huggingface.co. Retrieved from https://huggingface.co/transformers/. <br>