# Notebook of Max de Goede on his NLP findings
I have done some work with NLP in the past[1] [Click here to see this](https://github.com/pontiacboy/PDR_Supported/blob/master/NLP%20-%20Final%20-%20Max%20de%20Goede.ipynb), but I used techniques like Word2vec and Seq2Seq models. Now I will take the approach of looking at new techniques such as the usage of GPT2 and playing around with it.

<span style="color:blue">
    <h1>Feedback Olaf</h1>
    <br>
Hi Max, thanks for the interesting read. I really like how you let it generate text and summarize itself and then apply sentiment analysis on both input and (summarized) text.
    <h2> Secondary Mini-Application. Self reflecting Text Generator </h2>
I would really like to see this in a combined mini-application (can just be a function within the notebook). Based on an input you could let the model generate 10 texts, which you summarize and sentiment-analyse. Based on comparing the score of where the sentiment conveys best the input, or where the summary is closest to the whole generated text, you return the best generated text. A self-reflecting text generator. :-)
    <h2>Form</h2>
On form: please hand in a pdf or upload your notebook to git so I can easily see a pretty-formatted output. Reread your text, you seem to contradict yourself: you say you will do transfer learning but you don't, you say you translate to Spanish/Dutch but then do it German etc.

Otherwise nice work.
    
 *Note, changes after feedback from Olaf will be noted as blue*
 </span>.

# My idea
There are 2<span style="color:blue">(3)</span> things i would like to focus on:
1. Pipelines
- I would like to play with available pipelines from the Transformers library. To see what we can do with the various available networks. The main idea here is to get to know what these networks can do.
2. Create a tool to converse with
- I would like to create a small app that allows the user to have a conversation with a conversational NLP model. 
3. <span style="color:blue"> Self reflecting Text Generator</span>
- <span style="color:blue"> I received feedback from Olaf, that it would be a cool idea to use all the techniques I explored during my research and combine them into 1. To generate an output that matches the input given. </span>

### Imports

<span style="color:blue">Not many imports are necesarry, mainly the transformer library</span>

In [None]:
from transformers import Conversation,AutoTokenizer, AutoModelForMaskedLM, BertConfig,pipeline, set_seed
import numpy as np

## 1. Pipelines
Let's play with some of the pipelines from the Transformers[2] library, with this we can get an idea of what we can do with the new techniques within NLP. We will look at the following things:
1. Quick View
2. Text generation
3. Summarization
4. Translate 
5. Sentiment analysis


### 1.1 Quickview
We will be using GPT-2 and make a network that will be able to autofill something depending on what we type. as the input
<br>
<br>
Let's start by importing the model. Let's take the Bert Model (this is just loading the config we don't do anything with this)

In [3]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = BertConfig.from_pretrained("gpt2")

In [3]:
model

BertConfig {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "type_vocab_size": 2,
  "voc

#### Lets see how well the model performs when it tries to generate a story. Lets see what type of stories it comes up with! 

In [45]:
generator = pipeline ('text-generation',model = 'gpt2', tokenizer = tokenizer)
set_seed(42)
results = generator('Once upon a time',max_length=300, num_return_sequences=5)

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [46]:
results[0]

{'generated_text': 'Once upon a time, it was hard to have a relationship over beer. Beer, as you are talking now is the most popular drink on this planet and even the best of great ones have no effect on your physical well being (your immune system). At some point, a lot of the best and worst ones will come along and the result will be a mess and that\'s always frustrating and even if, as it was, the beer was on the plate I\'d definitely give up on it all.\n\nThe problem we are looking specifically at here is our inability to see an ounce of truth in the beer. There seems to be almost zero truth here. I mean, it\'s a good beer, it might not last as long as some of the best but it\'s not very good and, for me, the best I ever got is something named Bud Light. In between the two and maybe I\'m not quite sure. Still, as a general rule, we feel like we all have a bit of a bad ass to have around.\n\nBeer doesn\'t just work on your skin, it also works on your body and, in the case of Bud Lig

In [47]:
results[1]

{'generated_text': 'Once upon a time, those who are interested will have seen the great thing in them: One who has not yet reached the age of enlightenment, and without any knowledge of what he wishes (since the present life is such a journey) can attain true liberation, which means he can live for a lifetime. This is true in all spiritual fields. (2.) As time goes on, his true goal becomes clear, in this way he will be able to be known in all people in the world, since the one who seeks such an emancipation of life is the one who can live in the earth.\n\nIt is because of this, with the development of life and enlightenment, that one who has not yet reached the age of enlightenment, and without any knowledge of what he desires, can attain true liberation. He knows that this is necessary and can only be attained right away by all beings. (3.) This is one of those who are very fortunate in becoming able not only to speak out against ignorance on earth, but also to practice mindfulness i

In [7]:
results[2]

{'generated_text': 'Once upon a time, even when they were in the right place, the world was changing. And there were more people getting killed, and more dead in general. We didn\'t know exactly how, but the people we were sending to stop our work were taking part in the events at that point, and we started a war of attrition between the nations of the region I was trying to take over.\n\nMy military adviser, Commander-in-Chief of the Joint Chiefs of Staff Thomas L. Bolton, was an enthusiastic supporter of a war plan. He said the war needed to continue, but there needed to be an "international, regional and worldwide effort" to ensure that all the members of America\'s defense forces were ready for the war. A few years later, we reached an agreement to create our own National Security Council, to be headed by a military council composed of representatives from 50 countries. That was my idea, and that is what the United States is now.\n\nIn the second half of the twenty-first century, w

#### Quickview Conclusion
The results are pretty funny to look at and read. But of course this is not the type of proper usage we could use for the network. 
<br>
<br> 
Up next we will look at what we can do with these networks. We will be using this page from huggingface(https://huggingface.co/transformers/main_classes/pipelines.html)

## Generate text and then sumarizing it 
Lets, generate a text (about data science) and then lets see how well the text can be sumarized

### 1.2 Text generation
Here we will give the pipeline the input of a start.  We will give it the following input "Data Science is the future, especially when you look at what is possible with the NLP techniques". Let's see what kind of results we get from this

In [8]:
Story = generator('Data Science is the future, especially when you look at what is possible with the NLP techniques.',max_length=1000, num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [9]:
DataScience_story = Story[0]['generated_text']
DataScience_story

"Data Science is the future, especially when you look at what is possible with the NLP techniques. While the NLP techniques have been used before by many companies, many companies have tried to leverage the NLP techniques into their products. However there are limitations and weaknesses to try and get the best results for your business.\n\nWe all know that your marketing budget is constantly changing due to business demand, customers demand, and so you might find your customers are looking for a high level of information. However, all these factors often add up and you may not be the marketer that they were hoping to be. The solution is to have information you have to fill in and then make better decisions on business changes.\n\n5. Be strategic about your SEO plans. SEO is changing all the time and it's not a new concept. SEO is very effective when it comes to identifying the best features for your pages, how your website looks and how to leverage those features to deliver the best re

It seems like it was able to come up with quite a story on how you can use NLP in a business world and what type of effects it can have lets now try and sumarize the text!

### 1.3 Sumarization
A few parameters are important in the pipeline to sumarize:
1. Model
2. Tokenizer

In [10]:
generator = pipeline ('summarization',model = 't5-base', tokenizer='t5-base')
Sumarization = generator(DataScience_story,min_length=100,max_length=300)
Sumarization = Sumarization[0]['summary_text']
Sumarization

Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Token indices sequence length is longer than the specified maximum sequence length for this model (998 > 512). Running this sequence through the model will result in indexing errors


'data science is the future, especially when you look at what is possible with the NLP techniques . there are limitations and weaknesses to try and get the best results for your business . here are some of the most effective ways you should use your NLP on your email accounts . use direct marketing, SEO, and other free tools to get the most out of your email . know the best way email to reach you and how you can get from your product to your customers .'

Lets take a look of what the raw output would look like. 

In [11]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer

model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer.encode("summarize: " + DataScience_story, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


tf.Tensor(
[[    0   331  2056    19     8   647     6   902   116    25   320    44
    125    19   487    28     8   445  6892  2097     3     5   186   688
     43  1971    12 11531     8   445  6892  2097   139    70   494     3
      5   132    33 10005    11 21506    12   653    11   129     8   200
    772    21    39   268     3     5]], shape=(1, 54), dtype=int32)


### 1.4 Translate
<span style="color:blue">Initially I wanted to translate it to Dutch and Spanish, but it seems that these languages are not yet supported. It seems that Spanish and dutch are not yet supported and for this reason, we will be translating in a supported language german.</span>

In [41]:
generator = pipeline ('translation_en_to_de', model = 't5-base')
Translate = generator(Sumarization,max_length=300)
Translate

Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'translation_text': 'Datenwissenschaft ist die Zukunft, insbesondere wenn man sich anschaut, was mit den NLP-Techniken möglich ist . Es gibt Grenzen und Schwächen, um die besten Ergebnisse für Ihr Unternehmen zu erzielen . hier sind einige der effektivsten Möglichkeiten, wie Sie Ihre NLP auf Ihren E-Mail-Konten verwenden sollten . Nutzen Sie Direktmarketing, SEO und andere kostenlose Tools, um das Beste aus Ihrer E-Mail zu erhalten .'}]

Lets take a look at what the the output would look like, if we take this in an encoder/decoder basis.

In [13]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer

model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer.encode(Sumarization, return_tensors="tf")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

print(outputs)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


tf.Tensor(
[[    0 32099   331  2056    19     8   647     6   902   116    25   320
     44   125    19   487    28     8   445  6892  2097     3     5   132
     33 10005    11 21506    12   653    11   129     8   200   772    21
     39   268     3     5]], shape=(1, 40), dtype=int32)


#### Conlsusion
The translation only worked on german, and with my vague understanding of german i think the text is fairly well trasnlated. Lets now try and check the sentiment of both the sumarized and full text.

### 1.5 Sentiment analysis
Lets check the sentiment of our generated text!

In [14]:
generator = pipeline ('sentiment-analysis')
Sentimentsum = generator(Sumarization,max_length=300)
print(Sentimentsum)
Sentimentinput = generator('Data Science is the future, especially when you look at what is possible with the NLP techniques.',max_length=300)
print(Sentimentinput)

[{'label': 'NEGATIVE', 'score': 0.8584626913070679}]
[{'label': 'POSITIVE', 'score': 0.9984856247901917}]


Sadly the story as a whole could not be analized on sentiment as it was too long. But it was interesting to see that the sentiment of the summary was seen as negatiive. Even though the input text was nearly as positive as it could be.

## 2. Converse tool 
Now let's go ahead and work on the project that we wanted to make, there are a few things that are important here:
1. Creating the model that you can converse with.
- This will be left on default which is a GPT2Model, I tried with both Bert & Rubert, but got results that were not satisfying at all. (I.e The language being returned was not recognizable)
- Another part of this was finetuning the variables of the Conversation pipeline. It is important to know that 2 different variables need to be given to this model. One is the topic, so the model knows what it will be about. And the rest of the input will be inputs of the conversation. For the model, it is important to receive the whole conversation so it is aware of the conversation. 
2. The second part, is to create the loop to be able to converse with the model.
- For this, it is important that you look at the input from the topic and the input of the messages separately

### Conversational model
We will create a conversational model with which the user can interact with. This leads to some pretty funny and interesting conversations. The user first has to input a topic and then the conversation can start. The network used is Gpt2 Conversational usage.
<br>
#### How to use the tool?
1. Input a topic (the model is much better with certain topics than others). I got the feeling that it was good with small talk.
2. After the topic input, one can give have the conversation. Talk about what you want and you will be able to see the result.

In [38]:
#DeepPavlov/bert-base-cased-conversational
#DeepPavlov/rubert-base-cased-conversational
#tokenizer = 't5-base'
conversational_pipeline = pipeline("conversational")

Some weights of GPT2Model were not initialized from the model checkpoint at microsoft/DialoGPT-medium and are newly initialized: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.12.attn.masked_bias', 'transformer.h.13.attn.masked_bias', 'transformer.h.14.attn.masked_bias', 'transformer.h.15.attn.masked_bias', 'transformer.h.16.attn.masked_bias', 'transformer.h.17.attn.masked_bias', 'transformer.h.18.attn.masked_bias', 'transformer.h.19.attn.masked_bias', 'transformer.h.20.attn.masked_bias', 'transformer.h.21.attn.masked_bias', 'transformer.h.22.attn.masked_bias', 'transformer.h.23.attn.masked

In [50]:
def Talkwith_model(message):
    reply = conversational_pipeline(message,max_lenght = 300)
    return reply

def runtalk():
    print('Choose a topic:')
    topic = input()
    topic = Conversation(topic)
    print(conversational_pipeline(topic))
    print('About to start conversation')
    conv = True
    listofmsg = []
    while conv == True:
        msg = input()
        if msg =='stop':
            conv = False
        else:
            topic.add_user_input(msg)
            listofmsg.append(msg)
            msg_to_send = Talkwith_model(topic)
            print(msg_to_send)
    return msg_to_send

runtalk()

Choose a topic:
Hey, how are you?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 

About to start conversation
I'm great thanks for asking


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 

How old are you?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 

And where are you from?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 

Do you have a name?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 

Is it cold in the UK?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 

Does it rain often?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 
user >> Does it rain often? 
bot >> It's cold in the UK 

COld and rainy?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 
user >> Does it rain often? 
bot >> It's cold in the UK 
user >> COld and rainy? 
bot >> It's cold in the UK 

It is nice talking with you


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 
user >> Does it rain often? 
bot >> It's cold in the UK 
user >> COld and rainy? 
bot >> It's cold in the UK 
user >> It is nice talking with you 
bot >> I'm not a bad guy 

I know, bye


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 
user >> Does it rain often? 
bot >> It's cold in the UK 
user >> COld and rainy? 
bot >> It's cold in the UK 
user >> It is nice talking with you 
bot >> I'm not a bad guy 
user >> I know, bye 
bot >>  

stop


Conversation id: 3e85ea20-7a80-47f2-9ac5-b066224c2f55 
user >> Hey, how are you? 
bot >> I'm good, how are you? 
user >> I'm great thanks for asking 
bot >> I'm glad to hear that. 
user >> How old are you? 
bot >> I'm 19 
user >> And where are you from? 
bot >> I'm from the UK 
user >> Do you have a name? 
bot >> I don't, I'm not sure where I'm from 
user >> Is it cold in the UK? 
bot >> It's cold in the UK 
user >> Does it rain often? 
bot >> It's cold in the UK 
user >> COld and rainy? 
bot >> It's cold in the UK 
user >> It is nice talking with you 
bot >> I'm not a bad guy 
user >> I know, bye 
bot >>  

In [52]:
runtalk()

Choose a topic:
Do you have any movie suggestions?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 

About to start conversation
I like the star wars movies


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 

Why are you not sure if you can watch them?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 

Cool, do you have a favorite movie?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 

What movies are good?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 

My name is max


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 
user >> My name is max 
bot >> I'm Max 

No i am max


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 
user >> My name is max 
bot >> I'm Max 
user >> No i am max 
bot >> Max 

What is your name?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 
user >> My name is max 
bot >> I'm Max 
user >> No i am max 
bot >> Max 
user >> What is your name? 
bot >> Max 

What is the weather like?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 
user >> My name is max 
bot >> I'm Max 
user >> No i am max 
bot >> Max 
user >> What is your name? 
bot >> Max 
user >> What is the weather like? 
bot >> Max 

stop


Conversation id: ad6c3c54-3b34-4007-b9c7-c3540b11b1ca 
user >> Do you have any movie suggestions? 
bot >> I'm not sure, but I'm sure there are some good ones. 
user >> I like the star wars movies 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Why are you not sure if you can watch them? 
bot >> I'm not sure if I can watch them, but I'll check them out. 
user >> Cool, do you have a favorite movie? 
bot >> I don't really have a favorite, but I like the ones that are good. 
user >> What movies are good? 
bot >> I'm not sure, but I like the ones that are good. 
user >> My name is max 
bot >> I'm Max 
user >> No i am max 
bot >> Max 
user >> What is your name? 
bot >> Max 
user >> What is the weather like? 
bot >> Max 

## <span style="color:blue">3. Self reflecting Text Generator</span>
<span style="color:blue">After recieving feedback from Olaf, he came with a good idea. To create a Self reflecting text generator. As we concluded that even though you give a positive input text, the generated text would come out as negative. But the model can generate multiple texts. So it would be a cool idea how this would turn out if we make a mini-application that could evaluate multiple texts and return the sumarization which most closely resembles the input text when it comes to sentiment.</span>
### <span style="color:blue"> Necessary actions to create mini-application</span> 
#### <span style="color:blue">1. Generating text from input</span>
- <span style="color:blue">The model which initiates a text generation receives an input text, this can be as long as the user wants it to be. A text will then be generated form the input</span>
- <span style="color:blue">**For future improvements the user can think of creating parameters which alllow the user to change the lenght of the generated text.**</span>
#### <span style="color:blue">2. Sumarize text</span>
- <span style="color:blue">Sumarization of the text</span>
- <span style="color:blue">**A possible improvement would be to let the network generate multiple sumaries and then return the one which best represents the sentimental value**</span>
#### <span style="color:blue">3. Analysing summary sentiment</span>
- <span style="color:blue">Apply sentimental analysis to the summarization of the text.</span>

In [67]:
def run_textgeneration_sentiment(inputstring):
    ########################################################################################
    # Function inputs:                                                                     #
    #                 - inputstring, start of the storyline                                #
    ########################################################################################
    
    # Generate all pipeline variables    
    txtgen = pipeline ('text-generation',model = 'gpt2', tokenizer = tokenizer)
    sumagen = pipeline ('summarization',model = 't5-base', tokenizer='t5-base')
    sentiggenerator = pipeline ('sentiment-analysis')
    Story = txtgen(inputstring,max_length=1000, num_return_sequences=10)
    i = 0
    inputsent = sentiggenerator(inputtxt,max_length=300)
    inputsentiment = inputsent[0]['score']
    print('Input string sentiment: ' + str(inputsent))
    
    # Create array's to save the output from the pipelines
    stories = []
    textsentl = []
    textsents = []
    suma = []
    
    # Loop where the values get added ot the array
    for items in Story:
        text = Story[i]['generated_text']
        i = i + 1
        stories.append(text)
        sumarization = sumagen(text,min_length=100,max_length=300)
        suma.append(sumarization[0]['summary_text'])
        textsentiment = sentiggenerator(sumarization[0]['summary_text'],max_length=300)
        textsentl.append(textsentiment[0]['label'])
        textsents.append(textsentiment[0]['score'])
        print('Story ' + str(i)+ ': ' + str(textsentiment))
    
    # Matching the closest sentimental values. 
    sentiment_matching = min(textsents, key=lambda x:abs(x-inputsentiment))
    indexwithmatch = textsents.index(sentiment_matching)
    
    # Gnerate reply
    reply = 'The closest matching text on a sentimental basis. Sentimental rate input :'+str(inputsentiment)+ ' Sentimental rate output:' + str(textsents[indexwithmatch]) +    '------------------------------------------------------------------------------------------------------------------------------------- Generated summary: ' +suma[indexwithmatch]
    return reply
run_textgeneration_sentiment('Once upon a time')

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Token indices sequence length is longer than the specified maximum sequence length for this model (1053 > 512). Running this sequence t

Input string sentiment: [{'label': 'POSITIVE', 'score': 0.9984856247901917}]


Your max_length is set to 300, but you input_length is only 271. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


Story 1: [{'label': 'POSITIVE', 'score': 0.7254397869110107}]
Story 2: [{'label': 'NEGATIVE', 'score': 0.9757229685783386}]
Story 3: [{'label': 'NEGATIVE', 'score': 0.9975986480712891}]
Story 4: [{'label': 'POSITIVE', 'score': 0.9992242455482483}]
Story 5: [{'label': 'NEGATIVE', 'score': 0.999736487865448}]
Story 6: [{'label': 'NEGATIVE', 'score': 0.9986565113067627}]
Story 7: [{'label': 'POSITIVE', 'score': 0.9966144561767578}]
Story 8: [{'label': 'NEGATIVE', 'score': 0.9887039065361023}]
Story 9: [{'label': 'NEGATIVE', 'score': 0.970182478427887}]
Story 10: [{'label': 'NEGATIVE', 'score': 0.9086717963218689}]


"The closest matching text on a sentimental basis. Sentimental rate input :0.9984856247901917 Sentimental rate output:0.9986565113067627------------------------------------------------------------------------------------------------------------------------------------- Generated summary: many frozen foods don't have a way to store all the ingredients until you first get them out of the bag . if you have lots of cheese on hand, use only 1 tbsp, or one whole egg per egg . you can also use other cheeses to your advantage, such as butter as a base for your bacon cheesecake . be sure to place all the cheese on the stove top, not the baking sheet or the oven ."

In [68]:
run_textgeneration_sentiment('Natural language processing is an interesting topic!')

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Token indices sequence length is longer than the specified maximum sequence length for this model (1020 > 512). Running this sequence t

Input string sentiment: [{'label': 'POSITIVE', 'score': 0.9984856247901917}]
Story 1: [{'label': 'NEGATIVE', 'score': 0.8107666373252869}]
Story 2: [{'label': 'NEGATIVE', 'score': 0.9921165704727173}]
Story 3: [{'label': 'POSITIVE', 'score': 0.9733794331550598}]


Your max_length is set to 300, but you input_length is only 159. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


Story 4: [{'label': 'NEGATIVE', 'score': 0.6406677961349487}]
Story 5: [{'label': 'NEGATIVE', 'score': 0.7683529853820801}]
Story 6: [{'label': 'POSITIVE', 'score': 0.809773325920105}]


Your max_length is set to 300, but you input_length is only 91. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


Story 7: [{'label': 'POSITIVE', 'score': 0.8215650320053101}]
Story 8: [{'label': 'NEGATIVE', 'score': 0.9646899700164795}]
Story 9: [{'label': 'NEGATIVE', 'score': 0.9932735562324524}]
Story 10: [{'label': 'POSITIVE', 'score': 0.9691058993339539}]


'The closest matching text on a sentimental basis. Sentimental rate input :0.9984856247901917 Sentimental rate output:0.9932735562324524------------------------------------------------------------------------------------------------------------------------------------- Generated summary: text processing is designed to give users a sense of the human language and the meaning of its words . many of these tools are not available in most countries such as the u.s. or the uk . codepredict is a tool used by several leading computer companies to create programs that can be run in a few seconds (often in seconds or minutes) on a standard machine . in this blog, we will show you how to use a free scripting tool such as CodePredict .'

## Conclusion
My conclusion will contain, my point of view on a few insights. I will focus on what differences I see between networks such as GPT2 and the older work I did and I will focus on the findings that I found from working on this.
##### 1. What has changed?
Looking at what was possible in the past (last year), to be able to use these techniques you had to import multiple libraries to work with NLP. It is extremely handy that transformers have multiple NLP pipelines and that it supports all of the available networks on huggingface.co[huggingface.co]. It makes the usage much easier, **THOUGH** I will add that finding information on how to use the pipelines or retrain them was not an easy task but once it was easy to follow the documentation. I could not find information on how to retrain an existing network. Luckily this was not the task, and such I did my best playing around with the pipelines to see what you can do with the transformers library.
##### 2. Pipeline and usage.
The usage of the pipelines is extremely handy, especially with the fact that you can choose what model to run and add a tokenizer. The library is very well documented and as such is easy to use. Playing around with it taught me how this can be of use.<br><br> 
Already seeing the conversational model it is good for small talk. But once you start specifying the conversation you can tell that the model has trouble keeping up. Especially since it tries to keep in mind what the conversation is about.<br><br>
<span style="color:blue">The mini-application can self reflect on the text it generates and can choose the text which best matches the sentimental value of the input. Seeing how an application like this can be made is quite scary. And the scary thing is not that it is possible, but how easy it is to apply. I can already get an idea on how I could generate tweets given that I give it a set input and how I could filter (for example) the positive tweets out so the bot could proceed to only post negative tweets on a topic.</span>
##### 3. The future
Seeing what Olaf showed us, it seems very scary. As the NLP network could program HTLM, just by giving it inputs of what it should look like. Just by seeing what is available already, it is impressive. And how a network can generate a decent text just by giving it an input. The question is at what point can we tell the difference? This can be a huge topic in the coming years. I think NLP is one (IF THE MOST) dangerous part of Machine Learning. As the one thing that puts us above other species is the ability to communicate in the way that we do. If a computer can invade this part of us humans, we come to possibly the biggest ethical questions that we could ever face.

# References
[1] de Goede Max (2020). Github Repo. Retrived from: https://github.com/pontiacboy/PDR_Supported/blob/master/NLP%20-%20Final%20-%20Max%20de%20Goede.ipynb<br>
[2] Wolf et al (2019). HuggingFace's Transformers: State-of-the-art Natural Language Processing. Huggingface.co. Retrieved from https://huggingface.co/transformers/.