# A Tour of Transformer Applications

In [1]:
text = """Dear Amazon, last week I ordered an Optimius Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""
text

'Dear Amazon, last week I ordered an Optimius Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.'

## Libraries

### Pandas

In [2]:
import pandas as pd

### Transformers

#### Pipeline

Pipelines abstract away all the steps needed to convert raw text into a set of predictions from a fine-tuned model.

In [3]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


## Text Classification

The task here is to extract the sentiment of an input text.

### Assigning Task

The first time execution of the following loads the model weights and put them into cache. This prevents the following executions from calling these weights again, and thus the subsequent executions directly works without a progress bar.

By default, the ``text-classification`` pipeline uses a model that's designed for sentiment analysis, but it also supports multiclass and multilabel classification.

In [3]:
classifier = pipeline('text-classification')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


### Testing

Testing the default classifier pipeline on the global text string, yields a ``NEGATIVE`` label as can be interpreted from text showing negative sentiment. The score of 0.901+ suggests that sureity of this being a negative sentiment.

In [5]:
outputs = classifier(text)
outputs

[{'label': 'NEGATIVE', 'score': 0.9014247059822083}]

Here, only the ``NEGATIVE`` sentiment is returned because the score for the ``POSITIVE`` sentiment is just 1-above_score.

In [6]:
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901425


## Named Entity Recognition

The task here is to identify the named entities in the text

### Assigning Task

In NLP, real-world objects like products, places, and people are called named entities, and extracting them from text is called ``named entity recognition (NER)``.

The name of task given to pipeline thus becomes ``'ner'``.

In [7]:
ner_tagger = pipeline('ner', aggregation_strategy='simple')
ner_tagger

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS

<transformers.pipelines.token_classification.TokenClassificationPipeline at 0x1e8aca52e90>

### NER Tagger

The input requried is the text on which named entities are to be discovered.

In [8]:
outputs = ner_tagger(text)
outputs

[{'entity_group': 'ORG',
  'score': 0.8653713,
  'word': 'Amazon',
  'start': 5,
  'end': 11},
 {'entity_group': 'MISC',
  'score': 0.97458327,
  'word': 'Optimius Prime',
  'start': 36,
  'end': 50},
 {'entity_group': 'LOC',
  'score': 0.99975747,
  'word': 'Germany',
  'start': 91,
  'end': 98},
 {'entity_group': 'MISC',
  'score': 0.49259824,
  'word': 'Mega',
  'start': 209,
  'end': 213},
 {'entity_group': 'PER',
  'score': 0.59641314,
  'word': '##tron',
  'start': 213,
  'end': 217},
 {'entity_group': 'ORG',
  'score': 0.6183118,
  'word': 'Decept',
  'start': 254,
  'end': 260},
 {'entity_group': 'MISC',
  'score': 0.50714475,
  'word': '##icons',
  'start': 260,
  'end': 265},
 {'entity_group': 'MISC',
  'score': 0.74730206,
  'word': 'Megatron',
  'start': 351,
  'end': 359},
 {'entity_group': 'MISC',
  'score': 0.97488445,
  'word': 'Optimus Prime',
  'start': 368,
  'end': 381},
 {'entity_group': 'PER',
  'score': 0.8137787,
  'word': 'Bumblebee',
  'start': 503,
  'end': 5

In [9]:
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.865371,Amazon,5,11
1,MISC,0.974583,Optimius Prime,36,50
2,LOC,0.999757,Germany,91,98
3,MISC,0.492598,Mega,209,213
4,PER,0.596413,##tron,213,217
5,ORG,0.618312,Decept,254,260
6,MISC,0.507145,##icons,260,265
7,MISC,0.747302,Megatron,351,359
8,MISC,0.974884,Optimus Prime,368,381
9,PER,0.813779,Bumblebee,503,512


In [13]:
chat_gpt_answer = """
Improving predictions made using predictive analytics models involves several strategies and considerations. Here are some key approaches:

Data Quality and Quantity: Ensure that your data is clean, relevant, and sufficient for training the model. High-quality data leads to more accurate predictions. Additionally, having more data can help the model learn more complex patterns and relationships.

Feature Selection and Engineering: Identify the most important features that contribute to the prediction task and focus on those. Feature engineering involves creating new features or transforming existing ones to better represent the underlying patterns in the data.

Model Selection and Tuning: Experiment with different algorithms and model architectures to find the one that best fits your data and problem. Additionally, fine-tune hyperparameters to optimize the model's performance.

Cross-Validation: Use techniques like k-fold cross-validation to assess the model's performance on different subsets of the data. This helps ensure that the model generalizes well to unseen data.

Ensemble Methods: Combine predictions from multiple models to improve accuracy and robustness. Techniques like bagging, boosting, and stacking can help create more powerful predictive models.

Regularization: Prevent overfitting by applying regularization techniques such as L1 and L2 regularization, dropout, or early stopping. These techniques help prevent the model from memorizing noise in the training data.

Model Interpretability: Aim for models that are not only accurate but also interpretable. This allows stakeholders to understand why the model makes certain predictions, which can lead to more trust and easier implementation.

Domain Knowledge Integration: Incorporate domain knowledge into the model-building process. Subject matter experts can provide valuable insights that may not be apparent from the data alone.

Continuous Monitoring and Updating: Regularly monitor the model's performance in production and update it as needed. Data distributions may change over time, requiring the model to adapt to new patterns.

Feedback Loop: Establish a feedback loop where predictions are evaluated against ground truth outcomes, and the model is continuously refined based on this feedback. This iterative process helps improve the model's accuracy over time.

By following these strategies and continuously refining the predictive analytics process, you can improve the accuracy and reliability of your predictions.
""".replace('\n', '')

In [14]:
chat_gpt_answer_outputs = ner_tagger(chat_gpt_answer)
pd.DataFrame(chat_gpt_answer_outputs)

In [12]:
pd.DataFrame(chat_gpt_answer_outputs)

In [15]:
chat_gpt_answer_outputs

[]

In [16]:
alternative_text = """I am not getting much on the subject of Data Science. What can be the problem? Does it not recognize things like Predictive Analytics?"""

In [17]:
alternative_text_outputs = ner_tagger(alternative_text)
pd.DataFrame(alternative_text_outputs)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.617478,Data Science,40,52
1,MISC,0.511884,Pre,113,116
2,MISC,0.67387,Analytics,124,133


In [18]:
bert_definition = """BERT stands for “Bidirectional Encoder Representation with Transformers”. To put it in simple words BERT extracts patterns or representations from the data or word embeddings by passing it through an encoder. The encoder itself is a transformer architecture that is stacked together. It is a bidirectional transformer which means that during training it considers the context from both left and right of the vocabulary to extract patterns or representations. """

In [19]:
bert_definition_outputs = ner_tagger(bert_definition)
pd.DataFrame(bert_definition_outputs)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.599849,BER,0,3
1,ORG,0.360904,Transformers,59,71
2,ORG,0.916962,BER,100,103


#### Conclusion

What I observed can be listed as follows:
1. The output for the original text was amazing and very helpful as well as usable for such a task.
2. This failed to work on large texts as well as examples having strange(?) named entities.
3. This model is trained on ORG(organization), LOC (location), or PER (person). So, to make it work on named entities from a different system a second round of training on those systems will be required.

## Question Anaswering

The task here is to get answers to direct questions from a text.

### Assigning Task

In question answering, we provide the model with a passage of text called the context, along with a question whose answer we'd like to extract. The model then returns a span of text corresponding to the answer.

In [20]:
reader = pipeline('question-answering')
reader

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x1e8aeca07d0>

### Reader

In [21]:
question = 'What does the customer want?'

In [22]:
outputs = reader(question=question, context=text)
outputs

{'score': 0.590901255607605,
 'start': 336,
 'end': 359,
 'answer': 'an exchange of Megatron'}

In [23]:
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.590901,336,359,an exchange of Megatron


In [24]:
chat_gpt_answer

"Improving predictions made using predictive analytics models involves several strategies and considerations. Here are some key approaches:Data Quality and Quantity: Ensure that your data is clean, relevant, and sufficient for training the model. High-quality data leads to more accurate predictions. Additionally, having more data can help the model learn more complex patterns and relationships.Feature Selection and Engineering: Identify the most important features that contribute to the prediction task and focus on those. Feature engineering involves creating new features or transforming existing ones to better represent the underlying patterns in the data.Model Selection and Tuning: Experiment with different algorithms and model architectures to find the one that best fits your data and problem. Additionally, fine-tune hyperparameters to optimize the model's performance.Cross-Validation: Use techniques like k-fold cross-validation to assess the model's performance on different subsets

In [25]:
question = 'How can feature selection improve predictions?'

In [26]:
chat_gpt_outputs = reader(question=question, context=chat_gpt_answer)
chat_gpt_outputs

{'score': 0.6316820383071899,
 'start': 2340,
 'end': 2428,
 'answer': 'By following these strategies and continuously refining the predictive analytics process'}

In [27]:
pd.DataFrame([chat_gpt_outputs])

Unnamed: 0,score,start,end,answer
0,0.631682,2340,2428,By following these strategies and continuously...


In [28]:
bert_definition

'BERT stands for “Bidirectional Encoder Representation with Transformers”. To put it in simple words BERT extracts patterns or representations from the data or word embeddings by passing it through an encoder. The encoder itself is a transformer architecture that is stacked together. It is a bidirectional transformer which means that during training it considers the context from both left and right of the vocabulary to extract patterns or representations. '

In [29]:
bert_question = 'What is unique about BERT?'

In [30]:
bert_answer = reader(question=bert_question, context=bert_definition)
bert_answer

{'score': 0.07160419970750809,
 'start': 105,
 'end': 174,
 'answer': 'extracts patterns or representations from the data or word embeddings'}

### Conclusions

Following conclusions can be drawn from this:

1. The default model works well with the generate language questions and small queries on a small context.

2. The reader above returns the start and end of the part from where answer was extracted.

3. The answer appears as an extracted text rather than a well formed English sentence.

4. The efficiency of the algorithm decreases as context becomes larger or the answer becomes more spread out within the context.

5. This kind of question-answering is what the authors are calling the **extractive question answering** because the answer is extracted directly from the text.

6. This is still helpful if a uniform question is to be asked from a plethora of contexts having similar answers as is the case with the original text. This is so because the answers here are helpful in finding the right department to address the issues.

## Summarization

The goal of text summarization is to take a long text as input and generate a short version with all the relevant facts. This requires the model to generate coherent text.

### Assigning Task

The summarization pipeline can be defined as follows and uses the token ``summarization`` as parameter as follows:

In [31]:
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Summarizer

Let us implement the summarizer on the original test first and then the chatgpt answer.

#### Book Example

In [32]:
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
outputs

Your min_length=56 must be inferior than your max_length=45.


[{'summary_text': ' Bumblebee demands an exchange of Megatron for the Optimus Prime figure he ordered. The Decepticons are a lifelong enemy of the Decepticon, BumbleBee writes to Amazon.com.com.'}]

#### Predictive Analytics Example

In [34]:
outputs = summarizer(chat_gpt_answer, max_length=500, clean_up_tokenization_spaces=True)
outputs

Your max_length is set to 500, but your input_length is only 445. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=222)


[{'summary_text': ' High-quality data leads to more accurate predictions, and more data can help the model learn more complex patterns and relationships. Feature engineering involves creating new features or transforming existing ones to better represent the underlying patterns in the data. Experiment with different algorithms and model architectures to find the one that best fits your data and problem.'}]

### Conclusion

Following conclusions can be drawn from this:

1. There are stray words that can be avoided from the summerization but the important points are quite clear. This default model seems easy to use and small enough for a normal app to download.

2. The size and scope of summary seems to be limited. But quality inputs is infact leading to quality output.

3. Various named entities within the small scope were well identified and were well placed in summaries.

## Translation

Translation is a task where the output consists of generated text which is the required translation from the laguage of the original text to the required language in which output is returned.

## Assigning Task

The task of a translator requires the mention of original as well as the target language. Here we are going to mention a model as well as the languages.

In [35]:
translator = pipeline('translation_en_to_de', model='Helsinki-NLP/opus-mt-en-de')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [36]:
!pip install sacremoses

Defaulting to user installation because normal site-packages is not writeable
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
   ---------------------------------------- 0.0/897.5 kB ? eta -:--:--
   ---------------------------------------- 0.0/897.5 kB ? eta -:--:--
   - ------------------------------------- 30.7/897.5 kB 435.7 kB/s eta 0:00:02
   ----- ---------------------------------- 112.6/897.5 kB 1.1 MB/s eta 0:00:01
   ---------- ----------------------------- 245.8/897.5 kB 1.7 MB/s eta 0:00:01
   ---------------- ----------------------- 368.6/897.5 kB 1.9 MB/s eta 0:00:01
   --------------------- ------------------ 491.5/897.5 kB 2.1 MB/s eta 0:00:01
   ------------------------- -------------- 573.4/897.5 kB 2.0 MB/s eta 0:00:01
   --------------------------- ------------ 624.6/897.5 kB 2.0 MB/s eta 0:00:01
   ------------------------------- -------- 696.3/897.5 kB 1.8 MB/s eta 0:00:



### Translator

This translator requires the text to be translated and among other things one can give instructions like cleaning up of tokenization spaces or min_length as follows.

#### Book Example

In [39]:
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
outputs

[{'translation_text': 'Sehr geehrter Amazon, letzte Woche habe ich eine Optimius Prime Action Figur von Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, dass Sie mein Dilemma verstehen können. Um das Problem zu lösen, fordere ich einen Austausch von Megatron für die Optimus Prime Figur, die ich bestellt hatte.'}]

This looks like a good enough translation and should be as stated by the auther. Not being a speaker myself, I am not able to comment. So, I will try to get one for hindi language instead.

#### One in Hindi

In [40]:
hindi_translator = pipeline('translation_en_to_hi', model='Helsinki-NLP/opus-mt-en-hi')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [41]:
outputs = hindi_translator(text, clean_up_tokenization_spaces=True, min_length=100)
outputs

[{'translation_text': 'प्रिय एमिला, पिछले हफ्ते मैंने जर्मनी में आपके ऑनलाइन स्टोर से एक ऑप्टिमिमिमीय कार्रवाई की घोषणा की. दुःख की बात है, जब मैंने पैकेज खोला था, तब मुझे लगा कि मैं मेगॉरॉन का एक कार्य आकृति भेजा गया था! इसके बजाय, मैं आशा करता हूँ कि आप मेरी दुविधा को समझ सकते हैं. मुझे लगता है कि आप मेरी दुविधाओं को बदलने की मांग कर रहे हैं. मुझे इस तस्वीर के बारे में जो मैंने कहा था, उसकी बहुत जल्द आप सुन सकते हैं.'}]

This might just be hindi translations and lack of training data. But this is horrible translation.

In [42]:
hindi_translator = pipeline('translation_en_to_hi', model='Karn07/engilsh_to_hindi_translation')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [46]:
outputs = hindi_translator(text, clean_up_tokenization_spaces=True, min_length=4000)
outputs

Your input_length: 121 is bigger than 0.9 * max_length: 20. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


[{'translation_text': ',         '}]

This alternate model just refuses to work.

### Conclusion

Following are a few of my observation:

1. Not knowing German, I cannot comment on the quality of the output from the book example.

2. In Hindi translation, there is quite a lot left to desire, but I can imagine a worse scenario. This can be attributed to a general lack of availabilty of Hindi text and English to Hindi translation on the topic. I also believe that the low quality is also due to the fact that pretrained Hindi models are practically non-existant.

3. This situation should be worse for even rarer languages.

## Text Generation

Text Generation lets the user produce an autocomplete function that generates more text in the continuation of what is provided earlier.

#### Assigning Task

The generator is built using the parameter ``'text-generation'`` to the ``pipeline`` function.

In [47]:
generator = pipeline('text-generation')

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Generator

#### Book Example

Above we end up loading gpt2 which we are already aware of for its quality, still a prompt is required to be constructed to achieve any result.

In [53]:
response = 'Dear Bumblebee, I am sorry to hear that your order was mixed up.'
prompt = text + '\n\nCustomer service response:\n' + response
outputs = generator(prompt, max_length=200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
