## 💻 UnpackAI DL201 Bootcamp - Week 3 - NLP pipelines

### 📕 Learning Objectives

* Getting working examples able to achieve the main NLP tasks
* Knowing the existence of Hugging Face and the strenth of its pre-trained models and all-in-one pipelines

### 📖 Concepts map

* Pipeline

In [28]:
# import (use not verbose mode : ex "import -Uqq pandas as pd" if you are sure that there is no dependency error)

from transformers import pipeline
import pandas as pd

# Part 1. Introduction

Giving working example able to inspire you to build your own AI project

Hugging Face made available all-in-one ***pipelines*** including all the main steps of NLP.
https://huggingface.co/course/chapter2/2?fw=pt
* choosing a pre-trained model
* adapting the input text into this model (tokenization, vectorization) 
* running the model on the transformed input data
* adapting the model answer to human beings (ex : de-tokenization, to get an output text from an output vector or numbers)

Once the pipeline works, you can decide to tune it, more and more, little by little, as one would do to transform their car for a speed race.
So, you can decide to :
* fine tune the model or train it from scratch (instead of using pre-trained model)
* using a tokenizer from your own (instead of the default one)
* clean the training data before feeding the model


# Part 2. Example of question answering

In [2]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)
Downloading: 100%|██████████| 473/473 [00:00<00:00, 805kB/s]
Downloading: 100%|██████████| 249M/249M [01:56<00:00, 2.23MB/s] 
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 23.1kB/s]
Downloading: 100%|██████████| 208k/208k [00:04<00:00, 45.9kB/s] 
Downloading: 100%|██████████| 426k/426k [00:02<00:00, 209kB/s]  


In [3]:
my_answer = question_answerer(
    question="Where do I work?",
    context="My name is John and I work at unpackAI in Beijing."
)

In [12]:
my_answer

{'score': 0.5068178772926331, 'start': 30, 'end': 38, 'answer': 'unpackAI'}

In [22]:
print('Hello')

Hello


In [11]:
my_answer['answer']

'unpackAI'

In [6]:
def my_question_answerer(my_question,my_context):
    
    
    complete_answer = question_answerer(
        question=my_question,
        context=my_context
    )
    return complete_answer['answer']

In [11]:
context = 'My name is James and I work in Shenzhen'
question = 'Where do you work?'

In [12]:
my_question_answerer(question,context)

'Shenzhen'

# Part 3. Example of Sentiment Analysis

In [31]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [81]:
def sentiment_analysis_pipeline(text):
    """_summary_
    Takes a column of strings, cuts it down to the first 512 characters
    allowed by the model, then outputs a sentiment
    
    Args:
        text (_type_): _description_
        Text, usually from a pandas dataframe
        if it is more than 512 characters long, then it
        won't fit into the tokenizer

    Returns:
        _type_: _description_
        returns a value between -1 and 1
        negative numbers signify that it is negative
        and positive numbers signify a positive review
        
        The closer it is to 1 or -1 signifies the confidence that it
        is a negative review
    """
    token_length_limit = 512
    output = classifier(text[0:token_length_limit -1 ])[0]
    if output['label'] == 'NEGATIVE':
        output['score'] = output['score'] * -1 
    return output['score']
    

In [14]:
sentence_list = [ "I've been waiting for a HuggingFace course my whole life.","I hate this so much!"]

In [15]:
my_answer = classifier (sentence_list)

In [46]:
my_answer[0]

{'label': 'POSITIVE', 'score': 0.9598047137260437}

In [17]:
for items in my_answer:
    print(items['label'])

POSITIVE
NEGATIVE


# Part 4. Example of Text Generation

In [30]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


In [19]:
generator(
    "In this course, we will teach you how to utilize NLP",
    max_length=30,
    num_return_sequences=2
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to utilize NLP in NLP programming. In this course, you will learn how to use NLP'},
 {'generated_text': 'In this course, we will teach you how to utilize NLP or even NAP on a real project rather than using an in-house NLP'}]

# Part 5. Example of Named Entity Recognition (NER)

Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Examples could be entities such as person (PER), organization (ORG), date (DATE), location (LOC), or more.

In [29]:
ner_pipeline = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


In [21]:
ner_pipeline("My name is John and I work at unpackAI in Beijing.")

[{'entity_group': 'PER',
  'score': 0.9986534,
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.7754366,
  'word': 'unpackAI',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': 0.99954224,
  'word': 'Beijing',
  'start': 42,
  'end': 49}]