## 💻 UnpackAI DL201 Bootcamp - Week 3 - NLP pipelines

### 📕 Learning Objectives

* Getting working examples able to achieve the main NLP tasks
* Knowing the existence of Hugging Face and the strenth of its pre-trained models and all-in-one pipelines

### 📖 Concepts map

* Pipeline

In [6]:
# import (use not verbose mode : ex "import -Uqq pandas as pd" if you are sure that there is no dependency error)

import os
from pathlib import Path
from transformers import pipeline

In [3]:
# import data and images if necessary, and choose the right path
is_kaggle = False   # True if you are on Kaggle, False for local Windows, Linux or Mac environments.

if is_kaggle:
    !pip install -Uqq transformers==4.10.2
    IMAGE_DIR = Path('/kaggle/working/DL201/img')
    DATA_DIR = Path('/kaggle/working/DL201/data')
else:
    # This section is for local execution, it is assumed that the notebook is on the 'Week-2' folder
    # of the DL201 repository.
    DATA_DIR = Path('../data')
    IMAGE_DIR = Path('../img')

# finally, check if we found the right pathes
if os.path.isdir(DATA_DIR):
    print(f'DATA_DIR is a directory, its path is {DATA_DIR}')
else:
    print("ERROR : DATA_DIR is not a directory")

if os.path.isdir(IMAGE_DIR):
    print(f'IMAGE_DIR is a directory, its path is {IMAGE_DIR}')
else:
    print("ERROR : IMAGE_DIR is not a directory")

DATA_DIR is a directory, its path is ..\data
IMAGE_DIR is a directory, its path is ..\img


# Part 1. Introduction

Giving working example able to inspire you to build your own AI project

Hugging Face made available all-in-one ***pipelines*** including all the main steps of NLP.
https://huggingface.co/course/chapter2/2?fw=pt
* choosing a pre-trained model
* adapting the input text into this model (tokenization, vectorization) 
* running the model on the transformed input data
* adapting the model answer to human beings (ex : de-tokenization, to get an output text from an output vector or numbers)

Once the pipeline works, you can decide to tune it, more and more, little by little, as one would do to transform their car for a speed race.
So, you can decide to :
* fine tune the model or train it from scratch (instead of using pre-trained model)
* using a tokenizer from your own (instead of the default one)
* clean the training data before feeding the model


# Part 2. Example of question answering

In [None]:
question_answerer = pipeline("question-answering")

In [None]:
question_answerer(
    question="Where do I work?",
    context="My name is John and I work at unpackAI in Beijing."
)

In [None]:
print(my_answer.answer)

# Part 3. Example of Sentiment Analysis

In [8]:
classifier = pipeline("sentiment-analysis")

In [8]:
sentence_list = [ "I've been waiting for a HuggingFace course my whole life.","I hate this so much!"]

In [None]:
my_answer = Classifier (sentence_list)

In [None]:
print(my_answer.label)

# Part 4. Example of Text Generation

In [None]:
generator = pipeline("text-generation")
generator(
    "In this course, we will teach you how to utilize NLP",
    max_length=30,
    num_return_sequences=2
)

# Part 5. Example of Named Entity Recognition (NER)

Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Examples could be entities such as person (PER), organization (ORG), date (DATE), location (LOC), or more.

In [None]:
ner_pipeline = pipeline("ner", grouped_entities=True)

In [None]:
ner_pipeline("My name is John and I work at unpackAI in Beijing.")