# 0. Setup

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.1 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 59.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.4 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3

In [2]:
import transformers

In [3]:
!pip install transformers[sentencepiece]

Collecting sentencepiece==0.1.91
  Downloading sentencepiece-0.1.91-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 5.0 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.91


# 1. Transformer models

## Natural Language Processing


* **NLP란? 언어와 관련된 모든 것을 이해하는데 초점을 둔 언어학과 머신러닝의 분야이다!**

> * **Classifying whole sentences**: 리뷰 감정 분석, 이메일 스펨 탐지, 두 문장이 논리적인지 판별, 문법적으로 문장이 맞는지 판별
> * **Classifying each word in a sentece**: pos tagging같은 문법적인 요소 분석
> * **Generating text content**: mask된 단어들을 text에 빈칸 채우면서 자동으로 문장 생성
> * **Extracting an answer from a text**:  질문과 context가 주어지면 context로부터 질문에 대한 답을 추출
> * **Generating a new sentence from an input text**: 번역, 요약

## Transformers, what can they do?


Transforemr library에 기본적은 object는 **pipeline**이다! 

pipeline은 필수적인 전처리 및 후처리와 함께 모델을 연결해준다!

기본적으로 pipeline은 자동으로 fine-tuning된 모델을 가져와 task 시행! 

**모델 1번 다운 시, 다시 실행하면 캐시된 모델이 사용되니 다시 다운 필요 X**

**가능한 pipeline**

> * feature-extraction
> * fill-mask
> * ner
> * QA
> * 요약
> * 텍스트 생성
> * 번역
> * zero-shot-classification

In [7]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # 감정분석 pipeline 생성
classifier("I've been waiting for a HuggingFace course my whole life.") # 해당 문장 감정 분석 및 score 도출

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [6]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
]) #여러 문장도 가능!

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

**pipeline에 text를 통과시킬 시, 주된 step!**

1. text는 모델이 이해할 수 있는 format으로 전처리 된다!

2. 전처리된 input은 model에 통과!

3. 모델 예측값은 후처리 되어 이해 가능할 것이다!

### Zero-shot classification

**라벨이 부착되지 않은 텍스트 분류하는 어려운 과제 해결!**

사전 모델의 label에 의존하지 않는다! => 문장을 음수 or 양수로 분류 / 다른 레이블 집합을 사용하여 텍스트 분류

fine-tuning이 필요없고 원하는 레이블 list에 대한 확률 점수를 직접 반환!

In [8]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'labels': ['education', 'business', 'politics'],
 'scores': [0.844597339630127, 0.11197540909051895, 0.043427303433418274],
 'sequence': 'This is a course about the Transformers library'}

### Text generation

**프롬프트를 제공하면 모델이 나머지 텍스트를 생성하여 프롬프트를 자동 완료**

In [9]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to manage and monitor your personal information in the digital age. We will make sure your information is kept safe and secured.\n\nYou will also meet our team of computer experts. This team will help you'}]

'num_return_sequences' 인자로 많은 sequence들을 얼마나 생성하는지 컨트롤 가능

'max_length' 인자로 output text의 총 길이를 컨트롤 가능

✏️ Try it out! Use the num_return_sequences and max_length arguments to generate two sentences of 15 words each.



In [10]:
outputs = generator("In this course, we will teach you how to",num_return_sequences=2, max_length=15)
outputs

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build a scalable blockchain for'},
 {'generated_text': 'In this course, we will teach you how to use the system and it'}]

### Using any model from the Hub in a pipeline

distilgpt2 모델을 load 해보자!

In [11]:
from transformers import pipeline

generator = pipeline("text-generation",model = "distilgpt2")
generator("In this course, we will teach you how to", max_length=30, num_return_sequences=2)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use JavaScript as a template for your classes to make your class much more flexible. We\u202all'},
 {'generated_text': 'In this course, we will teach you how to control our heart.\u202e\n\n\nWe ask you –\nTo think of things as human'}]

### The Inference API

HuggingFace Website에서 이 api를 통해 직접 test가 가능!



### Mask filling

text에 공백(<MASK> 부분 word로 채우기!) 채우기!

top_k는 score 높은 k개의 단어 후보를 출력해준다!

In [12]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.19619838893413544,
  'sequence': 'This course will teach you all about mathematical models.',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.040527306497097015,
  'sequence': 'This course will teach you all about computational models.',
  'token': 38163,
  'token_str': ' computational'}]

✏️ Try it out! Search for the bert-base-cased model on the Hub and identify its mask word in the Inference API widget. What does this model predict for the sentence in our pipeline example above?

In [15]:
unmasker = pipeline("fill-mask", model="bert-base-cased")
unmasker("This course will teach you all about [MASK] models.", top_k=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.2596311569213867,
  'sequence': 'This course will teach you all about role models.',
  'token': 1648,
  'token_str': 'role'},
 {'score': 0.0942729040980339,
  'sequence': 'This course will teach you all about the models.',
  'token': 1103,
  'token_str': 'the'}]

### Named entity recognition

**NER은 모델의 입력 텍스트의 어떤 부분이 persons, locations, organization인지 판별하는 작업**

NER 시, **grouped_entities=True**하면 pipeline에 동일한 entity에 해당하는 부분은 regroup을 시킨다! ex) Hugging + Face => ORG / False 할 시에는 Hugging이 더 작은 단어들로 나눠서 보인다.(Wordpiece Embedding 느낌)

In [16]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 18,
  'entity_group': 'PER',
  'score': 0.9981694,
  'start': 11,
  'word': 'Sylvain'},
 {'end': 45,
  'entity_group': 'ORG',
  'score': 0.97960204,
  'start': 33,
  'word': 'Hugging Face'},
 {'end': 57,
  'entity_group': 'LOC',
  'score': 0.99321055,
  'start': 49,
  'word': 'Brooklyn'}]

In [17]:
ner1 = pipeline("ner", grouped_entities=False)
ner1("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 12,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.9993828,
  'start': 11,
  'word': 'S'},
 {'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.99815476,
  'start': 12,
  'word': '##yl'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.99590725,
  'start': 14,
  'word': '##va'},
 {'end': 18,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.9992327,
  'start': 16,
  'word': '##in'},
 {'end': 35,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.97389334,
  'start': 33,
  'word': 'Hu'},
 {'end': 40,
  'entity': 'I-ORG',
  'index': 13,
  'score': 0.976115,
  'start': 35,
  'word': '##gging'},
 {'end': 45,
  'entity': 'I-ORG',
  'index': 14,
  'score': 0.98879766,
  'start': 41,
  'word': 'Face'},
 {'end': 57,
  'entity': 'I-LOC',
  'index': 16,
  'score': 0.99321055,
  'start': 49,
  'word': 'Brooklyn'}]

### Question answering

주어진 context로부터 question 대답을 추출

In [18]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'answer': 'Hugging Face', 'end': 45, 'score': 0.6949757933616638, 'start': 33}

### Summarization

**max_length** or **min_length**를 통해서 text 생성을 조정할 수 있다!

In [19]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### Translation

"translation_en_to_fr"와 같이 task 이름에 언어 쌍을 제공하면 defalut model을 사용 가능!

마찬가지로 "max_length" or "min_length"로 text 생성과 요약 길이 조정 가능!

In [20]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]