# transformers Pipeline 이용하여 자연어 처리
+ 이번시간에는 transformers Pipeline 이용해서 쉽게 자연처 처리 하는 방법을 배워 보겠습니다.
+ transformers 라이브러리는 자연어처리를 위해 BERT, GPT 아키텍쳐와 API등 제공하고 있으며 우리는 transforgmers 설치하고 사용하면 됩니다.
+ HuggingFace 사이트 : BERT, GPT 모델 제공하고 쉽게 자연어처리 할수 있게 Support 해주는 Community 
+ 아래 huggingface 사이트 접속해 이것저것 살펴보면 좋겠네요
  + https://huggingface.co/
  + https://github.com/huggingface/transformers



## transformers Pipeline 이용하여 아래것들을 할수 있다.
1. Sequence Classification
2. Unmasking
3. Question Answering
4. Text Generation
5. Named Entity Recognition(NER)
6. Summarization
7. Translation

### 학습목차
#### A. transformers 설치
#### B. Pipeline 활용하기
1. Sequence Classification
2. Unmasking
3. Question Answering
4. Text Generation
5. Named Entity Recognition(NER)
6. Summarization
7. Translation

## A. transformers 설치
+ transformers 라이브러리 설치 
+ transformers 사용하려면 tensorflow와 keras 2.4.0 이상으로 업그레이드 필요

In [1]:
# 경고메세지 끄기
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
!pip install transformers  



In [3]:
import transformers
import tensorflow as tf

print(transformers.__version__)
print(tf.__version__)

4.18.0
2.4.2


## B. Pipeline 활용하기
+ Pipeline에 내장된 Default 모델사용하거나 다른 모델을 선택할수 있다
+ 참조사이트 : https://huggingface.co/transformers/main_classes/pipelines.html




In [4]:
from transformers import pipeline

### 1. Sequence Classification

In [5]:
# Sequence Classification : 감성분류
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [6]:
print(classifier("I love you"))
print(classifier("I hate you"))

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


In [7]:
print(classifier('We are very happy to show you the Transformers library.'))

[{'label': 'POSITIVE', 'score': 0.9997994303703308}]


In [8]:
# pre-trained model 선택할수 있다. multilingual로 한국어도 지원
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [9]:
print(classifier('We are very happy to show you the Transformers library.'))

[{'label': '5 stars', 'score': 0.749592661857605}]


In [10]:
# 한국어 문장에 대한 평가 
print(classifier('그 영화 지루하고 재미없네.'))

[{'label': '3 stars', 'score': 0.22962142527103424}]


### 2. Unmasking

In [11]:
# fill-mask : 빠진 내용 채우기
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [12]:
# [MASK]에 들어갈 말들은 골라준다.
unmasker("Hello I'm a [MASK] model.")

[{'score': 0.10731092840433121,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i'm a fashion model."},
 {'score': 0.08774464577436447,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i'm a role model."},
 {'score': 0.05338375270366669,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i'm a new model."},
 {'score': 0.046672213822603226,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i'm a super model."},
 {'score': 0.027095848694443703,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "hello i'm a fine model."}]

### 3. Question Answering

In [13]:
# Question Answering : 먼저 지문을 주고 질문을 하면 지문내에서 대답을 찾아 답변한다.
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [14]:
# 지문
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

In [15]:
# 질문에 대한 대답을 지문내에서 찾아 답변한다.
# 결과 : start : 지문내에서 답변 시작 위치 , end : 지문내에서 답변 끝위치 , answer : 답변

print(qa(question="What is extractive question answering?", context=context))
print(qa(question="What is a good example of a question answering dataset?", context=context))

{'score': 0.622244656085968, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5115309357643127, 'start': 147, 'end': 160, 'answer': 'SQuAD dataset'}


### 4. Text Generation

In [16]:
# text-generation : 문장 생성을 주면 알아서 문장 뒤을 이어 만들어 준다.
text_generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [17]:
# 문장을 던저주면 알아서 문장뒤을 이어 만들어 준다.
print(text_generator("When the Titanic crashed, I", max_length=50, do_sample=False))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "When the Titanic crashed, I was in the middle of a long day, and I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What"}]


### 5. Named Entity Recognition(NER)

In [18]:
# ner(Named Entity Recognition) : 단어 형태소 분석처럼 각 단어별 객체명을 나열함
nlp = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [19]:
# 문장
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
 therefore very close to the Manhattan Bridge which is visible from the window."""

In [20]:
# 단어별 객체명을 나열

# O, Outside of a named entity
# B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
# I-MIS, Miscellaneous entity
# B-PER, Beginning of a person’s name right after another person’s name
# I-PER, Person’s name
# B-ORG, Beginning of an organisation right after another organisation
# I-ORG, Organisation
# B-LOC, Beginning of a location right after another location
# I-LOC, Location

# 결과 분석
# Hu, ##gging : I-ORG(조직명) 인식
# New, York : I-LOC(위치) 인식

print(nlp(sequence))

[{'entity': 'I-ORG', 'score': 0.99957865, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9982224, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9994879, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.9994344, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99931955, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}, {'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}, {'entity': 'I-LOC', 'score': 0.95142674, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}, {'entity': 'I-LOC', 'score': 0.9336587, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}, {'entity': 'I-LOC', 'score': 0.9761654, 'index': 28, 'word': 'Manhat

### 6. Summarization

In [21]:
# summarization: 지문 요점 정리
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [22]:
# 지문 정의
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [23]:
# 지문에 대한 요점 출력
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]


### 7. Translation

In [24]:
# translation_en_to_fr: 영어을 프랑스어로 변역
translator = pipeline("translation_en_to_fr")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [25]:
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

[{'translation_text': 'Hugging Face est une entreprise technologique basée à New York et à Paris.'}]


# 배운 내용 정리
1. transformers Pipeline 이용해서 쉽게 자연처 처리 하는 실습해 보았습니다.
2. transformers Pipeline 이용해서 Sequence Classification, Unmasking, Question Answering , Text Generation, Named Entity Recognition(NER), Summarization, Translation에 대한 자연어처리를 쉽게 해 보았습니다.
3. 우리는 transformers 설치하고 Pipeline 함수 불러다 사용하면 됩니다.
4. transformers Pipeline 이해하고 필요할 때 사용하시면 되겠습니다. 