<a href="https://colab.research.google.com/github/jswooo/HuggingFace/blob/main/1_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Transformer가 할 수 있는 tasks (pipeline 이용) 

In [None]:
# 개발 버전 transformer install 
!pip install transformers[sentencepiece]

In [None]:
import transformers

In [2]:
# pipeline(): 모델에 필요한 전처리와 후처리 과정을 제공 
'''
1. 텍스트를 모델이 알 수 있도록 전처리 
2. 전처리된 input을 모델에 대입 
3. output을 후처리해서 사용자에게 제공 
'''
from transformers import pipeline

# sentiment analysis: text의 긍부정 판단 

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [None]:
# zero-shot-classification: 사용자가 제공한 label에 대해서 text를 classify. 사용하기 위해서 fine-tuning 필요없음 

classfier = pipeline("zero-shot-classification")
classfier(
    "Confidence is in short supply among Chinese investors these days, confounding analysts who say reasons to own the market are finally coming true.",
    candidate_labels = ["education","politics","economics"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'Confidence is in short supply among Chinese investors these days, confounding analysts who say reasons to own the market are finally coming true.',
 'labels': ['economics', 'education', 'politics'],
 'scores': [0.5652979612350464, 0.26237601041793823, 0.17232604324817657]}

In [None]:
classfier(
    "Kevin de Bruyne and Erling Haaland provided the silk and steel as Manchester City's seemingly unstoppable momentum consumed faltering Arsenal in their so-called Premier League title decider at Etihad Stadium.",
    candidate_labels = ["education","politics","economics","sports"]
)

{'sequence': "Kevin de Bruyne and Erling Haaland provided the silk and steel as Manchester City's seemingly unstoppable momentum consumed faltering Arsenal in their so-called Premier League title decider at Etihad Stadium.",
 'labels': ['sports', 'education', 'economics', 'politics'],
 'scores': [0.9251295328140259,
  0.0309822466224432,
  0.022845663130283356,
  0.021042505279183388]}

In [None]:
# text-generation: 사용자가 제공한 문장에 대해, 이후를 예측.(랜덤성이 들어있음) -> 이건 제대로 안되는 듯..? 

generator = pipeline("text-generation")
generator(
    "I'm currently studying about deep learning.",
    max_length = 30,
    num_return_sequences=2
    )

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I'm currently studying about deep learning. If I had to say the exact same thing, I'd say learning how to write python is really boring and"},
 {'generated_text': "I'm currently studying about deep learning. One question is, where does it originate and how can it be replicated? Well, that's one of the"}]

In [None]:
# 특정 모델을 사용해서 text generate. (gpt2 사용해봄)

generator = pipeline("text-generation", model='gpt2')
generator(
    "I'm currently studying about deep learning.",
    max_length = 30,
    num_return_sequences=2
    )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I'm currently studying about deep learning. I think deep learning is interesting from a theoretical standpoint. When people use a framework to learn things they then learn"},
 {'generated_text': "I'm currently studying about deep learning. We'll talk about how to take something you're working on or why it's important to be more expressive."}]

In [None]:
# mask filling: mask 토큰을 채워줌. mask 토큰의 형태는 모델마다 다를 수 있으니 확인해야..
unmasker = pipeline("fill-mask")

unmasker("most important knowlegde in physics is <mask>.",top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.07382698357105255,
  'token': 44999,
  'token_str': ' relativity',
  'sequence': 'most important knowlegde in physics is relativity.'},
 {'score': 0.06814121454954147,
  'token': 25634,
  'token_str': ' mathematics',
  'sequence': 'most important knowlegde in physics is mathematics.'}]

In [None]:
# NER

ner = pipeline("ner", grouped_entities=True) #grouped_entities: 같은 엔티티에 속하면 재그룹화 
ner("My name is Aaron and I work at Samsung Electronics in Gangnam.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9990368,
  'word': 'Aaron',
  'start': 11,
  'end': 16},
 {'entity_group': 'ORG',
  'score': 0.9992042,
  'word': 'Samsung Electronics',
  'start': 31,
  'end': 50},
 {'entity_group': 'LOC',
  'score': 0.9164551,
  'word': 'Gangnam',
  'start': 54,
  'end': 61}]

In [4]:
# QnA 

qa = pipeline("question-answering")
qa(
    question = "where do I learn?",
    context = "my name is Aaron and I'm undergraduate student at Kyunghee university"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6885819435119629, 'start': 50, 'end': 58, 'answer': 'Kyunghee'}

In [5]:
# Summarization : 트랜스포머 논문 abstract 돌려봄 

summarize = pipeline("summarization")
summarize(
    """
    The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. 
    The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. 
    Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. 
    On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
    """
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder . We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely . The Transformer achieves 28.4 BLEU on the WMT 2014 English-to-German translation task .'}]

In [6]:
# 기계번역 fr => en 

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator(
    "Ce cours est produit par Hugging Face."
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]