# **2022년 빅데이터 동아리 특강 - 오리엔테이션 (2022년 7월 11일)**

대학원 기술경영학과 (Management of Technology) - 송지훈 교수

🧑👩 학생 여러분, 빅데이터 동아리에 오신걸 환영합니다 !\
짧은 시간이지만, 여러분들이 다양한 특강을 기반으로 **스스로 학습** 및 **경진대회**에 **참가** 할 수 있는 **역량**을 갖출 수 있도록 지원하는게 주 목적 입니다.

**나중을 위한 팁**\
✅ You can only learn data science by doing data science. (실제로 코드를 구현해 봐야 합니다 ~) \
✅ Practice, practice, practice. (연습하고 또 연습하세요, 이번 짧은 강의에서는 모든 세세한 내용을 전부 다룰수 없습니다 ~)\
✅ Free resources everywhere. (인터넷상에는 무료로 데이터 분석 또는 프로그래밍 관련 공부를 할 수 있는 많은 자료들이 존재 합니다. 적극적으로 찾아서 이용하세요 ~)

In [5]:
# HuggingFace transformers: 가장 많이 활용되고 있는 최신 NLP (자연어처리) 라이브러리
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Pipeline

In [6]:
from transformers import pipeline
# pipeline does pre-processing and post-processing
# pre-processing: tokenization
# post-processing: labeling whether positive or negative

# we use a default model
classifier = pipeline("sentiment-analysis") # creating a pipeline object for the sentiment analysis, we can actually create different object types
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]


In [7]:
sentiment_check = classifier("We did not like the food we ordered last night.")
print(sentiment_check)

[{'label': 'NEGATIVE', 'score': 0.9989193677902222}]


In [20]:
sentiment_check = classifier(["We did not like the food we ordered last night.","I enjoyed the movie."])
print(sentiment_check)

[{'label': 'NEGATIVE', 'score': 0.9989193677902222}, {'label': 'POSITIVE', 'score': 0.9998699426651001}]


In [9]:
generator = pipeline("text-generation",model = 'distilgpt2') # 

res = generator("In this course, we will teach you how to",
                max_length = 30,
                num_return_sequences=2)
print(res)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a lot of knowledge, because as well as some of the challenges the course presents, it should'}, {'generated_text': 'In this course, we will teach you how to apply an approach to one specific class of skills on your own.'}]


In [10]:
classifier = pipeline("zero-shot-classification")
res = classifier("This is a course about NLP using Python.", candidate_labels = ['education','politics','business'])

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [19]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'answer': 'Hugging Face', 'end': 45, 'score': 0.6949766278266907, 'start': 33}

https://huggingface.co/docs/transformers/main_classes/pipelines

이러한 모델들을 어디서 배울 수 있나요? 

https://huggingface.co/course/chapter1/2?fw=pt

# Tokenizer and Model (Pipeline의 작동 원리)

In [12]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]


In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [14]:
result = tokenizer('Learning python is pretty fun.') # 101, beginning of the sentence / 102, end of the sentence
print(result)

{'input_ids': [101, 4083, 18750, 2003, 3492, 4569, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [15]:
tokens = tokenizer.tokenize('Learning python is pretty fun.')
print(tokens)

['learning', 'python', 'is', 'pretty', 'fun', '.']


In [16]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[4083, 18750, 2003, 3492, 4569, 1012]


In [17]:
decode_string = tokenizer.decode(ids)
print(decode_string)

learning python is pretty fun.


https://huggingface.co/models

https://huggingface.co/facebook/bart-large-cnn

In [18]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


# 한국어에 적용


In [22]:
from transformers import pipeline
classifier = pipeline("text-classification", model="matthewburke/korean_sentiment")
texts = ['현재 수시 원서 접수를 마친 고3이지.. 어벤져스 엔드게임 또 보며 눈물 콧물 흘리며 왔다..',
         '이게 과연 영화라 부를 수 있는 것인가?', '모르겠다, 그냥 잠이나 잘 걸', '또 보러 와야지, 개꿀잼']
preds = classifier(texts, return_all_scores=True)


Downloading:   0%|          | 0.00/887 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/552 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/387k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/770k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]



In [23]:
print(preds)

[[{'label': 'LABEL_0', 'score': 0.032331861555576324}, {'label': 'LABEL_1', 'score': 0.9676681756973267}], [{'label': 'LABEL_0', 'score': 0.9402838349342346}, {'label': 'LABEL_1', 'score': 0.059716179966926575}], [{'label': 'LABEL_0', 'score': 0.874341607093811}, {'label': 'LABEL_1', 'score': 0.12565837800502777}], [{'label': 'LABEL_0', 'score': 0.026859769597649574}, {'label': 'LABEL_1', 'score': 0.9731402397155762}]]


In [30]:
is_positive = preds[3][1]['score'] > 0.5
print(is_positive)

True
