#10장 문서 분류 (Document Classification)

#11-1 나이브 베이즈 분류(Naive Bayes Classifier)

##1.1 직접구현

### Naive Bayes Classifier

In [1]:
training_set = [['me free lottery', 1],
 ['free get free you', 1],
 ['you free scholarship', 0],
 ['free to contact me', 0],
 ['you won award', 0],
 ['you ticket lottery', 1]]

### 토큰 빈도수 및 문서별 토큰수 계산 (확률 계산을 위한 준비)

![대체 텍스트](https://wikimedia.org/api/rest_v1/media/math/render/svg/98f086c560aa2f66650060277dda4f90e54e30c0)

In [14]:
from collections import defaultdict

# 범주에 속하는 토큰수 세기 1(스팸), 0(정상))
doccnt0 = 0
doccnt1 = 0

# 토큰별로 문서내 빈도수 카운팅
wordfreq = defaultdict(lambda : [0, 0])

for doc, label in training_set:
    word_ls = doc.split()
    for word in word_ls:
        wordfreq[word][label] += 1

for key, (cnt0, cnt1) in wordfreq.items():
    doccnt0 += cnt0
    doccnt1 += cnt1

wordfreq

defaultdict(<function __main__.<lambda>()>,
            {'me': [1, 1],
             'free': [2, 3],
             'lottery': [0, 2],
             'get': [0, 1],
             'you': [2, 2],
             'scholarship': [1, 0],
             'to': [1, 0],
             'contact': [1, 0],
             'won': [1, 0],
             'award': [1, 0],
             'ticket': [0, 1]})

In [None]:
doccnt0

10

In [None]:
doccnt1

10

### Training : 토큰별 조건부 확률 계산 

In [15]:
k = 0.5

wordprobs = defaultdict(lambda : [0, 0])

for key, (cnt0, cnt1) in wordfreq.items():
    wordprobs[key][0]= (k+cnt0)/(2*k+doccnt0)
    wordprobs[key][1]= (k+cnt1)/(2*k+doccnt1)
wordprobs

defaultdict(<function __main__.<lambda>()>,
            {'me': [0.13636363636363635, 0.13636363636363635],
             'free': [0.22727272727272727, 0.3181818181818182],
             'lottery': [0.045454545454545456, 0.22727272727272727],
             'get': [0.045454545454545456, 0.13636363636363635],
             'you': [0.22727272727272727, 0.22727272727272727],
             'scholarship': [0.13636363636363635, 0.045454545454545456],
             'to': [0.13636363636363635, 0.045454545454545456],
             'contact': [0.13636363636363635, 0.045454545454545456],
             'won': [0.13636363636363635, 0.045454545454545456],
             'award': [0.13636363636363635, 0.045454545454545456],
             'ticket': [0.045454545454545456, 0.13636363636363635]})

### Classify : 신규 텍스트가 주어졌을 때 확률 계산

> 들여쓴 블록



In [18]:
import math

doc = "free lottery"

tokens = doc.split()

# 초기값은 모두 0으로 처리
log_prob1 = log_prob0 = 0.0

# 모든 단어에 대해 반복
for word, (prob0, prob1) in wordprobs.items():
    if word in tokens:
        log_prob0 += math.log(prob0)
        log_prob1 += math.log(prob1)

log_prob0 += math.log(doccnt0/(doccnt0+doccnt1))
log_prob1 += math.log(doccnt1/(doccnt0+doccnt1))

prob0 = math.exp(log_prob0)
prob1 = math.exp(log_prob1)


print(doc)
print("정상확률 : {:.2f}%".format(prob0 / (prob0 + prob1)*100))
print("스팸확률 : {:.2f}%".format(prob1 / (prob0 + prob1)*100))

free lottery
정상확률 : 12.50%
스팸확률 : 87.50%


In [None]:
log_prob0

-3.3198840257871636

In [None]:
prob0

0.03615702479338842

In [None]:
prob1

0.00516528925619835

##1.2 sklearn 활용 (영문 뉴스 분류)

### 뉴스 데이터 다운로드



In [19]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [20]:
print(twenty_train.target_names)
print(twenty_train.data[0])
print(twenty_train.target[0])

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have o

### 문서 분류(파이프 라인 사용)

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)



In [22]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)


0.7738980350504514

###Grid Search

In [27]:
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',MultinomialNB())])

parameters_clf = {'vect__ngram_range' : [(1,1),(2,2)],
                  'tfidf__use_idf' : (True, False),
                  'clf__alpha':(1, 0.5, 0.01)}

gs_clf = GridSearchCV(text_clf, parameters_clf, n_jobs=-1, verbose=2)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [28]:
print("Best score : {}".format(gs_clf.best_score_))
best_parameters = gs_clf.best_estimator_.get_params()
best_parameters

Best score : 0.9111718793038982


{'memory': None,
 'steps': [('vect', CountVectorizer()),
  ('tfidf', TfidfTransformer()),
  ('clf', MultinomialNB(alpha=0.01))],
 'verbose': False,
 'vect': CountVectorizer(),
 'tfidf': TfidfTransformer(),
 'clf': MultinomialNB(alpha=0.01),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': None,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tfidf__use_idf': True,
 'clf__alpha': 0.01,
 'clf__class_prior': None,
 'clf__fit_prior': True}

vs. https://www.kaggle.com/thomastilli/simple-naive-bayes-classier-0-34974-on-lb

###Parameter 적용

In [29]:
predicted = gs_clf.best_estimator_.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8352363250132767

##1.3 sklearn 활용 (한글 뉴스 분류)

In [30]:
!wget https://github.com/kyungsoo-fininsight/mulcam_b/raw/master/data/2019news_1000.csv

zsh:1: command not found: wget


In [32]:

%%time
import pandas as pd
df = pd.read_csv("./2019news_1000.csv")
df

CPU times: user 116 ms, sys: 17.3 ms, total: 134 ms
Wall time: 136 ms


Unnamed: 0,url,category1,category2,date,title,media,content
0,https://news.naver.com/main/read.nhn?mode=LS2D...,IT/과학,모바일,2019-05-02,"인권단체 중국, 신장위구르 소수민족 감시용 모바일앱 가동",연합뉴스,"HRW ""개인 정보 수집·보고서 작성·조사 활동에 앱 활용"" ""36가지 감시유형…뒷..."
1,https://news.naver.com/main/read.nhn?mode=LS2D...,IT/과학,모바일,2019-05-09,"카카오, 1분기 매출 7000억 넘어서…8분기째 최고치 경신(종합)",뉴시스,영업익 166.0%↑·순이익 19.
2,https://news.naver.com/main/read.nhn?mode=LS2D...,IT/과학,과학 일반,2019-12-11,"테라젠이텍스, '유전체 정보 관리 시스템' 특허 취득",연합뉴스,(서울=연합뉴스) 김잔디 기자 = 테라젠이텍스는 유전체 분석 정보 관리 시스템에 관...
3,https://news.naver.com/main/read.nhn?mode=LS2D...,IT/과학,컴퓨터,2019-10-14,"두나무-삼성증권-딥서치, 비상장 주식 통합 거래 지원 플랫폼 출범",디지털데일리,'증권플러스 비상장' 서비스를 설명중인 두나무 이성현 핀테크사업실장 [디지털데일리 ...
4,https://news.naver.com/main/read.nhn?mode=LS2D...,IT/과학,통신/뉴미디어,2019-10-24,"과기정통부, 태풍 미탁 피해, 특별재난지역 전파사용료 6개월간 전액감면",전자신문,과기정통부 로고 과학기술정보통신부는 18호 태풍 '미탁'으로 인해 특별재난지역으로 ...
...,...,...,...,...,...,...,...
4995,https://news.naver.com/main/read.nhn?mode=LS2D...,생활/문화,종교,2019-06-29,[가정예배 365-6월 29일] 부흥의 회복,국민일보,찬송 : ‘나의 죄를 정케 하사’ 320장(통 350장) 신앙고백 : 사도신경 본문...
4996,https://news.naver.com/main/read.nhn?mode=LS2D...,생활/문화,건강정보,2019-07-09,쉽게 구할 수 있는 암 예방 식품 5,코메디닷컴,"[사진=jv_food01/gettyimagesbank] 전문가들은 ""암은 여러 가지..."
4997,https://news.naver.com/main/read.nhn?mode=LS2D...,생활/문화,날씨,2019-08-03,"[오늘의 날씨] 부산·경남(3일, 토)…낮엔 폭염, 밤엔 열대야",뉴스1,자료사진 © News1 (부산ㆍ경남=뉴스1) 박기범 기자 = 3일 부산·경남은 가끔...
4998,https://news.naver.com/main/read.nhn?mode=LS2D...,생활/문화,공연/전시,2019-07-11,아야스·진발라 “알고리즘·무속적 우주론…내년 광주비엔날레엔 ‘인간 지성’ 다룰 것”,경향신문,"ㆍ예술감독 아야스·진발라, 계획 밝혀 ㆍ40주년 5·18은 다른 방식으로 준비 광주..."


In [None]:
from sklearn.model_selection import train_test_split
train_data, train_label, valid_datam, valid_label = train_test_split()

#11-2 서포트 벡터 머신(SVM, Support Vector Machine)

###뉴스 데이터 다운로드

###문서 분류 (파이프 라인 사용)

###Grid Search