# 20 뉴스그룹 분류

## 텍스트 데이터 확인

In [1]:
# fetch_20newsgroups: 20개의 주제에 대한 18,846개의 뉴스 데이터
from sklearn.datasets import fetch_20newsgroups
import numpy as np
# subset: 로드할 데이터 세트를 선택
# 훈련 세트는 'train', 테스트 세트는 'test', 둘 다는 'all'
news_data = fetch_20newsgroups(subset='all', random_state=100)

In [3]:
print(type(news_data))
print(np.array(news_data.data).shape)

<class 'sklearn.utils._bunch.Bunch'>
(18846,)


In [4]:
print('target 클래스의 값과 분포')
print(dict(zip(np.unique(news_data.target_names), np.bincount(news_data.target))))

target 클래스의 값과 분포
{'alt.atheism': 799, 'comp.graphics': 973, 'comp.os.ms-windows.misc': 985, 'comp.sys.ibm.pc.hardware': 982, 'comp.sys.mac.hardware': 963, 'comp.windows.x': 988, 'misc.forsale': 975, 'rec.autos': 990, 'rec.motorcycles': 996, 'rec.sport.baseball': 994, 'rec.sport.hockey': 999, 'sci.crypt': 991, 'sci.electronics': 984, 'sci.med': 990, 'sci.space': 987, 'soc.religion.christian': 997, 'talk.politics.guns': 910, 'talk.politics.mideast': 940, 'talk.politics.misc': 775, 'talk.religion.misc': 628}


In [5]:
print(news_data.data[0])

From: ggr@koonda.acci.com.au (Greg Rose)
Subject: Authentication and one-time-pads (was: Re: Advanced one time pad)
Summary: presents one-time-pad based MAC
Organization: Australian Computing and Communications Institute
Lines: 93

In article <1s1dbmINNehb@elang05.acslab.umbc.edu> olson@umbc.edu (Bryan Olson; CMSC (G)) writes:
>The one-time-pad yeilds ideal security, but has a well-known flaw in
>authentication.  Suppose you use a random bit stream as the pad, and
>exclusive-or as the encryption operation.  If an adversary knows the 
>plaintext of a message, he can change it into any other message.  
>Here's how it works.
>
>Alice is sending Bob a plaintext P, under a key stream S
>Alice computes the ciphertext C = S xor P,  and sends it to Bob.
>
>Eve knows the plainext P, but wants the message to appear as P'.
>Eve intercepts C, and computes  C' = C xor P xor P' = S xor P'.
>Eve sends C' to Bob.
>
>Bob decrypts C' by computing  C'xor S = P',  thus receiving the 
>false message which 

In [6]:
news_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), random_state=100)
print(news_data.data[0])


Firstly, an aside:

I agree that the weakness exists, but I have a lot of trouble
believing that it represents a difficulty in real life. Given:

1. the purpose of the one-time pad is to give unbreakable security,
and the expense of key distribution etc., imply that the clients
really do want that level of security

2. These same people want to keep P a secret

I find it hard to believe that Eve might happen to have a copy of P
lying around.

(I am aware that the same argument applies to Eve knowing even a small
part of the message, but Eve must know EXACTLY where (which bytes) in
C her known susequence starts, or the result will be garbled. I find
this at least as surprising.)

Back to the question:

If I had the resources to use a one-time-pad for such transmissions, I
would also append a Message Authentication Code to the message, using up
the next bits of the one-time-pad as the key perhaps. Your original
question basically asked whether there was any way to authenticate the
messa

## 학습 데이터 / 평가 데이터 분리

In [7]:
train_news = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=100)
x_train = train_news.data
y_train = train_news.target

test_news = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), random_state=100)
x_test = test_news.data
y_test = test_news.target

print(f'학습 데이터의 크기:{len(x_train):,}, 테스트 데이터 크기:{len(x_test):,}')

학습 데이터의 크기:11,314, 테스트 데이터 크기:7,532


## 피처 벡터화

### CountVectorizer 사용

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

cnt_vect = CountVectorizer(stop_words=stopwords.words('english'))
cnt_vect.fit(x_train)  # 학습 데이터로 fit 하면 그것으로 사전을 만들고, 이와 동일한 기준으로 평가 데이터에도 transform 필요
x_train_cnt_vect = cnt_vect.transform(x_train)

In [12]:
# 학습 데이터로 fit()된 CounterVectorizer를 이용해서 테스트 데이터를 feature extraction 변환 수행
x_test_cnt_vect = cnt_vect.transform(x_test)

In [21]:
print(f'학습 데이터 텍스트의 vector shape: {x_train_cnt_vect.shape}')  # 11314 - 문장의 개수, 101487 - 단어의 개수
print('사전에 포함된 단어 개수:', len(cnt_vect.vocabulary_))
# print(x_train_cnt_vect[0][:10])    # (0 번째 데이터의 n번째 단어)의 n 빈도 수

학습 데이터 텍스트의 vector shape: (11314, 101487)
사전에 포함된 단어 개수: 101487


### TfidfVectorizer 사용 -1

In [24]:
# tf 값과 idf 값 곱해 만듦 = 영문서 빈도화
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words=stopwords.words('english'))
tfidf_vect.fit(x_train)
x_train_tfidf_vect = tfidf_vect.transform(x_train)
x_test_tfidf_vect = tfidf_vect.transform(x_test)

### TfidfVectorizer 사용 -2
- N-gram (1,2) 적용
- max_df: 전체 문서에 걸쳐 너무 높은 빈도를 가지는 단어 피처 제외
- min_df: 전체 문세에 걸쳐 낮은 빈도를 가지는 단어 피처 제외
- ngram_range: 모델의 단어 순서를 어느 정도 보강하기 위한 범위 (범위 최솟값, 범위 최댓값)
>- (1,2)로 하면 토큰화된 단어를 1개씩, 그리고 순서대로 2개씩 묶어서 피처로 추출한다.

In [26]:
tfidf_vect = TfidfVectorizer(stop_words=stopwords.words('english'), ngram_range=(1,2), max_df=300)
tfidf_vect.fit(x_train)
x_train_tfidf_vect = tfidf_vect.transform(x_train)
x_test_tfidf_vect = tfidf_vect.transform(x_test)

In [27]:
print(len(tfidf_vect.vocabulary_))

990299


## 모델 학습 및 평가

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(x_train_cnt_vect, y_train)

y_hat = lr_clf.predict(x_test_cnt_vect)
print(f'정확도:{accuracy_score(y_test, y_hat):.3f}')

정확도:0.635


In [25]:
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(x_train_tfidf_vect, y_train)

y_hat = lr_clf.predict(x_test_tfidf_vect)
print(f'정확도:{accuracy_score(y_test, y_hat):.3f}')

정확도:0.688


## 하이퍼 파라메터 튜닝

In [29]:
from sklearn.model_selection import GridSearchCV

params = {'C':[0.01, 0.1, 1, 5, 10]}
grid_cv_lr = GridSearchCV(lr_clf, param_grid=params, cv=3, scoring='accuracy')
grid_cv_lr.fit(x_train_tfidf_vect, y_train)
print('최적 파라메터:', grid_cv_lr.best_params_)
y_hat = grid_cv_lr.best_estimator_.predict(x_test_tfidf_vect)
print(f'정확도:{accuracy_score(y_test, y_hat):.3f}')

최적 파라메터: {'C': 10}
정확도:0.703
