## Data : 20 Newsgroups

### Basic exploration

In [1]:
from pprint import pprint
import numpy as np
from sklearn.datasets import fetch_20newsgroups

In [2]:
newsgroups = fetch_20newsgroups(subset='all')

In [3]:
pprint(newsgroups.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [4]:
print(len(newsgroups.data))

18846


In [5]:
print(newsgroups.data[2000])

From: tomh@metrics.com (Tom Haapanen)
Subject: RFD: comp.os.ms-windows.nt.{misc,setup}
Organization: Software Metrics Inc.
Lines: 76
NNTP-Posting-Host: rodan.uu.net

This is the official Request for Discussion (RFD) for the creation of two
new newsgroups for Microsoft Windows NT.  This is a second RFD, replacing
the one originally posted in January '93 (and never taken to a vote).  The
proposed groups are described below:

NAME: 	 comp.os.ms-windows.nt.setup
STATUS:  Unmoderated.
PURPOSE: Discussions about setting up and installing Windows NT, and about
	 system and peripheral compatability issues for Windows NT.

NAME:	 comp.os.ms-windows.nt.misc
STATUS:	 Unmoderated.
PURPOSE: Miscellaneous non-programming discussions about using Windows NT,
	 including issues such as security, networking features, console
	 mode and Windows 3.1 (Win16) compatability.

RATIONALE:
	Microsoft NT is the newest member of the Microsoft Windows family
	of operating systems (or operating environments for tho

In [6]:
print(newsgroups.target[2000])

2


In [7]:
# 타겟의 인덱스 값을 넣으면 실제 데이터의 값을 확인할 수 있음
newsgroups.target_names[2]

'comp.os.ms-windows.misc'

In [8]:
print(newsgroups.target_names[newsgroups.target[2000]])

comp.os.ms-windows.misc


In [9]:
def get_content(newsgroups, news_id):
    print('[CONTENT]')
    print(newsgroups.data[news_id])
    print('\n[CLASS]: %s (index = %d)'%(newsgroups.target_names[newsgroups.target[news_id]], newsgroups.target[news_id]))

In [10]:
get_content(newsgroups, 11111)

[CONTENT]
From: g_waugaman@nac.enet.dec.com (Glenn R. Waugaman)
Subject: Re: I've found the secret!
Article-I.D.: nntpd.1993Apr15.193907.24177
Organization: Digital Equipment Corporation
Lines: 23


In article <1993Apr15.161730.9903@cs.cornell.edu>, tedward@cs.cornell.edu (Edward [Ted] Fischer) writes...
> 
>Why are the Red Sox in first place?  Eight games into the season, they
>already have two wins each from Clemens and Viola.  Clemens starts
>again tonight, on three days rest.
> 
>What's up?  Are the Sox going with a four-man rotation?  Is this why
>Hesketh was used in relief last night?

Clemens is going on his normal four days' rest (last pitched Saturday). 
Hesketh only pitched one inning yesterday afternoon, his first outing
since an aborted 1-1/3 inning start 6 days before, so he should be plenty
rested to go in his expected turn this Saturday, as the 5th starter.  Not
that this is a good thing, of course.  I'd like to see a well-managed
four-man rotation with this team... 

--

In [11]:
get_content(newsgroups, 0)

[CONTENT]
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



[CLASS]: rec.sport.hockey (index = 10)


In [12]:
target_id, counts = np.unique(newsgroups.target, return_counts = True)

In [13]:
dict(zip(target_id, counts))

{0: 799,
 1: 973,
 2: 985,
 3: 982,
 4: 963,
 5: 988,
 6: 975,
 7: 990,
 8: 996,
 9: 994,
 10: 999,
 11: 991,
 12: 984,
 13: 990,
 14: 987,
 15: 997,
 16: 910,
 17: 940,
 18: 775,
 19: 628}

In [14]:
dict(zip(newsgroups.target_names, counts))

{'alt.atheism': 799,
 'comp.graphics': 973,
 'comp.os.ms-windows.misc': 985,
 'comp.sys.ibm.pc.hardware': 982,
 'comp.sys.mac.hardware': 963,
 'comp.windows.x': 988,
 'misc.forsale': 975,
 'rec.autos': 990,
 'rec.motorcycles': 996,
 'rec.sport.baseball': 994,
 'rec.sport.hockey': 999,
 'sci.crypt': 991,
 'sci.electronics': 984,
 'sci.med': 990,
 'sci.space': 987,
 'soc.religion.christian': 997,
 'talk.politics.guns': 910,
 'talk.politics.mideast': 940,
 'talk.politics.misc': 775,
 'talk.religion.misc': 628}

## Training and test data

In [15]:
newsgroups_trn = fetch_20newsgroups(subset='train')
newsgroups_tst = fetch_20newsgroups(subset='test')

In [16]:
print("Number of training points : %d"%len(newsgroups_trn.data))
print("Number of test points : %d"%len(newsgroups_tst.data))

Number of training points : 11314
Number of test points : 7532


## More...
`sklearn.datasets.fetch_20newsgroups`에서는 메인 컨텐츠와 상관없는 내용을 삭제하는 기능이 포함되어 있습니다.
1. headers: [source](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/datasets/twenty_newsgroups.py#L110)
2. footers: [source](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/datasets/twenty_newsgroups.py#L134)
3. quotes: [source](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/datasets/twenty_newsgroups.py#L123)



In [17]:
newsgroups_removed = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [18]:
get_content(newsgroups, 11111)

[CONTENT]
From: g_waugaman@nac.enet.dec.com (Glenn R. Waugaman)
Subject: Re: I've found the secret!
Article-I.D.: nntpd.1993Apr15.193907.24177
Organization: Digital Equipment Corporation
Lines: 23


In article <1993Apr15.161730.9903@cs.cornell.edu>, tedward@cs.cornell.edu (Edward [Ted] Fischer) writes...
> 
>Why are the Red Sox in first place?  Eight games into the season, they
>already have two wins each from Clemens and Viola.  Clemens starts
>again tonight, on three days rest.
> 
>What's up?  Are the Sox going with a four-man rotation?  Is this why
>Hesketh was used in relief last night?

Clemens is going on his normal four days' rest (last pitched Saturday). 
Hesketh only pitched one inning yesterday afternoon, his first outing
since an aborted 1-1/3 inning start 6 days before, so he should be plenty
rested to go in his expected turn this Saturday, as the 5th starter.  Not
that this is a good thing, of course.  I'd like to see a well-managed
four-man rotation with this team... 

--

In [19]:
get_content(newsgroups_removed, 11111)

[CONTENT]

Clemens is going on his normal four days' rest (last pitched Saturday). 
Hesketh only pitched one inning yesterday afternoon, his first outing
since an aborted 1-1/3 inning start 6 days before, so he should be plenty
rested to go in his expected turn this Saturday, as the 5th starter.  Not
that this is a good thing, of course.  I'd like to see a well-managed
four-man rotation with this team... 

---
Glenn Waugaman
Digital Equipment Corporation
Littleton, MA
g_waugaman@nac.enet.dec.com

[CLASS]: rec.sport.baseball (index = 9)


In [20]:
get_content(newsgroups, 2000)

[CONTENT]
From: tomh@metrics.com (Tom Haapanen)
Subject: RFD: comp.os.ms-windows.nt.{misc,setup}
Organization: Software Metrics Inc.
Lines: 76
NNTP-Posting-Host: rodan.uu.net

This is the official Request for Discussion (RFD) for the creation of two
new newsgroups for Microsoft Windows NT.  This is a second RFD, replacing
the one originally posted in January '93 (and never taken to a vote).  The
proposed groups are described below:

NAME: 	 comp.os.ms-windows.nt.setup
STATUS:  Unmoderated.
PURPOSE: Discussions about setting up and installing Windows NT, and about
	 system and peripheral compatability issues for Windows NT.

NAME:	 comp.os.ms-windows.nt.misc
STATUS:	 Unmoderated.
PURPOSE: Miscellaneous non-programming discussions about using Windows NT,
	 including issues such as security, networking features, console
	 mode and Windows 3.1 (Win16) compatability.

RATIONALE:
	Microsoft NT is the newest member of the Microsoft Windows family
	of operating systems (or operating environmen

In [21]:
get_content(newsgroups_removed, 2000)

[CONTENT]
This is the official Request for Discussion (RFD) for the creation of two
new newsgroups for Microsoft Windows NT.  This is a second RFD, replacing
the one originally posted in January '93 (and never taken to a vote).  The
proposed groups are described below:

NAME: 	 comp.os.ms-windows.nt.setup
STATUS:  Unmoderated.
PURPOSE: Discussions about setting up and installing Windows NT, and about
	 system and peripheral compatability issues for Windows NT.

NAME:	 comp.os.ms-windows.nt.misc
STATUS:	 Unmoderated.
PURPOSE: Miscellaneous non-programming discussions about using Windows NT,
	 including issues such as security, networking features, console
	 mode and Windows 3.1 (Win16) compatability.

RATIONALE:
	Microsoft NT is the newest member of the Microsoft Windows family
	of operating systems (or operating environments for those who wish
	to argue about the meaning of an "OS").  The family ranges from
	Modular Windows through Windows 3.1 and Windows for Workgroups to
	Windows NT 

## Newsgroups_classification_simple_sklearn

In [22]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier # 처음 사용하는 것
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold, KFold, GridSearchCV

## 0. Data Loadinn and preprocessing

### Data : 20newsgroups
- http://qwone.com/~jason/20Newsgroups/
- http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups

In [23]:
categories = ['rec.sport.baseball', 'soc.religion.christian', 'comp.windows.x', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories)
X_train = newsgroups_train.data
Y_train = newsgroups_train.target
X_test = newsgroups_test.data
Y_test = newsgroups_test.target

In [24]:
# Declare two vectorizer
count_vectorizer = CountVectorizer(min_df=40)
tfidf_vectorizer = TfidfVectorizer(min_df=40)

In [25]:
# Fitting vectorizers to the training set
count_vectorizer = count_vectorizer.fit(X_train)
tfidf_vectorizer = tfidf_vectorizer.fit(X_train)

In [26]:
# Transform X_train and X_test using 2 vectorizers
X_train_count = count_vectorizer.transform(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_count = count_vectorizer.transform(X_test)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [27]:
X_train_count.shape

(2382, 891)

In [28]:
X_train_tfidf.shape

(2382, 891)

In [29]:
print(X_train_count[0, :])

  (0, 60)	1
  (0, 63)	1
  (0, 64)	1
  (0, 67)	1
  (0, 74)	1
  (0, 105)	1
  (0, 137)	1
  (0, 143)	1
  (0, 144)	1
  (0, 228)	3
  (0, 233)	1
  (0, 250)	1
  (0, 301)	2
  (0, 359)	1
  (0, 374)	2
  (0, 384)	1
  (0, 385)	1
  (0, 406)	2
  (0, 421)	1
  (0, 443)	1
  (0, 540)	1
  (0, 587)	1
  (0, 603)	1
  (0, 651)	1
  (0, 665)	1
  (0, 760)	1
  (0, 761)	1
  (0, 762)	2
  (0, 785)	2
  (0, 818)	1
  (0, 819)	3
  (0, 861)	1
  (0, 881)	3


In [30]:
print(X_train_tfidf[0, :])

  (0, 881)	0.576862404502
  (0, 861)	0.0728941164216
  (0, 819)	0.398347048389
  (0, 818)	0.179213589409
  (0, 785)	0.0988093651309
  (0, 762)	0.0902077527633
  (0, 761)	0.058409513562
  (0, 760)	0.126509695262
  (0, 665)	0.113406230175
  (0, 651)	0.124142000922
  (0, 603)	0.181009461643
  (0, 587)	0.128065574918
  (0, 540)	0.0531646115053
  (0, 443)	0.0951448117918
  (0, 421)	0.100418829305
  (0, 406)	0.121579076985
  (0, 385)	0.167644109262
  (0, 384)	0.0550409424255
  (0, 374)	0.202261823679
  (0, 359)	0.13653670152
  (0, 301)	0.124485489929
  (0, 250)	0.163907662217
  (0, 233)	0.140253453503
  (0, 228)	0.262696725634
  (0, 144)	0.157695146004
  (0, 143)	0.0802208063813
  (0, 137)	0.0711835823945
  (0, 105)	0.0669565451494
  (0, 74)	0.162354864317
  (0, 67)	0.092112478996
  (0, 64)	0.0526512312659
  (0, 63)	0.0818953815887
  (0, 60)	0.112756441459


# 1. Fitting classifiers with count vectorizer

In [31]:
num_folds = 5
num_instances = len(X_train)
seed = 1234
scoring = 'accuracy'

## 1.1 Logistic Regression
- regularization : L1, L2
- C

In [32]:
model = LogisticRegression()
penalty_set = ['l1', 'l2']
C_set = [0.5, 1]
param_grid = dict(penalty=penalty_set, C=C_set)

In [33]:
# Using count vectorizer
clf = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=num_folds, n_jobs=-1, verbose=1)
clf.fit(X_train_count, Y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done   5 out of  20 | elapsed:    0.2s remaining:    0.7s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    1.5s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [34]:
clf.cv_results_



{'mean_fit_time': array([ 0.1882442 ,  0.59979267,  0.71771169,  0.38417873]),
 'mean_score_time': array([ 0.00148969,  0.00120149,  0.00098805,  0.00059419]),
 'mean_test_score': array([ 0.85222502,  0.84592779,  0.85180521,  0.84466835]),
 'mean_train_score': array([ 0.93304223,  0.97029806,  0.95969939,  0.97942856]),
 'param_C': masked_array(data = [0.5 0.5 1 1],
              mask = [False False False False],
        fill_value = ?),
 'param_penalty': masked_array(data = ['l1' 'l2' 'l1' 'l2'],
              mask = [False False False False],
        fill_value = ?),
 'params': [{'C': 0.5, 'penalty': 'l1'},
  {'C': 0.5, 'penalty': 'l2'},
  {'C': 1, 'penalty': 'l1'},
  {'C': 1, 'penalty': 'l2'}],
 'rank_test_score': array([1, 3, 2, 4], dtype=int32),
 'split0_test_score': array([ 0.86610879,  0.85564854,  0.87029289,  0.85146444]),
 'split0_train_score': array([ 0.93644958,  0.97006303,  0.95798319,  0.97846639]),
 'split1_test_score': array([ 0.83263598,  0.83263598,  0.82217573,  0.

In [35]:
print("Best params : ", clf.best_params_)
print("Best test : ", scoring, ":", clf.best_score_)

Best params :  {'C': 0.5, 'penalty': 'l1'}
Best test :  accuracy : 0.852225020991


In [36]:
best_logistic_count = clf.best_estimator_

## 1.2 MLPClassifier
은닉층의 사이즈를 조절
- 은닉층 1개(노드 수 = 100)
- 은닉층 2개(노드 수 = 100)

In [37]:
model = MLPClassifier(learning_rate_init=0.01, max_iter=300)

hidden_layer_sizes_set = [(100, ), (100, 100)]
param_grid = dict(hidden_layer_sizes=hidden_layer_sizes_set)

In [38]:
# Using count vectorizer
clf = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=num_folds, verbose=1, n_jobs=-1)
clf.fit(X_train_count, Y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    8.7s remaining:    5.8s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    9.0s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.01, max_iter=300, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'hidden_layer_sizes': [(100,), (100, 100)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [39]:
clf.cv_results_



{'mean_fit_time': array([ 8.08860846,  4.56226525]),
 'mean_score_time': array([ 0.01542382,  0.02134137]),
 'mean_test_score': array([ 0.8324937 ,  0.84130982]),
 'mean_train_score': array([ 0.98698487,  0.9803675 ]),
 'param_hidden_layer_sizes': masked_array(data = [(100,) (100, 100)],
              mask = [False False],
        fill_value = ?),
 'params': [{'hidden_layer_sizes': (100,)},
  {'hidden_layer_sizes': (100, 100)}],
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_test_score': array([ 0.84728033,  0.83472803]),
 'split0_train_score': array([ 0.98581933,  0.95798319]),
 'split1_test_score': array([ 0.81799163,  0.83682008]),
 'split1_train_score': array([ 0.98634454,  0.98371849]),
 'split2_test_score': array([ 0.83018868,  0.82599581]),
 'split2_train_score': array([ 0.98635171,  0.98372703]),
 'split3_test_score': array([ 0.83157895,  0.85473684]),
 'split3_train_score': array([ 0.98793917,  0.98793917]),
 'split4_test_score': array([ 0.83544304,  0.85443038]),
 '

In [40]:
print("Best params : ", clf.best_params_)
print("Best test : ", scoring, ":", clf.best_score_)

Best params :  {'hidden_layer_sizes': (100, 100)}
Best test :  accuracy : 0.841309823678


In [41]:
best_mlp_count = clf.best_estimator_

## 1.3 두 모델의 비교
Logistic regression에서 가장 성능이 좋은 모델과 MLP에서 가장 성능이 좋은 모델을 선택하여 테스트 데이터에 대한 성능 비교

In [42]:
best_models_count = []
best_models_count.append(("LogisticRegression", best_logistic_count))
best_models_count.append(("MLPClassifier", best_mlp_count))

In [43]:
results = []
scores = []
names =[]
for name, model in best_models_count:
    Y_test_hat = model.predict(X_test_count)
    results.append(metrics.confusion_matrix(Y_test, Y_test_hat))
    scores.append(metrics.accuracy_score(Y_test, Y_test_hat))
    names.append(name)

In [44]:
for name, score, cm in list(zip(names, scores, results)):
    print('\n[%s]'%name)
    print('- test accuracy: %f'%score)
    print("- confusion matrix: \n", cm)


[LogisticRegression]
- test accuracy: 0.828283
- confusion matrix: 
 [[327  38  25   5]
 [  4 352  28  13]
 [ 19  46 311  18]
 [  9  38  29 322]]

[MLPClassifier]
- test accuracy: 0.791035
- confusion matrix: 
 [[323  23  27  22]
 [  7 306  32  52]
 [ 19  34 272  69]
 [  8  25  13 352]]


# 2. Fitting classifiers with tf-idf vectorizer

In [45]:
# Pre-definek options
num_folds = 5
num_instances = len(X_train)
seed = 1234
scoring = 'accuracy'

## 2.1 Logistic Regression
- 위와 동일

In [46]:
model = LogisticRegression()
penalty_set = ['l1', 'l2']
C_set = [0.1, 1, 10, 100]
param_grid = dict(penalty=penalty_set, C=C_set)

In [47]:
# Using count vectorizer
clf = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=num_folds, n_jobs=-1, verbose=1)
clf.fit(X_train_tfidf, Y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    1.0s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [48]:
clf.cv_results_



{'mean_fit_time': array([ 0.06846523,  0.11430616,  0.09718485,  0.12800169,  0.14617949,
         0.18747039,  0.16612148,  0.18234448]),
 'mean_score_time': array([ 0.00136476,  0.00123553,  0.00121598,  0.00083847,  0.00105443,
         0.0008441 ,  0.00087004,  0.00070577]),
 'mean_test_score': array([ 0.60537364,  0.82157851,  0.8434089 ,  0.85642317,  0.85516373,
         0.86607893,  0.84760705,  0.85558354]),
 'mean_train_score': array([ 0.61912064,  0.85998917,  0.8929446 ,  0.92999613,  0.98667013,
         0.98341674,  0.98698487,  0.98698487]),
 'param_C': masked_array(data = [0.1 0.1 1 1 10 10 100 100],
              mask = [False False False False False False False False],
        fill_value = ?),
 'param_penalty': masked_array(data = ['l1' 'l2' 'l1' 'l2' 'l1' 'l2' 'l1' 'l2'],
              mask = [False False False False False False False False],
        fill_value = ?),
 'params': [{'C': 0.1, 'penalty': 'l1'},
  {'C': 0.1, 'penalty': 'l2'},
  {'C': 1, 'penalty': 'l1'},


In [49]:
print("Best params : ", clf.best_params_)
print("Best test : ", scoring, ":", clf.best_score_)

Best params :  {'C': 10, 'penalty': 'l2'}
Best test :  accuracy : 0.866078925273


In [50]:
best_logistic_tfidf = clf.best_estimator_

## 2.2 MLPClassifer
- 위와 동일

In [51]:
model = MLPClassifier(learning_rate_init=0.01, max_iter=300)
hidden_layer_sizes_set = [(100, ), (100, 100)]
param_grid = dict(hidden_layer_sizes=hidden_layer_sizes_set)

In [52]:
clf = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=num_folds, n_jobs=-1, verbose=1)
clf.fit(X_train_tfidf, Y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    6.0s remaining:    4.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    6.7s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.01, max_iter=300, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'hidden_layer_sizes': [(100,), (100, 100)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [53]:
clf.cv_results_



{'mean_fit_time': array([ 5.86700387,  3.44230008]),
 'mean_score_time': array([ 0.02879663,  0.02137346]),
 'mean_test_score': array([ 0.84760705,  0.85264484]),
 'mean_train_score': array([ 0.98698487,  0.98698487]),
 'param_hidden_layer_sizes': masked_array(data = [(100,) (100, 100)],
              mask = [False False],
        fill_value = ?),
 'params': [{'hidden_layer_sizes': (100,)},
  {'hidden_layer_sizes': (100, 100)}],
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_test_score': array([ 0.85983264,  0.87029289]),
 'split0_train_score': array([ 0.98581933,  0.98581933]),
 'split1_test_score': array([ 0.85146444,  0.84518828]),
 'split1_train_score': array([ 0.98634454,  0.98634454]),
 'split2_test_score': array([ 0.83857442,  0.85953878]),
 'split2_train_score': array([ 0.98635171,  0.98635171]),
 'split3_test_score': array([ 0.84842105,  0.84210526]),
 'split3_train_score': array([ 0.98793917,  0.98793917]),
 'split4_test_score': array([ 0.83966245,  0.84599156]),
 '

In [54]:
print("Best params : ", clf.best_params_)
print("Best test : ", scoring, ":", clf.best_score_)

Best params :  {'hidden_layer_sizes': (100, 100)}
Best test :  accuracy : 0.852644836272


In [55]:
best_mlp_tfidf = clf.best_estimator_

## 2.3 두 모델의 비교

In [56]:
best_models = []
best_models.append(('LogisticRegression', best_logistic_tfidf))
best_models.append(('MLPClassifier', best_mlp_tfidf))

In [57]:
results = []
scores  = []
names   = []
for name, model in best_models:
    Y_test_hat = model.predict(X_test_tfidf)
    results.append(metrics.confusion_matrix(Y_test, Y_test_hat))
    scores.append(metrics.accuracy_score(Y_test, Y_test_hat))
    names.append(name)

In [58]:
for name, score, cm in list(zip(names, scores, results)):
    print('\n[%s]' % name)
    print('- test accuracy: %f' % score)
    print('- confusion matrix :\n', cm)


[LogisticRegression]
- test accuracy: 0.838384
- confusion matrix :
 [[335  31  23   6]
 [ 10 340  27  20]
 [ 20  41 306  27]
 [ 10  27  14 347]]

[MLPClassifier]
- test accuracy: 0.818813
- confusion matrix :
 [[334  21  29  11]
 [ 10 325  36  26]
 [ 24  44 299  27]
 [ 10  31  18 339]]
