## 20 newsgroups Dataset
- 대표적인 Text 분류 Toy dataset
- 20개의 뉴스 테스트 데이터를 분류하라 !
- Multiclass classification
- 약 20,000개의 news document 존재

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
news.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [3]:
print(news.data[0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [4]:
news.target[0]

10

In [5]:
news.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## 뉴스 분류기 모델 만들기

* 데이터 파악
* 전처리(Preprocessing)

    * 필요없는 단어 제거 (Data Cleansing)
    * CountVectorizer & Tf-idfVectorizer

---
    
* Modeling : BernoulliNB, MultinomialNB 사용
  * Cross Validation(Kfold 이용)
  
---

* Pipeline 이용

---

* Assignment Description
     * 위 신문 데이터를 바탕으로 신문 내용별 분류기를 개발하라
     * 위 데이터를 Traing / Test Dataset으로 나눠서 5-fold cross validation(5번 데이터를 training / testset으로 나눔, KV 활용)
     * Naive Bayesian Classifier와 Count Vector를 활용하여 각각 성능을 테스트하라
         * NB는 multinomial과 bernuoil 분포를 모두 사용하라
     * 가능할 경우, TF-IDF vector를 활용해 볼것 (검색어 - tf-idf scikit-learn)

# Dataset
   * 18846개의 데이터

In [6]:
news_df = pd.DataFrame({'News' : news.data, 'Target' : news.target})

In [7]:
news_df.head()

Unnamed: 0,News,Target
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,10
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,3
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,17
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,3
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,4


In [8]:
# Target 데이터 -> 문자 라벨링(뉴스마다 어떤 뉴스인지 보기 편하도록 만들기 위해서)
def word_labeling(lst, df):
    for idx, name in enumerate(lst):
        target_data = df['Target']
        for idx_, num_label in enumerate(target_data):
            if num_label == idx:
                df.loc[idx_, 'Target'] = name
    return df
news_df = word_labeling(news['target_names'], news_df)
news_df.head()

Unnamed: 0,News,Target
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,rec.sport.hockey
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,comp.sys.ibm.pc.hardware
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,talk.politics.mideast
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,comp.sys.ibm.pc.hardware
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,comp.sys.mac.hardware


* Data Cleansing
    * 이메일 제거
    * 불필요 숫자 제거
    * 문자 아닌 특수문자 제거
    * 단어 사이 공백 제거 : 띄어쓰기 별로 split해주고 join

In [9]:
def data_cleansing(df):
    delete_email = re.sub(r'\b[\w\+]+@[\w]+.[\w]+.[\w]+.[\w]+\b', ' ', df)
    delete_number = re.sub(r'\b|\d+|\b', ' ',delete_email)
    delete_non_word = re.sub(r'\b[\W]+\b', ' ', delete_number)
    cleaning_result = ' '.join(delete_non_word.split())
    return cleaning_result 

In [10]:
news_df.loc[:, 'News'] = news_df['News'].apply(data_cleansing)
news_df.head()

Unnamed: 0,News,Target
0,From Mamatha Devineni Ratnam Subject Pens fans...,rec.sport.hockey
1,From Matthew B Lawson Subject Which high perfo...,comp.sys.ibm.pc.hardware
2,From hilmi Hilmi Eren Subject Re ARMENIA SAYS ...,talk.politics.mideast
3,From Guy Dawson Subject Re IDE vs SCSI DMA and...,comp.sys.ibm.pc.hardware
4,From Alexander Samuel McDiarmid Subject driver...,comp.sys.mac.hardware


# Vectorizer
* CountVectorizer 
  * 문서 집합으로부터 단어의 수를 세어 카운트 행렬을 만듦
* TfidfVectorizer 
    * 단어를 갯수 그대로 카운트하지 않고 모든 문서에 공통적으로 들어있는 단어의 경우 문서 구별 능력이 떨어진다고 보아 가중치를 축소하는 방법
    * TF(Term Frequency) : 문서에서 해당 단어가 얼마나 나왔는지 나타내주는 빈도 수
    * DF(Document Frequency) : 해당 단어가 있는 문서의 수
    * IDF(Inverse Document Frequency) 해당 단어가 있는 문서의 수가 높아질 수록 가중치를 축소해주기 위해 역수 취해줌
        * log(N / (1 + DF))      
            * N : 전체 문서의 수
    * TF-IDF = TF * IDF
* CustomizedVectorizer - StemmedCounterVectorizer, StemmedTfidfVectorizer 

In [11]:
from nltk import stem
stmmer = stem.SnowballStemmer("english")
sentence = 'looking looks looked'
[stmmer.stem(word) for word in sentence.split()]

['look', 'look', 'look']

In [12]:
stmmer.stem("images"), stmmer.stem("imaging"), stmmer.stem("imagination")  # 명사, 동사 구분 가능

('imag', 'imag', 'imagin')

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import  nltk
enlish_stemmer = nltk.stem.SnowballStemmer("english")

class StemmedCountVectorizer(CountVectorizer): # CountVectorizer의 자식 class 
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer,self).build_analyzer() # CountVectorizer의 analyzer 선언
        return lambda doc: (enlish_stemmer.stem(w) for w in analyzer(doc)) # Stemming 만 추가로 적용하여 반환

In [14]:
StemmedCountVectorizer(min_df=1, stop_words="english").fit([sentence]).vocabulary_

{'look': 0}

In [15]:
CountVectorizer(min_df=1, stop_words="english").fit([sentence]).vocabulary_

{'looking': 1, 'looks': 2, 'looked': 0}

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

enlish_stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
        return lambda doc: (enlish_stemmer.stem(w) for w in analyzer(doc))

# Modeling
* Pipeline
* Gridsearch
* Cross Validation

In [17]:
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), 3)
for grams in sixgrams:
  print (grams)

('this', 'is', 'a')
('is', 'a', 'foo')
('a', 'foo', 'bar')
('foo', 'bar', 'sentences')
('bar', 'sentences', 'and')
('sentences', 'and', 'i')
('and', 'i', 'want')
('i', 'want', 'to')
('want', 'to', 'ngramize')
('to', 'ngramize', 'it')


In [18]:
from sklearn.base import TransformerMixin, BaseEstimator

# GaussianNB 알고리즘 - Dense한 데이터를 입력값으로 가짐
# vectorizer의 output : CSR form
# CSR form을 Dense form으로 변환하기 위한 class
class DenseTransformer(BaseEstimator, TransformerMixin):

    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

## Modeling Plan - Pipeline

Vectorizer
- Tfidf
- Count
- StemmedTfidf
- StemmedCount

Algorithm
- LogisticRegression
- Bernoulli NB
- Multinomial NB
- Gaussian NB

Metrics
- CV 2 times
- Accuracy

In [19]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

vectorizer = [CountVectorizer(), TfidfVectorizer(), StemmedCountVectorizer(), StemmedTfidfVectorizer()]
# algorithms = [BernoulliNB(), MultinomialNB(), GaussianNB(), LogisticRegression()]
algorithms = [MultinomialNB(), BernoulliNB()]

pipelines  = []


import itertools
for case in list(itertools.product(vectorizer, algorithms)): # 모든 경우의 수를 리스트로 만들기
    if isinstance(case[1], GaussianNB): # case[1](=알고리즘)이 GaussianNB instance이면
        case = list(case)
        case.insert(1,  DenseTransformer()) # CSR form을 Dense한 데이터로 transform 하도록 함
        # case의 형태 : [vectorizer, DenseTransformer(), GaussianNB()]
    pipelines.append(make_pipeline(*case))
pipelines

[Pipeline(steps=[('countvectorizer', CountVectorizer()),
                 ('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('countvectorizer', CountVectorizer()),
                 ('bernoullinb', BernoulliNB())]),
 Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                 ('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                 ('bernoullinb', BernoulliNB())]),
 Pipeline(steps=[('stemmedcountvectorizer', StemmedCountVectorizer()),
                 ('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('stemmedcountvectorizer', StemmedCountVectorizer()),
                 ('bernoullinb', BernoulliNB())]),
 Pipeline(steps=[('stemmedtfidfvectorizer', StemmedTfidfVectorizer()),
                 ('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('stemmedtfidfvectorizer', StemmedTfidfVectorizer()),
                 ('bernoullinb', BernoulliNB())])]

In [20]:
# Vectorizer Common params
ngrams_params = [(1,1),(1,3)]
stopword_params = ["english"]
lowercase_params = [True, False]
max_df_params = np.linspace(0.4, 0.6, num=4)
min_df_params = np.linspace(0.0, 0.1, num=4)

attributes = {"ngram_range":ngrams_params, "max_df":max_df_params,"min_df":min_df_params,
              "lowercase":lowercase_params,"stop_words":stopword_params}
vectorizer_names = ["countvectorizer","tfidfvectorizer","stemmedcountvectorizer","stemmedtfidfvectorizer"]
vectorizer_params_dict = {}

for vect_name in vectorizer_names:
    vectorizer_params_dict[vect_name] = {}
    for key, value in attributes.items():
        param_name = vect_name + "__" + key
        vectorizer_params_dict[vect_name][param_name] =  value

In [21]:
attributes.items()

dict_items([('ngram_range', [(1, 1), (1, 3)]), ('max_df', array([0.4       , 0.46666667, 0.53333333, 0.6       ])), ('min_df', array([0.        , 0.03333333, 0.06666667, 0.1       ])), ('lowercase', [True, False]), ('stop_words', ['english'])])

In [22]:
vectorizer_params_dict

{'countvectorizer': {'countvectorizer__ngram_range': [(1, 1), (1, 3)],
  'countvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'countvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'countvectorizer__lowercase': [True, False],
  'countvectorizer__stop_words': ['english']},
 'tfidfvectorizer': {'tfidfvectorizer__ngram_range': [(1, 1), (1, 3)],
  'tfidfvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'tfidfvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'tfidfvectorizer__lowercase': [True, False],
  'tfidfvectorizer__stop_words': ['english']},
 'stemmedcountvectorizer': {'stemmedcountvectorizer__ngram_range': [(1, 1),
   (1, 3)],
  'stemmedcountvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'stemmedcountvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'stemmedcountvectorizer__lowercase': [True, False]

In [24]:
# Algorithms parameters
# algorithm_names = ["bernoullinb","multinomialnb","gaussiannb","logisticregression"]
algorithm_names = ["multinomialnb", "bernoullinb"]

algorithm_params_dict = {}


#'bernoullinb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])
alpha_params = np.linspace(1.0, 2.0, num=5)
for i in range(2):
    algorithm_params_dict[algorithm_names[i]] = {
    algorithm_names[i]+ "__alpha" : alpha_params    
    }
# algorithm_params_dict[algorithm_names[2]] = {}


# LogisticRegression    
# multi_class : str, {‘ovr’, ‘multinomial’}, default: ‘ovr’
# C : float, default: 1.0
# solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’},
# n_jobs : int, default: 1
# penalty : str, ‘l1’ or ‘l2’, default: ‘l2’

# multi_class_params = ["ovr", "multinomial"]
# c_params = [0.1,  5.0, 7.0, 10.0, 15.0, 20.0, 100.0]
# algorithm_params_dict[algorithm_names[1]] = [{
#     "logisticregression__multi_class" : ["multinomial"],
#     "logisticregression__solver" : ["saga"],
#     "logisticregression__penalty" : ["l1"],
#     "logisticregression__C" : c_params
#     # },{
#     # "logisticregression__multi_class" : ["ovr"],
#     # "logisticregression__solver" : ['liblinear'],
#     # "logisticregression__penalty" : ["l2"],
#     # "logisticregression__C" : c_params
#     }
#     ]
algorithm_params_dict

{'multinomialnb': {'multinomialnb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ])},
 'bernoullinb': {'bernoullinb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ])}}

In [25]:
pipeline_params= []
for case in list(itertools.product(vectorizer_names, algorithm_names)):
    vect_params = vectorizer_params_dict[case[0]].copy()
    algo_params = algorithm_params_dict[case[1]]
    
    # 두 params를 한번에 넣어주기 위해 합치기
    if isinstance(algo_params, dict): # algo_params이 하나의 dict 형태인 경우
        vect_params.update(algo_params)
        pipeline_params.append(vect_params) # 합쳐진 params를 pipeline_parms에 넣기
    else: # algo_params이 list[dict] 형태인 경우
        temp = []
        for param in algo_params:
            vect_params.update(param)
            temp.append(vect_params)
        pipeline_params.append(temp)
pipeline_params

[{'countvectorizer__ngram_range': [(1, 1), (1, 3)],
  'countvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'countvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'countvectorizer__lowercase': [True, False],
  'countvectorizer__stop_words': ['english'],
  'multinomialnb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ])},
 {'countvectorizer__ngram_range': [(1, 1), (1, 3)],
  'countvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'countvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'countvectorizer__lowercase': [True, False],
  'countvectorizer__stop_words': ['english'],
  'bernoullinb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ])},
 {'tfidfvectorizer__ngram_range': [(1, 1), (1, 3)],
  'tfidfvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
  'tfidfvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
  'tfidfvect

# Learn! Learn!

In [26]:
from sklearn.preprocessing import LabelEncoder

X_data = news_df.loc[:, 'News'].tolist()[:1000] # 시간 단축 - 1000개만
y_data = news_df['Target'].tolist()[:1000]
y = LabelEncoder().fit_transform(y_data)
y

array([10,  3, 17,  3,  4, 12,  4, 10, 10, 19, 19, 11, 19, 13,  0, 17, 12,
       12, 11,  8,  7,  5,  1,  8, 10, 14, 16,  1,  6,  0,  7, 16,  5,  9,
       13,  4,  4, 18,  8,  8, 19,  1, 12,  7, 10,  5,  2,  6, 11,  2, 12,
        7, 18, 11,  7,  8,  0,  4, 19,  8,  9,  4,  1,  1, 17, 11, 10, 11,
        5,  1,  3, 17,  6, 14, 19, 14, 10,  2, 15, 10, 12,  7,  5, 12,  4,
       15, 16, 13,  8, 15,  9, 19, 15,  2, 17,  3,  2, 10, 16,  5, 17,  1,
        2, 17, 19, 10,  4, 18,  6,  2,  4,  7, 14,  5, 17, 17, 12, 11,  9,
        4,  3, 12,  2, 12, 16,  4,  2, 10, 12,  1, 17, 15, 16, 10,  3, 17,
        2, 11,  3,  8,  0,  2,  7,  5, 13, 11,  4,  1,  9,  6,  8,  8,  3,
        3, 18,  4,  8, 18, 14,  4,  1,  3, 12, 15,  1,  4, 15, 14,  4, 13,
        4, 13,  3,  0,  5,  2,  9, 15,  8, 10,  7,  9, 18, 12,  4,  2, 17,
       17, 17, 19,  6, 15, 11,  0, 19, 15, 17,  6, 13,  1, 19,  4, 15,  5,
        1,  7,  6, 15, 12,  0,  7,  5,  5,  3,  8, 16,  2, 17, 14, 11, 10,
       15,  3,  7, 11,  0

In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

scoring = ['accuracy']
estimator_results = []
for i, (estimator, params) in enumerate(zip(pipelines,pipeline_params)):
    gs_estimator = GridSearchCV(
            refit="accuracy", estimator=estimator, param_grid=params, scoring=scoring, cv=5, verbose=1, n_jobs=36)
    print(gs_estimator)

    gs_estimator.fit(X_data, y)
    estimator_results.append(gs_estimator)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('countvectorizer', CountVectorizer()),
                                       ('multinomialnb', MultinomialNB())]),
             n_jobs=36,
             param_grid={'countvectorizer__lowercase': [True, False],
                         'countvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
                         'countvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
                         'countvectorizer__ngram_range': [(1, 1), (1, 3)],
                         'countvectorizer__stop_words': ['english'],
                         'multinomialnb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ])},
             refit='accuracy', scoring=['accuracy'], verbose=1)
Fitting 5 folds for each of 320 candidates, totalling 1600 fits
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('countvectorizer', CountVectorizer()),
                                       ('bernoullinb',



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('stemmedcountvectorizer',
                                        StemmedCountVectorizer()),
                                       ('multinomialnb', MultinomialNB())]),
             n_jobs=36,
             param_grid={'multinomialnb__alpha': array([1.  , 1.25, 1.5 , 1.75, 2.  ]),
                         'stemmedcountvectorizer__lowercase': [True, False],
                         'stemmedcountvectorizer__max_df': array([0.4       , 0.46666667, 0.53333333, 0.6       ]),
                         'stemmedcountvectorizer__min_df': array([0.        , 0.03333333, 0.06666667, 0.1       ]),
                         'stemmedcountvectorizer__ngram_range': [(1, 1),
                                                                 (1, 3)],
                         'stemmedcountvectorizer__stop_words': ['english']},
             refit='accuracy', scoring=['accuracy'], verbose=1)
Fitting 5 folds for each of 320 candidates, totalling 1600 fit

In [28]:
import pandas as pd
from pandas import DataFrame
result_df_dict = {}
result_attributes = ["vectorizer", "model", "accuracy", "recall_macro","precision_macro" , "min_df", 
                     "lowercase", "max_df", "binarize", "alpha", "ngram_range"
                     "multi_class", "penalty", "solver", "C"]

pieline_list =  list(itertools.product(vectorizer_names, algorithm_names))

for att in result_attributes:
    result_df_dict[att] = [None for i in range(16)]

result_df = DataFrame(result_df_dict)
result_df

Unnamed: 0,vectorizer,model,accuracy,recall_macro,precision_macro,min_df,lowercase,max_df,binarize,alpha,ngram_rangemulti_class,penalty,solver,C
0,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,
7,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,
9,,,,,,,,,,,,,,


In [29]:
pieline_list

[('countvectorizer', 'multinomialnb'),
 ('countvectorizer', 'bernoullinb'),
 ('tfidfvectorizer', 'multinomialnb'),
 ('tfidfvectorizer', 'bernoullinb'),
 ('stemmedcountvectorizer', 'multinomialnb'),
 ('stemmedcountvectorizer', 'bernoullinb'),
 ('stemmedtfidfvectorizer', 'multinomialnb'),
 ('stemmedtfidfvectorizer', 'bernoullinb')]

In [30]:
for i, estiamtor in enumerate(estimator_results):
    best_estimator = estiamtor.best_estimator_
    best_index = estiamtor.best_index_
    result_df_dict["vectorizer"][i] = pieline_list[i][0]
    result_df_dict["model"][i] = pieline_list[i][1]
    result_df_dict["accuracy"][i] = estiamtor.best_score_
#     result_df_dict["recall_micro"][i] = estiamtor.cv_results_["mean_test_recall_micro"][best_index]
#     result_df_dict["precision_micro"][i] = estiamtor.cv_results_["mean_test_precision_micro"][best_index]
    for key, value in estiamtor.best_params_.items():
        if key.split("__")[1] in result_df_dict:
            name = key.split("__")[1]
            result_df_dict[name][i] = value
#     print(estiamtor.best_params_)
#     print(a.named_steps)

In [31]:
result_df = DataFrame(result_df_dict, columns=result_attributes)
result_df.sort_values("accuracy",ascending=False)

Unnamed: 0,vectorizer,model,accuracy,recall_macro,precision_macro,min_df,lowercase,max_df,binarize,alpha,ngram_rangemulti_class,penalty,solver,C
0,countvectorizer,multinomialnb,0.675,,,0.0,True,0.466667,,1.0,,,,
4,stemmedcountvectorizer,multinomialnb,0.662,,,0.0,True,0.533333,,1.0,,,,
6,stemmedtfidfvectorizer,multinomialnb,0.63,,,0.0,True,0.6,,1.0,,,,
2,tfidfvectorizer,multinomialnb,0.622,,,0.0,True,0.4,,1.0,,,,
5,stemmedcountvectorizer,bernoullinb,0.396,,,0.033333,True,0.6,,1.0,,,,
7,stemmedtfidfvectorizer,bernoullinb,0.396,,,0.033333,True,0.6,,1.0,,,,
1,countvectorizer,bernoullinb,0.343,,,0.033333,True,0.533333,,1.5,,,,
3,tfidfvectorizer,bernoullinb,0.343,,,0.033333,True,0.533333,,1.5,,,,
8,,,,,,,,,,,,,,
9,,,,,,,,,,,,,,
