# Naive Bayes 20 Article Classifier

Se realizará un predictor de 20 tipos de artículos. Se realizarán distintos modelos de NB y se buscará el modelo que de un score mas alto luego de realizar cross validation para cada uno. Se evaluará este "mejor modelo" y se dirá que resultado se obtiene para el dataset de test.

## Imports & Variable Declaration

Estos dos comandos nos permiten utilizar todos los modulos sin necesidad de correr explicitamente todo el script cada vez: https://ipython.org/ipython-doc/stable/config/extensions/autoreload.html

In [1]:
%load_ext autoreload
%autoreload 2

Importamos todos los paquetes necesarios para procesamiento, clasificador y manejo de datos:

In [2]:
import numpy as np
import time
import os
import pickle

import pandas         as pd
import dask.dataframe as dd

import nltk
from   nltk.tokenize import TreebankWordTokenizer
from   nltk.stem     import PorterStemmer, WordNetLemmatizer
from   nltk.corpus   import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection         import cross_val_score, train_test_split
from sklearn.naive_bayes             import MultinomialNB
from sklearn.datasets                import fetch_20newsgroups

Definición de variable y objetos necesarios:

In [3]:
nltk.download('wordnet')
nltk.download('stopwords')

tokenizer  = TreebankWordTokenizer()
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

random_seed = 0
test_size   = 0.3
cross_sets  = 5

[nltk_data] Downloading package wordnet to /home/nicolas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nicolas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Descargamos el dataset con todos los 11314 artículos. Extraeremos el dataset con su información de utilidad a un archivo csv para poder acceder a el cuando sea necesario. Es importante tener en cuenta que el target es, en este caso, el código o número de que tipo de articulo representa.

In [4]:
def create_data_file(partition, filename, samples):
    """ This function asumes the data is organized: article <--> code of category """
    data = pd.DataFrame(
        {'text': partition.data,
         'label': [target for target in partition.target]}).dropna()[:samples]
    data.to_csv(filename, index=False)
    return filename

newsgroups_train = fetch_20newsgroups(subset="train")

train_filename = create_data_file(newsgroups_train, "train.csv", 11314)

print(f'File name of csv: {train_filename}')

File name of csv: train.csv


## Caching

Definiremos los paths para acceder y guardar archivos. Esto permitirá guardar y levantar datos ya procesados

In [5]:
caching      = True
dataset_path = train_filename

def get_nltk_cache_path(hp):
    cache_path = f'cache-{hp["is_lem"]}-{hp["is_stop"]}-{hp["is_stem"]}-{hp["is_alpha"]}'
    return cache_path

def get_sklearn_cache_path(hp):
    cache_path = f'cache-{hp["is_lem"]}-{hp["is_stop"]}-{hp["is_stem"]}-{hp["is_alpha"]}-{hp["tf_idf"]}-{hp["min_df"]}-{hp["max_df"]}'
    return cache_path

## Hiperparámetros

Se especifcan que hiperparámetros usaremos para generar los modelos y se genera un diccionario para cada combinación. Cada combinación, luego, se traducira a una fila de DataFrame pandas

In [6]:
# Hyperparameters (hp) considered for the classifier
hyperparameters_specs = {
    'is_lem':   [True, False],
    'is_stop':  [True, False],
    'is_stem':  [True, False],
    'is_alpha': [True, False],
    'tf_idf':   [True, False],
    'min_df':   [0.01, 0.05, 0.1, 0.4],
    'max_df':   [0.5, 0.6, 0.7, 0.8, 0.99],
    'alpha':    [0.01, 0.1, 1.0, 10.0],
}

# Store hyperparameters in DataFrame
hyperparameters = pd.DataFrame()

for is_lem in hyperparameters_specs['is_lem']:
  for is_stop in hyperparameters_specs['is_stop']:
    for is_stem in hyperparameters_specs['is_stem']:
      for is_alpha in hyperparameters_specs['is_alpha']: 
        for tf_idf in hyperparameters_specs['tf_idf']:
          for min_df in hyperparameters_specs['min_df']:
            for max_df in hyperparameters_specs['max_df']:
              for alpha in hyperparameters_specs['alpha']:
                hp = {
                    'is_lem':   is_lem,
                    'is_stop':  is_stop,
                    'is_stem':  is_stem,
                    'is_alpha': is_alpha,
                    'alpha':    alpha,
                    'min_df':   min_df,
                    'max_df':   max_df,
                    'tf_idf':   tf_idf,
                }
                hp_pandas = pd.DataFrame(hp, index=[0])
                hyperparameters = hyperparameters.append(hp_pandas, ignore_index=True)

## Procesamiento NLTK

In [7]:
# Callback para el procesamiento paralelo de Dask
def nltk_preprocessor_callback(**kwargs):
    """ kwargs -> hp
        Preprocesamiento con NLTK igual que en la clase anterior """

    def preprocessor(datapoint):
        raw_datapoint          = datapoint
        tokenized_datapoint    = tokenizer.tokenize(str(raw_datapoint))

        # Decide if we are going to lemmatize our data
        if kwargs.setdefault('is_lem', True):
            lemmatized_datapoint   = [lemmatizer.lemmatize(x,pos='v') for x in tokenized_datapoint]
        else:
            lemmatized_datapoint   = tokenized_datapoint

        # Decide if we are going to remove stopwords our data, kwargs -> hp
        if kwargs.setdefault('is_stop', True):
            nonstop_datapoint      = [x for x in lemmatized_datapoint if x not in stopwords.words('english')]
        else:
            nonstop_datapoint      = lemmatized_datapoint

        # Decide if we are going to apply stemming to our data, kwargs -> hp
        if kwargs.setdefault('is_stem', True):
            stemmed_datapoint      = [stemmer.stem(x) for x in nonstop_datapoint]
            filtered_datapoint     = stemmed_datapoint
        else:
            filtered_datapoint     = nonstop_datapoint
        
        # Skip this if not applying alpha
        if kwargs.setdefault('is_alpha', True):
            alphanumeric_datapoint = [x for x in filtered_datapoint if x.isalpha()]
            filtered_datapoint     = alphanumeric_datapoint

        return ' '.join(filtered_datapoint)

    return preprocessor

def run_nltk_preprocessor(hp, dataset=None):
    print('NLTK Preprocessing...')

    to = time.time()
    cache_path = get_nltk_cache_path(hp)
    
    # Check if NLTK proccesing has been done for this combination of hyperparameters
    if not (os.path.exists(cache_path) and os.path.isfile(cache_path)):
        print('Cache miss: ', cache_path)

        # Read data set
        if caching is True:
            dataset = pd.read_csv(dataset_path)
        else:
            dataset  = dataset.copy()

        preprocessor = nltk_preprocessor_callback(
            is_lem=hp['is_lem'],
            is_stop=hp['is_stop'],
            is_stem=hp['is_stem'],
            is_alpha=hp['is_alpha']
            )
        ddataset        = dd.from_pandas(dataset, npartitions=os.cpu_count())
        dataset['text'] = ddataset['text'].map_partitions(lambda df: df.apply(preprocessor)).compute(scheduler='multiprocessing')
        
        
        #Guardamos en la cache este intent
        if caching is True:
            cache_path = get_nltk_cache_path(hp)
            with open(cache_path, 'wb') as fp:
                pickle.dump(dataset, fp)
        
    tf = time.time()
    print('finished in', (int(tf-to)), 'seconds.')

Corremos el procesamiento para todas las combinaciones de hiperparametros..

In [8]:
for idx,hyperParam in hyperparameters.iterrows():
  run_nltk_preprocessor(hyperParam)

NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Prepr

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished in 0 seconds.
NLTK Preprocessing...
finished i

## SKLEARN Processing

In [9]:
def run_sklearn_preprocessor(hp, dataset=None):
    print('sklearn preprocessing...')
    to = time.time()
    cache_path = get_sklearn_cache_path(hp)
    
    #Checkeamos si ya intentamos con esta combinación
    if not (os.path.exists(cache_path) and os.path.isfile(cache_path)):    
        print('Cache miss: ', cache_path)   
        
        if caching is True:
            cache_path = get_nltk_cache_path(hp)
            with open (cache_path, 'rb') as fp:
                dataset = pickle.load(fp)
        else:
            dataset = dataset.copy()

        #Corremos el vectorizer que corresponde, igual que en clase anterior
        V = (TfidfVectorizer if hp['tf_idf'] is True else CountVectorizer)(min_df=hp['min_df'], max_df=hp['max_df'])
        X = V.fit_transform(dataset['text']).toarray()
        Y = np.array([dataset['label'].values]).T
        D = np.hstack((X, Y))

        np.random.seed(seed=random_seed)
        np.random.shuffle(D)

        if caching is True:
            cache_path = get_sklearn_cache_path(hp)
            with open(cache_path, 'wb') as fp:
                pickle.dump(D, fp)

    tf = time.time()
    print('finished in', (int(tf-to)), 'seconds.')

Previo al procesamiento, observemos como quedo para uno de los casos el procesamiento NLTK

In [10]:
cache_path = 'cache-False-True-False-True'
with open (cache_path, 'rb') as fp:
    dataset = pickle.load(fp)
print("Nuestros textos procesados: ")
print(dataset['text'])
print("Labels: ")
print(dataset['label'].values)

Nuestros textos procesados: 
0        From lerxst thing Subject WHAT car Organizatio...
1        From guykuo Guy Kuo Subject SI Clock Poll Fina...
2        From twillis Thomas E Willis Subject PB questi...
3        From jgreen amber Joe Green Subject Re Weitek ...
4        From jcm Jonathan McDowell Subject Re Shuttle ...
                               ...                        
11309    From Jim Zisfein Subject Re Migraines scans Di...
11310    From ebodin Subject Screen Death Mac Lines Org...
11311    From westes Will Estes Subject Mounting CPU Co...
11312    From steve hcrlgw Steven Collins Subject Re Sp...
11313    From gunning Kevin Gunning Subject stolen Orga...
Name: text, Length: 11314, dtype: object
Labels: 
[7 4 4 ... 3 1 8]


Corremos procesamiento sklearn para todos los hiperparametros

In [11]:
for idx,hp2 in hyperparameters.iterrows():
    run_sklearn_preprocessor(hp2)

sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 s

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
sklearn preproces

Observemos para un caso como quedo el dataset luego del procesamiento sklearn

In [12]:
if caching is True:
    cache_path = 'cache-True-False-True-False-True-0.01-0.6'
    with open (cache_path, 'rb') as fp:
        D = pickle.load(fp)
else:
    D = dataset.copy()

X = D[:,:D.shape[1]-1]
Y = D[:,D.shape[1]-1:].flatten()

print("Dataset: ")
print(D)
print("Dataset shape: ")
print(D.shape)
print("Tipos: ")
print(Y[60:70])
print(f'Algunos valores de procesamiento para el articulo 60 de tipo {Y[60]}')
print(X[60][90:200])

Dataset: 
[[ 0.          0.          0.         ...  0.          0.
   1.        ]
 [ 0.          0.          0.         ...  0.          0.
  12.        ]
 [ 0.          0.          0.         ...  0.          0.
  13.        ]
 ...
 [ 0.13575459  0.          0.         ...  0.          0.
   8.        ]
 [ 0.          0.          0.         ...  0.          0.
  19.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]
Dataset shape: 
(11314, 2018)
Tipos: 
[10.  6. 19. 10.  8.  1. 19.  4. 12.  1.]
Algunos valores de procesamiento para el articulo 60 de tipo 10.0
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.08364989 0.         0

## Evaluando los Distintos Clasificadores: Cross Validation

In [13]:
#Callback para el procesamiento paralelo de Dask
def score_callback(dataset=None):
    def score_classifier(hp):
        print(hp.to_dict())
        
        if caching is True:
            cache_path = get_sklearn_cache_path(hp)
            with open (cache_path, 'rb') as fp:
                D = pickle.load(fp)
        else:
            D = dataset.copy()

        X = D[:,:D.shape[1]-1]
        Y = D[:,D.shape[1]-1:].flatten()

        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)

        #Aca defino el clasificador
        clf = MultinomialNB(alpha=hp['alpha'], class_prior=None, fit_prior=False)
        
        #Obtengo el score
        scores = cross_val_score(clf, X_train, Y_train, cv=cross_sets)

        hp['score'] = scores.mean()
        
        return hp
    return score_classifier

#### Caching  scores

De haber calculado previamente scores se puede utilizar este mismo, de manera contraria se elimina y se genera nuevamente.

In [14]:
to = time.time()

if not (os.path.exists('cache-scores') and os.path.isfile('cache-scores')):    
    print('No cache-scores')     
    score_classifier = score_callback(dataset)
    dhyperparameters = dd.from_pandas(hyperparameters.copy(), npartitions=os.cpu_count())
    scores           = hyperparameters.apply(score_classifier, axis=1)
    if caching is True:
        cache__score_path = 'cache-scores'
        with open('cache-scores', 'wb') as fp:
            pickle.dump(scores, fp)
else:
    print('Retriving cache-scores....')
    with open ('cache-scores', 'rb') as fp:
        scores = pd.read_pickle(fp)

tf = time.time()
print('finished in', (int(tf-to)), 'seconds.')

Retriving cache-scores....
finished in 0 seconds.


Observemos como quedan los scores para cada modelo por cada combinacion de hiperparámetos

In [15]:
print(scores)

      is_lem  is_stop  is_stem  is_alpha  alpha  min_df  max_df  tf_idf  \
0       True     True     True      True   0.01    0.01    0.50    True   
1       True     True     True      True   0.10    0.01    0.50    True   
2       True     True     True      True   1.00    0.01    0.50    True   
3       True     True     True      True  10.00    0.01    0.50    True   
4       True     True     True      True   0.01    0.01    0.60    True   
...      ...      ...      ...       ...    ...     ...     ...     ...   
2555   False    False    False     False  10.00    0.40    0.80   False   
2556   False    False    False     False   0.01    0.40    0.99   False   
2557   False    False    False     False   0.10    0.40    0.99   False   
2558   False    False    False     False   1.00    0.40    0.99   False   
2559   False    False    False     False  10.00    0.40    0.99   False   

         score  
0     0.786211  
1     0.793661  
2     0.783938  
3     0.754388  
4     0.786211

## Entrenamos el Modelo Final

Buscaremos el modelo que dio el mejor score y lo entrenaremos con esos hiperparámetros

In [16]:
print('Training model with best hyperparameters...')

#Me quedo con la mejor combinación de hiperparámetros.

if caching is True:
    with open ('cache-scores', 'rb') as fp:
        scores = pd.read_pickle(fp)
        
best_hp = scores.loc[scores['score'].idxmax()].drop(['score'])
print(best_hp.to_dict())

if caching is True:
    cache_path = get_sklearn_cache_path(best_hp)
    with open (cache_path, 'rb') as fp:
        D = pickle.load(fp)
else:
    D = dataset.copy()


X = D[:,:D.shape[1]-1]
Y = D[:,D.shape[1]-1:].flatten()

#Separamos el dataset para train y validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)

#Creamos el clasificador para los mejores hiperparámetros
clf = MultinomialNB(alpha=best_hp['alpha'], class_prior=None, fit_prior=False)

#Entrenamos el modelo
clf.fit(X_train, Y_train)

Training model with best hyperparameters...
{'is_lem': False, 'is_stop': True, 'is_stem': True, 'is_alpha': False, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.7, 'tf_idf': True}


MultinomialNB(alpha=0.1, class_prior=None, fit_prior=False)

## Performance

Calculamos el performance de nuestro modelo sobre el dataset de test

In [17]:
print('Evaluating best model...')
    
if caching is True:
    cache_path = get_sklearn_cache_path(best_hp)
    with open (cache_path, 'rb') as fp:
        D = pickle.load(fp)
else:
    D = dataset.copy()

X = D[:,:D.shape[1]-1]
Y = D[:,D.shape[1]-1:].flatten()

#Separo el set para train y test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)
    
#Vemos el score final del modelo para test
score = clf.score(X_test, Y_test)
print("accuracy: {:.4}%".format(score*100))

Evaluating best model...
accuracy: 82.56%


## Ejemplos y Algo de Análisis

Probaremos la predicción para un artículo

In [56]:
n_art = 67
art = D[:,D.shape[1]-1:][n_art][0]
X = D[:,:D.shape[1]-1][n_art:n_art+1]
print(f'La predicción para el artículo {art} es:')
print(clf.predict(X)[0])

La predicción para el artículo 4.0 es:
4.0


Podemos ver también un caso de mala predicción si indexamos y vemos el artículo en la posición 834

In [50]:
n_art = 834
art = D[:,D.shape[1]-1:][n_art][0]
X = D[:,:D.shape[1]-1][n_art:n_art+1]
print(f'La predicción para el artículo {art} es:')
print(clf.predict(X)[0])

La predicción para el artículo 1.0 es:
2.0
