This is a project for the Natural Language Processing (NLP) class "IFT-7022 Techniques et applications du traitement automatique de la langue (TALN)" of Luc Lamontagne.
There are two tasks to do in this homework:
- Task 1: classify positive and negative emotion in a document.
- Task 2: predict language in a document.
The current code is released under the BSD-3-Clause license. See LICENSE.md
. Copyright 2018 Guillaume Chevalier.
Note: as this is a university school project, the licences of the imported librairies, datasets, or other assets has not been checked.
Here is the structure of the folders for this code to run:
(Note: all text files have been ignored with grep so as to make the tree shorter, and the tree was captured before creating the src* folder. The data could be downloaded from the course website.)
!tree data | grep -v txt | grep -v text
!pwd
data
├── task1
│  └── Book
│  ├── neg_Bk
│  └── pos_Bk
└── task2
└── identification_langue
├── corpus_entrainement
└── corpus_test1
8 directories, 2044 files
/home/users_home/Documents/Session 7/NLP/TP2
Also see the requirements.txt
file:
!cat requirements.txt
# Python 3.6
conv
numpy
scikit-learn
nltk
# nltk.download('stopwords')
# nltk.download('tagsets')
# nltk.download('sentiwordnet')
matplotlib
# Python 3.6
import os
import glob
from src.pipeline_steps.nltk_word_tokenize import NLTKTokenizer
from src.pipeline_steps.to_lower_case import ToLowerCase
from src.pipeline_steps.remove_stop_words import RemoveStopWords
from src.pipeline_steps.keep_open_classes_only import KeepOpenClassesOnly
from src.pipeline_steps.sentiwordnet import SentiWordNetPosNegAttributes
from src.pipeline_steps.porter_stemmer import PorterStemmerStep
from src.pipeline_steps.data_shape_printer import ShapePrinter
from src.data_loading.task_1 import load_all_data_task_1
from src.text_classifier_pipelines.stop_words_open_class_stemmer.pipeline_factory import find_and_train_best_pipelines
[nltk_data] Downloading package stopwords to /home/gui/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /home/gui/nltk_data...
[nltk_data] Package sentiwordnet is already up-to-date!
neg_Bk_files = glob.glob(os.path.join(".", "data", "task1", "Book", "neg_Bk", "*.text"))
pos_Bk_files = glob.glob(os.path.join(".", "data", "task1", "Book", "pos_Bk", "*.text"))
X_train, y_train, X_test, y_test = load_all_data_task_1(neg_Bk_files, pos_Bk_files)
print(len(X_train), len(y_train), len(X_test), len(y_test))
1600 1600 400 400
best_trained_pipelines = find_and_train_best_pipelines(X_train, y_train)
Will start Cross Validation for Logistic Classifiers.
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.796875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.79125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase all_attributes with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase all_attributes with_stemming ' (4-folds cross validation accuracy score=0.785625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase all_attributes '...
Best hyperparameters for 'Logistic Classifier with_lowercase all_attributes ' (4-folds cross validation accuracy score=0.79):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.7725):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 100.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.759375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase remove_stop_words with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase remove_stop_words with_stemming ' (4-folds cross validation accuracy score=0.770625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase remove_stop_words '...
Best hyperparameters for 'Logistic Classifier with_lowercase remove_stop_words ' (4-folds cross validation accuracy score=0.766875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.789375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.775625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase keep_open_classes_only with_stemming '...
Best hyperparameters for 'Logistic Classifier with_lowercase keep_open_classes_only with_stemming ' (4-folds cross validation accuracy score=0.78875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier with_lowercase keep_open_classes_only '...
Best hyperparameters for 'Logistic Classifier with_lowercase keep_open_classes_only ' (4-folds cross validation accuracy score=0.775625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier all_attributes with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.794375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier all_attributes with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier all_attributes with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.785625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier all_attributes with_stemming '...
Best hyperparameters for 'Logistic Classifier all_attributes with_stemming ' (4-folds cross validation accuracy score=0.78875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier all_attributes '...
Best hyperparameters for 'Logistic Classifier all_attributes ' (4-folds cross validation accuracy score=0.775625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier remove_stop_words with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.7725):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier remove_stop_words with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.764375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words with_stemming '...
Best hyperparameters for 'Logistic Classifier remove_stop_words with_stemming ' (4-folds cross validation accuracy score=0.77375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier remove_stop_words '...
Best hyperparameters for 'Logistic Classifier remove_stop_words ' (4-folds cross validation accuracy score=0.76875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.78875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.788125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier keep_open_classes_only with_stemming '...
Best hyperparameters for 'Logistic Classifier keep_open_classes_only with_stemming ' (4-folds cross validation accuracy score=0.789375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Logistic Classifier keep_open_classes_only '...
Best hyperparameters for 'Logistic Classifier keep_open_classes_only ' (4-folds cross validation accuracy score=0.7775):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'logistic_regression__C': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.801875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.805):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_stemming ' (4-folds cross validation accuracy score=0.798125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase all_attributes '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes ' (4-folds cross validation accuracy score=0.8025):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.78):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.78375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_stemming ' (4-folds cross validation accuracy score=0.783125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words ' (4-folds cross validation accuracy score=0.785):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.79625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.794375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_stemming ' (4-folds cross validation accuracy score=0.7925):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only ' (4-folds cross validation accuracy score=0.798125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.8):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.810625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier all_attributes with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier all_attributes with_stemming ' (4-folds cross validation accuracy score=0.79625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier all_attributes '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier all_attributes ' (4-folds cross validation accuracy score=0.80625):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.78):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.78375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier remove_stop_words with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier remove_stop_words with_stemming ' (4-folds cross validation accuracy score=0.778125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 1, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 0.1}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier remove_stop_words '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier remove_stop_words ' (4-folds cross validation accuracy score=0.78375):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute with_stemming ' (4-folds cross validation accuracy score=0.803125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute ' (4-folds cross validation accuracy score=0.803125):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier keep_open_classes_only with_stemming '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_stemming ' (4-folds cross validation accuracy score=0.801875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 3), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
Cross-Validation Grid Search for: 'Multinomial Naive Bayes Classifier keep_open_classes_only '...
Best hyperparameters for 'Multinomial Naive Bayes Classifier keep_open_classes_only ' (4-folds cross validation accuracy score=0.79875):
{'count_vect_that_remove_unfrequent_words_and_stopwords__lowercase': False, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_df': 0.98, 'count_vect_that_remove_unfrequent_words_and_stopwords__max_features': 500000, 'count_vect_that_remove_unfrequent_words_and_stopwords__min_df': 2, 'count_vect_that_remove_unfrequent_words_and_stopwords__ngram_range': (1, 2), 'count_vect_that_remove_unfrequent_words_and_stopwords__preprocessor': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__strip_accents': None, 'count_vect_that_remove_unfrequent_words_and_stopwords__tokenizer': <function identity at 0x7f3a6fc2a598>, 'naive_bayes_multi__alpha': 1.0}
print("The final test: classifying on test documents:")
print("")
print("Note: the test set was of 20% of full data, which was held-out of cross validation.")
print("")
max_score = 0
max_score_model = ""
for (model_name, model) in best_trained_pipelines.items():
score = model.score(X_test, y_test) * 100
if score > max_score:
max_score = score
max_score_model = model_name
print("Test set score for '{}': {}%".format(model_name, score))
print("")
print("Max score is by '{}': {}%".format(max_score_model, max_score))
print("")
The final test: classifying on test documents:
Note: the test set was of 20% of full data, which was held-out of cross validation.
Test set score for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ': 84.25%
Test set score for 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute ': 81.5%
Test set score for 'Logistic Classifier with_lowercase all_attributes with_stemming ': 81.75%
Test set score for 'Logistic Classifier with_lowercase all_attributes ': 81.25%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming ': 80.0%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_pos_neg_attribute ': 81.5%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words with_stemming ': 78.75%
Test set score for 'Logistic Classifier with_lowercase remove_stop_words ': 79.25%
Test set score for 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming ': 81.5%
Test set score for 'Logistic Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute ': 81.25%
Test set score for 'Logistic Classifier with_lowercase keep_open_classes_only with_stemming ': 81.5%
Test set score for 'Logistic Classifier with_lowercase keep_open_classes_only ': 82.75%
Test set score for 'Logistic Classifier all_attributes with_pos_neg_attribute with_stemming ': 83.75%
Test set score for 'Logistic Classifier all_attributes with_pos_neg_attribute ': 81.75%
Test set score for 'Logistic Classifier all_attributes with_stemming ': 81.5%
Test set score for 'Logistic Classifier all_attributes ': 80.0%
Test set score for 'Logistic Classifier remove_stop_words with_pos_neg_attribute with_stemming ': 81.75%
Test set score for 'Logistic Classifier remove_stop_words with_pos_neg_attribute ': 80.75%
Test set score for 'Logistic Classifier remove_stop_words with_stemming ': 79.5%
Test set score for 'Logistic Classifier remove_stop_words ': 79.0%
Test set score for 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute with_stemming ': 80.0%
Test set score for 'Logistic Classifier keep_open_classes_only with_pos_neg_attribute ': 81.25%
Test set score for 'Logistic Classifier keep_open_classes_only with_stemming ': 81.75%
Test set score for 'Logistic Classifier keep_open_classes_only ': 80.5%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ': 80.75%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_pos_neg_attribute ': 79.5%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes with_stemming ': 80.75%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase all_attributes ': 80.0%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute with_stemming ': 81.75%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_pos_neg_attribute ': 79.25%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words with_stemming ': 78.25%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase remove_stop_words ': 78.5%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute with_stemming ': 79.75%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_pos_neg_attribute ': 79.25%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only with_stemming ': 78.25%
Test set score for 'Multinomial Naive Bayes Classifier with_lowercase keep_open_classes_only ': 79.0%
Test set score for 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute with_stemming ': 80.25%
Test set score for 'Multinomial Naive Bayes Classifier all_attributes with_pos_neg_attribute ': 79.75%
Test set score for 'Multinomial Naive Bayes Classifier all_attributes with_stemming ': 79.75%
Test set score for 'Multinomial Naive Bayes Classifier all_attributes ': 78.0%
Test set score for 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute with_stemming ': 81.75%
Test set score for 'Multinomial Naive Bayes Classifier remove_stop_words with_pos_neg_attribute ': 78.5%
Test set score for 'Multinomial Naive Bayes Classifier remove_stop_words with_stemming ': 78.25%
Test set score for 'Multinomial Naive Bayes Classifier remove_stop_words ': 78.75%
Test set score for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute with_stemming ': 81.5%
Test set score for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_pos_neg_attribute ': 80.0%
Test set score for 'Multinomial Naive Bayes Classifier keep_open_classes_only with_stemming ': 80.5%
Test set score for 'Multinomial Naive Bayes Classifier keep_open_classes_only ': 79.25%
Max score is by 'Logistic Classifier with_lowercase all_attributes with_pos_neg_attribute with_stemming ': 84.25%
# Python 3.6
import os
import glob
from collections import Counter
import nltk
from src.data_loading.task_2 import load_train, load_test
from src.text_classifier_pipelines.char_ngram.logistic import find_and_train_best_logistic_pipeline
from src.text_classifier_pipelines.char_ngram.multinomial_naive_bayes import find_and_train_best_multinomial_pipeline
from src.evaluation.eval_and_plot import eval_and_plot
corpus_train = glob.glob(os.path.join(".", "data", "task2", "identification_langue", "corpus_entrainement", "*.txt"))
corpus_test1 = glob.glob(os.path.join(".", "data", "task2", "identification_langue", "corpus_test1", "*.txt"))
print("Train files found:", corpus_train)
print("")
print("Test files found:", corpus_test1)
print("")
X_train, y_train, labels_readable = load_train(corpus_train, sentence_tokenizer=nltk.sent_tokenize)
X_test, y_test = load_test(corpus_test1, labels_readable)
print("'print(len(X_train), len(y_train))':")
print(len(X_train), len(y_train))
print("'print(len(X_test), len(y_test))':")
print(len(X_test), len(y_test))
print("")
print("Do we have a balanced dataset after counting sentences in train, and items in test?")
print(" Train:", Counter(y_train))
print(" Test:", Counter(y_test))
print("Note: Indices are labels' indexes, and counts are their appearance. Labels are:")
print(list(enumerate(labels_readable)))
print("")
Train files found: ['./data/task2/identification_langue/corpus_entrainement/espanol-training.txt', './data/task2/identification_langue/corpus_entrainement/portuguese-training.txt', './data/task2/identification_langue/corpus_entrainement/french-training.txt', './data/task2/identification_langue/corpus_entrainement/english-training.txt']
Test files found: ['./data/task2/identification_langue/corpus_test1/test15-es.txt', './data/task2/identification_langue/corpus_test1/test26-pt.txt', './data/task2/identification_langue/corpus_test1/test39-es.txt', './data/task2/identification_langue/corpus_test1/test3-fr.txt', './data/task2/identification_langue/corpus_test1/test28-pt.txt', './data/task2/identification_langue/corpus_test1/test40-es.txt', './data/task2/identification_langue/corpus_test1/test12-fr.txt', './data/task2/identification_langue/corpus_test1/test20-fr.txt', './data/task2/identification_langue/corpus_test1/test24-pt.txt', './data/task2/identification_langue/corpus_test1/test35-en.txt', './data/task2/identification_langue/corpus_test1/test4-en.txt', './data/task2/identification_langue/corpus_test1/test14-en.txt', './data/task2/identification_langue/corpus_test1/test9-en.txt', './data/task2/identification_langue/corpus_test1/test33-en.txt', './data/task2/identification_langue/corpus_test1/test13-fr.txt', './data/task2/identification_langue/corpus_test1/test21-pt.txt', './data/task2/identification_langue/corpus_test1/test34-en.txt', './data/task2/identification_langue/corpus_test1/test10-fr.txt', './data/task2/identification_langue/corpus_test1/test30-pt.txt', './data/task2/identification_langue/corpus_test1/test11-es.txt', './data/task2/identification_langue/corpus_test1/test18-fr.txt', './data/task2/identification_langue/corpus_test1/test17-en.txt', './data/task2/identification_langue/corpus_test1/test5-en.txt', './data/task2/identification_langue/corpus_test1/test25-pt.txt', './data/task2/identification_langue/corpus_test1/test31-fr.txt', './data/task2/identification_langue/corpus_test1/test16-en.txt', './data/task2/identification_langue/corpus_test1/test29-pt.txt', './data/task2/identification_langue/corpus_test1/test32-fr.txt', './data/task2/identification_langue/corpus_test1/test2-fr.txt', './data/task2/identification_langue/corpus_test1/test23-pt.txt', './data/task2/identification_langue/corpus_test1/test6-fr.txt', './data/task2/identification_langue/corpus_test1/test22-pt.txt', './data/task2/identification_langue/corpus_test1/test37-es.txt', './data/task2/identification_langue/corpus_test1/test27-pt.txt', './data/task2/identification_langue/corpus_test1/test19-es.txt', './data/task2/identification_langue/corpus_test1/test8-es.txt', './data/task2/identification_langue/corpus_test1/test36-en.txt', './data/task2/identification_langue/corpus_test1/test1-es.txt', './data/task2/identification_langue/corpus_test1/test7-es.txt', './data/task2/identification_langue/corpus_test1/test38-es.txt']
'print(len(X_train), len(y_train))':
767 767
'print(len(X_test), len(y_test))':
40 40
Do we have a balanced dataset after counting sentences in train, and items in test?
Train: Counter({0: 258, 3: 211, 2: 188, 1: 110})
Test: Counter({1: 10, 3: 10, 2: 10, 0: 10})
Note: Indices are labels' indexes, and counts are their appearance. Labels are:
[(0, 'english'), (1, 'espanol'), (2, 'french'), (3, 'portuguese')]
Because here, we want to perform those steps in order:
- Lowercase the text (but keep accents).
- Convert the documents to char ngrams.
- Use the classifier of our choice later on.
This is done in the from src.text_classifier_pipelines
local package imported.
Cross Validation is performed.
best_trained_logistic_pipelines = find_and_train_best_logistic_pipeline(X_train, y_train)
Will start Cross Validation for Logistic Classifiers.
Cross-Validation Grid Search for: '1-gram'...
Best hyperparameters for '1-gram' (4-folds cross validation accuracy score=0.924380704041721):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'logistic_regression__C': 1000.0, 'tf_idf__norm': 'l1', 'tf_idf__smooth_idf': True, 'tf_idf__sublinear_tf': False, 'to_ngram__ngram': 1, 'to_ngram__stride': 1}
Cross-Validation Grid Search for: '2-gram'...
Best hyperparameters for '2-gram' (4-folds cross validation accuracy score=0.9934810951760105):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'logistic_regression__C': 1000.0, 'tf_idf__norm': 'l2', 'tf_idf__smooth_idf': True, 'tf_idf__sublinear_tf': False, 'to_ngram__ngram': 2, 'to_ngram__stride': 1}
Cross-Validation Grid Search for: '3-gram'...
Best hyperparameters for '3-gram' (4-folds cross validation accuracy score=0.9960886571056062):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'logistic_regression__C': 1000.0, 'tf_idf__norm': 'l1', 'tf_idf__smooth_idf': True, 'tf_idf__sublinear_tf': True, 'to_ngram__ngram': 3, 'to_ngram__stride': 1}
Cross Validation is performed.
best_trained_bayes_pipelines = find_and_train_best_multinomial_pipeline(X_train, y_train)
Will start Cross Validation for Naive Bayes (MultinomialNB) Classifiers.
Cross-Validation Grid Search for: '1-gram'...
Best hyperparameters for '1-gram' (4-folds cross validation accuracy score=0.9269882659713168):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'naive_bayes_multiclass__alpha': 0.01, 'tf_idf__norm': None, 'tf_idf__smooth_idf': True, 'tf_idf__sublinear_tf': False, 'to_ngram__ngram': 1, 'to_ngram__stride': 1}
Cross-Validation Grid Search for: '2-gram'...
Best hyperparameters for '2-gram' (4-folds cross validation accuracy score=0.9869621903520208):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'naive_bayes_multiclass__alpha': 1.0, 'tf_idf__norm': None, 'tf_idf__smooth_idf': False, 'tf_idf__sublinear_tf': False, 'to_ngram__ngram': 2, 'to_ngram__stride': 1}
Cross-Validation Grid Search for: '3-gram'...
Best hyperparameters for '3-gram' (4-folds cross validation accuracy score=0.9973924380704041):
{'count_vectorizer__lowercase': False, 'count_vectorizer__max_df': 0.98, 'count_vectorizer__max_features': 100000, 'count_vectorizer__min_df': 1, 'count_vectorizer__ngram_range': (1, 1), 'count_vectorizer__preprocessor': None, 'count_vectorizer__strip_accents': None, 'count_vectorizer__tokenizer': <function identity at 0x7f4226572a60>, 'naive_bayes_multiclass__alpha': 0.1, 'tf_idf__norm': 'l2', 'tf_idf__smooth_idf': True, 'tf_idf__sublinear_tf': True, 'to_ngram__ngram': 3, 'to_ngram__stride': 1}
plot_range = 100
best_trained_pipelines = dict()
best_trained_pipelines.update(best_trained_logistic_pipelines)
best_trained_pipelines.update(best_trained_bayes_pipelines)
# eval_and_plot(best_trained_logistic_pipelines, X_test, y_test, plot_range=plot_range)
# eval_and_plot(best_trained_bayes_pipelines, X_test, y_test, plot_range=plot_range)
eval_and_plot(best_trained_pipelines, X_test, y_test, plot_range=plot_range)
print("")
print("The final test: classifying on test documents of full-length:")
print("")
max_score = 0
max_score_model = ""
for (model_name, model) in best_trained_pipelines.items():
score = model.score(X_test, y_test) * 100
if score > max_score:
max_score = score
max_score_model = model_name
print("Test set score for '{}': {}%".format(model_name, score))
print("")
print("Max score is by '{}': {}%".format(max_score_model, max_score))
print("")
The final test: classifying on test documents of full-length:
Test set score for '1-gram Char Logistic Classifier': 100.0%
Test set score for '2-gram Char Logistic Classifier': 100.0%
Test set score for '3-gram Char Logistic Classifier': 100.0%
Test set score for '1-gram Char Multinomial Naive Bayes Classifier': 95.0%
Test set score for '2-gram Char Multinomial Naive Bayes Classifier': 100.0%
Test set score for '3-gram Char Multinomial Naive Bayes Classifier': 100.0%
Max score is by '1-gram Char Logistic Classifier': 100.0%