# ML Pipeline Preparation & Algorithm choice
This notebook will be used to explore some possibilities about feature engineering and model evaluation + choice.  
We are dealing with text and we want to classify some input text into one to many categories. 

**We will build a supervised NLP model with multilabel classification (and not multiclass!).**  
What is the difference? 
Well, _"in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task"_ [source](https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff).

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re

# NLP transformations
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

# sklearn pipelines import
from sklearn.pipeline import Pipeline, FeatureUnion

# sklearn estimators and transformers
from sklearn.base import BaseEstimator, TransformerMixin

# Training
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Multiclass
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

# Model evaluation
from sklearn.metrics import classification_report, f1_score, make_scorer, precision_score, recall_score, confusion_matrix

---
## 1. Load data from database on disk and build X, Y

In [2]:
# load data from database
engine = create_engine('sqlite:///../data/output/disaster.db')
df = pd.read_sql_table('messages', engine)

In [3]:
df.head()

Unnamed: 0,id,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,direct,1,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


* X (input feature columns) is just the `message` (text), the `id` will not help us and the `genre` neither
* target classes are all other columns (36 categories)

In [4]:
X = df['message']
Y = df.drop(['id', 'message', 'genre'], axis=1)

In [5]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [6]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---
## 2. NLP tasks
In the project they suggest us to use **[nltk](https://www.nltk.org/)** which is a famous NLP library. I will instead use another package which has also became very famous recently: **[spacy](https://spacy.io/)** because I find this package easier to use.  

Before going further, you need to ensure that you have spacy package installed with the _'en_core_web_sm'_ downloaded. This can be done with:
> ```pip install spacy```  
> ```python -m spacy download en_core_web_sm```

First step is to load the downloaded module. I have obviously chosen one in english and a small one (note the trailing _'sm'_ in the _'en_core_web_sm'_ name). There are larger but spacy recommends us to always start with small and go to larger only if needed.  
**Note:** because I will not use it, it is a good option to provide a list of submodules that we do not want spacy to load so that the operation is faster. Please refer to this [spacy documentation about pipeline](https://spacy.io/usage/processing-pipelines) to discover all of them.

In [5]:
# Dependency labels, Name Entity Recognition and Text Categorization are not need for our specific usage
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

For the text transformation, there are easy tricks such as putting everything in lower case, remove digits and stop words, etc. Then we have the [choice between Stemming or Lemmatization](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python):  
_"Stemming and Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words."_  
* "Stemming is different to Lemmatization in the approach it uses to produce root forms of words and the word produced. Stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word."
* "Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words."

I will **go for Lemmatization** and for that I will just use the spacy NLP module that has been loaded and remove punctuation, whitespaces, stop words, digits, url and emails.

In [6]:
def lemmatize_txt(x):
    """
    Lemmatize text from a string.
    :param x: (string) the text we want to lemmatize
    :return: (string) string containing the lemmatized text.
    """
    doc = nlp.tokenizer(x.lower())
    lemma_txt = [token.lemma_ for token in doc if not 
              token.is_punct |
              token.is_space |
              token.is_stop |
              token.is_digit |
              token.is_quote |
              token.is_bracket |
              token.is_currency |
              token.like_url |
              token.like_email ]
    lemma_txt = ' '.join(lemma_txt)
    return lemma_txt

#### Trying it!

In [7]:
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Verbs: ['start', 'work', 'drive', 'take', 'can', 'tell', 'would', 'shake', 'turn', 'be', 'talk', 'say']


In [8]:
lemmatize_txt(text)

'sebastian thrun start work self drive car google people outside company take seriously tell senior ceos major american car company shake hand turn away not worth talk say thrun interview recode early week'

---
## 3. Build pipelines models
### 3.1. Models to try
Next phase is to build some different pipelines, one for each model that I will try:
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), simple and basic model when dealing with classification
* [RandomForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), a tree based algorithm that I wand to try because it is an ensemble model but easier and faster to use than [GradientBoostingTree](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) or even [XGBoost](https://xgboost.readthedocs.io/en/latest/) that can both take some huge computation time.

### 3.2. How to perform multiple class classification?
* As per scikit-learn documentation, for Logistic Regression: _"In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme"_. So the Logistic Regression classifier will be wrapped by a [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) which consists in _"fitting one classifier per class. For each classifier, the class is fitted against all the other classes"_.
* For the RandomForest we can wrapper it within a [MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) which can be used to _"extend classifiers that do not natively support multi-target classification"_.

There is a sample of the usage of MultiOutputClassifier with RandomForestClassifier on [scikit-learn documentation about multiclass](https://scikit-learn.org/stable/modules/multiclass.html).

### 3.3. For the pipeline, use _FeatureUnion_ or  _ColumnTransformer_ ?
Based on this [reading](https://stackoverflow.com/questions/55604249/featureunion-vs-columntransformer) I have decided to go for [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) usage because it is said that:  
_"FeatureUnion applies different transformers to the whole of the input data and then combines the results by concatenating them.  
ColumnTransformer, on the other hand, applies different transformers to different subsets of the whole input data, and again concatenates the results."_

In our case, there is only one feature which is the original text message on which we want to apply several transformations.

#### 3.3.1. TF-IDF
Let's start with a very common feature engineering in NLP: TF-IDF vectorization. This will allow us to give more weight to important words (based on their frequency in a text and the number of documents/texts they appear in). The parameters are the ones from the sklearn [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), they are all eligible to tuning through GridSearch but for the first run let's put some arbitrary values:
* max_df: 0.5 to ignore terms that have a document frequency strictly higher than this threshold (50%)
* max_features: 5000 seems to be a pretty good choice to start with. It will help us to build a vocabulary that only consider the top 5000 ordered by term frequency across the corpus.
* ngram_range: here we specify that we also want to consider bi-grams
* binary: if set to True we force all non-zero term counts to be set to 1 (even if they appear more than that)
* norm: no regularization for the moment (default is 'l2')
* use_idf: do we have to use the Inverse Document Frequency reweighting? (default is True)

#### 3.3.2. Logistic Regression with OneVsRest pipeline
As per sklearn documentation, solver to use is liblinear which "is limited to one-versus-rest schemes".

In [144]:
pipeline_lr = Pipeline([
            ("features", FeatureUnion([
                ("text", TfidfVectorizer(tokenizer=lemmatize_txt, max_df=0.5, 
                                 max_features=5000, ngram_range=(1, 2),
                                 binary=False, norm=None, 
                                 use_idf=True))
            ])),    
            ("clf", OneVsRestClassifier(LogisticRegression(solver='liblinear')))
            ])

#### 3.3.3. RandomForest with MultiOutputClassifier pipeline
Here the RandomForest classifier has few parameters set and no tuning made:
* random state to ensure reproducibility
* an arbitrary number of estimators
* max_depth also to limit the tree size and computation time

In [145]:
pipeline_rf = Pipeline([
            ("features", FeatureUnion([
                ("text", TfidfVectorizer(tokenizer=lemmatize_txt, max_df=0.5, 
                                 max_features=5000, ngram_range=(1, 2),
                                 binary=False, norm=None, 
                                 use_idf=True))
            ])),    
            ("clf", MultiOutputClassifier(RandomForestClassifier(random_state=42, max_depth=8,
                                                                 n_estimators=300)))
            ])

---
## 4. Split into train/test and build models pipelines

In [7]:
x_train_val, x_test, y_train_val, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### 4.1. Logistic regression

In [147]:
%%time
pipeline_lr.fit(x_train_val, y_train_val)



Wall time: 1min 55s


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...ne, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None))])

### 4.2. Random Forest

In [148]:
%%time
pipeline_rf.fit(x_train_val, y_train_val)

Wall time: 1min 37s


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))])

--- 
## 5. Models evaluation
How can we evaluate? As explained in this [towardsdatascience post](https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff) we have to average our results made for all categories and for that there are 2 options: the micro-averaging way or the macro-averaging one.  
Read this post for further information if needed, I will go for the micro way as "it is a useful measure when your dataset varies in size" (and remember that our classes/targets are imbalanced).  
Another interesting source about the difference and what to choose is available [here](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin) where it is written that **_"In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes)."_**.

In [8]:
def evaluate_model_perf(model_name, pipeline, x_test, y_test):
    """
    Evaluate model performance by predicting on test dataset and then printing performance metrics
    :param model_name: (string) name of the model, just for display purpose
    :pipeline: (object) the sklearn classifier pipeline to use for inference
    :x_test: (pandas DataFrame) features of the test subset used to evaluate
    :y_test: (pandas DataFrame) ground truth targets for the test subset used to evaluate
    """
    y_pred = pipeline.predict(x_test)
    
    p = precision_score(y_test, y_pred, average="micro")
    r = recall_score(y_test, y_pred, average="micro")
    f1 = f1_score(y_test, y_pred, average="micro")
    wp = precision_score(y_test, y_pred, average="weighted")
    wr = recall_score(y_test, y_pred, average="weighted")
    wf1 = f1_score(y_test, y_pred, average="weighted")
    
    print("Model {} -- MICRO metrics -- Precision: {:.4f}, Recall: {:.4f}, F1-Score: {:.4f}".format(model_name, p, r, f1))
    print("Model {} -- WEIGHTED metrics -- Precision: {:.4f}, Recall: {:.4f}, F1-Score: {:.4f}".format(model_name, wp, wr, wf1))
    
    # Build a df with results, class per class
    records = []
    records.append(('GLOBAL', f1))
    for i, col in enumerate(y_test.columns):
        records.append((col, f1_score(y_test[col], y_pred.T[i])))
    
    df = pd.DataFrame.from_records(records, columns=['class', 'F1-Score {}'.format(model_name)])
    df.set_index('class')
    return y_pred, df

### 5.1. Results for our first models

In [149]:
y_pred_lr, df_lr = evaluate_model_perf('LogisticReg', pipeline_lr, x_test, y_test)
y_pred_rf, df_rf = evaluate_model_perf('RandomForest', pipeline_rf, x_test, y_test)

Model LogisticReg -- MICRO metrics -- Precision: 0.6726, Recall: 0.5169, F1-Score: 0.5845
Model LogisticReg -- WEIGHTED metrics -- Precision: 0.6228, Recall: 0.5169, F1-Score: 0.5535


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Model RandomForest -- MICRO metrics -- Precision: 0.7683, Recall: 0.3281, F1-Score: 0.4599
Model RandomForest -- WEIGHTED metrics -- Precision: 0.6506, Recall: 0.3281, F1-Score: 0.3327


  'precision', 'predicted', average, warn_for)


In [153]:
df_lr

Unnamed: 0_level_0,F1-Score LogisticReg
class,Unnamed: 1_level_1
GLOBAL,0.584541
related,0.862293
request,0.519973
offer,0.034483
aid_related,0.624843
medical_help,0.311111
medical_products,0.325123
search_and_rescue,0.22
security,0.017699
military,0.248996


---
### 5.2. Let's have a better understanding of micro, macro and weighted average
Let's try to analyze for one class, the first one.

In [156]:
print(classification_report(y_test['related'], y_pred_lr.T[0]))

              precision    recall  f1-score   support

           0       0.59      0.39      0.47      1306
           1       0.82      0.91      0.86      3938

   micro avg       0.78      0.78      0.78      5244
   macro avg       0.71      0.65      0.67      5244
weighted avg       0.76      0.78      0.76      5244



In [155]:
# We will need the confusion matrix to see how many elements were classified as positive in this category
confusion_matrix(y_test['related'], y_pred_lr.T[0])

array([[ 510,  796],
       [ 350, 3588]], dtype=int64)

Line 1 is when truth is 0 whereas column 1 is when prediction is 0. So we have here:
* 510 True Negative (TN) (truth is O and prediction is 0 as well)
* 3588 True Positive (TP) (same for the 1)
* 796 False Positive (FP) (we say it is positive whereas actually it is not)
* 350 False Negative (FN) (we say it is negative and actually it is positive)

Precision formula is : TP/(TP + FP) and Recall one is TP/(TP + FN)

In [157]:
print("Precision for Positive is {:.2f}, Recall is {:.2f}".format(3588/(3588+796), 3588/(3588+350)))
print("Precision for Negative is {:.2f}, Recall is {:.2f}".format(510/(510+350), 510/(510+796)))

Precision for Positive is 0.82, Recall is 0.91
Precision for Negative is 0.59, Recall is 0.39


There are 1306 samples that should be classified as Negative and 3938 as Positive (cf. 'support' column).  
The macro average is a classic mean of both Precision for Positive and Precision for Negative.  
The weighted one takes into account the ratio 1306/(1306+3938) which is around 25%.
The micro one is still a mystery...

In [158]:
print("Precision macro is {}, Precision weighted is {:.3f}".format((0.59+0.82)/2, (1306*0.59+3938*0.82)/(1306+3938)))

Precision macro is 0.705, Precision weighted is 0.763


For micro averaging it is a little bit tricky but the goal is to sum all true positives and divide by the sum of all true positives plus the sum of all false positives:

In [159]:
print("Micro averaged precision is {:.2f}".format((3588+510)/(1306+3938)))

Micro averaged precision is 0.78


It will be the same value for recall as we will still have the same amount of True Positive (3597+510) and now we take the False Negative. But, as very well explained in this [great post](https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1) with pictures, _"each prediction error (X is misclassified as Y) is a False Positive for Y, and a False Negative for X. Thus, the total number of False Negatives is again the total number of prediction errors"_

When looking at a specific class in a multilabel classification, taking micro average metric for a specific class has no sense but on the overall this is a much better metric than the macro one that will just perform a mean operation over all precision, recall and F1 metrics. The weighted average is also a good option as it is taking into account the number of samples, that's why **I am looking at both of them, micro and weighted**.

***Observations about the first results:***
* The **Logistic Regression has a best F1-Score than the Random Forest (so far)** (0.58 vs. 0.46).
* When using the `weighted` average method (instead of `micro`) then the result is even lower for both of them.
* There are classes with a F1-Score very low, near 0 (for instance `offer` or `security`)

Let's print the full classification report so that we will also have access to the number of samples for each class:

In [160]:
print(classification_report(y_test, y_pred_lr))

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


              precision    recall  f1-score   support

           0       0.82      0.91      0.86      3938
           1       0.66      0.43      0.52       895
           2       0.03      0.04      0.03        26
           3       0.67      0.58      0.62      2131
           4       0.47      0.23      0.31       422
           5       0.49      0.24      0.33       270
           6       0.30      0.17      0.22       127
           7       0.04      0.01      0.02        88
           8       0.33      0.20      0.25       155
           9       0.63      0.44      0.52       339
          10       0.71      0.55      0.62       595
          11       0.58      0.37      0.46       470
          12       0.47      0.27      0.34        73
          13       0.33      0.25      0.29       104
          14       0.06      0.03      0.04        60
          15       0.32      0.19      0.24       171
          16       0.39      0.25      0.31       237
          17       0.37    

### 5.3. Specific error analysis for categories `offer` and `security`

In [161]:
print(classification_report(y_test['offer'], y_pred_lr.T[2]))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      5218
           1       0.03      0.04      0.03        26

   micro avg       0.99      0.99      0.99      5244
   macro avg       0.51      0.52      0.51      5244
weighted avg       0.99      0.99      0.99      5244



In [162]:
print(classification_report(y_test['security'], y_pred_lr.T[7]))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5156
           1       0.04      0.01      0.02        88

   micro avg       0.98      0.98      0.98      5244
   macro avg       0.51      0.50      0.50      5244
weighted avg       0.97      0.98      0.97      5244



In [163]:
confusion_matrix(y_test['offer'], y_pred_lr.T[2])

array([[5187,   31],
       [  25,    1]], dtype=int64)

In [164]:
confusion_matrix(y_test['security'], y_pred_lr.T[7])

array([[5132,   24],
       [  87,    1]], dtype=int64)

***Observations:*** there are **too few samples that should be classified as 1 so our model tends to classify them into 0** and is still good when looking at the micro average value. Even the weighted one is good.  
Question: how did the model learn for those categories? Let's count how many samples for each of those 2 categories:

In [165]:
print("{} samples with class {} in training dataset (so {:.2f}% of the dataset)".
      format(y_train_val['security'].sum(), 'security', y_train_val['security'].sum()*100/len(y_train_val)))

383 samples with class security in training dataset (so 1.83% of the dataset)


In [166]:
print("{} samples with class {} in training dataset (so {:.2f}% of the dataset)".
      format(y_train_val['offer'].sum(), 'security', y_train_val['offer'].sum()*100/len(y_train_val)))

92 samples with class security in training dataset (so 0.44% of the dataset)


---
## 6. Adding new features to the pipelines
Before trying to fine tune those models to improve the metrics, we could try to add some features as the only one we got for the moment is the TF-IDF vectorization. Reading this [post on Kaggle](https://www.kaggle.com/shaz13/feature-engineering-for-nlp-classification), we could add some other informations to see if it helps such as:
* number of sentences, words
* length of the text, length of sentences
* by using the part-of-speech tags (a.k.a POS), count the number of NOUNS, VERBS, ADJECTIVES and so on
* ...

Okay, once this is said, how do we do that? Thanks to sickit-learn we can build our own Transformers instances and put them in the pipeline.

### 6.1. Looking for POS tags with Spacy
As per [Spacy documenation on POS tags](https://spacy.io/api/annotation#pos-tagging), we will look after:
* adjectives (POS='ADJ')
* nouns (POS='NOUN')
* verbs (POS='VERB')  
Particles, pronouns, determiners, auxiliaries and so on are less relevant and should not bring very useful information.

In [9]:
def get_pos(text, pos):
    doc = nlp(text)
    elements = [token for token in doc if token.pos_ == pos]
    return elements, len(elements)

In [355]:
text = x_train_val.loc[22944]
print(text)
print("Verbs:", get_pos(text, "VERB"))
print("Nouns:", get_pos(text, "NOUN"))
print("Adjectives:", get_pos(text, "ADJ"))

* How do beneficiaries (gender/economic status) in varying circumstances (emergency/non-emergency) spend cash?
Verbs: ([do, varying, spend], 3)
Nouns: ([beneficiaries, gender, status, circumstances, emergency, emergency, cash], 7)
Adjectives: ([economic, non], 2)


### 6.2. Build custom transformers
The next section has been developped with the help of this [blog post](https://michelleful.github.io/code-blog/2015/06/20/pipelines/).

In [10]:
class WordCountExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe and outputs number of words"""

    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self

    def nb_words(self, text):
        return len(text.split())

    def transform(self, x):
        """
        Need to encapsulate the result within a DataFrame otherwise an error is thrown ("ValueError: blocks[0,:] has 
        incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 20972.")
        """
        return pd.DataFrame(x.apply(self.nb_words))

In [11]:
class AverageWordLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe and outputs average word length"""

    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self

    def average_word_length(self, text):
        """
        Sometimes text is empty so need to handle this case
        """
        return np.mean([len(word) for word in text.split()]) if len(text.split()) > 0 else 0

    def transform(self, x):
        return pd.DataFrame(x.apply(self.average_word_length))

In [12]:
def get_sentences(text):
    """
    Homemade function to retrieve all sentences within a given text. I have decided that sentences have 
    words and several characters and should end by either a '.', a '?' or a '!'
    """
    # Add a trailing dot if the last character is not ".", "?" or "!" to ensure we will capture the last sentence
    if text.strip()[-1:] not in ['.', '?', '!']:
        text = text + '.'
    sentence_detection_regex = r"\s?[\w,-;:'()\"\s]+[.?!]"
    sentences = re.findall(sentence_detection_regex, text)
    return sentences

In [13]:
class SentenceCountExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe and outputs number of sentences"""

    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self

    def nb_sentences(self, text):
        return len(get_sentences(text))

    def transform(self, x):
        return pd.DataFrame(x.apply(self.nb_sentences))

In [14]:
class AverageSentenceLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe and outputs average sentence length"""

    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self

    def average_sentence_length(self, text):
        return np.mean([len(sentence) for sentence in get_sentences(text)]) if len(get_sentences(text)) > 0 else 0

    def transform(self, x):
        return pd.DataFrame(x.apply(self.average_sentence_length))

In [15]:
def count_pos(doc, pos):
    return len([token for token in doc if token.pos_ == pos])

In [16]:
class PosCountExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe and outputs number of words tagged as the given part-of-speech (POS)"""

    def __init__(self):
        pass
    
    def fit(self, x, y=None):
        return self

    def nb_pos(self, text):
        """
        For performance reasons, we load through spacy nlp only once and get all counts we are interested in
        """
        doc = nlp(text)
        nb_nouns = count_pos(doc, "NOUN")
        nb_verbs = count_pos(doc, "VERB")
        nb_adjectives = count_pos(doc, "ADJ")
        return nb_nouns, nb_verbs, nb_adjectives

    def transform(self, x):
        df = pd.DataFrame(x.apply(self.nb_pos))
        # At this point all we have is a dataframe with only one column where each value is a tuple (nb_nouns, nb_verbs, nb_adjectives)
        # We have to split and transform into 3 columns
        df = df.astype(str).message.str[1:-1].str.split(',', expand=True)
        # Do not forget to put back all values as numeric ones
        return df.astype(int)

### 6.3. Putting it all together in a new pipeline
#### 6.3.1. Logistic Regression

In [17]:
pipeline_lr_v2 = Pipeline([
            ("features", FeatureUnion([
                ("text", TfidfVectorizer(tokenizer=lemmatize_txt, max_df=0.5, 
                                 max_features=5000, ngram_range=(1, 2),
                                 binary=False, norm=None, 
                                 use_idf=True)),
                ("word-count", WordCountExtractor()),
                ("word-len", AverageWordLengthExtractor()),
                ("sentence-count", SentenceCountExtractor()),
                ("sentence-len", AverageSentenceLengthExtractor()),
                ("verb-count", PosCountExtractor())
            ])),    
            ("clf", OneVsRestClassifier(LogisticRegression(solver='liblinear')))
            ])

In [18]:
%%time
pipeline_lr_v2.fit(x_train_val, y_train_val)



Wall time: 6min 44s


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...ne, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None))])

In [19]:
y_pred_lr_v2, _ = evaluate_model_perf('LogisticReg', pipeline_lr_v2, x_test, y_test)

Model LogisticReg -- MICRO metrics -- Precision: 0.6857, Recall: 0.5206, F1-Score: 0.5919
Model LogisticReg -- WEIGHTED metrics -- Precision: 0.6360, Recall: 0.5206, F1-Score: 0.5592


***Observations:*** we have made progress on F1-Score metric.  
Note that the fit on training dataset took almost 3 times the time it took with just the TF-IDF vectorization. This might be an issue to deal with as it will take ages if we want to run GridSearch Cross Validation (around 15 minutes for each combination).

As a reminder, previous results were:

In [448]:
y_pred_lr, _ = evaluate_model_perf('LogisticReg', pipeline_lr, x_test, y_test)

Model LogisticReg -- MICRO metrics -- Precision: 0.6726, Recall: 0.5169, F1-Score: 0.5845
Model LogisticReg -- WEIGHTED metrics -- Precision: 0.6228, Recall: 0.5169, F1-Score: 0.5535


So **we have 2% more on F1-Score and this is mostly due to a big increase on _precision_** (the _recall_ remained more or less the same).

#### 6.3.2. Random Forest

In [449]:
pipeline_rf_v2 = Pipeline([
            ("features", FeatureUnion([
                ("text", TfidfVectorizer(tokenizer=lemmatize_txt, max_df=0.5, 
                                 max_features=5000, ngram_range=(1, 2),
                                 binary=False, norm=None, 
                                 use_idf=True)),
                ("word-count", WordCountExtractor()),
                ("word-len", AverageWordLengthExtractor()),
                ("sentence-count", SentenceCountExtractor()),
                ("sentence-len", AverageSentenceLengthExtractor()),
                ("verb-count", PosCountExtractor())
            ])),    
            ("clf", MultiOutputClassifier(RandomForestClassifier(random_state=42, max_depth=8,
                                                                 n_estimators=300)))
            ])

In [450]:
%%time
pipeline_rf_v2.fit(x_train_val, y_train_val)

Wall time: 6min 45s


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))])

In [451]:
y_pred_rf_v2, df_rf_v2 = evaluate_model_perf('RandomForest', pipeline_rf_v2, x_test, y_test)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Model RandomForest -- MICRO metrics -- Precision: 0.7656, Recall: 0.3250, F1-Score: 0.4563
Model RandomForest -- WEIGHTED metrics -- Precision: 0.6381, Recall: 0.3250, F1-Score: 0.3305


  'precision', 'predicted', average, warn_for)


In [452]:
y_pred_rf, df_rf = evaluate_model_perf('RandomForest', pipeline_rf, x_test, y_test)

Model RandomForest -- MICRO metrics -- Precision: 0.7683, Recall: 0.3281, F1-Score: 0.4599
Model RandomForest -- WEIGHTED metrics -- Precision: 0.6506, Recall: 0.3281, F1-Score: 0.3327


***Note:*** with the random forest and arbitrary chosen parameters, the new features did not help, the F1-Score is even slightly lower.

For the tuning phase, as the Logistic Regression has less parameters than the Random Forest and is less longer to train, I will first fine tune the TF-IDF with Logistic Regression. Then, I will take the best parameters and apply them to Random Forest and perform another GridSearch to see if I am able to improve the metrics.

#### Handy: see the transformed dataset!

In [368]:
x_test.head()

7917             Please ,give me some help.Thanks before. 
25322    A positive development is that cooling has bee...
22191    Witnesses say 200 to 500 people have so far di...
18442    The 15-nation Economic Community of West Afric...
1336     and in addition what time will it be possible ...
Name: message, dtype: object

In [441]:
pipeline_lr_v2.named_steps['features'].transform(x_test[:5])

array([[  6.        ,   5.83333333,   1.        ,  40.        ,
          2.        ,   1.        ,   0.        ],
       [ 19.        ,   4.63157895,   1.        , 106.        ,
          6.        ,   4.        ,   1.        ],
       [ 52.        ,   4.61538462,   1.        , 291.        ,
         10.        ,   6.        ,   3.        ],
       [ 30.        ,   5.33333333,   1.        , 189.        ,
          5.        ,   4.        ,   3.        ],
       [ 16.        ,   3.6875    ,   1.        ,  74.        ,
          4.        ,   3.        ,   2.        ]])

---
## 7. Improve model
Here we will use grid search to find better parameters.

### 7.1. Logistic Regression
Here are the parameters that we can tune for the Logistic Regression model:

In [22]:
pipeline_lr_v2.get_params().keys()

dict_keys(['memory', 'steps', 'features', 'clf', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__text', 'features__word-count', 'features__word-len', 'features__sentence-count', 'features__sentence-len', 'features__verb-count', 'features__text__analyzer', 'features__text__binary', 'features__text__decode_error', 'features__text__dtype', 'features__text__encoding', 'features__text__input', 'features__text__lowercase', 'features__text__max_df', 'features__text__max_features', 'features__text__min_df', 'features__text__ngram_range', 'features__text__norm', 'features__text__preprocessor', 'features__text__smooth_idf', 'features__text__stop_words', 'features__text__strip_accents', 'features__text__sublinear_tf', 'features__text__token_pattern', 'features__text__tokenizer', 'features__text__use_idf', 'features__text__vocabulary', 'clf__estimator__C', 'clf__estimator__class_weight', 'clf__estimator__dual', 'clf__estimator__fit_intercept', 'clf__es

Let's start with just one parameter to check if it works fine and the time it takes before launching it with too many parameters. For that, we will choose to vary the use of IDF which is a boolean value.

In [22]:
# Only TF-IDF parameters (Logistic Regression has more or less only the C value to change)
param_grid = {
    'features__text__use_idf': [True, False]
}

cv = GridSearchCV(pipeline_lr_v2, param_grid, cv=2, scoring=make_scorer(f1_score, average='micro'), verbose=3)

In [23]:
%%time
cv.fit(x_train_val, y_train_val)

Fitting 2 folds for each of 2 candidates, totalling 4 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] features__text__use_idf=True ....................................




[CV]  features__text__use_idf=True, score=0.5695081860495396, total= 1.9min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.4min remaining:    0.0s


[CV] features__text__use_idf=True ....................................




[CV]  features__text__use_idf=True, score=0.5710950729117004, total= 1.8min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.8min remaining:    0.0s


[CV] features__text__use_idf=False ...................................




[CV]  features__text__use_idf=False, score=0.5976489217053063, total= 1.7min
[CV] features__text__use_idf=False ...................................




[CV]  features__text__use_idf=False, score=0.5982751110723931, total= 1.9min


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  9.4min finished


Wall time: 12min 1s


GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...ne, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'features__text__use_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(f1_score, average=micro), verbose=3)

In [24]:
print("Best score is {:.4f} with params = {}".format(cv.best_score_, cv.best_params_))

Best score is 0.5980 with params = {'features__text__use_idf': False}


***Observations:***
* 12 minutes for just 2 parameters and a 2 folds for Cross Validation, wow, it is taking too much time. I will go for a [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) instead and see whethere the n_jobs can be tuned as well.
* Interesting point: **it seems that Logistic Regression gives a better score when we do not use the IDF value in TF-IDF** (so only term frequency)!

In [25]:
# Only TF-IDF parameters (Logistic Regression has more or less only the C value to change)
param_grid = {
    'features__text__ngram_range': ((1, 1), (1, 2)),
    'features__text__max_df': [0.5, 0.7, 1.0],
    'features__text__max_features': [3000, 5000],
    'features__text__norm': [None, 'l1', 'l2'],
    'features__text__binary': [True, False],
    'features__text__use_idf': [False]
}

rs = RandomizedSearchCV(pipeline_lr_v2, param_distributions=param_grid, cv=3, n_iter=50, 
                        scoring=make_scorer(f1_score, average='micro'), verbose=3)

In [26]:
%%time
rs.fit(x_train_val, y_train_val)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.6137650162649522, total= 4.9min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.0min remaining:    0.0s


[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.6055517657553655, total= 3.4min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 11.2min remaining:    0.0s


[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.597986595102673, total= 2.0min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.4822838751974252, total= 1.4min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.4734446130500758, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.4873703477730324, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.4384428672381479, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.



[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.5909784025063074, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.5886673956318166, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.5922631773995838, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.47765894236482465, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_



[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.4418209351917669, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.4541622760800843, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.44178883824072923, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.44808640819129675, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.6071161338596899, total= 2.8min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.5951271186440678, total= 2.8min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.5995923661293311, total= 3.0min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.4344044303995494, total= 1.8min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.4325352782622483, total= 1.8min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.4443234587670136, total= 1.9min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.5933151470061447, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.5965584235359424, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.5974011235317482, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.4435049727789775, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, 



[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.5986415882967607, total= 2.0min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.596294246329828, total= 1.9min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.5980428655001861, total= 2.0min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.45369143683006136, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.4647114474929044, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.4520783650798443, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.



[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.6137650162649522, total= 1.9min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.6055517657553655, total= 1.9min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.597986595102673, total= 1.8min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.5897698634689413, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.5906243007384201, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.5952340328726611, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.44345706439569255, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.43628157713282767, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.4507734475914433, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.6053467648198374, total= 2.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.6007670484396344, total= 2.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.5999468932554435, total= 2.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.4477195536147501, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.4418209351917669, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.4541622760800843, total= 1.0min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.4686640171201467, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.46575003096742235, total= 1.4min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.4757963380988212, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.6132267629238106, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.6035655889023683, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.6100461437046802, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True, score=0.45836858006042297, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=False, score=0.4647114474929044, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True, score=0.5961803071165171, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True, score=0.5925822487851198, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True, score=0.5970233876683203, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.47979500471253533, total= 1.3min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__m



[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.47465842648974516, total= 1.3min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True, score=0.44048349490735056, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0.7, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=0



[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.5520478548327085, total= 1.8min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.5474762764161378, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.5515305519442146, total= 1.9min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.5953979962567434, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.5938440950150552, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True, score=0.5990890316009845, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.6053467648198374, total= 2.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.6007670484396344, total= 2.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.5999468932554435, total= 2.1min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.5897698634689413, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.5906243007384201, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.5952340328726611, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.44345706439569255, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.43628157713282767, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.4507734475914433, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.4686640171201467, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.46575003096742235, total= 1.4min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=False, score=0.4757963380988212, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False, score=0.443970507586375, total= 1.1min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__ma



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.44743935309973043, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.4458666666666667, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.4584165636196455, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.5933151470061447, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.5965584235359424, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=False, score=0.5974011235317482, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.4272866856766835, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.42863019333607566, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=True, score=0.44054717642932306, total= 1.2min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.611557064937738, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.6097256208673921, total= 1.8min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=0.7, features__text__binary=True, score=0.6131613582521717, total= 1.7min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.45994466646172766, total= 1.4min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.4600454135431895, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.46949184560166235, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.5953979962567434, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.5938440950150552, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=0.5, features__text__binary=True, score=0.5990890316009845, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l2, features__text__ngram_range=(1, 1), features__text__max_features=5000, features__text__max_df=0.5, f



[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.43598615916955014, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.43276368575670465, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=



[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.6071161338596899, total= 2.7min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.5951271186440678, total= 2.5min
[CV] features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=None, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=False, score=0.5995923661293311, total= 2.7min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.4344044303995494, total= 1.5min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.4325352782622483, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=3000, features__text__max_df=1.0, features__text__binary=True, score=0.4443234587670136, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 




[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.43598615916955014, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False, score=0.43276368575670465, total= 1.6min
[CV] features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=1.0, features__text__binary=False 
[CV]  features__text__use_idf=False, features__text__norm=l1, features__text__ngram_range=(1, 2), features__text__max_features=5000, features__text__max_df=

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 328.0min finished


Wall time: 5h 30min 13s


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=5000, min_df=1,
        ngr...ne, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None))]),
          fit_params=None, iid='warn', n_iter=50, n_jobs=None,
          param_distributions={'features__text__ngram_range': ((1, 1), (1, 2)), 'features__text__max_df': [0.5, 0.7, 1.0], 'features__text__max_features': [3000, 5000], 'features__text__norm': [None, 'l1', 'l2'], 'features__text__binary': [True, False], 'features__text__use_idf': [False]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn',
          scoring=make_scorer(f1

In [27]:
print("Best score is {:.4f} with params = {}".format(rs.best_score_, rs.best_params_))

Best score is 0.6115 with params = {'features__text__use_idf': False, 'features__text__norm': None, 'features__text__ngram_range': (1, 2), 'features__text__max_features': 5000, 'features__text__max_df': 0.7, 'features__text__binary': True}


#### See cross validation results table

In [31]:
rs.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_features__text__use_idf', 'param_features__text__norm', 'param_features__text__ngram_range', 'param_features__text__max_features', 'param_features__text__max_df', 'param_features__text__binary', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'mean_train_score', 'std_train_score'])

In [38]:
rs_table = pd.DataFrame.from_dict(rs.cv_results_)[['param_features__text__norm', 
                                        'param_features__text__ngram_range', 'param_features__text__max_features', 
                                        'param_features__text__max_df', 'param_features__text__binary',
                                       'mean_test_score', 'rank_test_score']].sort_values(by=['rank_test_score'], axis=0)
rs_table.columns = ['norm', 'ngram_range', 'max_features', 'max_df', 'binary', 'mean_test_score', 'rank_test_score']
rs_table

Unnamed: 0,norm,ngram_range,max_features,max_df,binary,mean_test_score,rank_test_score
42,,"(1, 2)",5000,0.7,True,0.611481,1
24,,"(1, 2)",5000,0.5,True,0.608946,2
15,,"(1, 2)",3000,1.0,True,0.605768,3
0,,"(1, 2)",5000,1.0,True,0.605768,3
32,,"(1, 2)",5000,0.7,False,0.60202,5
18,,"(1, 2)",3000,0.7,False,0.60202,5
47,,"(1, 2)",3000,1.0,False,0.600612,7
7,,"(1, 2)",5000,1.0,False,0.600612,7
11,,"(1, 2)",3000,0.5,False,0.59766,9
44,l2,"(1, 2)",3000,0.5,True,0.59611,10


***Observations:***
* ***ngram_range*** and ***norm*** parameters are respectively _(1,2)_ and _'None'_ for the 9 best scores so I think it is safe to say that those values can be kept.
* the 4 best scores have also the ***binary*** parameter set to _True_ so let's go with it as well.
* for the others (***max_df*** and ***max_features***), to ensure we have found the best combination we could run another search with new ranges to explore. I will not do that and move forward.

In [42]:
y_pred_lr_best, _ = evaluate_model_perf('LogisticRegression tuned', rs.best_estimator_, x_test, y_test)

Model LogisticRegression tuned -- MICRO metrics -- Precision: 0.7359, Recall: 0.5344, F1-Score: 0.6191
Model LogisticRegression tuned -- WEIGHTED metrics -- Precision: 0.6811, Recall: 0.5344, F1-Score: 0.5773


### 7.2. Apply TF-IDF tuned parameters to Random Forest
Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. We could try to tune also the Random Forest but I will let this as a further improvement and move forward otherwise I may be running out of time.  
For the moment I will only build the pipeline for RandomForest with the best parameters found for TF-IDF Logistic Regression and see if it improves the results which were, as a reminder:

* **TF-IDF vectorizer only** with arbitrary values: Precision: 0.7683, Recall: 0.3281, **F1-Score: 0.4599**
* **TF-IDF vectorizer + new features**: Precision: 0.7656, Recall: 0.3250, **F1-Score: 0.4563**

Remember that those values are for `average='micro'`.

In [39]:
pipeline_rf_tfidf_tuned = Pipeline([
            ("features", FeatureUnion([
                ("text", TfidfVectorizer(tokenizer=lemmatize_txt, max_df=0.7, 
                                 max_features=5000, ngram_range=(1, 2),
                                 binary=True, norm=None, 
                                 use_idf=False)),
                ("word-count", WordCountExtractor()),
                ("word-len", AverageWordLengthExtractor()),
                ("sentence-count", SentenceCountExtractor()),
                ("sentence-len", AverageSentenceLengthExtractor()),
                ("verb-count", PosCountExtractor())
            ])),    
            ("clf", MultiOutputClassifier(RandomForestClassifier(random_state=42, max_depth=8,
                                                                 n_estimators=300)))
            ])

In [40]:
%%time
pipeline_rf_tfidf_tuned.fit(x_train_val, y_train_val)

Wall time: 3min 15s


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text', TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.7, max_features=5000, min_df=1,
        ngra...           oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=None))])

In [41]:
y_pred_rf_tfidf_tuned, _ = evaluate_model_perf('RandomForest with tuned TF-IDF', pipeline_rf_tfidf_tuned, x_test, y_test)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Model RandomForest with tuned TF-IDF -- MICRO metrics -- Precision: 0.7640, Recall: 0.3308, F1-Score: 0.4616
Model RandomForest with tuned TF-IDF -- WEIGHTED metrics -- Precision: 0.6428, Recall: 0.3308, F1-Score: 0.3350


  'precision', 'predicted', average, warn_for)


Yes, **with the tuned TF-IDF vectorizer we managed to improve the F1-Score (0.461 vs. 0.459)!**. But the value for this model remains far far away from what we were able to achieve with Logistic Regression. To go further, _we should now start a new tuning session for all RandomForest_ parameters because there are high chances that results are poor because of bad arbitrary fixed parameters.  
Again, as the main objective is not to get the best metric, I will let this part as a further improvement and move forward on the project.