# Using ```basicnlp``` functions to customise vectorizers within ```sklearn``` pipelines 

As José Blanco explained in his blogpost (https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af), one way to create a customised vectorizer is to instantiate the vectorizer using a modified version of Sklearn’s default analyzer.

Under the hood, Sklearn’s vectorizers call a series of functions to convert a set of documents into a document-term matrix: 
- **```build_preprocessor```**: Returns a callable utilized to preprocess the input text before tokenization.
- **```build_tokenizer:```** Creates a function capable of splitting a document’s corpus into tokens.
- **```build_analyzer```**: Builds an analyzer function which applies preprocessing, tokenization, remove stop words and creates n-grams.

By providing our own custom pipelines to either or all of these steps, we can customise the way we transform the text documents for analysis. 

And this is exactly what we'll do in this script:
1. First, we customise the text pre-processing by only keeping those sentences in each text that are evaluated as subjective by ```TextBlob``` subjectivity algorithm.
2. Second, we customise the text tokenisation by marking negations, removing punctuation and digits, and lemmatising the words.
3. Finally, we train a SVM model to classify the polarity of each text using nested cross-validation which allows us to validate the model fitting procedure including the hyperparameter tuning.


## Set ups and Imports

Import modules and our user-defined functions

In [39]:
import pandas as pd
import os
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

from nlpfunctions.utils import *
from nlpfunctions.basicnlp import *
from nlpfunctions.nlppipelineutils import *

## Data

For the purpose of this notebook, we will import the labelled text data used in 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 (available here)

In [1]:
#print(os.getcwd())
#print(os.listdir())

In [3]:
imdb = pd.read_excel('Data/imdb.xlsx', header=0)
yelp = pd.read_excel('Data/yelp_labelled.xlsx', header=0)

imdb['source'] = 'imdb'
yelp['source'] = 'yelp'

df = pd.concat([imdb, yelp])


Let's take a quick look a the data:

In [4]:
print(df.columns)

print(df.head(5))


Index(['text', 'score', 'source'], dtype='object')
                                                text  score source
0  A very, very, very slow-moving, aimless movie ...      0   imdb
1  Not sure who was more lost - the flat characte...      0   imdb
2  Attempting artiness with black & white and cle...      0   imdb
3       Very little music or anything to speak of.        0   imdb
4  The best scene in the movie was when Gerardo i...      1   imdb


In [5]:
print(df['text'].describe())   #there are some duplicates

pd.crosstab(df['source'], df['score'])

count                 1742
unique                1736
top       I won't be back.
freq                     2
Name: text, dtype: object


score,0,1
source,Unnamed: 1_level_1,Unnamed: 2_level_1
imdb,361,384
yelp,499,500


In [6]:
df[df.duplicated('text')]
df = df.drop_duplicates('text')

In [7]:
df[pd.isnull(df['text'])]   #yep, 1 case
df = df[pd.notnull(df['text'])]   

## Customise text pre-processing within ```sklearn``` pipelines in ```CountVectorizer()```

### 1. Build an ad-hoc pre-processing pipeline

For our customised pre-processing step (i.e., before tokenisation), let's remove all sentences within each text that score too low on subjectivity. We'll pick 0.3 as our subjectivity threshold. 

In [28]:
?remove_objective_sents

In [8]:
my_preprocessor = combine_functions(sent_tokenise
                                    , lambda x : remove_objective_sents(x, 0.3)
                                    )

### 2. Build ad-hoc word-tokenisation pipeline

Let's then lemmatise, mark negations, remove numeric digits and punctuation as part of our customised tokenisation pipeline. 
Lemmatisaion requires that we first POS-tag the text.

Importantly, the output of our pipeline must be a list of token lists as this is the required output for the next step in the vecorizer.

In [9]:
my_tokenizer_pipe = combine_functions(word_tokenise
                                       ,to_lower
                                       ,POS_tagging
                                       ,lemmatise
                                       ,fix_neg_auxiliary
                                       ,lambda x : remove_stopwords(x, extra_stopwords = [
                                           'x', "'s", 'us', 'ca', 'many', 'much', 'one', 'put', '¬ñ',
                                           'also', 'get', 'would', 'could', 'like', 'go', 'lot', 'make'
                                       ])
                                       ,lambda s: [[re.sub(r'\d+','',x) for x in subs] for subs in s]
                                       ,mark_neg
                                       ,flattenIrregularListOfLists  # now we have one list of tokens per text
                                       ,lambda x: list(filter(None, x))   # end with a list of token lists, each sublist is a text
                                       )   

In [31]:
# In case you are wondering

?fix_neg_auxiliary
?mark_neg
?flattenIrregularListOfLists

### 3. Build ad-hoc ```CountVectorizer()```

We can now use our ad-hoc preprocessor and tokenizer within ```CountVectorizer```


In [59]:
my_vec = CountVectorizer(preprocessor = my_preprocessor,
                         tokenizer = my_tokenizer_pipe,
                         analyzer="word",
                         ngram_range = (1,3),
                         stop_words=None
                         #min_df=1
                         )

### 4. Instantiate our custom BoW pipeline

We will use the ```Transformers``` ```ColumnSelector()``` and ```Series2ListOfStrings()``` to select the ```pandas.Series``` that contains the text data and transform it into a list of strings which is ```CountVecorizer()```'s required input format. 

In [60]:
pipe_bags_words = Pipeline([
        
        ('selector', ColumnSelector(columns=['text'])),
        
        ('transformer', Series2ListOfStrings()),
        
        ('vec', my_vec)
        
        ])


### Let's apply our custom BoW pipeline to our data

Let's take a look at the most frequent words first

In [12]:
# https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial

all_words = df['text'].str.split(expand=True).unstack().value_counts()
all_words = all_words.to_frame().reset_index().rename(columns = {'index' : 'word', 0 : 'count'})

# get 50 more frequent words, lots of "rubbish"
#all_words[:50].plot.bar(x='word')
all_words[:50].T


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
word,the,and,a,I,is,of,was,to,The,this,...,one,an,great,really,all,about,by,they,time,from
count,1012,779,618,534,473,468,461,441,354,287,...,74,71,69,68,66,62,61,59,59,59


In [13]:
print(type(df['text']))
print(df['text'][:5])

<class 'pandas.core.series.Series'>
0    A very, very, very slow-moving, aimless movie ...
1    Not sure who was more lost - the flat characte...
2    Attempting artiness with black & white and cle...
3         Very little music or anything to speak of.  
4    The best scene in the movie was when Gerardo i...
Name: text, dtype: object


In [29]:
# Let's take a look at a random part of the output of our customised Vectoriser
pd.DataFrame(pipe_bags_words.fit_transform(df).A, columns=my_vec.get_feature_names()).iloc[1:6, 100:120]


Unnamed: 0,average best,avoid,award,away,awesome,awesome service,awful,baby,back,backNEG,backNEG NEG,bacon,bad,bad acting,bad bad,bad even,bad experience,bad film,bad film ever,bad food
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Let's apply the customised Vectorizer to some new random data, using the vocabulary just learned

In [30]:
new_text = pd.DataFrame(["Would you tell me, please, which way I ought to go from here? That depends a good deal on where you want to get to. I don’t much care where! Then it doesn’t matter which way you go."], columns=['text'])

pd.DataFrame(pipe_bags_words.transform(new_text).A, columns=my_vec.get_feature_names()).iloc[:, 100:120]



Unnamed: 0,average best,avoid,award,away,awesome,awesome service,awful,baby,back,backNEG,backNEG NEG,bacon,bad,bad acting,bad bad,bad even,bad experience,bad film,bad film ever,bad food
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Classifier

Finally, let's train an SVM classifier on the combined BoW features we have just computed with our customised vectorizer.

In [89]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

from sklearn.svm import SVC

In [61]:
pipe_bags_words = Pipeline([
        
        ('selector', ColumnSelector(columns=['text'])),
        
        ('transformer', Series2ListOfStrings()),
        
        ('vec', my_vec),
    
        ('tf_idf', TfidfTransformer())
        
        ])

In [42]:
svm = SVC(probability=True)

In [62]:
# Instantiate pipeline

pipe_bow_svm = Pipeline([
        
        ('bow', pipe_bags_words),
        
        ('classifier', svm)
        
        ])

In [None]:
# Check list of available hyperparameters

pipe_bow_svm_cv.get_params().keys()

In [73]:
# Define parameters space and dictionary

parameters = dict(
        
        #bow__vec__max_features = np.arange(5000, 11000, step=1000),
    
        classifier__kernel = ['linear', 'rbf'],
        
        classifier__C = [0.01, 0.1, 1, 10, 100]
        
        )

### GridSearch and nested cross-validation 

Nested cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. Nested CV estimates the generalization error of the underlying model and its (hyper)parameter search. 

More on this here:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

And check these excellent explanations https://stats.stackexchange.com/a/65156 and https://chrisalbon.com/machine_learning/model_evaluation/nested_cross_validation/

In [48]:
# Create a development dataset and an holdout dataset
df_dev, df_holdout = train_test_split(df, test_size=0.20, stratify=df.score, random_state=11)

In [52]:
X_dev = df_dev[['text']]
y_dev = df_dev.score.values

In [91]:
# Prepare for nested cross-validation

inner_cv = KFold(n_splits=5, shuffle=True, random_state=77)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=77)


In [92]:
# Initiate non_nested parameter search via GridSearch on the inner cv loop
pipe_bow_svm_cv = GridSearchCV(estimator=pipe_bow_svm,
                              param_grid=parameters,
                              cv=inner_cv,
                              return_train_score=True,
                              scoring='accuracy'    #could be smth else, e.g., "neg_log_loss"
                              )

In [104]:
#pipe_bow_svm_cv.get_params()

In [93]:
# Nested CV with parameter optimization to estimate the generalization error (this might take a while...) 
# Note: The outer cross validation estimates generalization error for a training function that includes tuning. 

# pipe_bow_svm_cv's GridSearchCV runs the inner loop while cross_val_score() runs the outer loop
nested_score = cross_val_score(pipe_bow_svm_cv, X=X_dev, y=y_dev, cv=outer_cv)

In [100]:
nested_score

# model fitting approach seems to give stable results

array([ 0.77697842,  0.78417266,  0.75179856,  0.76895307,  0.75451264])

### Evaluate the peformance on the holdout set

In [109]:
X_hout = df_holdout[['text']]
y_hout = df_holdout.score.values

In [114]:
# Train the classifier procedure on the whole development set
# Our final model (trained on the whole data set) is basically the inner CV loop applied to the whole development data set.

pipe_bow_svm_cv.fit(X_dev, y_dev)

GridSearchCV(cv=KFold(n_splits=5, random_state=77, shuffle=True),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('bow', Pipeline(memory=None,
     steps=[('selector', ColumnSelector(columns=['text'])), ('transformer', Series2ListOfStrings()), ('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowe...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__kernel': ['linear', 'rbf'], 'classifier__C': [0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [115]:
# let's take a look at the best model
pipe_bow_svm_cv.best_estimator_

Pipeline(memory=None,
     steps=[('bow', Pipeline(memory=None,
     steps=[('selector', ColumnSelector(columns=['text'])), ('transformer', Series2ListOfStrings()), ('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowe...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [118]:
print("Tuned pipeline's Hyperparameters: {}".format(pipe_bow_svm_cv.best_params_))
print("Accuracy or best score: {}".format(pipe_bow_svm_cv.best_score_))

Tuned pipeline's Hyperparameters: {'classifier__C': 1, 'classifier__kernel': 'linear'}
Accuracy or best score: 0.7672910662824207


In [119]:
# Inspect results for each combination of parameters' value
cv_results = pipe_bow_svm_cv.cv_results_
#print(cv_results.keys())

for mean_test_score, mean_train_score, params in zip(cv_results["mean_test_score"], cv_results["mean_train_score"], cv_results["params"]):
    #print(type(mean_test_score))
    print(params, 'test score: %s ' %round(mean_test_score, 3) + 'vs. train score: %s' %round(mean_train_score, 3))


{'classifier__C': 0.01, 'classifier__kernel': 'linear'} test score: 0.532 vs. train score: 0.544
{'classifier__C': 0.01, 'classifier__kernel': 'rbf'} test score: 0.532 vs. train score: 0.544
{'classifier__C': 0.1, 'classifier__kernel': 'linear'} test score: 0.55 vs. train score: 0.553
{'classifier__C': 0.1, 'classifier__kernel': 'rbf'} test score: 0.532 vs. train score: 0.544
{'classifier__C': 1, 'classifier__kernel': 'linear'} test score: 0.767 vs. train score: 0.888
{'classifier__C': 1, 'classifier__kernel': 'rbf'} test score: 0.532 vs. train score: 0.544
{'classifier__C': 10, 'classifier__kernel': 'linear'} test score: 0.741 vs. train score: 0.888
{'classifier__C': 10, 'classifier__kernel': 'rbf'} test score: 0.532 vs. train score: 0.544
{'classifier__C': 100, 'classifier__kernel': 'linear'} test score: 0.741 vs. train score: 0.888
{'classifier__C': 100, 'classifier__kernel': 'rbf'} test score: 0.532 vs. train score: 0.544


### Evaluation

In [120]:
print(pipe_bow_svm_cv.score(X_hout, y_hout))    #testing accuracy

y_predictions = pipe_bow_svm_cv.predict(X_hout)

print(classification_report(y_true = y_hout, y_pred = y_predictions ))

# [true-negative, false-positive, false-negative, true-positive] 
print(confusion_matrix(y_true = y_hout, y_pred = y_predictions))


0.781609195402
             precision    recall  f1-score   support

          0       0.74      0.86      0.79       171
          1       0.84      0.71      0.77       177

avg / total       0.79      0.78      0.78       348

[[147  24]
 [ 52 125]]
