# Using ```basicnlp``` functions to customise vectorizers within ```sklearn``` pipelines 

As José Blanco explained in his blogpost on towardsdatascience.com (https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af), one way to create a customised vectorizer is to instantiate the vectorizer using a modified version of Sklearn’s default analyzer.

Under the hood, Sklearn’s vectorizers call a series of functions to convert a set of documents into a document-term matrix: 
- **```build_preprocessor```**: Returns a callable utilized to preprocess the input text before tokenization.
- **```build_tokenizer:```** Creates a function capable of splitting a document’s corpus into tokens.
- **```build_analyzer```**: Builds an analyzer function which applies preprocessing, tokenization, remove stop words and creates n-grams.

By providing our own custom pipelines to either or all of these steps, we can customise the way we transform the documents for analysis. 

And this is exactly what we'll do in this script.

## Set ups and Imports

Import modules and our user-defined functions

In [39]:
import pandas as pd
import os
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

from nlpfunctions.utils import *
from nlpfunctions.basicnlp import *
from nlpfunctions.nlppipelineutils import *

## Data

For the purpose of this notebook, we will import the labelled text data used in 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 (available here)

In [1]:
#print(os.getcwd())
#print(os.listdir())

In [3]:
imdb = pd.read_excel('Data/imdb.xlsx', header=0)
yelp = pd.read_excel('Data/yelp_labelled.xlsx', header=0)

imdb['source'] = 'imdb'
yelp['source'] = 'yelp'

df = pd.concat([imdb, yelp])


Let's take a quick look a the data:

In [4]:
print(df.columns)

print(df.head(5))


Index(['text', 'score', 'source'], dtype='object')
                                                text  score source
0  A very, very, very slow-moving, aimless movie ...      0   imdb
1  Not sure who was more lost - the flat characte...      0   imdb
2  Attempting artiness with black & white and cle...      0   imdb
3       Very little music or anything to speak of.        0   imdb
4  The best scene in the movie was when Gerardo i...      1   imdb


In [5]:
print(df['text'].describe())   #there are some duplicates

pd.crosstab(df['source'], df['score'])

count                 1742
unique                1736
top       I won't be back.
freq                     2
Name: text, dtype: object


score,0,1
source,Unnamed: 1_level_1,Unnamed: 2_level_1
imdb,361,384
yelp,499,500


In [6]:
df[df.duplicated('text')]
df = df.drop_duplicates('text')

In [7]:
df[pd.isnull(df['text'])]   #yep, 1 case
df = df[pd.notnull(df['text'])]   

## Include custom text-processing functions within sklearn ```CountVectorizer()```

### 1. Build an ad-hoc pre-processing pipeline

For the pre-processing (i.e., before tokenisation), let's remove all sentences within each text that score too low on subjectivity score. We'll pick 0.3 as our subjectivity threshold. 

In [28]:
?remove_objective_sents

In [8]:
my_preprocessor = combine_functions(sent_tokenise
                                    , lambda x : remove_objective_sents(x, 0.3)
                                    )

### 2. Build ad-hoc word-tokenisation pipeline

Let's then lemmatise, mark negations, remove numeric digits and punctuation as part of our tokenisation pipeline. 
Lemmatisaion requires that we first POS-tag the text.

Importantly, the output of our pipeline must be a list of token lists as this is the required output for the next step in the vecorizer.

In [9]:
my_tokenizer_pipe = combine_functions(word_tokenise
                                       ,to_lower
                                       ,POS_tagging
                                       ,lemmatise
                                       ,fix_neg_auxiliary
                                       ,lambda x : remove_stopwords(x, extra_stopwords = [
                                           'x', "'s", 'us', 'ca', 'many', 'much', 'one', 'put', '¬ñ',
                                           'also', 'get', 'would', 'could', 'like', 'go', 'lot', 'make'
                                       ])
                                       ,lambda s: [[re.sub(r'\d+','',x) for x in subs] for subs in s]
                                       ,mark_neg
                                       ,flattenIrregularListOfLists  # now we have one list of tokens per text/paragraph
                                       ,remove_punctuation
                                       ,lambda x: list(filter(None, x))   # must end with a list of token lists, each sublist is a paragraph/text
                                       )   

In [31]:
# In case you are wondering

?fix_neg_auxiliary
?mark_neg
?flattenIrregularListOfLists

### 3. Build ad-hoc ```CountVectorizer()```

We can now use our ad-hoc preprocessor and tokenizer within ```CountVectorizer```


In [59]:
my_vec = CountVectorizer(preprocessor = my_preprocessor,
                         tokenizer = my_tokenizer_pipe,
                         analyzer="word",
                         ngram_range = (1,3),
                         stop_words=None
                         #min_df=1
                         )

### 4. Instantiate our custom BoW pipeline

We will use the ```Transformers``` ```ColumnSelector()``` and ```Series2ListOfStrings()``` to select the ```pandas.Series``` that contains the text data and transform it into a list of strings which is ```CountVecorizer()```'s required input format. 

In [60]:
pipe_bags_words = Pipeline([
        
        ('selector', ColumnSelector(columns=['text'])),
        
        ('transformer', Series2ListOfStrings()),
        
        ('vec', my_vec)
        
        ])


### Let's apply our custom BoW pipeline to our data

Let's take a look at the most frequent words first

In [12]:
# https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial

all_words = df['text'].str.split(expand=True).unstack().value_counts()
all_words = all_words.to_frame().reset_index().rename(columns = {'index' : 'word', 0 : 'count'})

# get 50 more frequent words, lots of "rubbish"
#all_words[:50].plot.bar(x='word')
all_words[:50].T


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
word,the,and,a,I,is,of,was,to,The,this,...,one,an,great,really,all,about,by,they,time,from
count,1012,779,618,534,473,468,461,441,354,287,...,74,71,69,68,66,62,61,59,59,59


In [13]:
print(type(df['text']))
print(df['text'][:5])

<class 'pandas.core.series.Series'>
0    A very, very, very slow-moving, aimless movie ...
1    Not sure who was more lost - the flat characte...
2    Attempting artiness with black & white and cle...
3         Very little music or anything to speak of.  
4    The best scene in the movie was when Gerardo i...
Name: text, dtype: object


In [29]:
# Let's take a look at a random part of the output of our customised Vectoriser
pd.DataFrame(pipe_bags_words.fit_transform(df).A, columns=my_vec.get_feature_names()).iloc[1:6, 100:120]


Unnamed: 0,average best,avoid,award,away,awesome,awesome service,awful,baby,back,backNEG,backNEG NEG,bacon,bad,bad acting,bad bad,bad even,bad experience,bad film,bad film ever,bad food
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Let's apply the customised Vectorizer to some new random data, using the vocabulary just learned

In [30]:
new_text = pd.DataFrame(["Would you tell me, please, which way I ought to go from here? That depends a good deal on where you want to get to. I don’t much care where! Then it doesn’t matter which way you go."], columns=['text'])

pd.DataFrame(pipe_bags_words.transform(new_text).A, columns=my_vec.get_feature_names()).iloc[:, 100:120]



Unnamed: 0,average best,avoid,award,away,awesome,awesome service,awful,baby,back,backNEG,backNEG NEG,bacon,bad,bad acting,bad bad,bad even,bad experience,bad film,bad film ever,bad food
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Classifier

Finally, let's train an SVM classifier on the combined BoW features we have just computed with our customised vectorizer.

In [40]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

from sklearn.svm import SVC

In [61]:
pipe_bags_words = Pipeline([
        
        ('selector', ColumnSelector(columns=['text'])),
        
        ('transformer', Series2ListOfStrings()),
        
        ('vec', my_vec),
    
        ('tf_idf', TfidfTransformer())
        
        ])

In [42]:
svm = SVC(probability=True)

In [62]:
# Instantiate pipeline

pipe_bow_svm = Pipeline([
        
        ('bow', pipe_bags_words),
        
        ('classifier', svm)
        
        ])

In [None]:
# Check list of available hyperparameters

pipe_bow_svm_cv.get_params().keys()

In [73]:
# Define parameters space and dictionary

parameters = dict(
        
        #bow__vec__max_features = np.arange(5000, 11000, step=1000),
    
        classifier__kernel = ['linear', 'rbf'],
        
        classifier__C = [0.01, 0.1, 1, 10, 100]
        
        )

In [74]:
# Instantiate the GridSearchCV object: cv

pipe_bow_svm_cv = GridSearchCV(estimator=pipe_bow_svm,
                              param_grid=parameters,
                              cv=5,
                              scoring='accuracy'    #could be smth else, e.g., "neg_log_loss"
                              )

In [71]:
#pipe_bow_svm_cv.get_params()

In [48]:
# Create a development dataset and an holdout dataset
df_dev, df_holdout = train_test_split(df, test_size=0.20, stratify=df.score, random_state=11)

In [52]:
X_dev = df_dev[['text']]
y_dev = df_dev.score.values

In [None]:
# Hyperparameter uning with GridSearch()
 
pipe_bow_svm_cv.fit(X_dev, y_dev)


In [None]:
print("Tuned pipeline's Hyperparameters: {}".format(pipe_bow_svm_cv.best_params_))
print("Accuracy or best score: {}".format(pipe_bow_svm_cv.best_score_))

pipe_bow_svm_cv.best_estimator_



# Inspect results for each combination of parameters' value

cv_results = pipe_bow_svm_cv.cv_results_
cv_results.keys()

for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)
