# Using ```basicnlp``` and ```nlppipelineutils``` functions to include ad-hoc text and non-text features within ```sklearn``` pipelines

In this example script, we build on the previous Example2 and expand the ```sklearn``` pipeline to include text and non-text ad-hoc features using functions and transformers from ```nlppipelineutils``` and ```sklearn FeatureUnion```.

We will:
1. Create a pipeline that calculates each text's Vader lexicon-based polarity score, and the proportion of adjectives and the count of meaningful punctuation symbols in each text as text ad-hoc features.
2. Create a pipeline that includes the text's source as a non-text ad-hoc feature.
3. Combine the BOW features and the text and non-text ad-hoc features into a unique pipeline using ```FeatureUnion```.
4. Train and evaluate the pipeline with a SVM classifier.

## Set ups and Imports

Import modules and our user-defined functions

In [144]:
import pandas as pd
import numpy as np
import os
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder, LabelBinarizer

from sklearn.feature_extraction import DictVectorizer

from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

from sklearn.svm import SVC

from nlpfunctions.utils import *
from nlpfunctions.basicnlp import *
from nlpfunctions.nlppipelineutils import *

## Data

We will use the same labelled text data as in Example 2 from 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 (available here)

In [2]:
imdb = pd.read_excel('Data/imdb.xlsx', header=0)
yelp = pd.read_excel('Data/yelp_labelled.xlsx', header=0)

imdb['source'] = 'imdb'
yelp['source'] = 'yelp'

df = pd.concat([imdb, yelp])


In [3]:
df[df.duplicated('text')]
df = df.drop_duplicates('text')

In [4]:
df[pd.isnull(df['text'])]   #yep, 1 case
df = df[pd.notnull(df['text'])]   

## Extract and include ad-hoc text features

### Numeric ad-hoc (text) features

Here we create a pipeline that extracts the following text-related information:
- the proportion of adjectives 
- the count of meaningful punctuation symbols 
- ```NLTK``` Vader polarity score for each text (the average polarity scores off each text's senences)
in each text and makes them available as text ad-hoc features for clasification purposes.

In [17]:
count_punkt_fn = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,lambda x: count_punkt(x, ['?', '!', '...'])
                                       )   

count_adj = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,POS_tagging
                                       ,lambda x: count_pos(x, pos_to_cnt='J', normalise=True)
                                       )   

VDR_score = combine_functions(sent_tokenise
                                       ,get_sentiment_score_VDR
                                       ,np.nanmean
                                       )   

In the pipeline:
- We first select the dataset column that contains the text data;
- We then use our custom ransformer ```Series2ListOfStrings``` to turn the ```pandas.Series``` into a list of strings on which we can apply our text-feature extracting functions. 
- We then extract the features by applying the ad-hoc text fetures extraction pipelines created above by using our custom ```TextPipelineArrayFeaturizer``` Transformer
- Finally, as we are dealing with all numeric features, we scale them.

In [14]:
# to know more
?ColumnSelector
?Series2ListOfStrings
?TextPipelineArrayFeaturizer

In [16]:
?count_punkt

In [18]:
pipe_text_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['text']))
        
        ,('transformer', Series2ListOfStrings())
        
        ,('text_featurizer', TextPipelineArrayFeaturizer(count_punkt_fn, count_adj, VDR_score))
        
        ,('scaler', StandardScaler())
        
        ])


Let's take a look at what the pipeline does:
- Each row in the array represents a text
- the first column contains the proportion of adjectives in the text
- the second columns contains the count of "!", "?" and "..." (our "meaningfull" punctuation) in the text
- the third columns contain the text's VADER polarity score.

Proportions, counts, and scores have been normalised.

In [19]:
# on train data
pipe_text_features.fit_transform(df)


array([[-0.2810793 ,  1.80448558, -1.21046145],
       [-0.2810793 ,  0.97468564, -0.4766906 ],
       [-0.2810793 ,  1.15908563, -1.83865071],
       ..., 
       [-0.2810793 ,  0.79028566, -1.10636394],
       [-0.2810793 ,  0.60588567, -0.31683414],
       [-0.2810793 , -1.05371421,  0.50470936]])

Let's apply the pipeline to some new random data. Mean and standard deviation of the scores learned for our original ```df.text``` data will be used to sandardise the scores of the new data using the ```transform``` method. 

In [21]:
new_text = pd.DataFrame(['I hate cheese!!! But it is soft and gentle and that makes it hard to accept.', 
                         'You should regret all opportunities you let go...',
                         'There are several reasons. But I am going to tell her only one. Why? Because she deserves more.'], columns =['text'])

pipe_text_features.transform(new_text)

array([[ 3.09966208,  1.25128562, -0.37354714],
       [ 0.84583449, -1.05371421, -0.42623193],
       [ 0.84583449,  0.51368568, -0.31683414]])

## Extract and include ad-hoc non-text features

###  Ad-hoc binary features

What if we want to include features that are not computed from the texts themselevs but are, for instance, stored in other columns of the dataset?  

For instance, we might want to include the source of the text, i.e., whether the text comes from imdb or yell, as we might have good reasons that people on yell tend to leave more positive feedback and so contribuing to predict the sentiment polarity of text. 

So here we create a pipeline that includes an existing ```pandas.DataFrame``` column containing some non-text ad-hoc features into our classification procedure.

The variable ```df.source``` is of type string which makes our life easier. We can in fact use our custom Transfomer ```CatToDictTransformer``` to pass the ```pandas``` dataframe column directly to ```DictVectorizer``` as a list of dictionaries which ```DictVectorizer```'s expected input to obtain a numeric ('dummy') representation of the text's source (see more here http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html).

In [58]:
cat2dict = CatToDictTransformer()

```CatToDictTransformer``` creates a list of dictionaries, with each dictionary containing the text as key and its value for source as value. 

In [61]:
cat2dict.fit_transform(df.source)[:5]

[{'source': 'imdb'},
 {'source': 'imdb'},
 {'source': 'imdb'},
 {'source': 'imdb'},
 {'source': 'imdb'}]

In [36]:
pipe_bin_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['source'])),
        
        ('todictionary', CatToDictTransformer()),
    
        ('dv', DictVectorizer())
                  
        ])


And here are the two numeric dummy variables represening whether the text's source is imdb or yell:

In [42]:
pd.DataFrame(pipe_bin_features.fit_transform(df).A).head(5)

Unnamed: 0,0,1
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


### Ad-hoc categorical multi-level features

What if our ad-hoc features have more than two levels? If the feature data is sored as string, we can still apply ```DictVectorizer``` to get the desired 'dummy' (or 'one-hot-encoding') representation. As it is shown here. 

To this purpose, let's create a three-level categorical variables for our dataset. Let's randomly assign each text to one of three possible favourite music genres for their authors (jazz, pop, rock). Completely made up!  But, it is a truth universally acknowledged, that jazz lovers have better taste in general.

In [75]:
import random
genre = ['jazz', 'pop', 'rock']

df['fav_music'] = random.choices(genre, k=df.shape[0]) 

In [77]:
cat2dict.fit_transform(df.fav_music)[:5]

[{'fav_music': 'pop'},
 {'fav_music': 'rock'},
 {'fav_music': 'pop'},
 {'fav_music': 'jazz'},
 {'fav_music': 'pop'}]

In [78]:
dv = DictVectorizer()

In [80]:
pd.DataFrame(dv.fit_transform(cat2dict.fit_transform(df.fav_music)[:5]).A)

Unnamed: 0,0,1,2
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,1.0,0.0


In [81]:
pipe_cat_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['source', 'fav_music'])),
        
        ('todictionary', CatToDictTransformer()),
    
        ('dv', DictVectorizer())
                  
        ])


In [83]:
pd.DataFrame(pipe_cat_features.fit_transform(df).A).head(5)
# note that the three dummy variables for fav_music are displayed first as columns 0, 1, 2.

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,1.0,0.0
1,0.0,0.0,1.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0


Another option is to use ```OneHotEncoder``` but ```OneHotEncoder``` does not work with string data which need to be conveted into integers first. One way to do so it's to use ```LabelEncoder``` firs. For the sake of completeness we will show it here.

To achieve this, we also introduce another python module that helps bridge he gaps between ```pandas``` and ```sklearn```: 

In [136]:
# NOTE as of version 0.20 OneHotEncoder will deal with categorical sring data

In [127]:
from sklearn_pandas import DataFrameMapper

In [133]:
mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in ['source', 'fav_music']]
)

pipe_cat_features2 = Pipeline([
    
    ("mapper", mapper),
    ("ohe", OneHotEncoder())
    ])

In [135]:
pipe_cat_features2.fit_transform(df).toarray()

array([[ 1.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       ..., 
       [ 0.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  1.,  0.,  0.]])

## Classifier 

Let's bring it all together by creating a classification pipeline that combines:
- bag-of-features (see Example 2)
- ad-hoc text-related (numerical) features
- ad-hoc non-text related (categorical) features

We will use ```sklearn.pipeline.FeatureUnion``` to achieve this.

In [138]:
# ad-hoc features 

pipe_adhoc_features = Pipeline([
        
        ('adhoc', FeatureUnion([
                
                # pipeline for text-related (numerical) features
                ('text_num', Pipeline([
                    ('selector', ColumnSelector(columns=['text'])),
                    ('transformer', Series2ListOfStrings()),
                    ('text_featurizer', TextPipelineArrayFeaturizer(count_punkt_fn, count_adj, VDR_score)),
                    ('scaler', StandardScaler())
                ])),
    
                # pipeline for non-text (categorical) features
                ('cat', Pipeline([
                    ('selector', ColumnSelector(columns=['source', 'fav_music'])),
                    ('todictionary', CatToDictTransformer()),
                    ('dv', DictVectorizer())
                ]))
        ]))
    ])




# bag-of-words

my_preprocessor = combine_functions(sent_tokenise
                                    , lambda x : remove_objective_sents(x, 0.3)
                                    )

my_tokenizer_pipe = combine_functions(word_tokenise
                                       ,to_lower
                                       ,POS_tagging
                                       ,lemmatise
                                       ,fix_neg_auxiliary
                                       ,lambda x : remove_stopwords(x, extra_stopwords = [
                                           'x', "'s", 'us', 'ca', 'many', 'much', 'one', 'put', '¬ñ',
                                           'also', 'get', 'would', 'could', 'like', 'go', 'lot', 'make'
                                       ])
                                       ,lambda s: [[re.sub(r'\d+','',x) for x in subs] for subs in s]
                                       ,mark_neg
                                       ,flattenIrregularListOfLists  # now we have one list of tokens per text
                                       ,lambda x: list(filter(None, x))   # end with a list of token lists, each sublist is a text
                                       )   

my_vec = CountVectorizer(preprocessor = my_preprocessor,
                         tokenizer = my_tokenizer_pipe,
                         analyzer="word",
                         ngram_range = (1,3),
                         stop_words=None
                         #min_df=1
                         )

pipe_bags_words = Pipeline([
        
        ('selector', ColumnSelector(columns=['text'])),
        ('transformer', Series2ListOfStrings()),
        ('vec', my_vec),
        ('tf_idf', TfidfTransformer())
        
        ])



# classifier

svm = SVC(probability=True, C=1, kernel='linear')


In [139]:
pipe_svm_clf = Pipeline([
        
        # Combined text (bag-of-word) and ad-hoc features
        ('features', FeatureUnion(
                
                transformer_list = [
                        
                        ('adhoc', pipe_adhoc_features),
                
                        ('text_bow', pipe_bags_words)
                        
                        ],
                
                # weight components in FeatureUnion
                transformer_weights={
                        'adhoc': 0.6,
                        'text_bow': 1.0
                        }
                        
                )),
        
        # Use classifier on combined features
        ('classifier', svm)
        
        ])



Let's just run a cross-validation on the whole datase we have available 


In [142]:
X = df[['text', 'source', 'fav_music']]
y = df.score.values


In [145]:
scores_svm = cross_validate(pipe_svm_clf, X, y, cv = 10, return_train_score=True)
sorted(scores_svm.keys())

print("Train Accuracy: %0.3f (std %0.3f)" % (scores_svm['train_score'].mean(), scores_svm['train_score'].std()))
print("Test Accuracy: %0.3f (std %0.3f)" % (scores_svm['test_score'].mean(), scores_svm['test_score'].std()))


RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'nlpfunctions.nlppipelineutils.TextPipelineArrayFeaturizer'> with constructor (self, *featurizers) doesn't  follow this convention.