# Using ```basicnlp``` and ```nlppipelineutils``` functions to include ad-hoc text and non-text features within ```sklearn``` pipelines

In this example script, we build on the previous Example2 and expand the ```sklearn``` pipeline to include text and non-text ad-hoc features using functions and transformers from ```nlppipelineutils``` and ```sklearn FeatureUnion```.

We will:
1. Create a pipeline that extracts the proportion of adverbs, the proportion of adjectives and the count of meaningful punctuation symbols in each text as text ad-hoc features.
2. Create a pipeline that includes the text's source as a non-text ad-hoc feature.
3. Combine the BOW features and the text and non-text ad-hoc features into a unique pipeline using ```FeatureUnion```.
4. Train and evaluate the pipeline with a SVM classifier.

## Set ups and Imports

Import modules and our user-defined functions

In [20]:
import pandas as pd
import os
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

from sklearn.svm import SVC

from nlpfunctions.utils import *
from nlpfunctions.basicnlp import *
from nlpfunctions.nlppipelineutils import *

## Data

We will use the same labelled text data as in Example 2 from 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 (available here)

In [3]:
imdb = pd.read_excel('Data/imdb.xlsx', header=0)
yelp = pd.read_excel('Data/yelp_labelled.xlsx', header=0)

imdb['source'] = 'imdb'
yelp['source'] = 'yelp'

df = pd.concat([imdb, yelp])


In [4]:
df[df.duplicated('text')]
df = df.drop_duplicates('text')

In [5]:
df[pd.isnull(df['text'])]   #yep, 1 case
df = df[pd.notnull(df['text'])]   

## Extract text ad-hoc features

Here we create a pipeline that extracts the following text-related information:
    - the proportion of adjectives 
    - the count of meaningful punctuation symbols 
    - ```NLTK``` Vader polarity score for each text (the average polarity scores off each text's senences)
in each text and makes them available as text ad-hoc features for clasification purposes.

In [9]:
count_punkt_fn = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,lambda x: count_punkt(x, ['?', '!', '...'])
                                       )   

count_adj = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,POS_tagging
                                       ,lambda x: count_pos(x, pos_to_cnt='J', normalise=False)
                                       )   

VDR_score = combine_functions(sent_tokenise
                                       ,get_sentiment_score_VDR
                                       ,np.nanmean
                                       )   

In the pipeline, we first select the dataset column that contains the text data. We then extract the features by applying the ad-hoc text fetures extraction pipelines created above by using our custom ```TextPipelineArrayFeaturizer``` Transformer.

In [14]:
# to know more
?ColumnSelector
?Series2ListOfStrings
?TextPipelineArrayFeaturizer

In [22]:
pipe_text_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['text']))
        
        ,('transformer', Series2ListOfStrings())
        
        ,('text_featurizer', TextPipelineArrayFeaturizer(count_punkt_fn, count_adj, VDR_score))
        
        ,('scaler', StandardScaler())
        
        ])


Let's take a look at what the pipeline does

In [23]:
# on train data
pipe_text_features.fit_transform(df)


NameError: name 'np' is not defined

In [None]:

pipe_text_features.fit_transform(text_df)

new_text = pd.DataFrame(['I hate chese!!! But it is soft and gentle and that makes it hard to accept.', 
                         'You should regret all opportunities you let go...',
                         'There are several reasons. But I am going to tell her only one. Why? Because she deserves more.'], columns =['text'])
pipe_text_features.transform(new_text)



## Extract non-text ad-hoc features

Here we create a pipeline that extracts the source of each text as a non-text ad-hoc feature to include in our classification procedure.