# ADVANCED NLP PIPELINE USING SKLEARN

# CASE STUDY: CORPORATE MESSAGING

### by Tran Nguyen

This is a comprehensive tutorial/template for NLP pipeline
Ref: Part of the materials is from the Data Science Nanodegree from Udacity.

This notebook is the advanced version of the Simple_NLP_pipeline_CaseStudy_Corporate_Messaging.ipynb. The notebook includes:
- How to create a custom transformer
- Feature union of the pipeline

## 1. CREATE A CUSTOM TRANSFORMER

- Develop by extending the base class in Scikit-Learn

- **STRUCTURE OF A TRANSFORMER**:
    + Transformer is a estimator which always has a fit method.
    + Transformer has a transform method
    
- **HOW TO CREATE A CUSTOM TRANSFORMER**:
    + First approach: Using the FunctionTransformer from scikit-learn's preprocessing module. Reference: [ref1](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer), [ref2](https://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers)
    + Second approach as below:

In [None]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

### Create a custom transformer
class CaseNormalizer(BaseEstimator, TransformerMixin):
    """ Custom transformer
    Transform all text to lowercase letters
    """
    def fit(self, X, y = None):
        """ This method is required to be compatible with scikit-learn
        simply return self.
        """
        return self
    
    def transform(self, X):
        """ Function to transform the data
        In this example, coverts all text to lower case
        """
        return pd.Series(X).apply(lambda x:x.lower()).values

### Use it
X = np.array(['Data', 'Engineer', 'is', 'Offering', 'from', 'Udacity'])
case_normalizer = CaseNormalizer()
case_normalizer.transform(X)

In [26]:
### Implement the starting_verb feature
import nltk
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def tokenize(text):
    """ Function to process text:
         + replace url with the common name for url "urlplaceholder"
         + normalize case, remove punctuation
         + tokenize text
         + lemmatize words
     """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]

    return clean_tokens

class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    """ Custom transformer
    Check if text start with a verb (using pos_tags)
    """
    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y = None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)
    
x = 'win big at our Community Awards in Jersey, Channel Islands http://ow.ly/HMfp'
verb = StartingVerbExtractor()
verb.transform(x)

[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,0
0,True


## 2. IMPLEMENT FEATURE UNION INTO THE WORKFLOW

- A class in `scikit-learn’s Pipeline` module
- Allows to perform steps (for example, 2 sequences of data transformation) in parallel and take the union of their results for the next step.
- A pipeline performs a list of steps in a linear sequence, while a feature union performs a list of steps in parallel and then combines their results.

- **HOW TO USE FEATURE UNION**:
    + from sklearn.pipeline import Pipeline, FeatureUnion
    + Define a pipeline: pipeline = Pipeline(`[FeatureUnion, 1 estimator at the end]`)
    
- In the `corporate message` case study, we may want to implement the feature `starting_verb` into the training model.

In [8]:
#### Import the neccessary Python packages
# basic packages
import pandas as pd
import numpy as np
import re
# processing the data
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.model_selection import train_test_split
# transform and fit data
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
# evaluate the result
from sklearn.metrics import confusion_matrix
# add the pipeline for using pipeline
from sklearn.pipeline import Pipeline,FeatureUnion 

### Implement the staring_verb feature
from sklearn.base import BaseEstimator, TransformerMixin


#### function to be use
def load_data():
    """ Load data from csv file
        Filter the data and get the X, y data
        return: X, y
    """
    df = pd.read_csv("corporate_messaging.csv", encoding = 'iso8859_4')
    
    #  The 4 categories: number of samples are: 
    # Information: 2129, Action: 724, Dialogue: 226, Exclude: 39
    # Choose the data that has confidence value == 1 and only choose the 3 main categories.
    df1 = df[(df["category:confidence"] == 1) & (df["category"] != "Exclude")]
    X = df1.text.values
    y = df1.category.values
    return X, y

def tokenize(text):
    """ Function to process text:
         + replace url with the common name for url "urlplaceholder"
         + normalize case, remove punctuation
         + tokenize text
         + lemmatize words
     """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]

    return clean_tokens

def display_results(y_test, y_pred):
    """ Display the confusion matrix and accuracy
    """
    labels = np.unique(y_pred)
    # calculate the confusion matrix from y_test and y_pred
    confusion_mat = confusion_matrix(y_test, y_pred, labels = labels)
    print("Confusion matrix:")
    print(labels)
    print(confusion_mat)
    accuracy = (y_pred == y_test).mean()
    # or accuracy = sum(y_pred == y_test)/len(y_pred)
    print("Accuracy:", accuracy)

### Implement the starting_verb feature

class StartingVerbExtractor():
    """ Custom transformer
    Check if text start with a verb (using pos_tags)
    """
    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y = None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

def model_pipeline():
    """ Create the model which implement the feature union: 
    Using text_pipeline and starting_verb feature running in parallel
    """
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer = tokenize)),
                ('tfidf', TfidfTransformer())
            ])),
            ('starting_verb', StartingVerbExtractor())
        ])), 
        ('clf',RandomForestClassifier())
    ])
    return pipeline

def main():
    ## prepare train and test set
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    ## initialize the model
    model = model_pipeline()
    
    ## train classifier
    model.fit(X_train, y_train)
    
    #### Predict on test data
    y_pred = model.predict(X_test)
    
    display_results(y_test, y_pred)

main()

[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Confusion matrix:
['Action' 'Dialogue' 'Information']
[[102   0  24]
 [  0  23   4]
 [  7   0 441]]
Accuracy: 0.9417637271214643


## 3. IMPLEMENT GRID SEARCH INTO THE PIPELINE

- A class in `sklearn.model_selection` module
- Can be used to optimize hyper parameters of a model.

- **HOW TO USE GRID SEARCH**:
    + from sklearn.model_selection import GridSearchCV
    + Add parameters and create grid search `cv` (cross validation) into the `buil_model` function

In [37]:
#### Import the neccessary Python packages
# basic packages
import pandas as pd
import numpy as np
# processing the data
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.model_selection import train_test_split
# transform and fit data
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
# evaluate the result
from sklearn.metrics import confusion_matrix
# add the pipeline for using pipeline
from sklearn.pipeline import Pipeline,FeatureUnion 

### create a new transformer
from sklearn.base import BaseEstimator, TransformerMixin
#### add the grid search
from sklearn.model_selection import GridSearchCV

#### function to be use
def load_data():
    """ Load data from csv file
        Filter the data and get the X, y data
        return: X, y
    """
    df = pd.read_csv("corporate_messaging.csv", encoding = 'iso8859_4')
    
    #  The 4 categories: number of samples are: 
    # Information: 2129, Action: 724, Dialogue: 226, Exclude: 39
    # Choose the data that has confidence value == 1 and only choose the 3 main categories.
    df1 = df[(df["category:confidence"] == 1) & (df["category"] != "Exclude")]
    X = df1.text.values
    y = df1.category.values
    return X, y

def tokenize(text):
    """ Function to process text:
         + replace url with the common name for url "urlplaceholder"
         + normalize case, remove punctuation
         + tokenize text
         + lemmatize words
     """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]

    return clean_tokens

def display_results(model, y_test, y_pred):
    """ Display the confusion matrix and accuracy
    """
    labels = np.unique(y_pred)
    # calculate the confusion matrix from y_test and y_pred
    confusion_mat = confusion_matrix(y_test, y_pred, labels = labels)
    print("Confusion matrix:")
    print(labels)
    print(confusion_mat)
    accuracy = (y_pred == y_test).mean()
    # or accuracy = sum(y_pred == y_test)/len(y_pred)
    print("Accuracy:", accuracy)
    print("Best parameters:", model.best_params_)

### Implement the starting_verb feature

class StartingVerbExtractor():
    """ Custom transformer
    Check if text start with a verb (using pos_tags)
    """
    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y = None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

def build_model():
    """ Create the model which implement the feature union: 
    Using text_pipeline and starting_verb feature running in parallel
    Adding parameter and gridsearch
    Return: cv
    Access cv result as: cv.best_params_
    """
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer = tokenize)),
                ('tfidf', TfidfTransformer())
            ])),
            ('starting_verb', StartingVerbExtractor())
        ])), 
        ('clf',RandomForestClassifier())
    ])
    
#     parameters = {
#         'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
#         'features__text_pipeline__vect__max_df': (0.5, 0.75, 1.0),
#         'features__text_pipeline__vect__max_features': (None, 5000, 10000),
#         'features__text_pipeline__tfidf__use_idf': (True, False),
#         'clf__n_estimators': [50, 100, 200],
#         'clf__min_samples_split': [2, 3, 4],
#         'features__transformer_weights': (
#             {'text_pipeline': 1, 'starting_verb': 0.5},
#             {'text_pipeline': 0.5, 'starting_verb': 1},
#             {'text_pipeline': 0.8, 'starting_verb': 1},
#         )
#     }
    ## short version of parameters
    parameters = {
        'features__text_pipeline__vect__max_df': (0.5, 1.0),
        'features__text_pipeline__vect__max_features': (None, 10000),
        'features__text_pipeline__tfidf__use_idf': (True, False),
        'clf__min_samples_split': [2, 3],
        'features__transformer_weights': (
            {'text_pipeline': 1, 'starting_verb': 0.5},
            {'text_pipeline': 0.8, 'starting_verb': 1},
        )
    }
    
    cv = GridSearchCV(pipeline, param_grid = parameters)
    return cv

def main():
    ## prepare train and test set
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    ## initialize the model
    model =  build_model()
    
    ## train classifier
    model.fit(X_train, y_train)
    
    #### Predict on test data
    y_pred = model.predict(X_test)
    
    display_results(model, y_test, y_pred)

main()


[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Confusion matrix:
['Action' 'Dialogue' 'Information']
[[108   0  20]
 [  0  20   3]
 [  1   0 449]]
Accuracy: 0.9600665557404326
Best parameters: {'clf__min_samples_split': 2, 'features__text_pipeline__tfidf__use_idf': True, 'features__text_pipeline__vect__max_df': 1.0, 'features__text_pipeline__vect__max_features': None, 'features__transformer_weights': {'text_pipeline': 1, 'starting_verb': 0.5}}


**OUTPUT:**

Confusion matrix:

['Action' 'Dialogue' 'Information']

[[108   0  20]

 [  0  20   3]
 
 [  1   0 449]]
 
Accuracy: 0.9600665557404326

Best parameters: {'clf__min_samples_split': 2, 'features__text_pipeline__tfidf__use_idf': True, 'features__text_pipeline__vect__max_df': 1.0, 'features__text_pipeline__vect__max_features': None, 'features__transformer_weights': {'text_pipeline': 1, 'starting_verb': 0.5}}