# SIMPLE NLP PIPELINE USING SKLEARN

# CASE STUDY: CORPORATE MESSAGING

### by Tran Nguyen

This is a comprehensive tutorial/template for NLP pipeline
Ref: Part of the materials is from the Data Science Nanodegree from Udacity.

**DATA**: A csv file `corporate_messaging.csv` containing the text messages from all the corporations on the social media, which were classified into 4 different categories: 
- Information: objective statements about the company or its activities
- Action: messages that ask for votes or ask users to click on links
- Dialogue: replies to users
- Exclude

**GOAL**: Classify a text message into a specific category.

**CONTENT**: This notebook includes 3 parts:

- Part 1. Create the workflow for the Machine Learning, which includes loading data, processing and transforming data, fitting and predicting data and finally, displaying the result.
- Part 2. Refactor the task into function to automate the workflow in part 1.
- Part 3. Using pipeline for the workflow from Part 2.

## 1. CREATE THE WORKFLOW

### 1.1 LOAD DATA

In [54]:
import pandas as pd
import numpy as np
import re

In [4]:
df = pd.read_csv("corporate_messaging.csv")
# => UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 42-43: invalid continuation byte

In [5]:
### check where is the file of the encodings
#from encodings import aliases
#aliases.__file__

## import all the encoding names
from encodings.aliases import aliases
alias_values = set(aliases.values())

### Iterate through the alias_values list trying out the different encodings to see which one or ones work
# Use a try - except statement. Otherwise your code will produce an error csv file with the wrong encoding.
for encoding_name in alias_values:
    try: 
        df = pd.read_csv('corporate_messaging.csv', encoding = encoding_name)
        print("Good encoding name", encoding_name)
    except:
        pass

Good encoding name iso8859_4
Good encoding name cp852
Good encoding name mac_latin2
Good encoding name cp865
Good encoding name kz1048
Good encoding name cp775
Good encoding name cp864
Good encoding name cp858
Good encoding name iso8859_14
Good encoding name koi8_r
Good encoding name iso8859_2
Good encoding name iso8859_10
Good encoding name cp862
Good encoding name cp866
Good encoding name mac_iceland
Good encoding name mac_roman
Good encoding name iso8859_15
Good encoding name iso8859_13
Good encoding name cp1251
Good encoding name cp1256
Good encoding name latin_1
Good encoding name iso8859_5
Good encoding name cp855
Good encoding name ptcp154
Good encoding name cp850
Good encoding name cp863
Good encoding name iso8859_9
Good encoding name iso8859_16
Good encoding name cp1125
Good encoding name cp861
Good encoding name cp857
Good encoding name cp437
Good encoding name mac_turkish
Good encoding name mac_greek
Good encoding name mac_cyrillic
Good encoding name cp860
Good encoding name

In [38]:
df = pd.read_csv("corporate_messaging.csv", encoding = 'iso8859_4')
df.head(3)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,category,category:confidence,category_gold,id,screenname,text
0,662822308,False,finalized,3,2/18/15 4:31,Information,1.0,,4.36528e+17,Barclays,Barclays CEO stresses the importance of regula...
1,662822309,False,finalized,3,2/18/15 13:55,Information,1.0,,3.86013e+17,Barclays,Barclays announces result of Rights Issue http...
2,662822310,False,finalized,3,2/18/15 8:43,Information,1.0,,3.7958e+17,Barclays,Barclays publishes its prospectus for its åŖ5....


In [7]:
df.shape

(3118, 11)

In [17]:
df.category.value_counts()

Information    2129
Action          724
Dialogue        226
Exclude          39
Name: category, dtype: int64

In [25]:
df.groupby("category")["category:confidence"].unique()

category
Action         [0.6747, 1.0, 0.6634, 0.6775, 0.6695, 0.664, 0...
Dialogue       [0.6606, 0.6666, 0.6747, 0.6628, 0.6695, 1.0, ...
Exclude        [0.33899999999999997, 0.3366, 0.6622, 1.0, 0.6...
Information    [1.0, 0.6573, 0.6643, 0.6569, 0.6656, 0.6614, ...
Name: category:confidence, dtype: object

In [36]:
df_highconfidence = df[df["category:confidence"] == 1]

In [37]:
df_highconfidence.category.value_counts()

Information    1823
Action          456
Dialogue        124
Exclude          27
Name: category, dtype: int64

In [42]:
### Percentage of data that has confidence value == 1 in each category
names = df.category.value_counts().index
vals = df.category.value_counts().values
print("Percentage of data that has confidence value == 1:")
for i in range(len(names)):
    print(names[i], df_highconfidence.category.value_counts()[i]/vals[i]*100)

Percentage of data that has confidence value == 1:
Information 85.62705495537811
Action 62.98342541436463
Dialogue 54.86725663716814
Exclude 69.23076923076923


**=> Choose the data that has confidence value == 1 and only choose the 3 main categories.**

In [44]:
df1 = df[(df["category:confidence"] == 1) & (df["category"] != "Exclude")]
df1.category.value_counts()

Information    1823
Action          456
Dialogue        124
Name: category, dtype: int64

**Generate X (text) and Y (category) as numpy array**

In [47]:
X = df1.text.values
y = df1.category.values
print(len(X), len(y))

2403 2403


In [50]:
y[:5]

array(['Information', 'Information', 'Information', 'Information',
       'Information'], dtype=object)

In [51]:
X[:5]

array(['Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG',
       'Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG',
       'Barclays publishes its prospectus for its åŖ5.8bn Rights Issue: http://t.co/YZk24iE8G6',
       'Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD',
       'Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director http://t.co/c3fNGY6NMT'],
      dtype=object)

### 1.2. PREPARE DATA

#### 1.2.1. PROCESS THE X VALUES (TEXT DATA) 

- **To-do list**:
    + replace url with the common name for url such as "urlplaceholder".
    + split text into tokens
    + Processing: normalize case, strip white space, lemmatize

In [58]:
#### url replace:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

text = re.sub(url_regex, 'urlplaceholder', X[3])
print(text)

Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health urlplaceholder


In [59]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [76]:
# !!!! It doesn't work if we tokenize the data this way!!! Why?

# def tokenize(text):
#     """ Function to process text:
#         + replace url with the common name for url "urlplaceholder"
#         + normalize case, remove punctuation
#         + tokenize text
#         + lemmatize and remove stop words
#     """
#     url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

#     text = re.sub(url_regex, 'urlplaceholder', text)
#     text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
#     tokens = word_tokenize(text)
#     #tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
#     tokens = [lemmatizer.lemmatize(word).strip() for word in tokens]
#     return tokens

In [89]:
def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [90]:
tokens = tokenize(X[2])
tokens

['barclays',
 'publishes',
 'it',
 'prospectus',
 'for',
 'it',
 'åŗ5.8bn',
 'rights',
 'issue',
 ':',
 'urlplaceholder']

#### 1.2.2. SPLIT DATA INTO TRAIN AND TEST SET

- Using the function `train_test_split` from `sklearn.model_selection`: Split train:test as 75%:25%

In [91]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
print(len(X_train), len(X_test), len(y_train), len(y_test))

1802 601 1802 601


#### 1.2.3. TRAIN CLASSIFIER

##### 1.2.3.1. VECTORIZE TEXT DATA USING BAG OF WORDS AND TF-IDF VALUES

In [92]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer = tokenize)
# initialize tf-idf transformer object
transformer = TfidfTransformer()

#### Transform X_train data
# Get counts of each word
X_train_counts = vect.fit_transform(X_train)
# use counts from count vectorizer results to compute tf-idf values
X_train_tfidf = transformer.fit_transform(X_train_counts)

##### 1.2.3.2. TRAIN CLASSIFIER USING RANDOM FOREST

In [93]:
from sklearn.ensemble import RandomForestClassifier

# initialize the random forest classifier
clf = RandomForestClassifier()
# fit X_train using random forest
clf.fit(X_train_tfidf, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

#### 1.2.4. PREDICT ON TEST DATA

In [94]:
type(X_test_tfidf)

scipy.sparse.csr.csr_matrix

In [95]:
#### Transform X_test data
# get counts of each word
X_test_counts = vect.transform(X_test)
# use counts from count vectorizer results to compute tf-idf values
X_test_tfidf = transformer.transform(X_test_counts)
#c.fit(X_train.values.reshape(-1, 1), y_train)
#### Predict on test data
y_pred = clf.predict(X_test_tfidf)

In [96]:
X_test[1]

'#HealthyBytes 8: Take a guess! Does sleep deprivation make you fatter or slimmer? Watch the video http://t.co/7qOBZz0tSa'

In [97]:
y_pred[1]

'Action'

#### 1.2.5. EVALUATE THE ACCURACY OF A CLASSIFICATION

- Using the confusion matrix
- Using `confusion_matrix` function from `sklearn.metrics`
- Confusion matrix example:
 Cj,j = the number of observations known to be in group i and predicted to be in group j.

                       'Action', 'Dialogue', 'Information'
       'Action'           87        0            26
       'Dialogue'          3        21            8
       'Information'       4         0            452

- In binary classification, the count of true negatives is C0,0, false negatives is C1,0, true positives is C1,1 and false positives is C0,1.
                        False    True
               False      5        5
               True      10       80
- tn, fp, fn, tp coul be extract using `ravel` function:
    
    Example: tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()

In [98]:
labels = np.unique(y_pred)
labels

array(['Action', 'Dialogue', 'Information'], dtype=object)

In [106]:
from sklearn.metrics import confusion_matrix
# calculate the confusion matrix from y_test and y_pred
confusion_mat = confusion_matrix(y_test, y_pred, labels = labels)
print("Confusion matrix:")
print(labels)
print(confusion_mat)
# calculate the accuracy
accuracy = (y_pred == y_test).mean()
# or accuracy = sum(y_pred == y_test)/len(y_pred)
print("Accuracy:", accuracy)

Confusion matrix:
['Action' 'Dialogue' 'Information']
[[ 86   0  15]
 [  0  28   7]
 [  4   1 460]]
Accuracy: 0.9550748752079867


## 2. REFACTOR TO AUTOMATE THE WORKFLOW

In [107]:
#### Import the neccessary Python packages
# basic packages
import pandas as pd
import numpy as np
import re

# processing the data
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.model_selection import train_test_split

# transform and fit data
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier

# evaluate the result
from sklearn.metrics import confusion_matrix

#### function to be use
def load_data():
    """ Load data from csv file
        Filter the data and get the X, y data
        return: X, y
    """
    df = pd.read_csv("corporate_messaging.csv", encoding = 'iso8859_4')
    
    #  The 4 categories: number of samples are: 
    # Information: 2129, Action: 724, Dialogue: 226, Exclude: 39
    # Choose the data that has confidence value == 1 and only choose the 3 main categories.
    df1 = df[(df["category:confidence"] == 1) & (df["category"] != "Exclude")]
    X = df1.text.values
    y = df1.category.values
    return X, y

def tokenize(text):
    """ Function to process text:
         + replace url with the common name for url "urlplaceholder"
         + normalize case, remove punctuation
         + tokenize text
         + lemmatize words
     """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]

    return clean_tokens

def display_results(y_test, y_pred):
    """ Display the confusion matrix and accuracy
    """
    labels = np.unique(y_pred)
    # calculate the confusion matrix from y_test and y_pred
    confusion_mat = confusion_matrix(y_test, y_pred, labels = labels)
    print("Confusion matrix:")
    print(labels)
    print(confusion_mat)
    accuracy = (y_pred == y_test).mean()
    # or accuracy = sum(y_pred == y_test)/len(y_pred)
    print("Accuracy:", accuracy)

def main():
    ## prepare train and test set
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    ## prepare transformers and estimator
    # initialize count vectorizer object
    vect = CountVectorizer(tokenizer = tokenize)
    # initialize tf-idf transformer object
    transformer = TfidfTransformer()
    # initialize the random forest classifier
    clf = RandomForestClassifier()
    
    #### Transform X_train data
    # Get counts of each word
    X_train_counts = vect.fit_transform(X_train)
    # use counts from count vectorizer results to compute tf-idf values
    X_train_tfidf = transformer.fit_transform(X_train_counts)
    # fit X_train using random forest
    clf.fit(X_train_tfidf, y_train)
    
    #### Transform X_test data
    # get counts of each word
    X_test_counts = vect.transform(X_test)
    # use counts from count vectorizer results to compute tf-idf values
    X_test_tfidf = transformer.transform(X_test_counts)
    #c.fit(X_train.values.reshape(-1, 1), y_train)
    #### Predict on test data
    y_pred = clf.predict(X_test_tfidf)
    
    display_results(y_test, y_pred)

main()

[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Confusion matrix:
['Action' 'Dialogue' 'Information']
[[ 88   0  21]
 [  0  23   5]
 [  4   0 460]]
Accuracy: 0.9500831946755408


## 3. IMPLEMENT PIPELINE INTO THE WORKFLOW


- Use the `Pipeline` from `sklearn.pipeline`.

- **HOW TO USE PIPELINE**:
    
    + Define a pipeline: pipeline = Pipeline(`[a list of transformers and 1 estimator at the end]`)
    + Train the data: pipeline.fit(X_train, y_train)
    + Predict: y_pred = pipeline.predict(X_test)

**ADVANTAGE OF USING PIPELINE**:

- Automate repetitive steps
- Easy to understand
- Easy to optimize the workflow (tuning the parameters)
- Prevent data leakage

In [108]:
#### Import the neccessary Python packages
# basic packages
import pandas as pd
import numpy as np
import re
# processing the data
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.model_selection import train_test_split
# transform and fit data
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
# evaluate the result
from sklearn.metrics import confusion_matrix
# add the pipeline for using pipeline
from sklearn.pipeline import Pipeline

#### function to be use
def load_data():
    """ Load data from csv file
        Filter the data and get the X, y data
        return: X, y
    """
    df = pd.read_csv("corporate_messaging.csv", encoding = 'iso8859_4')
    
    #  The 4 categories: number of samples are: 
    # Information: 2129, Action: 724, Dialogue: 226, Exclude: 39
    # Choose the data that has confidence value == 1 and only choose the 3 main categories.
    df1 = df[(df["category:confidence"] == 1) & (df["category"] != "Exclude")]
    X = df1.text.values
    y = df1.category.values
    return X, y

def tokenize(text):
    """ Function to process text:
         + replace url with the common name for url "urlplaceholder"
         + normalize case, remove punctuation
         + tokenize text
         + lemmatize words
     """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]

    return clean_tokens

def display_results(y_test, y_pred):
    """ Display the confusion matrix and accuracy
    """
    labels = np.unique(y_pred)
    # calculate the confusion matrix from y_test and y_pred
    confusion_mat = confusion_matrix(y_test, y_pred, labels = labels)
    print("Confusion matrix:")
    print(labels)
    print(confusion_mat)
    accuracy = (y_pred == y_test).mean()
    # or accuracy = sum(y_pred == y_test)/len(y_pred)
    print("Accuracy:", accuracy)

def main():
    ## prepare train and test set
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    ## initialize pipeline
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf',RandomForestClassifier())
    ])
    
    ## train classifier
    pipeline.fit(X_train, y_train)
    
    #### Predict on test data
    y_pred = pipeline.predict(X_test)
    
    display_results(y_test, y_pred)

main()

[nltk_data] Downloading package punkt to /Users/nhntran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nhntran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nhntran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Confusion matrix:
['Action' 'Dialogue' 'Information']
[[ 82   1  29]
 [  1  23   4]
 [  5   0 456]]
Accuracy: 0.9334442595673876
