# Feature Engineering
The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

Text Cleaning and Preparation: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization.
- Label coding: creation of a dictionary to map each category to a code.
- Train-test split: to test the models on unseen data.
- Text representation: use of TF-IDF scores to represent text.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np
pd.set_option('max_colwidth',100)

In [2]:
# Put here the path of the json file saved from src/data/data_processing.py
data_folder_path = '/Users/mouhamethtakhafaye/Desktop/behavox_assignment/notebooks/Pickles/clean_corpus.pickle'
data_folder_path_2 = '/Users/mouhamethtakhafaye/Desktop/behavox_assignment/notebooks/Pickles/raw_corpus.pickle'

# Path for outfiles
outfile_path = '/Users/mouhamethtakhafaye/Desktop/behavox_assignment/notebook/'

In [3]:
 with open(data_folder_path, 'rb') as data:
    clean_corpus = pickle.load(data)
with open(data_folder_path_2, 'rb') as f:
    raw_corpus = pickle.load(f)

In [4]:
clean_corpus

Unnamed: 0,Messages
CHATS,hello morning yeah ...
EMAILS,please let know still need curve shift thanks heather original message allen phillip k ...
SMS,sms hi ina ...


In [5]:
raw_corpus = raw_corpus.rename(columns={'Messages': 'Raw_Messages'})
raw_corpus 

Unnamed: 0,Raw_Messages
CHATS,\n Hello?\n \n Morning\n \n ...
EMAILS,"Please let me know if you still need Curve Shift. Thanks, Heather -----Original Message----- F..."
SMS,\n Sms #2\n \n Hi Ina! How are you?\n ...


In [6]:
combined_corpus = clean_corpus.join(raw_corpus).reset_index()
df = combined_corpus.rename(columns={'index': 'Channel'})
df

Unnamed: 0,Channel,Messages,Raw_Messages
0,CHATS,hello morning yeah ...,\n Hello?\n \n Morning\n \n ...
1,EMAILS,please let know still need curve shift thanks heather original message allen phillip k ...,"Please let me know if you still need Curve Shift. Thanks, Heather -----Original Message----- F..."
2,SMS,sms hi ina ...,\n Sms #2\n \n Hi Ina! How are you?\n ...


##  Label coding
We'll create a dictionary with the label codification:

In [7]:
channel_code = {
    'SMS': 1,
    'EMAILS': 2,
    'CHATS': 3,
    }

In [8]:
# Category mapping
df['Channel_code'] = df['Channel']
df = df.replace({'Channel_code': channel_code})

In [9]:
df

Unnamed: 0,Channel,Messages,Raw_Messages,Channel_code
0,CHATS,hello morning yeah ...,\n Hello?\n \n Morning\n \n ...,3
1,EMAILS,please let know still need curve shift thanks heather original message allen phillip k ...,"Please let me know if you still need Curve Shift. Thanks, Heather -----Original Message----- F...",2
2,SMS,sms hi ina ...,\n Sms #2\n \n Hi Ina! How are you?\n ...,1


## 3. Train - test split
We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df['Messages'], 
                                                    df['Channel_code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation
We have various options:

Count Vectors as features
TF-IDF Vectors as features
Word Embeddings as features
Text / NLP based features
Topic Models as features
We'll use TF-IDF Vectors as features.

We have to define the different parameters:

- ngram_range: We want to consider both unigrams and bigrams.
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
- See TfidfVectorizer? for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm.

In [11]:
# Parameter election
# We have chosen these values as a first approximation.
ngram_range = (1,2)
min_df = 1
max_df = 1
max_features = 400

In [12]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(2, 400)
(1, 400)


In [13]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, channel_id in sorted(channel_code.items()):
    features_chi2 = chi2(features_train, labels_train == channel_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")

# 'CHATS' category:
  . Most correlated unigrams:
. yeah
. chart
. trading
. short
. im
  . Most correlated bigrams:
. strong buy
. hi john

# 'EMAILS' category:
  . Most correlated unigrams:
. yeah
. chart
. trading
. short
. im
  . Most correlated bigrams:
. strong buy
. hi john

# 'SMS' category:
  . Most correlated unigrams:
. future
. full
. friday
. guys
. zdnet
  . Most correlated bigrams:
. gas intelligence
. full story



In [14]:
bigrams

['proprietary otherwise',
 'products services',
 'privileged proprietary',
 'private information',
 'primary account',
 'price save',
 'read price',
 'requested mattsmithenroncom',
 'request request',
 'request pending',
 'request id',
 'request create',
 'remove email',
 'recipient may',
 'received error',
 'otherwise private',
 'original use',
 'original message',
 'november pm',
 'notify sender',
 'please visit',
 'please use',
 'please try',
 'please reply',
 'please notify',
 'please note',
 'please let',
 'please contact',
 'please click',
 'phillip allen',
 'phase first',
 'resource name',
 'upgraded strong',
 'td typeblock',
 'td td',
 'upon request',
 'use email',
 'would like',
 'web site',
 'vpn octel',
 'resource type',
 'sender immediately',
 'save see',
 'risk acceptance',
 'rights reserved',
 'rights permanent',
 'review act',
 'strong buy',
 'ssn tin',
 'ssn ssn',
 'six digits',
 'coverage initiated',
 'contain privileged',
 'confidential information',
 'common stock',


We can see there is more bigrams. This means with a higher number of features in our parameter, the bigrams have more correlation with the category than the unigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [15]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)