# Features Engineering: TF-IDF
The next step is to extract features and we have various options for that:

- Count Vectors as features
- TF-IDF Vectors as features
- Word Embeddings as features
- Text / NLP based features
- Topic Modeling as features

Once the feature extraction technique is applied, our job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

For this notebook, We'll use TF-IDF Vectors as features.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np
pd.set_option('max_colwidth',100)

In [2]:
# Path for outfiles
outfile_path = '/Users/mouhamethtakhafaye/Desktop/behavox_assignment/notebook/'

In [3]:
 with open('Pickles/clean_corpus.pickle', 'rb') as data:
    clean_corpus = pickle.load(data)

In [4]:
clean_corpus

Unnamed: 0,Messages
CHATS,hello morning yeah ...
EMAILS,please let know still need curve shift thanks heather original message phillip sent f...
SMS,sms hi ina ...


In [5]:
df = clean_corpus.reset_index().rename(columns={'index': 'Channel'})

In [6]:
df

Unnamed: 0,Channel,Messages
0,CHATS,hello morning yeah ...
1,EMAILS,please let know still need curve shift thanks heather original message phillip sent f...
2,SMS,sms hi ina ...


###  Label coding
We'll create a dictionary with the label codification:

In [7]:
channel_code = {
    'SMS': 1,
    'EMAILS': 2,
    'CHATS': 3,
    }

In [8]:
# Communication Channel mapping
df['Channel_code'] = df['Channel']
df = df.replace({'Channel_code': channel_code})

In [9]:
df = df.drop('Channel', axis=1).copy()

## TF-IDF Vectors as features:

We have to define the different parameters:

- ngram_range: We want to consider both unigrams and bigrams.
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
- See TfidfVectorizer? for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm.

In [10]:
# Parameter selection
# We have chosen differents values as a first approximation and these are the ones that yield more meaningful features
ngram_range = (1,2)
min_df = 1
max_df = 7
max_features = 200

In [11]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features = tfidf.fit_transform(df.Messages).toarray()
labels = df.Channel_code
print(features.shape)
print(labels.shape)

(3, 200)
(3,)


In [12]:
for wrd, channel_id in sorted(channel_code.items()):
    features_chi2 = chi2(features, labels == channel_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' channel:".format(wrd))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-6:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-6:])))
    print("")

# 'CHATS' channel:
  . Most correlated unigrams:
. need
. email
. john
. going
. long
. short
  . Most correlated bigrams:
. downgraded buy
. coverage initiated
. original message
. enron corp
. let know
. strong buy

# 'EMAILS' channel:
  . Most correlated unigrams:
. click
. hi
. message
. phillip
. information
. buy
  . Most correlated bigrams:
. downgraded buy
. original message
. coverage initiated
. enron corp
. let know
. strong buy

# 'SMS' channel:
  . Most correlated unigrams:
. hello
. getting
. email
. need
. call
. immediately
  . Most correlated bigrams:
. downgraded buy
. original message
. coverage initiated
. enron corp
. let know
. strong buy



In [13]:
bigrams

['received error',
 'buy strong',
 'full story',
 'please let',
 'would like',
 'please click',
 'price save',
 'may contain',
 'intended recipient',
 'downgraded buy',
 'original message',
 'coverage initiated',
 'enron corp',
 'let know',
 'strong buy']

We can see there is more bigrams. This means with a higher number of features in our parameter, the bigrams have more correlation with the category than the unigrams, and since we're restricting the number of features to the most representative 200, only a few bigrams are being considered.

# Reminder: 
There is a big imbalance in term of text size between the 3 channels of communication. We have more files in the inbox(emails) folder than the chats and sms folders and as a result of effect the most meaningfull bigrams come from the email channel because the number of words in the chats and sms channel are very limited.