# Features Engineering 
The next step is to extract features and we have various options for that:

- Count Vectors as features
- TF-IDF Vectors as features
- Word Embeddings as features
- Text / NLP based features
- Topic Modeling as features

Once the feature extraction technique is applied, our job as a human is to interpret the results and see if the mix of words in each channel makes sense. If they don't make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters or even try a different model.

For this notebook, We'll try first CountVectors: 
- Why TFIDF vectorizer? 
The goal is to scale down the impact of tokens that occur very frequently in our corpus and that affect negatively our analysis. We have noticed lot of repeated words in the emails folder for instance words like Hi, Thanks, How are you etc are very frequent. TF-IDF vectorizer will hopefully helps reduce noises.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np
pd.set_option('max_colwidth',100)

In [2]:
# Path for outfiles
outfile_ = '/Users/mouhamethtakhafaye/Desktop/behavox_assignment/models/'

In [3]:
 with open('Pickles/clean_corpus.pickle', 'rb') as data:
    clean_corpus = pickle.load(data)

In [4]:
clean_corpus

Unnamed: 0,Messages
SMS,sms hi ina ...
CHATS,hello morning yeah ...
EMAILS,please let know still need curve shift thanks heather original message phillip sent f...


In [5]:
df = clean_corpus.reset_index().rename(columns={'index': 'Channel'})

In [6]:
df

Unnamed: 0,Channel,Messages
0,SMS,sms hi ina ...
1,CHATS,hello morning yeah ...
2,EMAILS,please let know still need curve shift thanks heather original message phillip sent f...


###  Label coding
We'll create a dictionary with the label codification:

In [7]:
channel_code = {
    'SMS': 1,
    'EMAILS': 2,
    'CHATS': 3,
    }

In [8]:
# Communication Channel mapping
df['Channel_code'] = df['Channel']
df = df.replace({'Channel_code': channel_code})

In [9]:
df = df.drop('Channel', axis=1).copy()

In [10]:
df

Unnamed: 0,Messages,Channel_code
0,sms hi ina ...,1
1,hello morning yeah ...,3
2,please let know still need curve shift thanks heather original message phillip sent f...,2


## TF-IDF Count Vectorizer:

We have to define the different parameters:

- ngram_range: We want to consider both unigrams and bigrams.
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.




## Reminder: 
There is a big imbalance in term of text size between the 3 channels of communication. We have more files in the emails folder than the chats and sms folders and as a result of effect the most meaningfull bigrams or trigrams will come from the email channel.



In [11]:
count_vec = CountVectorizer(stop_words=None, 
                            analyzer='word',
                            ngram_range=(3, 3),
                            max_df=0.80,
                            min_df=0.3, 
                            token_pattern=r"(?u)\b\w+\b", 
                            max_features=None)

In [12]:
dt_mat = count_vec.fit_transform(df.Messages)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

In [13]:
# count_vec.get_feature_names()   uncomment to see

In [14]:
trigrams = pd.DataFrame(dt_mat.todense(), index=df.index, columns=count_vec.get_feature_names())
trigrams['channel_code'] = df.Channel_code

In [15]:
trigrams

Unnamed: 0,able give forecast,able golf friday,able stay within,able well get,abn amro coverage,abn amro downgraded,accelerated distribution psa,accelerating distribution psa,accenture houston parkway,accenture human performance,...,是东亚地区统一的一党主权国家 也是世界上人口最多的国家 按,澳门特别行政区 下午好 嗨,照总面积计算 它是第三大或第四大国家 取决于所咨询的来源,現在 hear last,裤脚 鞋子全部打湿完了 привет,許多人需要同意 很明顯 我會等待確認,還沒有 許多人需要同意 很明顯,重庆 和香港 澳门特别行政区,鞋子全部打湿完了 привет пример,channel_code
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
1,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,0,3
2,0,1,1,1,1,1,1,1,2,3,...,0,0,0,0,0,0,0,0,0,2


In [16]:
trigrams_long = (pd.melt(trigrams.reset_index(),id_vars=['index','channel_code'],value_name='trigram_ct').query('trigram_ct > 0')
                 .sort_values(['index','channel_code']))

In [17]:
trigrams_long

Unnamed: 0,index,channel_code,variable,trigram_ct
1116,0,1,asap open email,1
1701,0,1,better hell happened,1
2358,0,1,call asap open,1
4011,0,1,crapthey morons nevermind,1
4059,0,1,critical wdim 裤脚,1
...,...,...,...,...
21647,2,2,zdnet today web,1
21650,2,2,zdnet today windows,1
21653,2,2,zero net curve,1
21656,2,2,zipper taking lead,1


In [18]:
tfidf = pd.DataFrame(tfidf_mat.todense(), index=df.index, columns=count_vec.get_feature_names())
tfidf['channel_code'] = df.Channel_code

tfidf_long = pd.melt(tfidf.reset_index(), 
                     id_vars=['index','channel_code'], 
                     value_name='tfidf').query('tfidf > 0')

In [19]:
fulldf = (trigrams_long.merge(tfidf_long,  on=['index','channel_code','variable']).set_index('index'))

In [20]:
fulldf.shape

(7268, 4)

In [21]:
# lets filter 30 highest score for each channel
fulldf.groupby('channel_code').apply(lambda x: x.nlargest(30, 'tfidf')).reset_index(drop=True) 

Unnamed: 0,channel_code,variable,trigram_ct,tfidf
0,1,asap open email,1,0.121268
1,1,better hell happened,1,0.121268
2,1,call asap open,1,0.121268
3,1,crapthey morons nevermind,1,0.121268
4,1,critical wdim 裤脚,1,0.121268
...,...,...,...,...
85,3,alright okay alright,1,0.039715
86,3,analysis chart look,1,0.039715
87,3,another look would,1,0.039715
88,3,another opportunity enter,1,0.039715
