# Bag of Words

## Data Exploration

[Consumer complaints data](https://catalog.data.gov/dataset/consumer-complaint-database)

In [77]:
from pprint import pprint
import pandas as pd 
# read in data, df short for dataframe
# change your path here 
path_to_data = "/Users/jorjacman/Downloads/Consumer_Complaints.csv"
df = pd.read_csv(path_to_data)

In [78]:
# size
len(df)

1083164

In [79]:
# check out the first few rows
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,03/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,03/17/2014,Closed with explanation,Yes,No,759217
1,10/01/2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10/05/2016,Closed with explanation,Yes,No,2141773
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100
3,06/08/2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854,Older American,,Web,06/10/2014,Closed with explanation,Yes,Yes,885638
4,09/13/2014,Debt collection,Credit card,Communication tactics,Frequent or repeated calls,,,"CITIBANK, N.A.",VA,23233,,,Web,09/13/2014,Closed with explanation,Yes,Yes,1027760


In [80]:
# understand available data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083164 entries, 0 to 1083163
Data columns (total 18 columns):
Date received                   1083164 non-null object
Product                         1083164 non-null object
Sub-product                     847994 non-null object
Issue                           1083164 non-null object
Sub-issue                       578497 non-null object
Consumer complaint narrative    305316 non-null object
Company public response         347949 non-null object
Company                         1083164 non-null object
State                           1069333 non-null object
ZIP code                        1064705 non-null object
Tags                            148845 non-null object
Consumer consent provided?      540277 non-null object
Submitted via                   1083164 non-null object
Date sent to company            1083164 non-null object
Company response to consumer    1083159 non-null object
Timely response?                1083164 non-null obje

In [81]:
# column of interest sample 
df.sample()["Consumer complaint narrative"].values[0]

nan

In [82]:
# class distribution 
from collections import Counter
Counter(df["Product"]).most_common()

[('Mortgage', 260335),
 ('Debt collection', 208423),
 ('Credit reporting', 140433),
 ('Credit reporting, credit repair services, or other personal consumer reports',
  134500),
 ('Credit card', 89191),
 ('Bank account or service', 86206),
 ('Student loan', 44917),
 ('Consumer Loan', 31605),
 ('Credit card or prepaid card', 28609),
 ('Checking or savings account', 24011),
 ('Vehicle loan or lease', 7025),
 ('Money transfer, virtual currency, or money service', 6812),
 ('Payday loan', 5546),
 ('Money transfers', 5354),
 ('Payday loan, title loan, or personal loan', 5300),
 ('Prepaid card', 3819),
 ('Other financial service', 1060),
 ('Virtual currency', 18)]

## Processing

In [83]:
# drop na in narrative
df = df[pd.notnull(df["Consumer complaint narrative"])]
len(df)

305316

In [84]:
# merge similar categories
mapping = {
    "Credit card": "Credit card or prepaid card",
    "Credit reporting": "Credit reporting, credit repair services, or other personal consumer reports",
    "Money transfers": "Money transfer, virtual currency, or money service",
    "Payday loan": "Payday loan, title loan, or personal loan",
    "Virtual currency": "Money transfer, virtual currency, or money service",
    "Prepaid card": "Credit card or prepaid card"
}

df = df.replace({"Product": mapping})

In [85]:
# updated class distribution, sorted
Counter(df["Product"]).most_common()

[('Credit reporting, credit repair services, or other personal consumer reports',
  91411),
 ('Debt collection', 69906),
 ('Mortgage', 46369),
 ('Credit card or prepaid card', 33653),
 ('Student loan', 18004),
 ('Bank account or service', 14887),
 ('Consumer Loan', 9474),
 ('Checking or savings account', 8005),
 ('Money transfer, virtual currency, or money service', 5306),
 ('Payday loan, title loan, or personal loan', 4462),
 ('Vehicle loan or lease', 3546),
 ('Other financial service', 293)]

In [86]:
# drop the vague category w/ low sample size
df = df[df["Product"] != 'Other financial service']

In [88]:
# focus on the narrative
df = df[["Product", "Consumer complaint narrative"]]
df.head(1)

Unnamed: 0,Product,Consumer complaint narrative
1,"Credit reporting, credit repair services, or o...",I have outdated information on my credit repor...


TO DO: Try using more of the features, e.g. the date. If the feature is categorical, will have to encode. See: sklearn.preprocessing.LabelEncoder 

## Feature Engineering

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

# bag of words
cv = CountVectorizer(stop_words="english", 
                     max_features=500)

TO DO: try experimenting here with passing different parameters to the vectorizer, e.g. the ngram_range. 

TO DO: additional text processing, e.g. contraction expansion, stemming/lemmatization

TO DO: try using TFIDF (term frequency inverse document frequency)

In [90]:
bag_of_words = cv.fit_transform(df["Consumer complaint narrative"]).toarray()
bag_of_words_df = pd.DataFrame(bag_of_words, 
                               columns=[feature for feature in cv.get_feature_names()]).set_index(df.index)

In [91]:
# merge with original data
df_with_bow = pd.concat([df, bag_of_words_df], axis=1).drop("Consumer complaint narrative", axis=1)
df_with_bow.head()

Unnamed: 0,Product,00,10,100,15,2015,2016,2017,30,able,...,wife,work,working,writing,written,wrong,xx,xxxx,year,years
1,"Credit reporting, credit repair services, or o...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Consumer Loan,0,3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,25,1,0
7,"Credit reporting, credit repair services, or o...",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
12,Debt collection,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,Debt collection,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,41,1,0


In [93]:
# bag of words sample
import numpy as np
n = np.random.choice(50000)
print(df_with_bow.iloc[n]["Product"])
list(df_with_bow.columns[1:][df_with_bow.iloc[n].values[1:] > 1].values)

Student loan


['help', 'loan', 'pay', 'school', 'xxxx']

## Model 

### Building

In [94]:
from sklearn.ensemble import RandomForestClassifier

In [95]:
model = RandomForestClassifier(n_jobs=-1)

TO DO: try more classifiers, e.g. logistic regression, SVC. Be sure to standardize the data if necessary. 

TO DO: test different parameters; can be automated. See: grid/random search

### Training and Evaluation

In [98]:
# split features and target
X = df_with_bow.drop("Product", axis=1)
y = df_with_bow["Product"]

#### Overall Accuracy

In [100]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# overall accuracy
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.77836498111624008

#### f1 score

In [101]:
from sklearn.metrics import f1_score

In [102]:
predictions = model.predict(X_test)
f1_score(y_test, predictions, average="weighted")

0.76709059058341944

#### K fold

In [103]:
from sklearn.model_selection import KFold
import numpy as np

splits = 5
kf = KFold(n_splits=splits, shuffle=True)
kf.get_n_splits(X)

accs = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    score = f1_score(y_test, predictions, average="weighted")
    accs.append(score)
    print("Score {}".format(score))
    
print("Mean accuracy of {} folds is {}".format(splits, round(np.mean(accs), 2)))

Score 0.7619959228554347
Score 0.7654075442473035
Score 0.7663424843254508
Score 0.765811381276321
Score 0.7663104437977075
Mean accuracy of 5 folds is 0.77


## Making Predictions

In [104]:
# use all data
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [106]:
# example data point 
complaint = "mortgage is too much"
x_predict = cv.transform([complaint])
model.predict(x_predict)[0]

'Mortgage'

# From Bag of Words to Word Embeddings

## Word Embeddings

In [107]:
import embeddings
embedding_dimension = 300
word_embeddings = embeddings.glove.GloveEmbedding("common_crawl_840", d_emb=embedding_dimension)

TO DO: try different word embeddings. See: word2vec and fastText

In [108]:
word_embeddings.emb("Orlando")

[0.5393099784851074,
 0.47308000922203064,
 0.46004000306129456,
 -0.0021617000456899405,
 0.9738600254058838,
 -0.5797399878501892,
 0.26308000087738037,
 0.35378000140190125,
 0.4237399995326996,
 2.2569000720977783,
 -0.4085099995136261,
 0.010920999571681023,
 0.0022525000385940075,
 0.07054899632930756,
 -0.44312000274658203,
 -0.13399000465869904,
 -0.31718000769615173,
 0.34415000677108765,
 -0.056063998490571976,
 -0.087677001953125,
 0.08006100356578827,
 0.15408000349998474,
 -0.1426600068807602,
 -0.2718699872493744,
 0.003391300095245242,
 -0.6018999814987183,
 -0.6907899975776672,
 -0.6645699739456177,
 0.6168000102043152,
 -0.25374001264572144,
 -0.2653200030326843,
 0.003116799984127283,
 0.1532599925994873,
 0.07588300108909607,
 0.05749199911952019,
 -0.21755999326705933,
 0.4061500132083893,
 -0.17818999290466309,
 -0.2797499895095825,
 -0.09681499749422073,
 -0.25843000411987305,
 -0.33754000067710876,
 -0.4851300120353699,
 -0.3213599920272827,
 -0.09493499994277954

## Similarity

In [109]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def word_similarity(word_1, word_2): 
    
    vec_1 = np.array(word_embeddings.emb(word_1)).reshape(1, -1)
    vec_2 = np.array(word_embeddings.emb(word_2)).reshape(1, -1)
    
    return cosine_similarity(vec_1, vec_2)[0][0]

In [110]:
word_similarity("dog", "cat")

0.80168550555947127

In [111]:
word_similarity("dog", "puppy")

0.85852143512896784

In [112]:
word_similarity("dog", "hotdog")

0.3361896910630251

In [113]:
word_similarity("cool", "awesome")

0.76157662595540798

In [115]:
word_similarity("cool", "cold")

0.56441859204981282

## Data Prep

In [135]:
lucas_phone_text_messages = {
    "positive": [
        "i love you",
        "you are great",
        "wonderful",
        "terrific", 
        "ok awesome see you",
        "super fun"
    ],
    "negative": [
        "hate you",
        "you are mean", 
        "awful person",
        "asshole", 
        "you're a piece of shit",
        "not cool"
    ],
    "neutral": [
        "pizza?",
        "what's up",
        "ok see you then", 
        "donuts in the office",
        "on my way",
        "where are you"
    ]
}

In [136]:
embedding_vector_length = 300
def word_to_embedding(word):
        
    # don't use only 0's for training
    return word_embeddings.emb(word) if word in word_embeddings else np.zeros(embedding_vector_length)

In [137]:
max_allowable_length = 15


from nltk import word_tokenize
def words_to_matrix(text):
    tokens = word_tokenize(text)
    matrix = np.zeros((max_allowable_length, embedding_vector_length))
    for i in range(min(max_allowable_length, len(tokens))):
        matrix[i] = word_to_embedding(tokens[i])
    
    return matrix

TO DO: try different tokenization

In [138]:
words_to_matrix("hello there")

array([[ 0.25233001,  0.10176   , -0.67484999, ...,  0.17869   ,
        -0.51916999,  0.33590999],
       [ 0.16718   ,  0.30592999, -0.13682   , ..., -0.035268  ,
         0.12809999,  0.023683  ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [139]:
def prep_training_data(training_data):
    labels = training_data.keys()
    label_idx = dict(zip(labels, range(len(labels))))

    # tokenize the words, and determine the word length
    training_examples = []
    indices = []
    for label in labels:
        for example in training_data[label]:
            label_encoding = [0] * len(labels)
            label_encoding[label_idx[label]] = 1
            indices.append(label_encoding)
            training_examples.append(word_tokenize(example))

    # store embedded vectors
    training_data_embedded = np.zeros(shape=(len(training_examples), max_allowable_length, embedding_vector_length))
    for i in range(len(training_examples)):
        for j in range(min(max_allowable_length, len(training_examples[i]))):
            training_data_embedded[i, j] = word_to_embedding(training_examples[i][j])
    indices = np.array(indices, dtype=np.int)

    return labels, training_data_embedded, indices

In [140]:
prep_training_data(lucas_phone_text_messages)

(dict_keys(['positive', 'negative', 'neutral']),
 array([[[ 0.18732999,  0.40595001, -0.51174003, ...,  0.16495   ,
           0.18757001,  0.53873998],
         [ 0.13948999,  0.53452998, -0.25246999, ..., -0.015228  ,
           0.088408  ,  0.30217001],
         [-0.11076   ,  0.30785999, -0.51980001, ..., -0.059105  ,
           0.47604001,  0.05661   ],
         ..., 
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ]],
 
        [[-0.11076   ,  0.30785999, -0.51980001, ..., -0.059105  ,
           0.47604001,  0.05661   ],
         [-0.19859   , -0.062818  , -0.36614001, ..., -0.58451003,
           0.27879   , -0.26205   ],
         [-0.093846  ,  0.58296001, -0.019271  , ..., -0.098804  ,
           0.027538  ,  0.30041999

## Model

In [141]:
from keras.layers import Conv1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

[CNN for text classification resource](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)

In [142]:
labels, training_data_embedded, indices = prep_training_data(lucas_phone_text_messages)

# build model
model = Sequential()
num_filters = 100
n_gram = 1
# conv layer
model.add(Conv1D(filters=num_filters,
                        kernel_size=n_gram,
                        padding="valid",
                        activation="relu",
                        input_shape=(max_allowable_length, embedding_vector_length)))

model.add(MaxPooling1D(pool_size=max_allowable_length - n_gram + 1)) # down sampling 
model.add(Flatten()) # correct dimension 
model.add(Dense(len(labels), activation="softmax")) # fully connected
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

# train
num_epochs = 5
model.fit(training_data_embedded, indices, epochs=num_epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x124f34cc0>

TO DO: try different number of filters, kernel sizes, etc

## Predict

In [143]:
text = "amazing"
# prep text
matrix = np.array([words_to_matrix(text)])

# make predictions
predictions = model.predict(matrix)

# format score
predictions_and_scores = {}
for idx, label in zip(range(len(labels)), labels):
    predictions_and_scores[label] = predictions[0][idx]

predictions_and_scores

{'negative': 0.21429136, 'neutral': 0.20057806, 'positive': 0.58513051}

## All Together

In [144]:
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize
import numpy as np

import embeddings

class CNNEmbeddedVecClassifier:
    def __init__(self,
                 n_gram=1,
                 embedding_vector_length=300,
                 num_filters=100,
                 max_allowable_length=15):
        self.n_gram = n_gram
        self.embedding_vector_length = embedding_vector_length
        self.num_filters = num_filters
        self.max_allowable_length = max_allowable_length

        self.labels = None
        self.model = None

    def word_to_embedding(self, word):
        

        return word_embeddings.emb(word) if word in word_embeddings else np.zeros(self.embedding_vector_length)

    def words_to_matrix(self, text):
        tokens = word_tokenize(text)
        matrix = np.zeros((self.max_allowable_length, self.embedding_vector_length))
        for i in range(min(self.max_allowable_length, len(tokens))):
            matrix[i] = self.word_to_embedding(tokens[i])
        
        return matrix

    def prep_training_data(self, training_data):
        labels = training_data.keys()
        label_idx = dict(zip(labels, range(len(labels))))

        # tokenize the words and encode the labels
        training_examples = []
        indices = []
        for label in labels:
            for example in training_data[label]:
                label_encoding = [0] * len(labels)
                label_encoding[label_idx[label]] = 1
                indices.append(label_encoding)
                training_examples.append(word_tokenize(example))
        indices = np.array(indices, dtype=np.int)

        # create embedded vectors
        training_data_embedded = np.zeros(shape=(len(training_examples), self.max_allowable_length, self.embedding_vector_length))
        for i in range(len(training_examples)):
            for j in range(min(self.max_allowable_length, len(training_examples[i]))):
                training_data_embedded[i, j] = self.word_to_embedding(training_examples[i][j])

        return labels, training_data_embedded, indices

    def train(self, training_data):
        
        # convert training data
        self.labels, training_data_embedded, indices = self.prep_training_data(training_data)

        # build model
        model = Sequential()
        model.add(Conv1D(filters=self.num_filters,
                                kernel_size=self.n_gram,
                                padding='valid',
                                activation='relu',
                                input_shape=(self.max_allowable_length, self.embedding_vector_length)))
        model.add(MaxPooling1D(pool_size=self.max_allowable_length - self.n_gram + 1))
        model.add(Flatten())
        model.add(Dense(len(self.labels), activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=["accuracy"])

        # train
        model.fit(training_data_embedded, indices, epochs=5)

        self.model = model

    def predict(self, text):

        # prep text
        matrix = np.array([self.words_to_matrix(text)])

        # make predictions
        predictions = self.model.predict(matrix)

        # format score
        predictions_and_scores = {}
        for idx, label in zip(range(len(self.labels)), self.labels):
            predictions_and_scores[label] = predictions[0][idx]

        return predictions_and_scores

In [145]:
pprint(lucas_phone_text_messages)

{'negative': ['hate you',
              'you are mean',
              'awful person',
              'asshole',
              "you're a piece of shit",
              'not cool'],
 'neutral': ['pizza?',
             "what's up",
             'ok see you then',
             'donuts in the office',
             'on my way',
             'where are you'],
 'positive': ['i love you',
              'you are great',
              'wonderful',
              'terrific',
              'ok awesome see you',
              'super fun']}


In [146]:
model = CNNEmbeddedVecClassifier(n_gram=2)

In [147]:
model.train(training_data=lucas_phone_text_messages)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [150]:
model.predict("sweet")

{'negative': 0.19678415, 'neutral': 0.2164792, 'positive': 0.58673668}