## overview

**Goal**: predict user-defined spending category based on text summaries of search results of the business name provided by credit card

As this is only valuable for the user if correct, **accuracy** is the best metric of success.

## conclusion

With 9 categories, and only using the preview text of the google search, we are able to achieve 
**74% test accuracy** using **TF-IDF embedding and `SGDClassifier`**, saved as `tfidf-SGD-hinge.pkl`. Since most transactions will be at places already seen by the user, this level of accuracy is already helpful to the user in categorizing their transactions.

After gathering more than ~100 expenses, or by more elaborately web scraping the full results instead of just the preview text, this could likely be improved upon.



# EDA

In [111]:
import pandas as pd
import numpy as np

df = pd.read_csv('training_df.csv')[['category','search_string','text','search']]

print('shape: ',df.shape,'\n')
print(df.dtypes)

shape:  (598, 3) 

category         object
search_string    object
text             object
dtype: object


In [6]:
df.sample(1)

Unnamed: 0,category,search_string,text
258,dining,at BOBCAT BONNIES DETROIT MI,Zacharie Stephen Bobcat Bonnie's. 1800 Michiga...


In [2]:
print('number of nulls by column:')
df.isnull().sum()

number of nulls by column:


category         0
search_string    0
text             0
dtype: int64

Columns:
* **search_string** text provided by credit card on where transaction occurred
* **text** - website summaries, from one of top google searches of search_string*
* **category** labeled according to one of the following spending categories (notice class imbalance):

In [3]:
val_counts = df['category'].value_counts()
median_count = int(val_counts.median())
val_counts

dining            214
prof dev           82
groceries          55
misc               46
house              45
transportation     44
fun                41
recurring          41
pets               30
Name: category, dtype: int64

Class imbalance will need to be accounted for.

# exploratory model selection

Because of the small number of features, TF-IDF may struggle to have enough matches to differentiate classes. Probably using a pretrained word embedding will yield best results. 

In [112]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import roc_auc_score,accuracy_score

X = df['text'].values
y = df['category'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=222)

def assess(clf):
    # function used to assess final model
    preds = clf.predict(X_test)
    print('test accuracy:',accuracy_score(y_test,preds))    

## basline: undersample majority class > tf-idf > naive bayes

In [236]:
from sklearn.naive_bayes import MultinomialNB

dining_small = df[df['category']=='dining'].sample(median_count,random_state=22)

resampled = pd.concat([dining_small,df[df['category']!='dining']],ignore_index=True)

#bal prefix for balanced
balX = resampled['text'].values
baly = resampled['category'].values

balX_train, balX_test, baly_train, baly_test = train_test_split(balX, baly, test_size=0.3, random_state=222)

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB(alpha=0.01))
                    ])

text_clf.fit(balX_train, baly_train)

preds = text_clf.predict(balX_test)
probs = text_clf.predict_proba(balX_test)

print('                 accuracy:',accuracy_score(baly_test,preds))
print('AUC averaged, one vs rest:',roc_auc_score(baly_test,probs,multi_class='ovr'))

                 accuracy: 0.6744186046511628
AUC averaged, one vs rest: 0.9088571095392288


## tfidf > Linear SVM classifier

In [243]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC()),
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 100, 200, 300),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': ([200,400,1000]),
    'clf__penalty': ('l2', 'elasticnet')
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1,cv=8)
grid_search.fit(X_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
print()
assess(grid_search)

Fitting 8 folds for each of 144 candidates, totalling 1152 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 328 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 828 tasks      | elapsed:    8.4s


Best score: 0.713
Best parameters set:
	clf__max_iter: 200
	clf__penalty: 'l2'
	vect__max_df: 0.5
	vect__max_features: None
	vect__ngram_range: (1, 1)

test accuracy: 0.7444444444444445


[Parallel(n_jobs=-1)]: Done 1152 out of 1152 | elapsed:   11.3s finished


## tfidf > SGDClassifier (logistic, SVC, Perceptron)

In [248]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

parameters = {
    'vect__max_df': (0.4, 0.5, 0.75),
    'vect__max_features': (None, 500),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__norm': ('l1', 'l2'),
    'clf__loss': ('hinge', 'log', 'squared_hinge', 'perceptron'),
    'clf__max_iter': ([50,100,300]),
    'clf__alpha': (0.01,0.005, 0.001),
    'clf__penalty': ['elasticnet'],
    'clf__l1_ratio': (0.01,0.05,0.1,0.2)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1,cv=5)
grid_search.fit(X_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
print('\n')

assess(grid_search)

Fitting 5 folds for each of 1728 candidates, totalling 8640 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 177 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 427 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 777 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 1227 tasks      | elapsed:   19.8s
[Parallel(n_jobs=-1)]: Done 1777 tasks      | elapsed:   27.1s
[Parallel(n_jobs=-1)]: Done 2604 tasks      | elapsed:   36.1s
[Parallel(n_jobs=-1)]: Done 5604 tasks      | elapsed:   59.2s


Best score: 0.749
Best parameters set:
	clf__alpha: 0.001
	clf__l1_ratio: 0.2
	clf__loss: 'hinge'
	clf__max_iter: 300
	clf__penalty: 'elasticnet'
	vect__max_df: 0.5
	vect__max_features: None
	vect__ngram_range: (1, 2)


test accuracy: 0.7444444444444445


[Parallel(n_jobs=-1)]: Done 8640 out of 8640 | elapsed:  1.4min finished


Since 'hinge' was the best hyperparameter choice for the loss, and since 'hinge' is just a stochastic version of LinearSVC, this is very similar to the previous model.

In [249]:
# save best performing model
import joblib
joblib.dump(grid_search.best_estimator_,'tfidf-SGD-hinge.pkl')

['tfidf-SGD-hinge.pkl']

## tfidf >> XGBoostClassifier

In [226]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV


pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', XGBClassifier())
     ])

parameters = {
    'vect__max_df': (0.5, 0.75),
    'vect__max_features': (None, 500),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__norm': ('l1', 'l2'),
    'clf__max_depth': ([2,4]),
    'clf__n_estimators': (30,50),
    'clf__learning_rate': (0.001,0.01,0.1),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1,cv=5)
grid_search.fit(X_train, y_train)

#print("Best score: %0.3f" % grid_search.best_score_)
#print("Best parameters set:")
#best_parameters = grid_search.best_estimator_.get_params()
#for param_name in sorted(parameters.keys()):
#    print("\t%s: %r" % (param_name, best_parameters[param_name]))   
#print('\n')   

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   34.1s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  1.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [227]:
preds = grid_search.predict(X_test)
print('test accuracy:',accuracy_score(y_test,preds))    

test accuracy: 0.5166666666666667


In [228]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))   
print('\n')   

Best score: 0.581
Best parameters set:
	clf__learning_rate: 0.1
	clf__max_depth: 4
	clf__n_estimators: 30
	vect__max_df: 0.5
	vect__max_features: 500
	vect__ngram_range: (1, 2)




## word2vec pretrained > SGDClassifier 

Without enough data to train a doc2vec model, one approach is to average the word2vec vectors on a pretrained word2vec model to get the embedding. Members of the Google team behind word2vec mentions this as an option for tasks where the word order is less important [(Le and Mikolov, 2014)](https://cs.stanford.edu/~quocle/paragraph_vector.pdf). The ability for pretrained model to know similarity between, e.g., 'taco' and 'sandwich' should help the model classify these terms within similar spending categories.

In [114]:
import gensim
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

model = gensim.models.KeyedVectors.load_word2vec_format(
    '/media/hdd_1tb/data/GoogleNews-vectors-negative300.bin', binary=True
)


In [115]:
def avgWords2Vec(sent):
    # averages word2vec vectors in sentence 
    sent = gensim.parsing.remove_stopwords(sent)
    lemmas = gensim.utils.simple_preprocess(sent)
    vecs = []
    for lem in lemmas:
        try:
            vecs.append(model.get_vector(lem))
        except KeyError:
            # word not in pretrained vocab
            None
    return np.mean(vecs,axis=0)

embedded_train = np.array([avgWords2Vec(sent) for sent in X_train])
embedded_test = np.array([avgWords2Vec(sent) for sent in X_test])

### visualizing word2vec embedding with PCA and t-SNE

In [197]:
import plotly.express as psx

vis1 = PCA(n_components=3).fit_transform(embedded_train)

fig = psx.scatter_3d(x=vis1[:,0],y=vis1[:,1],z=vis1[:,2],color=y_train)
fig.update_layout(title='3D visualization of PCA of 300-dimensional word2vec embedding')
fig.show()

In [198]:
vis2 = TSNE(n_components=3,perplexity=30,n_jobs=-1,verbose=1,n_iter=3000).fit_transform(
        embedded_train
)

import plotly.express as psx

fig = psx.scatter_3d(x=vis2[:,0],y=vis2[:,1],z=vis2[:,2],color=y_train)
fig.update_layout(title='3D visualization using t-SNE of 300-dimensional word2vec embedding')

fig.show()

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 418 samples in 0.008s...
[t-SNE] Computed neighbors for 418 samples in 0.107s...
[t-SNE] Computed conditional probabilities for sample 418 / 418
[t-SNE] Mean sigma: 0.296016
[t-SNE] KL divergence after 250 iterations with early exaggeration: 145.531570
[t-SNE] KL divergence after 3000 iterations: 1.966193


Even after several modifications of the perplexity, no clear clusters emerging from the t-SNE visualization. This could be a non-issue: the full 300 dimensional data may differentiate the categories which dissappears in the 3D represntation. However if tSNE cannot find any correct clusterss this could bode poorly for the potential of the word2vec embedding in this model.

### word2vec > SGD

In [200]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

sgd = SGDClassifier()

parameters = {
    'loss': ('hinge', 'log', 'squared_hinge', 'perceptron'),
    'alpha': (0.01,0.005, 0.001),
    'penalty': ['elasticnet'],
    'l1_ratio': (0.01,0.05,0.1,0.2),
    'max_iter': (1000,2000),
}

grid_search = GridSearchCV(sgd, parameters, n_jobs=-1,verbose=1,cv=5)
grid_search.fit(embedded_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
print('\n')

preds = grid_search.predict(embedded_test)
print('test accuracy:',accuracy_score(y_test,preds))    

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 223 tasks      | elapsed:   38.5s


Best score: 0.610
Best parameters set:
	alpha: 0.001
	l1_ratio: 0.01
	loss: 'hinge'
	max_iter: 1000
	penalty: 'elasticnet'


test accuracy: 0.5555555555555556


[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:   48.5s finished


The averaged, pretrained word2vec embedding is not proving helpful for feature extraction. One possible reason: including too many search results waters down the relevance of the data. Can test this hypothesis by running the same experiment but limiting to the top 2 search results rather than the top 5. 

## limiting search terms

Possibly using the top 5 search results leads to watered down data, which may make it harder for models to differentiate. To investigate this hypothesis, we reload the dataset only including the top 3 search results rather than the top 5.

In [232]:
Ldf = pd.read_csv('training_df.csv')
Ldf = Ldf[Ldf['search_ranking']<3]
Ldf.shape # Limited data frame

(369, 4)

In [233]:
LX = df['text'].values
Ly = df['category'].values

LX_train, LX_test, Ly_train, Ly_test = train_test_split(LX, Ly, test_size=0.3, random_state=222)

Lembedded_train = np.array([avgWords2Vec(sent) for sent in LX_train])
Lembedded_test = np.array([avgWords2Vec(sent) for sent in LX_test])

In [234]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

sgd = SGDClassifier()

parameters = {
    'loss': ('hinge', 'log', 'squared_hinge', 'perceptron'),
    'alpha': (0.01,0.005, 0.001),
    'penalty': ['elasticnet'],
    'l1_ratio': (0.01,0.05,0.1,0.2),
    'max_iter': (1000,2000),
}

grid_search = GridSearchCV(sgd, parameters, n_jobs=-1,verbose=1,cv=5)
grid_search.fit(Lembedded_train, y_train)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
print('\n')

preds = grid_search.predict(Lembedded_test)
print('test accuracy:',accuracy_score(Ly_test,preds))    

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 224 tasks      | elapsed:   53.1s


Best score: 0.610
Best parameters set:
	alpha: 0.001
	l1_ratio: 0.1
	loss: 'hinge'
	max_iter: 2000
	penalty: 'elasticnet'


test accuracy: 0.6111111111111112


[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  1.4min finished
