## Multi-Label Auto-Tagger (Banking)

This script is used to create the default banking classifier. 

### Automatic Comment tagging

The purpose of this document is to create an automated tagging system that tags a comment as belonging to a small set of predetermined categories.
Once comment can have multiple tags hence the multilabeling approach.


Example: {'Fantastic meals....quite good service':['food','service']}


In [18]:
###Importations
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.utils import shuffle
from sklearn import metrics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer

import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.pipeline import Pipeline
import seaborn as sns
from xgboost.sklearn import XGBClassifier
import pickle

## Train

#### a.) Load the Data

In [2]:
csvdata = pd.read_csv('datasets/banking.csv')
csvdata.head()

Unnamed: 0,commid,bank,rating,comment,theme
0,254722829561,I&M Bank,,Efficient service,service
1,254722829561,I&M Bank,,Efficient service,speed/efficiency
2,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,account
3,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,facilities
4,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,service


In [3]:
#Clean up the comments by removing everything that is not a letter, a number of punction

z = lambda x: str(re.sub("[^a-zA-Z0-9\s,!.]", "", x)) #Clean up: replace everything that's not a number or a dot with a space
csvdata['clean_comments'] = ([z(i) for i in  csvdata['comment']]) #Create a new column with clean data


In [4]:
csvdata.head()

Unnamed: 0,commid,bank,rating,comment,theme,clean_comments
0,254722829561,I&M Bank,,Efficient service,service,Efficient service
1,254722829561,I&M Bank,,Efficient service,speed/efficiency,Efficient service
2,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,account,1 not all your customer care staff know the SO...
3,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,facilities,1 not all your customer care staff know the SO...
4,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,service,1 not all your customer care staff know the SO...


In [5]:
#Add an extra column for themes as a category
#Label encoding to represent each of the theme classes as numbers
theme_categories = csvdata['theme'].astype('category') #1. We first convert the column into a category
csvdata['theme_categories']  = theme_categories.cat.codes #2. assign the encoded variable to a new column using the cat.codes
target_names = list(theme_categories.cat.categories)
csvdata.head()

Unnamed: 0,commid,bank,rating,comment,theme,clean_comments,theme_categories
0,254722829561,I&M Bank,,Efficient service,service,Efficient service,10
1,254722829561,I&M Bank,,Efficient service,speed/efficiency,Efficient service,11
2,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,account,1 not all your customer care staff know the SO...,1
3,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,facilities,1 not all your customer care staff know the SO...,3
4,254722727743,CBA,3.0,(1) not all your customer care staff know the ...,service,1 not all your customer care staff know the SO...,10


In [6]:
# df = pd.Series(csvdata)
# df.describe()

csvdata.describe()
csvdata['theme'].value_counts()


speed/efficiency             4926
service                      4648
accessibility                2611
staff                        1516
atm                          1374
account                      1274
mobile_banking               1176
system                        768
rates/charges                 624
facilities                    606
security                      531
online_banking                434
loan                          294
information/communication     283
Name: theme, dtype: int64

In [7]:
target_names, len(target_names)

(['accessibility',
  'account',
  'atm',
  'facilities',
  'information/communication',
  'loan',
  'mobile_banking',
  'online_banking',
  'rates/charges',
  'security',
  'service',
  'speed/efficiency',
  'staff',
  'system'],
 14)

In [8]:
processed_data = {}

for row in csvdata.iterrows():
    if row[1]['comment']  in processed_data.keys():  
        processed_data[row[1]['comment']].append(row[1]['theme_categories'])
    else:
        processed_data[row[1]['comment']] = [row[1]['theme_categories']]
        


In [9]:
my_data = {}
my_data['data'] = processed_data.keys()
my_data['target'] = processed_data.values()

X = my_data['data']
y = MultiLabelBinarizer().fit_transform(processed_data.values())
y.shape


(11628, 14)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(list(X), y, test_size=0.30, random_state=42,shuffle=True )

Try out some classifiers:
1. Logistic Regression
2. Random Forest
3. SVM
4. Multinomial Naives Bayes
5. XGBoost

### Logistic Regression Classifier

In [11]:
from sklearn.linear_model import LogisticRegression

LogReg_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
            ])
print('... Processing')
LogReg_pipeline.fit(X_train, y_train)
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))


... Processing
Test accuracy is 0.7916308397821725


### Random Forest Classifier

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


In [13]:
RandomForest_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(RandomForestClassifier(max_depth=150, random_state=0), n_jobs=1)),
            ])
print('... Processing')
RandomForest_pipeline.fit(X_train, y_train)
# compute the testing accuracy
prediction = RandomForest_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))

... Processing




Test accuracy is 0.8733161364287761


### SVM Classifier

In [14]:
from sklearn import svm

SVM_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(svm.SVC(decision_function_shape='ovo'), n_jobs=1)),
            ])
print('... Processing')
SVM_pipeline.fit(X_train, y_train)
# compute the testing accuracy
prediction = SVM_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))


... Processing




Test accuracy is 0.0


### Naives Bayes Classifier

In [15]:
from sklearn.naive_bayes import MultinomialNB
Naives_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(MultinomialNB(), n_jobs=1)),
            ])
print('... Processing')
Naives_pipeline.fit(X_train, y_train)
# compute the testing accuracy
prediction = Naives_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))

... Processing
Test accuracy is 0.5032960733734595


#### XGBoost Classifier

In [16]:
classes = len(csvdata['theme'].unique())#number of classes
reg_lambda = 2 #XG Boost's L2 regularization term on weights, increasing it makes the model more conservative.default=1


XGB_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(
                    XGBClassifier(                  
                        objective = "multi:softmax", 
                        seed =27,
                        reg_lambda=reg_lambda,
                        num_class = classes
                ), n_jobs=1)),
            ])
print('... Processing')
XGB_pipeline.fit(X_train, y_train)
# compute the testing accuracy
prediction = XGB_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))


... Processing
Test accuracy is 0.9415305245055889


### Validation

In [19]:
#Validation using the best classifier 
comment1 = 'Make me understand, why are there 2 teller on duty at kikuyu branch,av been here for close to 1 hr waiting and the lobby is almost full?' #
comment2 = 'I received a text of qualifying for loan but every time I request there are incoveniences'
comment3 = 'I misplaced my ATM'
comment4 = 'I am one dissapointed customer lately.Every time I transfer money from Mpesa to my account the transaction is not completed,meaning this cash doesn\'t hit my account'
comment5 = 'CB mobile app disappointing at its best and when you least expect it. The error message that keeps popping out is not funny'
comment6 = "Your service is soooooo slow, went to Karen branch no movement for 30 minutes then since was getting late decided to try your ongata Rongai branch, let's just say been here 1 hour but alas....... I hope never to need to visit your branch again in the near future."


comments_new = [comment1,comment2,comment3,comment4, comment5, comment6]


predicted = XGB_pipeline.predict(comments_new)
predicted = pd.DataFrame(predicted, columns=target_names)

predicted
   

for tw, category in zip(comments_new, predicted.iterrows()):
    themes=[]
    for i in range(len(category[1])) :
        if category[1][i] == 1:
            themes.append(target_names[i])
    print('\n%r ===> %s' % (tw, themes))



'Make me understand, why are there 2 teller on duty at kikuyu branch,av been here for close to 1 hr waiting and the lobby is almost full?' ===> ['facilities', 'staff']

'I received a text of qualifying for loan but every time I request there are incoveniences' ===> ['speed/efficiency']

'I misplaced my ATM' ===> ['atm']

"I am one dissapointed customer lately.Every time I transfer money from Mpesa to my account the transaction is not completed,meaning this cash doesn't hit my account" ===> ['account', 'mobile_banking', 'service', 'speed/efficiency']

'CB mobile app disappointing at its best and when you least expect it. The error message that keeps popping out is not funny' ===> ['mobile_banking']

"Your service is soooooo slow, went to Karen branch no movement for 30 minutes then since was getting late decided to try your ongata Rongai branch, let's just say been here 1 hour but alas....... I hope never to need to visit your branch again in the near future." ===> ['facilities', 'serv

#### Model Persistence

In [20]:
filename = 'banking.sav'
pickle.dump(XGB_pipeline, open(filename, 'wb'))