In this snippet, we will explore something that isn't really the traditional actuarial problem, but still extremely useful nonetheless. This article will go through the process of text classification. <br><br> The dataset used can be found at https://www.kaggle.com/sebastienverpile/consumercomplaintsdata and consists of finance related complaints that a company has recieved from its customers. <br><br> This problem will be split into 2 parts; transforming text data into usable inputs, and performing classification with those inputs.

The packages that we will be using are listed below. <br>
Pandas and Numpy for general data manipulation <br>
Matplotlib for general data visualisation <br>
Sci-kit learn packages for both feature extraction and classification model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

Since we are trying to predict the product based on the complaints recieved, we can ignore the rest of the columns for the purpose of this exercise and only look at the single feature and target column. Also, we will get rid of null entries and for the sake of simplicity, rows that fall into multiple categories.

In [2]:
df = pd.read_csv('Consumer_Complaints.csv')
df = df.loc[(df['Consumer complaint narrative'].notnull()),
            ['Consumer complaint narrative', 'Product']] \
       .reset_index() \
       .drop('index', axis = 1)
df = df[(np.logical_not(df.Product.str.contains(','))) \
        & (df.Product != 'Credit card or prepaid card')]
#df = df.iloc[:100, :]

df.columns = ['description', 'target']

Next, we will convert the target variable into an integer value instead of text. We can do this using a variety of methods but here, we will be using sklearn's labelencoder function.

In [3]:
encoder = LabelEncoder()
encoder.fit(df.target)
df = df.assign(encoded_response = lambda x: encoder.transform(x.target))

# Part 1 - Text to features

Although there are many examples of machine learning models that make use of a variety of inputs such as; images ect. we must understand that ultimately, whatever input chosen must be converted back into numerical features in order to be viable for the models. That being said, images can be generally are made up of pixels, where each pixel can be represented by a integer triplet (x, y, z) of the basic RGB colors. <br> In this section, we will find a meaningful way to represent our text data as numerical features. This is also known as feature extraction.

## Bag Of Words Model

Explain what BOW and TF-IDF model is.

$w_{i,j} = tf_{i,j} * log(\dfrac{N}{df_i})$

In [6]:
x_train, x_test, y_train, y_test = train_test_split(df.description, 
                                                    df.encoded_response, 
                                                    test_size = 0.2, 
                                                    random_state = 123)

In [7]:
# Similar to gensim's corpus turns document into BOW vectors
# Each token is now a feature in the machine learning model
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english',
                                      max_df = 0.7)
                                  #token_pattern = regex here)

In [8]:
# Shortcut to create a matrix for token ID and vectors
tfidf_train = tfidf_vectorizer.fit_transform(x_train.values)
tfidf_test = tfidf_vectorizer.transform(x_test.values)

In [10]:
# Features in the input matrix
# Loads of rubbish features here!
tfidf_vectorizer.get_feature_names()

['00',
 '000',
 '0000',
 '00000',
 '0001',
 '00072',
 '000a',
 '000cash',
 '000dollars',
 '000if',
 '000ii',
 '000ins',
 '000s',
 '000xx',
 '001',
 '00101',
 '0015',
 '0017',
 '0018',
 '001a',
 '001b',
 '001c',
 '002',
 '0022',
 '0023',
 '003',
 '004',
 '0041',
 '0054',
 '0065',
 '0077',
 '0080',
 '0085',
 '009',
 '00additional',
 '00amount',
 '00balance',
 '00but',
 '00eventually',
 '00i',
 '00interest',
 '00lender',
 '00plus',
 '00purchase',
 '00time',
 '00total',
 '00usd',
 '00which',
 '01',
 '010',
 '0102',
 '01045',
 '010trustee',
 '011',
 '0116',
 '012',
 '01231',
 '0125',
 '014',
 '015',
 '016',
 '017',
 '02',
 '020',
 '021',
 '02240',
 '025',
 '027',
 '03',
 '030',
 '031',
 '0324',
 '036',
 '038',
 '04',
 '040',
 '041',
 '0412',
 '043',
 '0432',
 '045',
 '05',
 '050',
 '056',
 '06',
 '0608',
 '064',
 '067',
 '069',
 '07',
 '070',
 '073',
 '075',
 '07732',
 '078',
 '07800',
 '08',
 '080',
 '0804',
 '0805',
 '081',
 '0813',
 '082',
 '083',
 '084',
 '0875',
 '089',
 '09',
 '090',


In [11]:
# First 5 vectors in the tfidf array
print(tfidf_train.A[:5])

[[0.06725613 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.1599056  0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [12]:
# Converting the vector to a full on numerical sparse matrix
# Column names are the features or tokens in the corpus
tfidf_df = pd.DataFrame(tfidf_train.A, 
                        columns = tfidf_vectorizer.get_feature_names())

In [13]:
tfidf_df.head()

Unnamed: 0,00,000,0000,00000,0001,00072,000a,000cash,000dollars,000if,...,zone,zoned,zones,zoning,zoo,zoom,zooms,ztuff,zwicker,zwickers
0,0.067256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.159906,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Part 2 - Classification Model

In [14]:
# Using naive bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [15]:
nb_classifier = MultinomialNB()
nb_classifier.fit(tfidf_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
pred = nb_classifier.predict(tfidf_test)

In [17]:
metrics.accuracy_score(y_test, pred)

0.7416719740813558

In [46]:
confusion_matrix = pd.DataFrame(metrics.confusion_matrix(y_test, pred),
                                index = encoder.classes_,
                                columns = encoder.classes_)

In [47]:
confusion_matrix

Unnamed: 0,Bank account or service,Checking or savings account,Consumer Loan,Credit card,Credit reporting,Debt collection,Money transfers,Mortgage,Other financial service,Payday loan,Prepaid card,Student loan,Vehicle loan or lease,Virtual currency
Bank account or service,1624,0,2,266,65,400,0,644,0,0,0,0,0,0
Checking or savings account,225,0,0,21,11,57,0,102,0,0,0,0,0,0
Consumer Loan,6,0,222,57,148,822,0,610,0,0,0,2,0,0
Credit card,64,0,0,2337,221,847,0,365,0,0,0,0,0,0
Credit reporting,7,0,7,63,4948,1047,0,158,0,0,0,16,0,0
Debt collection,11,0,6,48,351,8931,0,267,0,0,0,38,0,0
Money transfers,76,0,1,35,0,115,0,67,0,0,0,0,0,0
Mortgage,5,0,0,6,72,146,0,7082,0,0,0,0,0,0
Other financial service,4,0,0,4,0,26,0,16,0,0,0,6,0,0
Payday loan,6,0,0,6,4,211,0,107,0,0,0,2,0,0


In [50]:
# Improving the model?
# Interpret alpha here
# Can see that the best alpha here is 0.1
alphas = np.arange(0, 1, 0.1)
def train_and_predict(alpha):
    nb_classifier = MultinomialNB(alpha = alpha)
    nb_classifier.fit(tfidf_train, y_train)
    pred = nb_classifier.predict(tfidf_test)
    score = metrics.accuracy_score(y_test, pred)
    return score

for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0


  'setting alpha = %.1e' % _ALPHA_MIN)


Score:  0.7598648686068729

Alpha:  0.1
Score:  0.806939329327389

Alpha:  0.2
Score:  0.799988923656301

Alpha:  0.30000000000000004
Score:  0.7937307894663972

Alpha:  0.4
Score:  0.7861434940326198

Alpha:  0.5
Score:  0.7779469996954006

Alpha:  0.6000000000000001
Score:  0.7712181208982914

Alpha:  0.7000000000000001
Score:  0.7639354249162351

Alpha:  0.8
Score:  0.7564865837786946

Alpha:  0.9
Score:  0.7488992883449174



In [51]:
class_labels = nb_classifier.classes_

In [52]:
class_labels

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

In [53]:
feature_names = tfidf_vectorizer.get_feature_names()

In [56]:
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

In [58]:
# Can see alot of rubbish feautres, which means that some more preprocessing is needed
# Use the regex argument in vectorizer
list(feat_with_weights)

[(-11.903588118643736, '0000'),
 (-11.903588118643736, '00000'),
 (-11.903588118643736, '0001'),
 (-11.903588118643736, '00072'),
 (-11.903588118643736, '000a'),
 (-11.903588118643736, '000cash'),
 (-11.903588118643736, '000dollars'),
 (-11.903588118643736, '000if'),
 (-11.903588118643736, '000ii'),
 (-11.903588118643736, '000ins'),
 (-11.903588118643736, '000s'),
 (-11.903588118643736, '000xx'),
 (-11.903588118643736, '001'),
 (-11.903588118643736, '00101'),
 (-11.903588118643736, '0015'),
 (-11.903588118643736, '0017'),
 (-11.903588118643736, '0018'),
 (-11.903588118643736, '001a'),
 (-11.903588118643736, '001b'),
 (-11.903588118643736, '001c'),
 (-11.903588118643736, '002'),
 (-11.903588118643736, '0022'),
 (-11.903588118643736, '0023'),
 (-11.903588118643736, '004'),
 (-11.903588118643736, '0041'),
 (-11.903588118643736, '0054'),
 (-11.903588118643736, '0065'),
 (-11.903588118643736, '0077'),
 (-11.903588118643736, '0080'),
 (-11.903588118643736, '0085'),
 (-11.903588118643736, '00

In [59]:
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

0 [(-11.903588118643736, '0000'), (-11.903588118643736, '00000'), (-11.903588118643736, '0001'), (-11.903588118643736, '00072'), (-11.903588118643736, '000a'), (-11.903588118643736, '000cash'), (-11.903588118643736, '000dollars'), (-11.903588118643736, '000if'), (-11.903588118643736, '000ii'), (-11.903588118643736, '000ins'), (-11.903588118643736, '000s'), (-11.903588118643736, '000xx'), (-11.903588118643736, '001'), (-11.903588118643736, '00101'), (-11.903588118643736, '0015'), (-11.903588118643736, '0017'), (-11.903588118643736, '0018'), (-11.903588118643736, '001a'), (-11.903588118643736, '001b'), (-11.903588118643736, '001c')]


In [60]:
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

1 [(-6.260325448025164, 'did'), (-6.24788187024759, 'fargo'), (-6.246310631318104, 'wells'), (-6.237602275809984, 'america'), (-6.215087749046413, 'branch'), (-6.21093993077539, 'fee'), (-6.138331124085686, 'chase'), (-6.137932356281183, 'told'), (-6.070473506154503, 'funds'), (-6.064802807176095, 'deposit'), (-6.015548177420846, 'fees'), (-6.008945994973818, 'card'), (-5.938198955506924, 'overdraft'), (-5.867381227169975, 'checking'), (-5.732904055888358, 'money'), (-5.703664075313021, 'check'), (-5.670775349830528, 'xx'), (-5.526813497986442, '00'), (-4.835817207262399, 'bank'), (-4.808550700344095, 'account')]


Extensions of text classification problems. Besides being able to classify claim description into their respective groups, other extremely useful application of text classification would be sentiment analysis or  recommendation engines.