# Consumer Complaints Multi class text classification using LSTM

Objective : Give a text data of complaints , we need to find which product the complaint is about

In [192]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [193]:
#Load the data:

data = pd.read_csv(r'complaints.csv')

In [194]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1589258 entries, 0 to 1589257
Data columns (total 18 columns):
Date received                   1589258 non-null object
Product                         1589258 non-null object
Sub-product                     1354093 non-null object
Issue                           1589258 non-null object
Sub-issue                       1021648 non-null object
Consumer complaint narrative    527238 non-null object
Company public response         606327 non-null object
Company                         1589258 non-null object
State                           1562294 non-null object
ZIP code                        1438119 non-null object
Tags                            216607 non-null object
Consumer consent provided?      953233 non-null object
Submitted via                   1589258 non-null object
Date sent to company            1589258 non-null object
Company response to consumer    1589257 non-null object
Timely response?                1589258 non-null ob

In [195]:
#lets consider a smaller subset of data 
data = data[['Product','Consumer complaint narrative']]

In [196]:
data = data.dropna(subset=['Consumer complaint narrative'])

In [197]:
df = data

In [198]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 527238 entries, 0 to 1589257
Data columns (total 2 columns):
Product                         527238 non-null object
Consumer complaint narrative    527238 non-null object
dtypes: object(2)
memory usage: 12.1+ MB


In [199]:
#Various product categories and its equivalent count
df.Product.value_counts()

Credit reporting, credit repair services, or other personal consumer reports    162990
Debt collection                                                                 112896
Mortgage                                                                         64352
Credit card or prepaid card                                                      35771
Credit reporting                                                                 31588
Student loan                                                                     25948
Checking or savings account                                                      21022
Credit card                                                                      18838
Bank account or service                                                          14885
Consumer Loan                                                                     9473
Vehicle loan or lease                                                             8903
Money transfer, virtual currency, or money 

Based on the categories , we can group multiple categories together:
- Consolidate “Credit reporting” into “Credit reporting, credit repair services, or other personal consumer reports”.

- Consolidate “Credit card” into “Credit card or prepaid card”.

- Consolidate “Payday loan” into “Payday loan, title loan, or personal loan”.

- Consolidate “Virtual currency” into “Money transfer, virtual currency, or money service”.

- Other financial service” has very few number of complaints and it does not mean anything, so, I decide to remove it.

In [200]:
df.loc[:,'Product'] = np.where(data['Product']=='Credit reporting','Credit reporting, credit repair services, or other personal consumer reports',data['Product'])
df.loc[:,'Product'] = np.where(data['Product']=='Credit card','Credit card or prepaid card',data['Product'])
df.loc[:,'Product'] = np.where(data['Product']=='Payday loan','Payday loan, title loan, or personal loan',data['Product'])
df.loc[:,'Product'] = np.where(data['Product']=='Virtual currency','Money transfer, virtual currency, or money service',data['Product'])


In [201]:
#drop other financial service:
df = df[~df['Product'].isin(['Other financial service'])]

In [202]:
print(df.Product.nunique())
df.Product.value_counts()

13


Credit reporting, credit repair services, or other personal consumer reports    194578
Debt collection                                                                 112896
Mortgage                                                                         64352
Credit card or prepaid card                                                      54609
Student loan                                                                     25948
Checking or savings account                                                      21022
Bank account or service                                                          14885
Consumer Loan                                                                     9473
Vehicle loan or lease                                                             8903
Payday loan, title loan, or personal loan                                         8779
Money transfer, virtual currency, or money service                                8554
Money transfers                            

Now we have 13 labels after consolidation and lets plot the above in bar graph:

In [203]:
import plotly
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=False)
df['Product'].value_counts().sort_values(ascending=False).iplot(kind='bar',yTitle='Number of compalints',title='Categorical value distribution')

# Text cleaning:

In [172]:
display(df['Consumer complaint narrative'])

0          transworld systems inc. \nis trying to collect...
2          I would like to request the suppression of the...
3          Over the past 2 weeks, I have been receiving e...
11         I was sold access to an event digitally, of wh...
12         While checking my credit report I noticed thre...
                                 ...                        
1589253    I was on automatic payment for my car loan. In...
1589254    I recieved a collections call from an unknown ...
1589255    On XXXX XXXX, 2015, I contacted XXXX XXXX, who...
1589256    I can not get from chase who services my mortg...
1589257    I made a payment to CITI XXXX Credit Card on X...
Name: Consumer complaint narrative, Length: 526946, dtype: object

In [204]:
df1 = df.reset_index()

In [62]:
#lets check the content of index number 1589255:

def print_content(index):
    example = df[df.index==index][['Consumer complaint narrative','Product']].values[0]
    if len(example)>0:
        print("Content: ",example[0])
        print('Product:',example[1])
        

print_content(100)


Content:  In XX/XX/XXXX, I received a letter from National Credit Systems , Inc. in response to a letter from a XXXX XXXX regarding a Debt collection. The letter was dated XX/XX/XXXX in regards to a debt owed to XXXX XXXX for the amount of {$730.00} ; however, I did not incur any debt when I moved from this apartment complex back in XX/XX/XXXX. I was never notified, by XXXX XXXX. ; of any questions following my leaving their site neither, and they had all my contact information. I responded with a letter of explanation to NCS on XX/XX/XXXX requesting information. I received only a copy of my lease and a screenshot print out of an incorrect move out date. This printout did not match the ledger details or any other paperwork from XXXX XXXX that I had received in the past. This appeared to be a scam. I received another letter XX/XX/XXXX from a XXXX XXXX of   XXXX XXXX XXXX, XXXX ( XXXX ) and spoke with a XXXX XXXX and XXXX - they both said that they are collecting a debt for XXXX XXXX, bu

In [205]:
df1.head()

Unnamed: 0,index,Product,Consumer complaint narrative
0,0,Debt collection,transworld systems inc. \nis trying to collect...
1,2,"Credit reporting, credit repair services, or o...",I would like to request the suppression of the...
2,3,Debt collection,"Over the past 2 weeks, I have been receiving e..."
3,11,"Money transfer, virtual currency, or money ser...","I was sold access to an event digitally, of wh..."
4,12,Debt collection,While checking my credit report I noticed thre...


In [64]:
print_content(56600)

Content:  Good Evening Manager/Executive office ; I dont recall this account ever being late. You are reporting inaccurate information that is ruining my credit and I ask that you remove this inaccurate information immediately from my credit report. I was shocked when I reviewed my credit report and found late payment on the dates below : XX/XX/2018, 30-days late? - how? 

I am not sure how this happened, I believe that I had made my payments to you when I received my statements. My only thought is that my monthly statement did not get to me.
Product: Credit card or prepaid card


# Text preprocessing steps:

- Convert all text to lowercase
- Remove Digits
- Remove special characters
- Apply lemmatization
- Remove "x" text
- Remove double space
- Remove duplicated words in the sentence


In [66]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
ps = PorterStemmer()
lmtz = WordNetLemmatizer()
from string import digits
from nltk.tokenize import word_tokenize,sent_tokenize
eng_words = set(nltk.corpus.words.words())

    

str.maketrans builds a translation table, which is a mapping of integers or characters to integers, strings, or None. 

In [206]:
#Function to remove digits
def remove_digits(text):
    remove_digits = str.maketrans('','',digits)
    res = text.translate(remove_digits)
    return res

def text_prepare(text):
    replace_by_space = re.compile('[/(){}\[\]\|@;,]')
    good_symbols = re.compile('[^0-9a-z]')
    stopwords_set = set(stopwords.words('english'))
    text = text.lower()
    text = replace_by_space.sub(' ',text)
    text = good_symbols.sub(' ',text)
    text = text.replace('x',' ')
    text = ' '.join([word for word in set(text.split()) if word and word not in stopwords_set])
    return text
    
def lemmatize(text):
    processed_text=[]
    words = word_tokenize(text)
    for word in words:
        x=lmtz.lemmatize(word)
        processed_text.append(x)
    return ' '.join(processed_text)

def remove_single_letter(text):
    #text = re.sub(r'\W+', '', text)
    processed_text = ' '.join([word for word in text.split() if len(word)>2])
    return processed_text




In [154]:
#Data cleaning:

def data_clean(data):
    data.loc[:,'Consumer complaint narrative'] = [remove_digits(str(x)) for x in data.loc[:,'Consumer complaint narrative']]
    data.loc[:,'Consumer complaint narrative'] = [text_prepare(x) for x in data.loc[:,'Consumer complaint narrative']]
    data.loc[:,'Consumer complaint narrative'] = [lemmatize(x) for x in data.loc[:,'Consumer complaint narrative']]
    data.loc[:,'Consumer complaint narrative'] = [remove_single_letter(x) for x in data.loc[:,'Consumer complaint narrative']]
    data1 = data.drop_duplicates()
    data1 = data1.dropna(subset=['Consumer complaint narrative'])
    return data1




In [207]:
df1 = df1.sample(frac=0.1)

In [208]:
df1=df1.reset_index()

In [209]:
df1=df1.head(10000)

In [210]:
result=data_clean(df1)

In [211]:
result.head()

Unnamed: 0,level_0,index,Product,Consumer complaint narrative
0,204172,660297,"Credit reporting, credit repair services, or o...",finally never see accurate back fico considere...
1,296719,1173700,Credit card or prepaid card,never could back come magazine tell would subs...
2,78693,171731,Debt collection,purchased could withholding would receiving ca...
3,439507,1390989,Credit card or prepaid card,purchased online plan never see total back com...
4,258273,1042475,Debt collection,also never ambulance service account filing ca...


In [212]:
result = result[['Product','Consumer complaint narrative']]

In [213]:
#lets check the content of index number 1589255:

def print_content(index):
    example = result[result.index==index][['Consumer complaint narrative','Product']].values[0]
    if len(example)>0:
        print("Content: ",example[0])
        print('Product:',example[1])
        

print_content(100)

Content:  seems never pas could charging back complaint verification charged would sharing immediately unlikely fact understand loan email process paystub effort every pressed shared give mailing day report statement basically even document new called along plained day though collected require feel regard underwriting card approve passed first secure said waive concern address taking whether concern appraisal stub phone timely treated fund manager market say underwriter respond credit closing complained asked circumvent also ordered since going pnc changed money change email unable stated payment accepted still asking school provided verified letter felt district indicated question rule unfairly work offer seller agreement may information returned sent paystubs conversation paperwork offered emailed copy doubt completed paid doubt marketing well rush ordering statement fashion possibly planation employment underwriter mailed able electronic act automated approved wanting close stub ban

# LSTM Modeling

- Tokenizer
- post padding
- Limit the dataset to top 5000 words
- Set the max number of words in complaint to 250 words
- Fix the embedding dimension to 100

In [214]:
max_words =5000
max_seq_len = 250
Embed_dim = 100


In [227]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import SpatialDropout1D


In [219]:
tokenizer = Tokenizer(num_words=max_words,lower=True)
tokenizer.fit_on_texts(result['Consumer complaint narrative'].values)

In [220]:
word_index = tokenizer.word_index
print("Unique words",len(word_index))

Unique words 15997


In [223]:
#Truncate the sentences to the same pad lenght:
X = tokenizer.texts_to_sequences(result['Consumer complaint narrative'].values)
X = pad_sequences(X,maxlen=max_seq_len)
print("Shape of tensor: ",X.shape)

Shape of tensor:  (10000, 250)


In [224]:
#Converting categorical labels to numbers using pd.get_dummies:

Y = pd.get_dummies(result['Product']).values
print("Shape of label tensor: ",Y.shape)

Shape of label tensor:  (10000, 13)


In [225]:
#Train test split:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.1,random_state=101)

In [226]:
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

(9000, 250) (9000, 13)
(1000, 250) (1000, 13)


- First layer is Embedding layer of 100 length vectors to represent each word
- SpatialDropout1D performs variational dropout in NLP model
- LSTM layer with 100 memory units
- output layer must create 13 output values for each class
- Softmax activation function since multiclass classification
- loss function categorical_crossentropy needs to be used
- Optimiser Adam


In [228]:
model = Sequential()
model.add(Embedding(max_words,Embed_dim,input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(13,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 100)          500000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 250, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 13)                1313      
Total params: 581,713
Trainable params: 581,713
Non-trainable params: 0
_________________________________________________________________
None


In [229]:
model.fit(X_train,y_train,epochs=3,batch_size=64,validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Train on 8100 samples, validate on 900 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x1fcaafa6608>

We can add more data and try to increase the epochs and see for better accuracy

In [231]:
accuracy = model.evaluate(X_test,y_test)
print("Test set loss {:0.3f} \n accuracy {:0.3f}: ".format(accuracy[0],accuracy[1]))

Test set loss 1.103 
 accuracy 0.677: 


# Test with new Complaint :

In [232]:
new_complaint = ['I am a victim of identity theft and someone stole my identity and personal information to open up a Visa credit card account with Bank of America. The following Bank of America Visa credit card account do not belong to me : XXXX.']
seq = tokenizer.texts_to_sequences(new_complaint)
padded = pad_sequences(seq, maxlen=max_seq_len)
pred = model.predict(padded)
labels = ['Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Mortgage', 'Credit card or prepaid card', 'Student loan', 'Bank account or service', 'Checking or savings account', 'Consumer Loan', 'Payday loan, title loan, or personal loan', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Money transfers', 'Prepaid card']
print(pred, labels[np.argmax(pred)])

[[0.03810645 0.06836877 0.01760171 0.52929556 0.21278286 0.0402986
  0.02835477 0.0060277  0.01022005 0.01357683 0.00386436 0.00966665
  0.02183569]] Credit card or prepaid card
