# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

As you work, answer the following questions: 
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [1]:
import pandas as pd
import numpy as np
import re

from tqdm.notebook import tqdm

from joblib import dump, load

from nltk import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

In [11]:
complaints = pd.read_csv('../data/complaints.csv')

In [13]:
complaints.columns = ['description', 'issue']

In [None]:
complaints['issue'].value_counts().sort_index()

In [None]:
for statement in complaints.loc[complaints['issue'] == 'Attempts to collect debt not owed', 'description'].sample(3):
    print(statement)
    print('-----------------------------')

Preprocessing:

In [15]:
def preprocessing(text):
    text = re.sub(r'[\s]XX*[\S]*', '', text)
    text = re.sub(r'[0-9$,.?!{}()]', '', text)
    text = re.sub(r'\n', '', text)
    text = text.lower()
    return text

In [17]:
complaints['description'] = complaints['description'].apply(preprocessing)

In [None]:
#this did not help accuracy, and was very slow
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

Model selection and training:

In [19]:
X = complaints[['description']]
y = complaints['issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [21]:
vect1 = CountVectorizer(stop_words = 'english')
vect2 = TfidfVectorizer(stop_words = 'english')
clf = MultinomialNB()

pipe = Pipeline([("vect", vect1), ("clf", clf)])

param_grid = {
    'vect': [vect1, vect2],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'clf__fit_prior':[False, True]
}

In [23]:
rs = RandomizedSearchCV(estimator = pipe, param_distributions = param_grid, verbose = 2, n_jobs = -1)
rs.fit(X_train['description'], y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [25]:
print(rs.best_params_)
print(rs.best_score_)

{'vect__ngram_range': (1, 2), 'vect': CountVectorizer(stop_words='english'), 'clf__fit_prior': True}
0.8704135443913941


In [None]:
pd.DataFrame(rs.cv_results_)

In [27]:
y_pred = rs.best_estimator_.predict(X_test['description'])

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8714434459811222
[[13641   459    77  3934   180]
 [ 1515  3518    20   154   104]
 [  174    20  2668   178    47]
 [ 3552    37    81 52938   718]
 [   46    11     4    48  4234]]


In [29]:
vect = CountVectorizer(ngram_range=(1,2), stop_words = 'english')

X_train_vec = vect.fit_transform(X_train['description'])
X_test_vec = vect.transform(X_test['description'])

In [31]:
lr = LogisticRegression(max_iter=1000).fit(X_train_vec, y_train)

y_pred = lr.predict(X_test_vec)

In [33]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8972815138414179
[[13706   684   116  3715    70]
 [  868  4162    21   240    20]
 [  123    25  2709   216    14]
 [ 2299    81    64 54757   125]
 [   75    38    17   265  3948]]


In [None]:
coef_df = pd.DataFrame({
    'word': vect.get_feature_names_out(),
    lr.classes_[0]: lr.coef_[0],
    lr.classes_[1]: lr.coef_[1],
    lr.classes_[2]: lr.coef_[2],
    lr.classes_[3]: lr.coef_[3],
    lr.classes_[4]: lr.coef_[4]
})

In [None]:
coef_df.sort_values(lr.classes_[4], ascending = False).head(10)

In [None]:
svm = SVC().fit(X_train_vec, y_train)

y_pred = svm.predict(X_test_vec)

In [None]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))