# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

As you work, answer the following questions: 
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [4]:
import pandas as pd
import numpy as np

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import re
import glob
from tqdm.notebook import tqdm

from nltk import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords

from collections import Counter
from sklearn.metrics import classification_report

In [6]:
complaints_df = pd.read_csv('../data/complaints.csv')
complaints_df

Unnamed: 0,Consumer complaint narrative,Issue
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam
2,I have a particular account that is stating th...,Incorrect information on your report
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report
...,...,...
353427,Collections account I have no knowledge of,Attempts to collect debt not owed
353428,"Dear CFPB Team, The reason for my complaint is...",Attempts to collect debt not owed
353429,FRCA violations : Failing to Follow Debt Dispu...,Attempts to collect debt not owed
353430,"My Father, a XXXX XXXX acquired an HECM rever...",Struggling to pay mortgage


First I will form a model without any preprocessing

In [9]:
X = complaints_df[['Consumer complaint narrative']]
y = complaints_df['Issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [11]:
vect = CountVectorizer()

X_train_vec = vect.fit_transform(X_train['Consumer complaint narrative'])
X_test_vec = vect.transform(X_test['Consumer complaint narrative'])

In [13]:
nb = MultinomialNB().fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

In [15]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.7988976663120487
[[12086  2039   500  3343   323]
 [  587  4476    51   112    85]
 [   67    55  2813   110    42]
 [ 6610  1046   834 46997  1839]
 [   36    40    12    38  4217]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.62      0.66      0.64     18291
               Communication tactics       0.58      0.84      0.69      5311
                       Fraud or scam       0.67      0.91      0.77      3087
Incorrect information on your report       0.93      0.82      0.87     57326
          Struggling to pay mortgage       0.65      0.97      0.78      4343

                            accuracy                           0.80     88358
                           macro avg       0.69      0.84      0.75     88358
                        weighted avg       0.82      0.80      0.80     88358



In [21]:
feature_names = vect.get_feature_names_out()
log_probs = nb.feature_log_prob_
word_importance_df1 = pd.DataFrame(log_probs.T, index=feature_names, columns=nb.classes_)
#sorted_df = word_importance_df1.sort_values(by=log_probs.T, ascending=False)  
#sorted_df.head(10)
word_importance_df1

Unnamed: 0,Attempts to collect debt not owed,Communication tactics,Fraud or scam,Incorrect information on your report,Struggling to pay mortgage
00,-5.477826,-6.198689,-4.922377,-5.136400,-5.667744
000,-10.941502,-11.212767,-9.688146,-11.256993,-9.330217
0000,-16.035252,-14.580063,-13.936641,-15.077449,-14.163850
00000,-16.035252,-14.580063,-14.629788,-15.077449,-15.262462
000000000000000000000000000000000000000000000000000000000000000000000,-16.035252,-14.580063,-14.629788,-15.547452,-15.262462
...,...,...,...,...,...
zwicke,-15.342105,-14.580063,-14.629788,-17.156890,-15.262462
zwicker,-11.974809,-12.015113,-14.629788,-14.671984,-15.262462
zwickerpc,-16.035252,-13.886915,-14.629788,-17.156890,-15.262462
zwickers,-15.342105,-14.580063,-14.629788,-17.156890,-15.262462


In [25]:
mislabel_df1 = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
mislabel_df1['correct'] = mislabel_df1['actual'] == mislabel_df1['predicted']

In [27]:
misclassified_df1 = (
    mislabel_df1.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df1

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,33.923788
1,Communication tactics,15.722086
2,Fraud or scam,8.875931
3,Incorrect information on your report,18.018002
4,Struggling to pay mortgage,2.90122


Now lets try some preprocessing, including lowercasing, excluding stopwords, stemming, and lemmatization.

In [30]:
#from nltk.stem import PorterStemmer, WordNetLemmatizer

#complaints_list = list(complaints_df[['Consumer complaint narrative']])
#token_list = []

#for complaint in complaints_list:
    #complaints_tokens = set(regexp_tokenize(complaint, r"[A-Za-z]+(?:['’][A-Za-z]+)?"))
    #complaints_tokens = list(complaints_tokens)
    #token_list.extend(complaints_tokens)


#stop_words = set(stopwords.words('english'))
#token_list = [x.lower() for x in token_list if x.lower() not in stop_words]
#porter = PorterStemmer()
#token_list  = [porter.stem(x) for x in token_list if x not in stop_words]
#wnl = WordNetLemmatizer()
#token_list = [wnl.lemmatize(x) for x in token_list if x not in stop_words]

In [32]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
stop_words = set(stopwords.words('english'))
porter = PorterStemmer()
wnl = WordNetLemmatizer()

def clean_and_tokenize(text):
    tokens = regexp_tokenize(text, r"[A-Za-z]+(?:['’][A-Za-z]+)?")
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [porter.stem(t) for t in tokens]
    tokens = [wnl.lemmatize(t) for t in tokens]
    return ' '.join(tokens)  

In [34]:
complaints_df['cleaned_text'] = complaints_df['Consumer complaint narrative'].apply(clean_and_tokenize)

In [36]:
X = vect.fit_transform(complaints_df['cleaned_text'])  
y = complaints_df['Issue'].values

In [38]:
print(X.shape, len(y))

(353432, 53567) 353432


In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)

In [42]:
#X_train_vec = vect.fit_transform(X_train)
#X_test_vec = vect.transform(X_test)

In [44]:
nb = MultinomialNB().fit(X_train, y_train)

y_pred2 = nb.predict(X_test)

In [46]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.8033443207379009
[[ 9686  1589   352  2771   235]
 [  478  3567    34   103    67]
 [   52    50  2247    90    30]
 [ 5249   759   595 37916  1342]
 [   27    33     8    37  3370]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.63      0.66      0.64     14633
               Communication tactics       0.59      0.84      0.70      4249
                       Fraud or scam       0.69      0.91      0.79      2469
Incorrect information on your report       0.93      0.83      0.87     45861
          Struggling to pay mortgage       0.67      0.97      0.79      3475

                            accuracy                           0.80     70687
                           macro avg       0.70      0.84      0.76     70687
                        weighted avg       0.82      0.80      0.81     70687



In [50]:
feature_names2 = vect.get_feature_names_out()
log_probs2 = nb.feature_log_prob_
word_importance_df2 = pd.DataFrame(log_probs2.T, index=feature_names2, columns=nb.classes_)
#sorted_df2 = word_importance_df2.sort_values(by=log_probs.T, ascending=False)  
#sorted_df2.head(10)
word_importance_df2

Unnamed: 0,Attempts to collect debt not owed,Communication tactics,Fraud or scam,Incorrect information on your report,Struggling to pay mortgage
aa,-12.600603,-12.448425,-14.120264,-12.593616,-12.366037
aaa,-13.465601,-12.266103,-14.120264,-15.127313,-13.665320
aaaaan,-15.545042,-13.364716,-14.120264,-16.736751,-14.763932
aaac,-15.545042,-14.057863,-14.120264,-15.350456,-14.763932
aaadvantag,-15.545042,-14.057863,-14.120264,-16.043603,-14.763932
...,...,...,...,...,...
zwick,-14.851895,-14.057863,-14.120264,-16.736751,-14.763932
zwicker,-11.340350,-11.013340,-14.120264,-14.251844,-14.763932
zwickerpc,-15.545042,-13.364716,-14.120264,-16.736751,-14.763932
zxxxx,-15.545042,-14.057863,-14.120264,-16.736751,-14.763932


In [52]:
mislabel_df2 = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df2['correct'] = mislabel_df2['actual'] == mislabel_df2['predicted']

In [54]:
misclassified_df2 = (
    mislabel_df2.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df2

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,33.807148
1,Communication tactics,16.050835
2,Fraud or scam,8.991495
3,Incorrect information on your report,17.324088
4,Struggling to pay mortgage,3.021583


Now lets see if a tfidf vectorizer leads to a better model than count vectorizer

In [57]:
tfidf = TfidfVectorizer()

X = tfidf.fit_transform(complaints_df['cleaned_text'])  
y = complaints_df['Issue'].values

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)

In [61]:
nb = MultinomialNB().fit(X_train, y_train)

y_pred2 = nb.predict(X_test)

In [63]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.7991285526334404
[[ 6540   160    13  7904    16]
 [ 1402  1694     1  1148     4]
 [  265    10  1195   994     5]
 [ 1093    11     9 44655    93]
 [   47     9     1  1014  2404]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.70      0.45      0.55     14633
               Communication tactics       0.90      0.40      0.55      4249
                       Fraud or scam       0.98      0.48      0.65      2469
Incorrect information on your report       0.80      0.97      0.88     45861
          Struggling to pay mortgage       0.95      0.69      0.80      3475

                            accuracy                           0.80     70687
                           macro avg       0.87      0.60      0.69     70687
                        weighted avg       0.80      0.80      0.78     70687



In [142]:
feature_names3 = tfidf.get_feature_names_out()
log_probs3 = nb.feature_log_prob_
word_importance_df3 = pd.DataFrame(log_probs3.T, index=feature_names3, columns=nb.classes_)
#sorted_df3 = word_importance_df3.sort_values(by=log_probs.T, ascending=False)  
#sorted_df3.head(10)
word_importance_df3

Unnamed: 0,Attempts to collect debt not owed,Communication tactics,Fraud or scam,Incorrect information on your report,Struggling to pay mortgage
aa,-12.604397,-12.564564,-12.818032,-12.939952,-12.686556
aa report,-13.662844,-12.938082,-12.818032,-14.039034,-12.955729
aa report credit,-13.662844,-12.938082,-12.818032,-14.039034,-12.955729
aaf,-11.942611,-12.610666,-12.818032,-12.816632,-12.976823
aargon,-10.637115,-11.953895,-12.818032,-12.991415,-12.976823
...,...,...,...,...,...
zombi debt,-12.129689,-12.404853,-12.818032,-13.466302,-12.976823
zone,-12.944587,-11.159083,-12.344983,-13.072821,-11.665814
zoom,-13.243207,-12.851438,-12.409287,-14.293242,-12.836540
zwicker,-11.888668,-11.852405,-12.818032,-13.858857,-12.976823


In [69]:
mislabel_df3 = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df3['correct'] = mislabel_df3['actual'] == mislabel_df3['predicted']

In [71]:
misclassified_df3 = (
    mislabel_df3.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df3

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,55.306499
1,Communication tactics,60.131796
2,Fraud or scam,51.599838
3,Incorrect information on your report,2.629685
4,Struggling to pay mortgage,30.820144


The tfidf vectorizer above used only a unigram, lets try it with bigrams and trigrams

In [74]:
tfidf = TfidfVectorizer(ngram_range=(1,3), min_df = 20)

X = tfidf.fit_transform(complaints_df['cleaned_text'])  
y = complaints_df['Issue'].values

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)

In [78]:
nb = MultinomialNB().fit(X_train, y_train)

y_pred2 = nb.predict(X_test)

In [80]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.8633977959172126
[[10936   364    33  3226    74]
 [ 1231  2778     5   201    34]
 [  285    21  1883   262    18]
 [ 3183    48    18 42182   430]
 [   65    22     2   134  3252]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.70      0.75      0.72     14633
               Communication tactics       0.86      0.65      0.74      4249
                       Fraud or scam       0.97      0.76      0.85      2469
Incorrect information on your report       0.92      0.92      0.92     45861
          Struggling to pay mortgage       0.85      0.94      0.89      3475

                            accuracy                           0.86     70687
                           macro avg       0.86      0.80      0.83     70687
                        weighted avg       0.87      0.86      0.86     70687



In [82]:
feature_names4 = tfidf.get_feature_names_out()
log_probs4 = nb.feature_log_prob_
word_importance_df4 = pd.DataFrame(log_probs4.T, index=feature_names4, columns=nb.classes_)
#sorted_df4 = word_importance_df4.sort_values(by=log_probs4.T, ascending=False)  
#sorted_df4.head(10)
word_importance_df4

Unnamed: 0,Attempts to collect debt not owed,Communication tactics,Fraud or scam,Incorrect information on your report,Struggling to pay mortgage
aa,-12.604397,-12.564564,-12.818032,-12.939952,-12.686556
aa report,-13.662844,-12.938082,-12.818032,-14.039034,-12.955729
aa report credit,-13.662844,-12.938082,-12.818032,-14.039034,-12.955729
aaf,-11.942611,-12.610666,-12.818032,-12.816632,-12.976823
aargon,-10.637115,-11.953895,-12.818032,-12.991415,-12.976823
...,...,...,...,...,...
zombi debt,-12.129689,-12.404853,-12.818032,-13.466302,-12.976823
zone,-12.944587,-11.159083,-12.344983,-13.072821,-11.665814
zoom,-13.243207,-12.851438,-12.409287,-14.293242,-12.836540
zwicker,-11.888668,-11.852405,-12.818032,-13.858857,-12.976823


In [84]:
mislabel_df4 = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df4['correct'] = mislabel_df4['actual'] == mislabel_df4['predicted']

In [86]:
misclassified_df4 = (
    mislabel_df4.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df4

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,25.264812
1,Communication tactics,34.619911
2,Fraud or scam,23.734305
3,Incorrect information on your report,8.022067
4,Struggling to pay mortgage,6.417266


Now lets try a few other models and see if predictions improve

In [89]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(penalty='l2',
    C=0.5,
    solver='liblinear',
    max_iter=200,
    class_weight='balanced',
    random_state=321,
    verbose = 99)

In [91]:
logreg.fit(X_train, y_train)

[LibLinear]

In [93]:
y_pred2 = logreg.predict(X_test)

In [95]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.8839814958903335
[[11059   729   215  2518   112]
 [  503  3583    26   113    24]
 [   43    31  2273   108    14]
 [ 2859   180   189 42263   370]
 [   28    39    19    81  3308]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.76      0.76      0.76     14633
               Communication tactics       0.79      0.84      0.81      4249
                       Fraud or scam       0.84      0.92      0.88      2469
Incorrect information on your report       0.94      0.92      0.93     45861
          Struggling to pay mortgage       0.86      0.95      0.91      3475

                            accuracy                           0.88     70687
                           macro avg       0.84      0.88      0.86     70687
                        weighted avg       0.89      0.88      0.88     70687



In [97]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=321, max_depth=5)

In [99]:
clf.fit(X_train, y_train)

In [101]:
y_pred2 = clf.predict(X_test)

In [103]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.7782053277123092
[[ 7776   787   149  5821   100]
 [  914  2673    24   609    29]
 [   55   243   860  1296    15]
 [ 3561   265   169 41701   165]
 [   60   153    37  1226  1999]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.63      0.53      0.58     14633
               Communication tactics       0.65      0.63      0.64      4249
                       Fraud or scam       0.69      0.35      0.46      2469
Incorrect information on your report       0.82      0.91      0.86     45861
          Struggling to pay mortgage       0.87      0.58      0.69      3475

                            accuracy                           0.78     70687
                           macro avg       0.73      0.60      0.65     70687
                        weighted avg       0.77      0.78      0.77     70687



In [105]:
feature_names5 = tfidf.get_feature_names_out()
importances = clf.feature_importances_

importance_df = pd.DataFrame({
    'word': feature_names5,
    'importance': importances
})

importance_df = importance_df.sort_values(by='importance', ascending=False)
importance_df.head(10)


Unnamed: 0,word,importance
66005,debt,0.31471
186005,report,0.230687
34413,call,0.153263
141897,mortgag,0.12432
44387,collect,0.065105
140303,money,0.039293
199414,scam,0.022467
58281,credit,0.013321
139821,modif,0.011399
215606,strike,0.011285


In [107]:
mislabel_df5 = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df5['correct'] = mislabel_df5['actual'] == mislabel_df5['predicted']

In [109]:
misclassified_df5 = (
    mislabel_df5.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df5

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,46.859837
1,Communication tactics,37.09108
2,Fraud or scam,65.168084
3,Incorrect information on your report,9.070888
4,Struggling to pay mortgage,42.47482


In [111]:
from sklearn.ensemble import GradientBoostingClassifier

grad_clf = GradientBoostingClassifier(n_estimators=50,       
    learning_rate=1, 
    max_depth=3,                                
    random_state=321,
    verbose=99)       

In [113]:
grad_clf.fit(X_train, y_train)

      Iter       Train Loss   Remaining Time 
         1           0.7569           65.10m
         2           0.8618           63.47m
         3      693405.9437           61.83m
         4      693406.6671           60.18m
         5 2056308796538417645796968278362777866191050539501910116847706905515682672655407629002077202573831152647999484826380348162048.0000           58.69m
         6 2056308796538417645796968278362777866191050539501910116847706905515682672655407629002077202573831152647999484826380348162048.0000           57.19m
         7 2056308796538417645796968278362777866191050539501910116847706905515682672655407629002077202573831152647999484826380348162048.0000           55.77m
         8 2056308796538417645796968278362777866191050539501910116847706905515682672655407629002077202573831152647999484826380348162048.0000           54.49m
         9 2056308796538417645796968278362777866191050539501910116847706905515682672655407629002077202573831152647999484826380348162048.0000  

In [115]:
dump(grad_clf, "../models/cv_01.joblib")

FileNotFoundError: [Errno 2] No such file or directory: '../models/cv_01.joblib'

In [117]:
grad_clf = load("../models/cv_01.joblib")

FileNotFoundError: [Errno 2] No such file or directory: '../models/cv_01.joblib'

In [119]:
y_pred2 = grad_clf.predict(X_test)

In [121]:
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

Accuracy: 0.7801293024177006
[[ 8685   211   253  5403    81]
 [ 3400   226    37   578     8]
 [  364    66  1188   843     8]
 [ 2620   137   280 42602   222]
 [  532     8    71   420  2444]]
                                      precision    recall  f1-score   support

   Attempts to collect debt not owed       0.56      0.59      0.57     14633
               Communication tactics       0.35      0.05      0.09      4249
                       Fraud or scam       0.65      0.48      0.55      2469
Incorrect information on your report       0.85      0.93      0.89     45861
          Struggling to pay mortgage       0.88      0.70      0.78      3475

                            accuracy                           0.78     70687
                           macro avg       0.66      0.55      0.58     70687
                        weighted avg       0.76      0.78      0.76     70687



In [123]:
feature_names6 = tfidf.get_feature_names_out()
importances2 = grad_clf.feature_importances_

importance_df2 = pd.DataFrame({
    'word': feature_names6,
    'importance': importances2
})

importance_df2 = importance_df2.sort_values(by='importance', ascending=False)

importance_df2 = importance_df2[importance_df2['importance'] > 0]

importance_df2.head(10)

Unnamed: 0,word,importance
34413,call,0.453565
186005,report,0.35821
182110,refund,0.064186
66005,debt,0.032416
141897,mortgag,0.017994
58281,credit,0.013552
139821,modif,0.010271
44387,collect,0.009091
140303,money,0.008701
158630,owe,0.007746


In [125]:
mislabel_df6 = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df6['correct'] = mislabel_df6['actual'] == mislabel_df6['predicted']

In [127]:
misclassified_df6 = (
    mislabel_df6.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df6

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,40.647851
1,Communication tactics,94.681101
2,Fraud or scam,51.883354
3,Incorrect information on your report,7.106256
4,Struggling to pay mortgage,29.669065


Now choose the model with the best results and find the words or phrases that influence the model most

In [130]:
logreg.fit(X_train, y_train)

[LibLinear]

In [132]:
y_pred2 = logreg.predict(X_test)

In [134]:
logreg.coef_

array([[ 8.70681639e-02, -5.06361713e-03, -5.06361713e-03, ...,
         7.30991778e-02,  4.60185152e-01,  1.77317548e-01],
       [-6.03178134e-02, -2.58961327e-03, -2.58961327e-03, ...,
        -7.92801876e-02,  8.91071385e-02,  2.26143289e-01],
       [-8.46965251e-02, -4.22564853e-04, -4.22564853e-04, ...,
         1.05260070e-01, -1.85753273e-01, -6.17144372e-02],
       [ 4.09665147e-02,  1.27955235e-02,  1.27955235e-02, ...,
        -6.19779217e-02, -2.16307630e-01, -1.30591556e-01],
       [-5.43522194e-02, -1.75004709e-02, -1.75004709e-02, ...,
        -4.11970461e-02, -9.90869263e-02, -5.54515772e-02]])

In [136]:
word_importance_df = pd.DataFrame({
    'word': tfidf.get_feature_names_out(),
    'coef': logreg.coef_[0]
})

word_importance_df.sort_values(by = 'coef', ascending = False)

Unnamed: 0,word,coef
66005,debt,10.292492
44387,collect,9.721083
158630,owe,8.293847
30488,bill,4.969906
204035,servic,4.412179
...,...,...
123900,late,-6.818254
116022,inquiri,-7.222541
233432,transunion,-9.187165
80575,equifax,-9.492708


In [138]:
mislabel_df = pd.DataFrame({'actual': y_test, 'predicted': y_pred2})
mislabel_df['correct'] = mislabel_df['actual'] == mislabel_df['predicted']

In [140]:
misclassified_df = (
    mislabel_df.groupby('actual')
    .apply(lambda g: 100 * (~g['correct']).mean())
    .reset_index()
    .rename(columns={0: 'Misclassification (%)'})
)

misclassified_df

  .apply(lambda g: 100 * (~g['correct']).mean())


Unnamed: 0,actual,Misclassification (%)
0,Attempts to collect debt not owed,24.424247
1,Communication tactics,15.674276
2,Fraud or scam,7.938437
3,Incorrect information on your report,7.845446
4,Struggling to pay mortgage,4.805755
