<a href="https://colab.research.google.com/github/prasvijaya/datascienceportfolio/blob/master/Topic_modelling_using_NMF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
import spacy
import string
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer
from nltk.stem.porter import *

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
df= pd.read_csv('https://github.com/srivatsan88/YouTubeLI/blob/master/dataset/consumer_compliants.zip?raw=true', compression='zip', delimiter=',', quotechar='"')
df.shape

(57453, 18)

In [5]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,4/3/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Fraudulent loan,This auto loan was opened on XX/XX/2020 in XXX...,Company has responded to the consumer and the ...,TRUIST FINANCIAL CORPORATION,PA,,,Consent provided,Web,4/3/2020,Closed with explanation,Yes,,3591341
1,3/12/2020,Debt collection,Payday loan debt,Attempts to collect debt not owed,Debt is not yours,In XXXX of 2019 I noticed a debt for {$620.00}...,,CURO Intermediate Holdings,CO,806XX,,Consent provided,Web,3/12/2020,Closed with explanation,Yes,,3564184
2,2/6/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Credit denial,"As stated from Capital One, XXXX XX/XX/XXXX an...",,CAPITAL ONE FINANCIAL CORPORATION,OH,430XX,,Consent provided,Web,2/6/2020,Closed with explanation,Yes,,3521949
3,3/6/2020,Checking or savings account,Savings account,Managing an account,Banking errors,"Please see CFPB case XXXX. \n\nCapital One, in...",,CAPITAL ONE FINANCIAL CORPORATION,CA,,,Consent provided,Web,3/6/2020,Closed with explanation,Yes,,3556237
4,2/14/2020,Debt collection,Medical debt,Attempts to collect debt not owed,Debt is not yours,This debt was incurred due to medical malpract...,Company believes it acted appropriately as aut...,"Merchants and Professional Bureau, Inc.",OH,432XX,,Consent provided,Web,2/14/2020,Closed with explanation,Yes,,3531704


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57453 entries, 0 to 57452
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Date received                 57453 non-null  object 
 1   Product                       57453 non-null  object 
 2   Sub-product                   57453 non-null  object 
 3   Issue                         57453 non-null  object 
 4   Sub-issue                     57453 non-null  object 
 5   Consumer complaint narrative  57453 non-null  object 
 6   Company public response       57453 non-null  object 
 7   Company                       57453 non-null  object 
 8   State                         57453 non-null  object 
 9   ZIP code                      57453 non-null  object 
 10  Tags                          57453 non-null  object 
 11  Consumer consent provided?    57453 non-null  object 
 12  Submitted via                 57453 non-null  object 
 13  D

In [7]:
df.nunique()

Date received                     354
Product                             6
Sub-product                        32
Issue                              44
Sub-issue                         160
Consumer complaint narrative    55390
Company public response            11
Company                          2197
State                              60
ZIP code                         2874
Tags                                4
Consumer consent provided?          1
Submitted via                       1
Date sent to company              359
Company response to consumer        5
Timely response?                    2
Consumer disputed?                  0
Complaint ID                    57453
dtype: int64

In [8]:
df['Product'].value_counts()

Debt collection                21772
Credit card or prepaid card    13193
Mortgage                        9799
Checking or savings account     7003
Student loan                    2950
Vehicle loan or lease           2736
Name: Product, dtype: int64

Though we have many columns present in Dataset, we will pick 3 most important informative features for Topic Modelling.

In [9]:
complaints_df= df[['Product', 'Company', 'Consumer complaint narrative']].rename(columns={'Consumer complaint narrative':'Complaints'})

In [10]:
pd.set_option('display.max_colwidth', -1)
complaints_df

  """Entry point for launching an IPython kernel.


Unnamed: 0,Product,Company,Complaints
0,Vehicle loan or lease,TRUIST FINANCIAL CORPORATION,"This auto loan was opened on XX/XX/2020 in XXXX, NC with BB & T in my name. I have NEVER been to North Carolina and I have NEVER been a resident. I have filed a dispute twice through my credit bureaus but both times BB & T has claimed that this is an accurate loan. Which I wasn't aware of until today. I have tried to contact BB & T multiple times but I have never gotten through to a live person. I do n't drive and I have never owned a car before. I didn't have any knowledge of this account until I checked XXXXXXXX XXXX and noticed it. I've tried twice to dispute it. Additionally I never received any bills or information about this account. This is my last resort in trying to remove this fraudulent loan off of my account."
1,Debt collection,CURO Intermediate Holdings,"In XXXX of 2019 I noticed a debt for {$620.00} on my credit which i believed was mine I thought speedy cash had bought one of my old debts and sold it to XXXX XXXX XXXX XXXX. I contacted XXXX XXXX XXXX XXXX and after several attempts of giving my full name, nothing came up in their system. I gave my social and the rep said the account popped up but DID NOT tell me that the account was under someone elses name and continued to let me make a payment. The payment was for {$120.00}. Confirmation number-XXXX. After realizing it was not my account, I called back to get my money back and inform them of the mistake. I was told i needed to mail them an FTC report and dispute letter to get my money back. I completed all of this and when i called again they said they transferred the account back to speedy cash for fraud review and I would need to contact them. After contacting them i was again told that i can not get my money back. The issue im having is this representative at XXXX XXXX played blind to obvious fraud and let an innocent person make a payment on someone elses debt and i want my money back."
2,Vehicle loan or lease,CAPITAL ONE FINANCIAL CORPORATION,"As stated from Capital One, XXXX XX/XX/XXXX and XXXX 2018, My wife and I went to several car dealerships to request for a car loan to get a used car. However, according to their credit requirements unfortunately my credit score was insufficient for the car loan approval at that time. It seemed as though they pulled my credit report multiple times."
3,Checking or savings account,CAPITAL ONE FINANCIAL CORPORATION,"Please see CFPB case XXXX. \n\nCapital One, in the letter they provided ( and attached to that case as their response ) said this : "" The funds were reversed and sent back to XXXX XXXX XXXX on XX/XX/XXXX ''. \n\nXXXX XXXX XXXX ( now XXXX XXXX ) has not received these funds. Staff at XXXX XXXX - and also staff at the account-holder 's business - have looked for return of my money ( {$650.00} ) and find nothing. \n\nCapital One needs to document - actually prove - they returned the funds, as stated in their letter. Capital One must provide electronic information, if the return was made that way, or document the paper check they sent back to XXXX XXXX. \n\nI've left 3 messages about this problem for the person who signed the letter ( XXXX ) from Capital One. I have received no call-backs. \n\nSummary : Capital One said they returned my money on XX/XX/XXXX : they did not. If they continue claim they did, then they need to prove that."
4,Debt collection,"Merchants and Professional Bureau, Inc.","This debt was incurred due to medical malpractice ( XXXX XXXX XXXX, XXXX, TX ). I asked the doctor to turn over my claim to his malpractice insurance company. This has cost me thousands of dollars to XXXX XXXX XXXX. I am still trying to collect damages from this doctor. He never responded and turned over me to collections Merchants and Professional Collection Bureau , Inc. I sent them a letter describing exactly this issue and instead of not contacting me and verifying my debt they start reporting this debt to the credit reporting agencies. They never verified the debt, like I asked and they never stopped it from being reported when I specifically told them not to, due to the circumstances above."
...,...,...,...
57448,Student loan,"Nelnet, Inc.","I am attempting to make a payment toward my student loans on the Nelnet website today, XX/XX/20, and Nelnet will not allow me to post the payment sooner than XX/XX/20. By the time the payment posts, 2-3 days of additional interest will have accrued and my payments will apply more to interest than is due today, the day that I'm attempting to pay. My understanding was that I could make a payment at any time but this does not appear to be true. The funds are available in my bank account today regardless of whether Nelnet can collect over the weekend. I should not be penalized for this. \n\nI submitted complaint XXXX in XXXX for other deceptive practices with Nelnet. They have not yet resolved the issue identified in that complaint or contacted me as they said they would in their response. I believe this new issue is just one more deceptive practice by this company that causes financial harm to borrowers."
57449,Debt collection,"The Receivable Management Services LLC, New York, NY Branch",Received letter for {$480.00}. Original creditor didnt contact me until past statute of limitations for insurance company recoupment per Arizona law. Debt collection is illegal for phantom debt. Additionally they are phoning my office excessively.
57450,Debt collection,"Convergent Resources, Inc.","entire time 10 years until XX/XX/2020. XXXX makes my blood boil. I have called and was lied to told to provide my checking account information over the phone in order to turn my cell phone back on. i called at XXXX them at XXXX {$300.00} was added to my bill. \n\nScam scam scam I was told I can not call the office of the President just to write to XXXX XXXX XXXX XXXX XXXX XXXX XXXX, NM XXXX. I did three thousand times. the last letter I mailed on XX/XX/2020. Two collection agencies later. \n\nI chose to leave XXXX XXXX every time I called the XXXX supervisor would threaten me on a recorded line. I need peace of mind and a good Heart to beat inside of me. Im on a XXXX XXXX due to the stress at XXXX XXXX taking all my money 4 10 years."
57451,Checking or savings account,WELLS FARGO & COMPANY,"I am a customer with Wells Fargo Bank. Recently money was withdrawn on a couple of occasions without my permission or consent to pay for a timeshare account that was never used by me nor anyone connected to me because of unfair policies pertaining to the fees of the said timeshare. I tried cancelling the said timeshare account several times because of these fees that were never mentioned at the initiation. My account was debited to pay for the timeshare fees without my knowledge or consent several times. I tried correcting this with Wells Fargo bank with no avail. I would appreciate it if you can look into this matter for me. I was left with no funds in my account and as such I could not take care of the basic necessities of my day to day life. \nThanks in advance,"


In [11]:
#Initial Cleanup
complaints_df['Complaints']= complaints_df['Complaints'].str.lower()

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

#removing punctuation
complaints_df['Complaints']= complaints_df['Complaints'].apply(lambda x:remove_punctuation(x))

complaints_df.head()

Unnamed: 0,Product,Company,Complaints
0,Vehicle loan or lease,TRUIST FINANCIAL CORPORATION,this auto loan was opened on xxxx2020 in xxxx nc with bb t in my name i have never been to north carolina and i have never been a resident i have filed a dispute twice through my credit bureaus but both times bb t has claimed that this is an accurate loan which i wasnt aware of until today i have tried to contact bb t multiple times but i have never gotten through to a live person i do nt drive and i have never owned a car before i didnt have any knowledge of this account until i checked xxxxxxxx xxxx and noticed it ive tried twice to dispute it additionally i never received any bills or information about this account this is my last resort in trying to remove this fraudulent loan off of my account
1,Debt collection,CURO Intermediate Holdings,in xxxx of 2019 i noticed a debt for 62000 on my credit which i believed was mine i thought speedy cash had bought one of my old debts and sold it to xxxx xxxx xxxx xxxx i contacted xxxx xxxx xxxx xxxx and after several attempts of giving my full name nothing came up in their system i gave my social and the rep said the account popped up but did not tell me that the account was under someone elses name and continued to let me make a payment the payment was for 12000 confirmation numberxxxx after realizing it was not my account i called back to get my money back and inform them of the mistake i was told i needed to mail them an ftc report and dispute letter to get my money back i completed all of this and when i called again they said they transferred the account back to speedy cash for fraud review and i would need to contact them after contacting them i was again told that i can not get my money back the issue im having is this representative at xxxx xxxx played blind to obvious fraud and let an innocent person make a payment on someone elses debt and i want my money back
2,Vehicle loan or lease,CAPITAL ONE FINANCIAL CORPORATION,as stated from capital one xxxx xxxxxxxx and xxxx 2018 my wife and i went to several car dealerships to request for a car loan to get a used car however according to their credit requirements unfortunately my credit score was insufficient for the car loan approval at that time it seemed as though they pulled my credit report multiple times
3,Checking or savings account,CAPITAL ONE FINANCIAL CORPORATION,please see cfpb case xxxx \n\ncapital one in the letter they provided and attached to that case as their response said this the funds were reversed and sent back to xxxx xxxx xxxx on xxxxxxxx \n\nxxxx xxxx xxxx now xxxx xxxx has not received these funds staff at xxxx xxxx and also staff at the accountholder s business have looked for return of my money 65000 and find nothing \n\ncapital one needs to document actually prove they returned the funds as stated in their letter capital one must provide electronic information if the return was made that way or document the paper check they sent back to xxxx xxxx \n\nive left 3 messages about this problem for the person who signed the letter xxxx from capital one i have received no callbacks \n\nsummary capital one said they returned my money on xxxxxxxx they did not if they continue claim they did then they need to prove that
4,Debt collection,"Merchants and Professional Bureau, Inc.",this debt was incurred due to medical malpractice xxxx xxxx xxxx xxxx tx i asked the doctor to turn over my claim to his malpractice insurance company this has cost me thousands of dollars to xxxx xxxx xxxx i am still trying to collect damages from this doctor he never responded and turned over me to collections merchants and professional collection bureau inc i sent them a letter describing exactly this issue and instead of not contacting me and verifying my debt they start reporting this debt to the credit reporting agencies they never verified the debt like i asked and they never stopped it from being reported when i specifically told them not to due to the circumstances above


In [12]:
from sklearn.model_selection import train_test_split

#Split the dataframe into Training and Hold out set
X_train, X_hold = train_test_split(complaints_df, test_size=0.6, random_state=999)

In [13]:
X_train['Product'].value_counts()

Debt collection                8676
Credit card or prepaid card    5254
Mortgage                       3939
Checking or savings account    2819
Student loan                   1210
Vehicle loan or lease          1083
Name: Product, dtype: int64

In [14]:
stemmer= PorterStemmer()

def tokenize(text):
  tokens= [word for word in nltk.word_tokenize(text) if (len(word)) > 3 and len(word.strip('Xx/')) > 2]
  stems= [stemmer.stem(WordNetLemmatizer().lemmatize(items)) for items in tokens]
  return tokens

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
# NMF is able to use tf-idf
tf_vectorizer= TfidfVectorizer(tokenizer=tokenize, stop_words='english', max_features=10000, max_df=0.75, min_df=50)
tf_vectors= tf_vectorizer.fit_transform(X_train.Complaints)
tf_vectors.shape

(22981, 3003)

In [16]:
tf_vectors.A[:10]

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.15337845, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [17]:
tf_vectorizer.get_feature_names()

['1000',
 '10000',
 '100000',
 '1000000',
 '10000000',
 '11000',
 '110000',
 '1100000',
 '1200',
 '12000',
 '120000',
 '1200000',
 '13000',
 '130000',
 '1300000',
 '14000',
 '140000',
 '1400000',
 '1500',
 '15000',
 '150000',
 '1500000',
 '16000',
 '160000',
 '1600000',
 '1681c2',
 '1681m',
 '1692',
 '1692g',
 '17000',
 '170000',
 '18000',
 '180000',
 '19000',
 '190000',
 '2000',
 '20000',
 '200000',
 '2000000',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020',
 '21000',
 '210000',
 '22000',
 '220000',
 '23000',
 '230000',
 '24000',
 '240000',
 '2448',
 '2500',
 '25000',
 '250000',
 '2500000',
 '26000',
 '260000',
 '2700',
 '27000',
 '270000',
 '2800',
 '28000',
 '280000',
 '2900',
 '29000',
 '290000',
 '3000',
 '30000',
 '300000',
 '3000000',
 '31000',
 '32000',
 '320000',
 '33000',
 '330000',
 '34000',
 '340000',
 '3500',
 '35000',
 '350000',
 '3600',
 '36000',
 '360000',
 '37000',
 '3800',
 '38000',
 '3900',
 '39000',
 '4000',
 '40000',
 '400000',
 '41000',
 '43000',
 '44000',
 

In [18]:
from sklearn.decomposition import NMF

nmf= NMF(n_components=6, random_state=999, alpha=0.1, l1_ratio=.5, init='nndsvd')

W1= nmf.fit_transform(tf_vectors)
H1= nmf.components_

In [19]:
W1

array([[0.01666109, 0.        , 0.02763651, 0.        , 0.        ,
        0.        ],
       [0.00190764, 0.        , 0.00632466, 0.        , 0.        ,
        0.02581004],
       [0.        , 0.        , 0.02353073, 0.        , 0.        ,
        0.        ],
       ...,
       [0.01554095, 0.00252399, 0.00361476, 0.        , 0.01450731,
        0.02145971],
       [0.00768151, 0.02274074, 0.00391149, 0.        , 0.        ,
        0.        ],
       [0.        , 0.06891151, 0.        , 0.        , 0.        ,
        0.        ]])

In [20]:
H1

array([[0.02908341, 0.05906489, 0.07901992, ..., 0.01456046, 0.0126734 ,
        0.03373154],
       [0.        , 0.        , 0.05110936, ..., 0.00520972, 0.0080131 ,
        0.00307255],
       [0.02970536, 0.11159867, 0.10772307, ..., 0.00637386, 0.01091361,
        0.04122088],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.02110725],
       [0.02686155, 0.11731874, 0.06847792, ..., 0.00435401, 0.00583312,
        0.05450711]])

In [21]:
num_words=15

vocab = np.array(tf_vectorizer.get_feature_names())

top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H1])
topics = [' '.join(t) for t in topic_words]

In [22]:
topics

['loan payment mortgage payments late told paid company month home time called received escrow loans',
 'debt collection credit company report information letter reporting agency validation collect original creditor account sent',
 'account bank check money funds america closed checking told chase called said deposit number accounts',
 'theft identity belong victim affidavit report attached does legal debt result information reported police opened',
 'credit late accordance item misreport aforesaid tuned present unfavorable lately generate relation prefer recommend days60',
 'card credit charges capital charge balance report cards citi fraud chase dispute called limit purchase']

Topics Interpretation:

Topic 1 contains words like loan, mortgage, loans etc, so this might belong to Loan and Mortgage 

Topic 2 contains words like debt collection, creditor received, so this might belong to debt collection

Topic 3 contains words like account, check, funds, accounts, so this might belong to savings or checking account

Topic 4 contains words like theft, fraud, legal, police, it clearly belongs to fradulent activities

Topic 5 have words like credit, late, misreport, unfavourable, it might belong to late payment

Topic 6 have words like charges, charge, balance etc, so it might belong to fee or charge related complaints


In [23]:
colnames = ["Topic" + str(i) for i in range(nmf.n_components)]
docnames = ["Doc" + str(i) for i in range(len(X_train.Complaints))]
df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [24]:
df_doc_topic

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.02,0.00,0.03,0.0,0.00,0.00,2
Doc1,0.00,0.00,0.01,0.0,0.00,0.03,5
Doc2,0.00,0.00,0.02,0.0,0.00,0.00,2
Doc3,0.00,0.01,0.02,0.0,0.00,0.01,2
Doc4,0.00,0.00,0.03,0.0,0.00,0.00,2
...,...,...,...,...,...,...,...
Doc22976,0.01,0.00,0.00,0.0,0.00,0.00,0
Doc22977,0.00,0.01,0.02,0.0,0.00,0.00,2
Doc22978,0.02,0.00,0.00,0.0,0.01,0.02,0
Doc22979,0.01,0.02,0.00,0.0,0.00,0.00,1


In [25]:
WHold= nmf.transform(tf_vectorizer.transform(X_hold.Complaints[:5]))

In [26]:
colnames_hold = ["Topic" + str(i) for i in range(nmf.n_components)]
docnames_hold = ["Doc" + str(i) for i in range(len(X_hold.Complaints[:5]))]
df_doc_topic_hold = pd.DataFrame(np.round(WHold, 2), columns=colnames_hold, index=docnames_hold)
significant_topic_hold = np.argmax(df_doc_topic_hold.values, axis=1)
df_doc_topic_hold['dominant_topic'] = significant_topic_hold

In [29]:
df_doc_topic_hold.head()

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.0,0.0,0.03,0.0,0.0,0.01,2
Doc1,0.0,0.03,0.02,0.0,0.0,0.0,1
Doc2,0.0,0.0,0.02,0.0,0.0,0.02,2
Doc3,0.01,0.0,0.0,0.0,0.0,0.0,0
Doc4,0.0,0.03,0.0,0.0,0.01,0.0,1


In [31]:
X_hold.Complaints[:5]

8546     on xxxx19 2 separate transactions were withdrawn from my account 100 and 1900 i contacted my bank to inform them these were unauthorized transactions i also had my debit card cancelled and reissued i then contacted the company responsible for the transactions they informed me that they had no record of my account \n\nthen again on xxxx19 i had another charge of 1900 deducted from my account i am contacting my bank to report this as another unauthorized transaction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           