<a href="https://colab.research.google.com/github/prasvijaya/datascienceportfolio/blob/master/Topic_modelling_using_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
df= pd.read_csv('https://github.com/srivatsan88/YouTubeLI/blob/master/dataset/consumer_compliants.zip?raw=true', compression='zip', delimiter=',', quotechar='"')

In [4]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,4/3/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Fraudulent loan,This auto loan was opened on XX/XX/2020 in XXX...,Company has responded to the consumer and the ...,TRUIST FINANCIAL CORPORATION,PA,,,Consent provided,Web,4/3/2020,Closed with explanation,Yes,,3591341
1,3/12/2020,Debt collection,Payday loan debt,Attempts to collect debt not owed,Debt is not yours,In XXXX of 2019 I noticed a debt for {$620.00}...,,CURO Intermediate Holdings,CO,806XX,,Consent provided,Web,3/12/2020,Closed with explanation,Yes,,3564184
2,2/6/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Credit denial,"As stated from Capital One, XXXX XX/XX/XXXX an...",,CAPITAL ONE FINANCIAL CORPORATION,OH,430XX,,Consent provided,Web,2/6/2020,Closed with explanation,Yes,,3521949
3,3/6/2020,Checking or savings account,Savings account,Managing an account,Banking errors,"Please see CFPB case XXXX. \n\nCapital One, in...",,CAPITAL ONE FINANCIAL CORPORATION,CA,,,Consent provided,Web,3/6/2020,Closed with explanation,Yes,,3556237
4,2/14/2020,Debt collection,Medical debt,Attempts to collect debt not owed,Debt is not yours,This debt was incurred due to medical malpract...,Company believes it acted appropriately as aut...,"Merchants and Professional Bureau, Inc.",OH,432XX,,Consent provided,Web,2/14/2020,Closed with explanation,Yes,,3531704


In [5]:
df.shape

(57453, 18)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57453 entries, 0 to 57452
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Date received                 57453 non-null  object 
 1   Product                       57453 non-null  object 
 2   Sub-product                   57453 non-null  object 
 3   Issue                         57453 non-null  object 
 4   Sub-issue                     57453 non-null  object 
 5   Consumer complaint narrative  57453 non-null  object 
 6   Company public response       57453 non-null  object 
 7   Company                       57453 non-null  object 
 8   State                         57453 non-null  object 
 9   ZIP code                      57453 non-null  object 
 10  Tags                          57453 non-null  object 
 11  Consumer consent provided?    57453 non-null  object 
 12  Submitted via                 57453 non-null  object 
 13  D

In [7]:
df.nunique()

Date received                     354
Product                             6
Sub-product                        32
Issue                              44
Sub-issue                         160
Consumer complaint narrative    55390
Company public response            11
Company                          2197
State                              60
ZIP code                         2874
Tags                                4
Consumer consent provided?          1
Submitted via                       1
Date sent to company              359
Company response to consumer        5
Timely response?                    2
Consumer disputed?                  0
Complaint ID                    57453
dtype: int64

Consumer compliants narrative are filed under 6 different products. We need to review our complaints and determine under which product category the compliants are filed under. Number of Topics given are 6 topics. 

In [8]:
df['Product'].value_counts()

Debt collection                21772
Credit card or prepaid card    13193
Mortgage                        9799
Checking or savings account     7003
Student loan                    2950
Vehicle loan or lease           2736
Name: Product, dtype: int64

As you see, Consumer Complaints are filed under 6 products category.

In [9]:
df['Company'].value_counts()

CITIBANK, N.A.                           3226
CAPITAL ONE FINANCIAL CORPORATION        2711
BANK OF AMERICA, NATIONAL ASSOCIATION    2580
JPMORGAN CHASE & CO.                     2409
WELLS FARGO & COMPANY                    2001
                                         ... 
Veristone Mortgage, LLC                     1
Lefkoff, Rubin, Gleason & Russo, P.C.       1
EMG ACQUISITION GROUP, LLC                  1
Tormey Bewley Corporation                   1
MORTGAGE INVESTORS GROUP                    1
Name: Company, Length: 2197, dtype: int64

Company column contains against the company which the complaints are filed under. Most Complaints are filed against CITIBANK following CAPITAL ONE FINANCIAL CORPORATION, BANK OF AMERICA, NATIONAL ASSOCIATION.

Though we have many columns present in Dataset, we will pick 3 most important informative features for Topic Modelling.

In [10]:
complaints_df= df[['Product', 'Company', 'Consumer complaint narrative']].rename(columns={'Consumer complaint narrative':'Complaints'})

In [11]:
pd.set_option('display.max_colwidth', -1)
complaints_df

  """Entry point for launching an IPython kernel.


Unnamed: 0,Product,Company,Complaints
0,Vehicle loan or lease,TRUIST FINANCIAL CORPORATION,"This auto loan was opened on XX/XX/2020 in XXXX, NC with BB & T in my name. I have NEVER been to North Carolina and I have NEVER been a resident. I have filed a dispute twice through my credit bureaus but both times BB & T has claimed that this is an accurate loan. Which I wasn't aware of until today. I have tried to contact BB & T multiple times but I have never gotten through to a live person. I do n't drive and I have never owned a car before. I didn't have any knowledge of this account until I checked XXXXXXXX XXXX and noticed it. I've tried twice to dispute it. Additionally I never received any bills or information about this account. This is my last resort in trying to remove this fraudulent loan off of my account."
1,Debt collection,CURO Intermediate Holdings,"In XXXX of 2019 I noticed a debt for {$620.00} on my credit which i believed was mine I thought speedy cash had bought one of my old debts and sold it to XXXX XXXX XXXX XXXX. I contacted XXXX XXXX XXXX XXXX and after several attempts of giving my full name, nothing came up in their system. I gave my social and the rep said the account popped up but DID NOT tell me that the account was under someone elses name and continued to let me make a payment. The payment was for {$120.00}. Confirmation number-XXXX. After realizing it was not my account, I called back to get my money back and inform them of the mistake. I was told i needed to mail them an FTC report and dispute letter to get my money back. I completed all of this and when i called again they said they transferred the account back to speedy cash for fraud review and I would need to contact them. After contacting them i was again told that i can not get my money back. The issue im having is this representative at XXXX XXXX played blind to obvious fraud and let an innocent person make a payment on someone elses debt and i want my money back."
2,Vehicle loan or lease,CAPITAL ONE FINANCIAL CORPORATION,"As stated from Capital One, XXXX XX/XX/XXXX and XXXX 2018, My wife and I went to several car dealerships to request for a car loan to get a used car. However, according to their credit requirements unfortunately my credit score was insufficient for the car loan approval at that time. It seemed as though they pulled my credit report multiple times."
3,Checking or savings account,CAPITAL ONE FINANCIAL CORPORATION,"Please see CFPB case XXXX. \n\nCapital One, in the letter they provided ( and attached to that case as their response ) said this : "" The funds were reversed and sent back to XXXX XXXX XXXX on XX/XX/XXXX ''. \n\nXXXX XXXX XXXX ( now XXXX XXXX ) has not received these funds. Staff at XXXX XXXX - and also staff at the account-holder 's business - have looked for return of my money ( {$650.00} ) and find nothing. \n\nCapital One needs to document - actually prove - they returned the funds, as stated in their letter. Capital One must provide electronic information, if the return was made that way, or document the paper check they sent back to XXXX XXXX. \n\nI've left 3 messages about this problem for the person who signed the letter ( XXXX ) from Capital One. I have received no call-backs. \n\nSummary : Capital One said they returned my money on XX/XX/XXXX : they did not. If they continue claim they did, then they need to prove that."
4,Debt collection,"Merchants and Professional Bureau, Inc.","This debt was incurred due to medical malpractice ( XXXX XXXX XXXX, XXXX, TX ). I asked the doctor to turn over my claim to his malpractice insurance company. This has cost me thousands of dollars to XXXX XXXX XXXX. I am still trying to collect damages from this doctor. He never responded and turned over me to collections Merchants and Professional Collection Bureau , Inc. I sent them a letter describing exactly this issue and instead of not contacting me and verifying my debt they start reporting this debt to the credit reporting agencies. They never verified the debt, like I asked and they never stopped it from being reported when I specifically told them not to, due to the circumstances above."
...,...,...,...
57448,Student loan,"Nelnet, Inc.","I am attempting to make a payment toward my student loans on the Nelnet website today, XX/XX/20, and Nelnet will not allow me to post the payment sooner than XX/XX/20. By the time the payment posts, 2-3 days of additional interest will have accrued and my payments will apply more to interest than is due today, the day that I'm attempting to pay. My understanding was that I could make a payment at any time but this does not appear to be true. The funds are available in my bank account today regardless of whether Nelnet can collect over the weekend. I should not be penalized for this. \n\nI submitted complaint XXXX in XXXX for other deceptive practices with Nelnet. They have not yet resolved the issue identified in that complaint or contacted me as they said they would in their response. I believe this new issue is just one more deceptive practice by this company that causes financial harm to borrowers."
57449,Debt collection,"The Receivable Management Services LLC, New York, NY Branch",Received letter for {$480.00}. Original creditor didnt contact me until past statute of limitations for insurance company recoupment per Arizona law. Debt collection is illegal for phantom debt. Additionally they are phoning my office excessively.
57450,Debt collection,"Convergent Resources, Inc.","entire time 10 years until XX/XX/2020. XXXX makes my blood boil. I have called and was lied to told to provide my checking account information over the phone in order to turn my cell phone back on. i called at XXXX them at XXXX {$300.00} was added to my bill. \n\nScam scam scam I was told I can not call the office of the President just to write to XXXX XXXX XXXX XXXX XXXX XXXX XXXX, NM XXXX. I did three thousand times. the last letter I mailed on XX/XX/2020. Two collection agencies later. \n\nI chose to leave XXXX XXXX every time I called the XXXX supervisor would threaten me on a recorded line. I need peace of mind and a good Heart to beat inside of me. Im on a XXXX XXXX due to the stress at XXXX XXXX taking all my money 4 10 years."
57451,Checking or savings account,WELLS FARGO & COMPANY,"I am a customer with Wells Fargo Bank. Recently money was withdrawn on a couple of occasions without my permission or consent to pay for a timeshare account that was never used by me nor anyone connected to me because of unfair policies pertaining to the fees of the said timeshare. I tried cancelling the said timeshare account several times because of these fees that were never mentioned at the initiation. My account was debited to pay for the timeshare fees without my knowledge or consent several times. I tried correcting this with Wells Fargo bank with no avail. I would appreciate it if you can look into this matter for me. I was left with no funds in my account and as such I could not take care of the basic necessities of my day to day life. \nThanks in advance,"


When you are reviewing dataset columns using head() method pandas normally wrapped the text column and display only first three lines. By using, set_option() we can unwrap the text column.

In [12]:
#Split the dataframe into Training and Hold out set
X_train, X_hold = train_test_split(complaints_df, test_size=0.6, random_state=999)

In [13]:
X_train['Product'].value_counts()

Debt collection                8676
Credit card or prepaid card    5254
Mortgage                       3939
Checking or savings account    2819
Student loan                   1210
Vehicle loan or lease          1083
Name: Product, dtype: int64

Since we dont have target label and model we are building is unsupervised learning, we dont have worry about slightly imbalanced data distribution on Product Column. We are only going to classify the topics based on text available in Complaints column.

**Text Preprocessing **

  Features needs to be in numerical form to train our model. Texts needs to be vectorized before training. Before appling vectorizer, we need to clean up the texts and tokenize our words

In [14]:
stemmer= PorterStemmer()

In [15]:
def tokenize(text):
  tokens= [word for word in nltk.word_tokenize(text) if (len(word) > 3 and len(word.strip('Xx/')) > 2) ]
  stems= [stemmer.stem for items in tokens]
  return tokens

We can use default tokenize pattern available in vectorizer but we do few more customization for better training. Sensitive,Privacy details are masked using XXXX... we cannot consider for training our model.So we will strip the characters containing more than 2 'Xx's and picked tokens. Also, words with less than 3characters doesnt add much value so we will ignore that too.

CountVectorizer, Tfidf Vectorizer are two methods to apply on Text data based on Bow. We are going to LDA algorithm to train our model. This model need count of words in the document and doesnt need normalized form. We can use either Count Vectorizer or TFidf Vectorizer with use_idf-False and norm=None. By Default the norm will be l2 norm.

In [16]:
#instantiate TfidfVectorizer
tf_vectorizer= TfidfVectorizer(tokenizer=tokenize, stop_words='english', max_features=10000, max_df=0.75, use_idf=False, min_df=50,norm=None)
tf_vectors= tf_vectorizer.fit_transform(X_train.Complaints)
tf_vectors.shape

(22981, 2999)

max_df, min_df is given to streamline the frequent and rare words. max_df=0.75 so the words should be available in maximum of 75% of documents. min_df=50, so the words should be present in atleast 50 documents.

In [17]:
tf_vectors.A[:10]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [18]:
tf_vectorizer.get_feature_names()

['0.00',
 '1.00',
 '10.00',
 '100.00',
 '1000.00',
 '10000.00',
 '100000.00',
 '110.00',
 '1100.00',
 '11000.00',
 '12.00',
 '120.00',
 '1200.00',
 '12000.00',
 '130.00',
 '1300.00',
 '13000.00',
 '140.00',
 '1400.00',
 '14000.00',
 '15.00',
 '150.00',
 '1500.00',
 '15000.00',
 '160.00',
 '1600.00',
 '16000.00',
 '1681c-2',
 '1681m',
 '1692',
 '1692g',
 '170.00',
 '1700.00',
 '180.00',
 '1800.00',
 '190.00',
 '1900.00',
 '2.00',
 '20.00',
 '200.00',
 '2000.00',
 '20000.00',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2019.',
 '2020',
 '210.00',
 '2100.00',
 '220.00',
 '2200.00',
 '230.00',
 '2300.00',
 '240.00',
 '2400.00',
 '25.00',
 '250.00',
 '2500.00',
 '25000.00',
 '260.00',
 '2600.00',
 '27.00',
 '270.00',
 '2700.00',
 '28.00',
 '280.00',
 '2800.00',
 '29.00',
 '290.00',
 '2900.00',
 '3.00',
 '30.00',
 '300.00',
 '3000.00',
 '30000.00',
 '310.00',
 '320.00',
 '3200.00',
 '330.00',
 '3300.00',
 '340.00',
 '3400.00',
 '35.00',
 '350.00',
 '3500.00',
 '36.00',
 '360.00',
 '3600.0

LDA is statistical model that allows set of observation to be explained by a group or unobserved groups. It’s a way of automatically discovering topics that are a part of the given documents and compare the relevance of two documents.

In [19]:
lda= decomposition.LatentDirichletAllocation(n_components=6, max_iter=3, learning_method='online',learning_offset=50, n_jobs=-1, random_state=999)

W1= lda.fit_transform(tf_vectors)
H1= lda.components_

Some points:
max_iter= 3. Use maximum iterations so that it will predict better. batch uses all the data and replaces the data in earlier iterations. 

In [20]:
W1

array([[0.00186464, 0.15979967, 0.67155152, 0.00187166, 0.00186016,
        0.16305235],
       [0.00258768, 0.15863537, 0.00258615, 0.08545595, 0.0730474 ,
        0.67768744],
       [0.01669148, 0.01684215, 0.16728716, 0.01675262, 0.01669063,
        0.76573596],
       ...,
       [0.16144692, 0.23236704, 0.00311287, 0.25677815, 0.00309927,
        0.34319575],
       [0.39332018, 0.00322877, 0.00324677, 0.44599193, 0.15096622,
        0.00324613],
       [0.00086729, 0.00722452, 0.00086671, 0.29171994, 0.6984547 ,
        0.00086683]])

W1 are the probabilities of each topic likely to be in Document. If you see the probability figure, first document likely to belong Topic 2 as the probability is higher(argmax)= 0.67 in Topic3. Second document belongs to Topic6 and so on.

In [21]:
H1

array([[8.31719468e+00, 1.83980041e-01, 6.79037827e+00, ...,
        3.40813574e-01, 5.45858565e+01, 8.15750002e+01],
       [2.90825422e+02, 6.43232866e+01, 5.95123852e+01, ...,
        1.69051612e-01, 1.90870288e-01, 4.46956143e+01],
       [2.86433286e+01, 1.33166222e+01, 4.04199868e+01, ...,
        5.27186560e+01, 3.59085550e-01, 2.48575664e+01],
       [5.10845617e+01, 1.09407386e+01, 4.92871699e-01, ...,
        1.00141495e+01, 2.16783633e+01, 5.77226199e+01],
       [2.39466764e-01, 1.69564599e-01, 1.68148616e-01, ...,
        6.75662173e+01, 1.72722441e-01, 1.69229866e-01],
       [1.05245058e+02, 1.03106686e+02, 1.01615900e+02, ...,
        4.60829419e+01, 3.82913919e+01, 1.28363738e+02]])

In [22]:
num_words=15

vocab = np.array(tf_vectorizer.get_feature_names())

top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H1])
topics = [' '.join(t) for t in topic_words]

In [23]:
topics

['loan mortgage told time home payments loans company years said make called help asked received',
 'payment payments late account balance credit paid month statement mortgage received date fees monthly days',
 'insurance wells fargo escrow paid property xx/xx/2019 company received vehicle taxes fees offer refund months',
 'credit debt account report company collection letter information received sent dispute reporting agency reported removed',
 'debt information provide identity court consumer theft legal request company state contract federal violation reporting',
 'account card bank told called credit said money number phone received check time chase asked']

Topics tells the top 15 words contributing to Each Topic. Run it for more iteratiions we will get better classification.

Topics Interpretation:

First Topic consists of Loan mortgage, loans, payments words etc.so, Topic1 will be about Loans
Likewise, second topic might be about payments
Third Topic might be about insurance or its payment
4th Topic about Credit Debt account payment
5th Topic might be abount Theft, Legal complaints against Debts
6th Topic might be about Account related complaints as words like card, number, phone number, time are among the top words

Note: Topic interpretation based on Domain Expertise. Get help from domain expertise to classify what topic components.

In [24]:
colnames = ["Topic" + str(i) for i in range(lda.n_components)]
docnames = ["Doc" + str(i) for i in range(len(X_train.Complaints))]
df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [25]:
df_doc_topic

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.00,0.16,0.67,0.00,0.00,0.16,2
Doc1,0.00,0.16,0.00,0.09,0.07,0.68,5
Doc2,0.02,0.02,0.17,0.02,0.02,0.77,5
Doc3,0.15,0.01,0.01,0.72,0.01,0.12,3
Doc4,0.01,0.27,0.01,0.01,0.01,0.71,5
...,...,...,...,...,...,...,...
Doc22976,0.02,0.02,0.34,0.59,0.02,0.02,3
Doc22977,0.20,0.01,0.01,0.62,0.01,0.14,3
Doc22978,0.16,0.23,0.00,0.26,0.00,0.34,5
Doc22979,0.39,0.00,0.00,0.45,0.15,0.00,3


In [26]:
X_train.head()

Unnamed: 0,Product,Company,Complaints
33209,Mortgage,TCF FINANCIAL CORPORATION,"Chemical Bank sent a check from my escrow account to the wrong Homeowner 's Insurance provider, and my escrow account has not been credited for the mistaken transaction. \n\nXX/XX/2019 : Chemical Bank issues a payment of {$1500.00} to XXXX XX/XX/2019 : I switch Homeowner 's Insurance providers, from XXXX to XXXX XX/XX/2019 : XXXX issues a refund check for {$1300.00} for the cancelled homeowner 's policy. \nXX/XX/2019 : Chemical Bank issues a payment of {$1600.00} to XXXX ( the old Homeowner 's Insurance provider who no longer services me ) XX/XX/2019 : I deposit the refund check of {$1300.00} into my escrow account. \nXX/XX/2019 : I receive a letter from XXXX stating they have not received payment for my homeowner 's policy from my bank. I call Chemical Bank and the discover the original payment for {$1600.00} was sent to the wrong insurance company, XXXX. The issue a payment to XXXX for {$1600.00}. I am told to contact XXXX to about getting my money back. XXXX have no record of the payment associated to me, stating that the money would've come over in a larger check, associated with many policies, and have no record of anything for my account. I call Chemical Bank back, and they tell me they'll look into it and call me back. They never return my call."
23268,Credit card or prepaid card,"CITIBANK, N.A.","Please refer to case number : XXXX for background. \n\nI was assaulted and had my wallet stolen. One of the thieves ran up charges on my Best Buy Citi Visa card. \n\nThis is the worst credit card I've ever owned in my life, and I can not wait to dump it once this gets resolved. \n\nThat month, I paid off more than my entire interest bearing balance less the fraudulent charges. \n\nI was assessed {$15.00} in interest that month, because the citi had not removed the fraudulent charges from my account. Because of that, I was carrying an interest bearing balance for the month. However, if you reduced the interest bearing amount by the fraudulent amount, then I actually paid more than what I owed for interest bearing purchases. \n\nFast forward to today, I received a letter stating that they determined the charges were fraudulent and they refunded the fraudulent amount ; however, they never refunded the interest. \n\nThe assault happened on XX/XX/XXXX. It is now XX/XX/XXXX, and after repeated calls and time wasted, the money is still not refunded."
41868,Checking or savings account,U.S. BANCORP,"XX/XX/2020, I received forgery check from USBank amount {$2400.00}. It was counter withdraw from my checking account, but never been deposit."
3024,Debt collection,"Security Credit Services, LLC","XX/XX/XXXX I received a letter from the lender saying the account is paid and satisfied. XX/XX/XXXX on my credit report it says the exact same account, has a XXXX amount paid towards the balance owed which is incorrect, I contacted the lender explain to them, customer service person told me the account was deleted from their system and I didn't owe them anything, they would not and could not do anything about it being on my credit report."
34302,Checking or savings account,"BANK OF AMERICA, NATIONAL ASSOCIATION","XXXX2019 I have been BOAs branch in XXXX of XXXX, and I have take cash from ATM in this branch for {$100.00} dollars by twice, because the first time didnt work, then, when I check my bank app for transaction history, my account be charged twice, and the second day, all the account history be fixed, only appears once money out, and account still be charged twice, because balance never change."


In [27]:
WHold= lda.transform(tf_vectorizer.transform(X_hold.Complaints[:10]))

In [28]:
colnames_hold = ["Topic" + str(i) for i in range(lda.n_components)]
docnames_hold = ["Doc" + str(i) for i in range(len(X_hold.Complaints[:10]))]
df_doc_topic_hold = pd.DataFrame(np.round(WHold, 2), columns=colnames_hold, index=docnames_hold)
significant_topic_hold = np.argmax(df_doc_topic_hold.values, axis=1)
df_doc_topic_hold['dominant_topic'] = significant_topic_hold

In [29]:
df_doc_topic_hold.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,dominant_topic
Doc0,0.01,0.01,0.01,0.12,0.01,0.86,5
Doc1,0.01,0.01,0.01,0.95,0.01,0.01,3
Doc2,0.0,0.35,0.0,0.16,0.06,0.42,5
Doc3,0.94,0.01,0.01,0.01,0.01,0.01,0
Doc4,0.0,0.0,0.0,0.48,0.51,0.0,4
Doc5,0.25,0.0,0.0,0.08,0.36,0.31,4
Doc6,0.92,0.01,0.05,0.01,0.01,0.01,0
Doc7,0.01,0.37,0.01,0.01,0.01,0.59,5
Doc8,0.13,0.29,0.0,0.58,0.0,0.0,3
Doc9,0.0,0.04,0.0,0.57,0.1,0.29,3


In [30]:
X_hold.Complaints[:10]

8546     On XX/XX/19 2 separate transactions were withdrawn from my account, {$1.00} and {$19.00}, I contacted my bank to inform them these were unauthorized transactions, I also had my debit card cancelled and reissued. I then contacted the company responsible for the transactions, they informed me that they had no record of my account. \n\nThen again on XX/XX/19 I had another charge of {$19.00} deducted from my account. I am contacting my bank to report this as another unauthorized transaction.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

Notice 4th and 7th Document have words like loans, paperwork, home and it will classified to belong to Doc0 which is about loan and mortgage.
Statements, Transaction, address, interest are some of the words belong to Topic 6 is about Account related information