## TF-IDF

TF-IDF stands for term frequency–inverse document frequency, a formula that measures how important a word is to a document in a collection of documents.

This metric calculates the number of times a word appears in a text (term frequency) and compares it with the inverse document frequency (how rare or common that word is in the entire data set).

Multiplying these two quantities provides the TF-IDF score of a word in a document. The higher the score is, the more relevant the word is to the document.

TF-IDF is used primarily as a pre-processing step for tasks like clustering, topic modeling, and text classification. But TF-IDF can also be used to extract keywords from a document to get a sense of the content of the document.

For example, if we are dealing with Wikipedia articles, we can use tf-idf to extract words that are unique to a given article. These keywords can then be used for text analysis when we look at these keywords in aggregate.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
nlp = spacy.load("en_core_web_sm", disable = ['parser','ner'])

## Dataset

We will be using data published by the Consumer Financial Protection Bureau, obtained from their [website](https://www.consumerfinance.gov/data-research/consumer-complaints/). The data is a collection of complaints about consumer financial products and services that CFPB sent to companies for response.

Complaints can give us insights into problems people are experiencing. These insights can not only help in regulating consumer financial products and services but also help in enforcing laws under existing federal consumer financial laws.

The original dataset was too large and noisy for computation, so I will be using a cleaner version with 10,000 rows.

## Loading Data

In [2]:
df = pd.read_csv('bank_complaints.csv')

In [3]:
df.head()

Unnamed: 0,Complaint ID,Date received,Product,Issue,Consumer complaint narrative,Company
0,7244354,2023-07-13,Checking or savings account,Problem caused by your funds being low,Citibank allowed debit card transactions to ov...,"CITIBANK, N.A."
1,7108471,2023-06-13,"Credit reporting, credit repair services, or o...",Problem with a credit reporting company's inve...,I submitted a letter to the XXXX Credit Bureau...,"TRANSUNION INTERMEDIATE HOLDINGS, INC."
2,7497831,2023-09-05,Prepaid card,Trouble using the card,I was given a gift card for {$250.00} from US ...,U.S. BANCORP
3,7518662,2023-09-08,Prepaid card,Problem getting a card or closing an account,I have had several weeks of unemployment benef...,U.S. BANCORP
4,7499662,2023-09-06,Checking or savings account,Managing an account,"at branch # XXXX XXXX on XX/XX/23 at XXXX, I w...",WELLS FARGO & COMPANY


This dataset contains 6 fields including post Complaint ID, Date received, Product, and other metadata which we don’t need for analysis. We are only interested in the Consumer complaint narrative.

Complaint narratives are consumers’ descriptions of their experiences in their own words. This is the source of text for keyword extraction.

## Cleaning the data

We cannot go straight from raw text to fitting a machine learning model. We will first clean the text data by eliminating punctuations and numbers, and then normalizing the words to its root. For simplicity, we will only perform some mild pre-processing.

In [18]:
def clean_text(text):
    # lowercase
    text=text.lower()
    
    #remove tags,special characters and digits
    text=re.sub(r"_+"," ",text)
    text=re.sub(r"x{2,}"," ",text)
    text=re.sub(r"(\d|\W)+"," ",text)
    return text

#Lemmatization
def lemma_text(text):
    sentence = []
    doc = nlp(text)
    for word in doc:
        sentence.append(word.lemma_)
    return " ".join(sentence)

In [5]:
df['Clean_text'] = df['Consumer complaint narrative'].apply(clean_text).apply(lemma_text).str.strip()

## Importing CountVectorizer and creating a CountVectorizer object

We are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
vectorizer = TfidfVectorizer(stop_words='english')

In [8]:
vectors = vectorizer.fit_transform(df['Clean_text'])

We will build a dictionary (dict_of_tokens) from where the key is the index and the value is the word.

In [9]:
dict_of_tokens={i[1]:i[0] for i in vectorizer.vocabulary_.items()}

The next step is to simply sort the words in each row according to the TFIDF weight in descending order and then append a list of the top 5 keywords without their weights for each row.

In [12]:
keywords = []
for row in vectors:
    temp_tuple = zip(row.indices,row.data)
    temp_list = sorted(temp_tuple,key=lambda x:x[1],reverse=True)
    keywords.append([dict_of_tokens[x[0]] for x in temp_list[:5]])
df['Keywords'] = keywords

In [19]:
df.head(10)

Unnamed: 0,Complaint ID,Date received,Product,Issue,Consumer complaint narrative,Company,Clean_text,keywords
0,7244354,2023-07-13,Checking or savings account,Problem caused by your funds being low,Citibank allowed debit card transactions to ov...,"CITIBANK, N.A.",citibank allow debit card transaction to overd...,"[overdraft, debit, allow, card, citi]"
1,7108471,2023-06-13,"Credit reporting, credit repair services, or o...",Problem with a credit reporting company's inve...,I submitted a letter to the XXXX Credit Bureau...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",I submit a letter to the credit bureaus last t...,"[include, suspicious, pursue, think, prior]"
2,7497831,2023-09-05,Prepaid card,Trouble using the card,I was given a gift card for {$250.00} from US ...,U.S. BANCORP,I be give a gift card for from us bank sell at...,"[gift, fax, card, reactivate, send]"
3,7518662,2023-09-08,Prepaid card,Problem getting a card or closing an account,I have had several weeks of unemployment benef...,U.S. BANCORP,I have have several week of unemployment benef...,"[refuse, reliacard, card, unemployment, explan..."
4,7499662,2023-09-06,Checking or savings account,Managing an account,"at branch # XXXX XXXX on XX/XX/23 at XXXX, I w...",WELLS FARGO & COMPANY,at branch on at I be direct to deposit cash in...,"[teller, customer, deposit, cash, return]"
5,7257025,2023-07-15,"Credit reporting, credit repair services, or o...",Improper use of your report,In accordance with the Fair Credit Reporting a...,Experian Information Solutions Inc.,in accordance with the fair credit reporting a...,"[consumer, section, reporting, agency, state]"
6,7242995,2023-07-12,"Credit reporting, credit repair services, or o...",Incorrect information on your report,"On XX/XX/, 2023, XXXX XXXX admitted liability ...","BANK OF AMERICA, NATIONAL ASSOCIATION",on admit liability in exceed the rate of the s...,"[america, bank, domestic, creation, decision]"
7,7257145,2023-07-15,"Credit reporting, credit repair services, or o...",Incorrect information on your report,XX/XX/XXXX ] [ XXXX XXXX XXXX ] [ XXXX XXXX XX...,Experian Information Solutions Inc.,mi ssn re letter to remove inaccurate credit i...,"[remove, collection, mi, copy, account]"
8,7504183,2023-09-05,Prepaid card,Problem with a purchase or transfer,XX/XX/XXXX paid XXXX XXXXpayment made plus XXX...,Netspend Corporation,pay payment make plus charge over phone use ne...,"[netspend, payment, reversered, plus, reverse]"
9,7257121,2023-07-15,"Credit reporting, credit repair services, or o...",Improper use of your report,In accordance with the Fair Credit Reporting a...,Experian Information Solutions Inc.,in accordance with the fair credit reporting a...,"[consumer, section, reporting, agency, state]"


From the keywords above, the top keywords actually make sense. 