# FINANCIAL COMPLAINT TOPIC MODELING - NMF

### Albert Opoku - Senior Statistical Consultant at Allianca data Inc

#### Contact me:
- twitter [@opalbert](https://twitter.com/opalbert)
- linkedIn [Albert Opoku](https://www.linkedin.com/in/albertopokupmachinelearning/)
- email opalkabert@gmail.com
- website [www.opokualbert.com](https://opokualbert.com/)

### CASE:

Consumers provide feedback on financial products or services and our task is to extract the hidden themes/topics and assign each of the feedback documents to one of these themes or topics.

### Solution:

Train a Natural Language Processing machine learning model to extract the topics from each of the open-ended complaint text document.

### Data Source:

The data is downloaded from kaggle via this url: [consumer complaint data](https://www.kaggle.com/cfpb/us-consumer-finance-complaints)








**Topic Modeling** is an unsupervized machine learning technique to discover the hidden/latent thematic structure in a large corpus of text documents.


[Latent Dirichlet allocation (LDA)](http://jmlr.org/papers/volume3/blei03a/blei03a.pdf) and [Non-Negative Matrix Fatactorization (NMF)](https://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf) are the two most popular topic modeling techniques. LDA uses a probabilistic approach where as NMF uses matrix factorization approach.


<img src="NMF_Equation.PNG">

<img src="DTM_.png">

<img src="Doc_Topic_Terms.png">

##### Import the needed packages

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.decomposition import NMF
import numpy as np
import pickle 
import warnings  
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline

import nltk 
# from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
# from nltk.stem import PorterStemmer
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

In [None]:
df_orig = pd.read_csv("consumer_complaints.csv")

In [None]:
df_orig.shape

In [None]:
df_orig.head()

In [None]:
# Check how many rows have missing values for consumer_complaint_narrative column
df_orig.consumer_complaint_narrative.isnull().values.sum()

In [None]:
# extract only the relevant column for this project
df_orig = df_orig.loc[:,['consumer_complaint_narrative']]
# df_orig.to_csv('consumer_complaint_text.csv', index=False)

In [None]:
# Exclude all rows with null consumer_complaint_narrative
df_orig.dropna(inplace=True)
df_orig.shape

In [None]:
pd.set_option('max_colwidth', 1000)
pd.options.display.max_rows = 500

In [None]:
# Create a doc id for merging the results back to the original file 
df_orig.insert(0, 'Doc_Id', range(0, 0 + len(df_orig)))
df_orig.head()

In [None]:
# save for use later
df_orig.to_pickle('df_orig.pkl')

In [None]:
df = df_orig.loc[:,['consumer_complaint_narrative','Doc_Id']]

In [None]:
# Get the word count for each document
df['word_count'] = df['consumer_complaint_narrative'].apply(lambda x: len(str(x).split(" ")))
df.head()

In [None]:
# Summary statistics
df.word_count.describe()

In [None]:
# just to get a sizable data to work with in this tutorial
df1= df[(df['word_count']>=191)&(df['word_count']<=255)]

In [None]:
df1.shape

In [None]:
df1.head()

In [None]:
# Removing unwanted characters
import re
df1['consumer_complaint_narrative'] = df1['consumer_complaint_narrative'].str.replace('X', '')
df1['consumer_complaint_narrative'] = df1['consumer_complaint_narrative'].str.replace('{', '')
df1['consumer_complaint_narrative'] = df1['consumer_complaint_narrative'].str.replace('}', '')
df1['consumer_complaint_narrative'] = df1['consumer_complaint_narrative'].str.replace('/', '')
df1.head()

In [None]:
# Top 20 most frequent words
freq = pd.Series(' '.join(df1['consumer_complaint_narrative']).split()).value_counts()[:20]
freq

In [None]:
# Work with only nouns. Use NLTK to get only nouns in the corpus
def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [None]:
df1['data_nouns'] = pd.DataFrame(df1.consumer_complaint_narrative.apply(nouns))
df1.head()

In [None]:
# Further cleaning, removing stopwords, lemmatizing
import re
temp =[]
my_stop_words = text.ENGLISH_STOP_WORDS

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for sentence in df1['data_nouns']:
    sentence = sentence.lower()
    cleaner = re.compile('<.*?>')
    sentence = re.sub(cleaner, ' ', sentence)  # Remove html tags
    sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) # removing puntuations
    
    words = [lemmatizer.lemmatize(word) for word in sentence.split() if word not in my_stop_words] # removing stopwords and lemm
    temp.append(words)
    
final_X = temp

In [None]:
final_X[:2]

In [None]:
sent = []
for row in final_X:
    sequ = ''
    for word in row:
        sequ = sequ + ' ' + word
    sent.append(sequ)
final_X = sent

In [None]:
print(final_X[:2])

In [None]:
# remove unwanted characters, numbers and symbols 
df1['cleaned'] = final_X
df1.head()

In [None]:
# save for use later
df1.to_pickle('data_prep.pkl')

In [None]:
#df1 = pd.read_pickle(data_prep.pkl)