## Problem Statement 

Building a model that is able to classify customer complaints based on the products/services. Segregating tickets help in the quick resolution of the issue.

Since this data is not labelled,applying NMF(Topic Modelling) to analyse patterns and classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others 

## Steps performed:

1.  Data loading

2. Text preprocessing

3. Exploratory data analysis (EDA)

4. Feature extraction

5. Topic modelling 

6. Model building using supervised learning

7. Model training and evaluation

8. Model inference

## Importing the necessary libraries

In [None]:
import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

## Loading the data

The data is in JSON format and we need to convert it to a dataframe.

In [None]:
# Opening JSON file 
f = open("complaints-2021-05-14_08_16.json")
  
# returns JSON object as  
# a dictionary 
data = json.load(f)
df=pd.json_normalize(data)

## Data preparation

In [None]:
# Inspect the dataframe to understand the given data.
df.head()


In [None]:
#Length of dataframe
print("Total entries in dataframe is :",len(df))

In [None]:
#print the column names
existing_cols=list(df.columns)
print(existing_cols)

In [None]:
#Assign new column names
new_cols=[cols.replace("_source.","") for cols in existing_cols]
#new_cols = [cols.replace("_","",1) if cols[0]=="_" for cols in new_cols]
df.columns=new_cols
print("The new column names are : \n",list(df.columns))

In [None]:
#Assign nan in place of blanks in the complaints column
df["complaint_what_happened"] = df["complaint_what_happened"].apply(lambda x: str(x).strip()).replace('', np.nan)
blank_complaints=df['complaint_what_happened'].isna().sum()
print("The total blank complaints are : ",blank_complaints)

In [None]:
#Remove all rows where complaints column is nan
new_df= df[df["complaint_what_happened"].notna()]
new_df.reset_index()
print("Length of dataframe after removing blank complaints :",len(new_df))

## Preparing the text for topic modeling

Removing blank complaints and doing following preprocessing steps:

* Make the text lowercase
* Remove text in square brackets
* Remove punctuation
* Remove words containing numbers


After cleaning operations, performing the following:
* Lemmatize texts
* Extract POS tags of the lemmatized text and removing all the words which have tags other than NN[tag == "NN"].


In [None]:
# Function to clean the text and remove all the unnecessary elements.
def preprocess_complaints(complaint):
    complaint=complaint.lower()
    complaint=re.sub("\[.*?\]","",complaint)
    complaint=re.sub(r'[^\w\s]', '', complaint)
    complaint=re.sub(r'\w*\d\w*', '', complaint).strip()
    return complaint    

In [None]:
#Function to Lemmatize the texts
nlp = spacy.load('en_core_web_sm')
def lemmatize(complaint):
    doc = nlp(complaint)
    lemma=[]
    for token in doc:
        lemma.append(token.lemma_)
    lemmatized_complaint=" ".join(lemma)
    return lemmatized_complaint

In [None]:
#Creating a dataframe('df_clean') that will have only the complaints and the lemmatized complaints 
from tqdm import tqdm
df_clean = new_df[["_id","complaint_what_happened"]].copy()
# Cleaning complaints
df_clean["cleaned_complaints"] = df_clean["complaint_what_happened"].apply(preprocess_complaints)
lemm_complaints = []
for complaint in tqdm(list(df_clean["cleaned_complaints"])):
    lemm_complaints.append(lemmatize(complaint))
df_clean["lemm_complaints"] = lemm_complaints

In [None]:
df_clean.head()

In [None]:
#Function to extract the POS tags 

def pos_tag(text):
  doc = nlp(text)
  out = []
  for token in doc:
    if token.tag_ == "NN":
      out.append(token.text)
  return " ".join(out)

pos_removed = [pos_tag(text) for text in tqdm(lemm_complaints)]


df_clean["complaint_POS_removed"] =  pos_removed
 #this column should contain lemmatized text with all the words removed which have tags other than NN[tag == "NN"].

In [None]:
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.
df_clean = df_clean[["complaint_what_happened","lemm_complaints","complaint_POS_removed"]]
df_clean.head()

## Exploratory data analysis to get familiar with the data.

*   Visualising the data according to the 'Complaint' character length
*   Using a word cloud to find the top 40 words by frequency among all the articles after processing the text
*   Finding the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘

In [None]:
# Code to visualise the data according to the 'Complaint' character length
import seaborn as sb
import matplotlib.pyplot as plt
df_length = df_clean[["complaint_POS_removed"]].copy()
df_length["char_length"] = df_length["complaint_POS_removed"].apply(lambda x: len(str(x)))
length_comp=list(df_length["char_length"])
sb.boxplot(length_comp)
plt.show()
hist=sb.histplot(length_comp)
hist.set_xlim(0,4000)
plt.show()
df_length["char_length"].describe()

##### Observation : The maximum character length of complaint is 12160 and the average complaint length is 385 characters. Most of the complaints are within 500 characters

#### Finding the top 40 words by frequency among all the articles after processing the text.

In [None]:
#Using a word cloud to find the top 40 words by frequency among all the articles after processing the text
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

def show_wordcloud(data, title = None):
    stopwords = set(STOPWORDS)
    data_all=" ".join(list(data))
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=40,
        max_font_size=40, 
        scale=3,
        random_state=1
    ).generate(data_all)
    
    text_dict=wordcloud.process_text(data_all)
    word_freq={k: v for k, v in sorted(text_dict.items(),reverse=True, key=lambda item: item[1])}
    print("The top frequent words with their frequency are : ",list(word_freq.items())[:40])
    
    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)
    plt.imshow(wordcloud)
    plt.show()

df_clean["complaint_POS_removed"] = df_clean["complaint_POS_removed"].astype(str)
show_wordcloud(df_clean["complaint_POS_removed"])

In [None]:
#Removing -PRON- from the text corpus
df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')

#### Finding the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

In [None]:
#Code here to find the top 30 unigram frequency among the complaints in the cleaned datafram(df_clean). 
from nltk import ngrams,FreqDist
from collections import OrderedDict

complaint_all=" ".join(list(df_clean['Complaint_clean']))
print("Length of all complaints together",len(complaint_all))

def ngram_extract(data,n,top):
  ngram = ngrams(data.split(" "), n)
  #compute frequency distribution for all the bigrams in the text
  fdist = FreqDist(ngram)
  word_freq={k: v for k, v in sorted(fdist.items(),reverse=True, key=lambda item: item[1])}
  top_ngrams=list(word_freq.items())[:top]
  return OrderedDict(word_freq),top_ngrams
  
unigram_freq_dist,top_unigrams=ngram_extract(complaint_all,1,30)
print(f"The top 30 unigrams with their frequency are : {top_unigrams}")

In [None]:
#Printing the top 10 words in the unigram frequency
def extract_top_words_fdist(fdist,top):
    words=list(fdist.keys())[:top]
    words=[" ".join(word) for word in words]
    return words

top_10_unigrams=extract_top_words_fdist(unigram_freq_dist,10)
print("The top 10 words by unigram frequency are : ",top_10_unigrams)

In [None]:
#Code to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean). 
bigram_freq_dist,top_bigrams=ngram_extract(complaint_all,2,30)
print(f"The top 30 bigrams with their frequency are : {top_bigrams}")

In [None]:
#Printing top 10 words in the bigram frequency
top_10_bigrams=extract_top_words_fdist(bigram_freq_dist,10)
print("The top 10 words by bigram frequency are : ",top_10_bigrams)

In [None]:
#Code to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean). 
trigram_freq_dist,top_trigrams=ngram_extract(complaint_all,3,30)
print(f"The top 30 bigrams with their frequency are : {top_trigrams}")

In [None]:
#Printing the top 10 words in the trigram frequency
top_10_trigrams=extract_top_words_fdist(trigram_freq_dist,10)
print("The top 10 words by trigram frequency are : ",top_10_trigrams)

#### The personal details of customer has been masked in the dataset with xxxx. Removing the masked text as this will be of no use for our analysis

In [None]:
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].str.replace('xxxx','')

In [None]:
#All masked texts has been removed
df_clean.head()

## Feature Extraction
Converting the raw texts to a matrix of TF-IDF features

In [None]:
#Write your code here to initialise the TfidfVectorizer 
tfidf = TfidfVectorizer(max_df = 0.95,min_df = 2)

#### Creating a document term matrix using fit_transform

In [None]:
#Code to create the Document Term Matrix by transforming the complaints column present in df_clean.
dtm = tfidf.fit_transform(df_clean['Complaint_clean'])
dtm_df=pd.DataFrame(dtm.toarray(), columns=tfidf.get_feature_names_out())
print("The shape of document term matrix is ",dtm_df.shape)

## Topic Modelling using NMF
Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on

In [None]:
from sklearn.decomposition import NMF

In [None]:
#Loading nmf_model with the n_components i.e 5
num_topics = 5

nmf_model = NMF(n_components=num_topics,random_state=40)
W = nmf_model.fit_transform(dtm)  # Document-topic matrix
H = nmf_model.components_       # Topic-term matrix

In [None]:
#Fitting NMF model
nmf_model.fit(dtm)
len(tfidf.get_feature_names_out())

In [None]:
#Printing the Top15 words for each of the topics
tot_words=15
words = np.array(tfidf.get_feature_names_out())
topic_words = pd.DataFrame(np.zeros((num_topics, tot_words)), index=[f'Topic {i + 1}' for i in range(num_topics)],
                           columns=[f'Word {i + 1}' for i in range(tot_words)]).astype(str)
for i in range(num_topics):
    ix = H[i].argsort()[::-1][:tot_words]
    topic_words.iloc[i] = words[ix]

topic_words

In [None]:
#Creating the best topic for each complaint in terms of integer value 0,1,2,3 & 4
#Topic 1 - Bank account services - (0)
#Topic 2 - Credit card / Prepaid card - (1)
#Topic 3 - Mortgages/loans - (2)
#Topic 4 - Theft/Dispute reporting - (3)
#Topic 5 - Others - (4)

topic_mapping = {
    'Topic 1': 0,
    'Topic 2': 1,
    'Topic 3': 2,
    'Topic 4': 3,
    'Topic 5': 4,
}

In [None]:
#Assigning the best topic to each of the complaints in Topic Column
W = pd.DataFrame(W, columns=[f'Topic {i + 1}' for i in range(num_topics)])
W['Topic'] = W.apply(lambda x: topic_mapping.get(x.idxmax()), axis=1)
W[pd.notnull(W['Topic'])].head(10)

In [None]:
df_clean["Topic"]=W["Topic"]
df_clean.head()

In [None]:
#Printing the first 5 Complaint for each of the Topics
df_clean_group=df_clean.groupby('Topic').head(5)
df_clean_group.sort_values('Topic')

In [None]:
print(len(df_clean))

#### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:
* Bank Account services
* Credit card or prepaid card
* Theft/Dispute Reporting
* Mortgage/Loan
* Others

In [None]:
Topic_names = {0:"Bank account services",1:"Credit card / Prepaid card",2:"Mortgages/loans",3:"Theft/Dispute reporting",4:"Others"}
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean.head(10)

## Supervised model to predict any new complaints to the relevant Topics.

In [None]:
#Creating the dictionary again of Topic names and Topics

Topic_names = { "Bank account services":0,"Credit card / Prepaid card":1,"Mortgages/loans":2,"Theft/Dispute reporting":3,"Others":4 }
#Replacing Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean.head(10)

In [None]:
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data
training_data=df_clean[["complaint_what_happened","Topic"]]

In [None]:
training_data.head()

#### Applying the supervised models on the training data created


In [None]:
#Write your code to get the Vector count
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
count_vect = CountVectorizer()
train_count = count_vect.fit_transform(training_data['complaint_what_happened'])
tfidf_model = TfidfTransformer()
#Write your code here to transform the word vector to tf-idf
X = tfidf_model.fit_transform(train_count)

Models Used :
* Logistic regression
* Decision Tree
* Random Forest

In [None]:
# Code to build any 3 models and evaluate them using the required metrics
y = training_data[["Topic"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Building the models - 
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
classifiers = [LogisticRegression(random_state=42),DecisionTreeClassifier(random_state=42),RandomForestClassifier(random_state=42)]

In [None]:
logistic = classifiers[0]
logistic.fit(X_train,y_train)
print("Logistic regression: ")
print(classification_report(y_test,logistic.predict(X_test)))

In [None]:
decision = classifiers[1]
decision.fit(X_train,y_train)
print("Decision tree: ")
print(classification_report(y_test,decision.predict(X_test)))

In [None]:
random = classifiers[2]
random.fit(X_train,y_train)
print("Random forest: ")
print(classification_report(y_test,random.predict(X_test)))

### Selecting the best model :
#### Based on the evaluation metrics, Logistic regression performs the best with accuracy of 0.91 compared to Decision Tree(acc=0.76) & Random Forest(acc=0.82)

#### **Testing the model with a custom test case using the best model**

In [None]:
test_sentences = pd.DataFrame({"complaint_what_happened":["I want to check why my account is not having money, is it because my account is breached?"," The bill for my card is not paid, what is the due amount for this month for my card?","What is the interest rate your bank offers for buying a house"," I want to enquire about opening a new account with your bank","What is the best way to earn without job?"]})

In [None]:
X_test_sentences = count_vect.transform(test_sentences["complaint_what_happened"])
X_test_sentences = tfidf_model.transform(X_test_sentences)
Topic_names = {0:"Bank account services",1:"Credit card / Prepaid card",2:"Mortgages/loans",3:"Theft/Dispute reporting",4:"Others"}
test_sentences["Topic"]= logistic.predict(X_test_sentences)
test_sentences["Topic"] = test_sentences["Topic"].map(Topic_names)
test_sentences.head()


### **Conclusion :**

#### - We have used preprocessing to clean the data and then used topic modelling using NMF to create labels for the complaints
#### - We built classification models and found that Logistic regression is best suited with accuracy of 0.91
#### - We also tested on a custom case and found accurate results

# End of Notebook