## Problem Statement 

You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.

You will be doing topic modelling on the <b>.json</b> data provided by the company. Since this data is not labelled, you need to apply NMF to analyse patterns and classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others 


With the help of topic modelling, you will be able to map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

## Pipelines that needs to be performed:

You need to perform the following eight major tasks to complete the assignment:

1.  Data loading

2. Text preprocessing

3. Exploratory data analysis (EDA)

4. Feature extraction

5. Topic modelling 

6. Model building using supervised learning

7. Model training and evaluation

8. Model inference

## Importing the necessary libraries

In [None]:
# We used Jarvis so kept all the below install/download command

!pip install spacy
!pip install plotly
!pip install wordcloud

import nltk, spacy
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

!python -m spacy download en_core_web_sm

In [None]:
# importing all the libraries

import json 
import numpy as np
import pandas as pd

import re, nltk, spacy, string

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

## Loading the data

The data is in JSON format and we need to convert it to a dataframe.

In [None]:
# Opening JSON file.
# Write the path to your data file and load it 
f = open('complaints-2021-05-14_08_16.json', 'r')

# returns JSON object as a dictionary 
data = json.load(f)
df = pd.json_normalize(data)

## Data preparation

In [None]:
# Inspect the dataframe to understand the given data

print("Shape of dataframe:-", df.shape)

df.head()

In [None]:
#print the column names

df.columns

#### Comment about renaming columns:
- All the column name is having suffix as `_source` except the first 4 columns
- First add above suffix to first 4 columns also
- Then rename all the columns to remove suffux `_source`
- Reduce the columns name `complaint_what_happened` to `complaints`

In [None]:
#Assign new column names

df.rename({'_index': '_source.index', '_type': '_source.type','_id': '_source.id', '_score': '_source.score'}, axis=1, inplace=True)
df.rename(columns=lambda x: x.replace('_source.', ''), inplace=True)
df.rename({'complaint_what_happened': 'complaints'}, axis=1, inplace=True)

# print new column names
df.columns

In [None]:
#Use regex to assign nan in place of blanks in the complaints column

df['complaints'] = df['complaints'].replace(r'^\s*$', np.nan, regex=True)
df.head()

In [None]:
#Remove all rows where complaints column is nan

df.dropna(subset = ["complaints"], inplace=True)

In [None]:
# Print shape of df after cleaning the dataset
print("Shape of dataframe:-", df.shape)

# print the head also
df.head()

## Prepare the text for topic modeling

Once you have removed all the blank complaints, you need to:

* Make the text lowercase
* Remove text in square brackets
* Remove punctuation
* Remove words containing numbers


Once you have done these cleaning operations you need to perform the following:
* Lemmatize the texts
* Use POS tags to get relevant words from the texts.


In [None]:
# Write your function here to clean the text and remove all the unnecessary elements.

def clean_text(df):
    
    df['complaints'] = df['complaints'].apply(lambda x: x.lower()) # Make text in lower case
    df['complaints'] = df['complaints'].apply(lambda x: x.replace("\n","")) # Remove new line char
    df['complaints'] = df['complaints'].replace(r'(\w*\d\w*)', '', regex=True) # Remove word containing numbers
    df['complaints'] = df['complaints'].replace(r'[^\w\s]', '', regex=True) # Remove punctuation
    df['complaints'] = df['complaints'].replace(r'\([^)]*\)|\{[^)]*\}|\[[^)]*\]', '', regex=True) #Remove text in square bracket
    
    return df.head()

In [None]:
# Call the clean_text column to clean complaint column

clean_text(df)

In [None]:
# Create a complaint list from df.complaints column

com_list = df.complaints.tolist()

# Print couple of sentences
com_list[:3]

In [None]:
#Write your function to Lemmatize the texts

stop_words = stopwords.words("english")

def preprocess(document):
    'removes stopwords and lemmatizes the remainder of the sentence'

    # tokenize into words
    words = word_tokenize(document)
    
    words = [wordnet_lemmatizer.lemmatize(word, pos='n') for word in words if word not in stop_words]

    # join words to make sentence
    document = " ".join(words)
    
    return document

In [None]:
# call preprocess function to create lemmatization list using complaint list
wordnet_lemmatizer = WordNetLemmatizer()

lemma_list = [preprocess(sentence) for sentence in com_list]

# print couple of sentences from lemmatize list
lemma_list[:3]

In [None]:
#Create a dataframe('df_clean') that will have only the complaints and the lemmatized complaints 

df_clean = pd.DataFrame(list(zip(com_list, lemma_list)), columns =['complaint', 'lemma_Com'])

# print head
df_clean.head()

In [None]:
#Write your function to extract the POS tags (Keep only NN Pos tags)

import spacy
nlp = spacy.load("en_core_web_sm", disable=['parser','ner'])

def get_POS_tags(document):
    #pos_tags = ""
    words = ""
       
    tagged_sentence = nlp(document)
    
    for token in tagged_sentence:
        if token.pos_ == "NOUN":
            words += token.text + " " 
    
    return words
    #return words,pos_tags

In [None]:
# call get_POS_tags function to get the words with NOUN POS tags only.  Create a new list com_after_rem_pos.

com_after_rem_pos = [get_POS_tags(sentence) for sentence in com_list]

# Print first couple of lines
com_after_rem_pos[:3]

In [None]:
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.

# add column complaint after removing POS tags
df_clean['com_after_rem_pos'] = com_after_rem_pos

df_clean.head()

## Exploratory data analysis to get familiar with the data.

Write the code in this task to perform the following:

*   Visualise the data according to the 'Complaint' character length
*   Using a word cloud find the top 40 words by frequency among all the articles after processing the text
*   Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘




In [None]:
# Write your code here to visualise the data according to the 'Complaint' character length

plt.figure(figsize=(10,6))
doc_lens = [len(d) for d in df_clean.complaint]
plt.hist(doc_lens, bins = 50)
plt.xlabel('complaint_bins')
plt.ylabel('Character_length');

#### Find the top 40 words by frequency among all the articles after processing the text.

In [None]:
#Using a word cloud find the top 40 words by frequency among all the articles after processing the text

from wordcloud import WordCloud

wordcloud = WordCloud(stopwords=stop_words,max_words=40).generate(str(df_clean.com_after_rem_pos))

print(wordcloud)
plt.figure(figsize=(10,6))
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show();

In [None]:
print(spacy.__version__)

#Removing -PRON- from the text corpus
#df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')

####  Since we are using new version of spacy we don't need to perfom above step.

#### Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

In [None]:
#Write your code here to find the top 30 unigram frequency among the complaints in the cleaned datafram(df_clean). 

# function to find the top n unigram
def get_top_n_unigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(1, 1), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0) 
      
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    
    return words_freq[:n]

In [None]:
# finding the top 30 unigrams

common_words_uni = get_top_n_unigram(df_clean.com_after_rem_pos, 30)

In [None]:
#plotting the top 10 words in the unigram frequency

plt.figure(figsize=(15,8))
df_uni = pd.DataFrame(common_words_uni[0:10], columns = ['unigram' , 'count'])
fig = sns.barplot(y=df_uni['unigram'], x=df_uni['count'])

In [None]:
#Write your code here to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean). 

# function to find the top n bigram

def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
       
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

In [None]:
# finding the top 30 bigrams

common_words_bi = get_top_n_bigram(df_clean.com_after_rem_pos, 30)

In [None]:
#Plotting the top 10 words in the bigram frequency

plt.figure(figsize=(15,8))
df_bi = pd.DataFrame(common_words_bi[0:10], columns = ['bigram' , 'count'])
fig = sns.barplot(y=df_bi['bigram'], x=df_bi['count'])

In [None]:
#Write your code here to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean). 

# function to find the top n trigrams

def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0) 
    
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    
    return words_freq[:n]

In [None]:
#Plotting the top 10 words in the trigram frequency

common_words_tri = get_top_n_trigram(df_clean.com_after_rem_pos, 30)

In [None]:
# plotting the top-10 trigrams

plt.figure(figsize=(15,8))
df_tri = pd.DataFrame(common_words_tri[0:10], columns = ['trigram' , 'count'])
fig = sns.barplot(y=df_tri['trigram'], x=df_tri['count'])

## The personal details of customer has been masked in the dataset with xxxx. Let's remove the masked text as this will be of no use for our analysis

In [None]:
# remove 'xxxx'

df_clean['com_after_rem_pos'] = df_clean['com_after_rem_pos'].str.replace('xxxx','')

In [None]:
#All masked texts has been removed

df_clean

## Feature Extraction
Convert the raw texts to a matrix of TF-IDF features

**max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"
max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

**min_df** is used for removing terms that appear too infrequently
min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [None]:
#Write your code here to initialise the TfidfVectorizer 

tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2)

#### Create a document term matrix using fit_transform

The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:
The tuples that are not there have a tf-idf score of 0

In [None]:
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.

dtm = tfidf.fit_transform(df_clean.com_after_rem_pos)

In [None]:
print(dtm)

In [None]:
dtm.A

In [None]:
# get feature names using tfidf.get_feature_names()

pd.DataFrame(dtm.toarray(), columns=tfidf.get_feature_names())

## Topic Modelling using NMF

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

In this task you have to perform the following:

* Find the best number of clusters 
* Apply the best number to create word clusters
* Inspect & validate the correction of each cluster wrt the complaints 
* Correct the labels if needed 
* Map the clusters to topics/cluster names

In [None]:
# import NMF library

from sklearn.decomposition import NMF

## Manual Topic Modeling
You need to do take the trial & error approach to find the best num of topics for your NMF model.

The only parameter that is required is the number of components i.e. the number of topics we want. This is the most crucial step in the whole topic modeling process and will greatly affect how good your final topics are.

### We have tried the 8 and 10 topics as well and found out that 5 is the best num of topics

In [None]:
#Load your nmf_model with the n_components i.e 5
num_topics = 5 #write the value you want to test out

#keep the random_state =40
nmf_model = NMF(n_components=num_topics, random_state=40) #write your code here

In [None]:
nmf_model.fit(dtm)

print(len(tfidf.get_feature_names()))

In [None]:
# Document-topic matrix
W1 = nmf_model.fit_transform(dtm)

# Topic-term matrix
H1 = nmf_model.components_

In [None]:
#Print the Top15 words for each of the topics

words = np.array(tfidf.get_feature_names())
topic_words = pd.DataFrame(np.zeros((num_topics, 15)), index=[f'Topic {i + 1}' for i in range(num_topics)],
                           columns=[f'Word {i + 1}' for i in range(15)]).astype(str)
for i in range(num_topics):
    ix = H1[i].argsort()[::-1][:15]
    topic_words.iloc[i] = words[ix]

topic_words

In [None]:
H1 = pd.DataFrame(H1, index=['Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5'], columns=tfidf.get_feature_names())
H1

In [None]:
# Top words for each topic

fig, ax = plt.subplots(nrows=5, ncols=1, figsize=(15, 10))
for i in range(5):
    words = H1.loc[f'Topic {i + 1}'].sort_values(ascending=False)[:20]
    words.plot(ax=ax[i], kind='bar', rot=0)
    ax[i].set_title(f'Topic {i + 1}')

plt.tight_layout()

### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:
* Bank Account services
* Credit card or prepaid card
* Theft/Dispute Reporting
* Mortgage/Loan
* Others

In [None]:
# After analysing the Topic 1-5 we have created the following topic mapping

topic_mapping = {
    'Topic 1': 'Other',
    'Topic 2': 'Theft/Dispute reporting',
    'Topic 3': 'Mortgages/loans',
    'Topic 4': 'Credit card / Prepaid card',
    'Topic 5': 'Bank account services'
}

In [None]:
# Recall the document-topic matrix, W1
#Assign the best topic to each of the cmplaints in Topic Column

W1 = pd.DataFrame(W1, columns=[f'Topic {i + 1}' for i in range(num_topics)])
W1['max_topic'] = W1.apply(lambda x: topic_mapping.get(x.idxmax()), axis=1)

nmf_topics = W1[pd.notnull(W1['max_topic'])]
nmf_topics.head(10)

In [None]:
#Add the column `Topic` column to df_clean

df_clean['Topic'] = nmf_topics.max_topic #write your code to assign topics to each rows.

In [None]:
df_clean.head()

In [None]:
# Verify some complaints and corresponding topics

for i in range(15000,15005):
    print(df_clean.complaint[i])
    print("Topic:", df_clean.Topic[i])
    print("--"*65)

### After verifying the above 25-30 complaints and repective topics manually, we have found that the topics and the complaints are matching mostly and there are very few mismatches. This is the best topic-complaint mapping we have found.

## Supervised model to predict any new complaints to the relevant Topics.

You have now build the model to create the topics for each complaints. Now in the below section you will use them to classify any new complaints.

Since you will be using supervised learning technique we have to convert the topic names to numbers(numpy arrays only understand numbers)

In [None]:
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data

training_data = df_clean[['complaint', 'Topic']]

In [None]:
# training data

training_data

#### Apply the supervised models on the training data created. In this process, you have to do the following:

* Create the vector counts using Count Vectoriser
* Transform the word vecotr to tf-idf
* Create the train & test data using the train_test_split on the tf-idf & topics


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Create the vector counts using Count Vectoriser
count_vect = CountVectorizer(max_df = 0.95, min_df = 2)

# Transform the word vecotr to tf-idf. we will use TfidfTransformer()

vectorized_complaint = count_vect.fit_transform(df_clean.complaint)
tfidf_transformer = TfidfTransformer()
tfidf_model = tfidf_transformer.fit_transform(vectorized_complaint)

In [None]:
# Map the target variable to numeric value. topics will be assigned value 0-4 as below.

df_clean['Topic'] = df_clean['Topic'].map({
    'Other': 0,
    'Theft/Dispute reporting': 1, 
    'Mortgages/loans': 2, 
    'Credit card / Prepaid card': 3,
    'Bank account services': 4
})

In [None]:
#create X and y variable

X = pd.DataFrame(tfidf_model.toarray(), columns=count_vect.get_feature_names())

y = df_clean['Topic']

X.head(5)

In [None]:
# value counts for each class

y.value_counts()

In [None]:
# split the data into train and test

from sklearn.model_selection import train_test_split

# will split the 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size =0.3, random_state=100)

print("train shape:", X_train.shape)
print("test shape:", X_test.shape)

You have to try atleast 2 models on the train & test data from these options:
* Logistic regression
* Decision Tree
* Random Forest
* Naive Bayes (optional)

**Using the required evaluation metrics judge the tried models and select the ones performing the best**

### Models:
- Create a function to calculate the accuracy score and print the confusion matrix 
- We have decided to create below models:
    1. Naive bayes (Bernoulli and Mutinomial)
    2. Random forest
    2. Logistic Regression

In [None]:
# import confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score

# function to print the accuracy and the confusion matrix
def get_metrics(y_train, y_pred_train, y_test, y_pred_test):
    print("Train Accuracy :", accuracy_score(y_true=y_train, y_pred = y_pred_train))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, y_pred_train))
    print("-"*50)
    print("Test Accuracy :", accuracy_score(y_true=y_test, y_pred = y_pred_test))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred_test))

### 1. Naive Bayes

In [None]:
# training the Bernoulli Naive Bayes model 
from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

bnb.fit(X_train,y_train)

In [None]:
# predicting and getting the metrics for the train and test data
y_pred_train = bnb.predict(X_train)
y_pred_test = bnb.predict(X_test)

get_metrics(y_train, y_pred_train, y_test, y_pred_test)

In [None]:
# training the Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

mnb=MultinomialNB()
mnb.fit(X_train, y_train)

In [None]:
# predicting and getting the metrics for the train and test data
y_pred_train = mnb.predict(X_train)
y_pred_test = mnb.predict(X_test)

get_metrics(y_train, y_pred_train, y_test, y_pred_test)

## 2. Random Forest
- We have tuned the hyper-parameters and using the best parameters here to build the model

In [None]:
# training the Random Forest model
# We have tuned the hyper-parameters and using the best parameters here to build the model
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42, 
                                  n_jobs=-1, 
                                  max_depth =100, 
                                  min_samples_leaf = 150, 
                                  max_features = 1000,
                                  n_estimators = 300)

rf_model.fit(X_train,y_train)

In [None]:
# predicting and getting the metrics for the train and test data
y_pred_train = rf_model.predict(X_train)
y_pred_test = rf_model.predict(X_test)

get_metrics(y_train, y_pred_train, y_test, y_pred_test)

### 3. Logistic Regression

In [None]:
# training the Logistic regression model
from sklearn.linear_model import LogisticRegression

lm = LogisticRegression().fit(X_train, y_train)

In [None]:
# predicting and getting the metrics for the train and test data
y_pred_train = lm.predict(X_train)
y_pred_test = lm.predict(X_test)

get_metrics(y_train, y_pred_train, y_test, y_pred_test)

## Conclusion:

We have the below four models.
1. Bernoulli Naive Bayes - 64% Test accuracy
2. Multinomial Naive Bayes - 72.7% Test accuracy
3. Random Forest - 82.2% Test accuracy
4. Logistic Regression - 93% Test accuracy

#### Among all these Logistic regression is the best model as per the accuracy

## Model Inference

In [None]:
# Define some custom text
custom_text = "Applicable loan admin charge fee waiver is not applied on my loan account number xxxx-xxxxxx"

vectorized = tfidf_transformer.transform(count_vect.transform([custom_text]))
cust_x = pd.DataFrame(vectorized.toarray(), columns=count_vect.get_feature_names())

# make prediction
predictions = lm.predict(cust_x)

In [None]:
predictions

The topics label and the topics are as below 

0. Other
1. Theft/Dispute reporting
2. Mortgages/loans
3. Credit card / Prepaid card
4. Bank account services

### The final model (Logistic Regression) is able to correctly predict custom text as `Mortgages/loans`