 ## Problem Statement



 You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.



 You will be doing topic modelling on the <b>.json</b> data provided by the company. Since this data is not labelled, you need to apply NMF to analyse patterns and classify tickets into the following five clusters based on their products/services:



 * Credit card / Prepaid card



 * Bank account services



 * Theft/Dispute reporting



 * Mortgages/loans



 * Others





 With the help of topic modelling, you will be able to map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

 ## Pipelines that needs to be performed:



 You need to perform the following eight major tasks to complete the assignment:



 1.  Data loading



 2. Text preprocessing



 3. Exploratory data analysis (EDA)



 4. Feature extraction



 5. Topic modelling



 6. Model building using supervised learning



 7. Model training and evaluation



 8. Model inference

 ## Importing the necessary libraries

In [None]:
# %%
import tensorflow as tf
# Check if GPU is available
if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")


In [None]:
# %%
import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

import os



 ## Loading the data



 The data is in JSON format and we need to convert it to a dataframe.

In [None]:
# %%
# Opening JSON file 
f = open('./complaints-2021-05-14_08_16.json') # Write the path to your data file and load it 

# returns JSON object as  
# a dictionary 
data = json.load(f)
df=pd.json_normalize(data)



 ## Data preparation

In [None]:
# %%
# Inspect the dataframe to understand the given data.
display(df.head())
display(df.info())
display(df.describe(include='all'))


In [None]:
# %%
#print the column names
df.columns



In [None]:
# %%
#Assign new column names
df.columns = df.columns.str.lstrip('_')
df.columns = df.columns.str.replace('source.', '')
df.columns



In [None]:
# %%
# Assign nan in place of blanks in the complaints column
df[df['complaint_what_happened'] ==''] = np.nan



In [None]:
# %%
# df shape before dropna
print('df.shape before dropna =',df.shape)
#Remove all rows where complaints column is nan
df.dropna(subset='complaint_what_happened', inplace=True)
# df shape after dropna
print('df.shape after dropna =',df.shape)



 ## Prepare the text for topic modeling



 Once you have removed all the blank complaints, you need to:



 * Make the text lowercase

 * Remove text in square brackets

 * Remove punctuation

 * Remove words containing numbers





 Once you have done these cleaning operations you need to perform the following:

 * Lemmatize the texts

 * Extract the POS tags of the lemmatized text and remove all the words which have tags other than NN[tag == "NN"].



 #### Convert data types

In [None]:
# %%
df = df.convert_dtypes()
df.info()


 #### Make the text lowercase

In [None]:
# %%
# Write your function here to clean the text and remove all the unnecessary elements.
df['clean_complaints'] = df['complaint_what_happened'].str.lower()



 #### Create function to extract regex

In [None]:
# %%
def extract_regex(df, new_df, regexes):
    if not isinstance(regexes, list):
        regexes = [regexes]
    new_df = pd.DataFrame()
    for reg in regexes:
        if new_df.empty:
            new_df = pd.DataFrame(df[df['clean_complaints'].str.contains(reg, regex=True)]['clean_complaints'])            
        else:
            new_df = pd.concat([new_df, pd.DataFrame(df[df['clean_complaints'].str.contains(reg, regex=True)]['clean_complaints'])])
        
    # Apply the regular expressions to the DataFrame
    new_df['extracted'] = new_df['clean_complaints'].apply(lambda x: [match for regex in regexes for match in re.findall(regex, x)])
    # Display the 'extracted' column
    return pd.DataFrame((new_df['extracted']))



 #### Remove text in square brackets

In [None]:
# %%
df_square_brackets = pd.DataFrame()
regex = r'(\[.*?\])'
extract_regex(df, df_square_brackets, regex)


In [None]:
# %%
df['clean_complaints'] = df['clean_complaints'].str.replace(regex, '', regex = True)


In [None]:
# %%
extract_regex(df, df_square_brackets, regex)


 #### Remove punctuation

In [None]:
# %%
df_punctuation = pd.DataFrame()
regex = f'[{string.punctuation}]'
extract_regex(df, df_punctuation, regex)


In [None]:
# %%
# Remove punctuation from 'clean_complaints'
df['clean_complaints'] = df['clean_complaints'].str.replace(regex, '', regex=True)


In [None]:
# %%
extract_regex(df, df_punctuation, regex)


 #### Remove words containing numbers

 Filter words containing numbers anywhere

In [None]:
# %%
# Create a new DataFrame that contains rows that have square brackets and its text
df_word_num = pd.DataFrame()

# Define the regular expressions
regex = [r'\b[A-Za-z]+\d+\w*\b', r'\b\d+[A-Za-z]+\w*\b']

extract_regex(df, df_word_num, regex).head(30)


 Filter words containing numbers in between

In [None]:
# %%
# Create a new DataFrame that contains rows that have square brackets and its text
df_word_num = pd.DataFrame()

# Define the regular expressions
regex = r'\b[A-Za-z]+\d+[A-Za-z]+\b'

extract_regex(df, df_word_num, regex).head(10)


 Remove words containing numbers in between

In [None]:
# %%
# Remove punctuation from 'clean_complaints'
df['clean_complaints'] = df['clean_complaints'].str.replace(regex, '', regex=True)


In [None]:
# %%
extract_regex(df, df_word_num, regex).head(10)


 #### Clean space

In [None]:
# %%
print(df['clean_complaints'][1])


In [None]:
# %%
# Removing leading/trailing whitespace and empty sentences
df['clean_complaints'] = df['clean_complaints'].apply(lambda x: '\n'.join(sent.strip() for sent in x.split('\n') if sent.strip() != ''))


In [None]:
# %%
print(df['clean_complaints'][1])


In [None]:
# %%
# Removing extra spaces between words.
df['clean_complaints'] = df['clean_complaints'].apply(lambda x: '\n'.join(' '.join(word.strip() for word in sent.split() if word.strip()!= '') for sent in x.split('\n') if sent.strip()!= ''))


In [None]:
# %%
print(df['clean_complaints'][1])


 #### Drop empty rows

In [None]:
# %%
# df shape before dropna
print('df.shape before dropna =',df.shape)
#Remove all rows where complaints column is nan
df.dropna(subset='clean_complaints', inplace=True)
# Drop rows where column 'clean_complaints' is equal to ''
df = df[df['clean_complaints'] != '']
# df shape after dropna
print('df.shape after dropna =',df.shape)
# reset index
df.reset_index(drop=True, inplace=True)


 #### Drop duplicates

In [None]:
# %%
# df shape before drop_duplicates
print('df.shape before drop_duplicates =',df.shape)
# Drop duplicate rows based on column 'clean_complaints'
df = df.drop_duplicates(subset='clean_complaints')
# df shape after drop_duplicates
print('df.shape after drop_duplicates =',df.shape)


In [None]:
# %%
if os.path.isfile('df.csv'):
  # load df_clean
  df = pd.read_csv('df.csv')
else:
    df.to_csv('df.csv', index=False)



 #### Lemmatize the texts

In [None]:
# %%
#Write your function to Lemmatize the texts
def lemmatize(sent):        
    spacy.prefer_gpu()
    doc = nlp(sent)
    return ' '.join([token.lemma_ for token in doc])



In [None]:
# %%
if os.path.isfile('df_clean.csv'):
    # load df_clean
    df_clean = pd.read_csv('df_clean.csv')
else:
    #tag remote collab
    df = pd.read_csv('df.csv')

    #Create a dataframe('df_clean') that will have only the complaints and the lemmatized complaints 
    df_clean = pd.DataFrame()

    # initialize 'complaints' column
    df_clean['complaints'] = df['clean_complaints']

    # process 'complaints_lemmatized' column
    df_clean['complaints_lemmatized'] = pd.DataFrame(df_clean['complaints'].apply(lambda x: '\n'.join(lemmatize(sent) for sent in x.split('\n'))))

    # Store df_clean for later use
    df_clean.to_csv('df_clean.csv', index=False)


In [None]:
# %%
print(df_clean['complaints_lemmatized'][0])


In [None]:
# %%
df_clean



In [None]:
# %%
def pos_tag(text):
    # write your code here
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    #this column should contain lemmatized text with all the words removed which have tags other than NN[tag == "NN"].
    return lemmatize(' '.join([word for word, tag in pos_tags if tag.startswith('NN')]))



In [None]:
# %%
#Write your function to extract the POS tags 

if os.path.isfile('df_clean_v1.csv'):
  #tag remote collab
  df = pd.read_csv('df.csv')
  # load df_clean
  df_clean = pd.read_csv('df_clean_v1.csv')
else:
  #tag remote collab
  df = pd.read_csv('df.csv')
  #tag remote collab
  df_clean = pd.read_csv('df_clean.csv')

  nltk.download('punkt')
  # Make sure you have the necessary NLTK data downloaded
  nltk.download('averaged_perceptron_tagger')

  df_clean['complaint_POS_removed'] = pd.DataFrame(df_clean['complaints'].apply(lambda x: '\n'.join(pos_tag(sent) for sent in x.split('\n'))))
  # Store df_clean for later use
  df_clean.to_csv('df_clean_v1.csv', index=False)



In [None]:
# %%
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.
df_clean


 ## The personal details of customer has been masked in the dataset with xxxx. Let's remove the masked text as this will be of no use for our analysis

In [None]:
# %%
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].str.replace(r'xxxx*','', regex =True)
# Replace NaN values with an empty string
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].fillna('')


In [None]:
# %%
# Removing leading/trailing whitespace and empty sentences
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].apply(lambda x: '\n'.join(sent.strip() for sent in x.split('\n') if sent.strip() != ''))
# Removing extra spaces between words.
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].apply(lambda x: '\n'.join(' '.join(word.strip() for word in sent.split() if word.strip()!= '') for sent in x.split('\n') if sent.strip()!= ''))


In [None]:
# %%
#All masked texts has been removed
df_clean


 ## Exploratory data analysis to get familiar with the data.



 Write the code in this task to perform the following:



 *   Visualise the data according to the 'Complaint' character length

 *   Using a word cloud find the top 40 words by frequency among all the articles after processing the text

 *   Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘







In [None]:
# %%
# Write your code here to visualise the data according to the 'Complaint' character length

# Create a new column 'complaint_length' that contains the length of each complaint
complaint_length = df_clean['complaints_lemmatized'].apply(len)

# Set the figure size
plt.figure(figsize=(13, 5))

# Plot a histogram of the complaint lengths
sns.histplot(complaint_length, edgecolor='white', bins=50, alpha=0.55, kde=True)
plt.xlabel('Complaint Length')
plt.ylabel('Frequency')
plt.title('Distribution of Complaint Lengths')
plt.show()


 Distribution of complaint lengths is strongly right-skewed, indicating that most complaints are short, but there are a few very long ones that extend the range significantly.

 #### Find the top 40 words by frequency among all the articles after processing the text.

In [None]:
# %%
# Get the list of English stop words
stop_words = nlp.Defaults.stop_words

# Create a new column with stop words removed
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))
# Replace NaN values with an empty string
df_clean['complaint_POS_removed'] = df_clean['complaint_POS_removed'].fillna('')



In [None]:
# %%
#Using a word cloud find the top 40 words by frequency among all the articles after processing the text

from wordcloud import WordCloud
# Combine all the complaints into a single string
all_complaints = ' '.join(df_clean['complaint_POS_removed'])

# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, max_words=40).generate(all_complaints)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()



In [None]:
# %%
#Removing -PRON- from the text corpus
# df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')
if os.path.isfile('df_clean_v2.csv'):
  #tag remote collab
  df = pd.read_csv('df.csv')
  # load df_clean
  df_clean = pd.read_csv('df_clean_v2.csv')
else:
    #tag remote collab
    df = pd.read_csv('df.csv')
    # Define a function to replace a token
    def remove_PRON(sent):
        spacy.prefer_gpu()
        doc = nlp(sent)
        return ' '.join([token.text for token in doc if token.pos_ !='PRON'])

    # Apply the function to the 'complaint_POS_removed' column
    df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].apply(remove_PRON)
    # Replace NaN values with an empty string
    df_clean['Complaint_clean'] = df_clean['Complaint_clean'].fillna('')

    df_clean.to_csv('df_clean_v2.csv', index=False)



 #### Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

 - __Unigram__ means taking only one word at a time.

 - __Bigram__ means taking two words at a time.

 - __Trigram__ means taking three words at a time.



 Source: `https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/`

In [None]:
# %%
#tag delete
# Replace NaN values with an empty string
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].fillna('')


In [None]:
# %%
#Write your code here to find the top 30 unigram frequency among the complaints in the cleaned dataframe(df_clean). 
from nltk import ngrams, FreqDist

# Replace NaN values with an empty string
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].fillna('')

# Join all the complaints into a single string
all_words = ' '.join(df_clean['Complaint_clean']).split()

unigram_freq = FreqDist(all_words)

# Create a DataFrame from the result of word_freq.most_common(30)
df_unigram_freq = pd.DataFrame(unigram_freq.most_common(30), columns=['Unigram', 'Frequency'])

display (df_unigram_freq)


In [None]:
# %%
#Print the top 10 words in the unigram frequency
display(df_unigram_freq.head(10))


In [None]:
# %%
#Write your code here to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean). 

# Generate bigrams
bigrams = ngrams(all_words, 2)

bigram_freq = FreqDist(bigrams)

# Create a DataFrame from the result of word_freq.most_common(30)
df_bigram_freq = pd.DataFrame(bigram_freq.most_common(30), columns=['Bigram', 'Frequency'])

display (df_bigram_freq)



In [None]:
# %%
#Print the top 10 words in the bigram frequency
display(df_bigram_freq.head(10))


In [None]:
# %%
#Write your code here to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean). 

# Generate trigrams
trigrams = ngrams(all_words, 3)

trigram_freq = FreqDist(trigrams)

# Create a DataFrame from the result of word_freq.most_common(30)
df_trigram_freq = pd.DataFrame(trigram_freq.most_common(30), columns=['Trigram', 'Frequency'])

display (df_trigram_freq)



In [None]:
# %%
#Print the top 10 words in the trigram frequency
display(df_trigram_freq.head(10))


 ## Feature Extraction

 Convert the raw texts to a matrix of TF-IDF features



 **max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"

 max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"



 **min_df** is used for removing terms that appear too infrequently

 min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [None]:
# %%
#Write your code here to initialise the TfidfVectorizer 
tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2)


 #### Create a document term matrix using fit_transform



 The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:

 The tuples that are not there have a tf-idf score of 0

In [None]:
# %%
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.
dtm = tfidf.fit_transform(df_clean['Complaint_clean']) 



In [None]:
# %%
dtm.shape


 ## Topic Modelling using NMF



 Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.



 In this task you have to perform the following:



 * Find the best number of clusters

 * Apply the best number to create word clusters

 * Inspect & validate the correction of each cluster wrt the complaints

 * Correct the labels if needed

 * Map the clusters to topics/cluster names

In [None]:
# %%
from sklearn.decomposition import NMF


 ## Manual Topic Modeling

 You need to do take the trial & error approach to find the best num of topics for your NMF model.



 The only parameter that is required is the number of components i.e. the number of topics we want. This is the most crucial step in the whole topic modeling process and will greatly affect how good your final topics are.

In [None]:
# %%
#Load your nmf_model with the n_components i.e 5
num_topics = 5#write the value you want to test out

#keep the random_state =40
nmf_model = NMF(n_components=num_topics, random_state =40) #write your code here
H = nmf_model.fit_transform(dtm)



In [None]:
# %%
W = nmf_model.fit(dtm)
H = nmf_model.components_
len(tfidf.get_feature_names_out())


In [None]:
# %%
#Print the Top15 words for each of the topics
words = np.array(tfidf.get_feature_names_out())
topic_words = pd.DataFrame(np.zeros((num_topics, 15)), index=[f'Topic {i + 1}' for i in range(num_topics)], columns=[f'Word {i + 1}' for i in range(15)]).astype(str)

for i in range(num_topics):
    ix = H[i].argsort()[::-1][:15]
    topic_words.iloc[i] = words[ix]

topic_words



In [None]:
# %%
#Create the best topic for each complaint in terms of integer value 0,1,2,3 & 4




In [None]:
# %%
#Assign the best topic to each of the cmplaints in Topic Column

df_clean['Topic'] = #write your code to assign topics to each rows.


In [None]:
# %%
df_clean.head()


In [None]:
# %%
#Print the first 5 Complaint for each of the Topics
df_clean=df_clean.groupby('Topic').head(5)
df_clean.sort_values('Topic')


 #### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:

 * Bank Account services

 * Credit card or prepaid card

 * Theft/Dispute Reporting

 * Mortgage/Loan

 * Others

In [None]:
# %%
#Create the dictionary of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)


In [None]:
# %%
df_clean


 ## Supervised model to predict any new complaints to the relevant Topics.



 You have now build the model to create the topics for each complaints.Now in the below section you will use them to classify any new complaints.



 Since you will be using supervised learning technique we have to convert the topic names to numbers(numpy arrays only understand numbers)

In [None]:
# %%
#Create the dictionary again of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)


In [None]:
# %%
df_clean


In [None]:
# %%
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data
training_data=


In [None]:
# %%
training_data


 ####Apply the supervised models on the training data created. In this process, you have to do the following:

 * Create the vector counts using Count Vectoriser

 * Transform the word vecotr to tf-idf

 * Create the train & test data using the train_test_split on the tf-idf & topics



In [None]:
# %%

#Write your code to get the Vector count


#Write your code here to transform the word vector to tf-idf


 You have to try atleast 3 models on the train & test data from these options:

 * Logistic regression

 * Decision Tree

 * Random Forest

 * Naive Bayes (optional)



 **Using the required evaluation metrics judge the tried models and select the ones performing the best**

In [None]:
# %%
# Write your code here to build any 3 models and evaluate them using the required metrics






In [None]:
# %%



