## Theme reduction of user feedback comments using machine learning (NMF)

### Overview:

Built a theme reduction model which will identify the top themes appearing in the user feedback comments after their classification. Additionally, found out the top comments associated with the top themes.

The model helps in identifying the topics which appear frequently specific for each class and hence the overall themes which occur for that class.

## Code Index:


### 1. Theme reduction using machine learning for identifying themes in classified comments

    1.1 Importing essential libraries
    
    1.2 Implementing the theme reduction libraries and functions
    
    1.3 Null handling function
    
    1.4 Reading the data from the file
    
    1.5 Cleaning the nulls from the data and exploring it
    
    1.6 Lemmatization class using NLTK
    
    1.7 Removing all the non alpha-numeric characters
    
    1.8 Tf-IDF for vectorizing and tokenizing the text data and removing the stop words
    
    1.9 Function responsible for displaying the top topics/themes
    
    1.10 Implementing theme reduction using machine learning with NMF(Non-Negative Matrix Factorization)
    
    1.11 Getting the top comments associated with the themes

## Code:

### 1.1 Importing all the essential libraries and functions

In [15]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
pd.set_option('display.max_colwidth', -1)

### 1.2 Importing the theme reduction libraries and functions

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF


### 1.3 Null handling function

In [17]:
#Null cleaning function

def myfillna(series):
    if series.dtype is pd.np.dtype(float):
        return series.fillna('')
    elif series.dtype is pd.np.dtype(int):
        return series.fillna('')
    else:
        return series.fillna('NA')

### 1.4 Reading the data from the file

In [18]:
testdata=pd.read_csv('xplatest.csv',encoding='latin')

### 1.5 Cleaning the nulls from the dataset and exploring it

In [19]:
testdata=testdata.apply(myfillna)

In [20]:
testdata.isnull().sum()

environment               0
rating                    0
comment_text              0
Sentiment                 0
Bug_classifier            0
UX_classifier             0
Performance_classifier    0
dtype: int64

In [21]:
testdata.shape

(28018, 7)

In [22]:
testdata

Unnamed: 0,environment,rating,comment_text,Sentiment,Bug_classifier,UX_classifier,Performance_classifier
0,0,3,Ever since the SAP integration happened the software runs much much much slower - particularly on the mobile device.Also - the hotel search brings in hotels that do not meet the search criteria.,0.183333,0,0,0
1,0,5,Much easier than I expected. I suppose I felt a little intimidated by something new in the beginning.,0.012216,0,0,0
2,1,2,This new version is TOO SLOW. Too many steps.,0.112121,0,0,1
3,1,4,once you understand the process...it is pretty good.,0.475000,0,0,0
4,0,5,Excellent turnaround time on getting bank confirmation. Application is easy to use.,0.716667,0,0,0
5,0,4,"Send reminders more frequently, and include # OF DAYS DUE",0.158333,0,0,0
6,0,5,User friendly,0.375000,0,0,0
7,0,1,There is no need to itemize the hotel receipt tax because it is being submitted as an attachment.,0.000000,0,0,0
8,0,2,allocating expenses was awkward. I still find the system clunky.,-0.600000,0,0,0
9,0,5,Your staff is very friendly and professional.,0.293750,0,0,0


### 1.6 Lemmatization class using NLTK

In [23]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

### 1.7 Removing all the non alpha-numeric characters

In [29]:
RE_PREPROCESS = r'\W+|\d+' #the regular expressions that matches all non-characters
testdata.comment_text = np.array( [ re.sub(RE_PREPROCESS, ' ', comment).lower() for comment in testdata.comment_text])

### 1.8 Tf-IDF for vectorizing and tokenizing the text data and removing the stop words

In [34]:
no_features = 1000

#Vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(),max_df=0.55, min_df=5, ngram_range=(1,2),max_features=no_features, stop_words='english',use_idf=True)
tfidf = tfidf_vectorizer.fit_transform(testdata[testdata.Bug_classifier==1].comment_text)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

#The reason being TF-IDF works best with NMF (Non-Negative Matrix Factorization)




In [35]:
tfidf_feature_names

['ability',
 'able',
 'absolute',
 'accept',
 'acceptable',
 'access',
 'account',
 'accurately',
 'action',
 'actual',
 'actually',
 'add',
 'add expense',
 'add receipt',
 'added',
 'adding',
 'addition',
 'additional',
 'address',
 'advance',
 'affidavit',
 'air',
 'airfare',
 'airline',
 'allocate',
 'allocated',
 'allocation',
 'allow',
 'amex',
 'android',
 'annoying',
 'anymore',
 'ap',
 'app',
 'app doe',
 'app doesn',
 'app ha',
 'app phone',
 'app using',
 'app wa',
 'app work',
 'appear',
 'application',
 'approval',
 'approve',
 'approved',
 'apps',
 'area',
 'aren',
 'aren t',
 'ask',
 'asked',
 'asking',
 'associate',
 'attach',
 'attach receipt',
 'attached',
 'attached receipt',
 'attaching',
 'attaching receipt',
 'attachment',
 'attempt',
 'attendee',
 'attendee list',
 'audit',
 'auditor',
 'authorization',
 'auto',
 'automated',
 'automatically',
 'available',
 'available receipt',
 'away',
 'awful',
 'bad',
 'best',
 'better',
 'big',
 'bit',
 'book',
 'booked',
 '

### 1.9 Function responsible for displaying the top topics

In [31]:

def reduce_themes(model, feature_names, documents, no_top_words, no_top_documents,countv):
    for topic_idx, topic in enumerate(model.components_):
        print('\033[1m' + 'Topic %d:' % (topic_idx) + "\033[0;0m")
        print ('\033[1m' + ' '.join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]) + "\033[0;0m")
        print(' ')
        top_doc_indices = np.argsort(model.transform(countv)[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print (documents[doc_index])
            print (' ')

no_topics = 20


#NMF is an equivalent for PCA but for text

### 1.10 Implementing theme reduction using machine learning with NMF(Non-Negative Matrix Factorization) 

In [32]:
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf) 

### 1.11 Getting the top comments associated with the themes


In [33]:

documents = []
for i in testdata.comment_text:
    documents.append(i)
    
no_top_documents= 10
no_top_words = 10

reduce_themes(nmf, tfidf_feature_names, documents, no_top_words, no_top_documents,countv=tfidf)

[1mTopic 0:[0;0m
[1mreceipt attach image attached receipt image attach receipt upload file attaching uploading[0;0m
 
this is my first exposure to this system it seems to work well once you get a little assistance in what the correct codes are for accounting purposes 
 
great application very easy to use
 
 
it s getting easier to use 
 
the itemization on hotels does not always work good and results in double work which is annoying 
 
would not calculate mileage several times had to close tab and return to expense report from main menu multiple times to complete report 
 
superb 
 
this is a complex and not exactly user friendly method
 
once again lost expensit receipts and multiple crashes 
 
complicated how concur handles situations when breakfast is included in the room price
 
[1mTopic 1:[0;0m
[1mwork t work app work work time did work work properly doesnt work function doesnt tool[0;0m
 
i think it is still buggy
 
the fax option never works anymore we get a transmission

[1mconcur travel wa use s booking reservation flight book expense[0;0m
 
great and fast service
 
this was my first expense report on the new system and it was extremely easy 
 
user interface could be better took time to figure out how to delete an expense from the report 
 
has to attach   times
 
unclear how to add airline service fee when one fee applies to two separate flights
 
 
i frequently see exceptions indicating that i have exceeded the daily meal allowance of     when the expense policy clearly says the daily allowance is   
 
inputting employee list is not user friendly 
 
quick and easy
 
it took a couple of times for my receipt to attach 
 
[1mTopic 15:[0;0m
[1mdon t don t clear know t know t clear t receipt think password[0;0m
 
it is frustrating trying to figure out on my own how to enter in hotel details when there is an issue i see the red explanation mark but don t know how to correct easily i end up entering information or expenses different ways until the e

In [None]:
References:
    
1. https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
2.     