<a href="https://colab.research.google.com/github/panchamdesai777/NLP-Projects/blob/master/Haptik.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Domain classification of customer messages** (Haptik Case study)



>***Created by : Pancham Desai***



## **Problem Statement** : 

* Haptik is one of the world's largest conversational AI platforms. It is a personal assistant mobile app, powered by a combination of artificial intelligence and human assistance. It has its domain in multiple fields including customer support, feedback, order status and live chat.

* We have with us the dataset of Haptik containing the messages it receives from the customers and which topic(class) the messages refer to.

* We need to create a model predicting which class a particular message belongs to using NLP. We will also try to use techniques like LSA (Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation) to assign topics to new messages.

![alt text](https://varindia.com/uploads/2018/02/5c66b40b956fd.jpg)

## **About the dataset**:
![alt text](https://storage.googleapis.com/ga-commit-live-prod-live-data/account/b92/11111111-1111-1111-1111-000000000000/b566/984701e4-eb7e-4127-97cb-614776062232/file.PNG)

The dataset consists of message column along with the different column associated with the topic they could associated with it.

We have with us two variations of the same dataset:

* Train data(40000 rows) [We will train our model on this]

* Test data(10000 rows) [We will validate our model on this]

##**Importing all Required Libraries**

---



In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style # for styling the graphss
# style.available (to know the available list of styles)
style.use('ggplot') # chosen style
plt.rc('xtick',labelsize=13) # to globally set the tick size
plt.rc('ytick',labelsize=13) # to globally set the tick size
# To print multiple outputs together
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Change column display number during print
pd.set_option('display.max_columns', 500)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# To display float with 2 decimal, avoid scientific printing
pd.options.display.float_format = '{:.2f}'.format
import seaborn as sns
import warnings

In [0]:
from google.colab import files
uploaded = files.upload()


Saving Haptik_train.csv to Haptik_train (1).csv


In [0]:
#Loading The Dataset
import io
#The command written below is generally used to load .csv format file or .data format file.
df = pd.read_csv(io.BytesIO(uploaded['Haptik_train.csv']))
df.head()

Unnamed: 0,message,food,recharge,support,reminders,travel,nearby,movies,casual,other
0,7am everyday,F,F,F,T,F,F,F,F,F
1,chocolate cake,T,F,F,F,F,F,F,F,F
2,closed mortice and tenon joint door dimentions,F,F,T,F,F,F,F,F,F
3,train eppo kelambum,F,F,F,F,T,F,F,F,F
4,yesterday i have cancelled the flight ticket,F,F,F,F,T,F,F,F,F


**Observation:**

* Having a look at this dataframe, it's evident that there are multiple categories and depending on the message, each category for that particular message has been encoded as a True and the rest False. Which means, per message, there can be only one category which holds true.

* We will have to write a function to get the dataframe to essentially have 2 columns; one the message column, and the other column containing the category which was True for that message.

## **Data Cleaning**

We need to do an informal reverse of 'one hot encoding'.

* The function label_race() is already defined for you which takes an argument row as input. This function checks every row for a category that is marked as T and return the name of the first category.

* Create a new column category which contains the values obtained by applying the above-written function to all the rows of the dataframe.

In [0]:
#Data cleaning
def label_race(row):
  if row['food'] == "T":
    return 'food'
  elif row['recharge'] == "T":
    return 'recharge'
  elif row['support'] == 'T':
    return 'support' 
  elif row['reminders'] == 'T':
    return 'reminders'
  elif row['travel'] == 'T':
    return 'travel'
  elif row['nearby'] == 'T':
    return 'nearby'
  elif row['movies'] == 'T':
    return 'movies'
  elif row['casual'] == 'T':
    return 'casual'
  elif row['other'] ==  'T':
    return 'other'
df['category']=df.apply(lambda row : label_race(row),axis=1)
df.drop(['food','recharge','support','reminders','nearby','movies','casual','travel','other'],1,inplace=True)

In [0]:
df.head(5)

Unnamed: 0,message,category
0,7am everyday,reminders
1,chocolate cake,food
2,closed mortice and tenon joint door dimentions,support
3,train eppo kelambum,travel
4,yesterday i have cancelled the flight ticket,travel


## **TFIDF VECTORIZER & LABEL ENCODING**

---
we need to convert this textual data into vectors so that we can apply machine learning algorithms to them. In this task we will now employ a normal TF-IDF vectorizer to vectorize the message column and label encode the category column, essentially making it a classification problem.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Sampling only 1000 samples of each category
df = df.groupby('category').apply(lambda x: x.sample(n=1000, random_state=0))

# Code starts here
all_text=df['message'].str.lower()

tf_idf=TfidfVectorizer(stop_words="english")
X=tf_idf.fit_transform(all_text).toarray()

# Initiating a label encoder object
le = LabelEncoder()

# Fitting the label encoder object on the data
le.fit(df["category"])

# Transforming the data and storing it
y = le.transform(df["category"])

LabelEncoder()

## **Performance of Classification Model on Train data**

---
we have cleaned the data and converted the textual data into numbers in order to enable us to apply machine learning models. In this task we will apply Logistic Regression , Naive Bayes and Lienar SVM model onto the data.


In [0]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Code starts here

# Splitting the data into train and test sets
X_train, X_val,y_train, y_val = train_test_split(X,y, test_size = 0.3, random_state = 42)

# Implementing Logistic Regression model
log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_val)
log_accuracy = accuracy_score(y_val,y_pred)
print (str(log_accuracy)+(" is the accuracy of the logistic regression model"))
print('=='*100)

# Implementing Multinomial NB model
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_val)
nb_accuracy = accuracy_score(y_val,y_pred)
print (str(nb_accuracy)+(" is the accuracy of the Naive Bayes model"))
print('=='*100)

# Implementing Linear SVM model
lsvm = LinearSVC(random_state=0)
lsvm.fit(X_train, y_train)
y_pred = lsvm.predict(X_val)
lsvm_accuracy = accuracy_score(y_val,y_pred)
print (str(lsvm_accuracy)+(" is the accuracy of the LinearSVC model"))
print('=='*100)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

0.7066666666666667 is the accuracy of the logistic regression model


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

0.7114814814814815 is the accuracy of the Naive Bayes model


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0)

0.7125925925925926 is the accuracy of the LinearSVC model


## **Loading The  test Data**

---



In [0]:
from google.colab import files
uploaded = files.upload()

Saving Haptik_test.csv to Haptik_test.csv


In [0]:
#Loading The Dataset
import io
#The command written below is generally used to load .csv format file or .data format file.
df_test = pd.read_csv(io.BytesIO(uploaded['Haptik_test.csv']))
df_test.head()

Unnamed: 0,message,food,recharge,support,reminders,travel,nearby,movies,casual,other
0,Nearest metro station,F,F,F,F,F,T,F,F,F
1,Pick up n drop service trough cab,F,F,F,F,T,F,F,F,F
2,I wants to buy a bick,F,F,F,F,F,F,F,F,T
3,Show me pizza,T,F,F,F,F,F,F,F,F
4,What is the cheapest package to andaman and ni...,F,F,F,F,T,F,F,F,F


## **Performance of Classification Model on Train data**

---
Let's now see how well our models run on test set.

In [0]:
#Creating the new column category
df_test["category"] = df_test.apply (lambda row: label_race (row),axis=1)

#Dropping the other columns
drop= ["food", "recharge", "support", "reminders", "nearby", "movies", "casual", "other", "travel"]
df_test=  df_test.drop(drop,1)

# Code starts here
all_text=df_test['message'].str.lower()
X_test=tf_idf.transform(all_text)
y_test=le.transform(df_test['category'])
y_pred=log_reg.predict(X_test)
log_accuracy_2=accuracy_score(y_test,y_pred)
print('log_accuracy:',log_accuracy_2)
print('=='*100)

y_pred_nb_2=nb.predict(X_test)
nb_accuracy_2=accuracy_score(y_test,y_pred_nb_2)
print('nb accuracy:',nb_accuracy_2)
print('=='*100)


y_pred_lvsm_2=lsvm.predict(X_test)
lsvm_accuracy_2=accuracy_score(y_test,y_pred_lvsm_2)
print('lvsm_accuracy:',lsvm_accuracy_2)
print('=='*100)



log_accuracy: 0.77
nb accuracy: 0.6839
lvsm_accuracy: 0.7604


**Observation**
* Both logistic and linear svm are doing well on test data except the multinomial NB

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **Text Preprocessing**

---



In [0]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
from gensim.models.lsimodel import LsiModel
from gensim import corpora
from pprint import pprint
# import nltk
# nltk.download('wordnet')

# Creating a stopwords list
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
# Function to lemmatize and remove the stopwords
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

# Creating a list of documents from the complaints column
list_of_docs = df["message"].tolist()


## **Topic Modelling using LSI**

---



In [0]:
import nltk
nltk.download('wordnet')
doc_clean = [clean(doc).split() for doc in list_of_docs]

# Code starts here
dictionary=corpora.Dictionary(doc_clean)
dictionary.save('dictionary.dict')
doc_term_matrix=[dictionary.doc2bow(doc) for doc in doc_clean]
corpora.MmCorpus.serialize('corpus.mm', doc_term_matrix)

lsimodel=LsiModel(corpus=doc_term_matrix, num_topics=5, id2word=dictionary)
print('Topics using LSI Model:')
pprint(lsimodel.print_topics())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Topics using LSI Model:
[(0,
  '0.347*"reminder" + 0.267*"like" + 0.267*"cancel" + 0.266*"would" + '
  '0.256*"apiname" + 0.256*"userid" + 0.256*"exotel" + 0.256*"offset" + '
  '0.255*"taskname" + 0.255*"reminderlist"'),
 (1,
  '0.831*"want" + 0.221*"u" + 0.187*"know" + 0.181*"movie" + 0.135*"book" + '
  '0.128*"ticket" + 0.114*"need" + 0.108*"hi" + 0.096*"please" + '
  '0.092*"service"'),
 (2,
  '0.451*"reminder" + -0.328*"call" + -0.316*"u" + -0.233*"wake" + '
  '0.205*"water" + -0.196*"march" + -0.192*"wakeup" + 0.185*"every" + '
  '0.181*"drink" + 0.168*"want"'),
 (3,
  '-0.611*"u" + 0.418*"want" + -0.244*"need" + -0.238*"reminder" + '
  '-0.197*"please" + -0.143*"movie" + -0.117*"service" + 0.102*"wake" + '
  '-0.101*"near" + -0.101*"help"'),
 (4,
  '0.621*"need" + -0.510*"u" + 0.491*"movie" + 0.189*"offer" + -0.137*"want" + '
  '0.115*"ticket" + 0.058*"know" + 0.052*"today" + -0.052*"find" + '
  '0.049*"book"')]


## **Topic Modelling using LDA**

---



In [0]:
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# doc_term_matrix - Word matrix created in the last task
# dictionary - Dictionary created in the last task

# Function to calculate coherence values
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    topic_list : No. of topics chosen
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    topic_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(doc_term_matrix, random_state = 0, num_topics=num_topics, id2word = dictionary, iterations=10)
        topic_list.append(num_topics)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return topic_list, coherence_values


# Code starts here

# Calling the function
topic_list, coherence_value_list = compute_coherence_values(dictionary=dictionary, corpus=doc_term_matrix, texts=doc_clean, start=1, limit=41, step=5)
print(coherence_value_list)
# Finding the index associated with maximum coherence value
max_index=coherence_value_list.index(max(coherence_value_list))

# Finding the optimum no. of topics associated with the maximum coherence value
opt_topic= topic_list[max_index]
print("Optimum no. of topics:", opt_topic)

# Implementing LDA with the optimum no. of topic
lda_model = LdaModel(corpus=doc_term_matrix, num_topics=opt_topic, id2word = dictionary, iterations=10, passes = 30,random_state=0)

# pprint(lda_model.print_topics(5))
lda_model.print_topic(1)



[0.3287476298674388, 0.47880273927681355, 0.4814029893799397, 0.5397516701273923, 0.5460013261684322, 0.5759412789481678, 0.5748979636384208, 0.5906039077915982]
Optimum no. of topics: 36


'0.203*"near" + 0.056*"place" + 0.041*"timing" + 0.039*"food" + 0.035*"me" + 0.023*"location" + 0.021*"budget" + 0.020*"help" + 0.020*"theatre" + 0.020*"visit"'

## **Topics after LDA**

---



In [0]:
topics = lda_model.show_topics(formatted=False)
topics

[(12,
  [('service', 0.2062204),
   ('center', 0.08236956),
   ('pune', 0.053551704),
   ('dont', 0.021453576),
   ('courier', 0.018458763),
   ('message', 0.017256143),
   ('lenovo', 0.016755236),
   ('paytm', 0.016259223),
   ('redmi', 0.015074577),
   ('payment', 0.013782548)]),
 (8,
  [('find', 0.1421748),
   ('booking', 0.08821345),
   ('haptik', 0.034990184),
   ('still', 0.031731356),
   ('u', 0.030109473),
   ('take', 0.024798913),
   ('nearest', 0.023971396),
   ('full', 0.023918986),
   ('shoe', 0.023510173),
   ('trip', 0.022823216)]),
 (9,
  [('bill', 0.124040164),
   ('coupon', 0.056552365),
   ('code', 0.05137567),
   ('5', 0.046840403),
   ('rate', 0.038083807),
   ('postpaid', 0.03373633),
   ('electricity', 0.027187983),
   ('online', 0.022111442),
   ('web', 0.020902682),
   ('u', 0.015085907)]),
 (19,
  [('train', 0.25157505),
   ('number', 0.09544407),
   ('best', 0.06359507),
   ('car', 0.022665078),
   ('deal', 0.020447493),
   ('work', 0.01623739),
   ('working',

## **Visualisation of LDA using pyLDAvis**

---



In [0]:
!pip install pyLDAvis



In [0]:
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

In [0]:
pyLDAvis.enable_notebook()
corpus=doc_term_matrix
id2word=dictionary
vis = pyLDAvis.gensim.prepare(lda_model,corpus, id2word)
vis

### **Conclusion**

---
* After observing the above visualisation we can observe that topic 1 s very different as compared to other it is not overlapping with any one. After looking the words of topic 1 we can see that all the words are from reminder category

* Topic 5 and 9 are overlapping they are somewhat similar to one another they are mostly based on Travel Category

* Finally the topics in 4th quadrant are very much overlapping it shows they are very much related to one another
