# Problem Statement

Haptik is one of the world's largest conversational AI platforms. It is a personal assistant mobile app, powered by a combination of artificial intelligence and human assistance. It has its domain in multiple fields including customer support, feedback, order status and live chat.

We have with us the dataset of Haptik containing the messages it receives from the customers and which topic(class) the messages refer to.

We need to create a model predicting which class a particular message belongs to using NLP.
Additionally you can also try to use techniques like LSA (Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation) to assign topics to new messages.


# About the dataset

The original data looked like:
![dataset](https://storage.googleapis.com/ga-commit-live-prod-live-data/account/b92/11111111-1111-1111-1111-000000000000/b566/984701e4-eb7e-4127-97cb-614776062232/file.PNG)



The dataset consisted of `message` column along with the different column associated with the topic they could associated with it. 

We have combined the instances of different topic into a single column called cateogory. The snapshot of the data you will be working on:

![](T_Data.PNG)


The dataset has details of 40000 messages You need to predict the category.

For final submission purposes, following is the label encoding of the category column:
```python

{0: 'casual',
 1: 'food',
 2: 'movies',
 3: 'nearby',
 4: 'other',
 5: 'recharge',
 6: 'reminders',
 7: 'support',
 8: 'travel'}
```

### About the dataset


A zipped file containing the following items is given:

- `train.csv`

The data file `train.csv` contains the `40000` instances with the `2` features including the target feature(target feature is not encoded).

- `test.csv`

The datafile `test.csv` contains the `10000`instances with the `1` feature excluding the target feature.

- `sample_submission.csv` 

Explained under the `Submission` sub-heading


- `domain_classification_student_template.ipynb`

A template notebook explaining the task breakdown to solve the given problem statement


## Submission

After training the model on `train.csv` data, the learner has to predict the target feature of the `test.csv` data using the trained model. The learner has to then submit a csv file with the predicted feature.

Sample submission file(`sample_submission.csv`) is given to you as a reference to the format expected when you submit 


## Evaluation metrics

For this particular dataset we are using simple `F1 score`(average="macro") as the evaluation metric. 

Submissions will be evaluated based on [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as per the below threshold.

|Your `f1_score` score| Points earned for the Task|
|-----|-----|
|`f1_score`> 0.7|100% of the available points|
| 0.7 <= `f1_score` <= 0.68 |80% of the available points|
| 0.68 <= `f1_score` <= 0.66 |70% of the available points|
|`f1_score` <= 0.65|No points earned|

# Data cleaning

In [7]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
import re
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score ,confusion_matrix, f1_score

import gensim
from gensim.models.lsimodel import LsiModel
from gensim import corpora
from pprint import pprint
from gensim.models import LdaModel
from gensim.models import CoherenceModel

import warnings
warnings.filterwarnings("ignore")

In this step we will load the dataset and perform a basic cleaning in order to simplify our futher steps.



In [8]:
data = pd.read_csv('../data/train_data.csv', index_col = 'MID')
data = data.dropna()
data.head()

Unnamed: 0_level_0,message,category
MID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7am everyday,reminders
1,chocolate cake,food
2,closed mortice and tenon joint door dimentions,support
3,train eppo kelambum,travel
4,yesterday i have cancelled the flight ticket,travel


# Data Processing

As we have seen in the Text Analytics concepts we need to convert this textual data into vectors so that we can apply machine learning algorithms to them. In this task we will now employ a normal TF-IDF vectorizer to vectorize the message column and label encode the category column, essentially making it a classification problem. 


In [9]:
le = LabelEncoder()
data['category'] = le.fit_transform(data['category'])

In [10]:
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['category'], test_size = 0.3, random_state = 0)

In [11]:
tfidf = TfidfVectorizer().set_params(stop_words = 'english')
tfidf.fit_transform(X_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

# Classification implementation

In the previous tasks we have cleaned the data and converted the textual data into numbers in order to enable us to apply machine learning models. In this task we will apply Random forest classifier , MultinomialNB, SVM model and Logistic regression onto the data.

In [12]:
# Implementing Logistic Regression model
log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train,y_train)
y_pred_log = log_reg.predict(X_test)
log_f1 = f1_score(y_test,y_pred_log, average='macro')
print (str(log_f1)+(" is the f1 score of the logistic regression model"))

# Implementing Multinomial NB model
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred_nb = nb.predict(X_test)
nb_f1 = f1_score(y_test,y_pred_nb,average='macro')
print (str(nb_f1)+(" is the f1 score of the Naive Bayes model"))


# Implementing Linear SVM model
lsvm = LinearSVC(random_state=0)
lsvm.fit(X_train, y_train)
y_pred_lsvm = lsvm.predict(X_test)
lsvm_f1 = f1_score(y_test,y_pred_lsvm,average='macro')
print (str(lsvm_f1)+(" is the f1 of the LinearSVC model"))

0.7360972727004778 is the f1 score of the logistic regression model
0.6468078449174866 is the f1 score of the Naive Bayes model
0.7494978602566456 is the f1 of the LinearSVC model


#### Best score is given by LinearSVM model

# Validation of test data

Let's now see how well our models run on test set.


In [14]:
# Prediction on test data

test_data = pd.read_csv("../data/test_data.csv")

X_test_data = tfidf.transform(test_data['message'])

# res = rb.predict(X_test_data)
res = lsvm.predict(X_test_data)
res
# Code ends here

array([3, 8, 4, ..., 0, 4, 0])