# Problem Statement

Develop a intent classifier that can be implemented in CHATbot.

# Solution

The solution to the problem statement has been explained in step by step model.

## Step 1. Understanding the problem

Having the intuition of the problem statement is one of the most important part of solving a data science problem. I scanned through the data to have the overview of the problem I have to deal with. Based on the overview, decided the workflow for creating the model.

## Step 2. Loading the dataset 

Loaded the dataset using pandas read_csv function into a dataframe and then performed some data statistics to have basic idea of
the figures and numbers associated with database, number of null objects, shape of dataset etc.

## Step 3. Data Cleaning

Data in raw format is not of much use for machine learning algorithms. So performed a series of data cleaning processes including:<br>
-> Removal of punctuation <br>
-> Tokenizing<br>
-> Removal of stopwords<br>
-> Lemmatization<br>
-> Stemming<br>
-> Spell Correction<br>

### 3.1 Removal of Punctuation

Removed the punctuation from the data with the help of string and pandas library in python

### 3.2. Tokenizing

Text data is basically a linear sequende of symbols. So, before any text processing is done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc.
In this project I have performed Standard (Whitespace based) tokenization with the help of natural language processing library NLTK of Python.

### 3.3 Removal of Stopwords

Stopwords are common words that carry less important meaning than keywords. It is a surprising fact that a text article generally contains 60 to 70 percent of text data in form of stopwords, which implies that only 30 to 40 percent of text data drives the meaning of the article.

In this project the removal of stopwords was again performed with the help of NLP library NLTK.

**But then again, when removal of stop words was not performed in another iteration, the accuracy of the model was substantially improved. So I would advice againts removal of stopwords from the data . The detail of this statement has been explained more clearly in observation section of this document.**

### 3.4 Lemmatization
Performed lemmatization to reduce inflectional form of words. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words.

In this project **Wordnet Lemmatizer** from NLTK library was used

### 3.5. Stemming

Stemming is also used to reduce inflectional form of words. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time.

In this project **Porter Stemmer** from NLTK library was used.

### 3.6. Spell Correction

Spell Correction was performed to correct the spelling of mistyped words. In a real life based scenario mistyped words are common and thus spell correction improves the accuracy of model.

In this project spell correction was performed using **textblob** library of python

## Step 4. Feature Engineering

The next step is feature engineering. In this step raw text data will be converted into feature vectors and the new features will be created using the existing dataset.

I have implemented the following different methods in order to obtain relevant features from the dataset.

1. Count Vectors as features
2. TF-IDF as features<br>
    -> Word level <br>
    -> n-gram level <br>
    -> character level<br>
    
All the implementations were done using machine learning library **sklearn** of python

## Step 5. Model Building

The next and final step is to train a classifier using the features created in the previous step. There are many different choices of machine learning algorithm that can be implemented to train the classifier. I have used the following algorithms for this purpose:
1. Naive Bayes Classifier
2. Logistic Regression
3. Random Forest Classifier
4. XGBoost Gradient Classifier
5. Support Vector Machines(SVM)
6. Neural Networks


## Step 6. Serialization and Deserialization of data
This step involves saving the model for further use by performing serialization or pickling in python. Serialization enables the portability of the machine learning model and the pickled model can be used for various purposes like creating a web API, integrating with a desktop application etc by deserializing it.<br>
The serialization was performed using **sklearn** package of python

## Observations

1) From all the models used **Support Vector Machines** came out with most accurate results

![image.png](attachment:image.png)

2) **One of the most interesting observation is that unlike the general conception that removal of stop words can improve the accuracy of the model, in this case removal of stopwords substantially affected the accuracy which can be shown as snapshot of the accuracy dataframe before and after removal of stopwords.**

#### Before Removal of Stopwords

![image.png](attachment:image.png)

#### After Removal of stopwords

![image.png](attachment:image.png)

**The reason behind this behaviour can be attributed to the dataset which mostly contains interrogative sentences of which stops words are major part and they certainly describe the intent behind the statement**

2) Neural Networks are  performing at comparable accuracy to other machine learning models. The reason being less amount of data

3) XGBoost classifier was slow and inaccurate, after certain hyperparameter tuning like maximum tree depth and learning rate the accuracy improved

4) Deep Neural networks have even less accuracy as compared to shallow neural networks because deep neural networks perform well    with large datasets

5) In this project stemming operation was one of the key factor in improving accuracy

## Enhancements which can be done

1) More data can be acquired from the client to train the model more efficiently

2) Advanced data preprocessing such as getting meaning of short typed words, internet slangs etc can improve the accuracy of the model

3)  In the feature engineering section,  a number of different feature vectors were generated, combining them together can help to improve the accuracy of the classifier.

4)Tuning the hyperparamters is an important step, a number of parameters such as tree length, leafs, network paramters etc can be fine tuned to get a best fit model.

5) Ensemble Models : Stacking different models and blending their outputs can help to further improve the results