# ChatBot - Foul Word Detection
# Project Portfolio


## Abstract

Chatbots, or conversational interfaces as they are also known, present a new way for individuals to interact with computer systems. Traditionally, to get a question answered by a software program involved using a search engine, or filling out a form. A chatbot allows a user to simply ask questions in the same manner that they would address a human. The most well known chatbots currently are voice chatbots: Alexa and Siri. However, chatbots are currently being adopted at a high rate on computer chat platforms.

The technology at the core of the rise of the chatbot is natural language processing (“NLP”). Recent advances in machine learning have greatly improved the accuracy and effectiveness of natural language processing, making chatbots a viable option for many organizations. This improvement in NLP is firing a great deal of additional research which should lead to continued improvement in the effectiveness of chatbots in the years to come.

We will be evaluating different chatbots first and creating a chatbot. Goal is to integrate any website with a chatbot. Objective is that it is domain specific for now, can be extended to be scalable across other platforms. We will use RASA NLU to understand the questions in a correct manner and also take care of foul language being used. We will use intent classification and entity extraction.

Foul Word detection can be detected using a profanity filter which is what we implemeneted.

## Introduction

Rasa – A chatbot solution
Rasa provides a set of tools to build a complete chatbot at your local desktop and completely free. Their flagship tools are,
– Rasa NLU: A natural language understanding solution which takes the user input and tries to infer the intent and extract the available entities.
– Rasa Core: A dialog management solution tries to build a probability model which decides the set of actions to perform based on the previous set of user inputs.

Some keywords you will find repeatably used in the post in reference to Rasa functions/tools,
– Intent: Consider it as the aim or target of the user input. If a user say, “Which day is today?”, the intent would be finding the day of the week.
– Entity: Consider it as the useful information from the user input that can be extracted. From previous example, by intent we understand the aim is to find the day of week, but of which date? If we extract “Today” as the entity, we can perform the action on today.
– Actions: As the name suggest, its an operation which can be performed by the bot. It could be replying something in return, querying a database or any other thing possible by code.
– Stories: These are a sample interaction between the user and bot, defined in terms of intents captured and actions performed. So developer can mention what to do if you get a use input of some intent with/without some entities. Like saying if user intent is to find the day of week and entity is today, find day of week of today and reply.

## Data Set

We created our dataset and also made use of .json files for data.
We got input from many users and thus our chatbot was trained and tested on this basis.



## Data Description

Takes the user input in the form of text 

## Libraries

1. import rasa_nlu
2. import rasa_core
3. import spacy
4. from rasa_nlu.training_data import load_data
5. from rasa_nlu.config import RasaNLUModelConfig
6. from rasa_nlu.model import Trainer
7. from rasa_nlu import config
8. from rasa_core.actions import Action
9. from rasa_core.events import SlotSet
10. from IPython.core.display import Image, display
11. from rasa_core.agent import Agent
12. from rasa_core.evaluate import run_story_evaluation
13. from profanity_check import predict, predict_prob


## Chatbot Architecture
Rasa NLU: Natural Language Understanding
Natural Language Understanding as artificial intelligence area dealing with machines comprehending written language.

With a chatbot the NLU is the first component to receive the user input as a string and through trained machine-learned model it then identifies the user’s intent behind what he/she said together with structured data (user’s name, requested location and so on). The intent is then passed to the dialogue engine.

Rasa Core: Dialogue Engine
Dialogue engine takes an intent as it’s input and provides an action as it’s output. Machine learning of stories is used to teach the engine to respond to various intents in a context. Rase Core can process the action and return a full text output or just return the recommended action depending on your client architecture desire.

## Intent Classifier
As mentioned earlier, Rasa recently introduced a new intent classifier called intent_classifier_tensorflow_embedding that has many benefits presented here, the most important one being that it does not require using a pre-trained model. It means that it works for any language and consumes less memory. It can also be used to have multiple intents per input, see details here.

We were quite happy with it so we didn’t have to tweak the parameters too much. One thing to be careful though is that by default, the ability to use multiple intents (intent_tokenization_flag) is enabled, with the default intent separator being the underscore _. It means you may be using this feature without even knowing if your intents contain underscores. Turn it off or change the separator to another character if you don’t use it.


![image.png](attachment:image.png)

## Evaluation Metric 

The things your users say are the best source of training data for refining your models. Of course your model won’t be perfect, so you will have to manually go through each of these predictions and correct any mistakes before adding them to your training data. 
How is your model performing? Do you have enough data? Are your intents and entities well-designed?

### Method 1
Rasa NLU has an evaluate mode which helps you answer these questions. A standard technique in machine learning is to keep some data separate as a test set. If you’ve done this, you can see how well your model predicts the test cases using this command:

python -m rasa_nlu.evaluate \
    --data data/examples/rasa/demo-rasa.json \
    --model projects/default/model_20180323-145833
    
The evaluation script will produce a report, confusion matrix and confidence histogram for your model.

The report logs precision, recall and f1 measure for each intent and entity, as well as provide an overall average. You can save these reports as JSON files using the –report flag.

The confusion matrix shows you which intents are mistaken for others; any samples which have been incorrectly predicted are logged and saved to a file called errors.json for easier debugging.

The histogram that the script produces allows you to visualise the confidence distribution for all predictions, with the volume of correct and incorrect predictions being displayed by blue and red bars respectively. Improving the quality of your training data will move the blue histogram bars to the right and the red histogram bars to the left of the plot.

### Method 2

Hit and miss is a criteria of evaluation where we write the response expected from the chatbot beforehand and wait for the response from the chatbot. If its a right answer then its a hit. If not its a miss





## Algorithms

NLU library works on BoW

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

## Implementing Foul Word detection in Chatbot 

profanity-check uses a linear SVM model trained on 200k human-labeled samples of clean and profane text strings. Its model is simple but surprisingly effective, meaning profanity-check is both robust and extremely performant.

Why Use profanity-check?
No Explicit Blacklist
Many profanity detection libraries use a hard-coded list of bad words to detect and filter profanity. For example, profanity uses this wordlist, and even better-profanity still uses a wordlist. There are obviously glaring issues with this approach, and, while they might be performant, these libraries are not accurate at all.

profanity-check is anywhere from 300 - 4000 times faster than profanity-filter in this benchmark!

Installation
$ pip install profanity-check

![image.png](attachment:image.png)


1. Accuracy of foul word detection- 95%

2. Evaluation criteria
With a list of foul word samples from different users/sources we tried and tested our chatbot and saw how well it detected the foul words.


### Conclusion

An end to end chatbot was built using open source libraries- NLU RASA and NLU Core. 
We recognized intent for our sample data and labeled data from Spacy. We were able to successfully classify unstructured data into structured data. The Chatbot was able to understand the words, classify it into intents and perform meaningful actions

We implemented the foul word detection with 95% accuracy. 

### Future scope

1. We will be implementing this chatbot on Slack . 
2. We can scale this chatbot across any domain with less plugins
3. We can integrate it with websites and keep it domain generic/ specific based on the use case



## Contributions Statement

20% of the work is referred and 80% is my own work

## Citations

1. http://mohitmayank.com/building-a-chatbot-with-rasa/
2. https://towardsdatascience.com
3. https://matplotlib.org
4. https://stackoverflow.com
5. https://pandas.pydata.orge
6. https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/