HelpBot

(A work in progress)

A simple QA system for information-retrieval that attempts to solve user queries using a corpus of data.

How does it work?

Extract HelpBot.zip and cd into the directory and run the following commands:
a. export FLASK_APP=server.py
b. flask run
Open browser and type "localhost:5000" (default port is 5000)
Enter your search query into the text-box and click submit.
You'll see results in the following format:
a. The entered query in color
b. A set of matching questions in and answers in
User can see which question is closest to his search query and follow those steps.

Algorithm:

The algorithm is a type of unsupervised learning. The logic for matching the user query is based on the cosine-similarity of the query with existing questions/queries in the dataset. The process consists of two parts:

Precomputation
a. The text of the docs in SampleDocuments.zip was separated into pairs of questions and their answers.
b. The text of the questions was cleaned and stemmed and saved in a separate file "stemmed_questions.bin".
Realtime Matching
a. The input query is also stemmed.
b. The stemmed questions are loaded from disk and used to created a TF-IDF matrix.
c. The stemmed input query is converted to a TF-IDF vector using the matrix created above (the tfidf matrix' vocabulary is set during the previous step).
d. The product of the input query vector with each question vector (from the TF-IDF matrix) is calculated, and the results are sorted by descending product score.
e. Top 5 scoring questions are returned, as they seem most relevant to user-query according to TF-IDF measure.

Improvements:

Currently on questions are being considered. We can also gather context from answers to get better results.
We are currently relying on exact words. This will fail in case the use-query has synonyms of words in the knowledge-database. We can correct this by converting the questions and answers data to word-embeddings using word2vec algo. Using this, we could find the most relevant questions in the database with the smallest Euclidean distance to the word2vec vector of the user-query, and should theoretically fix this edge-case.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
templates		templates
README.md		README.md
answers.txt		answers.txt
clean_data.py		clean_data.py
expander.py		expander.py
load_data.py		load_data.py
questions.txt		questions.txt
server.py		server.py
stemmed_answers.bin		stemmed_answers.bin
stemmed_questions.bin		stemmed_questions.bin
vectorizer.py		vectorizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HelpBot

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HelpBot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages