Skip to content

Commit 7d01203

Browse files
authored
Merge pull request #1 from Anmol-Singh-Jaggi/master
Humble request
2 parents 5c5f388 + 51c655e commit 7d01203

File tree

1 file changed

+26
-25
lines changed

1 file changed

+26
-25
lines changed

README.md

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,41 @@
1-
HelpBot (A work in progress)
1+
# HelpBot
22

3-
A simple QA system for the information-retrieval that attempts to solve user queries using a corpus of data.
3+
(A work in progress)
44

5-
How it works?
5+
A simple QA system for information-retrieval that attempts to solve user queries using a corpus of data.
66

7-
1. Extract HelpBot.zip and cd into the directory and run the following commands:
8-
a. export FLASK_APP=server.py
9-
b. flask run
10-
2. Goto browser and type "localhost:5000" (default port is 5000)
7+
**How does it work?**
8+
9+
1. Extract `HelpBot.zip` and cd into the directory and run the following commands:
10+
a. `export FLASK_APP=server.py`
11+
b. `flask run`
12+
2. Open browser and type "localhost:5000" (default port is 5000)
1113
3. Enter your search query into the text-box and click submit.
12-
4. You'll see results in the following format:
13-
a. The entered query in <BLUE> color
14+
4. You'll see results in the following format:
15+
a. The entered query in <BLUE> color
1416
b. A set of matching questions in <RED> and answers in <GREEN>
15-
5. User can see which question is nearest to his search query and follow those steps.
17+
5. User can see which question is closest to his search query and follow those steps.
1618

1719

18-
Algorithm:
20+
**Algorithm:**
1921

2022
The algorithm is a type of unsupervised learning.
21-
The logic for matching the user query is based on the cosine-similarity of the query with existing questions/queries in the dataset. The process consists of two parts:
23+
The logic for matching the user query is based on the cosine-similarity of the query with existing questions/queries in the dataset.
24+
The process consists of two parts:
2225

23-
1. Precomputation
24-
a. The text of the docs in SampleDocuments.zip was separated into pairs of questions and their answers.
25-
b. The text of the questions was cleaned and stemmed and saved in a separate file "stemmed_questions.bin"
26+
1. Precomputation
27+
a. The text of the docs in `SampleDocuments.zip` was separated into pairs of questions and their answers.
28+
b. The text of the questions was cleaned and stemmed and saved in a separate file "stemmed_questions.bin".
2629

27-
2. Realtime Matching
28-
a. The input query is also stemmed
29-
b. The stemmed questions are loaded from disk and used to created a TF-IDF matrix
30-
c. The stemmed input query is converted to a TF-IDF vector using the matrix created above (the tfidf matrix' vocabulary is set during the step 2 above)
31-
d. The product of the input query vector with each question vector (from the TF-IDF matrix) is calculated, and the results are sorted by descending proudct score.
32-
e. Top 5 scoring questions are returned, as they seem most relevant to user-query acoording to TF-IDF measure.
30+
2. Realtime Matching
31+
a. The input query is also stemmed.
32+
b. The stemmed questions are loaded from disk and used to created a TF-IDF matrix.
33+
c. The stemmed input query is converted to a TF-IDF vector using the matrix created above (the tfidf matrix' vocabulary is set during the previous step).
34+
d. The product of the input query vector with each question vector (from the TF-IDF matrix) is calculated, and the results are sorted by descending product score.
35+
e. Top 5 scoring questions are returned, as they seem most relevant to user-query according to TF-IDF measure.
3336

3437

35-
Improvements:
38+
**Improvements:**
3639

3740
1. Currently on questions are being considered. We can also gather context from answers to get better results.
38-
2. We are currently relying on exact words. This will fail in case the use-query has synonyms of words in the knowledge-database. We can correct this by converting the questions and answers data to word-embeddings using word2vec algo. Using this, we could find the most relevant questions in the database with the smallest Euclidean distance to the word2vec vector of the user-query, and should theoretically fix this edge-case.
39-
40-
41+
2. We are currently relying on exact words. This will fail in case the use-query has synonyms of words in the knowledge-database. We can correct this by converting the questions and answers data to word-embeddings using word2vec algo. Using this, we could find the most relevant questions in the database with the smallest Euclidean distance to the **word2vec** vector of the user-query, and should theoretically fix this edge-case.

0 commit comments

Comments
 (0)