|
1 | | -HelpBot (A work in progress) |
| 1 | +# HelpBot |
2 | 2 |
|
3 | | -A simple QA system for the information-retrieval that attempts to solve user queries using a corpus of data. |
| 3 | +(A work in progress) |
4 | 4 |
|
5 | | -How it works? |
| 5 | +A simple QA system for information-retrieval that attempts to solve user queries using a corpus of data. |
6 | 6 |
|
7 | | -1. Extract HelpBot.zip and cd into the directory and run the following commands: |
8 | | - a. export FLASK_APP=server.py |
9 | | - b. flask run |
10 | | -2. Goto browser and type "localhost:5000" (default port is 5000) |
| 7 | +**How does it work?** |
| 8 | + |
| 9 | +1. Extract `HelpBot.zip` and cd into the directory and run the following commands: |
| 10 | + a. `export FLASK_APP=server.py` |
| 11 | + b. `flask run` |
| 12 | +2. Open browser and type "localhost:5000" (default port is 5000) |
11 | 13 | 3. Enter your search query into the text-box and click submit. |
12 | | -4. You'll see results in the following format: |
13 | | - a. The entered query in <BLUE> color |
| 14 | +4. You'll see results in the following format: |
| 15 | + a. The entered query in <BLUE> color |
14 | 16 | b. A set of matching questions in <RED> and answers in <GREEN> |
15 | | -5. User can see which question is nearest to his search query and follow those steps. |
| 17 | +5. User can see which question is closest to his search query and follow those steps. |
16 | 18 |
|
17 | 19 |
|
18 | | -Algorithm: |
| 20 | +**Algorithm:** |
19 | 21 |
|
20 | 22 | The algorithm is a type of unsupervised learning. |
21 | | -The logic for matching the user query is based on the cosine-similarity of the query with existing questions/queries in the dataset. The process consists of two parts: |
| 23 | +The logic for matching the user query is based on the cosine-similarity of the query with existing questions/queries in the dataset. |
| 24 | +The process consists of two parts: |
22 | 25 |
|
23 | | -1. Precomputation |
24 | | - a. The text of the docs in SampleDocuments.zip was separated into pairs of questions and their answers. |
25 | | - b. The text of the questions was cleaned and stemmed and saved in a separate file "stemmed_questions.bin" |
| 26 | +1. Precomputation |
| 27 | + a. The text of the docs in `SampleDocuments.zip` was separated into pairs of questions and their answers. |
| 28 | + b. The text of the questions was cleaned and stemmed and saved in a separate file "stemmed_questions.bin". |
26 | 29 |
|
27 | | -2. Realtime Matching |
28 | | - a. The input query is also stemmed |
29 | | - b. The stemmed questions are loaded from disk and used to created a TF-IDF matrix |
30 | | - c. The stemmed input query is converted to a TF-IDF vector using the matrix created above (the tfidf matrix' vocabulary is set during the step 2 above) |
31 | | - d. The product of the input query vector with each question vector (from the TF-IDF matrix) is calculated, and the results are sorted by descending proudct score. |
32 | | - e. Top 5 scoring questions are returned, as they seem most relevant to user-query acoording to TF-IDF measure. |
| 30 | +2. Realtime Matching |
| 31 | + a. The input query is also stemmed. |
| 32 | + b. The stemmed questions are loaded from disk and used to created a TF-IDF matrix. |
| 33 | + c. The stemmed input query is converted to a TF-IDF vector using the matrix created above (the tfidf matrix' vocabulary is set during the previous step). |
| 34 | + d. The product of the input query vector with each question vector (from the TF-IDF matrix) is calculated, and the results are sorted by descending product score. |
| 35 | + e. Top 5 scoring questions are returned, as they seem most relevant to user-query according to TF-IDF measure. |
33 | 36 |
|
34 | 37 |
|
35 | | -Improvements: |
| 38 | +**Improvements:** |
36 | 39 |
|
37 | 40 | 1. Currently on questions are being considered. We can also gather context from answers to get better results. |
38 | | -2. We are currently relying on exact words. This will fail in case the use-query has synonyms of words in the knowledge-database. We can correct this by converting the questions and answers data to word-embeddings using word2vec algo. Using this, we could find the most relevant questions in the database with the smallest Euclidean distance to the word2vec vector of the user-query, and should theoretically fix this edge-case. |
39 | | - |
40 | | - |
| 41 | +2. We are currently relying on exact words. This will fail in case the use-query has synonyms of words in the knowledge-database. We can correct this by converting the questions and answers data to word-embeddings using word2vec algo. Using this, we could find the most relevant questions in the database with the smallest Euclidean distance to the **word2vec** vector of the user-query, and should theoretically fix this edge-case. |
0 commit comments