Skip to content

Input a text file separated with many paragraphs and ask a question to get relevant passage back based on TF-IDF wights

License

Notifications You must be signed in to change notification settings

nirajdevpandey/passage-retrieval-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

passage-retrieval-chatbot

Input a text file separated with many paragraphs and ask a question to get relevant passage back based on TF-IDF wights


Following is the details & workflow of the repository

This chatbot work with a text file which has number of passages in it. Most of the work comes in preprocessing this collection to index its candidate utterances using the TFIDF model so we can easily find the utterance that's most similar to what the user has just said. The devision of the passage is based on the blank space between two paragraphs.

def paragraphs(file, separator=None):
    if not callable(separator):
        if separator != None: 
            print("TypeError separator must be callable")
        def separator(line): 
            return line == '\n'
    paragraph = []
    for line in file:
        if separator(line):
            if paragraph:
                yield ''.join(paragraph)
                paragraph = []
        else:
            paragraph.append(line)
    if paragraph :
        yield ''.join(paragraph)

In a large text corpus, some words will be very present (e.g. the, a, is in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

title

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency. Term-frequency refer to the times a particular word x appears in a document. Whereas, inverse document-frequency means that how many time the word x appears in entire corpus of document.


Note: While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.

As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier

Improvement: There are many improvements to information retrieval models nowadays, including Google's PageRank model to weight documents based on their importance and a variety of improved statistical models of document topics.

- This method was not that helpful for me so I would suggest to use other algorithms instead. 
- Reason being the nature of data I was dealing with.

About

Input a text file separated with many paragraphs and ask a question to get relevant passage back based on TF-IDF wights

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages