# Ranking documents based on tfidf scores

TF-IDF is a technique used to evaluate the importance of words in a document or corpus. It measures the frequency of a word in a document (TF) and the importance of a word in a corpus (IDF). The score for a word is obtained by multiplying its TF and IDF values. This method is widely used in information retrieval systems to rank documents by relevance to a query.

In [16]:
from nltk.corpus import *
from nltk.stem.porter import *
import pickle
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

## Setup

#### The corpus/documents are extracted from the pickle files.
#### The inverted_index built already is extracted , which would be used for extracting term and document frequencies.

In [17]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
root = Path("../")

my_path = root / "Pickled_files" / "Documents"
dbfile = open(my_path, 'rb')     
documents = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Inverted_index"
dbfile = open(my_path, 'rb')     
inverted_index = pickle.load(dbfile)
dbfile.close()

## Algorithm

The TF-IDF scoring algorithm calculates a score for each term in the query and each document in the collection, using the following formula:

**TF-IDF(term, document) = TF(term, document) * IDF(term)**

where TF(term, document) is the term frequency of the term in the document, and IDF(term) is the inverse document frequency of the term. The term frequency represents the number of times a term appears in a document, while the inverse document frequency represents the rarity of the term across the collection of documents.

The TF-IDF score for each term in the query is calculated in the same way. Then, for each document, the algorithm calculates the dot product between the query's TF-IDF scores and the document's TF-IDF scores. The resulting score represents the relevance of the document to the query string.

By ranking the documents based on their scores, the algorithm can identify the most relevant documents for a given query. This technique is widely used in search engines and other information retrieval applications, as it provides an effective way to match documents to user queries.

## Implementation

The rank_docs_by_tfidf function is a Python implementation of the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm used for ranking documents based on their relevance to a given query string.

The function takes a query string as input and returns a list of document IDs that are ranked in descending order of their relevance to the query. The algorithm first processes the query string by removing stop words and stem each term using a stemming algorithm to reduce them to their root forms.

Then, the algorithm calculates the TF-IDF score for each term in the query. It multiplies the term frequency in the query with the inverse document frequency (IDF) of the term. For each document that contains the term, it multiplies the document's TF-IDF score with the query's TF-IDF score for that term and adds the result to the document's score.

Finally, the algorithm returns a list of document IDs sorted in descending order of their scores. This allows the user to quickly identify the documents that are most relevant to the query string.

In [18]:
def rank_docs_by_tfidf(query):
    query_terms = (query.split(" "))
    query_term_freq = {}
    query_terms = [term.lower() for term in query_terms if term.lower() not in stop_words]
    for term in query_terms:
        if query_term_freq.get(term) == None:
            query_term_freq[term] = 0
        query_term_freq[term] += 1
    query_terms = list(query_term_freq.keys())
    
    document_scores = {}
    for i in range(len(documents)):
        document_scores[i] = 0
    
    for term in query_terms:
        qtf = query_term_freq[term]
        document_freq = 1
        normalised_term = stemmer.stem(term)
        if inverted_index.get(normalised_term) != None:
            document_freq += len(inverted_index[normalised_term])
            query_score = qtf * (1/document_freq)
            for doc in inverted_index[normalised_term]:
                doc_score = doc[2]*(1/document_freq)
                document_scores[doc[0]] += query_score*doc_score

    ranked_docs = [doc[0] for doc in sorted(document_scores.items(),key=lambda x:x[1])[::-1]]
    return ranked_docs

## Sample Queries for testing the algorithm

The query string is initialized and then passed to the rank_docs_by_tfidf function, which returns a list of document IDs ranked in descending order of their relevance to the query.The for loop is then used to print the top 3 documents that are most relevant to the query string. The loop iterates for a maximum of 3 times or the number of documents available, whichever is minimum. For each document, the loop prints the document's content, followed by an empty line for readability.

In [19]:
def printResults(query,ranked_docs):
    print(query)
    print()
    for i in range(0,min(3,len(ranked_docs))):
        print(documents[ranked_docs[i]][1])
        print(documents[ranked_docs[i]][0])
        print("---------------------------------------------------------------------------------------------------------------")

In [21]:
input_file = open("phrase_input.txt","r")
query = input_file.readline()
ranked_docs = rank_docs_by_tfidf(query)
printResults(query,ranked_docs)

Medical reports

.\Docs\Auto\AU127-1.docx
allstate austo insurance poliacy policy : issued to : m effective : p l e d o c u m e anu127-1 t allstate insurance company stable of contents general 2 when and where the policy applies 2 changes 2 duty to report autos 2 combining limits of two or more autosa payment of benefits ; autopsy 9 consent of beneficiary 9 part 4 automobile disability income protection coverage cw 9proof of claim ; medical reports 9 prohibited 2 transfer 2 cancellation 2 insuring agreement 9 insured persons 9 definitions 9 part 1 automobilemliability insurance exclusions what is not covered 9 coverages aa and bb 3 insuring agreement 3 to whom and when payment is made 10 proof of claim ; medical reports 10 padditional payments allstate will make 3 insured persons 4 part 5 uninsured motorists insurance coverage ss 10 insured autos 4 definitions 4 insuring agreement 10 insured persons 10 exclusions what is not covered 4 definitions 11 financial responsibility ...........