Skip to content

harshilmak21/Search_Engine_pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tfid_indexer.py

TfidfVectorizer() --> scikit learn tool that automatically calculates both term frequency and Inverse Document Frequency.It handles all the complex logarithmic math behind the scenes.

Before scikit-learn can calculate any scores, it needs a corpus (a list of plain text strings).

DocumentProcessor previously broke down the PDFs into clean lists of words (like ['hello', 'welcome', 'servlet']).

corpus = [] for doc in documents: text = " ".join(doc.clean_tokens) corpus.append(text)

This loop takes those lists of individual tokens and uses " ".join(...) to glue them back together into clean, continuous text sentences (like "hello welcome servlet"). Each finished document string is appended to the corpus list.

self.document_vectors = self.vectorizer.fit_transform(corpus) return self.document_vectors

fit: the Vectorizer reads the entire corpus to build a global vocabulary dictionary/It identifies every unique word across all documents(which resulted in those 92 uniquer words we saw earlier) asnd assigns an ID number to each word

transform: It calculates the mathematical TF-IDF score for every single word in every single document.

The output stored in self.document_vectors is a compressed Sparse Matrix. Instead of wasting computer memory storing hundreds of zeros for words that don't appear in a file, it only remembers the coordinates and values of the actual matching words.

search_engine.py

Calculating Relevance via Cosine Similarity

similarities = cosine_similarity(query_vector, self.document_vectors).flatten()

cosine_similarity(...): This functions as a geometric compass. In a multi-dimensional vector space, it calculates the angle between your query vector and each document vector.If the vectors point in the exact same direction (sharing lots of high-value TF-IDF words), the score is close to $1.0$ (highly relevant).

If they share no words, the score is $0.0$ (completely irrelevant).

Sorting from Best to Worst (argsort)

ranked_indices = similarities.argsort()[::-1]

.argsort(): Instead of sorting the actual scores, it returns the indices (the position numbers) that would sort the array from lowest score to highest score.

[::-1]: This reverses the array. Now, the indices are arranged from highest score to lowest score (best matching document index first).

Example: If your scores were [0.0, 0.0, 0.85], argsort()[::-1] outputs [2, 0, 1], telling you that Document #2 is your best match.

SnippetGenerator : extracts tiny window of context around their search term.It provides the user why the document is relebant without making them open up adn rad a massive 20 page of PDF files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages