Calculating Relevance via Cosine Similarity

tfid_indexer.py

TfidfVectorizer() --> scikit learn tool that automatically calculates both term frequency and Inverse Document Frequency.It handles all the complex logarithmic math behind the scenes.

Before scikit-learn can calculate any scores, it needs a corpus (a list of plain text strings).

DocumentProcessor previously broke down the PDFs into clean lists of words (like ['hello', 'welcome', 'servlet']).

corpus = [] for doc in documents: text = " ".join(doc.clean_tokens) corpus.append(text)

This loop takes those lists of individual tokens and uses " ".join(...) to glue them back together into clean, continuous text sentences (like "hello welcome servlet"). Each finished document string is appended to the corpus list.

self.document_vectors = self.vectorizer.fit_transform(corpus) return self.document_vectors

fit: the Vectorizer reads the entire corpus to build a global vocabulary dictionary/It identifies every unique word across all documents(which resulted in those 92 uniquer words we saw earlier) asnd assigns an ID number to each word

transform: It calculates the mathematical TF-IDF score for every single word in every single document.

The output stored in self.document_vectors is a compressed Sparse Matrix. Instead of wasting computer memory storing hundreds of zeros for words that don't appear in a file, it only remembers the coordinates and values of the actual matching words.

search_engine.py

Calculating Relevance via Cosine Similarity

similarities = cosine_similarity(query_vector, self.document_vectors).flatten()

cosine_similarity(...): This functions as a geometric compass. In a multi-dimensional vector space, it calculates the angle between your query vector and each document vector.If the vectors point in the exact same direction (sharing lots of high-value TF-IDF words), the score is close to $1.0$ (highly relevant).

If they share no words, the score is $0.0$ (completely irrelevant).

Sorting from Best to Worst (argsort)

ranked_indices = similarities.argsort()[::-1]

.argsort(): Instead of sorting the actual scores, it returns the indices (the position numbers) that would sort the array from lowest score to highest score.

[::-1]: This reverses the array. Now, the indices are arranged from highest score to lowest score (best matching document index first).

Example: If your scores were [0.0, 0.0, 0.85], argsort()[::-1] outputs [2, 0, 1], telling you that Document #2 is your best match.

SnippetGenerator : extracts tiny window of context around their search term.It provides the user why the document is relebant without making them open up adn rad a massive 20 page of PDF files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
engine		engine
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calculating Relevance via Cosine Similarity

Sorting from Best to Worst (argsort)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Calculating Relevance via Cosine Similarity

Sorting from Best to Worst (argsort)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages