<a href="https://colab.research.google.com/github/pouria1380-creator/information_retrieval/blob/main/vsm_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a VSM (Vector Space Model) for information retrieval with an example.

## Define the corpus

### Subtask:
Create a small collection of documents to work with.


**Reasoning**:
Create a list of strings to represent the documents in the corpus.



In [1]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog is sleeping on the mat.",
    "The quick brown fox is a fast animal."
]

## Tokenize the documents

### Subtask:
Break down each document into individual words or tokens.


**Reasoning**:
Tokenize each document in the corpus by converting to lowercase and splitting by whitespace.



In [2]:
tokenized_corpus = []
for doc in corpus:
  tokenized_corpus.append(doc.lower().split())

## Create a vocabulary

### Subtask:
Generate a unique list of all the tokens from the corpus.


**Reasoning**:
Generate a unique list of all the tokens from the corpus by iterating through the tokenized documents and adding each token to a set, then converting the set to a list and printing it.



In [3]:
vocabulary = set()
for doc in tokenized_corpus:
  for token in doc:
    vocabulary.add(token)

vocabulary = list(vocabulary)
print(vocabulary)

['the', 'jumps', 'is', 'brown', 'over', 'a', 'fast', 'lazy', 'dog.', 'quick', 'sleeping', 'fox', 'on', 'animal.', 'mat.', 'dog']


## Build the term-document matrix

### Subtask:
Represent each document as a vector based on the frequency of each term in the vocabulary.


**Reasoning**:
Implement the steps to create the term-document matrix based on term frequency.



In [4]:
term_document_matrix = []
for doc in tokenized_corpus:
  frequency_vector = [0] * len(vocabulary)
  for token in doc:
    try:
      token_index = vocabulary.index(token)
      frequency_vector[token_index] += 1
    except ValueError:
      # Handle tokens not in vocabulary if necessary, though in this case
      # the vocabulary is built from the corpus itself.
      pass
  term_document_matrix.append(frequency_vector)

# Display the term-document matrix
for row in term_document_matrix:
    print(row)

[2, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0]
[2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1]
[1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]


## Calculate tf-idf weights

### Subtask:
Apply TF-IDF (Term Frequency-Inverse Document Frequency) weighting to the term-document matrix to give more importance to terms that are unique to a document.


**Reasoning**:
Apply TF-IDF weighting to the term-document matrix to give more importance to terms that are unique to a document.



In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Instantiate a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the TfidfVectorizer to the original corpus and transform it
tfidf_matrix = vectorizer.fit_transform(corpus)

# Convert the resulting TF-IDF matrix to a dense NumPy array
tfidf_matrix = tfidf_matrix.todense()

# Store the TF-IDF matrix in a variable (already done in the previous step)

# Display the TF-IDF matrix
print(tfidf_matrix)

[[0.         0.30330642 0.30330642 0.         0.30330642 0.
  0.3988115  0.30330642 0.         0.         0.3988115  0.30330642
  0.         0.47108899]
 [0.         0.         0.30716074 0.         0.         0.30716074
  0.         0.30716074 0.40387946 0.40387946 0.         0.
  0.40387946 0.47707544]
 [0.46312056 0.35221512 0.         0.46312056 0.35221512 0.35221512
  0.         0.         0.         0.         0.         0.35221512
  0.         0.27352646]]


## Query processing

### Subtask:
Tokenize a sample query and represent it as a vector using the same vocabulary and weighting scheme.


**Reasoning**:
Define a sample query, tokenize it, and transform it into a TF-IDF vector using the previously fitted vectorizer.



In [6]:
sample_query = "quick fox"

# Tokenize the query
tokenized_query = sample_query.lower().split()
print("Tokenized Query:", tokenized_query)

# Transform the tokenized query into a TF-IDF vector using the fitted vectorizer
# The vectorizer expects a list of strings, so wrap the query in a list
query_tfidf_vector = vectorizer.transform([sample_query])

# Convert the resulting TF-IDF vector to a dense NumPy array
query_tfidf_vector = query_tfidf_vector.todense()

# Display the TF-IDF vector for the query
print("Query TF-IDF Vector:")
print(query_tfidf_vector)

Tokenized Query: ['quick', 'fox']
Query TF-IDF Vector:
[[0.         0.         0.         0.         0.70710678 0.
  0.         0.         0.         0.         0.         0.70710678
  0.         0.        ]]


## Calculate similarity

### Subtask:
Compute the similarity between the query vector and each document vector using cosine similarity.


**Reasoning**:
The subtask is to compute the cosine similarity between the query vector and each document vector. This requires importing the `cosine_similarity` function and applying it to the previously generated TF-IDF vectors.



In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Convert the matrices to numpy arrays
query_tfidf_vector_array = np.asarray(query_tfidf_vector)
tfidf_matrix_array = np.asarray(tfidf_matrix)

# Calculate the cosine similarity between the query vector and each document vector
similarity_scores = cosine_similarity(query_tfidf_vector_array, tfidf_matrix_array)

# Print or display the similarity scores
print("Similarity Scores:")
print(similarity_scores)

Similarity Scores:
[[0.42894006 0.         0.4981074 ]]


In [12]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Convert the matrices to numpy arrays
query_tfidf_vector_array = np.asarray(query_tfidf_vector)
tfidf_matrix_array = np.asarray(tfidf_matrix)

# Calculate the cosine similarity between the query vector and each document vector
similarity_scores = cosine_similarity(query_tfidf_vector_array, tfidf_matrix_array)

# Print or display the similarity scores
print("Similarity Scores:")
print(similarity_scores)

Similarity Scores:
[[0.42894006 0.         0.4981074 ]]


In [10]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Convert the matrices to numpy arrays
query_tfidf_vector_array = np.asarray(query_tfidf_vector)
tfidf_matrix_array = np.asarray(tfidf_matrix)

# Calculate the cosine similarity between the query vector and each document vector
similarity_scores = cosine_similarity(query_tfidf_vector_array, tfidf_matrix_array)

# Print or display the similarity scores
print("Similarity Scores:")
print(similarity_scores)

Similarity Scores:
[[0.42894006 0.         0.4981074 ]]


**Reasoning**:
The traceback indicates that `cosine_similarity` does not support `np.matrix` as input. Both `query_tfidf_vector` and `tfidf_matrix` were converted to dense matrices using `.todense()`. They need to be converted to NumPy arrays using `np.asarray` before calculating the cosine similarity.



In [8]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Convert the matrices to numpy arrays
query_tfidf_vector_array = np.asarray(query_tfidf_vector)
tfidf_matrix_array = np.asarray(tfidf_matrix)

# Calculate the cosine similarity between the query vector and each document vector
similarity_scores = cosine_similarity(query_tfidf_vector_array, tfidf_matrix_array)

# Print or display the similarity scores
print("Similarity Scores:")
print(similarity_scores)

Similarity Scores:
[[0.42894006 0.         0.4981074 ]]


## Rank documents

### Subtask:
Rank the documents based on their similarity to the query.


**Reasoning**:
I need to rank the documents based on their similarity scores. I will create a list of tuples with document index and score, sort them, and then print the ranked indices.



In [9]:
# Create a list of tuples with document index and similarity score
document_scores = [(i, score) for i, score in enumerate(similarity_scores[0])]

# Sort the list in descending order based on similarity scores
document_scores.sort(key=lambda x: x[1], reverse=True)

# Extract the ranked document indices
ranked_documents = [doc[0] for doc in document_scores]

# Print the ranked document indices
print("Ranked Documents (by index):")
print(ranked_documents)

Ranked Documents (by index):
[2, 0, 1]


## Summary:

### Data Analysis Key Findings

*   The corpus consists of three documents.
*   After tokenization and building a vocabulary, there are 14 unique terms in the corpus.
*   The TF-IDF matrix is a 3x14 matrix, where each row represents a document and each column represents a unique term, containing the TF-IDF scores.
*   The sample query "quick fox" was successfully transformed into a TF-IDF vector using the same vectorizer fitted on the corpus.
*   The cosine similarity scores between the query vector and the document vectors are `[[0.42894006 0.         0.4981074 ]]`, indicating the similarity to documents at index 0, 1, and 2 respectively.
*   Based on the similarity scores, the documents are ranked in descending order of relevance to the query as follows: Document at index 2 (score 0.4981074), Document at index 0 (score 0.42894006), and Document at index 1 (score 0.0).

### Insights or Next Steps

*   The document at index 2 ("The quick brown fox is a fast animal.") is the most relevant to the query "quick fox", followed by the document at index 0 ("The quick brown fox jumps over the lazy dog."). The document at index 1 ("The lazy dog is sleeping on the mat.") is the least relevant, with a similarity score of 0.
*   To improve the VSM, consider adding techniques like stemming or lemmatization to reduce terms to their root form and remove stop words to filter out common, less informative words.
