<span style="font-family: Cambria;"></span>

# <span style="font-family: Cambria;">Building a search engine using VSM</span>

<span style="font-family: Cambria;">first we should explain what is VSM?</span>

## <span style="font-family: Cambria;">VSM-Vector Space Model</span>

<span style="font-family: Cambria;">the Vector Space Model (VSM) is a fundamental concept used to represent text data numerically in a high-dimensional space. This representation is essential for various NLP tasks, including document retrieval, information retrieval, and text similarity analysis.</span>

<span style="font-family: Cambria;">After gaining an understanding of the Vector Space Model (VSM), the next step is to embark on building our search engine. The sequential steps to follow are as outlined below:</span>

## <span style="font-family: Cambria;">Step 0: Importing corpus</span>

<span style="font-family: Cambria;"> in this step we read our text corpora from our local machine to our code:</span>

In [1]:
import os

directory_path = r'E:\Study\University\codes\Jupyter\Preprocessing in NLP\Preprocessing-Methods-NLP\Corpora'

text_list = []

for filename in os.listdir(directory_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory_path, filename)
        with open(file_path, 'r') as file:
            text = file.read()
            text_list.append(text)

<span style="font-family: Cambria;">In this step, I developed a function to convert the contents of the text_list (which is a list of strings) into a single string. This transformation was necessary because the NLTK tokenizer is designed to tokenize strings, and I needed to prepare the data for tokenization. As a result, I stored the converted string in a variable named text.</span>

In [2]:
# Python program to convert a list to string
 
def listToString(s):
 
    str1 = ""
 
    for ele in s:
        str1 += ele
 
    return str1

text = listToString(text_list)


## <span style="font-family: Cambria;">Step 1: Preprocessing & tokenizing</span>

<span style="font-family: Cambria;">in this Step we should do some preprocessing steps on our corpus to eliminate useless tokens that complicates our calculation.</span>

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Kiarash
[nltk_data]     Rahmani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
from nltk.tokenize import sent_tokenize
import re
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
    
def preprocess_text(text):
    
    text = text.lower()
    
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    
    stop_words = set(["the", "and", "is", "in", "it"])
    text = ' '.join(word for word in text.split() if word not in stop_words)
    
    
    return text


sentence_list = []
sentences = sent_tokenize(text)

for idx, sentence in enumerate(sentences):
    sentence_info = {"id": idx + 1, "sentence": sentence}
    sentence_list.append(sentence_info)

for item in sentence_list:
    print(f"ID: {item['id']}, Sentence: {item['sentence']}")

ID: 1, Sentence: What are recurrent neural networks?
ID: 2, Sentence: A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series data.
ID: 3, Sentence: These deep learning algorithms are commonly used for ordinal or temporal problems, such as language translation, natural language processing (nlp), speech recognition, and image captioning; they are incorporated into popular applications such as Siri, voice search, and Google Translate.
ID: 4, Sentence: Like feedforward and convolutional neural networks (CNNs), recurrent neural networks utilize training data to learn.
ID: 5, Sentence: They are distinguished by their “memory” as they take information from prior inputs to influence the current input and output.
ID: 6, Sentence: While traditional deep neural networks assume that inputs and outputs are independent of each other, the output of recurrent neural networks depend on the prior elements within the sequence.
ID: 7, Sentence: Wh

## <span style="font-family: Cambria;">Step 2: Creating our Dataset</span>

In [6]:
## make it csv
import pandas as pd

sentence_df = pd.DataFrame(sentence_list)

sentence_df.to_csv('output.csv', index=False)

sentence_df.head()

Unnamed: 0,id,sentence
0,1,What are recurrent neural networks?
1,2,A recurrent neural network (RNN) is a type of ...
2,3,These deep learning algorithms are commonly us...
3,4,Like feedforward and convolutional neural netw...
4,5,They are distinguished by their “memory” as th...


## <span style="font-family: Cambria;">Step 3: Creating our Matrix</span>


In [8]:
import pandas as pd

df = pd.read_csv('output.csv') 
column_name = 'id'
id_values = df[column_name].tolist()

column_name_2 = 'sentence'
sentences_values = df[column_name_2].tolist()


tokens = [word.lower() for sentence in sentences_values for word in str(sentence).split()]

unique_tokens = list(set(tokens))

sparse_df = pd.DataFrame(0, index=unique_tokens, columns=df['id'].tolist())

for idx, sentence in zip(df['id'], sentences_values):
    for word in str(sentence).split():
        sparse_df.loc[word.lower(), idx] += 1  # Increment the count for TF

sparse_df.to_csv('sparse_output_tf.csv')

print(sparse_df)


                 1    2    3    4    5    6    7    8    9    10   ...  587  \
ways,              0    0    0    0    0    0    0    0    0    0  ...    0   
eyes               0    0    0    0    0    0    0    0    0    0  ...    0   
tutorials          0    0    0    0    0    0    0    0    0    0  ...    0   
fields.            0    0    0    0    0    0    0    0    0    0  ...    0   
?                  0    0    0    0    0    0    0    0    0    0  ...    0   
...              ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
attention.         0    0    0    0    0    0    0    0    0    0  ...    0   
distant            0    0    0    0    0    0    0    0    0    0  ...    0   
effort             0    0    0    0    0    0    0    0    0    0  ...    0   
lecture.digital    0    0    0    0    0    0    0    0    0    0  ...    0   
three              0    0    0    0    0    0    0    0    0    0  ...    0   

                 588  589  590  591  592  593  594 

## <span style="font-family: Cambria;">Step5: Calculating cosine similartiy</span>

In [56]:
# now i have this sparce matrix and i want to calculate cosine similartiy
import pandas as pd

sparse_df = pd.read_csv('sparse_output_tf.csv', index_col=0)
input_text = "This strategy allows us to use local information to understand the general structure of the data."

input_tokens = [word.lower() for word in input_text.split()]

input_df = pd.DataFrame(0, index=sparse_df.index, columns=['input'])
for word in input_tokens:
    if word.lower() in input_df.index:
        input_df.loc[word.lower(), 'input'] += 1

# Calculate cosine similarity 
cosine_similarities = []
for column in sparse_df.columns:
    dot_product = sum(sparse_df[column] * input_df['input'])
    magnitude_sparse = sum(sparse_df[column] ** 2) ** 0.5
    magnitude_input = sum(input_df['input'] ** 2) ** 0.5

    if magnitude_sparse == 0 or magnitude_input == 0:
        similarity = 0  # Handle division by zero
    else:
        similarity = dot_product / (magnitude_sparse * magnitude_input)

    cosine_similarities.append(similarity)

    
result_df = pd.DataFrame({'id': sparse_df.columns, 'similarity': cosine_similarities})
result_df = result_df.sort_values(by='similarity', ascending=False)

result_df.to_csv('ranked_output.csv', index=False)

print(result_df)

      id  similarity
499  500    1.000000
501  502    0.645497
123  124    0.487950
9     10    0.484123
24    25    0.483046
..   ...         ...
479  480    0.000000
481  482    0.000000
254  255    0.000000
485  486    0.000000
0      1    0.000000

[596 rows x 2 columns]


## <span style="font-family: Cambria;">Step6: </span>