In [117]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from scipy.spatial.distance import cosine

Today we will focus on finding similarities between documents. For this purpose, we will compare the content of these documents. The same techniques can be used for a query in a search engine. Then simply we can treat the query like another document, calculate similarities and return the most similar documents.

In [118]:
documents = ['Machine Learning',
 'Five Advanced Plots in Python - Matplotlib',
 'How to Make your Computer Talk with Python',
 'Anomaly Detection on Servo Drives',
 'Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction',
 'Animated Mathematical Analysis',
 'How to Perform Speech Recognition with Python',
 'Beyond The Semesters: E04',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 'Time Series Forecasting with ThymeBoost',
 'CHAPTER 2: Why I Chose Data Science!',
 'Training Provably-Robust Neural Networks',
 'Time Series Forecasting with ThymeBoost',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 '5 Cute Features of CatBoost',
 'Variance Inflation Factor (VIF) and it’s relationship with multicollinearity&nbsp;.',
 'Beyond The Semesters: E04',
 'Efficient Digital Transformation - Particle Swarm Optimiser',
 'MEASURE OF ASYMMETRY',
 'What is linear regression? A quick cover with a tutorial',
 'Correlation VS Covariance: The easy way',
 'Are Recommender System harming us?',
 '1 Line of Python Code That Will Speed Up Your AI by Up to 6x',
 'If You Are Serious About Data Science Job. You Must Know These 3 Things.',
 'Recommender System With Machine Learning and Statistics',
 'Bias detection and mitigation in IBM AutoAI',
 'Data Engineering: Create your own Dataset',
 'Graph Neural Networks and Generalizable Models in Neuroscience',
 'Fastest Way of Deploying Your Machine Learning Models',
 'A Novel Approach to Integrate Speech Recognition into Authentication Systems',
 '3 Lessons Learned in Teaching Machine Learning for Earth Observation Techniques',
 'Vision Transformer in Galaxy Morphology Classification',
 'Exploring Methods of Deep Reinforcement Learning with NLP Applications',
 '6 Essential Tips to Solve Data Science Projects',
 'Data Science Interview Questions My Friends and I got asked recently (III)',
 'Understanding Uber’s Generative Teaching Networks',
 'How to achieve efficient large-batch training?',
 'How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.',
 'Why You Need to Know the Inner Workings of Models',
 'Let’s Build A Simple Object Classification Task I']

In [119]:
CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')

'''
CountVectorizer converts a collection of docs into term-document matrix
row -> document
column -> term, word (ngram); here the ngram is (1,1) so just one word

ngram_range=(1,1) -> singe words so unigrams
ngram_range=(1,2) -> single words and pairs of words, so unigrams and bigrams
ngram_range=(2,2) -> only bigrams
GENERALLY:
ngram_range=(n,m) -> n is the minimum number of words in a ngram and m is the maximum number of words in a ngram 
so if n==m then only ngrams of size n will be considered

stopwords='english' -> removes common english words like 'the', 'is', 'and' etc.
'''

CountData = CountVec.fit_transform(documents)

'''
fit_transform(documents)
fit -> learn vocabulary from the input i.e. docs; scans through them, identifies tokens (unique words) after the preprocessing
(here preprocessing is removing stopwords) and assigns an index to each token
transform -> convert the input docs into term-document matrix
'''

CountData

<40x150 sparse matrix of type '<class 'numpy.int64'>'
	with 204 stored elements in Compressed Sparse Row format>

The very basic way of storing information about documents is word count. Simply for each document we store an information how many times each word appears. It can be stored in an array, however, it's not the best option since it will be filled mostly with 0s. That's why it's stored in a sparse matrix, but we can expand it.

In [120]:
df=pd.DataFrame(CountData.toarray(), columns=CountVec.get_feature_names_out(), index=documents)
df

Unnamed: 0,6x,achieve,advanced,ai,analysis,animated,anomaly,applications,approach,asked,...,tutorial,uber,understanding,variance,ventilator,vif,vision,vs,way,workings
Machine Learning,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Five Advanced Plots in Python - Matplotlib,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How to Make your Computer Talk with Python,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Anomaly Detection on Servo Drives,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Animated Mathematical Analysis,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How to Perform Speech Recognition with Python,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Beyond The Semesters: E04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"How to improve classification of e-commerce pages, incorporating multiple modalities",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Time Series Forecasting with ThymeBoost,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Task 1
We can reduce the size of an array, get rid of unnecesary words, and improve the quality of comparison by firstly preprocessing the docuemnts.
Check array size after stemming/lemmatization and without stop words

In [121]:
import nltk

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize, wordpunct_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [122]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
wordNet = WordNetLemmatizer()
stopWords = set(stopwords.words('english'))

# def wordPunctTokens(doc):
#     return wordpunct_tokenize(doc)


# def wordTokens(doc):
#     return word_tokenize(doc)

# def porterStem(tokens):
#     return [porter.stem(token) for token in tokens]


def preprocess_doc(doc, tokenizer=word_tokenize ,stemmer=None, lemmatizer=None, use_lemmatizer=False):
    tokens = tokenizer(doc.lower()) # tokenize the doc and convert to lowercase
    terms = [word for word in tokens if word.isalpha() and word not in stopWords] # remove stopwords and non-alphabetic words
    if use_lemmatizer and lemmatizer:
        processed = [lemmatizer.lemmatize(word) for word in terms]
    elif stemmer:
        processed = [stemmer.stem(word) for word in terms] # stem the words 
    else:
        processed = terms
    
    return ' '. join(processed)   


In [123]:
processed_docs_porter = [preprocess_doc(doc, stemmer=porter) for doc in documents]
processed_docs_porter_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, stemmer=porter) for doc in documents]

processed_docs_lancaster = [preprocess_doc(doc, stemmer=lancaster) for doc in documents]
processed_docs_lancaster_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, stemmer=lancaster) for doc in documents]

processed_docs_wordnet = [preprocess_doc(doc, lemmatizer=wordNet, use_lemmatizer=True) for doc in documents]
processed_docs_wordnet_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, lemmatizer=wordNet, use_lemmatizer=True) for doc in documents]


processed_docs = {
    "Porter (word_tokenize)": processed_docs_porter,
    "Porter (wordpunct_tokenize)": processed_docs_porter_wordpunct,
    "Lancaster (word_tokenize)": processed_docs_lancaster,
    "Lancaster (wordpunct_tokenize)": processed_docs_lancaster_wordpunct,
    "WordNet Lemmatizer (word_tokenize)": processed_docs_wordnet,
    "WordNet Lemmatizer (wordpunct_tokenize)": processed_docs_wordnet_wordpunct,
}

In [124]:
for name, processed_doc in processed_docs.items():
    CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')
    CountData = CountVec.fit_transform(processed_doc)
    num_columns = len(CountVec.get_feature_names_out())
    print(f"{name}: Number of terms = {num_columns}")

Porter (word_tokenize): Number of terms = 141
Porter (wordpunct_tokenize): Number of terms = 144
Lancaster (word_tokenize): Number of terms = 138
Lancaster (wordpunct_tokenize): Number of terms = 141
WordNet Lemmatizer (word_tokenize): Number of terms = 143
WordNet Lemmatizer (wordpunct_tokenize): Number of terms = 146


## Task 2

Easy technique to compare two documents is a jaccard similarity.
$J={\frac {|A\cap B|}{|A\cup B|}}.$

Implement Jaccard similarity, and function finding closest document to a provided query. Test different queries

In [125]:
def jaccard(d1, d2):
    set1 = set(d1.split())
    set2 = set(d2.split())
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

def closest(query, documents):
    processed_query = preprocess_doc(query, stemmer=porter)
    processed_docs = [preprocess_doc(doc, stemmer=porter) for doc in documents]
    similarities = [jaccard(processed_query, doc) for doc in processed_docs]
    max_index = similarities.index(max(similarities))
    return documents[max_index], max(similarities)

<a href="https://ibb.co/k4rRpf9"><img src="https://i.ibb.co/GW1KXLt/ir4.jpg" alt="ir4" border="0"></a>

In [126]:
documents

['Machine Learning',
 'Five Advanced Plots in Python - Matplotlib',
 'How to Make your Computer Talk with Python',
 'Anomaly Detection on Servo Drives',
 'Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction',
 'Animated Mathematical Analysis',
 'How to Perform Speech Recognition with Python',
 'Beyond The Semesters: E04',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 'Time Series Forecasting with ThymeBoost',
 'CHAPTER 2: Why I Chose Data Science!',
 'Training Provably-Robust Neural Networks',
 'Time Series Forecasting with ThymeBoost',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 '5 Cute Features of CatBoost',
 'Variance Inflation Factor (VIF) and it’s relationship with multicollinearity&nbsp;.',
 'Beyond The Semesters: E04',
 'Efficient Digital Transformation - Particle Swarm Optimiser',
 'MEASURE OF ASYMMETRY',
 'What is linear regression? A quick cov

In [127]:
queries = [
    "python",
    "plot neural network",
    "plot neural networks",
    "ploting neural networks",
    "data science",
    "5 Cute Features of CatBoost"
]
for q in queries:
    print(f"Query: {q}")
    # print(closest(q, df))

    closest_doc, similarity = closest(q, documents)
    print(f"Closest document: {closest_doc}")
    print(f"Jaccard Similarity: {similarity:.2f}\n")
    

Query: python
Closest document: How to Make your Computer Talk with Python
Jaccard Similarity: 0.25

Query: plot neural network
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: plot neural networks
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: ploting neural networks
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: data science
Closest document: CHAPTER 2: Why I Chose Data Science!
Jaccard Similarity: 0.50

Query: 5 Cute Features of CatBoost
Closest document: 5 Cute Features of CatBoost
Jaccard Similarity: 1.00



## Task 3

TFIDF (term frequency–inverse document frequency) is a much better approach. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

This approach consists of 2 steps:
TF (term frequency) -  $tf(t,d)$, is the relative frequency of term $t$ within document $d$, can be expressed e.g. as a word count divided by number of terms in a given document or by the maximum term count in a given document.

IDF (inverse document frequency) - is a measure of how much information the word provides. If a word appears in every document it does not provide much information, but if it just appears in two documents then its impact on similiarity between these two documents is higher. The standard approach to compute this value is logarithm of number of documents divided by number of documents containing a given term $IDF(t) = log(\frac{N}{n_t})$

TFIDF is then just TF multiplied by IDF


Implement tf idf, compare it with sklearn TfidfVectorizer

In [128]:
tfidf=TfidfVectorizer(use_idf=True, smooth_idf=False)

dfTFIDF = pd.DataFrame(tfidf.fit_transform(documents).toarray(), index=documents, columns=tfidf.get_feature_names_out())
dfTFIDF

Unnamed: 0,6x,about,achieve,advanced,ai,analysis,and,animated,anomaly,applications,...,vision,vs,way,what,why,will,with,workings,you,your
Machine Learning,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Five Advanced Plots in Python - Matplotlib,0.0,0.0,0.0,0.450495,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How to Make your Computer Talk with Python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.249731,0.0,0.0,0.316067
Anomaly Detection on Servo Drives,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.459985,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Animated Mathematical Analysis,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How to Perform Speech Recognition with Python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.280999,0.0,0.0,0.0
Beyond The Semesters: E04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"How to improve classification of e-commerce pages, incorporating multiple modalities",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Time Series Forecasting with ThymeBoost,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.32486,0.0,0.0,0.0


In [129]:
pd.Series(tfidf.idf_, index=tfidf.get_feature_names_out()).sort_values()

of           2.491655
to           2.491655
with         2.609438
and          2.897120
how          2.897120
               ...   
integrate    4.688879
interview    4.688879
into         4.688879
need         4.688879
measure      4.688879
Length: 184, dtype: float64

In [130]:
query = "how to machine learning"
query = tfidf.transform([query]).toarray()[0] # convert the query into tfidf vector using the trained tfidf model with learned vocab
1-dfTFIDF.apply(lambda x: cosine(x, query), axis=1).sort_values()

Machine Learning                                                                                    0.763355
Recommender System With Machine Learning and Statistics                                             0.364334
Fastest Way of Deploying Your Machine Learning Models                                               0.328159
How to Perform Speech Recognition with Python                                                       0.265814
3 Lessons Learned in Teaching Machine Learning for Earth Observation Techniques                     0.258540
How to achieve efficient large-batch training?                                                      0.246288
How to Make your Computer Talk with Python                                                          0.236235
How to improve classification of e-commerce pages, incorporating multiple modalities                0.221282
How to improve classification of e-commerce pages, incorporating multiple modalities                0.221282
Exploring Methods o

----------
my own approach

In [143]:
def compute_tf(doc):
    """
    Compute Term Frequency (TF) for a single document.
    """
    words = doc.split()
    word_count = len(words)
    tf = {word: words.count(word) / word_count for word in words}
    return tf

def compute_idf(corpus):
    """
    Compute Inverse Document Frequency (IDF) for all terms in the corpus.
    """
    num_docs = len(corpus)
    idf = {}
    all_words = set(word for doc in corpus for word in doc.split())
    for word in all_words:
        containing_docs = sum(1 for doc in corpus if word in doc.split())
        idf[word] = np.log((num_docs + 1) / (containing_docs + 1))  # Add 1 to avoid division by zero
    return idf


def compute_tfidf(corpus):
    """
    Compute TF-IDF for the entire corpus.
    """
    idf = compute_idf(corpus)
    tfidf = []
    for doc in corpus:
        tf = compute_tf(doc)
        tfidf.append({word: tf[word] * idf[word] for word in tf})
    return tfidf


# processed_documents = [preprocess_doc(doc, tokenizer=wordpunct_tokenize,stemmer=lancaster) for doc in documents]
processed_documents = [preprocess_doc(doc) for doc in documents]
# Compute TF-IDF
tfidf = compute_tfidf(processed_documents)

# Convert to DataFrame
all_words = set(word for doc in processed_documents for word in doc.split())
tfidf_matrix = pd.DataFrame([{word: doc_tfidf.get(word, 0) for word in all_words} for doc_tfidf in tfidf], index=processed_documents).sort_index(axis=1)

print("Custom TF-IDF Matrix:")
tfidf_matrix


Custom TF-IDF Matrix:


Unnamed: 0,achieve,advanced,ai,analysis,animated,anomaly,applications,approach,asked,asymmetry,...,uber,understanding,us,variance,ventilator,vif,vision,vs,way,workings
machine learning,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
five advanced plots python matplotlib,0.0,0.604085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
make computer talk python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
anomaly detection servo drives,0.0,0.0,0.0,0.0,0.0,0.755106,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
key takeaways kaggle recent time series competition ventilator pressure prediction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.302042,0.0,0.0,0.0,0.0,0.0
animated mathematical analysis,0.0,0.0,0.0,1.006808,1.006808,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
perform speech recognition python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beyond semesters,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
improve classification pages incorporating multiple modalities,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
time series forecasting thymeboost,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Task 4
Create a search engine based on TFIDF

In [144]:
def search(query, df):
    processed_query = preprocess_doc(query)
    query_tfidf = compute_tfidf([processed_query])[0]
    query_vector = np.array([query_tfidf.get(term, 0) for term in df.columns])
    similarity_scores = [1 - cosine(doc_vector, query_vector) for doc_vector in df.values]    
    ranked_results = pd.Series(similarity_scores, index=df.index).sort_values(ascending=False)
    return ranked_results
    
query = "how to machine learning"
results = search(query, tfidf_matrix)
print("Ranked Results:")
results

Ranked Results:


  dist = 1.0 - uv / math.sqrt(uu * vv)


machine learning                                                                     NaN
five advanced plots python matplotlib                                                NaN
make computer talk python                                                            NaN
anomaly detection servo drives                                                       NaN
key takeaways kaggle recent time series competition ventilator pressure prediction   NaN
animated mathematical analysis                                                       NaN
perform speech recognition python                                                    NaN
beyond semesters                                                                     NaN
improve classification pages incorporating multiple modalities                       NaN
time series forecasting thymeboost                                                   NaN
chapter chose data science                                                           NaN
training neural netwo

In [145]:
def search(query, df):
    """
    Perform a search using TF-IDF and cosine similarity.

    Parameters:
    - query: str, the user's search query.
    - df: pd.DataFrame, the TF-IDF matrix with terms as columns and documents as rows.

    Returns:
    - pd.Series: Ranked documents with their similarity scores.
    """
    # Preprocess the query
    processed_query = preprocess_doc(query)
    
    # Compute TF-IDF for the query
    query_tfidf = compute_tfidf([processed_query])[0]
    
    # Align the query vector with the TF-IDF matrix columns
    query_vector = np.array([query_tfidf.get(term, 0) for term in df.columns])
    print(query_vector)
    
    missing_terms = [term for term in processed_query.split() if term not in df.columns]
    if missing_terms:
        print(f"Warning: These query terms are not in the TF-IDF vocabulary: {missing_terms}")

    # Ensure the query vector is non-zero
    if np.linalg.norm(query_vector) == 0:
        return pd.Series([0] * len(df), index=df.index).sort_values(ascending=False)
    
    # Compute cosine similarity, handling zero document vectors
    similarity_scores = [
        1 - cosine(doc_vector, query_vector) if np.linalg.norm(doc_vector) != 0 else 0
        for doc_vector in df.values
    ]
    
    # Rank documents by similarity
    ranked_results = pd.Series(similarity_scores, index=df.index).sort_values(ascending=False)
    return ranked_results


query = "how to machine learning"
results = search(query, tfidf_matrix)
# print("Ranked Results:")
# results


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


In [146]:
def search(query, df):
    """
    Perform a search using TF-IDF and cosine similarity.

    Parameters:
    - query: str, the user's search query.
    - df: pd.DataFrame, the TF-IDF matrix with terms as columns and documents as rows.

    Returns:
    - pd.Series: Ranked documents with their similarity scores.
    """
    # Preprocess the query
    processed_query = preprocess_doc(query)
    print(f"Processed Query: {processed_query}")  # Debugging
    
    # Compute TF-IDF for the query
    query_tfidf = compute_tfidf([processed_query])[0]
    print(f"Query TF-IDF: {query_tfidf}")  # Debugging
    
    # Align the query vector with the TF-IDF matrix columns
    query_vector = np.array([query_tfidf.get(term, 0) for term in df.columns])
    print(f"Query Vector: {query_vector}")  # Debugging
    
    # Check for missing terms in the vocabulary
    missing_terms = [term for term in processed_query.split() if term not in df.columns]
    if missing_terms:
        print(f"Warning: These query terms are not in the TF-IDF vocabulary: {missing_terms}")
    
    # Ensure the query vector is non-zero
    if np.linalg.norm(query_vector) == 0:
        print("Warning: Query vector is all zeros.")
        return pd.Series([0] * len(df), index=df.index).sort_values(ascending=False)
    
    # Compute cosine similarity
    similarity_scores = [
        1 - cosine(doc_vector, query_vector) if np.linalg.norm(doc_vector) != 0 else 0
        for doc_vector in df.values
    ]
    
    # Rank documents by similarity
    ranked_results = pd.Series(similarity_scores, index=df.index).sort_values(ascending=False)
    return ranked_results


In [147]:
query = "how to machine learning"
results = search(query, tfidf_matrix)
print("Ranked Results:")
results

Processed Query: machine learning
Query TF-IDF: {'machine': 0.0, 'learning': 0.0}
Query Vector: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
Ranked Results:


machine learning                                                                      0
five advanced plots python matplotlib                                                 0
line python code speed ai                                                             0
serious data science job must know things                                             0
recommender system machine learning statistics                                        0
bias detection mitigation ibm autoai                                                  0
data engineering create dataset                                                       0
graph neural networks generalizable models neuroscience                               0
fastest way deploying machine learning models                                         0
novel approach integrate speech recognition authentication systems                    0
lessons learned teaching machine learning earth observation techniques                0
vision transformer galaxy morpho

## Task 5
Create a search engine based on history containing more than one document

In [148]:
def search(history, df):
    pass