In [63]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from scipy.spatial.distance import cosine

Today we will focus on finding similarities between documents. For this purpose, we will compare the content of these documents. The same techniques can be used for a query in a search engine. Then simply we can treat the query like another document, calculate similarities and return the most similar documents.

In [64]:
documents = ['Machine Learning',
 'Five Advanced Plots in Python - Matplotlib',
 'How to Make your Computer Talk with Python',
 'Anomaly Detection on Servo Drives',
 'Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction',
 'Animated Mathematical Analysis',
 'How to Perform Speech Recognition with Python',
 'Beyond The Semesters: E04',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 'Time Series Forecasting with ThymeBoost',
 'CHAPTER 2: Why I Chose Data Science!',
 'Training Provably-Robust Neural Networks',
 'Time Series Forecasting with ThymeBoost',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 '5 Cute Features of CatBoost',
 'Variance Inflation Factor (VIF) and it’s relationship with multicollinearity&nbsp;.',
 'Beyond The Semesters: E04',
 'Efficient Digital Transformation - Particle Swarm Optimiser',
 'MEASURE OF ASYMMETRY',
 'What is linear regression? A quick cover with a tutorial',
 'Correlation VS Covariance: The easy way',
 'Are Recommender System harming us?',
 '1 Line of Python Code That Will Speed Up Your AI by Up to 6x',
 'If You Are Serious About Data Science Job. You Must Know These 3 Things.',
 'Recommender System With Machine Learning and Statistics',
 'Bias detection and mitigation in IBM AutoAI',
 'Data Engineering: Create your own Dataset',
 'Graph Neural Networks and Generalizable Models in Neuroscience',
 'Fastest Way of Deploying Your Machine Learning Models',
 'A Novel Approach to Integrate Speech Recognition into Authentication Systems',
 '3 Lessons Learned in Teaching Machine Learning for Earth Observation Techniques',
 'Vision Transformer in Galaxy Morphology Classification',
 'Exploring Methods of Deep Reinforcement Learning with NLP Applications',
 '6 Essential Tips to Solve Data Science Projects',
 'Data Science Interview Questions My Friends and I got asked recently (III)',
 'Understanding Uber’s Generative Teaching Networks',
 'How to achieve efficient large-batch training?',
 'How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.',
 'Why You Need to Know the Inner Workings of Models',
 'Let’s Build A Simple Object Classification Task I']

In [65]:
CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')

'''
CountVectorizer converts a collection of docs into term-document matrix
row -> document
column -> term, word (ngram); here the ngram is (1,1) so just one word

ngram_range=(1,1) -> singe words so unigrams
ngram_range=(1,2) -> single words and pairs of words, so unigrams and bigrams
ngram_range=(2,2) -> only bigrams
GENERALLY:
ngram_range=(n,m) -> n is the minimum number of words in a ngram and m is the maximum number of words in a ngram 
so if n==m then only ngrams of size n will be considered

stopwords='english' -> removes common english words like 'the', 'is', 'and' etc.
'''

CountData = CountVec.fit_transform(documents)

'''
fit_transform(documents)
fit -> learn vocabulary from the input i.e. docs; scans through them, identifies tokens (unique words) after the preprocessing
(here preprocessing is removing stopwords) and assigns an index to each token
transform -> convert the input docs into term-document matrix
'''

CountData

<40x150 sparse matrix of type '<class 'numpy.int64'>'
	with 204 stored elements in Compressed Sparse Row format>

The very basic way of storing information about documents is word count. Simply for each document we store an information how many times each word appears. It can be stored in an array, however, it's not the best option since it will be filled mostly with 0s. That's why it's stored in a sparse matrix, but we can expand it.

In [66]:
df=pd.DataFrame(CountData.toarray(), columns=CountVec.get_feature_names_out(), index=documents)
df

Unnamed: 0,6x,achieve,advanced,ai,analysis,animated,anomaly,applications,approach,asked,...,tutorial,uber,understanding,variance,ventilator,vif,vision,vs,way,workings
Machine Learning,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Five Advanced Plots in Python - Matplotlib,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How to Make your Computer Talk with Python,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Anomaly Detection on Servo Drives,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Animated Mathematical Analysis,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How to Perform Speech Recognition with Python,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Beyond The Semesters: E04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"How to improve classification of e-commerce pages, incorporating multiple modalities",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Time Series Forecasting with ThymeBoost,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Task 1
We can reduce the size of an array, get rid of unnecesary words, and improve the quality of comparison by firstly preprocessing the docuemnts.
Check array size after stemming/lemmatization and without stop words

In [67]:
import nltk

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize, wordpunct_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmart\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [68]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
wordNet = WordNetLemmatizer()
stopWords = set(stopwords.words('english'))

# def wordPunctTokens(doc):
#     return wordpunct_tokenize(doc)


# def wordTokens(doc):
#     return word_tokenize(doc)

# def porterStem(tokens):
#     return [porter.stem(token) for token in tokens]


def preprocess_doc(doc, tokenizer=word_tokenize ,stemmer=None, lemmatizer=None, use_lemmatizer=False):
    tokens = tokenizer(doc.lower()) # tokenize the doc and convert to lowercase
    terms = [word for word in tokens if word.isalpha() and word not in stopWords] # remove stopwords and non-alphabetic words
    if use_lemmatizer and lemmatizer:
        processed = [lemmatizer.lemmatize(word) for word in terms]
    elif stemmer:
        processed = [stemmer.stem(word) for word in terms] # stem the words 
    else:
        raise ValueError("You must provide a stemmer or lemmatizer")
    
    return ' '. join(processed)   


In [69]:
processed_docs_porter = [preprocess_doc(doc, stemmer=porter) for doc in documents]
processed_docs_porter_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, stemmer=porter) for doc in documents]

processed_docs_lancaster = [preprocess_doc(doc, stemmer=lancaster) for doc in documents]
processed_docs_lancaster_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, stemmer=lancaster) for doc in documents]

processed_docs_wordnet = [preprocess_doc(doc, lemmatizer=wordNet, use_lemmatizer=True) for doc in documents]
processed_docs_wordnet_wordpunct = [preprocess_doc(doc, tokenizer=wordpunct_tokenize, lemmatizer=wordNet, use_lemmatizer=True) for doc in documents]


processed_docs = {
    "Porter (word_tokenize)": processed_docs_porter,
    "Porter (wordpunct_tokenize)": processed_docs_porter_wordpunct,
    "Lancaster (word_tokenize)": processed_docs_lancaster,
    "Lancaster (wordpunct_tokenize)": processed_docs_lancaster_wordpunct,
    "WordNet Lemmatizer (word_tokenize)": processed_docs_wordnet,
    "WordNet Lemmatizer (wordpunct_tokenize)": processed_docs_wordnet_wordpunct,
}

In [70]:
for name, processed_doc in processed_docs.items():
    CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')
    CountData = CountVec.fit_transform(processed_doc)
    num_columns = len(CountVec.get_feature_names_out())
    print(f"{name}: Number of terms = {num_columns}")

Porter (word_tokenize): Number of terms = 141
Porter (wordpunct_tokenize): Number of terms = 144
Lancaster (word_tokenize): Number of terms = 138
Lancaster (wordpunct_tokenize): Number of terms = 141
WordNet Lemmatizer (word_tokenize): Number of terms = 143
WordNet Lemmatizer (wordpunct_tokenize): Number of terms = 146


## Task 2

Easy technique to compare two documents is a jaccard similarity.
$J={\frac {|A\cap B|}{|A\cup B|}}.$

Implement Jaccard similarity, and function finding closest document to a provided query. Test different queries

In [73]:
def jaccard(d1, d2):
    set1 = set(d1.split())
    set2 = set(d2.split())
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

def closest(query, documents):
    processed_query = preprocess_doc(query, stemmer=porter)
    processed_docs = [preprocess_doc(doc, stemmer=porter) for doc in documents]
    similarities = [jaccard(processed_query, doc) for doc in processed_docs]
    max_index = similarities.index(max(similarities))
    return documents[max_index], max(similarities)

<a href="https://ibb.co/k4rRpf9"><img src="https://i.ibb.co/GW1KXLt/ir4.jpg" alt="ir4" border="0"></a>

In [74]:
documents

['Machine Learning',
 'Five Advanced Plots in Python - Matplotlib',
 'How to Make your Computer Talk with Python',
 'Anomaly Detection on Servo Drives',
 'Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction',
 'Animated Mathematical Analysis',
 'How to Perform Speech Recognition with Python',
 'Beyond The Semesters: E04',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 'Time Series Forecasting with ThymeBoost',
 'CHAPTER 2: Why I Chose Data Science!',
 'Training Provably-Robust Neural Networks',
 'Time Series Forecasting with ThymeBoost',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 '5 Cute Features of CatBoost',
 'Variance Inflation Factor (VIF) and it’s relationship with multicollinearity&nbsp;.',
 'Beyond The Semesters: E04',
 'Efficient Digital Transformation - Particle Swarm Optimiser',
 'MEASURE OF ASYMMETRY',
 'What is linear regression? A quick cov

In [76]:
queries = [
    "python",
    "plot neural network",
    "plot neural networks",
    "ploting neural networks",
    "data science",
    "5 Cute Features of CatBoost"
]
for q in queries:
    print(f"Query: {q}")
    # print(closest(q, df))

    closest_doc, similarity = closest(q, documents)
    print(f"Closest document: {closest_doc}")
    print(f"Jaccard Similarity: {similarity:.2f}\n")
    

Query: python
Closest document: How to Make your Computer Talk with Python
Jaccard Similarity: 0.25

Query: plot neural network
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: plot neural networks
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: ploting neural networks
Closest document: Training Provably-Robust Neural Networks
Jaccard Similarity: 0.50

Query: data science
Closest document: CHAPTER 2: Why I Chose Data Science!
Jaccard Similarity: 0.50

Query: 5 Cute Features of CatBoost
Closest document: 5 Cute Features of CatBoost
Jaccard Similarity: 1.00



## Task 3

TFIDF (term frequency–inverse document frequency) is a much better approach. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

This approach consists of 2 steps:
TF (term frequency) -  $tf(t,d)$, is the relative frequency of term $t$ within document $d$, can be expressed e.g. as a word count divided by number of terms in a given document or by the maximum term count in a given document.

IDF (inverse document frequency) - is a measure of how much information the word provides. If a word appears in every document it does not provide much information, but if it just appears in two documents then its impact on similiarity between these two documents is higher. The standard approach to compute this value is logarithm of number of documents divided by number of documents containing a given term $IDF(t) = log(\frac{N}{n_t})$

TFIDF is then just TF multiplied by IDF


Implement tf idf, compare it with sklearn TfidfVectorizer

In [77]:
tfidf=TfidfVectorizer(use_idf=True, smooth_idf=False)

dfTFIDF = pd.DataFrame(tfidf.fit_transform(documents).toarray(), index=documents, columns=tfidf.get_feature_names_out())
dfTFIDF

Unnamed: 0,6x,about,achieve,advanced,ai,analysis,and,animated,anomaly,applications,...,vision,vs,way,what,why,will,with,workings,you,your
Machine Learning,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Five Advanced Plots in Python - Matplotlib,0.0,0.0,0.0,0.450495,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How to Make your Computer Talk with Python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.249731,0.0,0.0,0.316067
Anomaly Detection on Servo Drives,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.459985,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Animated Mathematical Analysis,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How to Perform Speech Recognition with Python,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.280999,0.0,0.0,0.0
Beyond The Semesters: E04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"How to improve classification of e-commerce pages, incorporating multiple modalities",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Time Series Forecasting with ThymeBoost,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.32486,0.0,0.0,0.0


In [78]:
pd.Series(tfidf.idf_, index=tfidf.get_feature_names_out()).sort_values()

of           2.491655
to           2.491655
with         2.609438
and          2.897120
how          2.897120
               ...   
integrate    4.688879
interview    4.688879
into         4.688879
need         4.688879
measure      4.688879
Length: 184, dtype: float64

In [81]:
query = "how to machine learning"
query = tfidf.transform([query]).toarray()[0]
1-dfTFIDF.apply(lambda x: cosine(x, query), axis=1).sort_values()

Machine Learning                                                                                    0.763355
Recommender System With Machine Learning and Statistics                                             0.364334
Fastest Way of Deploying Your Machine Learning Models                                               0.328159
How to Perform Speech Recognition with Python                                                       0.265814
3 Lessons Learned in Teaching Machine Learning for Earth Observation Techniques                     0.258540
How to achieve efficient large-batch training?                                                      0.246288
How to Make your Computer Talk with Python                                                          0.236235
How to improve classification of e-commerce pages, incorporating multiple modalities                0.221282
How to improve classification of e-commerce pages, incorporating multiple modalities                0.221282
Exploring Methods o

## Task 4
Create a search engine based on TFIDF

In [None]:
def search(query, df):
    pass

## Task 5
Create a search engine based on history containing more than one document

In [None]:
def search(history, df):
    pass