In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from scipy.spatial.distance import cosine

Today we will focus on finding similarities between documents. For this purpose, we will compare the content of these documents. The same techniques can be used for a query in a search engine. Then simply we can treat the query like another document, calculate similarities and return the most similar documents.

In [None]:
documents = ['Machine Learning',
 'Five Advanced Plots in Python - Matplotlib',
 'How to Make your Computer Talk with Python',
 'Anomaly Detection on Servo Drives',
 'Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction',
 'Animated Mathematical Analysis',
 'How to Perform Speech Recognition with Python',
 'Beyond The Semesters: E04',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 'Time Series Forecasting with ThymeBoost',
 'CHAPTER 2: Why I Chose Data Science!',
 'Training Provably-Robust Neural Networks',
 'Time Series Forecasting with ThymeBoost',
 'How to improve classification of e-commerce pages, incorporating multiple modalities',
 '5 Cute Features of CatBoost',
 'Variance Inflation Factor (VIF) and it’s relationship with multicollinearity&nbsp;.',
 'Beyond The Semesters: E04',
 'Efficient Digital Transformation - Particle Swarm Optimiser',
 'MEASURE OF ASYMMETRY',
 'What is linear regression? A quick cover with a tutorial',
 'Correlation VS Covariance: The easy way',
 'Are Recommender System harming us?',
 '1 Line of Python Code That Will Speed Up Your AI by Up to 6x',
 'If You Are Serious About Data Science Job. You Must Know These 3 Things.',
 'Recommender System With Machine Learning and Statistics',
 'Bias detection and mitigation in IBM AutoAI',
 'Data Engineering: Create your own Dataset',
 'Graph Neural Networks and Generalizable Models in Neuroscience',
 'Fastest Way of Deploying Your Machine Learning Models',
 'A Novel Approach to Integrate Speech Recognition into Authentication Systems',
 '3 Lessons Learned in Teaching Machine Learning for Earth Observation Techniques',
 'Vision Transformer in Galaxy Morphology Classification',
 'Exploring Methods of Deep Reinforcement Learning with NLP Applications',
 '6 Essential Tips to Solve Data Science Projects',
 'Data Science Interview Questions My Friends and I got asked recently (III)',
 'Understanding Uber’s Generative Teaching Networks',
 'How to achieve efficient large-batch training?',
 'How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.',
 'Why You Need to Know the Inner Workings of Models',
 'Let’s Build A Simple Object Classification Task I']

In [None]:
CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')
CountData = CountVec.fit_transform(documents)

CountData

The very basic way of storing information about documents is word count. Simply for each document we store an information how many times each word appears. It can be stored in an array, however, it's not the best option since it will be filled mostly with 0s. That's why it's stored in a sparse matrix, but we can expand it.

In [None]:
df=pd.DataFrame(CountData.toarray(), columns=CountVec.get_feature_names_out(), index=documents)
df

## Task 1
We can reduce the size of an array, get rid of unnecesary words, and improve the quality of comparison by firstly preprocessing the docuemnts.
Check array size after stemming/lemmatization and without stop words

## Task 2

Easy technique to compare two documents is a jaccard similarity.
$J={\frac {|A\cap B|}{|A\cup B|}}.$

Implement Jaccard similarity, and function finding closest document to a provided query. Test different queries

In [None]:
def jaccard(d1, d2):
    pass

def closest(query, df):
    pass



<a href="https://ibb.co/k4rRpf9"><img src="https://i.ibb.co/GW1KXLt/ir4.jpg" alt="ir4" border="0"></a>

In [None]:
queries = [
    "python",
    "plot neural network",
    "plot neural networks",
    "ploting neural networks",
    "data science",
]
for q in queries:
    print(q)
    print(closest(q, df))

## Task 3

TFIDF (term frequency–inverse document frequency) is a much better approach. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

This approach consists of 2 steps:
TF (term frequency) -  $tf(t,d)$, is the relative frequency of term $t$ within document $d$, can be expressed e.g. as a word count divided by number of terms in a given document or by the maximum term count in a given document.

IDF (inverse document frequency) - is a measure of how much information the word provides. If a word appears in every document it does not provide much information, but if it just appears in two documents then its impact on similiarity between these two documents is higher. The standard approach to compute this value is logarithm of number of documents divided by number of documents containing a given term $IDF(t) = log(\frac{N}{n_t})$

TFIDF is then just TF multiplied by IDF


Implement tf idf, compare it with sklearn TfidfVectorizer

In [None]:
tfidf=TfidfVectorizer(use_idf=True, smooth_idf=False)

dfTFIDF = pd.DataFrame(tfidf.fit_transform(documents).toarray(), index=documents, columns=tfidf.get_feature_names_out())
dfTFIDF

In [None]:
pd.Series(tfidf.idf_, index=tfidf.get_feature_names_out()).sort_values()

In [None]:
query = "how to machine learning"
query = tfidf.transform([query]).toarray()[0]
1-dfTFIDF.apply(lambda x: cosine(x, query), axis=1).sort_values()

## Task 4
Create a search engine based on TFIDF

In [None]:
def search(query, df):
    pass

## Task 5
Create a search engine based on history containing more than one document

In [None]:
def search(history, df):
    pass