# Search Engine for Medium Articles

`In this project, a basic Search Engine was created using the TF-IDF method to rank documents based on relevance. The project involves the following steps:`

- **Text Preprocessing**
- **Word Co Occurence Matrix**
- **Continouous Bag of Words (CBoW)**
- **Skipgram**
- **Word2Vec**
- **TF IDF weighted Word2Vec**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim

import nltk, re, string, contractions
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

from scipy.sparse import csr_matrix

In [7]:
# nltk.download('punkt')
# nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Muthukumar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [3]:
raw_data = pd.read_csv(r'F:\Muthu_2023\Personal\NextStep\NLP\NLP\Dataset\medium_articles_v3.csv')
# raw_data = pd.read_csv(r'E:\Nextstep\NLP\Dataset\medium_articles_v3.csv')
raw_data.head()

Unnamed: 0,link,title,sub_title,author,reading_time,text,id
0,https://towardsdatascience.com/ensemble-method...,"Ensemble methods: bagging, boosting and stacking",Understanding the key concepts of ensemble lea...,Joseph Rocca,20,This post was co-written with Baptiste Rocca.\...,1
1,https://towardsdatascience.com/understanding-a...,Understanding AUC - ROC Curve,"In Machine Learning, performance measurement i...",Sarang Narkhede,5,"In Machine Learning, performance measurement i...",2
2,https://towardsdatascience.com/how-to-work-wit...,How to work with object detection datasets in ...,"A comprehensive guide to defining, loading, ex...",Eric Hofesmann,10,Microsoft's Common Objects in Context dataset ...,3
3,https://towardsdatascience.com/11-dimensionali...,11 Dimensionality reduction techniques you sho...,Reduce the size of your dataset while keeping ...,Rukshan Pramoditha,16,"In both Statistics and Machine Learning, the n...",4
4,https://towardsdatascience.com/the-time-series...,The Time Series Transformer,Attention Is All You Need they said. Is it a m...,Theodoros Ntakouris,6,Attention Is All You Need they said. Is it a m...,5


In [4]:
raw_data = raw_data.drop(66)

`As per Analysis, Article 67 contains 10000+ unique words due to the presence of names of google scholars. Hence dropped`

## Text Preprocessing

- `Sentence Tokenization`
- `Text cleaning: Links, Numbers, AlphaNumeric words, concatenated words and Stopwords removal`

In [5]:
def text_preprocess(text):
    sent_tokens = sent_tokenize(text)
    stop_words = stopwords.words('English')
    sent_processed = []
    for sent in sent_tokens:
        sent = re.sub(r'[^a-zA-Z0-9 ]',' ', contractions.fix(sent.lower()))
        sent = re.sub(r'https://[^\s\n\r]+', '', sent) #Remove links
        sent = re.sub(r'http://[^\s\n\r]+', '', sent)
        sent = re.sub(r'[^a-zA-Z0-9 ]',' ', sent)
        word_list = []
        for word in sent.split():
            if word not in stop_words and len(word.strip()) > 1 and not word.isnumeric() and not bool(re.search(r'\d', word)) and len(word.strip()) < 20:
                word_list.append(word)
        if len(word_list)>0:
            sent_processed.append(' '.join(word_list))
    return(sent_processed)

In [6]:
raw_data['transformed_text'] = raw_data['text'].apply(text_preprocess)
raw_data.head()

Unnamed: 0,link,title,sub_title,author,reading_time,text,id,transformed_text
0,https://towardsdatascience.com/ensemble-method...,"Ensemble methods: bagging, boosting and stacking",Understanding the key concepts of ensemble lea...,Joseph Rocca,20,This post was co-written with Baptiste Rocca.\...,1,"[post co written baptiste rocca, unity strengt..."
1,https://towardsdatascience.com/understanding-a...,Understanding AUC - ROC Curve,"In Machine Learning, performance measurement i...",Sarang Narkhede,5,"In Machine Learning, performance measurement i...",2,[machine learning performance measurement esse...
2,https://towardsdatascience.com/how-to-work-wit...,How to work with object detection datasets in ...,"A comprehensive guide to defining, loading, ex...",Eric Hofesmann,10,Microsoft's Common Objects in Context dataset ...,3,[microsoft common objects context dataset coco...
3,https://towardsdatascience.com/11-dimensionali...,11 Dimensionality reduction techniques you sho...,Reduce the size of your dataset while keeping ...,Rukshan Pramoditha,16,"In both Statistics and Machine Learning, the n...",4,[statistics machine learning number attributes...
4,https://towardsdatascience.com/the-time-series...,The Time Series Transformer,Attention Is All You Need they said. Is it a m...,Theodoros Ntakouris,6,Attention Is All You Need they said. Is it a m...,5,"[attention need said, robust convolution, hack..."


In [11]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 207 entries, 0 to 207
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   link              207 non-null    object
 1   title             207 non-null    object
 2   sub_title         207 non-null    object
 3   author            207 non-null    object
 4   reading_time      207 non-null    int64 
 5   text              207 non-null    object
 6   id                207 non-null    int64 
 7   transformed_text  207 non-null    object
dtypes: int64(2), object(6)
memory usage: 14.6+ KB


In [12]:
raw_data['id'].nunique()

207

## Word Co Occurence Matrix

- `Prepare list of vocabulary`
- `Prepared word co occurence matrix: sparse`
- `Prepare training data`

In [7]:
sent_list = raw_data['transformed_text'].explode()
voc_list = sent_list.str.split().explode().unique()
print('Vocabulary Size: ', len(voc_list), 'No. of sentences: ', len(sent_list))

Vocabulary Size:  19511 No. of sentences:  26534


In [14]:
d = {}
for sentence in sent_list:
    words = sentence.split()
    for i in range(len(words)-2):
        if (words[i], words[i+1]) not in d:
            if (words[i+1], words[i]) not in d:
                d[(words[i], words[i+1])] = 1
            else:
                d[(words[i+1], words[i])] += 1
        else:
            d[(words[i], words[i+1])] += 1
            
        if (words[i], words[i+2]) not in d:
            if (words[i+2], words[i]) not in d:
                d[(words[i], words[i+2])] = 1
            else:
                d[(words[i+2], words[i])] += 1
        else:
            d[(words[i], words[i+2])] += 1

In [15]:
x_list = []
y_list = []
for sentence in sent_list:
    words = sentence.split()
    for ind in range(len(words)):
        pair_list = []
        for sub_ind in range(ind - 2, ind + 3):
            if sub_ind != ind and sub_ind >= 0 and sub_ind < len(words):
                pair_list.append(words[sub_ind])                
        if len(pair_list) > 0:
            x_list.append(pair_list)
            y_list.append([words[ind]])

In [16]:
len(x_list), len(y_list)

(244928, 244928)

In [17]:
mlb = MultiLabelBinarizer(classes = voc_list, sparse_output=True) # Generates Multi label Encoding with sparse output

In [18]:
xtrain = mlb.fit_transform(x_list)
ytrain = mlb.fit_transform(y_list)

In [20]:
mlb.classes_

array(['post', 'co', 'written', ..., 'serotonin', 'gobbled', 'critic'],
      dtype=object)

## Prepare Model

In [90]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, InputLayer
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras import Sequential

In [92]:
vec_size = 10
voc_size = len(voc_list)

In [93]:
model = Sequential()
model.add(InputLayer(input_shape=(voc_size,), sparse=True))
model.add(Dense(vec_size, activation='relu'))
model.add(Dense(voc_size, activation='softmax'))

In [94]:
model.compile(optimizer='adam', loss='SparseCategoricalCrossentropy', metrics='accuracy')

In [104]:
def convert_sparse_matrix_to_sparse_tensor(X):
    coo = X.tocoo()
    indices = np.mat([coo.row, coo.col]).transpose()
    return tf.SparseTensor(indices, coo.data, coo.shape)

In [108]:
tf.sparse.reorder(convert_sparse_matrix_to_sparse_tensor(ytrain))

SparseTensor(indices=tf.Tensor(
[[     0      0]
 [     1      1]
 [     2      2]
 ...
 [244925     62]
 [244926   1479]
 [244927     11]], shape=(244928, 2), dtype=int64), values=tf.Tensor([1 1 1 ... 1 1 1], shape=(244928,), dtype=int32), dense_shape=tf.Tensor([244928  19511], shape=(2,), dtype=int64))

In [None]:
model.fit(tf.sparse.reorder(convert_sparse_matrix_to_sparse_tensor(xtrain)), tf.sparse.reorder(convert_sparse_matrix_to_sparse_tensor(ytrain)), epochs=1, verbose=1)

`Problem with sparse matrix in the network: Needs further investigation`

# Word2Vec

- `Continuous Bag of Words implementation (CBoW) using word2vec`
- `Skipgram implementation using word2vec`

In [8]:
from gensim.models import Word2Vec

In [9]:
#Generate list of list
sentence_list = []
for sent in sent_list:
    word_list = []
    for word in sent.split():
        word_list.append(word)
    sentence_list.append(word_list)

## CBoW using Word2Vec

In [23]:
model = Word2Vec(sentence_list, window=2, vector_size=100, sg=0, min_count=0)

In [358]:
# model.build_vocab(sentence_list, progress_per=10000)
# model.train(sentence_list, total_examples=model.corpus_count, epochs=10)

In [24]:
model.wv.most_similar(positive=["learning"])

[('vending', 0.9217960834503174),
 ('translation', 0.8929186463356018),
 ('washing', 0.8903473615646362),
 ('repair', 0.8423986434936523),
 ('earliest', 0.8414889574050903),
 ('slot', 0.8408620953559875),
 ('envelopes', 0.8398900032043457),
 ('capsules', 0.8373095989227295),
 ('volunteered', 0.834150493144989),
 ('mastercard', 0.8338434100151062)]

`Extract similar words for the given word from the model`

In [25]:
model.wv['learning']

array([ 1.0237862 , -0.38215688, -0.05556021,  0.39030933,  0.26835766,
       -0.9685569 ,  1.1269547 ,  0.52827674, -0.67589307, -0.10392741,
        0.49126264, -0.1965297 , -0.37685892,  1.7085149 , -0.33567706,
       -0.59292144,  1.4235001 ,  0.5721935 , -0.15958312, -2.5173852 ,
        0.8205378 , -0.2597604 ,  1.7086773 , -0.3014652 , -0.41533405,
        0.42606047, -0.4873449 , -0.50534207, -0.03519989, -0.40178236,
        1.7446532 ,  1.5876681 , -0.11230429,  0.49677545, -1.1850039 ,
        0.61231095, -1.5197169 , -1.6955729 , -1.3053275 , -2.140613  ,
        0.86405545, -1.0840689 ,  0.9455743 , -1.2610888 ,  0.11255067,
        0.91128683, -1.0392733 , -1.7087599 , -0.18029949,  0.3970201 ,
        0.02042915, -1.580043  , -0.9794143 ,  0.09938424, -1.7822664 ,
       -0.83774525, -0.60228837, -1.6656373 , -1.8647523 , -1.2973278 ,
       -0.09623057, -0.44136506,  0.49341574,  0.07339007, -0.8351086 ,
        1.2139884 , -0.3670534 , -0.42476758, -1.1924977 , -0.47

`Extract word vector for the given word`

In [26]:
# Find Centroid
raw_data['Centroid_cbow'] = [[0.0] * 100] * raw_data.shape[0]
for index in range(len(raw_data)):
    centroid = np.array([0.0] * 100)
    article = raw_data['transformed_text'].iloc[index]
    for sent in article:
        for word in sent.split():
            try:
                centroid = np.add(centroid, model.wv[word])
            except:
                continue
    raw_data['Centroid_cbow'].iloc[index] = centroid.tolist()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_data['Centroid_cbow'].iloc[index] = centroid.tolist()


`Generates centroid of the word vectors for each article`

In [42]:
def get_similar_article(query, df, model, col_name):
    cos_sim_list = []
    for index in range(len(df)):
        cent_article = np.array(df[col_name].iloc[index]).reshape(1,-1)
        cos_sim = 0
        for word in query.split():
            try:
                temp_cent = np.array(model.wv[word]).reshape(1,-1)
                cos_sim += (cosine_similarity(temp_cent, cent_article))
            except:
                continue
        cos_sim_list.append([df['title'].iloc[index], cos_sim[0,0]])
    return cos_sim_list

`Compares cosine similarity for each word from query with all the articles`

In [68]:
similar_articles = get_similar_article("principal component analysis", raw_data, model, 'Centroid_cbow')
temp_df = pd.DataFrame(similar_articles, columns=['title', 'score'])
temp_df.sort_values('score', ascending=False)['title'].iloc[:5]

18    17 Clustering Algorithms Used In Data Science ...
3     11 Dimensionality reduction techniques you sho...
52    TRAIN A CUSTOM YOLOv4 OBJECT DETECTOR (Using G...
19    Introduction to Genetic Algorithms  Including ...
74    The 5 Clustering Algorithms Data Scientists Ne...
Name: title, dtype: object

`Returns top 5 matching articles`

## Skipgram using Word2Vec

In [10]:
model_sg = Word2Vec(sentence_list, min_count=0, window=2, vector_size=100, sg=1)

In [70]:
model_sg.wv['learning']

array([ 0.4514971 , -0.06575663, -0.04660844,  0.05198604,  0.11167115,
       -0.33453867,  0.45436785,  0.08417453, -0.37928718,  0.00169973,
        0.1480776 , -0.15703738, -0.14837955,  0.83874077, -0.13342898,
       -0.38646498,  0.597749  ,  0.50272214, -0.1252687 , -0.97115844,
        0.49803537, -0.11483087,  0.9821957 , -0.17528768, -0.22842406,
        0.3397735 , -0.13709344, -0.5374775 , -0.12903762, -0.17602657,
        1.0036551 ,  0.83098614, -0.01324456,  0.20545445, -0.4712755 ,
        0.2481716 , -0.73464656, -0.727447  , -0.7408131 , -0.9686139 ,
        0.34138075, -0.5425545 ,  0.38642862, -0.5712795 ,  0.03600386,
        0.3683938 , -0.65368474, -0.88349015,  0.086861  ,  0.3330882 ,
        0.09210204, -0.7773117 , -0.55614394,  0.00394889, -0.84166384,
       -0.41658974, -0.25193766, -0.77118176, -0.9820485 , -0.56150347,
       -0.01979044, -0.3817165 ,  0.05722544,  0.06648361, -0.3433137 ,
        0.62319756, -0.20727728, -0.29181555, -0.5545251 , -0.11

In [71]:
model_sg.wv.most_similar(positive=['learning'])

[('translation', 0.8542159199714661),
 ('supervised', 0.8365312814712524),
 ('cpap', 0.8223248720169067),
 ('language', 0.8162795305252075),
 ('ai', 0.8143751621246338),
 ('geolocation', 0.8119624257087708),
 ('intelligence', 0.8092731833457947),
 ('brownlee', 0.8022918105125427),
 ('source', 0.802204430103302),
 ('python', 0.800692617893219)]

In [39]:
raw_data['Centroid_sg'] = [[0.0] * 100 ] * raw_data.shape[0]
for index in range(len(raw_data)):
    text = raw_data['transformed_text'].iloc[index]
    centroid_article = [0.0] * 100
    for sent in text:
        for word in sent.split():
            try:
                centroid_article = np.add(centroid_article, model_sg.wv[word])
            except:
                continue
    raw_data['Centroid_sg'].iloc[index] = centroid_article    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_data['Centroid_sg'].iloc[index] = centroid_article


`Generates centroid of the word vectors for each article`

In [79]:
temp_df_sg = pd.DataFrame(get_similar_article('pca', raw_data, model_sg, 'Centroid_sg'), columns = ['title', 'score'])
temp_df_sg.sort_values('score', ascending=False).iloc[:5]

Unnamed: 0,title,score
49,PCA using Python (scikit-learn),0.967104
131,"Machine learning in finance: Why, what & how",0.966168
3,11 Dimensionality reduction techniques you sho...,0.963317
28,180 Data Science and Machine Learning Projects...,0.960358
98,Top 10 Data Science Projects for Beginners,0.959503


`Returns Top 5 matching articles`

In [89]:
model_sg.wv.most_similar(['regression'])

[('classification', 0.9591761827468872),
 ('non', 0.9544885158538818),
 ('linear', 0.9541516900062561),
 ('clustering', 0.9520301222801208),
 ('methods', 0.9456323385238647),
 ('dimensionality', 0.9435859322547913),
 ('technique', 0.9434919357299805),
 ('performance', 0.9420003890991211),
 ('ensemble', 0.9411723613739014),
 ('called', 0.9368484616279602)]

In [87]:
searchword = 'regression'
search_list = [sent for sent in sentence_list if searchword in sent]
len(search_list)

138

# TFIDF weighted Word2Vec

- `Calculate TF-IDF`
- `Apply the weights while calculating word vec centroid for each article`
- `Calculate averageTFIDF for each word and apply the weights for each word in query while calculating centroid for each query`

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
processed_text = [" ".join(article) for article in raw_data['transformed_text']]

In [22]:
tfidf = TfidfVectorizer()
tfidf_df = pd.DataFrame(tfidf.fit_transform(processed_text).todense())
tfidf_df.columns = tfidf.get_feature_names_out()

In [35]:
raw_data['tfidf_w2v'] = [[0.0] * 100 ] * raw_data.shape[0]
for index in range(len(raw_data)):
    text = raw_data['transformed_text'].iloc[index]
    centroid_article = [0.0] * 100
    for sent in text:
        for word in sent.split():
            try:                
                centroid_article = np.add(centroid_article, tfidf_df.iloc[index][word] * model_sg.wv[word])
            except:
                continue
    raw_data['tfidf_w2v'].iloc[index] = centroid_article   

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_data['tfidf_w2v'].iloc[index] = centroid_article


In [82]:
def get_similar_article1(query, df, model, col_name):
    cos_sim_list = []
    for index in range(len(df)):
        cent_article = np.array(df[col_name].iloc[index]).reshape(1,-1)
        cos_sim = 0
        for word in query.split():
            try:
                temp_arr = np.array(tfidf_df.iloc[:][word])
                tfidf_mean = temp_arr[list(np.where(temp_arr))].mean()
                temp_cent = np.array(model.wv[word] * tfidf_mean).reshape(1,-1)
                cos_sim += (cosine_similarity(temp_cent, cent_article))
            except:
                continue
        cos_sim_list.append([df['title'].iloc[index], cos_sim[0,0]])
    return cos_sim_list

In [83]:
temp_df_tfidf = pd.DataFrame(get_similar_article1('pca', raw_data, model_sg, 'tfidf_w2v'), columns = ['title', 'score'])
temp_df_tfidf.sort_values('score', ascending=False).iloc[:5]

Unnamed: 0,title,score
3,11 Dimensionality reduction techniques you sho...,0.979759
49,PCA using Python (scikit-learn),0.976613
0,"Ensemble methods: bagging, boosting and stacking",0.970478
45,Understanding Contrastive Learning,0.968863
55,Time Series Forecasting with PyCaret Regressio...,0.966923


# Prepared by Muthukumar G