# Create A Simple Search Engine Using Python 
## Utilize TF-IDF and Cosine Similarity to retrieve similar articles with query

Information Retrieval right now is an important task. Probably you're wondering, how does the system can retrieve articles that we want using a query? Here are the steps,
1. Extract documents from the Internet (It could be Web Scraping or extract manually)
2. Clean the documents to make the retrieval much easier
3. Create a Term-Document Matrix with TF-IDF weighting
4. Write your queries and convert it as vector (based on TF-IDF)
5. Calculate the cosine similarity between the query and the document and repeat the process on each document.
6. Finally, show the document


In [2]:
import re
import string
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer

def retrieve_docs_and_clean():

  r = requests.get('https://sports.ndtv.com/fifa-world-cup-2022/news')
  soup = BeautifulSoup(r.content, 'html.parser')

  #THE FOLLOWING CODE NEED TO BE MODIFIED TO SUITE FOR THE ABOVE URL
  link = []
  for i in soup.find('div', {'class':"lst-pg_hd"}).find_all('a', {'class':"lst-pg_ttl"}):
    i['href'] = "https://sports.ndtv.com/" + i['href'] + '?page=all'
    link.append(i['href'])

  

  # Retrieve Paragraphs
  documents = []
  for i in link:
      r = requests.get(i)
      soup = BeautifulSoup(r.content, 'html.parser')

      sen = []
      for i in soup.find('div', {'class':'sp-cn pg-str-com js-ad-section'}).find_all('p'):
          sen.append(i.text)
      documents.append(' '.join(sen))

  # Clean Paragraphs
  documents_clean = []
  for d in documents:
      document_test = re.sub(r'[^\x00-\x7F]+', ' ', d)
      document_test = re.sub(r'@\w+', '', document_test)
      document_test = document_test.lower()
      document_test = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', document_test)
      document_test = re.sub(r'[0-9]', '', document_test)
      document_test = re.sub(r'\s{2,}', ' ', document_test)
      documents_clean.append(document_test)

  return documents_clean

In [3]:
docs = retrieve_docs_and_clean()
# Create Term-Document Matrix with TF-IDF weighting
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Create a DataFrame
df = pd.DataFrame(X.T.toarray(), index=vectorizer.get_feature_names_out())
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
abandoned,0.095733,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
able,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067416,0.0
about,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076461,0.0,0.0,0.0
accepting,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032035
according,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054104,0.083562,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
def get_similar_articles(q, df):
  print("query:", q)
  print("The following are articles with the highest cosine similarity values: ")
  q = [q]
  q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
  sim = {}
  for i in range(10):
    sim[i] = np.dot(df.loc[:, i].values, q_vec) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vec)
  
  sim_sorted = sorted(sim.items(), key=lambda x: x[1], reverse=True)
  
  for k, v in sim_sorted:
    if v != 0.0:
      print("Similarity Values:", v)
      print(docs[k])
      print()


q1 = 'barcelona'
q2 = 'spain'
q3 = 'argentina'

get_similar_articles(q1, df)
print('-'*100)
get_similar_articles(q2, df)
print('-'*100)
get_similar_articles(q3, df)

query: barcelona
The following are articles with the highest cosine similarity values: 
Similarity Values: 0.015616493922495082
france s bid to retain the world cup continues against poland on sunday while england s pursuit of a first major trophy in years will be tested by african champions senegal didier deschamps french side are aiming to become the first team to win successive world cups since brazil in and after winning group d in qatar window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js led by electric paris saint germain forward kylian mbappe the french are heavy favourites to progress past poland at doha s al thumama stadium but in a world cup packed with shocks like the group stage exits of germany and belgium subduing barcelona striker robert lewandowski is the key to france s hopes of avoiding another upset pol

In [17]:
from gensim.summarization.bm25 import BM25

def bm25_ranking(query):
  docs_tokens = [s.split() for s in docs]
  bm25 = BM25(docs_tokens)
  qry = query.split()
  scores = bm25.get_scores(qry, average_idf=20)

  best = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
  for i,b in enumerate(best):
    print(f"Rank {i+1}: {docs[b]}")


In [18]:
bm25_ranking('argentina')

Rank 1: argentina will take on australia in the second round of match of the ongoing fifa world cup at ahmad bin ali stadium on sunday ist argentina faced a major blow after they lost to saudi arabia in their first group stage match but showed a massive comeback and finished as the group c toppers australia on the other hand have equal number of points and finished in the second position in group d window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js when will the argentina vs australia round of match be played the argentina vs australia round of match will be played on sunday december ist where will the argentina vs australia round of match will be played the argentina vs australia round of match will be played at the ahmad bin ali stadium sponsored by vuuklevar vuukle config apikey e a d bd bfcd articleid comments enable

In [19]:
bm25_ranking('messi')

Rank 1: lionel messi produced a moment of trademark quality to score the opener in a win over australia and send argentina into the quarter finals of the world cup on saturday messi s th goal in his th career appearance but first in the knockout rounds of the world cup helped set up a last eight clash with the unbeaten netherlands on friday julian alvarez scored the other in the th minute after some calamitous australian defending before enzo fernandez s th minute own goal set up an unexpectedly nervy finale now one of the best players of all time is seeking to win the tournament for the first time in what will surely be his final world cup window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js messi s classy first half goal bending the ball into the bottom corner broke the resistance and ultimately the hearts of a gutsy aus

In [20]:
bm25_ranking('poland')

Rank 1: france s bid to retain the world cup continues against poland on sunday while england s pursuit of a first major trophy in years will be tested by african champions senegal didier deschamps french side are aiming to become the first team to win successive world cups since brazil in and after winning group d in qatar window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js led by electric paris saint germain forward kylian mbappe the french are heavy favourites to progress past poland at doha s al thumama stadium but in a world cup packed with shocks like the group stage exits of germany and belgium subduing barcelona striker robert lewandowski is the key to france s hopes of avoiding another upset poland s all time leading scorer will be hoping to add to his meagre world cup tally of one goal against france goalkeeper 