# Information Retrieval Project - Mini Document Search Engine

**Class: LB01**

1. Juan Sebastian Veron - 2201754311
2. Muhammad Alvito Kuntjoro - 2201788073
3. I Putu Prema Ananda D.N - 2201785001
4. Naxwell Fladico - 2201791282
5. Calvin Tantry - 2201749551

In this project, we are going to build a mini document search engine using simple TF-IDF and document ranking with Cosine Similarity. The dataset consist of 2 categories: Medical & Technology.

In [107]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import string
from collections import Counter
import numpy as np

# I.  Medical Dataset

In [8]:
medical_dataset = pd.read_csv('corona_dataset.csv')
medical_dataset = pd.DataFrame(medical_dataset)
medical_dataset = medical_dataset.drop(['Unnamed: 0','Unnamed: 0.1','category','source','date','title'], axis = 1)
medical_dataset = medical_dataset.dropna()
medical_dataset['preprocessed_body'] = medical_dataset['body']
medical_dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,New York Mayor Bill de Blasio is among those q...
1,A second patient has died of the COVID-19 viru...,A second patient has died of the COVID-19 viru...
2,"England will formally register COVID-19, a dis...","England will formally register COVID-19, a dis..."
3,Members of the Los Angeles Lakers are under qu...,Members of the Los Angeles Lakers are under qu...
4,Social media is awash with myths about how peo...,Social media is awash with myths about how peo...
...,...,...
1004,As thousands of fans queued to get into the Au...,As thousands of fans queued to get into the Au...
1005,Seven weeks after the first case of COVID-19 w...,Seven weeks after the first case of COVID-19 w...
1006,The government has a palette of options it can...,The government has a palette of options it can...
1007,Millions of face masks stockpiled by Ontario i...,Millions of face masks stockpiled by Ontario i...


# Preprocessing

# a. Lowercase

In [9]:
medical_dataset['preprocessed_body'] = medical_dataset['preprocessed_body'].astype(str).str.lower()
medical_dataset['preprocessed_body']

medical_dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,new york mayor bill de blasio is among those q...
1,A second patient has died of the COVID-19 viru...,a second patient has died of the covid-19 viru...
2,"England will formally register COVID-19, a dis...","england will formally register covid-19, a dis..."
3,Members of the Los Angeles Lakers are under qu...,members of the los angeles lakers are under qu...
4,Social media is awash with myths about how peo...,social media is awash with myths about how peo...
...,...,...
1004,As thousands of fans queued to get into the Au...,as thousands of fans queued to get into the au...
1005,Seven weeks after the first case of COVID-19 w...,seven weeks after the first case of covid-19 w...
1006,The government has a palette of options it can...,the government has a palette of options it can...
1007,Millions of face masks stockpiled by Ontario i...,millions of face masks stockpiled by ontario i...


# b. Number and Punctuation removal

In [10]:
medical_dataset['preprocessed_body'] = medical_dataset['preprocessed_body'].str.replace('[^\w\s]','')
medical_dataset['preprocessed_body'] = medical_dataset['preprocessed_body'].str.replace('[0-9]','')
medical_dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,new york mayor bill de blasio is among those q...
1,A second patient has died of the COVID-19 viru...,a second patient has died of the covid virus i...
2,"England will formally register COVID-19, a dis...",england will formally register covid a disease...
3,Members of the Los Angeles Lakers are under qu...,members of the los angeles lakers are under qu...
4,Social media is awash with myths about how peo...,social media is awash with myths about how peo...
...,...,...
1004,As thousands of fans queued to get into the Au...,as thousands of fans queued to get into the au...
1005,Seven weeks after the first case of COVID-19 w...,seven weeks after the first case of covid was ...
1006,The government has a palette of options it can...,the government has a palette of options it can...
1007,Millions of face masks stockpiled by Ontario i...,millions of face masks stockpiled by ontario i...


# c. Stop Word Removal

In [11]:
stop_words = set(stopwords.words('english'))
medical_dataset['preprocessed_body']= medical_dataset.apply(lambda row: nltk.word_tokenize(row['preprocessed_body']), axis=1)
medical_dataset['preprocessed_body'] = medical_dataset['preprocessed_body'].apply(lambda x: ' '.join([word for word in x if word not in (stop_words)]))

# d. Preprocessed Dataset

In [12]:
medical_dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,new york mayor bill de blasio among questionin...
1,A second patient has died of the COVID-19 viru...,second patient died covid virus ireland total ...
2,"England will formally register COVID-19, a dis...",england formally register covid disease caused...
3,Members of the Los Angeles Lakers are under qu...,members los angeles lakers quarantine days tes...
4,Social media is awash with myths about how peo...,social media awash myths people might stop new...
...,...,...
1004,As thousands of fans queued to get into the Au...,thousands fans queued get australian grand pri...
1005,Seven weeks after the first case of COVID-19 w...,seven weeks first case covid confirmed u
1006,The government has a palette of options it can...,government palette options use shore economy i...
1007,Millions of face masks stockpiled by Ontario i...,millions face masks stockpiled ontario afterma...


# II. Technology Dataset

In [13]:
technology_dataset = pd.read_csv('CSV_sampleDataset_DataAnalyticsTrack.csv')

In [14]:
technology_dataset = technology_dataset.drop('Unnamed: 0', 1)
technology_dataset = technology_dataset.drop('articleId',1)
technology_dataset.columns = ['body']
technology_dataset['preprocessed_body'] = technology_dataset['body']

In [15]:
technology_dataset

Unnamed: 0,body,preprocessed_body
0,Finding an off-the-shelf processor that is an ...,Finding an off-the-shelf processor that is an ...
1,"SAN JOSE, Calif. — Fresh details have emerged ...","SAN JOSE, Calif. — Fresh details have emerged ..."
2,"AUSTIN, Texas — The effort to develop a 157-nm...","AUSTIN, Texas — The effort to develop a 157-nm..."
3,The following sources offer additional informa...,The following sources offer additional informa...
4,As embedded systems and the microprocessor cor...,As embedded systems and the microprocessor cor...
...,...,...
4995,"Scottsdale, Ariz. — NKK Switches offers a new ...","Scottsdale, Ariz. — NKK Switches offers a new ..."
4996,"Lisle, Ill. — A new range of 0.95-mm height Sl...","Lisle, Ill. — A new range of 0.95-mm height Sl..."
4997,"El Segundo, Calif. – International Rectifier's...","El Segundo, Calif. – International Rectifier's..."
4998,A handful of vendors banned together to addres...,A handful of vendors banned together to addres...


In [16]:
for i in range(len(technology_dataset)):
    technology_dataset['preprocessed_body'][i] = str(technology_dataset['preprocessed_body'][i])

In [17]:
technology_dataset['preprocessed_body'] = technology_dataset['preprocessed_body'].str.lower()
#x = technology_dataset['content'].isna()
technology_dataset

Unnamed: 0,body,preprocessed_body
0,Finding an off-the-shelf processor that is an ...,finding an off-the-shelf processor that is an ...
1,"SAN JOSE, Calif. — Fresh details have emerged ...","san jose, calif. — fresh details have emerged ..."
2,"AUSTIN, Texas — The effort to develop a 157-nm...","austin, texas — the effort to develop a 157-nm..."
3,The following sources offer additional informa...,the following sources offer additional informa...
4,As embedded systems and the microprocessor cor...,as embedded systems and the microprocessor cor...
...,...,...
4995,"Scottsdale, Ariz. — NKK Switches offers a new ...","scottsdale, ariz. — nkk switches offers a new ..."
4996,"Lisle, Ill. — A new range of 0.95-mm height Sl...","lisle, ill. — a new range of 0.95-mm height sl..."
4997,"El Segundo, Calif. – International Rectifier's...","el segundo, calif. – international rectifier's..."
4998,A handful of vendors banned together to addres...,a handful of vendors banned together to addres...


In [18]:
technology_dataset['preprocessed_body'] = technology_dataset['preprocessed_body'].str.replace('[^\w\s]','')
technology_dataset['preprocessed_body'] = technology_dataset['preprocessed_body'].str.replace('[0-9]','')
technology_dataset

Unnamed: 0,body,preprocessed_body
0,Finding an off-the-shelf processor that is an ...,finding an offtheshelf processor that is an id...
1,"SAN JOSE, Calif. — Fresh details have emerged ...",san jose calif fresh details have emerged abo...
2,"AUSTIN, Texas — The effort to develop a 157-nm...",austin texas the effort to develop a nm litho...
3,The following sources offer additional informa...,the following sources offer additional informa...
4,As embedded systems and the microprocessor cor...,as embedded systems and the microprocessor cor...
...,...,...
4995,"Scottsdale, Ariz. — NKK Switches offers a new ...",scottsdale ariz nkk switches offers a new ser...
4996,"Lisle, Ill. — A new range of 0.95-mm height Sl...",lisle ill a new range of mm height slimstack ...
4997,"El Segundo, Calif. – International Rectifier's...",el segundo calif international rectifiers apu...
4998,A handful of vendors banned together to addres...,a handful of vendors banned together to addres...


In [19]:
stop_words = set(stopwords.words('english'))
technology_dataset['preprocessed_body'] = technology_dataset.apply(lambda row: nltk.word_tokenize(row['preprocessed_body']), axis=1)
technology_dataset['preprocessed_body'] = technology_dataset['preprocessed_body'].apply(lambda x: ' '.join([word for word in x if word not in (stop_words)]))
technology_dataset

Unnamed: 0,body,preprocessed_body
0,Finding an off-the-shelf processor that is an ...,finding offtheshelf processor ideal fit applic...
1,"SAN JOSE, Calif. — Fresh details have emerged ...",san jose calif fresh details emerged second sp...
2,"AUSTIN, Texas — The effort to develop a 157-nm...",austin texas effort develop nm lithography sol...
3,The following sources offer additional informa...,following sources offer additional information...
4,As embedded systems and the microprocessor cor...,embedded systems microprocessor cores based gr...
...,...,...
4995,"Scottsdale, Ariz. — NKK Switches offers a new ...",scottsdale ariz nkk switches offers new series...
4996,"Lisle, Ill. — A new range of 0.95-mm height Sl...",lisle ill new range mm height slimstack connec...
4997,"El Segundo, Calif. – International Rectifier's...",el segundo calif international rectifiers apu ...
4998,A handful of vendors banned together to addres...,handful vendors banned together address intero...


In [14]:
# technology_dataset['content'] = technology_dataset.apply(lambda row: nltk.word_tokenize(row['content']), axis=1)
# technology_dataset

In [20]:
technology_dataset

Unnamed: 0,body,preprocessed_body
0,Finding an off-the-shelf processor that is an ...,finding offtheshelf processor ideal fit applic...
1,"SAN JOSE, Calif. — Fresh details have emerged ...",san jose calif fresh details emerged second sp...
2,"AUSTIN, Texas — The effort to develop a 157-nm...",austin texas effort develop nm lithography sol...
3,The following sources offer additional informa...,following sources offer additional information...
4,As embedded systems and the microprocessor cor...,embedded systems microprocessor cores based gr...
...,...,...
4995,"Scottsdale, Ariz. — NKK Switches offers a new ...",scottsdale ariz nkk switches offers new series...
4996,"Lisle, Ill. — A new range of 0.95-mm height Sl...",lisle ill new range mm height slimstack connec...
4997,"El Segundo, Calif. – International Rectifier's...",el segundo calif international rectifiers apu ...
4998,A handful of vendors banned together to addres...,handful vendors banned together address intero...


In [21]:
medical_dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,new york mayor bill de blasio among questionin...
1,A second patient has died of the COVID-19 viru...,second patient died covid virus ireland total ...
2,"England will formally register COVID-19, a dis...",england formally register covid disease caused...
3,Members of the Los Angeles Lakers are under qu...,members los angeles lakers quarantine days tes...
4,Social media is awash with myths about how peo...,social media awash myths people might stop new...
...,...,...
1004,As thousands of fans queued to get into the Au...,thousands fans queued get australian grand pri...
1005,Seven weeks after the first case of COVID-19 w...,seven weeks first case covid confirmed u
1006,The government has a palette of options it can...,government palette options use shore economy i...
1007,Millions of face masks stockpiled by Ontario i...,millions face masks stockpiled ontario afterma...


# Combining 2 datasets

In [22]:
dataset = pd.concat([medical_dataset, technology_dataset], ignore_index=True)

In [23]:
dataset['preprocessed_body'] = dataset.apply(lambda row: nltk.word_tokenize(row['preprocessed_body']), axis=1)

In [24]:
dataset

Unnamed: 0,body,preprocessed_body
0,New York Mayor Bill de Blasio is among those q...,"[new, york, mayor, bill, de, blasio, among, qu..."
1,A second patient has died of the COVID-19 viru...,"[second, patient, died, covid, virus, ireland,..."
2,"England will formally register COVID-19, a dis...","[england, formally, register, covid, disease, ..."
3,Members of the Los Angeles Lakers are under qu...,"[members, los, angeles, lakers, quarantine, da..."
4,Social media is awash with myths about how peo...,"[social, media, awash, myths, people, might, s..."
...,...,...
6004,"Scottsdale, Ariz. — NKK Switches offers a new ...","[scottsdale, ariz, nkk, switches, offers, new,..."
6005,"Lisle, Ill. — A new range of 0.95-mm height Sl...","[lisle, ill, new, range, mm, height, slimstack..."
6006,"El Segundo, Calif. – International Rectifier's...","[el, segundo, calif, international, rectifiers..."
6007,A handful of vendors banned together to addres...,"[handful, vendors, banned, together, address, ..."


# Document Ranking with TF-IDF and Cosine Similarity and Main Program

Count document frequency of each vocabulary in our dataset.

In [66]:
DOC_COUNT = len(dataset)
DF = {}

for tokens in dataset["preprocessed_body"]:
    for tok in set(tokens):
        try:
            DF[tok] += 1
        except:
            DF[tok] = 1

vocab = [x for x in DF]
VOCAB_SIZE = len(DF)

Calculate TF * IDF for each token in each document.

In [74]:
tfidf_doc = np.zeros((DOC_COUNT, VOCAB_SIZE))

for (idx, tokens) in enumerate(dataset["preprocessed_body"]):
    N_TOKENS = len(tokens)

    counter = Counter(tokens)

    for tok in set(tokens):
        tf = counter[tok] / N_TOKENS
        df = DF[tok]
        idf = np.log(DOC_COUNT / (df + 1))

        tfidf_doc[idx][vocab.index(tok)] = tf * idf

In [110]:
# Export our document TF-IDF to a file so we don't have to calculate it every main program startup
np.save("tfidf_doc.npy", tfidf_doc)

Create a function to return document frequency of a token (must return 0 if token not exist in the document, don't throw exception)

In [77]:
def document_freq(token):
    freq = 0
    try:
        freq = DF[token]
    except: pass
    return freq

Create a function to vectorize the TF-IDF of our query. The vector is a 1D array that that has N column, where N is our vocabulary size.

In [90]:
def vectorize_query_tokens(query_tokens):
    query_vec = np.zeros((VOCAB_SIZE))

    token_counter = Counter(query_tokens)
    N_TOKENS = len(query_tokens)
    
    # Calculate TF-IDF for each token in our query document
    for tok in set(query_tokens):
        tf = token_counter[tok] / N_TOKENS
        df = document_freq(tok)
        idf = np.log(DOC_COUNT / (df + 1))

        # Get index of token in our vocabulary list
        vocab_idx = vocab.index(tok)
        query_vec[vocab_idx] = tf * idf

    return query_vec

In [96]:
def preprocess_query(query):
    query = query.lower()
    query = query.replace('[^\w\s]','')
    query = query.replace('[0-9]','')
    query_tokenized = nltk.word_tokenize(query)
    query_tokenized = [t for t in query_tokenized if t not in stop_words]

    return query_tokenized

Calculate similarity between query and documents using Cosine Similarity

`(A * B) / (normalized(A) * normalized(B))`

In [100]:
def cosine_similarity(n_docs, query):
    print("Query: {}\n".format(query))

    tokens = preprocess_query(query)
    print(tokens)
    print()

    query_vec = vectorize_query_tokens(tokens)

    rank = []

    for doc in tfidf_doc:
        cos_sim = np.dot(query_vec, doc) / (np.linalg.norm(query_vec) * np.linalg.norm(doc))
        rank.append(cos_sim)

    # Sort in descending order (most relevant to least relevant)
    rank_sorted_desc = np.array(rank).argsort()[-n_docs:][::-1]

    print("============================")
    print("Search result (Top {}):    ".format(n_docs))
    print("============================")
    print()

    for doc_idx in rank_sorted_desc:
        print("Document #{}".format(doc_idx))
        print(dataset["body"][doc_idx])
        print("\n==============================================================================\n")

Main Search Engine Program

In [112]:
while(True):
        query = input("Input your search query >> ")
        cosine_similarity(n_docs=5, query=query)
        print("\n\n\n")

Query: covid quarantine

['covid', 'quarantine']

Search result (Top 5):    

Document #3
Members of the Los Angeles Lakers are under quarantine for 14 days and will be tested for COVID-19, multiple media outlets reported Tuesday.


Document #473
Infectious disease and public health experts doubt the viability of plans by Italy's government to extend quarantine measures across the entire country, saying they are probably unsustainable, and unlikely to halt the spread of COVID-19.


Document #267
The companies, and other businesses, like Instacart, have also said they would compensate workers who contract the virus or are subject to quarantine orders.


Document #327
Seven weeks after the first case of COVID-19 was confirmed in the U.


Document #513
Seven weeks after the first case of COVID-19 was confirmed in the U.








KeyboardInterrupt: 