 # 3. Create a search engine using TFIDF
0. Import libraries and dataset
1. data preprocess labelled data
3. Create TFIDF vectoriser from literatures that are included in the dataset
4. Cosine Similarity
5. Evaluation of the search engine using the labelled data

## 3.0. Import Libraries and Dataset

In [126]:
# Import all the required Library
import pandas as pd
import numpy as np
from tqdm import tqdm
import pickle

# Text preprocessing libraries
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer

# libraries for keyword extraction with tf-idf
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\josep\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\josep\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [73]:
# Import the pickle files created from previous notebooks
scoped_categorised_literature = pd.read_pickle("./1_scoped_cat_lit.pkl")
extracted_literature_data = pd.read_pickle("./2_extracted_literature_data.pkl")

In [123]:
extracted_literature_data.columns

Index(['extract_id', 'json_path', 'section', 'text'], dtype='object')

In [124]:
scoped_categorised_literature.columns

Index(['Date', 'question_idx', 'pdf_json_files', 'pmc_json_files', 'Study',
       'Study Link', 'Journal', 'Study Type', 'Factors', 'Influential',
       'Excerpt', 'Measure of Evidence', 'Added on'],
      dtype='object')

## 3.1. Creating function for text preprocessing


In [97]:
# printing all the stop words
stop_words = set(stopwords.words("english"))
print(stop_words)

{'doesn', 'what', 'in', 'but', 'having', 'of', 'yourselves', 'our', 'because', 'and', 'me', 'up', "shan't", 'own', 'needn', "wouldn't", 'i', 'd', "shouldn't", 'was', 'ourselves', "couldn't", 'wasn', 'by', 'itself', 'are', 'why', 'hadn', 'between', 'my', 'most', 'under', 'about', 'or', "didn't", 'which', 'same', 'no', 'only', 'herself', 'll', 'm', 'then', 'where', 'myself', 'do', 'other', 'further', 'below', 'against', "you'd", 'her', 'these', 'until', 'off', 'each', "you're", 'few', 'being', 'above', 'it', 'am', 'does', 'has', 'who', 'you', 've', 'once', "mightn't", 'through', "it's", 'don', 'ain', 're', 'been', 'himself', "you've", 'them', 'again', 'ours', 'him', 'than', 'whom', "don't", 'all', "doesn't", 'isn', 'any', 'during', 'after', 'this', 'some', "you'll", 'down', 'for', "that'll", "hasn't", "isn't", 'its', 'is', 'how', 'with', 'from', 'be', 'that', 'out', 'his', 'shouldn', 'while', 'can', 'we', 'hers', "wasn't", 'should', 'o', 't', "haven't", 'themselves', 'had', 'couldn', 'he

In [128]:
def preprocess(inputText):
    #define stopwords
    stop_words = set(stopwords.words("english"))
    #lower case the text
    outputText = inputText.lower()
    #Convrt percentages into the string percent
    outputText = re.sub('(\\d+%)', 'percent', outputText)
    # Remove special characters and digits
    outputText=re.sub("(\\d|\\W)+"," ",outputText)    
    # Tokenisation
    outputText = outputText.split()
    # Remove Stop Words
    outputText = [word for word in outputText if not word in stop_words]
    # Stemming
    ps=PorterStemmer()
    outputText = [ps.stem(word) for word in outputText]
    # Lemmatisation
    lem = WordNetLemmatizer()
    outputText = [lem.lemmatize(word) for word in outputText] 
    outputText = " ".join(outputText) 
    
    return outputTex

### 3.1.1. Testing the text pre-processing


In [129]:
# Testing the preprocessing
text = scoped_categorised_literature.iloc[3]['Excerpt']
print(text)
preprocess(text)

Figure 10 shows that the number of the exposed individuals in region1 decreases from 868.52 (without controls) to 482.05 (with controls) at the end of the implementation of the proposed strategy. Figure 11 demonstrates that the number of the infected individuals in region 1 decreases from 657.01 (without controls) to 364.95 (with controls) at the end of the implementation of the proposed strategy. Also, the number of the quarantined individuals increases significantly from 10.15 (without controls) to 224.57 (with controls).


'figur show number expo individu region decreas without control control end implement propos strategi figur demonstr number infect individu region decreas without control control end implement propos strategi also number quarantin individu increas significantli without control control'

# 3.2. Generate TF-IDF Vector space

In [213]:
# applying data preprocessing to all the text we've extracted from the JSON file
extracted_literature_data['text'] = extracted_literature_data['text'].apply(preprocess)

In [214]:
print(extracted_literature_data['text'])

0       covid declar pandem date covid affect peopl wo...
1       sever acut respiratori syndrom coronaviru sar ...
2       label pandem covid affect peopl worldwid major...
3       facilit characteri sar cov comparison made bet...
4       studi look first confirm case ncip provid evid...
                              ...                        
6491    studi period henc time seri length longer hube...
6492    conclu meteorolog factor influenc covid transm...
6493    declar conflict interest certifi peer review a...
6494    certifi peer review author funder grant medrxi...
6495    copyright holder preprint version post march h...
Name: text, Length: 6496, dtype: object


In [215]:
# Create a tfidfVectorizer object
vectorizer = TfidfVectorizer()
#fit_transform method to convert given text into TF-IDF scores for all the documents
tfidf_transform = vectorizer.fit_transform(extracted_literature_data['text'])

## 3.3. Cosine Similarity 

### 3.3.1. Testing a user query on running the TFIDF search engine

In [216]:
# Create a test example of how to run the search engine
query = 'breathing difficulty give oxygem therapy'
query = preprocess(query)
query_vec = vectorizer.transform([query])
print(query_vec)

after preprocessing:  breath difficulti give oxygem therapi


In [212]:
# Using cosine_similarity to get cosine similarities for a query vs all the document available in the text
result = cosine_similarity(tfidf_transform, query_vec)
result = [i[0] for i in result]

# obtaining the top 5 vaules and print the name
N = 5
top_5_idx = np.argsort(result)[-N:]
top_5_rating = []

print(top_5_idx)
for i in top_5_idx:
    top_5_rating.append(round(result[i],3))
print(top_5_rating)

____________
[  34 6021  244   31   33]
[0.158, 0.162, 0.197, 0.259, 0.382]


## 3.4. Evaluation of the search engine using the labelled data