 # 3. Create a search engine using TFIDF
0. Import libraries and dataset
1. data preprocess labelled data
3. Create TFIDF vectoriser from literatures that are included in the dataset
4. Cosine Similarity
5. Evaluation of the search engine using the labelled data

## 3.0. Import Libraries and Dataset

In [19]:
# Import all the required Library
import pandas as pd
import numpy as np
from tqdm import tqdm
import pickle

# Text preprocessing libraries
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer

# libraries for keyword extraction with tf-idf
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\josep\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\josep\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [20]:
# Import the pickle files created from previous notebooks
scoped_categorised_literature = pd.read_pickle("./1_scoped_cat_lit.pkl")
extracted_literature_data = pd.read_pickle("./2_extracted_literature_data.pkl")

In [21]:
extracted_literature_data.columns

Index(['extract_id', 'json_path', 'section', 'text'], dtype='object')

In [22]:
scoped_categorised_literature.columns

Index(['Date', 'topic_id', 'research_topic', 'pdf_json_files',
       'pmc_json_files', 'Study', 'Study Link', 'Journal', 'Study Type',
       'Factors', 'Influential', 'Excerpt', 'Measure of Evidence', 'Added on'],
      dtype='object')

## 3.1. Creating function for text preprocessing


In [23]:
# printing all the stop words
stop_words = set(stopwords.words("english"))
print(stop_words)

{'now', 'as', 'hasn', 'the', 'are', 'an', 'all', 'because', 'my', 'he', 'each', "isn't", 'there', 'o', 'from', 'why', 'shan', 'during', 'do', 'doing', 'them', 'where', "aren't", "hadn't", 'other', 'm', "hasn't", "shan't", 'while', 'had', 'what', 'about', 'mustn', "she's", 'at', 'these', 'this', 'y', 'him', 'mightn', "doesn't", 'out', 'not', 'wasn', 'me', "you'll", 'if', 'to', 'theirs', 'i', 'didn', "weren't", 'needn', 'yours', 'with', 'but', 'just', 'most', 'in', 'yourself', 's', 'itself', 'same', 'up', 'once', "you've", 'few', "mustn't", "that'll", 'here', 'then', 'over', 'which', 'can', 've', 'she', 'before', 'above', 'wouldn', 'been', "couldn't", "didn't", 'couldn', 'be', 'myself', 'no', 'is', 'so', 'after', 'for', 'too', 'they', 're', 'only', "don't", 'further', 'nor', "won't", 'did', 'or', 'it', 'himself', 'our', 'who', 'against', 'ma', 'was', 'those', 'of', 'how', "mightn't", 'should', "wouldn't", 'own', 'and', "you're", 'hers', 'when', 'some', "it's", 'being', 'does', 'doesn', "

In [24]:
def preprocess(inputText):
    #define stopwords
    stop_words = set(stopwords.words("english"))
    #lower case the text
    outputText = inputText.lower()
    #Convrt percentages into the string percent
    outputText = re.sub('(\\d+%)', 'percent', outputText)
    # Remove special characters and digits
    outputText=re.sub("(\\d|\\W)+"," ",outputText)    
    # Tokenisation
    outputText = outputText.split()
    # Remove Stop Words
    outputText = [word for word in outputText if not word in stop_words]
    # Stemming
    ps=PorterStemmer()
    outputText = [ps.stem(word) for word in outputText]
    # Lemmatisation
    lem = WordNetLemmatizer()
    outputText = [lem.lemmatize(word) for word in outputText] 
    outputText = " ".join(outputText) 
    
    return outputText

### 3.1.1. Testing the text pre-processing


In [25]:
# Testing the preprocessing
text = scoped_categorised_literature.iloc[3]['Excerpt']
print(text)
preprocess(text)

Figure 10 shows that the number of the exposed individuals in region1 decreases from 868.52 (without controls) to 482.05 (with controls) at the end of the implementation of the proposed strategy. Figure 11 demonstrates that the number of the infected individuals in region 1 decreases from 657.01 (without controls) to 364.95 (with controls) at the end of the implementation of the proposed strategy. Also, the number of the quarantined individuals increases significantly from 10.15 (without controls) to 224.57 (with controls).


'figur show number expo individu region decreas without control control end implement propos strategi figur demonstr number infect individu region decreas without control control end implement propos strategi also number quarantin individu increas significantli without control control'

# 3.2. Generate TF-IDF Vector space

In [26]:
# applying data preprocessing to all the text we've extracted from the JSON file
processed_extracted_literature_data = extracted_literature_data['text'].apply(preprocess)

In [27]:
print(processed_extracted_literature_data)

0       background social distanc effort success slow ...
1       sar cov emerg central china late lead ongo pan...
2       initi data suggest earli social distanc guidel...
3       goal maintain reproduct number le one strateg ...
4       although rigor popul base serosurvey sar cov u...
                              ...                        
6491    peer review copyright holder preprint http doi...
6492    peer review copyright holder preprint http doi...
6493    peer review copyright holder preprint http doi...
6494    peer review copyright holder preprint http doi...
6495    peer review copyright holder preprint http doi...
Name: text, Length: 6496, dtype: object


In [28]:
# Create a tfidfVectorizer object
vectorizer = TfidfVectorizer()
#fit_transform method to convert given text into TF-IDF scores for all the documents
tfidf_transform = vectorizer.fit_transform(processed_extracted_literature_data)

## 3.3. Cosine Similarity 

### 3.3.1. Testing a user query on running the TFIDF search engine

In [29]:
# Create a test example of how to run the search engine
query = 'breathing difficulty give oxygem therapy'
query = preprocess(query)
query_vec = vectorizer.transform([query])
print(query_vec)

  (0, 6636)	0.5407612421224066
  (0, 2613)	0.40779661991205124
  (0, 1687)	0.507304610499241
  (0, 739)	0.5328425921158624


In [30]:
# Using cosine_similarity to get cosine similarities for a query vs all the document available in the text
result = cosine_similarity(tfidf_transform, query_vec)
result = [i[0] for i in result]

# obtaining the top 5 vaules and print the name
N = 5
top_5_idx = np.argsort(result)[-N:]
top_5_idx = top_5_idx.tolist()
top_5_idx.reverse()
top_5_score = []
top_5_text =[]
for i in top_5_idx:
    top_5_score.append(round(result[i],3))
    top_5_text.append(extracted_literature_data.iloc[i]['text'])

test_df = pd.DataFrame(zip(top_5_idx, top_5_score, top_5_text), columns = ['idx', 'score', 'text'])
print(test_df)

    idx  score                                               text
0  5637  0.382  • Provide supplemental oxygen therapy immediat...
1  5635  0.259  For patients with severe disease (Figure 6) , ...
2   149  0.196  The registered cases continued to increase rap...
3  5207  0.162  In this manuscript, a method is presented that...
4  5638  0.158  • Closely monitor patients with SARI in case o...


## 3.4. Evaluation of the search engine using the labelled data
The goal is to ask the engine the 11 key questions that researchers are looking for answers to.
Based on the top N results from the engine, there should be atleast 1 paragraph that is relevant to what the researcher wants
Performance is measured by using the labelled data.

1. Using the labelled data, run the paragraphs into the search engine, and get the top 1 paragraph with the best cosin similarity score
2. From the step above, we will obtain a list of paragraphs (in search engine index) grouped by each topic questions

3. Using the research topic itself, query the search engine and get top N result (i.e. 5).

4. check whether or not the top N result matches with the list of paragraphs that were obtained from the labelled data

5. only one match needs to occur per query to get a correct response

6. divide the correct response by the total amount of query to get the accuracy of the search engine model

### 3.4.1. Obtaining the list of key research topics with their topic ids

In [31]:
# Using the labelled dataset
topic_list = scoped_categorised_literature[['topic_id', 'research_topic']].drop_duplicates()
print(topic_list)

    topic_id                                     research_topic
1          1  Effectiveness of a multifactorial strategy to ...
0          2  Effectiveness of case isolation_isolation of e...
0          3       Effectiveness of community contact reduction
0          4    Effectiveness of inter_inner travel restriction
0          5                 Effectiveness of school distancing
1          6  Effectiveness of workplace distancing to preve...
10         7  Evidence that domesticated_farm animals can be...
0          8  How does temperature and humidity affect the t...
0          9  Methods to understand and regulate the spread ...
0         10                        Seasonality of transmission
0         11  What is the likelihood of significant changes ...


### 3.4.2. Find the top match for the labelled data through the search engine

In [32]:
topic_id_list = []
excerpt_list = []
top_idx_list = []
top_score_list = []
top_text_list = []
for index, row in scoped_categorised_literature.iterrows():
    topic_id_list.append(row['topic_id'])
    excerpt_list.append(row['Excerpt'])
    
    query = preprocess(row['Excerpt'])
    query = vectorizer.transform([query])
    cos_result = cosine_similarity(tfidf_transform, query)
    cos_result = [i[0] for i in cos_result]
    top_idx = np.argsort(cos_result)[-1]
    top_score = round(cos_result[top_idx],3)
    
    top_idx_list.append(top_idx)
    top_score_list.append(top_score)
    top_text_list.append(extracted_literature_data.iloc[top_idx]['text'])

labelled_topic_output = pd.DataFrame(zip(topic_id_list,excerpt_list,top_idx_list,top_score_list,top_text_list), 
                                     columns = ['topic_id', 'exceprt', 'top_idx', 'top_score', 'top_text'])

In [33]:
labelled_topic_output

Unnamed: 0,topic_id,exceprt,top_idx,top_score,top_text
0,1,"Comparing these four scenarios, we shall deduc...",1060,0.638,Par. Scenarios B 1 and B 2 show cases in which...
1,1,Our study reveals that the strict control meas...,3671,0.571,Background: The ongoing COVID-19 epidemic dila...
2,1,We then compare the transmission rates in diff...,3637,0.799,"more cases within a week, implying a fast grow..."
3,1,Figure 10 shows that the number of the exposed...,188,0.827,"To realize this strategy, we apply only the co..."
4,1,Lockdown showed highest reduction (28%) in num...,255,0.643,The copyright holder for this preprint this ve...
...,...,...,...,...,...
395,11,"Generally, the curves tended to be not associa...",2252,0.898,is the (which was not peer-reviewed) The copyr...
396,11,We find the high temperature and relative humi...,5750,0.657,We find that temperature negatively relates to...
397,11,We find the high temperature and relative humi...,5750,0.657,We find that temperature negatively relates to...
398,11,"The regression model, demonstrates that both a...",3997,0.948,Relationship with environmental factors. The r...


In [34]:
#Check whether the input text matches with the paragraphs output from the search engine
x = 5
print(labelled_topic_output.iloc[x]['exceprt'])
print('____________________________________________')
print(labelled_topic_output.iloc[x]['top_text'])

The epidemic would ultimately infect approximately 77% of the population (Fig 2B) and result in around 350 thousand fatalities among individuals aged over 60, and around 60 thousand aged below 60 (Fig 2C). Sustained social-distancing by older individuals (assumed to result in a 90% reduction in contacts with individuals under 25, a 70% reduction with 25-59, and a 50% reduction between one another), and moderately effective self-isolation by symptomatic individuals (at 20% efficacy) results in a shallower epidemic curve (Fig 2D) and a much smaller outbreak size among individuals aged 60+
____________________________________________
Sustained social-distancing by older individuals (assumed to result in a 90% reduction in contacts with individuals under 25, a 70% reduction with 25-59, and a 50% reduction between one another), and moderately effective self-isolation by symptomatic individuals (at 20% efficacy) results in a shallower epidemic curve ( Fig 2D) and a much smaller outbreak size

### 3.4.3. Using the research topic itself, query the search engine and get top 5 results

In [35]:
# now query the search engine based on the scientific question
topic_question_top_5_result_list = []

for index, row in topic_list.iterrows():  
    query = preprocess(row['research_topic'])
    query = vectorizer.transform([query])    
    result = cosine_similarity(tfidf_transform, query)
    result = [i[0] for i in result]
    # obtaining the top 5 vaules and print the name
    N = 5
    top_5_idx = np.argsort(result)[-N:]
    top_5_idx = top_5_idx.tolist()
    top_5_idx.reverse()
    
    output = [row['topic_id']] + top_5_idx
    topic_question_top_5_result_list.append(output)

for topic_question_top_5_result in topic_question_top_5_result_list:
    print(topic_question_top_5_result)

[1, 194, 195, 190, 5631, 196]
[2, 4653, 1773, 4069, 196, 1785]
[3, 2321, 4950, 2297, 1153, 505]
[4, 1844, 651, 3235, 621, 5738]
[5, 2431, 2323, 84, 2575, 2462]
[6, 1451, 1472, 2162, 1584, 1479]
[7, 164, 159, 464, 2140, 154]
[8, 5750, 1735, 3416, 335, 681]
[9, 3449, 6321, 3494, 5937, 683]
[10, 870, 2676, 6441, 2215, 6439]
[11, 5182, 6137, 5329, 1706, 5174]


### 3.4.4. Evaulate the accuracy of the Search engine
1. Check whether the top 5 result from the key topics query matches with the indexes obtained from the labelled data
2. Only one match needs to occur per query to get a correct response
3. divide the correct response by the total amount of query to get the accuracy of the search engine model

In [36]:
match_count = 0
questions_asked = 0
for topic_question_top_5_result in topic_question_top_5_result_list:
    match = False    
    for i in range(1,len(topic_question_top_5_result)):
        topic_id = topic_question_top_5_result[0]
        labelled_topic_id_df = labelled_topic_output[labelled_topic_output['topic_id'] == topic_id]
        if not labelled_topic_id_df[labelled_topic_id_df['top_idx'] == topic_question_top_5_result[i]].empty:
            match = True
    if match == True:
        match_count = match_count + 1
    questions_asked = questions_asked + 1

print("The accuracy of the Search Engine based on the above criteria is: ", match_count, "/", questions_asked, "=", round(match_count / questions_asked,3))

The accuracy of the Search Engine based on the above criteria is:  2 / 11 = 0.182
