# Universidade Tecnológica Federal do Paraná
## Programa de Pós-Graduação em Computação Aplicada
### Ciência de Dados 2 - 2021/1
### Equipe Evolution:
### Leila Fabiola Ferreira
### Mateus Cichelero da Silva
  
## Information Retrieval

   Esta atividade tem como objetivo fazer a aplicação de recuperação de informação através da relação de similaridade de cossenos entre uma query e o dataset pré-processado do Cord-19

In [None]:
import pandas as pd
import numpy as np
import nltk  
import string

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def convert_lower_case(data):
    return np.char.lower(data)

In [None]:
def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

In [None]:
def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

In [None]:
def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

In [None]:
def convert_numbers(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w))
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

In [None]:
def lematiser(data):
    wnl = WordNetLemmatizer()    
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + wnl.lemmatize(w)
    return new_text


In [None]:
def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data)
    data = remove_apostrophe(data)
    data = remove_stop_words(data)
    data = lematiser(data)
    return data

In [None]:
def doc_query_similarity(documents, query):
  docTFIDF = TfidfVectorizer(use_idf=True, smooth_idf=True).fit_transform(documents)
  queryTFIDF = TfidfVectorizer().fit(documents)
  queryTFIDF = queryTFIDF.transform([query])

  cosine = cosine_similarity(queryTFIDF, docTFIDF).flatten()
  return cosine

Leitura do dataset pré-processado

In [None]:
filepath = "/content/drive/MyDrive/Colab Notebooks/cord-19/dataset_cord19.csv"
df = pd.read_csv(filepath)

In [None]:
df.drop(['publish_time'], 
        axis='columns', inplace=True)

Obtendo a query através de um input do usuário

In [None]:
query = input("Hello, please type your query:\n") 

Hello, please type your query:
spread air virus


Pré-processo da query, para que seja possível calcular a similaridade de cosseno

In [None]:
query = preprocess(query)

Aplicando a função que calcula as similaridades de cossenos entre o vetor da query e dos artigos do dataset

In [None]:
teste = doc_query_similarity(df['abstract'], query)

Ordenando os scores para obter o documento com maior similaridade

In [None]:
df['score'] = teste
df = df.sort_values(by=['score'], ascending=False)
res = " Aproveite para ler o artigo: " + df['title'].iloc[0] + " no(s) link(s) " + df['url'].iloc[0]

In [None]:
df.head()

Unnamed: 0,title,abstract,url,score
71945,A predictive model for disease progression in ...,A predictive model for Corona Virus Disease 2019,https://www.ncbi.nlm.nih.gov/pubmed/32430433/;...,0.588206
43119,Suggestions for changes in professional proced...,Abstracts The COVID-19 (COrona Virus Disease 2...,https://doi.org/10.7416/ai.2021.2434; https://...,0.561158
22557,The Implications of COVID-19 in Radiation Onco...,The corona virus disease of 2019 (covid-19) ha...,https://doi.org/10.3747/co.27.7095; https://ww...,0.535063
63228,Understanding the fate of corona virus transmi...,We propose a simple model for understanding th...,https://arxiv.org/pdf/2003.10530v1.pdf,0.463793
119429,A Computer Simulation Study on novel Corona Vi...,The World Health Organization (WHO) on March 1...,http://medrxiv.org/cgi/content/short/2020.05.1...,0.448636


In [None]:
print(res)

 Aproveite para ler o artigo: A predictive model for disease progression in non-severe illness patients with Corona Virus Disease 2019 no(s) link(s) https://www.ncbi.nlm.nih.gov/pubmed/32430433/; https://doi.org/10.1183/13993003.01234-2020
