## Identify similar papers based cosine similarity

To test if we can automatically identify similar papers (based on abstracts) from existing ones, I start with universal-sentence-encoder which convert paper abstracts into vectors of size 512. Then I search for the top 5 similar papers with cosine similarity. 
See the example output below. Input dataset generated as method in notebook *Keyword window “polymerase + therapeut" R* and *Part 2 TFIDF Clustering at Abstract level* (a subset of CORD19 data, 1941 “polymerase” related paper abstracts) 

https://www.kaggle.com/leijiang1/keyword-window-polymerase-therapeut-r

https://www.kaggle.com/leijiang1/part-2-tfidf-clustering-at-abstract-level


For test 2, after manually reading the input, I think the original abstract focus on influenza medicine. My top results are focus on influenza. The reason I chose abstract level analysis is because I think use full body text for topic modeling could have the risk of picking up too much noise. Another point is I chose sentence encoder becasue we'd better not to tokenize medical words (it’s a domain specific property).



### So the pros of my method is
My method used Transfer learning of universal-sentence-encoder which means I stand on the shoulder of giant. 🙂 

### cons of my method is 
need to manually read the outputs to validate if the results are solid.


### Universal Sentence Encoder family

There are a number of different Universal Sentence Encoders. I think universal-sentence-encoder-qa should be the best fit for the task of Question Answering.
However, I could not get it to run (ResourceExhaustedError).

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/3") #this step takes long on local, fast on cloud. worked well


embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5") # took some time even on cloud about 1-2 min, got ResourceExhaustedError at next step
The universal-sentence-encoder-large model is trained with a Transformer encoder. try to run on cloud but got got ResourceExhaustedError# 


embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/3?tf-hub-format=compressed") #try compressed large, got TypeError: 'AutoTrackable' object is not callable

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-qa/3") # ResourceExhaustedError  

### Use methods in combination

Furthermore, after finding the most similar papers/abstracts, we can use the Keyword window method described in previous posts to retrieve more detailed information. So these methods can be used in combination to achieve better results.



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/polymerasesubset/POLYMERASE_ABSTRACT_TFIDF_10clustersResultsWithOriginalABS.csv
/kaggle/input/CORD-19-research-challenge/metadata.csv
/kaggle/input/CORD-19-research-challenge/COVID.DATA.LIC.AGMT.pdf
/kaggle/input/CORD-19-research-challenge/json_schema.txt
/kaggle/input/CORD-19-research-challenge/metadata.readme
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/26aec9a28a4345276498c14e302ead7d96c7feee.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/252878458973ebf8c4a149447b2887f0e553e7b5.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/138e18baf12e4e92b67ab7dee321d2b149f236ed.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/e008bb9bd16411df2029bfbfd2df3fef72a7e575.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/ba581ccb585036d6220cfb461733c94584326d96.json
/kaggle/input/CORD-19-research-challenge/noncomm

/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/0994b9aa851f17dc1c6af309973fe189118ab6c5.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/bb65c5259727bce9f1623452a0af39aa57f9b1bc.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/a41e8928ee5cb212550be0620946d6e62c34b001.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/da70e84ba93bb87634a681d7f61a44be543e8549.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/c5f2db5c9d68ec18bad931bbbc31bc535e5cb3a6.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/0f2e41a966ec7d622762dc64ba63009e00a69bd5.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/140e6d0298bfcd1e825a4b81dcabc50d1658357a.json
/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/1c202f2c3924d86f516deecda1d47f55d08337da.json


/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/83323659de2f2d5ed3ca6ada7cf137a8a17bb014.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/3c14c4c3a176c86833794029e787ce460c9d89dd.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/5b25c63b89586e93a09b5e213a20cfd9b0418b60.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/f3ff1ecae96700f41b83d2a034a3a959428388b0.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/106f10eaae4fb67aca5b2bd737ce2bf1c27fa243.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/b0aa77344d5726699361b0d3bf2dc9b45eff8b06.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/baabfb35a321ea12028160e0d2c1552a2fda2dd5.json
/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/815745bf1b522d33fd7371cc9a6561a2a93ef87e.json
/kaggle/input/CORD-19-research-challenge/biorxiv

/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/9461c2fa9c2d30b0438e6922e1c3a31e7dd34d24.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/a58c4de6acdb5e333840e07494c95b0e81b6e52e.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/7f67490473a643ff59b8ad115149846bc36825ca.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/9e0f14131900d5136cabf11f654a5cbb9d88ad48.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/2a423f15be9162d939e0402df6f74c31ee2e9486.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/d1f0b5426cd092d2418eebffd0f2146313ac4e11.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/9ac6114492dec5eb4c4afe605b1a83c689df0dec.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/c52ffa67fea523c74bd0c0f45033183f403692ed.json
/kaggle/input/CORD-19-research-challenge/comm_us

/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/feccc27dc3c4100c8080abda883969ec979956c9.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/cf0d47062feee58725fffdbd8b91eec680237fc1.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/2eb55c914df79a8b8cee977bac85d06eb07507d4.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/85a6db45e29636f124b7b1adc5230dfca7b37f9f.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/de33cc55be6bb27a8f52e33fe21836c670252e28.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/19ddf65d65d5e172ac3ceda4662b68769633dbe6.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/90ec39a617ee62026fcf75b3b0eb3624ecba733f.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/5bbf8689ab8b16eeb47592fc469a390f63fa232e.json
/kaggle/input/CORD-19-research-challenge/comm_us

/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/348e9d5eb25a7d8b2b9597d7a24642eb67abc034.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/87d23553b1e4abc9f23750121d6259b10da73387.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/7a17153175a6479a419d62e63b6625186f6b86a0.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/eefddcf51f8426ecaa9e3ace144dadfb34a74cf5.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/bd7f96581b36339bff8aca7e50e1b91f333fc00e.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/785f1bfe60d4c4773e6492752396fc4ce8f61865.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/f87c8bfc787731a80acd6bf082c748c41afb8a4a.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/266da81290486b68fea7c553e21fa6a5d46aa6b7.json
/kaggle/input/CORD-19-research-challenge/comm_us

/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/7d5094e206f1773dc21d879809ddb1a7fccaa2ba.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/0b530925f860fe1a3191519482c19e8240834caf.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/f7ff81d2d3ae773aa28836cc485617983d7499ef.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/4465b8957219de75fb5b7f7ed57069405572cf00.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/0c58c5ce46bfa52188d231685cc1c6a440840b85.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/4e86bcd31bd05b5c131e397b7d3a74fe4020bb0b.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/b8e80e078c91d674918e69168da4085a9e176053.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/46c464442591d7777212f7ee389bcbfa362ddc8e.json
/kaggle/input/CORD-19-research-challenge/comm_us

/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/5a2360e93ec038502133c97cf78a114c012e5df4.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/a64149dbac768a5fac5c4ca620c09c2dea2e2bf0.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/c2749a18c00e1668167a5f65d6d9ff7896f25465.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/1dd7096290bcbcc5aa830706c5396155ef51bd39.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/347f19fee8147e17a853627c99decbeada4660dc.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/e7a4eae5bc97a5dc97189e3faa40ef9a91bb3207.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/339f0cd12ca08506c22c8b62eb4b3283a437271a.json
/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/84e3663e6544caf4cac64a937e7e3f6159717c3c.json
/kaggle/input/CORD-19-research-challenge/comm_us

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/df00dae042a148c2be3d0d1aa08d75c142f467fa.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/e1d54d1ec033522012a4f994961fec6582c17ce6.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/7684a38d0d7d2c6e65a82b43ec39ffa64fda2297.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/e79e1ca39f35850879540728a90d38ffe36ddb04.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/b715b120180b24a6efb23ca38fb3bdea33ac3e40.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/437829e3b844b1cca575750a22807243e974dd21.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/99186ccec4a11241e890c8175a3b4cf05686c24e.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/9b888ba75d7dbaaabcf5beaa505595e7f63f0341.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/8ca1a1c3ca665b86971a5d9d9da6fec6f71edaf2.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a3b33938d545cefc44815cdf03f5e5c3f1a233c3.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/ab7eef656f3114a9e31fa5eb2df4280687add37a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/34c4da743e331ca5eb25d59fdd855698d3639a1a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/c55d7bb4051bb35c71287d43a62b2b3300ee6885.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/e3af2ca43010f59c3d1bb731abd011e3dd0fc51c.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/9af31da4521b51c620568c287a5711d71b6242dd.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/99ad07c4a0fdcc24df76b723ff978c84076c56ac.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/ed46961388c67f6331206558a7f261817dd6e6ac.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/fc9f4ba0a657fb6fe7c562097f43162406151a7e.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/6672c2511a824775afb14665b1714c275fa440c3.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/3bdbfc6bd6fed91d09261c5cffbd5fe6483dc368.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/89c57d5561ec6e36387dbd41ea562eb6d5263dd9.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/608f11c80d206e1ca59ef5683a1b31d689c79102.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/403e8d91187db0d88ced77f2104266187a211c33.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/5b86e6307a43623963d73738a991a2f37e91488b.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/203e7b5f755b1366c29b27e1ebcfb86237be7cca.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/65a6cb6c93e202dd114fa870f43bdedc0e2d1962.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/4cb678948f6791f5fb4bd0f8b2ca1856fa8c2320.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/9d31c6d8c0aa29eee2fc6c4bddcb8db6fd050d1f.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/93a9ce5dc65920bc235049037d9c03b211395c57.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/adaab431554087bf59b391f6b588f675666fc163.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/22fedac215550efbec7acb5582b5a205ef2e1672.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/2cbf0c99c7d0ea1b6418efc0037b46f21d205895.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/2a91b72f3d1826077a827f3632af9b82904f17dd.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/ddff43b2dc51c88c59b110df02735bcaf5fbc760.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/254e92d852316717a7b3d4ddb33c10c28af98f55.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/40b19e61122433538ff22a68b0f7d4ff4f5d0821.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/0f9a9bcd7f82dc2e386aa78be1dc311be95e314b.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/daba4448568c1ac24b1e9d9410ac999a90ec6768.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a9cfae66923576d73664e087f48e4a6a44bb1db7.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/defd33a14b2808e7dfc16a42333612ea6625be3b.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/b1bb2403837df337ef7fa75a252c9eba2be6e29a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/830d94b4a696c1926324cd013b6d557faa2eebf0.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a74944158323fa32e8b9f9428ea116dc27c64b7e.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/92ed78b59ef7bb72e474b24b0ebf17a46f311979.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a8f2ef3519d018dc2bc06999a4dcefe7ab6538cf.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/6d2492de654b3ed2af74151a9ff7f1dea0eadca8.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/3f78a6fa7d8e731fe72c25d105133e0ad126da7d.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/e804c69a56260774f663e3948a5e926a4d7f647f.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/0cd3094f458d8dd679c9050ef72d401d41162272.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/d01a3aa9bfabf76255cad83bb3aef2e5773f2d1c.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/9dbbd99bf146dd20f2e16be53fa7605cb6c03cd8.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/1be944b5ecb82fea11b242d7170ef1eeeb27c1cd.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/fb1d8b1b3da007c27d6b18c0ed358e37a1b844d9.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/766e7342b2fe0db048fd28af319606ca1fb53849.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/3ed670f60a7be2e3e2a991ea8af1fdd5fa5e2b2c.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a9abb3db5313ed216d36a6997fb39a7eb6ee418f.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/42f3e7ac9ddbf33112249d7b6c07d1e82e4382bd.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/cf9c3a76781c522b7e04f74b4d55e5e697999752.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/9127740196cb4abcd32f1c4947ca9b1766738ae5.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/144e25dac178c0cdf69c63796078a98dded8538f.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/67e14e52cccec13912ed117ddcad3eaabe6464d1.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a12e03389c81fb37e9902079169dea981f1f9226.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/41bcfec1feb9282623f5ca07d16a20d82ba116b4.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/2169d2cfff2a604bb8dc484edb6a591e007e94a7.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/62b7f419bb3a4af2f439778b85c0b2582461053a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/61e3a9e97cc0cd4e09add2a6cc45861a57283fcd.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/916d45569ec3b63c19a0ee9be8de36f502993a46.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/02016f5da58f97adbf8d99660991cd31c9fc0564.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/d268877551cde701225d92c4c89098d69173e8d9.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/7789c6bf9d516bb61967c9a1e994deb92cca4b1a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/36c812f4ba52e1f796ac059d2e9d409f2087f557.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/1ac070de6fe30924deb4ea2b0b8259884bde7121.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a526fb64b1ab4c06c7e5e1d484301b47498260e5.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/3af77b0b615ee8f8d64d286a8e631ddf6cbdf98f.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/5dffeb429624a9aa9aa5bceba892e37671620220.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/dd4647c38d85133300ac6ad803a408371de989ff.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/8fec60b11ae29068a0955cfd73721543de6b5c08.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/bcf1bc15c1b56c4dd0cc08b1eaa1cee1a6411e0e.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/f3bad835871894b8382f3b9de3da2db989e0e1fb.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/854ced2c6f9f116e0e21e5f0b8344d76931d94a3.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/91ea675435091540b209fe4fc354ee1212fa2a25.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/ab0c7dec09cd27954e7f45caacfd45fef11404bf.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/7ee83287692dd2e39ce626028890e121639abbfc.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/a04e4f843ca510866408f472a01bda26e0d80a83.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/ba16c81b988de5d1e2ee20956f1fbdd864fd36f5.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/3531cb91249c59e74d1cfa5dbc99807fa416f4a3.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/043e7349256914851dfb0cc087bfbd7a10396a3a.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/6e5ff0f99e4a069dadc4d36cf881b1f6bce05674.json
/kaggle/input/CORD-19-research-challenge/custom_license/custom_l

In [2]:
import pandas as pd 
df = pd.read_csv('/kaggle/input/polymerasesubset/POLYMERASE_ABSTRACT_TFIDF_10clustersResultsWithOriginalABS.csv')
df.head()

Unnamed: 0,index,ABS,0,1,2,3,4,5,6,7,...,G1,G2,G3,G4,G5,G6,G7,G8,G9,G10
0,1,abstract astrocytes produce granulocytemacroph...,-0.050448,0.017385,-0.039777,-0.067159,-0.029633,0.074573,-0.050444,-0.010799,...,0,0,0,0,0,1,0,0,0,0
1,2,abstract replication of avian infectious bronc...,-0.128422,-0.084803,0.084813,-0.013748,0.006486,0.128668,0.032655,0.066775,...,0,0,0,0,0,1,0,0,0,0
2,3,abstract the infectivity of vesicular stomatit...,-0.095019,-0.032279,0.017571,-0.06586,0.001315,0.048199,-0.031072,0.010103,...,0,0,0,0,0,1,0,0,0,0
3,4,abstract two temporally and enzymatically dist...,-0.134657,-0.086097,0.026415,-0.027553,-0.02028,-0.044637,-0.013312,-0.033345,...,0,0,0,0,0,0,1,0,0,0
4,5,abstract rnadependent rna polymerase rdrp acti...,-0.222713,-0.100353,0.107456,-0.079828,0.030818,-0.09104,-0.014809,0.012756,...,0,0,0,0,0,0,1,0,0,0


In [5]:
df.shape

(1941, 42)

In [6]:
df.isnull().sum()

index    0
ABS      0
0        0
1        0
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
17       0
18       0
19       0
20       0
21       0
22       0
23       0
24       0
25       0
26       0
27       0
28       0
29       0
G1       0
G2       0
G3       0
G4       0
G5       0
G6       0
G7       0
G8       0
G9       0
G10      0
dtype: int64

In [7]:
data1 = np.array(df.ABS.drop_duplicates(keep='last'))
data1

array(['abstract astrocytes produce granulocytemacrophage colonystimulating factor gmcsf and support the survival and proliferation of microglia to study the functions of gmcsf in the central nervous system cns we examined the effects of gmcsf on cytokine production by glial cells gmcsf induced interleukin6 il6 production by microglia but not by astrocytes in a dosedependent manner as assessed by bioassay and the detection of il6 mrna by reverse transcriptasepolymerase chain reaction rtpcr analysis gmcsf did not induce tumor necrosis factor tnfa or il1 in microglia and astrocytes whereas lipopolysaccharide induced all these cytokines the induction of il6 by gmcsf in microglia was completely inhibited by antibodies to gmcsf neither il3 nor macrophagecsf mcsf induced il6 production in microglia given that il1 and tnf monokines derived from microglia induce il6 production in astrocytes but not in microglia results indicate that astrocytes and microglia may mutually regulate il6 production

In [8]:
data1.shape


(1941,)

In [9]:
data1list=data1.tolist()

In [3]:
%%capture
# Install the latest Tensorflow version.
#!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
#!pip3 install seaborn

In [4]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/3")

In [10]:
embeddings = embed(data1list)["outputs"]

#print(embeddings)

In [11]:
embeddings.shape

TensorShape([1941, 512])

In [12]:
NParray1941papers512vector=np.array(embeddings)
NParray1941papers512vector

array([[ 0.02409276,  0.00333246,  0.02617527, ..., -0.05257908,
        -0.02694643, -0.06477401],
       [ 0.00073825,  0.02878778,  0.00838286, ..., -0.05337695,
        -0.03477789,  0.00727506],
       [-0.02900779,  0.03677293,  0.01629622, ..., -0.04934358,
        -0.00588756, -0.03237622],
       ...,
       [-0.0604236 , -0.00013403, -0.05867178, ..., -0.05803579,
        -0.06144454, -0.05860412],
       [-0.01348849,  0.04927895, -0.05944375, ..., -0.03317856,
        -0.04884239, -0.04074986],
       [ 0.01909221,  0.01242393, -0.05794841, ..., -0.04703324,
        -0.00612303, -0.06096034]], dtype=float32)

In [13]:
testAbstract1=["background enterovirus 71 ev71 is one of the major causative agents of hand foot and mouth disease hfmd which is sometimes associated with severe central nervous system disease in children there is currently no specific medication for ev71 infection quercetin one of the most widely distributed flavonoids in plants has been demonstrated to inhibit various viral infections however investigation of the antiev71 mechanism has not been reported to date methods the antiev71 activity of quercetin was evaluated by phenotype screening determining the cytopathic effect cpe and ev71induced cells apoptosis the effects on ev71 replication were evaluated further by determining virus yield viral rna synthesis and protein expression respectively the mechanism of action against ev71 was determined from the effective stage and timeofaddition assays the possible inhibitory functions of quercetin via viral 2apro 3cpro or 3dpol were tested the interaction between ev71 3cpro and quercetin was predicted and calculated by molecular docking results quercetin inhibited ev71mediated cytopathogenic effects reduced ev71 progeny yields and prevented ev71induced apoptosis with low cytotoxicity investigation of the underlying mechanism of action revealed that quercetin exhibited a preventive effect against ev71 infection and inhibited viral adsorption moreover quercetin mediated its powerful therapeutic effects primarily by blocking the early postattachment stage of viral infection further experiments demonstrated that quercetin potently inhibited the activity of the ev71 protease 3cpro blocking viral replication but not the activity of the protease 2apro or the rna polymerase 3dpol modeling of the molecular binding of the 3cproquercetin complex revealed that quercetin was predicted to insert into the substratebinding pocket of ev71 3cpro blocking substrate recognition and thereby inhibiting ev71 3cpro activity conclusions quercetin can effectively prevent ev71induced cell injury with low toxicity to host cells quercetin may act in more than one way to deter viral infection exhibiting some preventive and a powerful therapeutic effect against ev71 further quercetin potently inhibits ev71 3cpro activity thereby blocking ev71 replication"]

In [14]:
testAbstract2=["background to investigate the effects and immunological mechanisms of the traditional chinese medicine xinjiaxiangruyin on controlling influenza virus fm1 strain infection in mice housed in a hygrothermal environment methods mice were housed in normal and hygrothermal environments and intranasally infected with influenza virus fm1 a highperformance liquid chromatography fingerprint of xinjiaxiangruyin was used to provide an analytical method for quality control realtime quantitative polymerase chain reaction rtqpcr was used to measure messenger rna expression of tolllike receptor 7 tlr7 myeloid differentiation primary response 88 myd88 and nuclear factorkappa b nfb p65 in the tlr7 signaling pathway and virus replication in the lungs western blotting was used to measure the expression levels of tlr7 myd88 and nfb p65 proteins flow cytometry was used to detect the proportion of th17tregulatory cells results xinjiaxiangruyin effectively alleviated lung inflammation in c57bl6 mice in hot and humid environments guizhimahuanggebantang significantly reduced lung inflammation in c57bl6 mice the expression of tlr7 myd88 and nfb p65 mrna in lung tissue of wt mice in the normal environment gzmhgbt group was significantly lower than that in the model group p  005 in wt mice exposed to the hot and humid environment the expression levels of tlr7 myd88 and nfb p65 mrna in the xjxry group were significantly different from those in the virus group the expression levels of tlr7 myd88 and nfb p65 protein in lung tissue of wt mice exposed to the normal environment gzmhgbt group was significantly lower than those in the model group in wt mice exposed to hot and humid environments the expression levels of tlr7 myd88 and nfb p65 protein in xjxry group were significantly different from those in the virus group conclusion guizhimahuanggebantang demonstrated a satisfactory therapeutic effect on mice infected with the influenza a virus fm1 strain in a normal environment and xinjiaxiangruyin demonstrated a clear therapeutic effect in damp and hot environments and may play a protective role against influenza through downregulation of the tlr7 signal pathway"]

In [15]:
Question1=['What are clinical effective therapeutics or drugs for COVID-19?']

In [16]:
embeddingsT1 = embed(testAbstract1)["outputs"]
embeddingsT2 = embed(testAbstract2)["outputs"]

embeddingsQ1 = embed(Question1)["outputs"]

test1=np.array(embeddingsT1)
test2=np.array(embeddingsT2)

question1=np.array(embeddingsQ1)

import textwrap

In [17]:
result1 = np.sum(NParray1941papers512vector*test1,axis=1)/(np.sqrt(np.sum(NParray1941papers512vector*NParray1941papers512vector,axis=1))*np.sqrt(np.sum(test1*test1)))
maxRows1=result1.argsort()[-10:][::-1]  #https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array
print("The indexes for most similar papers are:") 
print(maxRows1)
print("\n")
print("The cosine similarity for top 5 papers are:") 
print(result1[result1.argsort()[-10:][::-1]])
print("\n")
print("For Paper Abstract:\n")
print(textwrap.fill(testAbstract1[0],100))
print("\nWe found the top 10 most similar papers as listed below:\n")
print(df.ABS.iloc[maxRows1])

The indexes for most similar papers are:
[ 984 1499 1937 1064  663 1636  154   38  146  776]


The cosine similarity for top 5 papers are:
[1.0000001  0.8270296  0.77556515 0.7716375  0.7676147  0.76736087
 0.75756586 0.7570242  0.75692165 0.75613713]


For Paper Abstract:

background enterovirus 71 ev71 is one of the major causative agents of hand foot and mouth disease
hfmd which is sometimes associated with severe central nervous system disease in children there is
currently no specific medication for ev71 infection quercetin one of the most widely distributed
flavonoids in plants has been demonstrated to inhibit various viral infections however investigation
of the antiev71 mechanism has not been reported to date methods the antiev71 activity of quercetin
was evaluated by phenotype screening determining the cytopathic effect cpe and ev71induced cells
apoptosis the effects on ev71 replication were evaluated further by determining virus yield viral
rna synthesis and protein expressio

In [18]:
result2 = np.sum(NParray1941papers512vector*test2,axis=1)/(np.sqrt(np.sum(NParray1941papers512vector*NParray1941papers512vector,axis=1))*np.sqrt(np.sum(test2*test2)))
maxRows2=result2.argsort()[-10:][::-1]  #https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array
print("The indexes for most similar papers are:") 
print(maxRows2)
print("\n")
print("The cosine similarity for top 5 papers are:") 
print(result2[result2.argsort()[-10:][::-1]])
print("\n")
print("For Paper Abstract:\n")
print(textwrap.fill(testAbstract2[0],100))
print("\nWe found the top 10 most similar papers as listed below:\n")
print(df.ABS.iloc[maxRows2])

The indexes for most similar papers are:
[1046 1453 1279 1383 1064 1168 1563  400  210 1419]


The cosine similarity for top 5 papers are:
[0.99999994 0.7110583  0.6851455  0.6791582  0.6723436  0.66990966
 0.66609055 0.6534611  0.64800745 0.6457796 ]


For Paper Abstract:

background to investigate the effects and immunological mechanisms of the traditional chinese
medicine xinjiaxiangruyin on controlling influenza virus fm1 strain infection in mice housed in a
hygrothermal environment methods mice were housed in normal and hygrothermal environments and
intranasally infected with influenza virus fm1 a highperformance liquid chromatography fingerprint
of xinjiaxiangruyin was used to provide an analytical method for quality control realtime
quantitative polymerase chain reaction rtqpcr was used to measure messenger rna expression of
tolllike receptor 7 tlr7 myeloid differentiation primary response 88 myd88 and nuclear factorkappa b
nfb p65 in the tlr7 signaling pathway and virus replica

In [19]:
resultq1 = np.sum(NParray1941papers512vector*question1,axis=1)/(np.sqrt(np.sum(NParray1941papers512vector*NParray1941papers512vector,axis=1))*np.sqrt(np.sum(question1*question1)))
maxRowsq1=resultq1.argsort()[-20:][::-1]  #https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array
print("The indexes for most similar papers are:") 
print(maxRowsq1)
print("\n")
print("The cosine similarity for top 20 papers are:") 
print(resultq1[resultq1.argsort()[-20:][::-1]])
print("\n")
print("For Question:\n")
print(textwrap.fill(Question1[0],100))
print("\nWe found the top 20 most related papers as listed below:\n")
print(df.ABS.iloc[maxRowsq1])

The indexes for most similar papers are:
[ 644  818 1457 1145 1829   66 1774  496 1825  637   84  298 1403 1467
 1263  436 1789  647 1273  217]


The cosine similarity for top 20 papers are:
[0.31485555 0.28105113 0.27972364 0.2649105  0.2526728  0.24850303
 0.24528085 0.23647133 0.23592094 0.2335433  0.22230253 0.22137865
 0.22077075 0.21867044 0.21549016 0.21536064 0.2152237  0.21099517
 0.20463014 0.20306927]


For Question:

What are clinical effective therapeutics or drugs for COVID-19?

We found the top 20 most related papers as listed below:

644     ltpgtltbgtltbgt there is no specific drug that...
818     limited data is available on feline leishmanio...
1457    please cite this paper as wathen et al 2012 an...
1145    background rapid molecular methods have create...
1829    introduction the differentiation of viral from...
66      abstract viruses are major pathogenic agents c...
1774    abstractthe global spread of sarscov2 requires...
496     in this study 26 blood samples