# Kaggle COVID-19 NLP - Notebook review

Hello, I'm Tiago, a computational biology student in São Paulo, Brazil. I'm not a virology expert, neither a NLP ninja. So I was wondering: how can I actually make a contribution to this effort?


I gave it some thought and landed on the following conclusions:

### People are amazing!
The work done by other kagglers in this effort is just is just beautiful. So, the first step of my contribution will be to *review as many COVID NLP notebooks in Kaggle as I possibly can*. Even if it is a quick review. 

### Everyone knows something that can help; so everyone  can contribute!
Tasks are so diverse and require so many complementary skills that anyone can help. I could try and squeeze a life of dealing with NLP in a couple months of pandemic, but that is likely neither effective or very useful. 
There is one thing, though, that I am very familizarized with, and that is the open knowledge graph at [https://www.wikidata.org/wiki/Wikidata:Main_Page Wikidata]. What I can do is to get statements from all these amazing analysis effort and convert them  into standardized, machine-readable, referenced public domain statements. tl;dr *add a Wikidata layer to Kaggle efforts*.

For those of you that know Wikidata, join us at our (COVID-19 WikiProject)[https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_COVID-19! Every person counts] ! 
For those of you that do not know Wikidata yet, join us anyway!


This notebook correspond to part 1: Review of current notebooks on Kaggle. 

A second notebook will be linked shortly, point towards the Wikidata-related work. 

### First part: Review of current notebooks on Kaggle. 
 For every work, I will write a small  summary, trying to summarize explaining in a couple words what was done. 
 It is a personal view of what the authors aimed at doing, the core tools that were used and what can be directly useful for the wikidata integration. 


## Part 1 - Review of Notebooks

### Notebook #1 [Browsing research papers with a BM25 search engine](https://www.kaggle.com/dgunning/browsing-research-papers-with-a-bm25-search-engine)(v67)

DwightGunning built a search engine to rank papers based on keywords related to the tasks.
It is available in GitHub as a Python  package, cord.

The user can choose subsets of papers based on dates. 

Output is in the format of a ranked list of papers related to the words. 
He also implements rendering of individual papers for exploration.

Query mechanism based on matching tokens to the document (Abstract).

Older versions of the notebook show how the search engine was built.

Core tools used: Python: nltk for text processing; rank_bm25 for the searching algorithm.


### Notebook #2 [CORD-19 Analysis with Sentence Embeddings](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) (v21)

David Mezzetti built a search engine to rank papers based on sentence embeddings matching a query. Furthermore, it displays the sentences from the article most likely to be of interest (i. e., highest ranking in  relation to query). 
Uses a set of heuristics to define valid sentences.

A table is generated for the sentences within the article.
Displays some EDA of the metadata related to the articles.

The use of embeddings allow match of concepts instead of matching only words.

It is available in GitHub as a Python  package, cord19q.

David Mezzetti also rendered notebooks with highlights for each of the 10 tasks, each subdivided into specific topics, 
and defined a set of subquestions for each task.


Core tools used: SQLite: Process metadata.csv file as a database. FastTest vectors trained on CORD dataset for embedding. BM25 index used to weight  word embeddings for reaching a sentence embedding. Python package faiss for similarity searching.  TextRank algorithm to highlight best sentences.

*Note:* David's highlights are a super good resource for adding information to Wikidata.





### Notebook #3 [NLP Text Mining - Disease behavior](https://www.kaggle.com/cstefanache/nlp-text-mining-disease-behavior) (v16)

Stefanache Cornel built a notebook on the virus behaviour, such as symptoms, incubation periods, transmission methods and so on. Built a simulator for virus propagation. 

He assembled terms related to COVID19 ("incubation", possible symtpoms, virus references, organs and connecting words as "lower" or "higher").   Coded NLP matcher functions to find matches. 

It is not available as a package, just many functions spread.


Core tools used:Spacy (Python NLP suite).




### Notebook #4 [Anserini+BERT-SQuAD for Semantic Corpus Search](https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search) (v43)

Dirk The Engineer's goal was to build a question and answering system for covid-related questions. Uses two algorithms,  Anserini and Bert-SQuAD. Anserini ranks sentences of text in relation to a question. Bert-SQuAD selects the best sentence segment to answer the question. He adapted the ideas of this paper by [Yang et al](https://arxiv.org/abs/1902.01718) to the COVID-19 dataset (actually the  lucene searchable CORD-19 database).  Google's Universal Sentence Encoder to find semantic similarities between question and answer.

Defined a set of subquestions for each task.

Nice visualization, highlighting the specific part of the excerpts that better answers the query.


Core Tools used: Anserini (via Pyserini), Bert-SQuAD,  Google's Universal Sentence Encoder, BioPython/Entrez to also query PubMed

It is not available as a package.


*Note:* Q&A could be further refined if a knowledge base (as Wikidata) was populated.  Dirk's highlights are a super good resource for adding information to Wikidata. Also, this notebook is very well organized. Code folding does wonders 



### Notebook #5 [CoronaWhy.org - Team - 560+ people (join slack)](    https://www.kaggle.com/arturkiulian/coronawhy-org-team-560-people-join-slack) (v44)


Artur Kiulian, Platon Mysnyk. Ivan Didur, AntonPolishko and others organized a notebook to join people in the kaggle community to work on ML tasks related to COVID-19. THere is a Slack group that anyone can join, where specific tasks adn goals are discussed.

They use also Trello taskboards and have daily calls that tackle open problems.
Anyone can join this effort in the website [CoronaWhy.org](https://www.coronawhy.org/).


    


### Notebook #6 [CORD : Tools and Knowledge graphs](https://www.kaggle.com/shahules/cord-tools-and-knowledge-graphs) (v24)

Shahules786 did some exploratory analysis with the dataset, checking average title sizes and most used words.

Uses  Universal Sentence encoder and DBscan with spacy sentence vectors to find titles that are best match for other titles. 

For each task, he/she selected manually core articles, found best matches and detected keywords usind either RAKE or pytextrank. 

Did some work towards building knowledge graph: detecting subject, relations and objects.
The knowledge graphs were built with spacy for each task. Visuzalization in form of networks, but it is hard to extract meaning. It is a work in progress, though, and I imagine it will get better with time.


Core Tools used: rake-nltk, spacy matcher, tensorflow, Universal Sentence Encoder, DBscan, pytextrank

It is not available as a package, just many functions spread

*Note:* Knowledge graph inference very useful for wikidata.

### Notebook #7 [COVID-19 Thematic tagging with Regular Expressions](https://www.kaggle.com/ajrwhite/covid-19-thematic-tagging-with-regular-expressions/) (v56)

[Andy White](https://www.kaggle.com/ajrwhite) tagged articles in the corpus using regex. His tags were like `tag_disease_covid19` or `tag_risk_smoking`. Among the  motivations, the idea was to standardize some terminology and make papers easier to find. Each paper receives multiple tags. Extensive work with many terms of interest, ranging from biomedicine to geography. 

As of now, he is building a tool to filter papers based on tags.

Core tools used: package re for identifying regex

It is not available as a package, functions are available at [this link](https://www.kaggle.com/ajrwhite/covid19-tools).


### Notebook #8 [CORD-19: EDA, parse JSON and generate clean CSV](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv) (v40)

[xhlulu](https://www.kaggle.com/xhlulu) processed the dataset to cleand and update the part related to biorxiv papers. Helper functions are made available. He de data about the preprints available as a .csv file.


Core tools used: pandas and basic tools for data cleaning.

Not available as a package. Functions in the notebook.

### Notebook #9 [Mining COVID-19 scientific papers](https://www.kaggle.com/mobassir/mining-covid-19-scientific-papers) (v40)

[Mobassir](https://www.kaggle.com/mobassir) focused on specifically the risk factors of COVID-19. Uses dataset of [CORD-19: EDA, parse JSON and generate clean CSV](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv). Works towards visualizing topics,  but it is hard to extract meaning.


Core Tools used: tensorflow, nltk, bert, gensim for topic modelling, TextRank algorithm

### Notebook #10 [COVID-19 Literature Clustering](https://www.kaggle.com/maksimeren/covid-19-literature-clustering) (v22)

[MaksimEkin](https://www.kaggle.com/maksimeren)'s goal was to cluster articles by similarity, so people can find articles that are alike. 

It uses sklearn HashingVectorizer and Tf-idf to create a feature vector for each article from its digrams (ex: 'liveattenuated', 'attenuatedviruses',
 'viruseshave'). 
 
 Visualizes the papers using PCA/TSNE + seaborn. Clusters the articles using kmeans.
 
He plays around with number of clusters and creates na interactive tsne viz.

Core Tools: sklearn

Not available as a package. Functions in the notebook.


# Work in progress

To do: review more notebooks.