COVID-19 Literature Analysis using Machine Learning and Deep Learning

Introduction

The coronavirus pandemic caused enormous health, economic, environmental, and social challenges to the entire human population. The entire research community worked tirelessly for a vaccine but could we help speeding up these efforts even more?

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared a COVID-19 Open Research Dataset (CORD-19). It is a resource of over 1 million scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset was provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

This project aims to help researchers navigate this fast-growing body of coronavirus literature to efficiently find relevant and up-to-date information. This is done by using various topic modeling algorithms to cluster similar papers together. We leverage Hadoop for data storage management and PySpark for building ML and DL pipelines.

Dataset Description

Dataset consists of JSON and CSV files. Each paper is saved in a nested JSON file while some additional metadata is available in a CSV file. A detailed description is available here. Below image summarizes the data preprocessing pipeline.

Methodology

Graph Database

Graph databases provide a way to generate and visualize relationships between entities
Both Pyspark GraphFrame and neo4j can achieve graph-based data storage. We explored both the tools
Each author, paper, and journal acts as a node
All nodes are connected as per relationships – “has_published” or “has_paper”
Data was prepared using python to make it ready to import to neo4j
Docker was used to install the neo4j (neo4j version 5.2.0)
Bash script (start_neo4j.sh) starts the docker container, neo4j server and imports the data

Results

Below are a few sample results of topic modeling

Topic 1 seem to be concerned with immune response and antibodies
Topic 2 seem to be talking about effects of pandemic on society, mental health (stress, anxiety) and work environment (behavior, support)
Topic 3 papers could be related to infection detection, antibody sequencing and virus itself

Folder Structure

covid19-literature-analysis
  |
  |--- data_prep: Code for preprocessing the raw data
         |--- cord19-parser.py: A python parser to convert the raw data into a structured CSV file
         |--- Data-Preprocessing.ipynb: Data parser but using PySpark
  |--- data_viz: Some visualizations to understand the data better
  |--- graph_db: Post project exploratory work to store and represent data using neo4j and PySpark GraphFrames
  |--- images: README file images
  |--- modeling: Modeling work
  |--- ppt: Contains a presentation describing the whole project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

Introduction

Dataset Description

Methodology

Graph Database

Results

Folder Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

Introduction

Dataset Description

Methodology

Graph Database

Results

Folder Structure