Skip to content

Latest commit

 

History

History
63 lines (46 loc) · 3.6 KB

README.md

File metadata and controls

63 lines (46 loc) · 3.6 KB

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

  1. Introduction
  2. Dataset Description
  3. Methodology
  4. Graph Database
  5. Results
  6. Folder Structure

Introduction

The coronavirus pandemic caused enormous health, economic, environmental, and social challenges to the entire human population. The entire research community worked tirelessly for a vaccine but could we help speeding up these efforts even more?

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared a COVID-19 Open Research Dataset (CORD-19). It is a resource of over 1 million scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset was provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

This project aims to help researchers navigate this fast-growing body of coronavirus literature to efficiently find relevant and up-to-date information. This is done by using various topic modeling algorithms to cluster similar papers together. We leverage Hadoop for data storage management and PySpark for building ML and DL pipelines.

Dataset Description

Dataset consists of JSON and CSV files. Each paper is saved in a nested JSON file while some additional metadata is available in a CSV file. A detailed description is available here. Below image summarizes the data preprocessing pipeline.

Data Preprocessing

Methodology

Methodology

Graph Database

  • Graph databases provide a way to generate and visualize relationships between entities
  • Both Pyspark GraphFrame and neo4j can achieve graph-based data storage. We explored both the tools
  • Each author, paper, and journal acts as a node
  • All nodes are connected as per relationships – “has_published” or “has_paper”
  • Data was prepared using python to make it ready to import to neo4j
  • Docker was used to install the neo4j (neo4j version 5.2.0)
  • Bash script (start_neo4j.sh) starts the docker container, neo4j server and imports the data

sample-graph

final-graph

Results

Below are a few sample results of topic modeling

Topic 1 Topic 2 Topic 3

  • Topic 1 seem to be concerned with immune response and antibodies
  • Topic 2 seem to be talking about effects of pandemic on society, mental health (stress, anxiety) and work environment (behavior, support)
  • Topic 3 papers could be related to infection detection, antibody sequencing and virus itself

Folder Structure

covid19-literature-analysis
  |
  |--- data_prep: Code for preprocessing the raw data
         |--- cord19-parser.py: A python parser to convert the raw data into a structured CSV file
         |--- Data-Preprocessing.ipynb: Data parser but using PySpark
  |--- data_viz: Some visualizations to understand the data better
  |--- graph_db: Post project exploratory work to store and represent data using neo4j and PySpark GraphFrames
  |--- images: README file images
  |--- modeling: Modeling work
  |--- ppt: Contains a presentation describing the whole project