### Effectiveness of drugs being developed and tried to treat COVID-19 patients

### 1. Introduction
write intro here

### 2. Notebook’s Goal
The goal of this notebook is to help the medical community in finding answers to the questions below as it relates to COVID-19. Using data science as well as data mining techniques, and tools, this notebook intends to assist the medical community in quickly finding the most relevant scholarly articles that could help answer these questions.

### 3. Methodology
The workflow in the figure bellow illustrates the approach taken pre-process, model, and post-process the cord-19 data set.

The first step is the pre-processing. It involves basic data cleansing across the dataset, applying a proper selection of the paper’s language, availability of full research document, and other basic elements. At the core of the modeling step, a Natural Language Processing (NLP) library is found (1). It provides the possibility to execute different tasks like clustering, summarization, phrase matching, a ranking of the given research papers.

Finally, the selection of the most relevant documents is conducted considering the representation of presented words and sentences using a vector form. This functionality provides a way to measure the semantic similarity of a given search sub-task.

#### 3.1 Pre-processing

##### 3.1.1 Dataset description
Each paper in the dataset is represented by a JSON file. For each paper, we want to extract the following data:
- paper ID
- publication date
- title
- abstract text
- body text
- primary location and country for the author(s)

##### 3.1.2 Data cleaning
The following steps were considered for dataset cleaning:

- Remove unnecessary or unhelpful characters and words from the paper text
- Remove duplicate papers
- Remove papers which are not in English
- Eliminate null values
- Remove blank space
- Removes references and annotations

Import Reuired Libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

### 1. ETL

#### 1.1 Create SparkSession and SparkContext

In [4]:
spark = SparkSession.builder.appName("COVID-19 Effective Drugs").master("local[3]").getOrCreate()

Read data from given path

In [6]:
path = "data"
raw_df = spark.read.json(path, multiLine="true")

Explore DataFrame Schema

In [8]:
raw_df.printSchema()
# raw_df.show(1, True)

root
 |-- abstract: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cite_spans: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- ref_id: string (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- text: string (nullable = true)
 |    |    |-- ref_spans: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- section: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |-- back_matter: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cite_spans: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- ref_id: string (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |  

Creating a row for each text element inside body text: explode

In [11]:
body_text_item_df = raw_df.select("paper_id", explode("body_text.text").alias("body_text_item"))
# body_text_item_df.printSchema()
body_text_item_df.show(10, truncate=True)

+--------------------+--------------------+
|            paper_id|      body_text_item|
+--------------------+--------------------+
|00a0ab182dc01b6c2...|The world was mad...|
|00a0ab182dc01b6c2...|Patients with MER...|
|00a0ab182dc01b6c2...|Chest radiography...|
|00a0ab182dc01b6c2...|The mean incubati...|
|00a0ab182dc01b6c2...|MERS can progress...|
|00a0ab182dc01b6c2...|As a group, child...|
|00a0ab182dc01b6c2...|The virus associa...|
|00a0ab182dc01b6c2...|MERS-CoV is a put...|
|00a0ab182dc01b6c2...|Complete genome d...|
|00a0ab182dc01b6c2...|To date, the MERS...|
+--------------------+--------------------+
only showing top 10 rows



## Pre-processing
Each paper in the dataset is represented by a JSON file. For each paper, we want to extract the following data:
- paper ID
- publication date
- title
- abstract text
- body text
- primary location and country for the author(s)