# Classifying COVID19 Papers based on Severity

### by Luke Batchelder, Jessica Diehl, Drew Griffith and Joe Netti
### CSCI 635- Introduction to Machine Learning
### Project 2- COVID-19 Open Research Dataset Challenge
### 5/5/2020

## Introduction

In any technical field, researchers must keep pace with new papers. The coronavirus pandemic has generated far more papers than researchers can reasonably sift through. This project aims to help solve this problem that researchers face with the abundance of information using neural networks. Our code was based on being a bricks on bricks operation to other submissions. Based on our sources there seemed to be great tools for sorting the data into groups and generally querying on those groups and there had been solid efforts at ground up NN learning based on ensemble CNN networks. Our code was designed to test the effectiveness of combining these two functionalities by using the very effective code of (XYX) to find papers that fit our analyses and using the semantic analysis of (YXX) to find the results in order to classify papers if they discussed patients who had high severity. 

The task used for this project is the Risk Factors task, specifically, “Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups”. The sentiment analysis is used to provide additional papers other than ones the neural network has been trained to detect.  Does the sentiment within the high severity papers reflect that topics discussed are similar in other papers which have been found to have a similar sentiment?  Especially, do the other papers indicate information about the risk of fatality?  The high risk papers revolve around the severity of disease and contain indications of research involving fatalities or hospitalizations. The neural network used for sentiment analysis classifies papers into discussions about high severity for the disease or no discussion about the severity of the disease.  The aim for this classifier is to assist in finding additional papers which have information about severity indicated through fatalities.  The underlying network works similar to popular systems which would recommend a book to a reader, however this network would be based on the internal sentiment analysis of the book they are currently reading instead of metadata tagging a book. 

The dataset used for this project is the COVID-19 Open Research Dataset Challenge (CORD-19) dataset as provided in the CORD-19 research challenge.  The data for the neural network has been split into training, validation and test data.  For a positive sentiment, the training data consists of papers which were found to contain the word ‘fatalities’.  The validation data papers contained the word ‘hospitilization’.  For a negative sentiment, training data contains ‘recovery’ and validation constians ‘flu’.  Test data is the entirety of papers.  An additional dataset containing the word ‘mice’ is used, since mice will never be hospitalized.

In [23]:
import datetime
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json
import sys
from string import punctuation
from os import listdir, mkdir, path
from collections import Counter
from nltk.corpus import stopwords
from string import punctuation
from os import listdir
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
import tensorflow as tf
%load_ext tensorboard

In [3]:
############################################
### Load All Papers - Cleaned and Spaced
############################################

df = pd.read_csv('spacy.csv',index_col=0)

Papers used for training and validation are associated with their hash number as found in the metadata.csv (provided by the dataset)  to make it easier to locate the document in the dataset, since there are some encoding errors in the titles (such as Unicode coding issues for certain characters).  After locating the documents, they are distilled down to their basic words using a bag of words style method.  Stop words are removed using nltk’s corpus stopwords and then stored.  The words are tokenized using Kera’s Tokenizer class.  The fit_on_texts and text_to_sequences convert the text to numbers for the neural network to process.  The training data is padded so all sequences are the same length.  Classifications are stored in the ytrain and ytest variables for use in the neural network. The vocabulary used for the embedding layer contains words that occur only 3 or more times in the training set entirety of the paper. We found that using words that occured 2 or more times caused the network to overfit on the training data.

In [5]:
############################################
### Load Paper Categories Hashes
############################################

paper_fns = ['positive_out.txt', 'negative_out.txt']
	
positive_hash = None
with open(paper_fns[0], 'r') as f:
	positive_hash = f.read().split('\n')
	
negative_hash = None
with open(paper_fns[1], 'r') as f:
	negative_hash = f.read().split('\n')

In [6]:
############################################
### Load Papers from Hashes
############################################

hashes = df['paper_id'].values.tolist()
paper_text = df['processed_text'].values.tolist()

positive_papers = []

for hash in positive_hash:
	for idx,hash2 in enumerate(hashes):
		if hash == hash2:
			try:
				positive_papers.append(paper_text[idx])
				break
			except:
				print(hash)
				
negative_papers = []
for hash in negative_hash:
	for idx,hash2 in enumerate(hashes):
		if hash == hash2:
			try:
				negative_papers.append(paper_text[idx])
				break
			except:
				print(hash)
			
print(len(positive_papers))
print(len(negative_papers))

320
316


In [7]:
############################################
### Create Vocabulary from Papers
############################################


# turn a doc into clean tokens
def clean_doc_vocab(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(doc, vocab):
	tokens = clean_doc_vocab(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs_vocab(doc_list, vocab):
		for doc in doc_list:
			add_doc_to_vocab(doc, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs_vocab(positive_papers, vocab)
process_docs_vocab(negative_papers, vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))


# keep tokens with a min occurrence
min_occurane = 1000
vocab = [k for k,c in vocab.items() if c >= min_occurane]
print(len(vocab))
vocab = set(vocab)

42745
[('cell', 15607), ('mouse', 11669), ('infection', 11483), ('use', 11024), ('virus', 10528), ('study', 9229), ('patient', 7792), ('viral', 5971), ('group', 5213), ('protein', 5112), ('day', 5080), ('result', 4798), ('disease', 4743), ('respiratory', 4557), ('high', 4433), ('control', 4253), ('response', 4067), ('increase', 4064), ('test', 3751), ('include', 3690), ('level', 3620), ('infect', 3494), ('sample', 3467), ('report', 3363), ('expression', 3313), ('human', 3261), ('datum', 3177), ('case', 3090), ('analysis', 2993), ('antibody', 2855), ('compare', 2843), ('find', 2832), ('effect', 2808), ('time', 2748), ('model', 2715), ('lung', 2711), ('low', 2696), ('detect', 2696), ('gene', 2696), ('follow', 2668), ('clinical', 2648), ('treatment', 2595), ('activity', 2561), ('influenza', 2502), ('type', 2472), ('animal', 2447), ('observe', 2415), ('child', 2410), ('table', 2389), ('numb', 2340)]
188


In [8]:
############################################
### Load Train, Validation, and Test sets
############################################


# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens


# load all docs in a directory
def process_docs(doc_list, vocab):
	documents = list()
	for doc in doc_list:
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# 70% Train
# 20% Validation
# 10% Test
train_split = 0.7
valid_split = 0.9

In [9]:
############################################
### Training Set
############################################

# load all training reviews
positive_docs = process_docs(positive_papers[:int(len(positive_papers)*train_split)], vocab)
negative_docs = process_docs(negative_papers[:int(len(negative_papers)*train_split)], vocab)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(len(positive_docs))] + [1 for _ in range(len(negative_docs))])

In [10]:
############################################
### Validation Set
############################################

# load all test reviews
positive_docs = process_docs(positive_papers[int(len(positive_papers)*train_split):int(len(positive_papers)*valid_split)], vocab)
negative_docs = process_docs(negative_papers[int(len(negative_papers)*train_split):int(len(negative_papers)*valid_split)], vocab)
valid_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(valid_docs)
# pad sequences
Xvalid = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
yvalid = array([0 for _ in range(len(positive_docs))] + [1 for _ in range(len(negative_docs))])

In [11]:
############################################
### Test Set
############################################

# load all test reviews
positive_docs = process_docs(positive_papers[int(len(positive_papers)*valid_split):], vocab)
negative_docs = process_docs(negative_papers[int(len(negative_papers)*valid_split):], vocab)
valid_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(valid_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(len(positive_docs))] + [1 for _ in range(len(negative_docs))])

In [12]:
print(Xtrain[-1])
print(len(Xtrain))
print(len(Xvalid))
print(Xvalid[-1])
print(len(Xtest))
print(Xtest[-1])

print(ytrain)
print(len(ytrain))
print(yvalid)
print(len(yvalid))
#print(ytest)
print(len(ytest))

[26  3 32 ...  0  0  0]
445
127
[ 80  14 167 ...   0   0   0]
64
[ 27  12 154 ...   0   0   0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1]
4

In [13]:
def model1(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model


def model2(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='relu'))
    return model


# doubles dense of model1 to 20
def model3(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(20, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model


def model4(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=128, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=3))
    model.add(Conv1D(filters=64, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model

In [19]:
############################################
### Run 1D CNN
############################################

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
epochs = 20

log_dir = "logs"
model_name = "model_1"
fit_dir = path.join(log_dir, "fit", str(model_name) + "_" + datetime.datetime.now().strftime("%Y%m%d-%H%M"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=fit_dir, histogram_freq=1)

The neural network starts with an embedding layer, which turns positive integers in the dense vectors of a fixed size. The vocabulary size and input length are passed to the embedding.  Then 1D convolutional layer is used to do temporal convolution.  The layer creates a convolution kernel which takes the input and convulves it over a single dimension to produce a tensor of outputs.  Then, a relu activation is applied to the outputs.  A max pooling layers is used to reduce the dimensionality of the data.  Then the flatten layer collapses the spatial dimension of the input to the channel dimension. Lastly, two dense layers reduce the output to 10 and then 1 value.  The last layer uses a sigmoid to classify two categories.

In [20]:
# define model
model = model1(vocab_size, max_length)

In [None]:
# define model
model = model2(vocab_size, max_length)

In [27]:
# define model
model = model3(vocab_size, max_length)

In [None]:
# define model
model = model4(vocab_size, max_length)

In [None]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
#model.fit(Xtrain, ytrain, epochs=100, validation_data=(Xvalid, yvalid), verbose=2)
model.fit(Xtrain, ytrain, epochs=epochs, validation_data=(Xvalid, yvalid), verbose=2, callbacks=[tensorboard_callback])
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Train on 445 samples, validate on 127 samples
Epoch 1/20
445/445 - 1s - loss: 0.6336 - accuracy: 0.6292 - val_loss: 0.3845 - val_accuracy: 0.8898
Epoch 2/20
445/445 - 0s - loss: 0.3350 - accuracy: 0.9034 - val_loss: 0.1085 - val_accuracy: 0.9764
Epoch 3/20
445/445 - 0s - loss: 0.1731 - accuracy: 0.9438 - val_loss: 0.1634 - val_accuracy: 0.9370
Epoch 4/20
445/445 - 0s - loss: 0.1280 - accuracy: 0.9573 - val_loss: 0.1177 - val_accuracy: 0.9528
Epoch 5/20
445/445 - 0s - loss: 0.0815 - accuracy: 0.9730 - val_loss: 0.1771 - val_accuracy: 0.9370


In [26]:
%tensorboard --logdir logs/fit

Reusing TensorBoard on port 6006 (pid 1976), started 0:01:21 ago. (Use '!kill 1976' to kill it.)

The network was used to classify all 16000 literatures (sans training and validation papers).  After prediction, one paper taken at random that was classified as relevant to the topic of Severity of Disease. “Impact of Middle East respiratory syndrome outbreak on the use of emergency medical resources in febrile patients”. This paper discusses the occurrence of a respiratory syndrome in 2015 in the Middle East.  From this paper we can conclude that the symptoms shown by the patients included a fever.  In addition, statistics are given regarding the duration of the fever at the emergency department and the patient’s length of stay in the emergency department.  The paper directly addresses the issue of fatality in symptomatic hospitalized patients : “We also found no change in mortality rates for febrile patients attending the ED after the outbreak.”; although there are no statistics about this topic.  The paper does highlight mortality rates for emergency room patients due to overcrowding, and cites some other papers which specifically address the issue.  The neural network classifier is specifically looking for papers regarding symptomatic patients and hospitalization, so this paper is considered useful to the Kaggle task assignment.  

About half the papers were classified relevant when a threshold of .5 prediction is used.  When predicting the classes directly with keras, . This neural network would be useful to researchers looking at papers since they would have a better idea of which papers they should start to read. Looking at the top 99th percentile of papers, there are some false positives, such as “An ethnic model of Japanese overseas tourism companies”.  


# Accuracy and Loss

We tested a few different neural networks and here are the validation accuracy and loss results from tensorboard (this can also be seen if line 26 is run.) 
![title](accuracy.png)
![title](loss.png)
We tested out 4 different models and included three of the models here. The best model is our first model, "model 1", which was inspired the movie sentiment analysis [1].  


# Categorizing training data
There did not exist labels for the papers so we had to make our own labels for training and validation. Our process was aided with the clustering kernel by maksimeren  in Kaggle [2]. First we searched in the search bar (see SHOW section in kernel) for keywords such as "hospitalization", "fatality", "elderly", "mice", etc. After searching a keyword, we looked randomly through the papers and labels papers based on whether we though the titles and skimming the papers suggested that they were about severve virus cases in humans.  

# Resources
[1] Brownlee, J. (2019, November 19). How to Develop a Deep Convolutional  
Neural Network for Sentiment Analysis (Text Classification). Retrieved May 4, 2020, from https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
[2] 