# COVID-19 risk factors

Coronavirus is a serious public health threat to countries worldwide and rapid action must be taken to avoid the collapse of health care systems.

There are numerous studies released every day containing information on the risk factors, transmission, incubation, diagnostics, potential vaccines and therapeutics, to just name a few of the topics covered.

However, with the increasing number of research papers it is become more and more difficult for humans to combed through the sheer mass of information and find patterns in the findings. Therefore, AI is needed to speed up the process and generate valuable insights from these studies.



In [1]:
import numpy as np
import pandas as pd
from glob import glob
import config
import json



## Load data

In [2]:
# read in the metadata
df_meta = pd.read_csv('metadata.csv')
df_meta.head()


Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,custom_license
1,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850.0,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,custom_license
2,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701.0,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,custom_license
3,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077.0,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,custom_license
4,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285.0,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,custom_license


In [3]:
# read in the json schema
with open('json_schema.txt') as open_json:
    json_schema = list(open_json)


In [4]:
# read in the studies
studies_all = glob(config.global_path+'**/*.json', recursive=True)
len(studies_all)


29315

In [5]:
# read in the first study and create dataframe for studies
with open(studies_all[0]) as file:
    first_study = json.load(file)
    
df_studies = pd.DataFrame.from_dict(first_study, orient='index').T
df_studies


Unnamed: 0,paper_id,metadata,abstract,body_text,bib_entries,ref_entries,back_matter
0,0015023cc06b5362d332b3baf348d11567ca2fbb,{'title': 'The RNA pseudoknots in foot-and-mou...,[{'text': 'word count: 194 22 Text word count:...,"[{'text': 'VP3, and VP0 (which is further proc...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Genetic...",{'FIGREF0': {'text': 'and-mouth disease virus ...,[{'text': 'author/funder. All rights reserved....


In [6]:
# add individual studies into one list
studies_list = []

for study in studies_all[1:]:
    df_temp = pd.read_json(study, orient='index').T
    df_studies = pd.concat([df_studies, df_temp], ignore_index=True)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [7]:
# remove any unwanted column
df_studies.drop(columns=['back_matter'], inplace=True)


In [113]:
# create new columns
df_studies['abstract_text'] = df_studies['abstract'].apply(lambda x: ','.join([i['text'] for i in x]) if x != [] else '')
df_studies['title'] = df_studies['metadata'].apply(lambda x: x['title'] if x != {} else '')
df_studies['authors'] = df_studies['metadata'].apply(lambda x: x['authors'] if x != [] else '')
df_studies['authors_list'] = df_studies['authors'].apply(lambda x: [' '.join([value if type(value) == str else 
                                                            (value[0] if (len(value) > 0 and type(value) == list) 
                                                            else (value+'; ' if key == 'last' else ''))
                                                            for key, value in i.items()]).strip() for i in x]
                                                            if x != [] else '')
#dropping the helper authors column
df_studies.drop(columns=['authors'], inplace=True)
# creating a temporary dataframe with sha and journal
df_meta_journal = df_meta[['sha', 'journal']].copy()
# merging the journal to the matching paper
df_meta_journal.rename(columns={'sha': 'paper_id'}, inplace=True)
df_studies = df_studies.merge(df_meta_journal, on='paper_id', how='inner')


In [114]:
df_studies.head()

Unnamed: 0,abstract,bib_entries,body_text,metadata,paper_id,ref_entries,abstract_text,title,authors_list,journal
0,[{'text': 'word count: 194 22 Text word count:...,"{'BIBREF0': {'ref_id': 'b0', 'title': 'Genetic...","[{'text': 'VP3, and VP0 (which is further proc...",{'title': 'The RNA pseudoknots in foot-and-mou...,0015023cc06b5362d332b3baf348d11567ca2fbb,{'FIGREF0': {'text': 'and-mouth disease virus ...,word count: 194 22 Text word count: 5168 23 24...,The RNA pseudoknots in foot-and-mouth disease ...,"[Joseph C Ward, Lidia Lasecka-Dykes, Chris N...",
1,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'World H...",[{'text': 'The 2019-nCoV epidemic has spread a...,{'title': 'Healthcare-resource-adjusted vulner...,004f0f8bb66cf446678dc13cf2701feec4f36d76,{'FIGREF0': {'text': '(A) The estimated number...,,Healthcare-resource-adjusted vulnerabilities t...,"[Hanchu Zhou, Jiannan Yang, Kaicheng Tang, ...",
2,[{'text': 'Infectious bronchitis (IB) causes s...,"{'BIBREF0': {'ref_id': 'b0', 'title': 'Emergen...","[{'text': 'Infectious bronchitis (IB), which i...","{'title': 'Real-time, MinION-based, amplicon s...",00d16927588fb04d4be0e6b269fc02f0d3c2aa7b,{'FIGREF0': {'text': '35 cycles of 94 °C for 3...,Infectious bronchitis (IB) causes significant ...,"Real-time, MinION-based, amplicon sequencing f...","[Salman L Butt, Eric C Erwood, Jian Zhang, Ho...",
3,[{'text': 'Nipah Virus (NiV) came into limelig...,"{'BIBREF0': {'ref_id': 'b0', 'title': 'Molecul...",[{'text': 'Nipah is an infectious negative-sen...,{'title': 'A Combined Evidence Approach to Pri...,0139ea4ca580af99b602c6435368e7fdbefacb03,{'FIGREF0': {'text': 'NVIK architecture illust...,Nipah Virus (NiV) came into limelight recently...,A Combined Evidence Approach to Prioritize Nip...,"[Nishi Kumari, Ayush Upadhyay, Kishan Kalia...",
4,[{'text': 'Background: A novel coronavirus (20...,"{'BIBREF0': {'ref_id': 'b0', 'title': 'A Novel...","[{'text': 'In December 2019, a cluster of pati...",{'title': 'Assessing spread risk of Wuhan nove...,013d9d1cba8a54d5d3718c229b812d7cf91b6c89,"{'FIGREF0': {'text': 'February 1 st , 2020, re...",Background: A novel coronavirus (2019-nCoV) em...,Assessing spread risk of Wuhan novel coronavir...,"[Shengjie Lai, Isaac I Bogoch, Nick W Ruktano...",


## Preprocessing

## Exploratory Data Analysis

## Preprocessing for model

## Modelling

## Evaluation