# COVID-19 risk factors

Coronavirus is a serious public health threat to countries worldwide and rapid action must be taken to avoid the collapse of health care systems.

There are numerous studies released every day containing information on the risk factors, transmission, incubation, diagnostics, potential vaccines and therapeutics, to just name a few of the topics covered.

However, with the increasing number of research papers it is become more and more difficult for humans to combed through the sheer mass of information and find patterns in the findings. Therefore, AI is needed to speed up the process and generate valuable insights from these studies.



In [1]:
import numpy as np
import pandas as pd
from glob import glob
import config
import json



In [2]:
# read in the metadata
df_meta = pd.read_csv('metadata.csv')
df_meta.head()


Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,custom_license
1,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850.0,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,custom_license
2,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701.0,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,custom_license
3,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077.0,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,custom_license
4,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285.0,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,custom_license


In [3]:
# read in the json schema
with open('json_schema.txt') as open_json:
    json_schema = list(open_json)


In [4]:
# read in the studies
studies_all = glob(config.global_path+'**/*.json', recursive=True)
len(studies_all)


29315

In [5]:
# read in the first study and create dataframe for studies
with open(studies_all[0]) as file:
    first_study = json.load(file)
    
df_studies = pd.DataFrame.from_dict(first_study, orient='index').T
df_studies


Unnamed: 0,paper_id,metadata,abstract,body_text,bib_entries,ref_entries,back_matter
0,0015023cc06b5362d332b3baf348d11567ca2fbb,{'title': 'The RNA pseudoknots in foot-and-mou...,[{'text': 'word count: 194 22 Text word count:...,"[{'text': 'VP3, and VP0 (which is further proc...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Genetic...",{'FIGREF0': {'text': 'and-mouth disease virus ...,[{'text': 'author/funder. All rights reserved....


In [6]:
# add individual studies into one list
studies_list = []

for study in studies_all[1:]:
    df_temp = pd.read_json(study, orient='index').T
    df_studies = pd.concat([df_studies, df_temp], ignore_index=True)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [7]:
# remove any unwanted column
df_studies.drop(columns=['back_matter'], inplace=True)


In [9]:
# create new columns
df_studies['abstract_text'] = df_studies['abstract'].apply(lambda x: ','.join([i['text'] for i in x]) if x != [] else '')
df_studies['title'] = df_studies['metadata'].apply(lambda x: x['title'] if x != {} else '')
df_studies['authors'] = df_studies['metadata'].apply(lambda x: x['authors'] if x != [] else '')


In [None]:
df_studies['authors_list'] = df_studies['authors'].apply(lambda x: ' '.join([value if type(value) == str else 
                                                            (value[0] if (len(value) > 0 and type(value) == list) else '')
                                                            for key, value in x.items()]).strip()
                                                            if x != [] else '')

In [279]:
df_studies['authors'][0]

[{'first': 'Joseph',
  'middle': ['C'],
  'last': 'Ward',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Lidia',
  'middle': [],
  'last': 'Lasecka-Dykes',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Chris',
  'middle': [],
  'last': 'Neil',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Oluwapelumi',
  'middle': [],
  'last': 'Adeyemi',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Sarah',
  'middle': [],
  'last': '',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': '',
  'middle': [],
  'last': 'Gold',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Niall',
  'middle': [],
  'last': 'Mclean',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Caroline',
  'middle': [],
  'last': 'Wright',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': 'Morgan',
  'middle': ['R'],
  'last': 'Herod',
  'suffix': '',
  'affiliation': {},
  'email': ''},
 {'first': '

In [261]:
for x in df_studies['metadata']:
    if x['authors'] == []:
        print(x)

{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'TITLE: Pulmonary Metagenomic Sequencing Suggests Missed Infections in Immunocompromised AFFILIATIONS', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Characterisation of the faecal virome of captive and wild Tasmanian', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Excess cases of Influenza like illnesses in France synchronous with COVID19 invasion. Pierre-Yves Boëlle 1 and the Sentinelles syndromic and viral surveillance group', 'authors': []}
{'title': '', 'authors': []}
{'title': 'A combined RNA-seq and whole genome', 'authors': []}
{'title': 'The Israeli Acute Paralysis Virus IRES captures host', 'authors': []}
{'title': 'Nanopore-based native RNA sequencing provides insights into', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Herpesvirus infection reduces Pol II occupancy 

{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Morbidity and Mortality Weekly Report', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Bibliography of the current world literature', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Sialic Acids and Sialoglycoconjugates in the Biology of life, Health and Disease', 'authors': []}
{'title': '', 'authors': []

{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'A Disease Around the Corner', 'authors': []}
{'title': 'Original Article', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'CLINICAL EXPERIMENTAL VACCINE RESEARCH', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'Supplementary Information 4: Recommendations for Laboratory Performance and Interpretation of the Direct Antiglobulin Test and Flow Cytometry for Erythrocyte-Bound', 'authors': []}
{'title': 'ARTICLE IN PRESS +Model', 'authors': []}
{'title': '', 'authors': []}
{'title': 'CLINICAL EXPERIMENTAL VACCINE RESEARCH', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'authors': []}
{'title': 'CLINICAL EXPERIMENTAL VACCINE RESEARCH', 'authors': []}
{'title': '', 'authors': []}
{'title': '', 'auth

In [232]:
[' '.join([value if type(value) == str else (value[0] if (len(value) > 0 and type(value) == list) else '')
                for key, value in x.items()]).strip() for x in df_studies['metadata'][0]['authors']]

['Joseph C Ward',
 'Lidia  Lasecka-Dykes',
 'Chris  Neil',
 'Oluwapelumi  Adeyemi',
 'Sarah',
 'Gold',
 'Niall  Mclean',
 'Caroline  Wright',
 'Morgan R Herod',
 'David  Kealy',
 'Emma',
 'Warner',
 'Donald P King',
 'Tobias J Tuthill',
 'David J Rowlands',
 'Nicola J',
 'Stonehouse  A#']

In [233]:
df_studies['metadata'][0]

{'title': 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3',
 'authors': [{'first': 'Joseph',
   'middle': ['C'],
   'last': 'Ward',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Lidia',
   'middle': [],
   'last': 'Lasecka-Dykes',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Chris',
   'middle': [],
   'last': 'Neil',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Oluwapelumi',
   'middle': [],
   'last': 'Adeyemi',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Sarah',
   'middle': [],
   'last': '',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': '',
   'middle': [],
   'last': 'Gold',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Niall',
   'middle': [],
   'last': 'Mclean',
   'suffix': '',
   'affiliation': {},
   'email': ''},
  {'first': 'Caroline