# CORD-19 Explore dataset
In general, this jupyter notebook is designated to explore the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv' , converters={'software': lambda x: x[1:-1].split(',')})

Show the head of the dataset to inspect all columns and obtain a broad overview. 

In [3]:
CORD19_CSV.head(20)

Unnamed: 0,paper_id,doi,title,source_x,license,publish_time,journal,url,software
0,00006903b396d50cc0037fed39916d57d50ee801,,Urban green space and happiness in developed c...,ArXiv,arxiv,2021-01-04,,https://arxiv.org/pdf/2101.00807v1.pdf,['Google Street View']
1,0000fcce604204b1b9d876dc073eb529eb5ce305,10.1016/j.regg.2021.01.002,La Geriatría de Enlace con residencias en la é...,Elsevier; PMC,els-covid,2021-01-13,Rev Esp Geriatr Gerontol,https://api.elsevier.com/content/article/pii/S...,['SEGG']
2,000122a9a774ec76fa35ec0c0f6734e7e8d0c541,10.1016/j.rec.2020.08.002,Impact of COVID-19 on ST-segment elevation myo...,Elsevier; Medline; PMC,no-cc,2020-09-08,Rev Esp Cardiol (Engl Ed),https://api.elsevier.com/content/article/pii/S...,"['STATA', 'IAMCEST']"
3,0001418189999fea7f7cbe3e82703d71c85a6fe5,10.1016/j.vetmic.2006.11.026,Absence of surface expression of feline infect...,Elsevier; Medline; PMC,no-cc,2007-03-31,Vet Microbiol,https://www.sciencedirect.com/science/article/...,['SPSS']
4,00033d5a12240a8684cfe943954132b43434cf48,10.3390/v12080849,Detection of Severe Acute Respiratory Syndrome...,Medline; PMC,cc-by,2020-08-04,Viruses,https://www.ncbi.nlm.nih.gov/pubmed/32759673/;...,"['R', 'MassARRAY Typer Analyzer']"
5,00035ac98d8bc38fbca02a1cc957f55141af67c0,10.3389/fpsyt.2020.559701,The Psychological Pressures of Breast Cancer P...,Medline; PMC,cc-by,2020-12-15,Front Psychiatry,https://doi.org/10.3389/fpsyt.2020.559701; htt...,"['Wechat', 'SPSS Statistics']"
6,00039b94e6cb7609ecbddee1755314bcfeb77faa,10.1111/j.1365-2249.2004.02415.x,Plasma inflammatory cytokines and chemokines i...,Medline; PMC,bronze-oa,2004-04-01,Clinical & Experimental Immunology,https://onlinelibrary.wiley.com/doi/pdfdirect/...,['Statistical Package for Social Sciences (SPS...
7,0004456994f6c1d5db7327990386d33c01cff32a,10.1186/1471-2334-10-8,Seasonal influenza risk in hospital healthcare...,PMC,cc-by,2010-01-12,BMC Infect Dis,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,"['STATA', 'STATA', 'Statacorp']"
8,00073cb65dd2596249230fab8b15a71c4a135895,10.1086/605034,Risk Parameters of Fulminant Acute Respiratory...,Medline; PMC,no-cc,2009-08-01,J Infect Dis,https://doi.org/10.1086/605034; https://www.nc...,"['SPSS', 'SPSS']"
9,0007f972812bb45abbe5b0edf8db5359d49c23eb,10.1186/s42234-020-00057-1,The role of nicotinic receptors in SARS-CoV-2 ...,Medline; PMC,cc-by,2020-10-28,Bioelectron Med,https://www.ncbi.nlm.nih.gov/pubmed/33292872/;...,"['geNorm', 'GraphPad Prism', 'GraphPad', 'C..."


### General check of the dataframe

First, of all it is interesting to investigate how many NaN are contained within the DataFrame. 

In [4]:
CORD19_CSV.isnull().sum().sum()

12476

How are the NaNs distributed between the columns of the DataFrame? 

In [5]:
def count_nan_per_column(df):
    """ This function counts the NaNs per column of the received dataframe 
    and returns a dataframe containing a column listing the number of NaNs.
    """
    col_num = []
    col_name = []
    for i in df: 
        col_num.append(df[i].isnull().sum())
    col_names = df.columns
    df_joined =  pd.DataFrame(col_num, col_names)
    df_joined.columns = ['NaNs']
    return df_joined

In [6]:
df_NaNs = count_nan_per_column(CORD19_CSV)
df_NaNs

Unnamed: 0,NaNs
paper_id,0
doi,3144
title,0
source_x,0
license,0
publish_time,0
journal,9332
url,0
software,0


For further analysis, it is important to obey that the columns "doi" and "journal" contain NaNs.

In [7]:
#Indicate corrupted csv
#paper_id -> check integrity with base 16 check
#doi -> Regex check
#title -> check exshow existence of covid, corona in title
#source_x -> show most used source_x
#license -> show most used license
#publish_time -> check how many rows are before 2020 and after
#journal -> show must used journal 

### Dedicated analysis based on columns

Create own series for each row.

In [8]:
paper_id = CORD19_CSV.paper_id
doi = CORD19_CSV.doi
title = CORD19_CSV.title
source_x = CORD19_CSV.source_x
license = CORD19_CSV.license
publish_time= CORD19_CSV.publish_time
journal = CORD19_CSV.journal
url = CORD19_CSV.url

### Checking the column "paper_id"

First, checking the integrity of the series paper_id. It should be be obvious that each row has an unique ID which is not shared with another row. 

In [9]:
paper_id_counted = paper_id.value_counts()
paper_id_counted

0ed3c6a5559cd73307184f51fc53ccc76da559bc    3
5d6678f81812464543b367e7de138e23b3483ed1    2
0831fe32280e46ba8d5c1a9456111e1e009863ac    2
c89f86cdd9d41eeec127cc0b03990c52888a9635    2
5d0d0bd116976e1412c10a84902894999df4a342    2
                                           ..
92192bc6b6a3b6222439d4b508c00d7f5adc12fb    1
9410f59af4af3a11c46d0b1b8c53883ea2dbd7d6    1
6102dd48ba28756830876fe88d80c8a81bcc802e    1
3760682ee07bc410122aabc7ff0c494e678713f2    1
f083a0e1c8a168fbf63cc8c6eaac0ff896941c9c    1
Name: paper_id, Length: 77436, dtype: int64

The first finding linked to the column paper_id is that this column does not soley contain unique ID's. There's definitely rows which share ID's. For this, the exact number of shared ID's need to be found. 

In [10]:
def find_shared_values(col):
    """ This function checks columns for shared values and deducts NaNs
    """
    col_shared = len(col) - len(col.value_counts())
    col_shared = col_shared - col.isnull().sum()
    return col_shared

In [11]:
paper_id_shared_ids_num = find_shared_values(paper_id)    
paper_id_shared_ids_num

12

The column contains 12 shared ID's with even one used three times. 

In [12]:
paper_id_counted.head(paper_id_shared_ids_num)

0ed3c6a5559cd73307184f51fc53ccc76da559bc    3
5d6678f81812464543b367e7de138e23b3483ed1    2
0831fe32280e46ba8d5c1a9456111e1e009863ac    2
c89f86cdd9d41eeec127cc0b03990c52888a9635    2
5d0d0bd116976e1412c10a84902894999df4a342    2
36e2047d1674c3095617f3eb97f9f61e48989dfe    2
ff40e6b44e151e42a54227e255a88d0c0c104876    2
d1dde1df11f93e8eae0d0b467cd0455afdc5b98c    2
dd74a3a343529174fe7c6485723cf2d5911c18ed    2
ec7d3038b8912a9fc92f4d02a2c30d566d4d0a93    2
46b053c7126c1603101f46e4bb6e411f790a45fc    2
f02c0b30ece7c578c8641ca794663b4bab14dfba    1
Name: paper_id, dtype: int64

Check if the dataset contains only duplicates for the column paper_id or for whole DataFrame rows. Therefore, the function collect_rows_of_df will support the process. 

In [13]:
#Method receiving string and dataframe which returns double or tripple dataframe to append 
def collect_rows_of_df(df,column,st):
    """This function receives a dataframe, a column contained within a dataframe 
    and a string which can be found within the column. 
    Then, the string is compared to the whole column and.
    When a match is found, the corresponding rows are returned as a dataframe. 
    """
    subset = df[df[column] == st]
    return subset

Collecting rows which share their paper_id.

In [14]:
x = 0 
shared_paper_id_df = pd.DataFrame(columns=['paper_id','doi','title','source_x','license','publish_time','journal','url','software'])
while x < paper_id_shared_ids_num:
    shared_paper_id_df = shared_paper_id_df.append(collect_rows_of_df(CORD19_CSV, 'paper_id', paper_id_counted.index[x]))
    x= x+1
shared_paper_id_df

Unnamed: 0,paper_id,doi,title,source_x,license,publish_time,journal,url,software
4466,0ed3c6a5559cd73307184f51fc53ccc76da559bc,10.1016/j.jinf.2020.02.019,Simulating and forecasting the cumulative conf...,Elsevier; Medline; PMC,no-cc,2020-02-26,J Infect,https://doi.org/10.1016/j.jinf.2020.02.019; ht...,"['SAS', 'NLIA', 'NLIA', 'SAS', 'SAS', 'NL..."
4467,0ed3c6a5559cd73307184f51fc53ccc76da559bc,10.1016/j.jinf.2020.02.020,Novel coronavirus disease (Covid-19): The firs...,Elsevier; Medline; PMC,els-covid,2020-05-31,Journal of Infection,https://www.sciencedirect.com/science/article/...,"['SAS', 'NLIA', 'NLIA', 'SAS', 'SAS', 'NL..."
4468,0ed3c6a5559cd73307184f51fc53ccc76da559bc,10.1016/j.jinf.2020.02.011,Chinese medical personnel against the 2019-nCoV,Elsevier; Medline; PMC,els-covid,2020-05-31,Journal of Infection,https://api.elsevier.com/content/article/pii/S...,"['SAS', 'NLIA', 'NLIA', 'SAS', 'SAS', 'NL..."
28159,5d6678f81812464543b367e7de138e23b3483ed1,,Data-driven modeling for different stages of p...,PMC,cc-by,2020-09-21,ArXiv,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"['Influenza Risk Assessment Tool (IRAT)', 'Pa..."
28160,5d6678f81812464543b367e7de138e23b3483ed1,,Data-driven modeling for different stages of p...,ArXiv,arxiv,2020-09-21,,https://arxiv.org/pdf/2009.10018v1.pdf,"['Influenza Risk Assessment Tool (IRAT)', 'Pa..."
2447,0831fe32280e46ba8d5c1a9456111e1e009863ac,,A trial emulation approach for policy evaluati...,ArXiv,arxiv,2020-11-11,,https://arxiv.org/pdf/2011.05826v1.pdf,"['NYT Tracker', 'NYT Tracker']"
2448,0831fe32280e46ba8d5c1a9456111e1e009863ac,,A trial emulation approach for policy evaluati...,Medline; PMC,cc-by,2020-11-11,ArXiv,https://www.ncbi.nlm.nih.gov/pubmed/33200083/,"['NYT Tracker', 'NYT Tracker']"
60796,c89f86cdd9d41eeec127cc0b03990c52888a9635,10.1101/2020.11.07.372938,A low power flexible dielectric barrier discha...,BioRxiv,biorxiv,2020-11-09,bioRxiv,https://doi.org/10.1101/2020.11.07.372938,"['ImageJ', 'ImageJ', 'Sterlis®', 'Sterlis®']"
60797,c89f86cdd9d41eeec127cc0b03990c52888a9635,,A low power flexible dielectric barrier discha...,ArXiv,arxiv,2020-11-08,,https://arxiv.org/pdf/2011.03898v1.pdf,"['ImageJ', 'ImageJ', 'Sterlis®', 'Sterlis®']"
28044,5d0d0bd116976e1412c10a84902894999df4a342,10.1016/j.jinf.2020.02.013,Analysis of angiotensin-converting enzyme 2 (A...,Elsevier; Medline; PMC,no-cc,2020-02-21,J Infect,https://www.sciencedirect.com/science/article/...,"['MAFFT', 'Epi2Me interface', 'MinION', 'MA..."


Besides software, the paper_id duplicates have variations among all other columns. 

In [15]:
#check paper_id base16 validity 

### Checking the column "doi"

In [16]:
doi_counted = doi.value_counts()
doi_counted

10.1016/j.dsx.2020.04.012      2
10.31729/jnma.5498             2
10.1186/s42269-020-00434-5     1
10.1007/s40615-020-00931-3     1
10.1186/s12985-020-01345-7     1
                              ..
10.1177/2055116917704089       1
10.1007/s00705-003-0244-0      1
10.1056/nejmoa2034201          1
10.1101/2020.04.11.20062158    1
10.1088/1361-648x/aaa0f6       1
Name: doi, Length: 74302, dtype: int64

The column "doi" contains two entries which share a doi. For this, it needs to be explored how the shared dois are in relation to each other based on their row. 

In [17]:
doi_shared_dois = len(doi) - len(doi.value_counts())
doi_shared_dois

3146

In [21]:
doi_shared_dois_num = find_shared_values(doi)    
doi_shared_dois_num

2

In [19]:
df_NaNs

Unnamed: 0,NaNs
paper_id,0
doi,3144
title,0
source_x,0
license,0
publish_time,0
journal,9332
url,0
software,0
