# Dataset diversity in COVID-19 repository

In an attempt to track data relevant to the 2020 COVID-19 outbreak, a group of researchers has developed a data clearinghouse, containing a variety of datasets along with metadata and data dictionaries.

Here, we attempt to understand the diversity of the attributes (columns) in those datasets.

## Strategy.

1. eunmerate all of the directories and their associated data dictionaries, as found in the data_guide.csv files
2. examine each of the data_guide.csv files to identify the fields contained. 
3. Come up with a master list of fields and frequences, as well as a list of fields/dataset.

## enumerating and finding data dictionaries. 

In [2]:
from os import listdir
from os.path import isfile, join
import glob

In [3]:
files=glob.glob("../data/**/data_guide.csv",recursive=True)

In [4]:
files

['../data/cases/south korea/confirmed_cases_movement/data_guide.csv',
 '../data/cases/china/daily_cases_chinacdc_ZH/data_guide.csv',
 '../data/cases/china/death_count_imperial_college/data_guide.csv',
 '../data/cases/china/daily_cases_chinacdc_EN/data_guide.csv',
 '../data/cases/china/cases_travel/data_guide.csv',
 '../data/cases/china/cumulative_cases_DXY/data_guide.csv',
 '../data/cases/europe/europe_situation_updates/data_guide.csv',
 '../data/cases/global/line_listings_nihfogarty/data_guide.csv',
 '../data/cases/global/line_listings_imperial_college/data_guide.csv',
 '../data/cases/global/line_listings_oxford/data_guide.csv',
 '../data/cases/canada/ontario_situation_updates/data_guide.csv',
 '../data/data_from_papers/dataset_luo-et-al_202002/data_guide.csv']

### nice. now split and iterate over each of them

ok. how to store this.  We want a dictionary, keyeed by directory name ('confirmed_cases_movement').  each item in that dictionary will have a type (cases,data_from_papers), a local (south korea, etc.) and a list of fields that we will get from the data guide. 


ok. we're interested in full_name_en - the fourth field. read the CSV with pandas and grab the fourth column

ok. Tie this together - take a full file name, return tuple of name, type,subtype, locale, and list of fields

In [39]:
def proc_data_guide(f):
    res={}
    path_comps = f.split('/')
    if len(path_comps) ==6:
        res['type']=path_comps[1]
        res['subtype']=path_comps[2]
        res['locale']=path_comps[3]
        name=path_comps[4]
    else: # shorter. no /local
        res['type']=path_comps[1]
        res['subtype']=path_comps[2]
        name=path_comps[3]
    
        
    pfields = pandas.read_csv(f)
    cols = pfields['full_name_en']
    res['fields'] = cols.values.tolist()
    return (name,res)

In [41]:
proc_data_guide(files[-1])

('dataset_luo-et-al_202002',
 {'type': 'data',
  'subtype': 'data_from_papers',
  'fields': ['Province',
   'R0_mean',
   'Absolute_Humidity',
   'Temperature_Mean',
   'Cumulative_Cases',
   'R0_confidence_interval_Upper',
   'R0_confidence_interval_Lower',
   'Country.Region',
   'PopulationDensity_2014',
   'Lat',
   'Long',
   'Capital']})

### now, we have to (a) do that for all files, and (b) accumulate all field names

In [42]:
def proc_files(flist):
    res={}
    for f in flist:
        (name,result)=proc_data_guide(f)
        res[name]=result
    return res

In [43]:
result = proc_files(files)

In [46]:
b={}
not 'a' in b

True

In [48]:
summary =  {}
for name,data in result.items():
    fields = data['fields']
    for field in fields:
        if not field in summary:
            summary[field]=0
        summary[field]=summary[field]+1

In [50]:
len(summary)

137

In [53]:
counts = [x[1] for x in summary.items()]

In [55]:
import statistics
print(statistics.mean(counts))

1.1605839416058394


In [56]:
statistics.stdev(counts)

0.4575120802380646