Skip to content

mongodb-developer/open-data-covid-19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MongoDB Open Data COVID-19

This project retrieves and inserts into MongoDB the Johns Hopkins University COVID-19 dataset provided on Github.

Please read the blog post associated to this repository.

Databases and Collections

Database covid19

This database contains a few collections that have been carefully engineered to be as useful and as convenient as possible to work with.

In each of these collections:

  • the keys have been renamed correctly to provide more consistency across the different collections,
  • the fields have been casted into their correct types,
  • the loc fields contains GeoJSON points,
  • the date are ISODates,
  • etc.

5 collections are available:

  • metadata
  • global (the data from the time series global files)
  • us_only (the data from the time series US files)
  • global_and_us (the most complete one)
  • countries_summary (same as global but countries are grouped in a single doc for each date)

Collection metadata

This collection contains only one single document. It contains the list of all the values (obtained with mongodb distinct function) for the major fields along with the first and last dates.

{
  _id : "metadata",
  countries : [ "Afghanistan", "Albania", "Algeria", "..." ],
  states : [ "Alabama", "Alaska", "Alberta", "..." ],
  states_us : [ "Alabama", "Alaska", "American Samoa", "..." ],
  counties : [ "Abbeville", "Acadia", "Accomack", "..." ],
  iso3s : [ "ABW", "AFG", "AGO", "..." ],
  uids : [ 4, 8, 12, ... ],
  first_date : 2020-01-22T00:00:00.000+00:00,
  last_date : 2020-04-24T00:00:00.000+00:00
}

Collection global

This collection contains the equivalent of what is in the 3 documents:

  • time_series_covid19_confirmed_global.csv
  • time_series_covid19_deaths_global.csv
  • time_series_covid19_recovered_global.csv

Each document is also joined with its associated line from the UID_ISO_FIPS_LookUp_Table.csv file.

In this collection we have nb_entries(time_series_covid19_confirmed_global.csv) * number_days documents.

Here is an example document:

{
	"_id" : ObjectId("5ea45f2a8049cddb8cfa3822"),
	"uid" : 4,
	"country_iso2" : "AF",
	"country_iso3" : "AFG",
	"country_code" : 4,
	"country" : "Afghanistan",
	"combined_name" : "Afghanistan",
	"population" : 38928341,
	"loc" : {
		"type" : "Point",
		"coordinates" : [
			67.71,
			33.9391
		]
	},
	"date" : ISODate("2020-01-22T00:00:00Z"),
	"confirmed" : 0,
	"deaths" : 0,
	"recovered" : 0
}

Collection us_only

This collection contains the equivalent of what is in the 2 documents:

  • time_series_covid19_confirmed_US.csv
  • time_series_covid19_deaths_US.csv

Each document is also joined with its associated line from the UID_ISO_FIPS_LookUp_Table.csv file.

In this collection we have nb_entries(time_series_covid19_confirmed_US.csv) * number_days documents.

Here is an example document:

{
	"_id" : ObjectId("5ea45f2b8049cddb8cfa9912"),
	"uid" : 16,
	"country_iso2" : "AS",
	"country_iso3" : "ASM",
	"country_code" : 16,
	"fips" : 60,
	"state" : "American Samoa",
	"country" : "US",
	"combined_name" : "American Samoa, US",
	"population" : 55641,
	"loc" : {
		"type" : "Point",
		"coordinates" : [
			-170.132,
			-14.271
		]
	},
	"date" : ISODate("2020-01-22T00:00:00Z"),
	"confirmed" : 0,
	"deaths" : 0
}

Note: JHU does not provide recovered data for the US files. It's currently only available in the global files.

Collection countries_summary

This collection is calculated using the data from the global collection. It's the same collection but the countries are grouped together into a single document for each date.

So in this collection, you will find nb_countries * nb_days documents.

Also, because all the states are grouped into a single doc for each countries, some fields are arrays in this collection.

You can see the detailed aggregation pipeline in the 2-smart-insert.py file in the data-import folder.

Here is an example for France:

{
	"_id" : ObjectId("5eb1e2bdeb5fc5a3a38a33f8"),
	"uids" : [ 312, 175, 638, 666, 258, 250, 663, 254, 652, 540, 474 ],
	"confirmed" : 169583,
	"deaths" : 25204,
	"country" : "France",
	"date" : ISODate("2020-05-04T00:00:00Z"),
	"country_iso2s" : [ "GF", "MF", "PF", "YT", "GP", "RE", "PM", "MQ", "FR", "BL", "NC" ],
	"country_iso3s" : [ "SPM", "NCL", "REU", "BLM", "MAF", "MYT", "MTQ", "PYF", "GLP", "FRA", "GUF" ],
	"country_codes" : [ 652, 474, 666, 175, 250, 258, 254, 663, 312, 638, 540 ],
	"combined_names" : [
		"Saint Pierre and Miquelon, France",
		"Guadeloupe, France",
		"Reunion, France",
		"New Caledonia, France",
		"Saint Barthelemy, France",
		"France",
		"Martinique, France",
		"Mayotte, France",
		"French Polynesia, France",
		"French Guiana, France",
		"St Martin, France"
	],
	"population" : 298682,
	"recovered" : 51476,
	"states" : [
		"French Guiana",
		"French Polynesia",
		"Guadeloupe",
		"Mayotte",
		"New Caledonia",
		"Reunion",
		"Saint Barthelemy",
		"St Martin",
		"Martinique",
		"Saint Pierre and Miquelon"
	]
}

Collection global_and_us

This collection is the most complete collection in this database. This collection basically contains all the documents from the collections:

  • global
  • us_only

But with a little trick on top: the US cases are counted in both collections respectively:

  • at a country level in the first one,
  • and in a more detailed level (county and state) in the second one.

So to take advantages of both collections, I just removed the confirmed and deaths counts from the US documents which comes from the global collection.

This allow me to keep track of the recovered cases in the US while also keeping track of the confirmed and deaths cases at a more detailed level. This is really the best we can do here because JHU don't reported recovered cases at a detailed level for the US.

With this trick, the count for confirmed, deaths and recovered persons for a given date is correct.

This is the collection I'm using to build my charts in my charts blog posts:

The documents in this collection are exactly the same than in the collections mentioned above.

  • For a document that comes from the global collection:
{
	"_id" : ObjectId("5ea49768865a48ecca6d5ccb"),
	"uid" : 250,
	"country_iso2" : "FR",
	"country_iso3" : "FRA",
	"country_code" : 250,
	"country" : "France",
	"combined_name" : "France",
	"population" : 65273512,
	"loc" : {
		"type" : "Point",
		"coordinates" : [
			2.2137,
			46.2276
		]
	},
	"date" : ISODate("2020-04-24T00:00:00Z"),
	"confirmed" : 158636,
	"deaths" : 22245,
	"recovered" : 43493
}
  • For a document that comes from the us_only collection:
{
	"_id" : ObjectId("5ea4976b865a48ecca70df4d"),
	"uid" : 84042101,
	"country_iso2" : "US",
	"country_iso3" : "USA",
	"country_code" : 840,
	"fips" : 42101,
	"county" : "Philadelphia",
	"state" : "Pennsylvania",
	"country" : "US",
	"combined_name" : "Philadelphia, Pennsylvania, US",
	"population" : 1584064,
	"loc" : {
		"type" : "Point",
		"coordinates" : [
			-75.1379,
			40.0034
		]
	},
	"date" : ISODate("2020-04-24T00:00:00Z"),
	"confirmed" : 11877,
	"deaths" : 449
}
  • For the special document that comes from the global collection and represents the US entire country (the one with the trick):
{
	"_id" : ObjectId("5ea49768865a48ecca6d84d1"),
	"uid" : 840,
	"country_iso2" : "US",
	"country_iso3" : "USA",
	"country_code" : 840,
	"country" : "US",
	"combined_name" : "US",
	"population" : 329466283,
	"loc" : {
		"type" : "Point",
		"coordinates" : [
			-100,
			40
		]
	},
	"date" : ISODate("2020-04-24T00:00:00Z"),
	"recovered" : 99079
}

Just pay attention: the documents which come from the US detailed source don't have a recovered field because JHU doesn't provide this data. Only the documents (one for each date) that represents the US at a country level contain the recovered field.

Also JHU doesn't provide the recovered count for all the countries and states in the global files.

In this collection, we have (nb_entries(time_series_covid19_confirmed_global.csv) + nb_entries(time_series_covid19_confirmed_global.csv)) * number_days.

Dabatase covid19jhu

This database contains the raw CSV files imported with the mongoimport tool.

This database is updated by the 1-mongoimport-everything.sh in the data-import folder.

This script imports all the files matching these names with wildcards:

  • jhu/csse_covid_19_data/csse_covid_19_daily_reports/*.csv
  • jhu/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv
  • jhu/csse_covid_19_data/csse_covid_19_time_series/*.csv
  • jhu/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv

And it creates these collections:

  • Errata
  • UID_ISO_FIPS_LookUp_Table
  • daily
  • daily_us
  • time_series_covid19_confirmed_US
  • time_series_covid19_confirmed_global
  • time_series_covid19_deaths_US
  • time_series_covid19_deaths_global
  • time_series_covid19_recovered_global

Note: These collections are not clean and the schema designs are not great to work with because they come from raw and flat CSV files but at least they contain exactly what JHU is delivering.