<a href="https://colab.research.google.com/github/hikmahealth/covid19countymap/blob/master/notebooks/NYT_cases_export.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import io
import json
import urllib

import pandas as pd

Download count from the [New York Times](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html) [github](https://github.com/nytimes/covid-19-data).

In [0]:
raw_county_data = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
raw_state_data = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")

In [0]:
with urllib.request.urlopen("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv") as infile:
  raw_data = infile.read()
# Deal with encoding issues in the Census CSV file.
replacements = {0xed: "í", 0xe1: "á", 0xf3: "ó", 0xf1: "ñ", 0xfc:"ü"}
for char, repl in replacements.items():
  raw_data = raw_data.replace(bytes([char]), repl.encode())
raw_census = pd.read_csv(io.BytesIO(raw_data))
census = raw_census.copy()
census["fips"] = census.STATE * 1000 + census.COUNTY
census.set_index("fips", inplace=True)

In [0]:
def get_latest(raw_data):
  # Filter out data with missing FIPS codes
  data = raw_data[(raw_data.fips < 100000) & (raw_data.fips >= 0)].copy()
  data.fips = data.fips.astype(int)
  data = data[["fips", "cases", "deaths", "date"]].copy()
  data.set_index("fips", inplace=True)
  return data.groupby(by="fips").agg(func=max)

In [0]:
county_data = get_latest(raw_county_data)
state_data = get_latest(raw_state_data)

NYTimes dataset lumps all [5 NYC counties](https://simple.wikipedia.org/wiki/List_of_counties_in_New_York) together. We will extrapolate each county by proportional 2010 population from census.gov

In [0]:
NYC_COUNTIES = [
                36061,  # New York County
                36047,  # Kings County
                36005,  # Bronx County
                36085,  # Richmond County
                36081,  # Queens County
]

In [0]:
nycensus = census[census.index.isin(NYC_COUNTIES)]
nyc_proportions = nycensus.CENSUS2010POP / nycensus.CENSUS2010POP.sum()
nyc_data = raw_county_data[raw_county_data.county == "New York City"].agg(func=max)[["cases", "deaths", "date"]]
nyc_scaled = {}
for fips, proportion in nyc_proportions.items():
  nyc_scaled[fips] = {
      "cases": round(proportion * nyc_data.cases),
      "deaths": round(proportion * nyc_data.deaths),
      "date": nyc_data.date,
  }
county_data = county_data.append(pd.DataFrame.from_dict(nyc_scaled, orient="index"))

Add in 2019 census.gov population estimates. State dataset is missing some entries, so they are added in manually.

In [0]:
county_data["pop"] = census.CENSUS2010POP[county_data.index]

In [0]:
state_populations = census[census.COUNTY == 0].set_index("STATE").CENSUS2010POP
state_populations[60] = 56079  # American Samoa
state_populations[66] = 159358  # Guam
state_populations[69] = 53971  # Northern Mariana Islands
state_populations[72] = 3725789  # Puerto Rico
state_populations[78] = 106405  # Virgin Islands

In [0]:
state_data["pop"] = state_populations

In [0]:
with open("county_cases.json", "w") as outfile:
  outfile.write(county_data.to_json(orient="index"))

In [0]:
with open("state_cases.json", "w") as outfile:
  outfile.write(state_data.to_json(orient="index"))