# Process NYTimes data

This notebook processes the NYTimes county-level data into a dataframe with the data for the first date with cases, and the data for N days after the first cases, as defined by `NDAYS`. It also generates a FIPS-county name lookup table for use in `010_process_kaiser`.

This reads in the data as downloaded from the [NYTimes GitHub](https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv)

In [1]:
# CHANGE THIS VARIABLE TO THE NUMBER OF DAYS
NDAYS = [60, 70, 80]

import pandas as pd
import datetime

# read and sort the data
nytimes = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
nytimes = nytimes.sort_values(by=["date"])

Select the unique FIPS codes in the data into a list

In [2]:
fips_list = nytimes.dropna().fips.unique()

This is the bulk of the code. It loops through FIPS codes, and gets all the needed data for that FIPS and adds it to a list as a dictionary.

In [3]:
processed_nytimes_rows = []
for fips in fips_list:
    # get a dataframe with the data for this fips
    fips_df = nytimes.loc[nytimes.fips == fips]

    # get the date and row for the first case in this fips
    row_0days = fips_df.sort_values(by=["date"]).iloc[0]
    date_0days = datetime.datetime.strptime(row_0days.date, "%Y-%m-%d").date()

    row_dict = {
        "date_0days": row_0days.date,
        "cases_0days": row_0days.cases,
        "deaths_0days": row_0days.deaths,
        "fips": int(fips)
    }

    for N in NDAYS:
        row_dict[f"date_{N}days"] = None
        row_dict[f"cases_{N}days"] = None
        row_dict[f"deaths_{N}days"] = None

        # calculate the date 100 days from the first case
        date_ndays = date_0days + datetime.timedelta(days=N)

        # this will select the closest date to n days out, and limit to 10 days before skipping.
        row_ndays = None
        tries = 10
        while row_ndays is None and tries > 0:
            try:
                # try to get the row for that date
                row_ndays = fips_df.loc[nytimes.date == date_ndays.strftime("%Y-%m-%d")].iloc[0]
            except:
                # if there's no record for that date, decrement the date back one day and try again
                date_ndays -= datetime.timedelta(days=1)
            tries -= 1
        
        if row_ndays is not None:
            row_dict[f"date_{N}days"] = row_ndays.date
            row_dict[f"cases_{N}days"] = row_ndays.cases
            row_dict[f"deaths_{N}days"] = row_ndays.deaths

        # add the data to the list
        processed_nytimes_rows.append(row_dict)

Finally, convert the list of rows into a dataframe and save:

In [4]:
processed_nytimes = pd.DataFrame(processed_nytimes_rows)
processed_nytimes.to_csv(f"nytimes_processed_{'-'.join(NDAYS)}days.csv")

This last step will create a lookup CSV to get fips from county name or vice-versa:

In [5]:
fips_county_df = nytimes.dropna().loc[:, ["fips", "county", "state"]]
fips_county_df.fips = fips_county_df.fips.astype(int)
fips_county_df = fips_county_df.set_index("fips")
fips_county_df.to_csv("fips_county_lookup.csv")