CSVs retrieved from: https://data.ontario.ca/dataset/long-term-care-home-covid-19-data

In [1]:
import pandas as pd

Import

In [2]:
active = pd.read_csv("https://data.ontario.ca/datastore/dump/4b64488a-0523-4ebb-811a-fac2f07e6d59?bom=True")
active['outbreak_status'] = 'active'

inactive = pd.read_csv("https://data.ontario.ca/dataset/42df36df-04a0-43a9-8ad4-fac5e0e22244/resource/0cf2f01e-d4e1-48ed-8027-2133d059ec8b/download/resolvedltc.csv")
inactive['outbreak_status'] = 'inactive'

Clean

In [3]:
# convert to datetime
active.Report_Data_Extracted= pd.to_datetime(active.Report_Data_Extracted)
inactive.Report_Data_Extracted= pd.to_datetime(inactive.Report_Data_Extracted)

**NOTE:** Inactive outbreaks were reported daily since status change from `active` to `inactive`, leading to multicate entries. For simplicity, the earliest date was kept under the assumption that this date is when the outbreak was declared to be over. 

The same reporting was applied to active outbreaks. Again, the first date was assumed to be the declaration of the start of the outbreak.

In [5]:
# active.groupby("LTC_Home").count()
# active.loc[active["LTC_Home"] == "Afton Park Place Long Term Care Community"]

Unnamed: 0,_id,Report_Data_Extracted,LTC_Home,LTC_City,Beds,Total_LTC_Resident_Cases,Total_LTC_Resident_Deaths,Total_LTC_HCW_Cases,outbreak_status
7738,7739,2020-06-22,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
7801,7802,2020-06-23,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
7863,7864,2020-06-24,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
7920,7921,2020-06-25,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
7977,7978,2020-06-26,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
8034,8035,2020-06-27,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
8089,8090,2020-06-28,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
8145,8146,2020-06-29,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
8200,8201,2020-06-30,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active
8248,8249,2020-07-01,Afton Park Place Long Term Care Community,,128.0,0.0,0,<5,active


In [6]:
keep_indices = inactive.LTC_Home.drop_duplicates(keep='first').index.to_list()
inactiveFiltered = inactive.iloc[keep_indices]

keep_indices2 = active.LTC_Home.drop_duplicates(keep='first').index.to_list()
activeFiltered = active.iloc[keep_indices2]

Merge - dropping duplicates where LTCs are in both active and inactive lists, keeping the active home.

First, check that no homes reported more than one outbreak.

In [7]:
outbreaks = pd.concat([inactiveFiltered, activeFiltered])

grouped = outbreaks.groupby(["LTC_Home","outbreak_status"]).count()
grouped.loc[grouped["Report_Data_Extracted"] > 1]

Unnamed: 0_level_0,Unnamed: 1_level_0,Report_Data_Extracted,City,Beds,Total_LTC_Resident_Deaths,_id,LTC_City,Total_LTC_Resident_Cases,Total_LTC_HCW_Cases
LTC_Home,outbreak_status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


In [8]:
outbreaks.drop_duplicates(subset='LTC_Home', keep="last", inplace = True)

Filter for those reported before August 1, 2020:

In [9]:
outbreaksAug1 = outbreaks.loc[outbreaks['Report_Data_Extracted'] < "2020-08-01"]

Import LTC data and merged with outbreaks:

In [10]:
ltc = pd.read_csv("../merge_LTC_database/webscrape_LTC_general_database.csv")

In [11]:
outbreaksAug1["name"] = outbreaksAug1.LTC_Home.str.upper()
complete = pd.merge(ltc, outbreaksAug1, on = "name")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [12]:
print("Number of homes in outbreak DF missing after merge: ", len(set(outbreaksAug1.name.unique())-set(complete.name.unique())))

Number of homes in outbreak DF missing after merge:  7


Fix name issues and re-merge

In [13]:
list(set(outbreaksAug1.name.unique())-set(complete.name.unique()))

['MACASSA  LODGE',
 'PINECREST NURSING HOME (BOBCAYGEON)',
 'RESIDENCE SAINT-LOUIS',
 "ST. PATRICK'S HOME",
 'FINLANDIA HOIVAKOTI NURSING HOME',
 'VISION NURSING HOME',
 'HEARTWOOD']

In [14]:
ltc.name.replace({'RESIDENCE SAINT- LOUIS':'RESIDENCE SAINT-LOUIS',
                 'FINLANDIA HOIVAKOTI NURSING HOME LIMITED': 'FINLANDIA HOIVAKOTI NURSING HOME',
                  'PINECREST NURSING HOME - BOBCAYGEON': 'PINECREST NURSING HOME (BOBCAYGEON)',
                 'ST PATRICK\'S HOME': "ST. PATRICK'S HOME",
                  'VISION \'74 INC.':'VISION NURSING HOME',
                  'MACASSA LODGE':'MACASSA  LODGE',
                  'HEARTWOOD (FKA VERSA-CARE CORNWALL)':'HEARTWOOD'
                 },
                inplace = True)

In [15]:
complete = pd.merge(ltc, outbreaksAug1, on = "name")
print("Number of homes in outbreak DF missing after merge: ", len(set(outbreaksAug1.name.unique())-set(complete.name.unique())))
list(set(outbreaksAug1.name.unique())-set(complete.name.unique()))

Number of homes in outbreak DF missing after merge:  0


[]

**Clean & Export:**

In [16]:
complete.drop(columns = ['_id','LTC_City','additional_info','LTC_Home',
                        'management', 'city', 'City', 'index'], inplace = True)
complete.set_index('name', inplace = True)

In [17]:
complete.to_csv('../merge_LTC_database/LTC_general_DB_aug1.csv')