# Wayback Machine population data

This notebook reaches back in time and scrapes the population table from the<a href="https://www.cor.pa.gov/Pages/COVID-19.aspx"> PA DOC main Covid-19 page</a>. This page was created on march 21st but only contained population data starting on April 6th. **This notebook was last run on `11-20-2020`.** All data after 11/20 is scraped at 6:00PM daily from the cron job `population_scraper.py`. 

In [4]:
import pandas as pd
import urllib
from datetime import datetime
import glob

In [6]:
html_text = "https://web.archive.org/web/{date}/https:/www.cor.pa.gov/Pages/COVID-19.aspx"
date_range = pd.date_range(start="2020-03-21",end='2020-11-20').to_list()

In [7]:
for date in date_range:
    print(f'{datetime.strftime(date,"%m-%d")}',end=" | ")
    # get html frames
    html_frames = pd.read_html(html_text.format(date = datetime.strftime(date,"%Y%m%d")))
    for frame in html_frames:
        if frame.loc[0,0] == "INSTITUTION": # find population count frame
            
            # clean dataframe
            frame.columns = frame.loc[0]
            frame = frame.loc[1:,]
            frame.iloc[:,1:] = frame.iloc[:,1:].astype(float)
            frame['date'] = date
            
            frame.to_csv(f'../data/Daily_Populations/Daily_Populations_{datetime.strftime(date,"%m-%d")}.csv',index=False)

03-21 | 03-22 | 03-23 | 03-24 | 03-25 | 03-26 | 03-27 | 03-28 | 03-29 | 03-30 | 03-31 | 04-01 | 04-02 | 04-03 | 04-04 | 04-05 | 04-06 | 04-07 | 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


04-08 | 04-09 | 04-10 | 04-11 | 04-12 | 04-13 | 04-14 | 04-15 | 04-16 | 04-17 | 04-18 | 04-19 | 04-20 | 04-21 | 04-22 | 04-23 | 04-24 | 04-25 | 04-26 | 04-27 | 04-28 | 04-29 | 04-30 | 05-01 | 05-02 | 05-03 | 05-04 | 05-05 | 05-06 | 05-07 | 05-08 | 05-09 | 05-10 | 05-11 | 05-12 | 05-13 | 05-14 | 05-15 | 05-16 | 05-17 | 05-18 | 05-19 | 05-20 | 05-21 | 05-22 | 05-23 | 05-24 | 05-25 | 05-26 | 05-27 | 05-28 | 05-29 | 05-30 | 05-31 | 06-01 | 06-02 | 06-03 | 06-04 | 06-05 | 06-06 | 06-07 | 06-08 | 06-09 | 06-10 | 06-11 | 06-12 | 06-13 | 06-14 | 06-15 | 06-16 | 06-17 | 06-18 | 06-19 | 06-20 | 06-21 | 06-22 | 06-23 | 06-24 | 06-25 | 06-26 | 06-27 | 06-28 | 06-29 | 06-30 | 07-01 | 07-02 | 07-03 | 07-04 | 07-05 | 07-06 | 07-07 | 07-08 | 07-09 | 07-10 | 07-11 | 07-12 | 07-13 | 07-14 | 07-15 | 07-16 | 07-17 | 07-18 | 07-19 | 07-20 | 07-21 | 07-22 | 07-23 | 07-24 | 07-25 | 07-26 | 07-27 | 07-28 | 07-29 | 07-30 | 07-31 | 08-01 | 08-02 | 08-03 | 08-04 | 08-05 | 08-06 | 08-07 | 08-08 | 08-09 | 08-10 | 

### Merge all dates to single aggregated DF

In [38]:
colmapper = {"INSTITUTION":"SCI",
            "TODAY'S POPULATION":"population",
            "REPRIEVE RELEASES":"reprieve_releases",
            "TODAY'S POPULATION AFTER REPRIEVE RELEASES":"population_after_reprieve",
            "INCREASE/ DECREASE FROM YESTERDAY":"population_change_one_day",
            "INCREASE/ DECREASE FROM LAST WEEK":"population_change_one_week",
            "INCREASE/ DECREASE FROM LAST MONTH":"population_change_one_month",
            "date":"date"}

# 5/22 - 5/25 had column formatting issues
colmapper_522_525 = {"INSTITUTION":"SCI",
            "TODAY'S POPULATION":"population",
            "REPRIEVE RELEASES":"reprieve_releases",
            "TODAY'S POPULATION AFTER REPRIEVE RELEASES":"population_after_reprieve",
            "INCREASE/ DECREASEFROM YESTERDAY":"population_change_one_day",
            "INCREASE/ DECREASEFROM LAST WEEK":"population_change_one_week",
            "INCREASE/ DECREASEFROM LAST MONTH":"population_change_one_month",
            "date":"date"}

df_list = []
for i in glob.glob('../data/Daily_Populations/*.csv'):
    df = pd.read_csv(i)
    if "INCREASE/ DECREASEFROM LAST WEEK" in df.columns:
        df = df.rename(columns=colmapper_522_525)
    else:
        df = df.rename(columns=colmapper)
    df_list.append(df)
    
combined['date'] = pd.to_datetime(combined['date'])
combined = pd.concat(df_list)
combined.to_csv('../data/Daily_Populations/Daily_Populations_aggregated.csv',index=False)