# Scrape Wayback Machine 

## Get snapshots of the URL we want to scrape

 For all URL in the Wisconsin Health Department's website page with al WIC local offices, scrape the Wayback Machine for all versions of the URL.

 **Note** This run in console for now.


In [1]:
! wayback-machine-scraper -a 'https://www.dhs.wisconsin.gov/wic/local-projects.htm$' https://www.dhs.wisconsin.gov/wic/local-projects.htm -o './data/'

2022-02-03 09:30:43 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-02-03 09:30:43 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 13:09:58) - [GCC 7.5.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Linux-5.13.0-28-generic-x86_64-with-glibc2.31
2022-02-03 09:30:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 1,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
 'LOG_LEVEL': 'INFO',
 'USER_AGENT': 'Wayback Machine Scraper/1.0.8 '
               '(+https://github.com/sangaline/scrapy-wayback-machine)'}
2022-02-03 09:30:43 [scrapy.extensions.telnet] INFO: Telnet Password: f2dfb36b6647d0cb
2022-02-03 09:30:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUs

This creates the following directory structure in `./data/`:

```
website/
└── www.dhs.wisconsin.gov
    └── wic
        └──local-projects.htm
            ├── 20141213083147.snapshot
            ├── 20150801041055.snapshot
            ...
            └── 20211211141555.snapshot
```

We can obtain a list of retrieved snapshots by:

In [2]:
from os import listdir
from os.path import isfile, join

data_path = './data/www.dhs.wisconsin.gov/wic/local-projects.htm/'

snapshots = [f for f in listdir(data_path) if isfile(join(data_path, f)) and f.endswith('.snapshot')]

## Scrape data from the snapshots

### Example: Scrape data from first snapshot in the list.

In [3]:
from bs4 import BeautifulSoup
import pandas as pd
import re

with open(data_path + snapshots[0]) as fp:
    soup = BeautifulSoup(fp, 'html.parser')

table = soup.find_all('table')[0]

In [4]:
table_rows = table.find_all('tr')

In [5]:
df_county  = pd.DataFrame(columns=["county", "zip", "address"])
for i in range(1, len(table_rows)):
    county = table_rows[i].find_all('td')[0].text.strip()
    content = table_rows[i].find_all('td')[1].find_all('p')

    data_county = { "county":[], "zip": [], "address" : []}#, "name" : []}

    for c in content:
        t = c.text.strip().split('\n\t\t\t\t\t\t')
        # if (len(t) == 1) and ('Back to top ' not in t[0]):
            # data_county['name'].append(t[0])
        # el
        if (len(t) > 1):
            data_county['county'].append(county)
            t = " ".join(t).split('Telephone:')[0]
            data_county['address'].append(t)
            data_county['zip'].append(re.findall(r"(?<!\d)\d{5}(?!\d)", t)[0])
    
    temp = pd.DataFrame(data_county)

    df_county = pd.concat([df_county, temp], ignore_index=True)

df_county["snapshot"] = pd.to_datetime(snapshots[0][:-9])

df_county

Unnamed: 0,county,zip,address,snapshot
0,Adams,53948,"200 Hickory Street Mauston, WI 53948",2021-04-16 03:47:33
1,Ashland,54806,"216 3rd St. West, Suite 100 Ashland, WI 54806",2021-04-16 03:47:33
2,Ashland,53585,Bad River Health Center 53585 Nokomis Road Ash...,2021-04-16 03:47:33
3,Barron,54812,"335 E. Monroe Ave., Room 338 Barron, WI 54812",2021-04-16 03:47:33
4,Barron,54893,St. Croix Tribal Center 4404 State Road 70 Web...,2021-04-16 03:47:33
...,...,...,...,...
95,Waushara,54982,"400 South Townline Rd Wautoma, WI 54982",2021-04-16 03:47:33
96,Winnebago,54901,"112 Otter Ave. PO Box 2808 Oshkosh, WI 54901-5...",2021-04-16 03:47:33
97,Winnebago,54956,"211 N. Commercial St. Neenah, WI 54956",2021-04-16 03:47:33
98,Wood,54495,"111 W. Jackson St. Wisconsin Rapids, WI 54495",2021-04-16 03:47:33


### Latitude and Longitude (for later)

In [6]:
# from geopy.geocoders import Nominatim
# from geopy.extra.rate_limiter import RateLimiter

# geolocator = Nominatim(user_agent="valsbobes@wisc.edu")


# # 1 - conveneint function to delay between geocoding calls
# geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# # 2- - create location column
# df_county['location'] = df_county['address'].apply(geocode)

# # 3 - create longitude, laatitude and altitude from location column (returns tuple)
# # df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)

In [7]:
# df_county

## Putting it all together

In [8]:
df_state  = pd.DataFrame(columns=["county", "zip", "address","snapshot"])

for j in range(len(snapshots)):
    with open(data_path + snapshots[j]) as fp:
        soup = BeautifulSoup(fp, 'html.parser')

    table = soup.find_all('table')[0]

    df_county  = pd.DataFrame(columns=["county", "zip", "address"])
    for i in range(1, len(table_rows)):
        county = table_rows[i].find_all('td')[0].text.strip()
        content = table_rows[i].find_all('td')[1].find_all('p')

        data_county = { "county":[], "zip": [], "address" : []}#, "name" : []}

        for c in content:
            t = c.text.strip().split('\n\t\t\t\t\t\t')
            # if (len(t) == 1) and ('Back to top ' not in t[0]):
                # data_county['name'].append(t[0])
            # el
            if (len(t) > 1):
                data_county['county'].append(county)
                t = " ".join(t).split('Telephone:')[0]
                data_county['address'].append(t)
                data_county['zip'].append(re.findall(r"(?<!\d)\d{5}(?!\d)", t)[0])
        
        temp = pd.DataFrame(data_county)

        df_county = pd.concat([df_county, temp], ignore_index=True)

    df_county["snapshot"] = pd.to_datetime(snapshots[j][:-9])
    df_state = pd.concat([df_state, df_county], ignore_index=True)

df_state

Unnamed: 0,county,zip,address,snapshot
0,Adams,53948,"200 Hickory Street Mauston, WI 53948",2021-04-16 03:47:33
1,Ashland,54806,"216 3rd St. West, Suite 100 Ashland, WI 54806",2021-04-16 03:47:33
2,Ashland,53585,Bad River Health Center 53585 Nokomis Road Ash...,2021-04-16 03:47:33
3,Barron,54812,"335 E. Monroe Ave., Room 338 Barron, WI 54812",2021-04-16 03:47:33
4,Barron,54893,St. Croix Tribal Center 4404 State Road 70 Web...,2021-04-16 03:47:33
...,...,...,...,...
3295,Waushara,54982,"400 South Townline Rd Wautoma, WI 54982",2016-02-01 10:06:17
3296,Winnebago,54901,"112 Otter Ave. PO Box 2808 Oshkosh, WI 54901-5...",2016-02-01 10:06:17
3297,Winnebago,54956,"211 N. Commercial St. Neenah, WI 54956",2016-02-01 10:06:17
3298,Wood,54495,"111 W. Jackson St. Wisconsin Rapids, WI 54495",2016-02-01 10:06:17


In [9]:
df_state.groupby('snapshot').count()

Unnamed: 0_level_0,county,zip,address
snapshot,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-12-13 08:31:47,100,100,100
2015-08-01 04:10:55,100,100,100
2015-11-19 07:10:46,100,100,100
2016-02-01 10:06:17,100,100,100
2016-05-02 02:35:34,100,100,100
2016-08-01 02:53:58,100,100,100
2016-11-01 03:21:06,100,100,100
2016-12-23 07:13:51,100,100,100
2017-02-01 03:53:29,100,100,100
2017-05-02 06:07:25,100,100,100


In [10]:
df_state.groupby(['county', 'snapshot']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,zip,address
county,snapshot,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,2014-12-13 08:31:47,1,1
Adams,2015-08-01 04:10:55,1,1
Adams,2015-11-19 07:10:46,1,1
Adams,2016-02-01 10:06:17,1,1
Adams,2016-05-02 02:35:34,1,1
...,...,...,...
Wood,2021-04-16 03:47:33,2,2
Wood,2021-04-29 19:03:28,2,2
Wood,2021-05-14 23:56:43,2,2
Wood,2021-09-24 08:15:24,2,2


In [13]:
df_state.to_csv('./data/WI_wic_locations.csv', index=False)

## Delete the snapshots

In [11]:
! rm -rf ./data/www.dhs.wisconsin.gov