# Scrape Wayback Machine 

## Get snapshots of the URL we want to scrape

 For all URL in the Wisconsin Health Department's website page with al WIC local offices, scrape the Wayback Machine for all versions of the URL.

 **Note** This run in console for now.


In [26]:
! wayback-machine-scraper -a 'https://www.dhs.wisconsin.gov/wic/local-projects.htm$' https://www.dhs.wisconsin.gov/wic/local-projects.htm -o './data/'

2022-02-02 21:23:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-02-02 21:23:35 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 13:09:58) - [GCC 7.5.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Linux-5.13.0-28-generic-x86_64-with-glibc2.31
2022-02-02 21:23:35 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 1,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 10.0,
 'LOG_LEVEL': 'INFO',
 'USER_AGENT': 'Wayback Machine Scraper/1.0.8 '
               '(+https://github.com/sangaline/scrapy-wayback-machine)'}
2022-02-02 21:23:35 [scrapy.extensions.telnet] INFO: Telnet Password: e850e8c825195d19
2022-02-02 21:23:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUs

This creates the structure diretories in `./data/`:

```
website/
└── www.dhs.wisconsin.gov
    └── wic
        └──local-projects.htm
            ├── 20141213083147.snapshot
            ├── 20150801041055.snapshot
            ...
            └── 20211211141555.snapshot
```

We can obtain a list of retrieved snapshots by:

In [42]:
from os import listdir
from os.path import isfile, join

data_path = './data/www.dhs.wisconsin.gov/wic/local-projects.htm/'

snapshots = [f for f in listdir(data_path) if isfile(join(data_path, f)) and f.endswith('.snapshot')]

## Scrape data from the snapshots

### Example: Scrape data from first snapshot in the list.

In [90]:
from bs4 import BeautifulSoup
import pandas as pd

with open(data_path + snapshots[0]) as fp:
    soup = BeautifulSoup(fp, 'html.parser')

table = soup.find_all('table')[0]

In [67]:
table_rows = table.find_all('tr')

In [147]:
df_county  = pd.DataFrame(columns=["county", "zip", "address"])
for i in range(1, len(table_rows)):
    county = table_rows[i].find_all('td')[0].text.strip()
    content = table_rows[i].find_all('td')[1].find_all('p')

    data_county = { "county":[], "zip": [], "address" : []}#, "name" : []}

    for c in content:
        t = c.text.strip().split('\n\t\t\t\t\t\t')
        # if (len(t) == 1) and ('Back to top ' not in t[0]):
            # data_county['name'].append(t[0])
        # el
        if (len(t) > 1):
            data_county['county'].append(county)
            t = " ".join(t).split('Telephone:')[0]
            data_county['address'].append(t)
            data_county['zip'].append(re.findall(r"(?<!\d)\d{5}(?!\d)", t)[0])
    
    temp = pd.DataFrame(data_county)

    df_county = pd.concat([df_county, temp], ignore_index=True)

df_county["snapshot"] = pd.to_datetime(snapshots[0][:-9])

df_county

Unnamed: 0,county,zip,address,snapshot
0,Adams,53948,"200 Hickory Street Mauston, WI 53948",2021-04-16 03:47:33
1,Ashland,54806,"216 3rd St. West, Suite 100 Ashland, WI 54806",2021-04-16 03:47:33
2,Ashland,53585,Bad River Health Center 53585 Nokomis Road Ash...,2021-04-16 03:47:33
3,Barron,54812,"335 E. Monroe Ave., Room 338 Barron, WI 54812",2021-04-16 03:47:33
4,Barron,54893,St. Croix Tribal Center 4404 State Road 70 Web...,2021-04-16 03:47:33
...,...,...,...,...
95,Waushara,54982,"400 South Townline Rd Wautoma, WI 54982",2021-04-16 03:47:33
96,Winnebago,54901,"112 Otter Ave. PO Box 2808 Oshkosh, WI 54901-5...",2021-04-16 03:47:33
97,Winnebago,54956,"211 N. Commercial St. Neenah, WI 54956",2021-04-16 03:47:33
98,Wood,54495,"111 W. Jackson St. Wisconsin Rapids, WI 54495",2021-04-16 03:47:33


In [133]:
i = 41
county = table_rows[i].find_all('td')[0].text.strip()
content = table_rows[i].find_all('td')[1].find_all('p')


In [134]:
for c in content:
    t = c.text.strip().split('\n\t\t\t\t\t\t')
    if (len(t) == 1) and ('Back to top ' not in t[0]):
            data_county['name'].append(t[0])
            data_county['county'].append(county)
    elif (len(t) > 1):
        t = " ".join(t).split('Telephone:')[0]
        print(t)

7120 West National Ave. West Allis, WI 53214 
5050 S. Lake Drive Cudahy, WI 53116 
7325 West Forest Home Ave. Greenfield, WI 53220 
1218 West Kilbourn Ave, Suite 207 Milwaukee, WI 53233 
1337 South Cesar Chavez Drive Milwaukee, WI 53204 
1445 South 32nd St. Milwaukee, WI 53215 
Northwest Health Center 7630 West Mill Road Milwaukee, WI 53218 
Keenan WIC Project 3200 N 36th St. Milwaukee, WI 53216 
South Side Health Center (SSHC) 1639 S. 23rd St., First Floor Milwaukee, WI 53204 
2555 North Dr. Martin Luther King Jr. Drive Milwaukee, WI 53212 
3882 North Teutonia Ave. Milwaukee, WI 53206 
5825 West Capitol Drive Milwaukee, WI 53216 
4630 W. North Ave. Milwaukee, WI 53208 
