# Scrape https://www.wicprograms.org/

First we scrape the state names from "https://www.wicprograms.org/" to create a list of urls to scrape.

In [2]:
import requests
from bs4 import BeautifulSoup


URL = "https://www.wicprograms.org/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')

states = [option for option in soup.find_all('select')[0].find_all('option')][1:]
states = [s.get('value') for s in states]

## Use wayback machine to get change history of the WIC clinincs at the state level

In [None]:
! wayback-machine-scraper -a 'https://www.wicprograms.org/$' https://www.wicprograms.org/ -o './data/'

This creates the following directory structure in `./data/`:

```
website/
└── www.wicprograms.org
       ├── 20101203122320.snapshot
       ├── 20110202214735.snapshot
       ...
       └── 20220119003812.snapshot
```

Im goint to ignore the snapshots that are older than Jun 21 2012 since they do not contain the table of WIC clinics by state.

In [21]:
from os import listdir
from os.path import isfile, join

data_path = './data/www.wicprograms.org/'

snapshots = [f for f in listdir(data_path) if isfile(join(data_path, f)) and f.endswith('.snapshot')]

snapshots = [s for s in snapshots if int(s[:14]) > 20120620000000 ]

In [73]:
from bs4 import BeautifulSoup
import pandas as pd
import re

problem = []
WIC_count_by_state = pd.DataFrame({'state': [], 'WIC_count': [], 'snapshot': []})
for i in range(len(snapshots)):
    with open(data_path + snapshots[i]) as fp:
        soup = BeautifulSoup(fp, 'html.parser')
    state_data = {'state': [], 'WIC_count': [], 'snapshot': []}
    try:
        table = soup.find_all('div', {'class':"multicolumn"})[0]
    except:
        table = soup.find_all("ul", {'class':"statelist"})[0]
    statte_list = table.find_all('a')
    state_data['state'] = [s.text for s in statte_list]

    program_count = table.find_all('em')
    state_data['WIC_count'] = [int("".join(filter(str.isdigit, p.text))) for p in program_count]

    state_data['snapshot'] = pd.to_datetime([snapshots[0][:-9]]*len(state_data['state']))
    state_data = pd.DataFrame(state_data).head()
    WIC_count_by_state = pd.concat([WIC_count_by_state, state_data], ignore_index=True)

In [75]:
WIC_count_by_state

Unnamed: 0,state,WIC_count,snapshot
0,Alaska,23.0,2016-05-07 05:16:11
1,Alabama,64.0,2016-05-07 05:16:11
2,Arkansas,50.0,2016-05-07 05:16:11
3,Arizona,134.0,2016-05-07 05:16:11
4,California,223.0,2016-05-07 05:16:11
...,...,...,...
810,Alabama,20.0,2016-05-07 05:16:11
811,Arkansas,50.0,2016-05-07 05:16:11
812,Arizona,132.0,2016-05-07 05:16:11
813,California,78.0,2016-05-07 05:16:11


We can track the changes in the number of WIC clinics by state:

import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
sns.set()

x = WIC_count_by_state[WIC_count_by_state['state'] == 'Alabama'].snapshot
y = WIC_count_by_state[WIC_count_by_state['state'] == 'Alabama'].WIC_count

plt.plot(x, y)

## Use Wayback Machine to get the all versions of each state's page

In [None]:
import os

# for state in states:
#     os.system(f"wayback-machine-scraper -a 'https://www.wicprograms.org/state/{state}$' https://www.wicprograms.org/state/{state} -o './data/'")

This creates the following directory structure in `./data/`:

```
website/
└── www.wicprograms.org
    └── state
    |   └──alabama
    |       ├── 20110518012524.snapshot
    |       ├── 20110728175734.snapshot
    |       ...
    |       └── 20210508013023.snapshot
    ...
    |   └──wyoming
    |       ├── 20110518012524.snapshot
    |       ├── 20110728175734.snapshot
    |       ...
    |       └── 20210508013023.snapshot
```

**Note:** The this took 30 minutes to run.