[header stuff here]

In this notebook, we will access the [BLS Public Data API](https://www.bls.gov/developers/) to pull unemployment data by state over time. We'll also do some preprocessing to annualize rates by averaging.

---

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

import time
from datetime import date

end_year = int(date.today().strftime("%Y"))-1

**NOTE**: BLS Series IDs are coded by a two-digit state code, which are found at [https://download.bls.gov/pub/time.series/sm/sm.state](https://download.bls.gov/pub/time.series/sm/sm.state). That link came from [this documentation](https://www.bls.gov/bls/data_finder.htm).

Also, the below will restrict the list to only the 50 states and DC, excluding territories and national aggregates.

In [2]:
bls_state_ids = pd.read_csv('../data/reference/bls_sm.state.txt',sep="\t")
#two-digit leading zero filling method taken from: https://stackoverflow.com/a/51837162


In [3]:
bls_state_ids.drop(inplace=True,
                   index=bls_state_ids[bls_state_ids['state_name'].isin(
                       ['All States','Puerto Rico','Virgin Islands','All Metropolitan Statistical Areas'])].index)

In [4]:
bls_state_ids = {f'LASST{k:02}0000000000003':v for k,v in zip(bls_state_ids['state_code'],bls_state_ids['state_name'])}

---

Unfortunately, per the [BLS Public Data API docs](https://www.bls.gov/developers/api_signature_v2.htm#multiple), a multiple-series query can return data for **at most 50 series** (and we need to get 51). So we'll break the request up into two chunks, processing 26 series the first pass and appending the last 25 series in the second pass.

The below function will be the API calling code. The call will need to be made multiple times due to the following restrictions:
* Up to 50 seriesIDs can be requested per call
    * We'll get the unemployment data in two batches: for states 0-25 and 26-50.
* Only 20 years of data can be requested per call
    * We'll have to start from 1979 (which coincides with the FBI data's start) and get unemployment data in 20-year increments. In 2021, at least 3 calls will need to be made per batch of state data requests.

In [5]:
headers = {'Content-type': 'application/json'} #does not change between API calls

def request_unemployment_data(year_range,series_ids):
    data = json.dumps({"seriesid": series_ids,
                   "startyear":year_range[0], 
                   "endyear":year_range[1],
                   "catalog": True,
                   "annualaverage": True,
                   "registrationkey":"" ### YOUR API KEY HERE
                  })
    
    #make the request
    p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    
    series_id_results = False
    
    for i, state in enumerate(p.json()['Results']['series']):
        #print(f"getting data for {i}, {p.json()['Results']['series'][i]['catalog']['area']} between {year_range[0]} and {year_range[1]}.")
        #get the unemployment data:
        df = pd.DataFrame(state['data'])
        #make string unemployment values into floats:
        df['value'] = df['value'].astype('float')
        #get averages by year:
        df = pd.DataFrame(df.groupby('year')['value'].mean()).reset_index()
        #add which state the series corresponds to:
        df['state'] = p.json()['Results']['series'][i]['catalog']['area']
        
        if series_id_results is False:
            series_id_results = df
        else:
            series_id_results = series_id_results.append(df,ignore_index=True)
        
        #release memory
        del df
    # must return a dataframe for the year range and specified series_ids:
    return series_id_results

The cell below will run the api call for as many times as required to get state average unemployment rates from 1979 to the most recent complete year.

In [6]:
def define_year_ranges(end_year,start_year=1979):
    #how many years between 1979 and end year:
    years = end_year - start_year
    #how many splits needed:
    splits = (years // 20) + (0 if years%20==0 else 1)
    result = []
    last_year = start_year
    for i in range(splits):
        new_last_year = last_year + min(19,end_year-last_year)
        result.append((last_year,new_last_year))
        last_year = new_last_year+1
    return result

In [7]:
series_sets = {'set1': [x for i,x in enumerate(bls_state_ids) if i<26],
          'set2': [x for i,x in enumerate(bls_state_ids) if i>25]}

In [8]:
state_unemployment = False
for i, series_set in series_sets.items():
    for date_range in define_year_ranges(2020):
        if state_unemployment is False:
            state_unemployment = request_unemployment_data(date_range,series_set)
        else:
            state_unemployment = state_unemployment.append(request_unemployment_data(date_range,series_set),
                                                           ignore_index=True)

In [9]:
state_unemployment.shape

(2142, 3)

The row count above should match the row count of rows we've obtained for crime data in [Notebook](#).

In [10]:
state_unemployment.rename(columns={'value': 'avg_unemployment_rate'}, inplace=True)

In [11]:
state_unemployment.columns

Index(['year', 'avg_unemployment_rate', 'state'], dtype='object')

In [12]:
state_unemployment = state_unemployment[['state','year','avg_unemployment_rate']]

In [13]:
state_unemployment = state_unemployment.sort_values(by=['state','year'])

In [14]:
state_unemployment.head()

Unnamed: 0,state,year,avg_unemployment_rate
0,Alabama,1979,7.225
1,Alabama,1980,8.816667
2,Alabama,1981,10.691667
3,Alabama,1982,13.95
4,Alabama,1983,13.816667


In [15]:
state_unemployment.to_csv('../data/bls_state_unemployment.csv',index=False)

---

Get **National** averages too:

In [16]:
us_unemployment = state_unemployment.copy()

In [17]:
us_unemployment = pd.DataFrame(us_unemployment.groupby('year')['avg_unemployment_rate'].mean()).reset_index()

In [18]:
us_unemployment[:5]

Unnamed: 0,year,avg_unemployment_rate
0,1979,5.504412
1,1980,6.817974
2,1981,7.273203
3,1982,9.189379
4,1983,9.154739


In [19]:
#write to dataset:
us_unemployment.to_csv('../data/bls_us_unemployment.csv',index=False)