# **Predicting U.S. Crime Rates**

## **CPI Data Retrieval**

In this notebook, we will access the [BLS Public Data API](https://www.bls.gov/developers/) to pull CPI data over time. 

Since there is no CPI series for states (just metropolitan areas), we'll use the average annualized national CPI as a broad measure for how expensive things are in every state.

**NOTE**: The code cells in this notebook will look very familiar compared to those in the [Unemployment Rate Retrieval notebook](unemployment_rates_retrieval.ipynb). The data both come from the BLS and use the same API.


---

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

import time
from datetime import date

end_year = int(date.today().strftime("%Y"))-1

---

Unfortunately, per the [BLS Public Data API docs](https://www.bls.gov/developers/api_signature_v2.htm#multiple), the number of years of data is limited to 20 per call.

The below function will be the API calling code. The call will need to be made multiple times due to the following restrictions:
* Only 20 years of data can be requested per call
    * We'll have to start from 1979 (which coincides with the FBI data's start) and get CPI data in 20-year increments. In 2021, at least 3 calls will need to be made per batch of data requests.

In [2]:
headers = {'Content-type': 'application/json'} #does not change between API calls

def request_unemployment_data(year_range,series_ids):
    data = json.dumps({"seriesid": series_ids,
                   "startyear":year_range[0], 
                   "endyear":year_range[1],
                   "catalog": True,
                   "annualaverage": True,
                   "registrationkey":"" ### YOUR API KEY HERE
                  })
    
    #make the request
    p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    
    series_id_results = False
    
    df = pd.DataFrame(p.json()['Results']['series'][0]['data'])
    #make string unemployment values into floats:
    df['value'] = df['value'].astype('float')
    #get averages by year:
    df = pd.DataFrame(df.groupby('year')['value'].mean()).reset_index()

    return df

The cell below will run the API call for as many times as required to get national average CPI from 1979 to the most recent complete year.

In [3]:
def define_year_ranges(end_year,start_year=1979):
    #how many years between 1979 and end year:
    years = end_year - start_year
    #how many splits needed:
    splits = (years // 20) + (0 if years%20==0 else 1)
    result = []
    last_year = start_year
    for i in range(splits):
        new_last_year = last_year + min(19,end_year-last_year)
        result.append((last_year,new_last_year))
        last_year = new_last_year+1
    return result

The `series_id` argument we're using is the BLS series_id for **[Seasonally Adjusted CPI for All Urban Consumers](https://beta.bls.gov/dataViewer/view/timeseries/CUSR0000SA0)**. 

In [4]:
national_cpi = False
for date_range in define_year_ranges(2020):
    if national_cpi is False:
        national_cpi = request_unemployment_data(date_range,["CUSR0000SA0"])
    else:
        national_cpi = national_cpi.append(request_unemployment_data(date_range,["CUSR0000SA0"]),
                                                       ignore_index=True)

In [5]:
national_cpi.shape

(42, 2)

In [6]:
national_cpi.rename(columns={'value': 'avg_CPI'}, inplace=True)

In [7]:
national_cpi.columns

Index(['year', 'avg_CPI'], dtype='object')

In [8]:
national_cpi = national_cpi.sort_values(by=['year'])

In [9]:
national_cpi.head()

Unnamed: 0,year,avg_CPI
0,1979,72.583333
1,1980,82.383333
2,1981,90.933333
3,1982,96.533333
4,1983,99.583333


---

Write to a .csv:

In [10]:
national_cpi.to_csv('../data/bls_cpi.csv',index=False)