# SI 330 Final Project
### Nicholas Ketchum, SI 330, Winter 2021

This project create a visualization and correlation between unemployment insurance claims recorded by the Federal Reserve and covid cases in the United States. It uses public APIs and datasets to gather information, clean it, and store in a database. It then makes calculations and plots a visualization using dataframes. 

### Motivation
See how new positive COVID-19 cases and initial unemployment claims (newly unemployed individuals) were affected what the correlation is between new cases and initial unemployment claims over a one-year period. I chose this comparison because of public attention and controversy debating how these two variables (among many others) are possibly correlated. Is there a correlation? What do the charts look like when initial unemployment claims are plotting next to new positive COVID-19 cases.

### Input

1. Dataset One
 - Name: Initial unemployment insurance claims (seasonally-adjusted) from the Federal Reserve
 - Size: 2831 records
 - Location: https://fred.stlouisfed.org/docs/api/fred/series_observations.html
 - Format: JSON
 - Access Method: Public API with access key
\
&nbsp;
2. Dataset Two
 - Name: The Covid Tracking Project
 - Size: 25921 records
 - Location: https://covidtracking.com/data/download
 - Format: CSV
 - Access Method: HTTP
 
### Output

1. Visualization One
 - Total initial unemployment claims (red)
 - Total new positive COVID-19 cases (blue)
 - Rolling window of total new initial unemployment claims (orange)
 - Rolling window of total new positive COVID-19 cases (sky blue)
\
&nbsp;
2. Visualization Two
 - Percent change of initial unemployment claims (red)
 - Percent change of new positive COVID-19 cases (blue)
 - Rolling window of percentage change of initial unemployment claims (orange)
 - Rolling window of percentage change of new positive COVID-19 cases (sky blue)
\
&nbsp;
3. Correlation
 - Provides a numeric correlation and interpretation of changes in unemployment claims COVID-19 cases.

### Instructions

1. Install these python modules:
    - matplotlib
    - numpy
    - pandas
    - requests
    - sql
2. Run each code cell, in order, and check for the printing verification message before continuing.
3. Cells 1-11 just load modules, functions, and populate variables. There's no output except the verification message.
4. Execute Visulations and other results starts at item 12.
5. If execution problems arise, restart the kernel and try re-running each code cell in order. It is important each preceeding cell is executed before running the current cell.

### Design Notes

1. The databases tables are only created and populated once to reserve resources, emulating processes which would be useful for much larger datasets where only portions of data may be remotely retrieved and added to database. This avoids re-requesting the same data. Once data is in the database we can just use that for faster and more polite execution.

2. Daily COVID data retreived from the public CSV could easily be resampled into weekly records but that's not done here. Instead, each daily COVID record is stored in a dedicated table in the database. Weekly unemployment information is also stored in another table. Then, weekly numbers for both are selected by a JOIN ON a per-week data, which ssums the COVID numbers for that entire week. (See cell number 7 for details.)

---

## Program Execution

**1. Establish a database connection**

Using the sql extension and th sqlalchemy library, we will connect to the default database configuration provided by Jupyter through MichiganMads.org. This database will store data pulled from APIs and other sources for storage and queries.

In [None]:
# Establish a database connection.
%load_ext sql
%sql postgres://jovyan:si330studentuser@localhost:5432/si330
import sqlalchemy
engine = sqlalchemy.create_engine('postgres://jovyan:si330studentuser@localhost:5432/si330')
display('SQL loaded and database is initialized.')

**2. Import libraries and set data source variables**

First, we go ahead and import all the other libraries we'll need in one spot. This makes it easier to keep track of what we've imported. For some reason, pandas requires an import of register_matplotlib_converters to properly plot some of my data.

The next thing this cell does is execute register_matplotlib_converters() to load the pandas plotting extension, and then define my own specific Fed API key, for which I had to register, and the Fed API headers, which I like keeping one spot for clarity.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import requests

# For some reason the "register_matplotlib_converters" class
# is required for plotting on this platform.
register_matplotlib_converters()

# Storing Fed api key and headers in one spot for more flexibility.
FED_API_KEY = "a964732354e30d669470642ff6b45f4c"

fed_api_settings = {
    'series_id': 'ICSA',
    'file_type': 'json',
    'sort_order': 'desc',
    'observation_start': "2020-01-07", # Trying to match the static end/start dates
    'observation_end': "2021-03-07",   # provided by the Covid tracker.
    'limit': '1300'
}

display('Python modules and settings() are loaded.')

**3. Retrieve Fed data from a public API using JSON**

This function takes the predefined headers and API key and makes a request to gather one year's worth of data roughly aligning with the first year of the COVID-19 outbreak. It requests the JSON data, then rerads it into a dataframe, and then refactored (cleaned) into a new data frame that has a converted string date with an extra day added to each to align these dates with dates in the COVID-19 data set, which allows for joins.

The additional day is immaterial because these are weekly measurements. Because we're looking at overall trends in rough form, we're not concerned if data is off by one day; we're looking at chart shapes more generally.

Finally, the dataframe's index is set using the date column, and columns are renamed for clarity and convience before returning the dataframe to the original function call.

In [None]:
# Get weekly numbers from a public API.
# This function is called later on, but only if the database is empty.
def getFedData(settings):
    headers = "?series_id=" + settings['series_id'] + "&api_key=" + FED_API_KEY + "&file_type=" + settings['file_type'] + "&sort_order=" + settings['sort_order'] + "&observation_start=" + settings['observation_start'] + "&observation_end=" + settings['observation_end'] + "&limit=" + str(settings['limit'])
    url = "https://api.stlouisfed.org/fred/series/observations" + headers
    df = pd.read_json(url)
    ndf = pd.DataFrame()
    week_no = 61
    for group, row in df.iterrows():
        # Covid data is one day behind Fed data.
        # To make our lives easier, let's add one day to Fed data so we can match records.
        # A one-day delta is immaterial for our overall measurement and analysis.
        ndf.loc[group, 'date'] = pd.to_datetime(df.iloc[group]['observations']['date']) + pd.Timedelta(days=1)
        ndf.loc[group, 'week_no'] = week_no
        week_no = week_no - 1
        ndf.loc[group, 'total_new_claims'] = int(df.iloc[group]['observations']['value'])
    ndf.set_index('date', inplace=True)
    return ndf

display('getFedData() is loaded.')

**4. Store Fed data into the database**

This function creates a table to store Fed data. It drops any pre-existing tables, creates a new table with two columns: date (timestamp) and claims (integer). Then, the data is loaded from a dataframe by using the .to_sql() method.

In [None]:
# Store fed data in a db tabe.
def storeFedData(df, engine):
    %sql CREATE TABLE IF NOT EXISTS fed_data(date timestamp, claims integer, week_no integer, PRIMARY KEY(date))
    df.to_sql('fed_data', engine)

display('storeFedData() is loaded.')

**5. Retrieve COVID-19 data from a public CSV file**

This function requests a public CSV file listed at covidtracking.com. It uses the pandas read_csv method to convert the CSV data into a dataframe. Then, a new dataframe is created which stores a reformatted date and converts it into a pandas datetime.

Nans are replaced with zeros to keep the dataframe shape the same as the Fed dataframe (keeping an identical row count), its positive_cases column datatype is converted to an integer using the apply() function to cast data as an int(), and then set as the dataframe index.

Finally, the dataframe is resampled to a weekly interval, summing up the new cases from each day that week. Then the final, cleaned dataframe is returned to the original calling function.

In [None]:
# Get cases from a public csv.
# Clean the data a bit by reformatting the date to match the other
# data source, and count the week numbers so we can do a GROUP BY
# week number later on in this program when we do a database query.
# Eventually we only want weekly numbers.
def getCovidData():
    url = 'https://api.covidtracking.com/v1/us/daily.csv'
    # Entire dataframe with everything.
    df = pd.read_csv(url)
    # Empty dataframe will store what we need.
    ndf = pd.DataFrame()
    day_no = 0
    week_no = 61
    ndf['week_no'] = ''
    for group, row in df.iterrows():
        # Format the data similar to Fed data.
        date = row['date']
        date = str(date)
        year = date[0:4]
        day = date[4:6]
        month = date[6:8]
        ndf.loc[group, 'week_no'] = week_no
        # Track the weeks.
        if day_no % 7 == 0:
            week_no = week_no - 1
        day_no = day_no + 1
        date = year + '-' + day + '-' + month
        # Make date into a datetime type.
        ndf.loc[group, 'date'] = pd.to_datetime(date)
        positive_cases = row['positiveIncrease']
        ndf.loc[group, 'positive_cases'] = positive_cases
    # Get rid of empty yrecords.
    ndf = ndf.replace(np.nan, 0)
    # Convert positive case column type into an int.
    ndf['positive_cases'] = ndf['positive_cases'].apply(int)
    # Index via datetime.
    ndf.set_index('date', inplace=True)
    # Resample as a weekly sum of new cases.
    # We could resample if we weren't relying on databases joins
    # to match unemployment claim datess to new covid numbers.
    # ndf = ndf.resample('W').sum() 
    return ndf

display('getCovidData() is loaded.')

**6. Store COVID-19 data into the database**

This function creates a table to store COVID-19 data. It drops any pre-existing tables, creates a new table with two columns: date (timestamp) and positive_cases (integer). Then, the data is loaded from a dataframe by using the .to_sql() method.

In [None]:
# Store fed data in a db tabe.
def storeCovidData(df, engine):
    %sql CREATE TABLE IF NOT EXISTS covid_data(date timestamp, positive_cases integer, week_no integer, PRIMARY KEY(date))
    df.to_sql('covid_data', engine)

display('storeCovidData() is loaded.')

**7. Select everthing stored in the database**

This function joins the fed_data and covid_data table on the data field. This results in each record containing a date, number of new unemployment claims in a particular week, the number of new COVID-19 cases identified in that same week, and returns the data as a sql result set.

In [None]:
# Select final records in a single master query.
# This accomplishes something very similar to a DataFrame "merge" in that it pairs two sets of
# records together in a RIGHT JOIN which limits data to unemployment claims, which is the smaller
# of the two data sets, and have the time intervals we are interested in measuring.
# We're grouping on week_no so we can SUM up the values for COVID cases during that week.
def selectAllData(engine):
    # Covid data is daily but Fed data is weekly. The join and grouping takes care of that.
    results = %sql SELECT f.date, f.total_new_claims, f.week_no, c.date, c.positive_cases FROM (SELECT date, week_no, SUM(positive_cases) AS positive_cases FROM covid_data GROUP BY date, week_no, positive_cases) c RIGHT JOIN fed_data f ON f.week_no = c.week_no AND f.date = c.date ORDER BY f.date ASC;
    return results

display('selectAllData() is loaded.')

Below is a much more readable version of the above SQL query.

In [None]:
# %%sql
# SELECT f.date, f.total_new_claims, f.week_no, c.date, c.positive_cases
# FROM (
#     SELECT date, week_no, SUM(positive_cases) AS positive_cases
#     FROM covid_data
#     GROUP BY date, week_no, positive_cases
#     ) c 
# RIGHT JOIN fed_data f 
# ON f.week_no = c.week_no AND f.date = c.date
# ORDER BY f.date ASC;

**8. Create a two dataframes for each datasest from the returned database queries**

This function takes the sql result set returned from the above function, converts it into a "master" dataframe, renames the columns from integer indexes to readable lables, and sets the index using the date. Then, two new dataframes are created using the master dataframe: one for unemployment (Fed) and one for COVID-19. Both dataframes are returned to the original calling function as a tuple.


Although created a "master dataframe" could have been accomplished without databases, they were included anyhow to demonstrate the concepts.

In [None]:
# This takes the master SQL query result, puts it into a DataFrame
# which is reindexed and columns renamed.
def getFinalDf(all_sql_results):
    # Put it all to a data frame
    all_results = pd.DataFrame(all_sql_results)
    all_results = all_results.rename(columns={0: "claim_date", 1: "total_new_claims", 2: "week_no", 3: "covid_date", 4: "new_positive_cases"})
    fedResults = all_results[['claim_date', 'total_new_claims']]
    fedResults = fedResults.set_index('claim_date')
    covidResults = all_results[['covid_date', 'new_positive_cases']]
    covidResults = covidResults.set_index('covid_date')
    return (fedResults, covidResults)

display('all_sql_results() is loaded.')

**9. Combine get/store/select data operations**

This function grabs unemployment Fed data and COVID-19 data from their sources using their dedicated functions and stores it in the databases. It then calls a function all_sql_results() which queries and joins table records from fed_data and covid_data using the selectAllData() function. Finally it returns a tuple containing fedResults and covidResults from the aforementioned database query.

In [None]:
# We only grab and store data if database tables are empty.
# Otherwise this function calls getFinalDf for each sql data set
# from the databases, receives two dataframes which is returned
# from this function as a tuple.
def getStoreSelectData(engine):
    # Get Fed data, store it into a db, and select contents.
    fed_data = %sql SELECT * FROM fed_data;
    if not len(fed_data):
        storeFedData(getFedData(fed_api_settings), engine)
    
    # Get Covid data, store it into a db, and select contents.
    covid_data = %sql SELECT * FROM covid_data;
    if not len(covid_data):
        storeCovidData(getCovidData(), engine)
    
    # Get clean and relevant records from db.
    all_sql_results = selectAllData(engine)

    # Extract each set of results from tuple.
    fedResults = getFinalDf(all_sql_results)[0]
    covidResults = getFinalDf(all_sql_results)[1]
    
    return (fedResults, covidResults)

display('getStoreSelectData() is loaded.')

**10. Convert a decimal into a float**

This is used for an apply function later in the program. A simpler method exists to do this, but I'd like to demonstrate an apply function that uses its own logic.

In [None]:
def makeFloat(row):
    return float(row)

display('makeFloat() is loaded.')

**11. Visualize the data**

This function grabs the final Fed/unemployment and COVID-19 data from functions above, and sets up a 2-figure plot, each displaying two time series. The first plot displays raw total numbers of both initial unemployment claims reported by the Federal Reserve and the total number of new COVID-19 cases for each specific week. The second chart displays the percentage change for each specific week. The period of weeks covered starts from January 2020 and goes to March 2021.

In [None]:
def visualize(engine):
    
    from matplotlib.ticker import StrMethodFormatter
    import matplotlib.ticker as plticker
    
    # Set up the dataframes from database data.
    fedResults = pd.DataFrame(getStoreSelectData(engine)[0])
    covidResults = pd.DataFrame(getStoreSelectData(engine)[1])[1:]

    # Get the overall figure sets up.
    fig = plt.figure(figsize = (40, 10), facecolor ="lightgrey" )
    fig.suptitle('Unemployment Claims vs. COVID-19 Cases', fontsize=12)
    
    # Set plot one colors and labels.
    red = mpatches.Patch(color='r', label='New total claims')
    orange = mpatches.Patch(color='orange', label='Rolling average')
    blue = mpatches.Patch(color='b', label='New positive cases')
    sky = mpatches.Patch(color='#87ceeb', label='Rolling average')

    # Add total claims and cases totals subplot.
    subplot = fig.add_subplot(121)
    subplot.legend(handles=[red, orange, blue, sky], title='Claims & Cases', bbox_to_anchor=(1, 1), loc='upper left')
    subplot.plot(fedResults.index, fedResults['total_new_claims'], 'r')
    subplot.plot(fedResults.index, fedResults['total_new_claims'].rolling(14).mean(), 'orange')
    subplot.plot(covidResults.index, covidResults['new_positive_cases'], 'b')
    subplot.plot(covidResults.index, covidResults['new_positive_cases'].rolling(14).mean(), '#87ceeb')
    
    # Add labels.
    subplot.set_title(label = "Total Claims and Cases", fontsize=10)
    subplot.set_xlabel(xlabel = "Date", fontsize=10)
    subplot.set_ylabel(ylabel = "Number of Claims", fontsize=10)
    
    # Manipulate tick size and establish a grid.
    plt.xticks(fontsize=6)
    plt.yticks(fontsize=8)
    loc = plticker.MultipleLocator(base=12.5) # this locator puts ticks at regular intervals
    plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
    plt.grid()

    # Calculate Fed total_new_claims percent change.
    fedResults['percent_change'] = fedResults['total_new_claims'].pct_change()
    fedResults = fedResults.replace([np.inf, -np.inf], np.nan)
    fedResults = fedResults.dropna()
    fedResults = fedResults[['percent_change']]
    fedResults['percent_change'] = fedResults['percent_change'].apply(lambda x: x * 100)
    
    # Calculate COVID-19 new_positive_cases percent change.
    covidResults['new_positive_cases'] = covidResults['new_positive_cases'].apply(makeFloat)
    covidResults['percent_change'] = covidResults['new_positive_cases'].pct_change()
    covidResults = covidResults.replace([np.inf, -np.inf], np.nan)
    covidResults = covidResults.dropna()
    covidResults = covidResults[['percent_change']]
    covidResults['percent_change'] = covidResults['percent_change'].apply(lambda x: x * 100)
    
    # Add claims and cases percentage change subplot.
    subplot = fig.add_subplot(122)
    subplot.plot(fedResults.index, fedResults['percent_change'], 'r')
    subplot.plot(fedResults.index, fedResults['percent_change'].rolling(14).mean(), 'orange')
    subplot.plot(covidResults.index, covidResults['percent_change'], 'b')
    subplot.plot(covidResults.index, covidResults['percent_change'].rolling(14).mean(), '#87ceeb')
    
    # Set plot two colors and labels.
    red = mpatches.Patch(color='r', label='Percent change in claims')
    orange = mpatches.Patch(color='orange', label='Rolling average')
    blue = mpatches.Patch(color='b', label='Percent change in positive cases')
    sky = mpatches.Patch(color='#87ceeb', label='Rolling average')
    
    # Add total claims and cases percent change subplot.
    subplot.legend(handles=[red, orange, blue, sky], title='Percent Change', bbox_to_anchor=(1, 1), loc='upper left')
    subplot.set_title(label = "Percent Change in Claims and Cases", fontsize=10)
    subplot.set_xlabel(xlabel = "Date", fontsize=10)
    subplot.set_ylabel(ylabel = "Percent Change", fontsize=10)
    
    # Manipulate tick size and establish a grid.
    plt.xticks(fontsize=6)
    plt.yticks(fontsize=8)
    loc = plticker.MultipleLocator(base=12.5) # this locator puts ticks at regular intervals
    plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
    plt.grid()
    
    # Draw plots.
    plt.draw()

display('visualize() is loaded.')

---

## View The Results

**12. Execute the visualization**

This is just a function call that runs EVERYTHING (except whatever is below the next cell), sets labels and colors and tick intervals, and finally displays the plots.

In [None]:
visualize(engine)

**12.1. What can we say about the above plots?**

Comparing the two charts, we see a couple things. We see when COVID-19 really entered the picture in March 2020, initial unemployment claims spiked, then dropped off (perhaps because only one initial claim can be filed), and then claims stablized a bit as the virus spread. But then, in 2022, total positive cases increased more than last year, while initial unemployment claims did not spike, but more closely matched the pattern of the COVID-19 data.

Also, because initial unemployment claims are a one-time mechanism, the drop off does not indicate continued unemployment claims have remained higher.

**13. Calculate correlation**

This function computes the correlation between percent changes in initial unemployment claims and new positive COVID-19 cases for each week. The the correlation is to zero, the weaker the  relationship. Positive values indicate a positive correlations where both variables tend to increase together. Negative values indicate a negative correlation where both variables tend to move opposite each other.

In [None]:
def getCorrelation():
    all_sql_results = selectAllData(engine)

    fedResults = pd.DataFrame(getStoreSelectData(engine)[0])
    fedResults['percent_change'] = fedResults['total_new_claims'].pct_change()
    fedResults = fedResults.replace([np.inf, -np.inf], np.nan)
    fedResults = fedResults.dropna()
    fedResults = fedResults[['percent_change']]

    covidResults = pd.DataFrame(getStoreSelectData(engine)[1])[1:]
    covidResults['new_positive_cases'] = covidResults['new_positive_cases'].apply(makeFloat)
    covidResults['percent_change'] = covidResults['new_positive_cases'].pct_change()
    covidResults = covidResults.replace([np.inf, -np.inf], np.nan)
    covidResults = covidResults.dropna()
    covidResults = covidResults[['percent_change']]

    df = pd.DataFrame()
    df['fed'] = fedResults['percent_change'][:-1]
    df['covid'] = covidResults['percent_change']

    correlation = df['fed'].corr(df['covid'])
    
    # https://towardsdatascience.com/eveything-you-need-to-know-about-interpreting-correlations-2c485841c0b8
    if correlation > 0.9:
        print('There is a very high positive correleation of', correlation)
    elif correlation > 0.7:
        print('There is a high positive correleation of', correlation)
    elif correlation > 0.5:
        print('There is a moderate positive correleation of', correlation)
    elif correlation > 0.3:
        print('There is a low positive correleation of', correlation)
    elif (correlation < 0.3) and (correlation > -0.3):
        print('There is a neglible correlation of', correlation)
    elif correlation > -0.3:
        print('There is a low negative correleation of', correlation)
    elif correlation > -0.5:
        print('There is a moderate negative correleation of', correlation)
    elif correlation > -0.7:
        print('There is a high negative correleation of', correlation)
    else:
        print('There is a very high positive correleation of', correlation)

display('getCorrelation() is loaded.')

**14. Display corrrelation**

This simply calls the above function and displays the correlation as a decimal/float.

In [None]:
getCorrelation()

**14.1. What can we say about the correlation?**

Initial unemployment claims and new positive COVID-19 cases have a low correlation.

## Tests

**15. Functional testing**

This cell simply loads some data and then runs simple tests of all the functions, which are mostly testing for lengths of records. The database "store" functions assume tables are already populated with the expected data within the hard-coded time range.

An extra record exist until the selection phase, where a nan value is dropped in one of the records.

In [None]:
fedData = getFedData(fed_api_settings)
covidData = getCovidData()
all_sql_results = selectAllData(engine)

def test_getFedData():
    assert len(getFedData(fed_api_settings)) == 61

def test_storeFedData(fedData, engine):
    results = %sql SELECT * FROM fed_data ORDER BY date ASC
    assert len(results) == 61
    
def test_getCovidData():
    assert len(getCovidData()) == 61

def test_storeCovidData(covidData, engine):
    results = %sql SELECT * FROM covid_data ORDER BY date ASC
    assert len(results) == 420

def test_selectAllData(engine):
    assert len(selectAllData(engine)) == 61

def test_getFinalDf(all_sql_results):
    assert len(getFinalDf(all_sql_results)) == 2

def test_getStoreSelectData(engine):
    fed_data = %sql SELECT * FROM fed_data;
    if len(fed_data) > 0:
        assert len(getStoreSelectData(engine)) == 2

test_getFedData()
test_storeFedData(fedData, engine)
test_storeCovidData(covidData, engine)
test_selectAllData(engine)
test_getFinalDf(all_sql_results)
test_getStoreSelectData(engine)