<div style="color:#00BFFF">

# Nowcasting Consumer Expenditure: 

<div style="color:#5F9EA0">

## Uncovering Reliable Proxies for Consumer Spending Behaviour. 

#### 1.1. Introduction: The Problem (Why)

The current quarterly GDP reports lag in reflecting the dynamic changes in the economy, impacting decision-makers who rely on timely economic data. This project is devised to mitigate this issue by identifying high-frequency, readily updated data proxies that offer quicker insights into consumer expenditure patterns.

#### 1.2. Project Scope and Objectives (What)

The project's primary objective is to systematically identify, harmonise, and validate high-frequency data sources as proxies for real-time tracking of consumer expenditure in the United States. The goal is to refine these proxies to provide more immediate data on consumer spending habits, thus bridging the gap caused by the delayed reporting of official GDP figures.

#### Key Questions:

- Which high-frequency data sources can serve as accurate proxies for consumer spending?
- How can we validate these proxies against established measures of consumer expenditure?
- What techniques can we employ to ensure these proxies offer immediate and reliable insights into current consumer spending trends?
- How will we address potential discrepancies between different data sources in terms of scale, units, or reporting standards?
- Are there any unforeseen challenges in harmonizing data frequencies (monthly vs. quarterly) that could impact the accuracy of our analysis?
- How can we ensure the economic relevance of our findings, beyond statistical correlations?
- What contingency plans do we have for dealing with data anomalies or irregularities that might skew our analysis?

#### 1.3. Methodology

The methodology is designed to focus on data preparation and validation:

- **Exploratory Data Analysis (EDA)**: To understand the characteristics and quality of the high-frequency monthly indicators and their initial relationships to consumer spending.
- **Data Harmonization**: To transform and align the monthly indicators with the quarterly GDP data, using log transformations and adjustments for seasonality and rate of change.
- **Proxy Validation**: To establish a correlation with established measures of consumer spending through statistical analysis, ensuring that the proxies are reliable and relevant.

#### 1.4. Assumptions

**Data Quality and Relevance:** We operate under the assumption that the high-frequency data from FRED and other sources accurately reflect current economic trends and consumer sentiments. However, there is an inherent risk of data bias or inaccuracy, which could impact the reliability of our findings.

**Predictive Power and Relevance:** While we aim to identify effective proxies for consumer expenditure, there's a risk that these proxies may not fully capture the complexities of consumer behaviour or may not adapt swiftly to sudden economic shifts.

**External Factors:** The project also assumes a stable economic environment. Sudden external shocks (like global events or policy changes) could significantly affect consumer behaviour, potentially reducing the predictive accuracy of our proxies.  



#### 2. Primary Dataset Description

**Short Description:** The primary dataset is "Table 1.1.5. Gross Domestic Product" from the U.S. Bureau of Economic Analysis. It comprises seasonally adjusted quarterly U.S. Gross Domestic Product (GDP) rates in billions of dollars.

**Relevance:** The dataset's detailed information on U.S. GDP over several years is integral to the project's goal of nowcasting consumption. The data's granularity and time-series nature will allow for comprehensive analysis and identification of trends, making it pivotal for the project's success.

**Data frequency:** The data reflecting the economic output of the United States is crucial for analyzing economic trends and growth patterns. The presentation of data is done quarterly by the GDP component.

**Location:** Available at [U.S. Bureau of Economic Analysis](https://apps.bea.gov/iTable/?reqid=19&step=2&isuri=1&categories=survey&_gl=1*j1lvlb*_ga*MTk0MDMyMjk0MC4xNzA1NDk1NTk4*_ga_J4698JNNFT*MTcwNTQ5NTU5OC4xLjEuMTcwNTQ5NzA2MC42MC4wLjA.#eyJhcHBpZCI6MTksInN0ZXBzIjpbMSwyLDMsM10sImRhdGEiOltbImNhdGVnb3JpZXMiLCJTdXJ2ZXkiXSxbIk5JUEFfVGFibGVfTGlzdCIsIjUiXSxbIkZpcnN0X1llYXIiLCIxOTQ3Il0sWyJMYXN0X1llYXIiLCIyMDIzIl0sWyJTY2FsZSIsIi05Il0sWyJTZXJpZXMiLCJRIl1dfQ==). ([BEA](https://apps.bea.gov/iTable/?reqid=19&step=2&isuri=1&categories=survey&_gl=1*j1lvlb*_ga*MTk0MDMyMjk0MC4xNzA1NDk1NTk4*_ga_J4698JNNFT*MTcwNTQ5NTU5OC4xLjEuMTcwNTQ5NzA2MC42MC4wLjA.#eyJhcHBpZCI6MTksInN0ZXBzIjpbMSwyLDMsM10sImRhdGEiOltbImNhdGVnb3JpZXMiLCJTdXJ2ZXkiXSxbIk5JUEFfVGFibGVfTGlzdCIsIjUiXSxbIkZpcnN0X1llYXIiLCIxOTQ3Il0sWyJMYXN0X1llYXIiLCIyMDIzIl0sWyJTY2FsZSIsIi05Il0sWyJTZXJpZXMiLCJRIl1dfQ==))

**Format:** CSV

**Access Method:** The dataset is readily available and can be easily accessed and downloaded directly from the U.S. Bureau of Economic Analysis website.


#### 3 Secondary Datasets

##### Federal Reserve Economic Data (FRED)

**Short Description:** This dataset is sourced from the Federal Reserve Bank of St. Louis's FRED macroeconomic database. It contains a variety of economic data points available at monthly intervals, with a particular focus on US GDP data. The data covers consumer spending indicators, a crucial component of the Gross Domestic Product (GDP).

**Relevance**: Complements the primary dataset with additional economic indicators, useful for cross-referencing and correlation analysis.

**Data frequency:** The monthly frequency of this dataset provides a more detailed temporal resolution than the primary dataset, which may reveal more immediate economic trends. This granularity will be useful in identifying more immediate proxies for nowcasting.

**Estimated Size**: 0.6MB

**Location**: https://research.stlouisfed.org/econ/mccracken/fred-databases/

**Format**: CSV.

**Access Method**: Direct download.

<div style="color:#5F9EA0">

## Setup Environment and import libraries

In [None]:
# Activate the virtual environment by running in terminal: 
# python -m venv myenv
# source myenv/bin/activate
# ! source /myenv/bin/activate

# # ------- PIP INSTALLS -------
# ! pip install --upgrade pip
# ! python3.10 -m pip install --upgrade pip
# ! pip install -r requirements.txt
# ! pip install pandas
# ! pip install matplotlib
# ! pip install seaborn
# ! pip install numpy
# ! pip install scikit-learn
# ! pip install scipy
# ! pip install statsmodels

# Run the imports file
%matplotlib inline

In [None]:
# import nbformat
# import os

# def extract_imports_from_notebook(notebook_path):
#     with open(notebook_path, 'r', encoding='utf-8') as file:
#         nb = nbformat.read(file, as_version=4)

#     imports = set()
#     for cell in nb['cells']:
#         if cell['cell_type'] == 'code':
#             lines = cell['source'].split('\n')
#             for line in lines:
#                 if line.startswith('import ') or line.startswith('from '):
#                     imports.add(line.split()[1].split('.')[0])

#     return imports

# # Path to Jupyter notebook
# notebook_path = './M1_NowCasting_Consumer_Exp.ipynb'
# imports = extract_imports_from_notebook(notebook_path)

# # Print out the unique imports
# print("Libraries to install:")
# for lib in imports:
#     print(lib)


In [None]:
# ------- Standard Library Imports -------
import warnings
from datetime import datetime
from pprint import pprint
from typing import List

# ------- Third-Party Library Imports -------
# Data handling and numerical operations
import pandas as pd
import numpy as np

# Utility and display modules
from IPython.display import display, HTML

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns


# Remove warnings
warnings.filterwarnings('ignore')

# Set the display options
# pd.set_option('display.max_rows', None)  
# pd.set_option('display.max_columns', None)  
# pd.set_option('display.width', None)  
# pd.set_option('display.max_colwidth', None)  

<div style="color:#5F9EA0;">

## Cleaning and Manipulation

<div style="color:#5F9EA0;">

### Load and Pre Process BEAU Quarterly GDP dataset

**Loads and preprocesses** the GDP data from a CSV file. Process a DataFrame to create a structured description column.

**Handling Missing Values**: Utilize median imputation for missing values, as it's less influenced by outliers and provides a more representative central tendency.

**Outliers and Anomalies**: Apply Interquartile Range (IQR) or Z-score analysis to identify and address outliers. This step ensures the integrity of data by minimizing the impact of extreme values.

**Data Type Standardization**: Use Python's Pandas library to standardise data formats and types across datasets. This step is crucial to ensure consistency, particularly when dealing with various formats like percentages, counts, and currencies.

In [None]:
def load_and_preprocess_gdp_data(file_path):
    """
    Loads and preprocesses the GDP data from a CSV file.
    Args:file_path (str): The path to the CSV file containing GDP data.
    Returns:pandas.DataFrame: Preprocessed GDP data.
    """

    # Load the data with specified rows to skip and number of rows to read
    df = pd.read_csv(file_path, skiprows=3, nrows=28)

    # Drop the first column (unnecessary or identifier column)
    df.drop(df.columns[0], axis=1, inplace=True)

    # Rename the first column as 'description'
    df.rename(columns={df.columns[0]: 'description'}, inplace=True)

    # Remove any characters after (and including) the "." in column names
    df.columns = df.columns.str.replace(r'\..*', '', regex=True)

    # Concatenate the column names with the first row values, handling NaNs
    df.columns = df.columns + " " + df.iloc[0].fillna('')

    # Drop the first row as it's now part of the column names
    df.drop(df.index[0], inplace=True)

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

    # Correct any trailing space issues in the 'description' column name
    df.rename(columns=lambda x: x.strip(), inplace=True)

    return df


In [None]:
def create_structured_description(df):
    """
    Process a DataFrame to create a structured description column.
    
    This function takes a DataFrame with a 'description' column and adds structure to it
    based on indentation levels, indicating hierarchical relationships.

    Parameters:gdp_df (DataFrame): A pandas DataFrame with a column named 'description'.
    Returns:DataFrame: The modified DataFrame with a structured 'description' column.
    """

    # Function to determine the indentation level (number of leading spaces)
    def indentation_level(s):
        """Return the number of leading spaces in a string, indicating the indentation level."""
        return len(s) - len(s.lstrip())

    # Apply the function to find indentation levels
    df['indentation'] = df['description'].apply(indentation_level)

    # Initialize an empty list to store the new structured names
    structured_names = []
    current_parent = ""
    current_subparent = ""

    # Iterate through the DataFrame to construct the hierarchical names
    for index, row in df.iterrows():
        if row['indentation'] == 0:
            name = row['description'].strip()
            current_parent = name
        elif row['indentation'] == 4:
            name = f"{current_parent} : {row['description'].strip()}"
            current_subparent = row['description'].strip()
        elif row['indentation'] == 8:
            name = f"{current_parent} : {current_subparent} : {row['description'].strip()}"
        else:
            name = row['description'].strip()

        structured_names.append(name)

    # Assigning the structured names to the 'description' column
    df['description'] = structured_names

    # Dropping the 'indentation' column as it's no longer needed
    df.drop('indentation', axis=1, inplace=True)

    return df


In [None]:

def create_short_description(df):
    """
    Create a column 'short_description' in the gdp_df DataFrame with abbreviated descriptions.
    Parameters:gdp_df (DataFrame): A DataFrame containing GDP data with a column 'description'.
    Returns:DataFrame: The modified DataFrame including a new 'short_description' column.
    """
    def abbreviate_description(desc):

        # Define a mapping from full descriptions to their abbreviations
        abbreviations = {
            "Gross domestic product": "GDP",
            "Personal consumption expenditures": "PCE",
            "Gross private domestic investment": "GPDI",
            "Net exports of goods and services": "NXGS",
            "Government consumption expenditures and gross investment": "GCEGI",
        }

        # Split the description into parts and abbreviate each part
        parts = desc.split(" : ")
        abbreviated_parts = [abbreviations.get(part, part) for part in parts]

        # Join the abbreviated parts and replace spaces with underscores
        abrev_descr = "_".join(abbreviated_parts).replace(' ', '_')
        
        # remove all leading "_" characters
        abrev_descr = abrev_descr.lstrip('_')
        
        #remove leading and trailing spaces from the description column
        abrev_descr = abrev_descr.strip()
        
        return abrev_descr

    # Apply the abbreviation function to each description
    df['short_description'] = df['description'].apply(abbreviate_description)
    df['description'] = df['description'].str.lstrip(" :").str.strip()

    # Insert the new column 'short_description' right after the 'description' column
    description_index = df.columns.get_loc("description")
    df.insert(description_index + 1, 'short_description', df.pop('short_description'))
    
    #drop the 'description' column
    df.drop('description', axis=1, inplace=True)
    
    #move last row to after 1st row fo readability
    last_row = df.iloc[-1].copy()
    df = df.iloc[:-1]
    df = pd.concat([df.iloc[:1], last_row.to_frame().T, df.iloc[1:]]).reset_index(drop=True)

    return df


In [None]:
def transform_date_formats(df):
    # Step 1: Extract only non-date columns
    non_date_columns = df.columns[:1]  # Assuming first column is non-date column

    # Step 2: Extract and transform date columns
    date_columns = df.columns[1:]  # Date columns start from the 2nd column

    # Function to convert quarter to last date of the quarter
    def quarter_to_date(q):
        year, quarter = q.split(' Q')
        year = int(year)
        if quarter == '1':
            return f"{year}-03-31"
        elif quarter == '2':
            return f"{year}-06-30"
        elif quarter == '3':
            return f"{year}-09-30"
        elif quarter == '4':
            return f"{year}-12-31"

    # Apply this function to each of the date columns
    transformed_date_columns = [quarter_to_date(col) for col in date_columns]

    # Step 3: Combine the columns back together
    df.columns = list(non_date_columns) + transformed_date_columns

    # Transpose the dataset for easier manipulation (columns become rows)
    df = df.set_index('short_description').transpose()

    # Converting the index to datetime
    df.index = pd.to_datetime(df.index)

    df.index.freq = 'Q'

    # Convert all columns to numeric
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    return df


In [None]:
def remove_outliers(data):
    """
    Replaces outliers in a DataFrame with NaN based on IQR.
    """
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Define the mask using the typical IQR criterion
    mask = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))

    # Replace extreme values with NaN and store the corresponding values
    extreme_values = {}
    for column in data.columns:
        extreme_values[column] = data[column][mask[column]].dropna().reset_index()

    data[mask] = np.nan
    return data, extreme_values


In [None]:
def load_and_preprocess_BEA(file_path):
    """
    Loads and preprocesses the GDP data from a CSV file.
    """
    df = load_and_preprocess_gdp_data(file_path)

    df = create_structured_description(df)
    df = create_short_description(df)

    # extract only PCE data
    df = df[df['short_description'].str.contains('PCE')]

    # Transform the date formats and remove outliers
    df = transform_date_formats(df)
    df, extreme_values = remove_outliers(df)

    # Print the column names and values where outliers were found
    for column, values_df in extreme_values.items():
        if not values_df.empty:
            print(f"Extreme values for {column}:")
            print(values_df)
        else:
            print(f"No extreme values for {column}")
            
    return df, extreme_values

file_path = './data/bea/bea_usgdp.csv'

bea_pce, extreme_values = load_and_preprocess_BEA(file_path)

In [None]:
bea_pce.head()

<div style="color:#5F9EA0;padding: 5px;">

### Load and Pre Process FRED monthly  dataset

<div style="color:#5F9EA0;padding: 5px;">

##### Loading the FRED data

The `load_fredmd_data` function, below, performs the following actions, once for the FRED-MD dataset and once for the FRED-QD dataset:

1. Based on the `vintage` argument, it downloads a particular vintage of these datasets from the base URL https://files.stlouisfed.org/files/htdocs/fred-md into the `orig_[m|q]` variable.
2. Extracts the column describing which transformation to apply into the `transform_[m|q]` (and, for the quarterly dataset, also extracts the column describing which factor an earlier paper assigned each variable to).
3. Extracts the observation date (from the "sasdate" column) and uses it as the index of the dataset.
4. Applies the transformations from step (2).
5. Removes outliers for the period 1959-01 through 2019-12.

In [None]:
def load_fredmd_data(vintage):
    """
    Loads and processes the FRED-MD data.
    """
    # Define the base URL for the FRED-MD dataset
    base_url = 'https://files.stlouisfed.org/files/htdocs/fred-md'

    # Load the dataset for the specified 'vintage', dropping rows that are entirely NA
    fred_orig = pd.read_csv(f'{base_url}/monthly/{vintage}.csv').dropna(how='all')

    # Extract transformation codes (second column onwards) from the first row
    transform_info = fred_orig.iloc[0, 1:]

    # Drop the first row (containing transformation info) from the dataset
    fred_orig = fred_orig.iloc[1:]

    # Convert 'sasdate' column to a PeriodIndex with monthly frequency for time-series analysis
    fred_orig.index = pd.PeriodIndex(fred_orig.sasdate.tolist(), freq='M')

    # Remove the 'sasdate' column as it's now set as the index
    fred_orig.drop('sasdate', axis=1, inplace=True)

    # Return the processed data and the transformation information
    return fred_orig, transform_info

# Load data for the current vintage and unpack into original data and transformation info
fred_orig, transform_info = load_fredmd_data("current")


<div style="color:#5F9EA0;padding: 5px;">

**Mapping FRED indices to Economic Data groups**

*Explanation*

In this section, we import and organize the definitions of economic variables. 
These definitions are loaded from CSV files corresponding to the FRED-MD and FRED-QD databases. 
This process ensures that we have a clear and concise understanding of each economic variable in our dataset, which is essential for accurate analysis and interpretation of the data.

In [None]:
# Function for Column Name Mapping
def map_column_names(data, Fredmd_defn):
    """
    Maps FRED-MD column names to their descriptions.
    """

    # Set the 'fred' column as the index of the definitions DataFrame
    Fredmd_defn.index = Fredmd_defn.fred

    # Filter the definitions to include only those variables present in the data columns
    Fredmd_defn = Fredmd_defn.loc[data.columns.intersection(Fredmd_defn.fred), :]

    # Create a dictionary mapping FRED-MD variable names to their descriptions
    map_dict = Fredmd_defn['description'].to_dict()

    # Replace the names of columns in the dataset with the descriptions from the map
    return data[map_dict.keys()].rename(columns=map_dict)

# Map column names for fred_original 

column_defn_file = './data/FRED/FRED_Definitions_Mapping/fredmd_definitions.csv'
Fredmd_defn = pd.read_csv(column_defn_file, encoding_errors='ignore')

fred_orig = map_column_names(fred_orig, Fredmd_defn)

In [None]:
fred_orig.head(2)

In [None]:
Fredmd_defn.head(2)

In [None]:
def transform_codes_function(transform_info, Fredmd_defn):
    """
    Transforms the provided information into a DataFrame and renames the columns.
    Returns:DataFrame: Transformed and renamed DataFrame.
    """
    # Convert transform_info to a DataFrame and transpose it
    transformed_df = transform_info.to_frame().T

    # Map the column names using provided definitions
    transformed_df = map_column_names(transformed_df, Fredmd_defn).T
    
    return transformed_df

# Example usage
transformed_codes = transform_codes_function(transform_info, Fredmd_defn)
transformed_codes = transformed_codes.T

As per the methodology outlined in McCracken and Ng (2016), I have implemented a process to remove outliers from the dataset. 

- These outliers are defined as observations that deviate significantly from the series mean, specifically those that are more than 10 times the interquartile range (IQR) away from the mean.
- However, it's worth noting that during the first half of 2020, there are numerous series containing extreme observations. These extreme values are likely to contain valuable information about the real PCE in 2020. Therefore, I have chosen to apply the outlier removal only to the period from January 1959 to December 2019.
- To carry out this outlier removal, we've created a function named `remove_outliers`. This function identifies extreme values and replaces them with NaN (missing values). It also keeps track of the year in which each extreme value was removed.

Let's proceed by implementing the `remove_outliers` function and printing out the extreme values along with their corresponding years.
https://playfairdata.com/3-creative-ways-to-visualize-outliers-in-tableau/

In [None]:
# Call the function
fred_orig, extreme_values = remove_outliers(fred_orig) # Remove outliers for a specific period

# Print the column names and values where outliers were found
# for column, values_df in extreme_values.items():
#     if not values_df.empty:
#         print(f"Extreme values for {column}:")
#         print(values_df)

Below, we get the groups for each series from the definition files above, and then show how many of the series that we'll be using fall into each of the groups.

We'll also re-order the series by group, to make it easier to interpret the results.

Since we're including the quarterly real GDP variable in our analysis, we need to assign it to one of the groups in the monthly dataset. It fits best in the "Output and income" group.

<div style="color:#5F9EA0">

#### Saving Data

In [None]:
# Saving Data
fred_orig.to_csv("./results/monthly.csv")
bea_pce.to_csv("./results/pce.csv")

<div style="color:#5F9EA0">

### Data Harmonization and Transformation

#####     Filter the FRED and BEA PCE for a set date range


In [None]:
def filter_data_by_year(fred_orig, bea_pce, year=2000):
    """
    Filters the FRED and BEA PCE datasets based on the specified year.(inclusive)
    """
    fred_filtered = fred_orig[fred_orig.index.year >= year]
    bea_pce_filtered = bea_pce[bea_pce.index.year >= year]

    return fred_filtered, bea_pce_filtered


##### Monthly Rate of Change: 

For indices that are better represented through changes (e.g., stock indices, employment rates), calculate the month-over-month rate of change post-log transformation. This helps to highlight immediate shifts in economic activities.


In [None]:
def calculate_rate_of_change(fred_data, bea_pce_data):
    """
    Calculates the rate of change for FRED and BEA PCE datasets.
    """
    # Calculate the month-over-month rate of change for the FRED dataset
    fred_rate_of_change = fred_data.pct_change()

    # Calculate the quarter-over-quarter rate of change for the PCE dataset
    pce_rate_of_change = bea_pce_data.pct_change()

    return fred_rate_of_change, pce_rate_of_change


##### Frequency Alignment: 
- Transform the monthly economic indices from FRED to a quarterly format to align with the BEA’s quarterly GDP data. Calculate the sum or average (as appropriate) of monthly values within each quarter. 


In [None]:
def frequency_allignment(fred_rate_of_change, pce_rate_of_change):
    """
    Transform the monthly economic indices from FRED to a quarterly format to align with the BEA’s quarterly GDP data.
    """

    # Convert the monthly rate of change data to quarterly using the average as the aggregation method
    fred_alligned = fred_rate_of_change.resample('Q').mean()

    # Convert DateTimeIndex to PeriodIndex with quarterly frequency for PCE dataset
    pce_rate_of_change.index = pce_rate_of_change.index.to_period('Q')
    pce_alligned = pce_rate_of_change

    return fred_alligned, pce_alligned

In [None]:
# First, filter the data
fred_filtered, bea_pce_filtered = filter_data_by_year(fred_orig, bea_pce, 2000)

# Then, calculate the rate of change
fred_rate_of_change, pce_rate_of_change = calculate_rate_of_change(fred_filtered, bea_pce_filtered)

# Then, allign the date frequencies
fred_alligned, pce_alligned = frequency_allignment(fred_rate_of_change, pce_rate_of_change)

In [None]:
#pce_rate_of_change.head()

In [None]:
#fred_rate_of_change.head()

##### inspect PCE and FRED for Variance, Skewness, Kurtosis, distribution inspection for possible transformation

In [None]:
def plot_time_series_with_iqr_and_extended_range(data, column):
    # Ensure the index is in datetime format for matplotlib to plot correctly
    datetime_index = pd.to_datetime(data.index.to_timestamp())

    # Calculate statistics
    median = data[column].median()
    std = data[column].std()
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_whisker = median - 2.698 * std
    upper_whisker = median + 2.698 * std
    
    # Plot the time series line graph
    plt.figure(figsize=(14, 4))
    plt.plot(datetime_index, data[column], marker='o', markersize=3,color='blue', linewidth=1.5, label='PCE')

    # Shade the IQR
    plt.fill_between(datetime_index, Q1, Q3, color='grey', alpha=0.3, label='IQR')
    
    # Shade the extended range
    plt.fill_between(datetime_index, lower_whisker, upper_whisker, color='lightgrey', alpha=0.2, label='Extended Range')
    
    # Mark potential outliers
    outliers = data[column][(data[column] < lower_whisker) | (data[column] > upper_whisker)]
    plt.scatter(outliers.index, outliers, color='red', zorder=5, label='Outliers')

    # Add median line
    plt.axhline(median, color='green', linestyle='--', linewidth=2, label='Median')
    
    # add upper and lower whiskers lines
    plt.axhline(upper_whisker, color='grey', linestyle='--', linewidth=1, label='Upper Whisker')
    plt.axhline(lower_whisker, color='grey', linestyle='--', linewidth=1, label='Lower Whisker')

    # Annotate the median and quartiles
    # plt.text(datetime_index[0], median, ' Median', va='center', ha='right', backgroundcolor='w')
    # plt.text(datetime_index[0], Q1, ' Q1', va='center', ha='right', backgroundcolor='w')
    # plt.text(datetime_index[0], Q3, ' Q3', va='center', ha='right', backgroundcolor='w')
    # plt.text(datetime_index[0], upper_whisker, ' Upper Whisker', va='center', ha='right') #, backgroundcolor='w')
    # plt.text(datetime_index[0], lower_whisker, ' Lower Whisker', va='center', ha='right', backgroundcolor='w')

    # Add labels and legend
    plt.xlabel('Time')
    plt.ylabel(column)
    plt.title(f'Time Series with IQR and Extended Range for {column}')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.grid(False)

    # Show plot
    plt.show()

# Call the function to plot the chart
for col in pce_alligned.columns:
    plot_time_series_with_iqr_and_extended_range(pce_alligned, col)

In [None]:
def calculate_distribution_stats(df):
    # Calculate skewness, kurtosis, and variance for each column
    
    stats_df = pd.DataFrame(index=df.columns, 
                            columns=['Skewness', 'Kurtosis', 'Variance', 
                                     'Interpretation', 'Transformation', 'Visualization'])

    for column in df.columns:
        stats_df.at[column, 'Skewness'] = df[column].skew()
        stats_df.at[column, 'Kurtosis'] = df[column].kurtosis()
        stats_df.at[column, 'Variance'] = df[column].var()

        # Interpretation of skewness and kurtosis
        skew = stats_df.at[column, 'Skewness']
        kurt = stats_df.at[column, 'Kurtosis']
        transformation = "None"
        log_transformation = "None"
        visualization = "Histogram or Boxplot"

        if np.abs(skew) < 0.5:
            interpretation = 'Fairly Symmetrical'
            if df[column].min() >= 0:
                transformation = 'Log'
        elif skew >= 0.5:
            interpretation = 'Right Skewed'
            transformation = 'Square Root or Log'
        elif skew <= -0.5:
            interpretation = 'Left Skewed'
            transformation = 'Square or Cube'

        if kurt > 3:
            interpretation += ", Heavy Tails"
            visualization = "Boxplot for Outliers"

        stats_df.at[column, 'Interpretation'] = interpretation
        stats_df.at[column, 'Transformation'] = transformation
        stats_df.at[column, 'Visualization'] = visualization

    # Sort the DataFrame based on the absolute skewness
    stats_df['Absolute Skewness'] = stats_df['Skewness'].abs()
    sorted_stats = stats_df.sort_values(by='Absolute Skewness', ascending=False)

    return sorted_stats.drop('Absolute Skewness', axis=1)

# Example usage with your DataFrame
distribution_stats = calculate_distribution_stats(fred_rate_of_change)
distribution_stats


In [None]:
# import matplotlib.pyplot as plt
# import seaborn as sns
# import math

# # Function to plot small multiples of histograms for a selected number of columns from a DataFrame
# def plot_distributions(df, num_columns=10):
#     num_rows = math.ceil(len(df.columns) / num_columns)
#     plt.figure(figsize=(20, 3 * num_rows))
    
#     for i, column in enumerate(df.columns):
#         plt.subplot(num_rows, num_columns, i + 1)
#         sns.histplot(df[column], kde=True, bins=30)
#         plt.title(column,fontsize = 8)
#         plt.ylabel('Frequency', fontsize=8)
#         plt.xlabel('')
    
#     plt.tight_layout()
#     plt.show()

# # Plotting small multiples of histograms for all columns in your DataFrame
# plot_distributions(pce_rate_of_change, num_columns=10)
# plot_distributions(fred_quarterly_rate_of_change)


<div style="color:#DC143C">

##### Log Transformation for Monthly Data: IN PROGRESS
- Implement logarithmic transformations to stabilize the variance in monthly data that exhibit exponential growth or large fluctuations. This step is particularly important for FRED data (FRED provides a logarithmic key mapping).

In [None]:
# def transform(column, transformation_code):
#     """
#     Applies the specified transformation to a Pandas Series.
#     Transformation Codes from FRED suggested Description:
#     1. No Transformation, 2. First Difference, 3. Second Difference,
#     4. Log Transformation, 5. Log First Difference, 6. Log Second Difference,
#     7. Exact Percent Change
#     """
#     # Multiplier for quarterly data; if data is quarterly, multiply by 4, else 1
#     mult = 4 if column.index.freqstr[0] == 'Q' else 1

#     if transformation_code == 1:
#         # No transformation, return the column as is
#         return column
#     if transformation_code == 2:
#         # First Difference: Subtract each element from its predecessor
#         # Useful for converting a series to its change values
#         return column.diff()
#     if transformation_code == 3:
#         # Second Difference: Apply first difference twice
#         # Useful when first difference is insufficient to achieve stationarity
#         return column.diff().diff()
#     if transformation_code == 4:
#         # Log Transformation: Apply natural logarithm
#         # Useful for data with exponential growth patterns
#         return np.log(column)
#     if transformation_code == 5:
#         # Log First Difference: Apply log transformation, then first difference
#         # Multiplied by 100 for percentage change, especially useful for financial data
#         return np.log(column).diff() * 100 * mult
#     if transformation_code == 6:
#         # Log Second Difference: Apply log transformation, then second difference
#         # Similar to Code 5 but provides a more refined measure of change
#         return np.log(column).diff().diff() * 100 * mult
#     if transformation_code == 7:
#         # Exact Percent Change: Calculate the percentage change from one period to the next
#         # Useful for directly understanding growth rates
#         return ((column / column.shift(1))**mult - 1.0) * 100

# # Transformation Code 5 for 'PCE', and Code 6 for the rest
# pce_df_transformed = pce_rate_of_change.apply(lambda col: transform(col, 5 if col.name == 'PCE' else 6))
# # Apply log transformations using original column names according to FRED guidelines
# # fred_log_transform = fred_quarterly_rate_of_change.apply(lambda col: transform(col, transform_info[col.name][0]))
# pce_df_transformed

In [None]:
# fred_rate_of_change.head(2)

In [None]:
# # Create an empty DataFrame to store the transformed data
# fred_log_transform = pd.DataFrame(index=fred_rate_of_change.index)

# # Iterate through each column in fred_quarterly_rate_of_change
# for column_name in fred_rate_of_change.columns:
#     # Fetch the transformation code for the current column from transform_info
#     transformation_code = transformed_codes.at[0, column_name]

#     # Apply the transformation using the transform function
#     transformed_column = transform(fred_rate_of_change[column_name], transformation_code)

#     # Store the transformed column in the new DataFrame
#     fred_log_transform[column_name] = transformed_column

# fred_log_transform.head()



##### **Seasonal Adjustments**: 
- Adjust high-frequency data for seasonality, if necessary, to isolate the core economic trends from regular seasonal patterns. This step will make the data more representative of general economic behaviours, irrespective of seasonal influences.

##### **Quarterly Integration**: 
- Integrate the monthly indices into the quarterly GDP data framework. For BEA's GDP data, represent them as absolute figures or calculate the quarter-over-quarter rate of change if it aligns better with our analysis objectives.

##### **Final Aggregation and Comparison**: 
- Ensure that the final format of both datasets (quarterly GDP and monthly indices) is compatible for direct comparison. This could involve representing both datasets as rates of change or absolute figures based on what is most meaningful for the analysis.

### *4.3 Data Integration and Quality Assurance*

**Data Integration**: We will merge various datasets into a unified framework using pandas, ensuring seamless integration and compatibility. This step is vital for consolidating different economic indicators into a single, comprehensive analysis.

**Final anomaly Detection and Correction**: Employing statistical methods to detect and correct anomalies ensures that our analysis is based on accurate and representative data, free from distortions that could lead to erroneous conclusions.

**Consistency Checks**: Conducting thorough checks for data consistency, especially when integrating diverse data sources, is essential to validate the reliability and accuracy of our findings.

### 4.4 Advanced Data Handling and Analysis

**Standardisation of Growth Rates**: Standardizing growth rates enables us to compare different economic indicators on a common scale, facilitating a more meaningful analysis across various data points.

**Stationarity Assessment**: Using tests like the Augmented Dickey-Fuller ensures that our time series data is suitable for modelling and forecasting, as many statistical models require stationarity for valid results.

**Addressing Non-Stationarity**: Techniques such as differencing or transformation will be applied to achieve stationarity, which is crucial for the accuracy and reliability of our predictive models and correlation analysis. 



<div style="color:#5F9EA0;padding: 5px;">

## 5. Analysis

### 5.2 Exploratory Data Analysis (EDA)

Performed early in the project to get an overview of the data's characteristics. This step is crucial for identifying the most relevant variables for analysis, understanding the data's basic structure, and ensuring that hypotheses are grounded in both statistical findings and economic logic.

- **Technique**: Using statistical tools to summarise the data, visualising distributions with histograms, identifying correlations with scatter plots, and detecting patterns and outliers with box plots.
- **Objective**: To gain an initial understanding of data trends, outliers, and correlations and to identify any anomalies or irregularities that may influence further analysis. Hereafter we will incorporate economic theories to hypothesise potential relationships between variables.



5.3 Seasonality Adjustment Analysis

Conducted post-EDA to refine the data for more accurate correlation analysis. Seasonality adjustment is essential for preventing seasonal patterns from distorting the true economic trends.

- 
- **Technique**: Applying time-series decomposition methods to separate the data into trend, seasonal, and residual components and then adjusting for these seasonal effects.
- **Objective**: To accurately capture the underlying trends in consumer spending by removing repetitive seasonal patterns, which are regular but not necessarily related to the economic indicators of interest. *While crucial, we have to ensure this doesn't lead to an overly complex focus on time-series analysis techniques unless they are directly relevant to identifying proxies.*



5.4 Correlation and Proxy Validation Analysis

Implemented after seasonality adjustments to ensure that the relationships being analysed and the proxies being identified are not influenced by seasonal fluctuations and confirm that identified relationships are economically plausible as well as statistically significant.

- **Technique**: Calculating Pearson or Spearman correlation coefficients to quantify the strength and direction of the relationship between different variables. Scatter plots will be used for a more nuanced view of these relationships.
- **Objective**: To identify which monthly indicators from the high-frequency dataset show a strong and statistically significant correlation with quarterly consumer spending figures. Economic theory will be applied to interpret these correlations, ensuring they align with established economic principles and behaviors.



5.5 Comparative and Temporal Analysis

Undertaken after correlation analysis to delve deeper into the dynamics of the relationships between consumer spending and the identified proxies, providing insights into potential causative or predictive trends.

**Lead and Lag Analysis**:

- **Technique**: Analysing the time-shifted relationships between consumer spending and the proxies to identify if any indicators consistently lead or lag behind consumer spending patterns.
- **Objective**: To discover predictive relationships where certain proxies might signal changes in consumer spending ahead of time or respond with a delay. *While relevant, the Lead and Lag Analysis could become complex and time-consuming. We need to ensure that it directly contributes to the goal of identifying proxies.*

**Consumer Behaviour Indicators Correlation**:

- **Technique**: Using scatter plots and heatmaps to examine how different indicators relate to consumer spending visually.
- **Objective**: To explore more complex relationships between consumer spending and various high-frequency proxies and to identify patterns not evident in standard correlation analysis.



### 5.6 Proxy Evaluation and Variable Selection

Essential for finalising the selection of proxies, ensuring they are representative of consumer spending trends and robust under different conditions.

**Variable Selection and Reduction**:

- **Technique**: Selecting proxies based on correlation outcomes and economic rationale.
- **Objective**: To focus on a select group of high-frequency proxies that most accurately reflect and predict trends in consumer spending.



### 5.6 Regression Analysis and Uncertainty Assessment

Performed as a concluding analytical step to provide a more nuanced understanding of how each identified proxy affects consumer spending. This step helps quantify the relationships discovered in earlier analyses.

**Model Evaluation and Uncertainty Assessment**:

- **Technique**: Utilizing advanced statistical techniques, such as bootstrapping or Monte Carlo simulations, to evaluate the robustness of the selected proxies.
- **Objective**: To assess the reliability and stability of the chosen proxies under various economic scenarios and conditions. *Techniques like bootstrapping or Monte Carlo simulations might be more advanced than required for this project as the primary aim is to identify proxies rather than to build a predictive model.*

**Regression Analysis**

- **Technique**: Conduct linear regression analysis to quantify the impact of each selected proxy on consumer spending and assess the significance of regression coefficients.
- **Objective**: To determine the strength and nature of the influence that each proxy has on consumer spending, thereby providing a quantitative measure of their relative importance.