<div style="color:#00BFFF">

# Nowcasting Consumer Expenditure: 

### Further Analysis for fitting data to model

<div style="color:#00BFFF">

##### Introduction:  Uncovering Reliable Proxies for Consumer Spending Behaviour. 


Essential for finalising the selection of proxies, ensuring they are representative of consumer spending trends and robust under different conditions. 

When selecting a subset of indicators for VAR model from the reduced set of variables:

1. **Economic Theory and Relevance**: Choose variables that are theoretically and empirically relevant to "PCE". They should have economic justification for inclusion in the model.

2. **Statistical Significance**: Consider variables that have shown significant coefficients in the linear regression analysis and a strong correlation with "PCE".

3. **Avoid Overfitting**: With VAR models, including too many variables can lead to overfitting and model complexity. Choose a subset that captures the essential dynamics without being overly complex.

4. **Dimensionality Considerations**: Given the complexity of VAR models, especially with lagged terms, it’s prudent to limit the number of variables. From initial set of 123, a significantly reduced subset based on the above criteria would be appropriate.

5. **Diverse Representation**: Ensure that the chosen indicators cover diverse aspects of the economy and are not too closely related to each other, to provide a comprehensive view.

6. **Iterative Approach**: Model building can be an iterative process. Start with a smaller set of key variables and gradually add or remove variables based on model performance and diagnostics.

<div style="color:#00BFFF">

##### Setup Environment and import libraries

In [1]:
# Run the imports file
%matplotlib inline

In [2]:
# ------- Standard Library Imports -------
import warnings
from datetime import datetime
from pprint import pprint
from typing import List

# ------- Third-Party Library Imports -------
import pandas as pd
from pandas import NaT
import numpy as np

# Utility and display modules
from IPython.display import display, HTML

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Remove warnings
warnings.filterwarnings('ignore')

# Set the display options
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', None)  
pd.set_option('display.max_colwidth', None)  

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
#load data generated from [1]M1_clean_and_preprocess.ipynb

#open defn
defn = pd.read_csv('./results/fred/fred_indicator_mappings.csv',index_col=0)

#open joined_dataset.csv
joined_dataset = pd.read_csv('./results/merged_data/joined_dataset_transformed.csv', index_col=0,parse_dates=False)

In [4]:
joined_dataset.tail()

Unnamed: 0,PCE,Real Personal Income,IP Index,IP: Fuels,Capacity Utilization: Manufacturing,Help-Wanted Index for United States,Ratio of Help Wanted/No. Unemployed,Civilian Labor Force,Civilian Employment,Civilian Unemployment Rate,Average Duration of Unemployment (Weeks),Initial Claims,Housing Starts: Total New Privately Owned,New Private Housing Permits (SAAR),Total Business: Inventories to Sales Ratio,M1 Money Stock,M2 Money Stock,Real M2 Money Stock,Total Reserves of Depository Institutions,Commercial and Industrial Loans,Real Estate Loans at All Commercial Banks,Nonrevolving consumer credit to Personal Income,Switzerland / U.S. Foreign Exchange Rate,Japan / U.S. Foreign Exchange Rate,U.S. / U.K. Foreign Exchange Rate,Canada / U.S. Foreign Exchange Rate,"Crude Oil, spliced WTI and Cushing",CPI : All Items,Personal Cons. Expend.: Chain Index,Securities in Bank Credit at All Commercial Banks,Primary_Sector_Employment,Secondary_Sector_Employment,Tertiary_Sector_Employment,Public_Sector_Employment,Avg_Hourly_Earnings_Employment,Avg_weekly_hours_Employment,Output: Consumer_Goods_Index,Output_Materials_Index,IP_Prod_Equipment_Index,IP_Final_Products_Index,Short_Term_Rate_Index,Long_Term_Rate_Index,Spread_Index,Credit_Market_Index,Stock_Market_Performance_Index,Stock_Market_Valuation_Index,Consumer_Spending_Index,Producer_Price_Index,Consumer_Credit_Index,Consumer_Demand_Composite_Index,New_Orders_Index
2022-07-01,269.1,194.322,0.842,2.7581,0.32,,,549.0,778.0,-0.1,-2.1,-26250.0,-98.0,-113.0,0.04,,,-92.4,-97.0,79.2115,140.3301,-0.000806,0.0031,9.3224,-0.1,0.0535,-30.58,1.811,0.819,,5.6,244.0,1328.5,127.0,0.363333,0.033333,-0.1185,0.4651,1.090533,0.5747,1.5,0.71,-0.278333,0.022,-63.47,-0.262319,0.272333,-16.4334,-2486.608183,7077.7085,46371.0
2022-10-01,232.8,-3.541,-2.0435,-4.6124,-2.3779,,,349.0,406.0,0.0,-0.7,20100.0,-106.0,-179.0,0.02,,-166.8,-115.3,-24.1,55.5381,175.2785,0.000664,-0.0417,-8.3695,0.086,0.0246,-7.82,2.451,1.034,,15.1,176.0,735.5,94.0,0.446667,-0.4,-0.725433,-3.784933,-2.1136,-0.8347,1.113333,0.316667,-0.8,-0.4,68.81,0.344364,0.226,-6.9772,1820.163743,1013.22175,32036.0
2023-01-01,352.6,186.37,1.1744,1.4965,0.6711,,,1692.0,1524.0,0.0,0.0,31400.0,23.0,28.0,0.01,,-482.3,-226.5,151.1,-11.5693,71.4796,-0.002158,-0.0061,-1.2505,-0.0042,0.0098,-3.16,2.818,1.182,-154.7772,6.4,35.0,1061.5,234.0,0.416667,0.066667,-0.0487,2.562933,0.2443,-0.1088,0.353333,0.033333,-0.365,0.002,105.5,0.472305,0.797,-9.1696,3119.41913,-219.4985,-15212.0
2023-04-01,149.4,26.498,-0.359,0.5308,-0.2416,,-0.133004,310.0,180.0,0.1,1.2,14750.0,38.0,4.0,0.0,,-21.5,-53.4,7.2,-31.5478,27.6099,-0.00093,-0.0251,7.6938,0.0489,-0.0397,-3.03,2.033,0.691,-173.0991,6.7,83.0,542.7,83.0,0.366667,0.1,-0.767,0.2586,0.0252,-0.6142,0.4,0.26,-0.1,-0.05,510.49,1.22771,0.374333,-3.3626,12736.168893,868.65225,96254.0
2023-07-01,260.5,22.863,1.0344,1.9677,0.1685,185.0,-0.055127,897.0,546.0,0.2,0.7,-45950.0,-62.0,30.0,-0.04,,-99.1,-113.5,-26.0,1.3296,54.7611,-0.002597,-0.0008,6.4869,-0.0246,0.0245,19.18,3.64,1.046,-114.5775,2.0,65.0,682.9,205.0,0.373333,-0.066667,1.618067,0.806033,0.393767,0.9606,0.183333,0.456667,0.07,0.306,80.785,,0.656333,9.127,9026.758967,10424.72525,35440.0


<div style="color:#00BFFF">

##### Seasonality Assesment for Joined Dataset

In [None]:
from statsmodels.tsa.stattools import acf
import math

# Function to check seasonality
def check_seasonality(series, max_lag, threshold=0.3, seasonal_lags=[4, 8, 12, 16]):
    acf_values = acf(series, nlags=max_lag, fft=True)
    return any(abs(acf_values[lag]) > threshold for lag in seasonal_lags)

# Check for seasonality in each column
seasonality_presence = {col: check_seasonality(joined_dataset[col], max_lag=40) for col in joined_dataset.columns}

# Print indicators with seasonality
print("Indicators with seasonality:")
for key, value in seasonality_presence.items():
    if value == True:
        print(key)

Indicators with seasonality:


In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
import math
from matplotlib import dates as mdates

# Function to adjust seasonality if there are any
def seasonally_adjust(series, model='additive', period=12):
    result = seasonal_decompose(series, model=model, period=period)
    return result.trend + result.resid  # Assuming an additive model

for indicator, has_seasonality in seasonality_presence.items():
    if has_seasonality:
        # Apply seasonal adjustment to the series
        joined_dataset[indicator] = seasonally_adjust(joined_dataset[indicator])



<div style="color:#00BFFF">

##### Lead and Lag Analysis

**Time Lag Analysis**:
   - **Cross-Correlation**: Examine the cross-correlation function (CCF) between 'PCE' and other indicators to identify potential lead-lag relationships.

- **Technique**: Analysing the time-shifted relationships between consumer spending and the proxies to identify if any indicators consistently lead or lag behind consumer spending patterns.
- **Objective**: To discover predictive relationships where certain proxies might signal changes in consumer spending ahead of time or respond with a delay. *While relevant, the Lead and Lag Analysis could become complex and time-consuming. We need to ensure that it directly contributes to the goal of identifying proxies.*

In [17]:
def lead_lag_analysis(dataset, target_column, variable_list, max_lag=3):
    """
    Perform lead and lag analysis for specified variables against a target column.
    
    :param dataset: Pandas DataFrame
    :param target_column: Column name of the target variable
    :param variable_list: List of column names to analyze
    :param max_lag: Maximum number of periods for lead/lag
    :return: DataFrame with correlation results
    """
    results = []

    for variable in variable_list:
        for lag in range(-max_lag, max_lag + 1):
            if lag == 0:
                # Contemporaneous correlation
                corr = dataset[variable].corr(dataset[target_column])
            else:
                # Lead/Lag correlation
                shifted = dataset[variable].shift(-lag)
                corr = shifted.corr(dataset[target_column])

            results.append({'Variable': variable, 'Lag': lag, 'Correlation': corr})

    return pd.DataFrame(results)


In [18]:
# Assuming 'refined_dataset' is your DataFrame
variables_to_analyze = refined_dataset.columns.drop('PCE')  # Excluding PCE from the variables to analyze
lead_lag_results = lead_lag_analysis(refined_dataset, 'PCE', variables_to_analyze, max_lag=4)


In [19]:
# Filter for significant correlations (you might choose a threshold, e.g., |correlation| > 0.3)
significant_results = lead_lag_results[lead_lag_results['Correlation'].abs() > 0.3]

# Sort by absolute correlation to see the strongest predictors first
sorted_results = significant_results.sort_values(by='Correlation', ascending=False)

print(sorted_results)


                                              Variable  Lag  Correlation
133    Nonrevolving consumer credit to Personal Income    3     0.629856
291                     Stock_Market_Performance_Index   -1     0.551484
211                           Public_Sector_Employment    0     0.530776
40                 Ratio of Help Wanted/No. Unemployed    0     0.514498
221                     Avg_Hourly_Earnings_Employment    1     0.479889
337                    Consumer_Demand_Composite_Index    0     0.470353
182  Securities in Bank Credit at All Commercial Banks   -2     0.457176
247                             Output_Materials_Index    0     0.441009
223                     Avg_Hourly_Earnings_Employment    3     0.425818
256                            IP_Prod_Equipment_Index    0     0.414959
30                 Help-Wanted Index for United States   -1     0.414469
180  Securities in Bank Credit at All Commercial Banks   -4     0.407900
220                     Avg_Hourly_Earnings_Employm

In [20]:
# Step 2: Filter and Process Results
# Filter for high correlation and lag of 0 or 1
filtered_results = lead_lag_results[lead_lag_results['Lag'].between(-1, 1)].sort_values(by='Correlation', ascending=False)

# Extract variable names for nowcasting
variables_for_nowcasting = filtered_results['Variable'].unique().tolist()
if 'PCE' not in variables_for_nowcasting:
    variables_for_nowcasting.append('PCE')  # Ensure 'PCE' is included

# Step 3: Refine Dataset Based on Selection
filtered_refined_dataset = refined_dataset[variables_for_nowcasting]

For constructing a Vector Autoregression (VAR) model, choosing the right number of variables (proxies in this case) is crucial for the model's performance and interpretability. Using too many variables can lead to overfitting and computational complexity, while too few may miss out on important information.

Suggested Approach:
Set Thresholds for Correlation and R-squared:
A correlation threshold (e.g., |Correlation| > 0.5) helps ensure that only variables significantly related to consumer spending (PCE) are included.
An R-squared threshold (e.g., R-squared > 0.25) ensures the variable has decent predictive power.

In [None]:
# Set thresholds
corr_threshold = 0.3
r_squared_threshold = 0.2

# Filter based on the thresholds
filtered_proxies = comparison_df[
    (comparison_df['Correlation'].abs() > corr_threshold) |
    (comparison_df['R_squared'] > r_squared_threshold)]

# Now, 'filtered_proxies' contains variables meeting both criteria
selected_variables = filtered_proxies.index.tolist() # Use 'selected_variables' in VAR model

filtered_proxies

In [None]:
# Convert comparison_df.index to a list to ensure compatibility
columns_to_keep = selected_variables+['PCE']

# Filter the columns in final_proxy_dataset_for_validation
final_proxy_dataset = refined_dataset[columns_to_keep]

<div style="color:#00BFFF">

### Further Transformations and considerations for Model fitting


<div style="color:#00BFFF">

##### PCA Analysis


<div style="color:#00BFFF">

##### Log Transformation on joined dataset for comparability



**Rationale:** 

Logarithmic transformation is used to stabilize the variance in data that exhibits exponential growth or large fluctuations. This is especially crucial for datasets like FRED's, where certain indicators can show significant variability over time. Given the information from the FRED database and their suggested transformation types, it seems reasonable to align with their expertise and apply these transformations to the dataset. This approach will save time and ensure that the data is treated consistently with established economic analysis practices.

**Transformation Types (as per FRED):**

1. **No Transformation (1)**: The data is used as is, without any modification.
   
2. **First Difference (∆x_t) (2)**: The change from one period to the next, useful for highlighting trends.
   
3. **Second Difference (∆^2x_t) (3)**: The change in the first difference, often used to capture acceleration or deceleration in a series.
   
4. **Natural Log (log(x_t)) (4)**: Useful for stabilizing variance and making exponential growth trends linear.
   
5. **First Difference of Log (∆ log(x_t)) (5)**: Commonly used to convert data into a stationary series, representing percentage change.
   
6. **Second Difference of Log (∆^2 log(x_t)) (6)**: The change in the first difference of the log, similar to the second difference but for logged data.
   
7. **Percentage Change from Prior Period (∆(x_t/x_t_−_1 − 1.0)) (7)**: This calculates the percentage change from the previous period, emphasizing relative changes.

**Approach:**

- **Apply Transformations:** Apply FRED Transformations and use the transformation codes provided in the `fred_indicator_mappings` dataset to transform the corresponding series in `pce_joined_dataset`.
- This approach should streamline our analysis process and align with the methodology with FRED's established practices. 
- Additionally, it ensures that the data is treated in a manner that is suitable for economic analysis.
-  **FRED Logarithmic Key Mapping:** We will map the transformation codes in the FREDmd_defn dataset to our dataset's indicators and then perform the necessary transformations.






In [None]:



# #  transformation function to handle the time column and a special case for PCE
# def modified_log_transform(column, time_column, transformation_code=4, column_name=None):
#     """
#     Applies the specified transformation to a Pandas Series, considering the time column and special cases.
#     """
#     time_column = time_column.astype(str)
#     # Special instruction for the PCE column
#     if column_name in ("PCE"):
#         transformation_code = 5  #6 # according to FREDs guidelines

#     # Check if the data is quarterly based on the time column
#     mult = 4 if any(time_column.str.endswith(('Q1', 'Q2', 'Q3', 'Q4'))) else 1

#     if transformation_code == 1:
#         # No transformation -> Mathematical Equation: x(t)
#         # It leaves the data in its original form, without any alteration.
#         return column
    
#     elif transformation_code == 2:
#         # First Difference -> Mathematical Equation: x(t) - x(t-1)
#         # It measures the absolute change from one period to the next, helping to detrend the data.
#         return column.diff()
    
#     elif transformation_code == 3:
#         # Second Difference -> Mathematical Equation: (x(t) - x(t-1)) - (x(t-1) - x(t-2))
#         # It measures the change in the first difference, capturing the acceleration or deceleration in the data's movement.
#         return column.diff().diff()
    
#     elif transformation_code == 4:
#         # Log Transformation -> Mathematical Equation: ln(x(t))
#         # It stabilizes the variance across the data series and can help make a skewed distribution more normal.
#         return np.log(column)
    
#     elif transformation_code == 5:
#         # Log First Difference -> Mathematical Equation: 100 * (ln(x(t)) - ln(x(t-1)))
#         # It measures the growth rate from one period to the next and multiplies by 100 for percentage change.
#         # The 'mult' variable allows for scaling the growth rate if necessary.
#         return np.log(column).diff() * 100 * mult
    
#     elif transformation_code == 6:
#         # Log Second Difference -> Mathematical Equation: 100 * ((ln(x(t)) - ln(x(t-1))) - (ln(x(t-1)) - ln(x(t-2))))
#         # It measures the change in the growth rate (change in log first difference), capturing the momentum of change.
#         # The 'mult' variable allows for scaling the change in growth rate if necessary.
#         return np.log(column).diff().diff() * 100 * mult
    
#     elif transformation_code == 7:
#         # Exact Percent Change -> Mathematical Equation: 100 * ((x(t)/x(t-1))^mult - 1)
#         # It measures the percentage change from one period to the next, with an option to compound the change using 'mult'.
#         return ((column / column.shift(1))**mult - 1.0) * 100
    
#     else:
#         raise ValueError("Invalid transformation code")


# # Create a mapping of columns to transformation codes
# transformation_mapping = defn.set_index('description')['tcode'].to_dict()

# # Extracting the time column
# time_column = joined_dataset.index

# # Applying the transformations to the dataframe
# transformed_dataset = joined_dataset.copy()

# for column in transformed_dataset.columns:
#     # Check if the column is in the mapping, else apply special instruction for PCE
#     tcode = transformation_mapping.get(column, None)
#     transformed_dataset[column] = modified_log_transform(transformed_dataset[column], time_column, tcode, column)

# # Drop the first 5 rows containing NaN values resulting from the transformation
# transformed_dataset = transformed_dataset.iloc[5:]

# # Displaying the first few rows of the transformed dataset
# joined_dataset = transformed_dataset
# joined_dataset.head(8)