# Module 5: Regime Prediction with Machine Learning - Part 2

In this part we will prepare the dataset for our recession forecasting problem. We will clean the data and perform feature selection to reduce the number of variables in the data.

## Table of Contents:
&nbsp;&nbsp;1. [Set Up Environment and Read Data](#1)

&nbsp;&nbsp;2. [Data Cleaning](#2)


## 1. Set Up Environment and Read Data <a id="1"></a>

In [13]:
#load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

from statsmodels.tsa.stattools import adfuller #to check unit root in time series 
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

import seaborn as sns #for correlation heatmap

import warnings
warnings.filterwarnings('ignore')

In [14]:
#bigmacro=pd.read_csv("Macroeconomic_VariablesL.csv")
bigmacro=pd.read_csv("current.csv")
bigmacro=bigmacro.rename(columns={'sasdate':'Date'})
Recession_periods=pd.read_csv('Recession_PeriodsL2024.csv')
bigmacro.insert(loc=1,column="Regime", value=Recession_periods['Regime'].values)
bigmacro.head()

Unnamed: 0,Date,Regime,RPI,W875RX1,DPCERA3M086SBEA,CMRMTSPLx,RETAILx,INDPRO,IPFPNSS,IPFINAL,...,DNDGRG3M086SBEA,DSERRG3M086SBEA,CES0600000008,CES2000000008,CES3000000008,UMCSENTx,DTCOLNVHFNM,DTCTHFNM,INVEST,VIXCLSx
0,Transform:,Normal,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,6.0,6.0,6.0,6.0,6.0,2.0,6.0,6.0,6.0,1.0
1,1/1/1959,Normal,2583.56,2426.0,15.188,276676.8154,18235.77392,21.9616,23.3868,22.262,...,18.294,10.152,2.13,2.45,2.04,,6476.0,12298.0,84.2043,
2,2/1/1959,Normal,2593.596,2434.8,15.346,278713.9773,18369.56308,22.3917,23.7024,22.4549,...,18.302,10.167,2.14,2.46,2.05,,6476.0,12298.0,83.528,
3,3/1/1959,Normal,2610.396,2452.7,15.491,277775.2539,18523.05762,22.7142,23.8459,22.5651,...,18.289,10.185,2.15,2.45,2.07,,6508.0,12349.0,81.6405,
4,4/1/1959,Normal,2627.446,2470.0,15.435,283362.7075,18534.466,23.1981,24.1903,22.8957,...,18.3,10.221,2.16,2.47,2.08,,6620.0,12484.0,81.8099,


In [16]:
bigmacro.tail()

Unnamed: 0,Date,Regime,RPI,W875RX1,DPCERA3M086SBEA,CMRMTSPLx,RETAILx,INDPRO,IPFPNSS,IPFINAL,...,DNDGRG3M086SBEA,DSERRG3M086SBEA,CES0600000008,CES2000000008,CES3000000008,UMCSENTx,DTCOLNVHFNM,DTCTHFNM,INVEST,VIXCLSx
787,7/1/2024,Normal,19988.217,16306.3,120.966,1529744.0,710851.0,102.534,100.5549,100.3012,...,119.776,127.972,31.2,35.7,27.96,66.4,548691.61,930374.96,5278.3525,14.4084
788,8/1/2024,Normal,20007.209,16322.1,121.052,1530317.0,710038.0,103.0831,101.0481,101.0128,...,119.653,128.291,31.26,35.81,27.97,67.9,551667.22,933066.9,5327.6461,19.675
789,9/1/2024,Normal,20044.142,16333.7,121.69,1541305.0,716388.0,102.5283,100.1897,99.8299,...,119.22,128.682,31.44,36.0,28.11,70.1,553347.06,934283.59,5368.5924,17.6597
790,10/1/2024,Normal,20131.843,16394.9,121.837,1539243.0,719676.0,102.1123,99.593,99.0546,...,119.073,129.148,31.57,36.22,28.18,70.5,554377.25,937299.96,5407.412,19.9478
791,11/1/2024,Normal,20163.284,16433.7,122.173,,724609.0,101.9621,99.5572,99.2239,...,119.124,129.363,31.63,36.22,28.3,71.8,,,5382.4182,15.9822


## 2. Data Cleaning <a id="2"></a>

We will follow the steps below to clean data and make it ready for feature selection process.

1. Remove the variables with missing observations
2. Add lags of the variables as additional features
3. Test stationarity of time series
4. Standardize the dataset

In [17]:
#remove columns with missing observations
missing_colnames=[]
for i in bigmacro.drop(['Date','Regime'],axis=1):
    observations=len(bigmacro)-bigmacro[i].count()
    if (observations>10):
        print(i+':'+str(observations))
        missing_colnames.append(i)
 
bigmacro=bigmacro.drop(labels=missing_colnames, axis=1)

#rows with missing values
bigmacro=bigmacro.dropna(axis=0)

bigmacro.shape

PERMIT:12
PERMITNE:12
PERMITMW:12
PERMITS:12
PERMITW:12
ACOGNO:399
ANDENOx:109
TWEXAFEGSMTHx:168
UMCSENTx:154
VIXCLSx:42


(789, 118)

In [18]:
# Add lags
for col in bigmacro.drop(['Date', 'Regime'], axis=1):
    for n in [3,6,9,12,18]:
        bigmacro['{} {}M lag'.format(col, n)] = bigmacro[col].shift(n).ffill().values

# 1 month ahead prediction
bigmacro["Regime"]=bigmacro["Regime"].shift(-1)

bigmacro=bigmacro.dropna(axis=0)

In [19]:
bigmacro.shape

(770, 698)

Augmented Dickey-Fuller Test can be used to test for stationarity in macroeconomic time series variables. We will use `adfuller` function from `statsmodels` module in Python. More information about the function can be found __[here](https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html)__.

In [20]:
#check stationarity
from statsmodels.tsa.stattools import adfuller #to check unit root in time series 
threshold=0.01 #significance level
for column in bigmacro.drop(['Date','Regime'], axis=1):
    result=adfuller(bigmacro[column])
    if result[1]>threshold:
        bigmacro[column]=bigmacro[column].diff()
bigmacro=bigmacro.dropna(axis=0)

In [21]:
threshold=0.01 #significance level
for column in bigmacro.drop(['Date','Regime'], axis=1):
    result=adfuller(bigmacro[column])
    if result[1]>threshold:
        bigmacro[column]=bigmacro[column].diff()
bigmacro=bigmacro.dropna(axis=0)

In [22]:
threshold=0.01 #significance level
for column in bigmacro.drop(['Date','Regime'], axis=1):
    result=adfuller(bigmacro[column])
    if result[1]>threshold:
        print(column)
bigmacro=bigmacro.dropna(axis=0)      

AWHMAN 6M lag
CPIAUCSL 18M lag
CPIAPPSL 18M lag
CUSR0000SA0L5 18M lag
PCEPI 18M lag
DDURRG3M086SBEA 18M lag


In [23]:
# Standardize
from sklearn.preprocessing import StandardScaler
features=bigmacro.drop(['Date','Regime'],axis=1)
col_names=features.columns

scaler=StandardScaler()
scaler.fit(features)
standardized_features=scaler.transform(features)
standardized_features.shape
df=pd.DataFrame(data=standardized_features,columns=col_names)
df.insert(loc=0,column="Date", value=bigmacro['Date'].values)
df.insert(loc=1,column='Regime', value=bigmacro['Regime'].values)
df.head()
df.shape

(768, 698)

In [24]:
df.to_csv("Dataset_CleanedL2024.csv", index=False)