# Creating Derived Time Variables While Minimizing Collinearity

This notebook demonstrates how to create meaningful derived variables from datetime features while being mindful of collinearity. We'll focus on creating features that capture different aspects of time while avoiding redundant information.

## 1. Import Libraries and Load Data

First, we'll import the necessary libraries and load our sample data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load the data
df = pd.read_csv("../Data/kaggle_data.txt")
df['date'] = pd.to_datetime(df['date'])  # Convert date column to datetime
print("Data loaded successfully. Date range:", df['date'].min(), "to", df['date'].max())

## 2. Extract Basic Time Components

We'll start by extracting basic time components. However, we need to be careful as many of these components can be highly correlated. For example:
- Year and month are naturally correlated with the overall trend
- Day of month and day of week can have some correlation
- Quarter is directly derived from month

Let's extract these components and examine their relationships.

In [None]:
# Extract basic time components
df_time = pd.DataFrame()
df_time['year'] = df['date'].dt.year
df_time['month'] = df['date'].dt.month
df_time['day'] = df['date'].dt.day
df_time['day_of_week'] = df['date'].dt.dayofweek
df_time['quarter'] = df['date'].dt.quarter

# Display the first few rows
print("Basic time components:")
print(df_time.head())

## 3. Create Cyclical Time Features

One way to reduce collinearity while preserving the cyclical nature of time features is to transform them into sine and cosine components. This is particularly useful for:
- Months (cycles every 12 months)
- Days of week (cycles every 7 days)

This transformation preserves the cyclic nature of these features while reducing direct correlations.

In [None]:
# Create cyclical features for month and day of week
df_time['month_sin'] = np.sin(2 * np.pi * df_time['month'] / 12)
df_time['month_cos'] = np.cos(2 * np.pi * df_time['month'] / 12)
df_time['dow_sin'] = np.sin(2 * np.pi * df_time['day_of_week'] / 7)
df_time['dow_cos'] = np.cos(2 * np.pi * df_time['day_of_week'] / 7)

# Display the cyclical features
print("Cyclical features:")
print(df_time[['month_sin', 'month_cos', 'dow_sin', 'dow_cos']].head())

## 4. Calculate Time-Based Lags and Differences

We'll create some additional time-based features that might be useful:
- Days since the start of the dataset (captures overall trend)
- Is weekend (binary indicator)
- Is month end (binary indicator)

These features capture different aspects of time without being strictly collinear with other features.

In [None]:
# Calculate additional time-based features
df_time['days_since_start'] = (df['date'] - df['date'].min()).dt.days
df_time['is_weekend'] = df_time['day_of_week'].isin([5, 6]).astype(int)
df_time['is_month_end'] = df['date'].dt.is_month_end.astype(int)

print("Additional time-based features:")
print(df_time[['days_since_start', 'is_weekend', 'is_month_end']].head())

## 5. Analyze Feature Correlations

Let's examine the correlations between our derived features to identify potential collinearity. We'll use a correlation matrix and Variance Inflation Factor (VIF) analysis to make informed decisions about which features to keep.

In [None]:
# Calculate correlation matrix
correlation_matrix = df_time.corr()

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Time Features')
plt.tight_layout()
plt.show()

# Calculate VIF for numerical features
numerical_features = ['year', 'month', 'day', 'day_of_week', 'quarter', 'days_since_start']
X = df_time[numerical_features]
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factors:")
print(vif_data.sort_values('VIF', ascending=False))

## 6. Select Non-Collinear Features

Based on our correlation analysis and VIF values, we'll select a subset of features that minimize collinearity while preserving important time information. Here's our recommended set of derived variables:

1. Trend feature:
   - `days_since_start` (captures overall trend)

2. Cyclical features:
   - `month_sin` and `month_cos` (captures seasonal patterns)
   - `dow_sin` and `dow_cos` (captures weekly patterns)

3. Binary indicators:
   - `is_weekend` (captures weekend effect)
   - `is_month_end` (captures month-end effects)

We'll drop the following due to high collinearity:
- `year` (highly correlated with days_since_start)
- `month` (replaced by sine/cosine components)
- `day_of_week` (replaced by sine/cosine components)
- `quarter` (derived from month)
- `day` (less important for trend analysis)

Let's create our final feature set:

In [None]:
# Select final feature set
final_features = ['days_since_start', 
                 'month_sin', 'month_cos',
                 'dow_sin', 'dow_cos',
                 'is_weekend', 'is_month_end']

final_time_features = df_time[final_features]

print("Final time-based features:")
print(final_time_features.head())

# Check correlations in final feature set
final_correlation = final_time_features.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(final_correlation, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Final Time Features')
plt.tight_layout()
plt.show()