# Feature Engineering for QuickBooks Sales Forecasting

This notebook focuses on transforming raw sales data into model-ready features for our forecasting model.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Set visualization style
plt.style.use('ggplot')
sns.set_theme(style="whitegrid")
%matplotlib inline

In [2]:
# Load the sales data
sales_df = pd.read_csv('../data/raw/sales.csv')

# Convert date to datetime
sales_df['date'] = pd.to_datetime(sales_df['date'])

# Display the first few rows
sales_df.head()

Unnamed: 0,date,category,product,units_sold,revenue,product_price,promo_flag,is_holiday_season,category_popularity_30d
0,2020-05-08,Electronics,Smart Speaker,7,991.63,141.66,0,0,1.03
1,2020-05-08,Groceries,Organic Cereal,307,1556.42,5.07,0,0,1.1
2,2020-05-08,Clothing,Running Shoes,54,3145.9,58.26,0,0,0.95
3,2020-05-08,Books,Business Book,86,1869.15,21.73,0,0,1.1
4,2020-05-08,Furniture,Office Chair,7,822.02,117.43,0,0,0.93


## Time-Based Features
In this section, we extract temporal features from the date column. These features capture important calendar-based patterns like yearly seasonality, monthly cycles, and day-of-week effects that are crucial for time series forecasting. We create features for year, month, day of week, and a binary weekend indicator.


In [3]:
# Extract date components
sales_df['year'] = sales_df['date'].dt.year
sales_df['month'] = sales_df['date'].dt.month
sales_df['day_of_week'] = sales_df['date'].dt.dayofweek
sales_df['is_weekend'] = sales_df['day_of_week'].isin([5, 6]).astype(int)

# Display the enhanced dataframe
sales_df.head()

Unnamed: 0,date,category,product,units_sold,revenue,product_price,promo_flag,is_holiday_season,category_popularity_30d,year,month,day_of_week,is_weekend
0,2020-05-08,Electronics,Smart Speaker,7,991.63,141.66,0,0,1.03,2020,5,4,0
1,2020-05-08,Groceries,Organic Cereal,307,1556.42,5.07,0,0,1.1,2020,5,4,0
2,2020-05-08,Clothing,Running Shoes,54,3145.9,58.26,0,0,0.95,2020,5,4,0
3,2020-05-08,Books,Business Book,86,1869.15,21.73,0,0,1.1,2020,5,4,0
4,2020-05-08,Furniture,Office Chair,7,822.02,117.43,0,0,0.93,2020,5,4,0


## Lag and Rolling Features
In this section, we create lag and rolling window features that capture temporal dependencies in the data. Lag features represent past values (previous day, week, etc.), which help the model learn from historical patterns. Rolling window features (like 7-day and 30-day averages) smooth out short-term fluctuations and highlight longer-term trends. These features are essential for time series forecasting as they provide the model with historical context.


In [4]:
# Create lagged features (previous day, week, month)
sales_df = sales_df.sort_values(by=['product', 'date'])

# Lag
sales_df['lag_units_1d'] = sales_df.groupby('product')['units_sold'].shift(1)
sales_df['lag_revenue_1d'] = sales_df.groupby('product')['revenue'].shift(1)

# Rolling mean
sales_df['rolling_avg_units_7d'] = sales_df.groupby('product')['units_sold'].transform(lambda x: x.shift(1).rolling(7).mean())
sales_df['rolling_avg_units_30d'] = sales_df.groupby('product')['units_sold'].transform(lambda x: x.shift(1).rolling(30).mean())

# Display with lag features
sales_df.head(10)

Unnamed: 0,date,category,product,units_sold,revenue,product_price,promo_flag,is_holiday_season,category_popularity_30d,year,month,day_of_week,is_weekend,lag_units_1d,lag_revenue_1d,rolling_avg_units_7d,rolling_avg_units_30d
5,2020-05-08,Toys,Building Blocks,40,1357.98,33.95,0,0,1.09,2020,5,4,0,,,,
13,2020-05-09,Toys,Building Blocks,64,2116.09,33.06,0,0,0.88,2020,5,5,1,40.0,1357.98,,
21,2020-05-10,Toys,Building Blocks,53,1855.47,35.01,0,0,0.98,2020,5,6,1,64.0,2116.09,,
29,2020-05-11,Toys,Building Blocks,59,2210.09,37.46,1,0,1.14,2020,5,0,0,53.0,1855.47,,
37,2020-05-12,Toys,Building Blocks,50,1775.64,35.51,0,0,0.93,2020,5,1,0,59.0,2210.09,,
45,2020-05-13,Toys,Building Blocks,59,2071.01,35.1,0,0,0.89,2020,5,2,0,50.0,1775.64,,
53,2020-05-14,Toys,Building Blocks,47,1686.54,35.88,0,0,1.09,2020,5,3,0,59.0,2071.01,,
61,2020-05-15,Toys,Building Blocks,50,1806.63,36.13,0,0,1.01,2020,5,4,0,47.0,1686.54,53.142857,
69,2020-05-16,Toys,Building Blocks,57,1967.96,34.53,0,0,0.94,2020,5,5,1,50.0,1806.63,54.571429,
77,2020-05-17,Toys,Building Blocks,46,1555.07,33.81,0,0,1.0,2020,5,6,1,57.0,1967.96,53.571429,


## Promotional and Price Features
In this section, we engineer features related to pricing strategies and promotional activities. We create a price change flag to identify when product prices change, which can significantly impact sales. We also create an interaction feature between promotions and holiday seasons, as the combined effect of these factors often leads to sales spikes. These features help the model understand how pricing and promotional strategies influence purchasing behavior.


In [5]:
# Price change from the previous day
sales_df['price_change_flag'] = sales_df.groupby('product')['product_price'].diff().fillna(0).ne(0).astype(int)

# Promo and holiday interaction
sales_df['promo_and_holiday'] = sales_df['promo_flag'] & sales_df['is_holiday_season']

sales_df.head(10)

Unnamed: 0,date,category,product,units_sold,revenue,product_price,promo_flag,is_holiday_season,category_popularity_30d,year,month,day_of_week,is_weekend,lag_units_1d,lag_revenue_1d,rolling_avg_units_7d,rolling_avg_units_30d,price_change_flag,promo_and_holiday
5,2020-05-08,Toys,Building Blocks,40,1357.98,33.95,0,0,1.09,2020,5,4,0,,,,,0,0
13,2020-05-09,Toys,Building Blocks,64,2116.09,33.06,0,0,0.88,2020,5,5,1,40.0,1357.98,,,1,0
21,2020-05-10,Toys,Building Blocks,53,1855.47,35.01,0,0,0.98,2020,5,6,1,64.0,2116.09,,,1,0
29,2020-05-11,Toys,Building Blocks,59,2210.09,37.46,1,0,1.14,2020,5,0,0,53.0,1855.47,,,1,0
37,2020-05-12,Toys,Building Blocks,50,1775.64,35.51,0,0,0.93,2020,5,1,0,59.0,2210.09,,,1,0
45,2020-05-13,Toys,Building Blocks,59,2071.01,35.1,0,0,0.89,2020,5,2,0,50.0,1775.64,,,1,0
53,2020-05-14,Toys,Building Blocks,47,1686.54,35.88,0,0,1.09,2020,5,3,0,59.0,2071.01,,,1,0
61,2020-05-15,Toys,Building Blocks,50,1806.63,36.13,0,0,1.01,2020,5,4,0,47.0,1686.54,53.142857,,1,0
69,2020-05-16,Toys,Building Blocks,57,1967.96,34.53,0,0,0.94,2020,5,5,1,50.0,1806.63,54.571429,,1,0
77,2020-05-17,Toys,Building Blocks,46,1555.07,33.81,0,0,1.0,2020,5,6,1,57.0,1967.96,53.571429,,1,0


## Volatility Features
In this section, we create features that capture the volatility or variability in sales over time. We calculate the rolling standard deviation of units sold over a 7-day window, which helps identify periods of stable versus unstable sales. Volatility features are important for forecasting as they help the model adjust its predictions based on the historical stability of sales, potentially increasing confidence during stable periods and widening prediction intervals during volatile periods.


In [6]:
sales_df['rolling_std_units_7d'] = sales_df.groupby('product')['units_sold'].transform(lambda x: x.shift(1).rolling(7).std())
sales_df.head(10)

Unnamed: 0,date,category,product,units_sold,revenue,product_price,promo_flag,is_holiday_season,category_popularity_30d,year,month,day_of_week,is_weekend,lag_units_1d,lag_revenue_1d,rolling_avg_units_7d,rolling_avg_units_30d,price_change_flag,promo_and_holiday,rolling_std_units_7d
5,2020-05-08,Toys,Building Blocks,40,1357.98,33.95,0,0,1.09,2020,5,4,0,,,,,0,0,
13,2020-05-09,Toys,Building Blocks,64,2116.09,33.06,0,0,0.88,2020,5,5,1,40.0,1357.98,,,1,0,
21,2020-05-10,Toys,Building Blocks,53,1855.47,35.01,0,0,0.98,2020,5,6,1,64.0,2116.09,,,1,0,
29,2020-05-11,Toys,Building Blocks,59,2210.09,37.46,1,0,1.14,2020,5,0,0,53.0,1855.47,,,1,0,
37,2020-05-12,Toys,Building Blocks,50,1775.64,35.51,0,0,0.93,2020,5,1,0,59.0,2210.09,,,1,0,
45,2020-05-13,Toys,Building Blocks,59,2071.01,35.1,0,0,0.89,2020,5,2,0,50.0,1775.64,,,1,0,
53,2020-05-14,Toys,Building Blocks,47,1686.54,35.88,0,0,1.09,2020,5,3,0,59.0,2071.01,,,1,0,
61,2020-05-15,Toys,Building Blocks,50,1806.63,36.13,0,0,1.01,2020,5,4,0,47.0,1686.54,53.142857,,1,0,8.234654
69,2020-05-16,Toys,Building Blocks,57,1967.96,34.53,0,0,0.94,2020,5,5,1,50.0,1806.63,54.571429,,1,0,6.187545
77,2020-05-17,Toys,Building Blocks,46,1555.07,33.81,0,0,1.0,2020,5,6,1,57.0,1967.96,53.571429,,1,0,4.825527


## Daily Contextual Features
In this section, we create aggregated features at the daily level to provide broader context for each transaction. We calculate total sales, transaction count, and the number of unique categories sold each day. These daily contextual features help the model understand the overall business environment on a given day, which can be important for accurate forecasting. For example, a high transaction count day might indicate a sale event or holiday shopping period.


In [7]:
# Aggregate daily stats
daily_context = sales_df.groupby('date').agg(
    total_sales=('revenue', 'sum'),
    transaction_count=('revenue', 'count'),
    unique_categories=('category', 'nunique')
).reset_index()

# Merge to main df
sales_df = pd.merge(sales_df, daily_context, on='date', how='left')

## Category-Based Features
In this section, we create features that capture sales patterns at the category level. We aggregate sales by date and category, then pivot the data to create separate columns for each product category. This transformation allows the model to learn category-specific patterns and relationships. Understanding how different product categories perform over time is crucial for accurate forecasting, especially when certain categories have distinct seasonal patterns or growth trends.


In [8]:
# Create category-specific features
category_daily = sales_df.groupby(['date', 'category'])['revenue'].sum().reset_index()

# Pivot to get categories as columns
category_pivot = category_daily.pivot(index='date', columns='category', values='revenue').reset_index()
category_pivot = category_pivot.fillna(0)  # Fill NaN with 0

# Display pivoted data
category_pivot.head()

category,date,Beauty,Books,Clothing,Electronics,Furniture,Groceries,Sports,Toys
0,2020-05-08,3123.75,1869.15,3145.9,991.63,822.02,1556.42,1793.46,1357.98
1,2020-05-09,3522.52,2413.48,3634.32,840.23,883.3,1620.57,1569.33,2116.09
2,2020-05-10,3195.38,2424.01,4321.36,882.06,1215.43,1653.73,2609.5,1855.47
3,2020-05-11,3494.76,2288.23,4054.6,1498.61,1499.52,1874.54,1931.15,2210.09
4,2020-05-12,3110.31,2322.06,4121.08,2432.77,1013.2,1652.64,1989.02,1775.64


In [9]:
# Recompute daily context in case previous version was date-level only
daily_sales = sales_df.groupby('date').agg(
    total_sales=('revenue', 'sum'),
    avg_transaction=('revenue', 'mean'),
    transaction_count=('revenue', 'count'),
    unique_categories=('category', 'nunique')
).reset_index()

daily_sales.head()

Unnamed: 0,date,total_sales,avg_transaction,transaction_count,unique_categories
0,2020-05-08,14660.31,1832.53875,8,8
1,2020-05-09,16599.84,2074.98,8,8
2,2020-05-10,18156.94,2269.6175,8,8
3,2020-05-11,18851.5,2356.4375,8,8
4,2020-05-12,18416.72,2302.09,8,8


In [10]:
# Merge category features with daily sales
features_df = pd.merge(daily_sales, category_pivot, on='date', how='left')
features_df = features_df.fillna(0)  # Fill any NaN values

# Display final feature dataframe
features_df.head()

Unnamed: 0,date,total_sales,avg_transaction,transaction_count,unique_categories,Beauty,Books,Clothing,Electronics,Furniture,Groceries,Sports,Toys
0,2020-05-08,14660.31,1832.53875,8,8,3123.75,1869.15,3145.9,991.63,822.02,1556.42,1793.46,1357.98
1,2020-05-09,16599.84,2074.98,8,8,3522.52,2413.48,3634.32,840.23,883.3,1620.57,1569.33,2116.09
2,2020-05-10,18156.94,2269.6175,8,8,3195.38,2424.01,4321.36,882.06,1215.43,1653.73,2609.5,1855.47
3,2020-05-11,18851.5,2356.4375,8,8,3494.76,2288.23,4054.6,1498.61,1499.52,1874.54,1931.15,2210.09
4,2020-05-12,18416.72,2302.09,8,8,3110.31,2322.06,4121.08,2432.77,1013.2,1652.64,1989.02,1775.64


## Feature Selection and Preparation
In this section, we finalize our feature set for model training. We handle missing values by dropping rows with NaN values that might have been introduced during the creation of lag and rolling features. We also ensure that all necessary temporal features are present in our final feature dataframe. This step is critical for preparing clean, consistent data that will yield reliable model performance. The final feature set combines all the engineered features from previous sections into a comprehensive dataset ready for model training.


In [11]:
# Extract date components for the feature dataframe
features_df['year'] = features_df['date'].dt.year
features_df['month'] = features_df['date'].dt.month
features_df['day_of_week'] = features_df['date'].dt.dayofweek
features_df['is_weekend'] = features_df['day_of_week'].isin([5, 6]).astype(int)
features_df['week_of_year'] = features_df['date'].dt.isocalendar().week
features_df['quarter'] = features_df['date'].dt.quarter
features_df['is_month_end'] = features_df['date'].dt.is_month_end.astype(int)
features_df['is_month_start'] = features_df['date'].dt.is_month_start.astype(int)
features_df['is_november'] = (features_df['date'].dt.month == 11).astype(int) # Black friday month indicator

# Lag and rolling trend features per category
target_cols = ['Beauty', 'Books', 'Clothing', 'Electronics', 'Furniture', 'Groceries', 'Sports', 'Toys']

for lag in [1, 7, 14]:
    for col in target_cols:
        features_df[f'{col}_lag_{lag}'] = features_df[col].shift(lag)

# Display final feature set
features_df.head()

Unnamed: 0,date,total_sales,avg_transaction,transaction_count,unique_categories,Beauty,Books,Clothing,Electronics,Furniture,...,Sports_lag_7,Toys_lag_7,Beauty_lag_14,Books_lag_14,Clothing_lag_14,Electronics_lag_14,Furniture_lag_14,Groceries_lag_14,Sports_lag_14,Toys_lag_14
0,2020-05-08,14660.31,1832.53875,8,8,3123.75,1869.15,3145.9,991.63,822.02,...,,,,,,,,,,
1,2020-05-09,16599.84,2074.98,8,8,3522.52,2413.48,3634.32,840.23,883.3,...,,,,,,,,,,
2,2020-05-10,18156.94,2269.6175,8,8,3195.38,2424.01,4321.36,882.06,1215.43,...,,,,,,,,,,
3,2020-05-11,18851.5,2356.4375,8,8,3494.76,2288.23,4054.6,1498.61,1499.52,...,,,,,,,,,,
4,2020-05-12,18416.72,2302.09,8,8,3110.31,2322.06,4121.08,2432.77,1013.2,...,,,,,,,,,,


In [12]:
# Drop rows with NaN values (from lag/rolling features)
features_df = features_df.dropna()

features_df['date'] = pd.to_datetime(features_df['date'])  # ensure datetime type
features_df.set_index('date', inplace=True)

display(features_df.head())
# Save the engineered features
features_df.to_csv('../data/processed/sales_engineered_features.csv', index=True)
print(f"Saved engineered features with shape: {features_df.shape}")

Unnamed: 0_level_0,total_sales,avg_transaction,transaction_count,unique_categories,Beauty,Books,Clothing,Electronics,Furniture,Groceries,...,Sports_lag_7,Toys_lag_7,Beauty_lag_14,Books_lag_14,Clothing_lag_14,Electronics_lag_14,Furniture_lag_14,Groceries_lag_14,Sports_lag_14,Toys_lag_14
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-05-22,14692.66,1836.5825,8,8,2882.88,1679.06,3134.6,1721.19,1048.85,1371.86,...,1831.87,1806.63,3123.75,1869.15,3145.9,991.63,822.02,1556.42,1793.46,1357.98
2020-05-23,18454.54,2306.8175,8,8,3803.92,2484.18,3449.02,1891.31,1280.76,1628.88,...,1843.56,1967.96,3522.52,2413.48,3634.32,840.23,883.3,1620.57,1569.33,2116.09
2020-05-24,18929.57,2366.19625,8,8,3381.79,2416.94,3749.46,1830.46,1494.5,1543.91,...,1866.54,1555.07,3195.38,2424.01,4321.36,882.06,1215.43,1653.73,2609.5,1855.47
2020-05-25,17785.61,2223.20125,8,8,3363.05,2151.65,3645.68,2138.67,947.54,1322.72,...,2750.63,2029.09,3494.76,2288.23,4054.6,1498.61,1499.52,1874.54,1931.15,2210.09
2020-05-26,14765.07,1845.63375,8,8,2934.02,2091.65,3059.44,1647.3,502.46,1505.56,...,1860.68,1948.32,3110.31,2322.06,4121.08,2432.77,1013.2,1652.64,1989.02,1775.64


Saved engineered features with shape: (1812, 45)


## Conclusion

We've created a comprehensive set of features for our sales forecasting model, including:
- Time-based features (year, month, day, day of week, etc.)
- Lagged features (previous day, week, month)
- Rolling window statistics (7-day and 30-day means and standard deviations)
- Category-specific sales amounts

These features will be used in the next notebook for model training.
