# Feature Engineering

## Definition and a summary

Feature engineering - the process of transforming raw data into meaningful input features that better represent the underlying problem, improving the performance and accuracy of machine learning models. This critical data science technique involves selecting, creating, and transforming variables to enhance the data's predictive power and make it more suitable for algorithms to learn from. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import multiprocessing as mp
import gc
import datetime
from sklearn.preprocessing import LabelEncoder
import calendar
from scipy.sparse import csr_matrix,hstack
import tensorflow as tf
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from tqdm import tqdm
import pickle

First of all, let's read up the dataframes

In [2]:
train=pd.read_csv('final_dataframe.csv')
test=pd.read_csv('final_dataframe_test.csv')
final_test=pd.read_csv('final_future_data.csv')

  train=pd.read_csv('final_dataframe.csv')
  test=pd.read_csv('final_dataframe_test.csv')


It took more than 10 minutes to read all the dataframes. It would be easier if I reduce the memory of all of those by converting all categorical variables to integer. Also, we save here label encoders data so we can use them to encode our future unknown data.

In [3]:
lbl=LabelEncoder()
train['item_id']=lbl.fit_transform(train['item_id'])
test['item_id']=lbl.transform(test['item_id'])
final_test['item_id']=lbl.transform(final_test['item_id'])
pickle.dump(lbl,open('label_encoder_item_id.sav','wb'))

In [4]:
lbl=LabelEncoder()
train['dept_id']=lbl.fit_transform(train['dept_id'])
test['dept_id']=lbl.transform(test['dept_id'])
final_test['dept_id']=lbl.transform(final_test['dept_id'])
pickle.dump(lbl,open('label_encoder_dept_id.sav','wb'))

In [5]:
lbl=LabelEncoder()
train['cat_id']=lbl.fit_transform(train['cat_id'])
test['cat_id']=lbl.transform(test['cat_id'])
final_test['cat_id']=lbl.transform(final_test['cat_id'])
pickle.dump(lbl,open('label_encoder_cat_id.sav','wb'))

In [6]:
lbl=LabelEncoder()
train['store_id']=lbl.fit_transform(train['store_id'])
test['store_id']=lbl.transform(test['store_id'])
final_test['store_id']=lbl.transform(final_test['store_id'])
pickle.dump(lbl,open('label_encoder_store_id.sav','wb'))

In [7]:
lbl=LabelEncoder()
train['state_id']=lbl.fit_transform(train['state_id'])
test['state_id']=lbl.transform(test['state_id'])
final_test['state_id']=lbl.transform(final_test['state_id'])
pickle.dump(lbl,open('label_encoder_state_id.sav','wb'))

In [8]:
# Handle event_name_1 encoding
train['event_name_1'] = train['event_name_1'].fillna('no_event')
test['event_name_1'] = test['event_name_1'].fillna('no_event')
final_test['event_name_1'] = final_test['event_name_1'].fillna('no_event')

# Ensure all values are strings
train['event_name_1'] = train['event_name_1'].astype(str)
test['event_name_1'] = test['event_name_1'].astype(str)
final_test['event_name_1'] = final_test['event_name_1'].astype(str)

# Combine all values for fitting
all_values = np.concatenate([
    train['event_name_1'].values,
    test['event_name_1'].values,
    final_test['event_name_1'].values
])

# Fit and transform
lbl = LabelEncoder()
lbl.fit(all_values)
train['event_name_1'] = lbl.transform(train['event_name_1'])
test['event_name_1'] = lbl.transform(test['event_name_1'])
final_test['event_name_1'] = lbl.transform(final_test['event_name_1'])

pickle.dump(lbl, open('label_encoder_event_name_1.sav', 'wb'))

In [9]:
# Handle event_name_2 encoding
train['event_name_2'] = train['event_name_2'].fillna('no_event')
test['event_name_2'] = test['event_name_2'].fillna('no_event')
final_test['event_name_2'] = final_test['event_name_2'].fillna('no_event')

# Ensure all values are strings
train['event_name_2'] = train['event_name_2'].astype(str)
test['event_name_2'] = test['event_name_2'].astype(str)
final_test['event_name_2'] = final_test['event_name_2'].astype(str)

# Combine all values for fitting
all_values = np.concatenate([
    train['event_name_2'].values,
    test['event_name_2'].values,
    final_test['event_name_2'].values
])

# Fit and transform
lbl = LabelEncoder()
lbl.fit(all_values)
train['event_name_2'] = lbl.transform(train['event_name_2'])
test['event_name_2'] = lbl.transform(test['event_name_2'])
final_test['event_name_2'] = lbl.transform(final_test['event_name_2'])

pickle.dump(lbl, open('label_encoder_event_name_2.sav', 'wb'))

In [10]:
# Handle event_type_1 encoding
train['event_type_1'] = train['event_type_1'].fillna('no_event')
test['event_type_1'] = test['event_type_1'].fillna('no_event')
final_test['event_type_1'] = final_test['event_type_1'].fillna('no_event')

# Ensure all values are strings
train['event_type_1'] = train['event_type_1'].astype(str)
test['event_type_1'] = test['event_type_1'].astype(str)
final_test['event_type_1'] = final_test['event_type_1'].astype(str)

# Combine all values for fitting
all_values = np.concatenate([
    train['event_type_1'].values,
    test['event_type_1'].values,
    final_test['event_type_1'].values
])

# Fit and transform
lbl = LabelEncoder()
lbl.fit(all_values)
train['event_type_1'] = lbl.transform(train['event_type_1'])
test['event_type_1'] = lbl.transform(test['event_type_1'])
final_test['event_type_1'] = lbl.transform(final_test['event_type_1'])

pickle.dump(lbl, open('label_encoder_event_type_1.sav', 'wb'))

In [11]:
# Handle event_type_2 encoding
train['event_type_2'] = train['event_type_2'].fillna('no_event')
test['event_type_2'] = test['event_type_2'].fillna('no_event')
final_test['event_type_2'] = final_test['event_type_2'].fillna('no_event')

# Ensure all values are strings
train['event_type_2'] = train['event_type_2'].astype(str)
test['event_type_2'] = test['event_type_2'].astype(str)
final_test['event_type_2'] = final_test['event_type_2'].astype(str)

# Combine all values for fitting
all_values = np.concatenate([
    train['event_type_2'].values,
    test['event_type_2'].values,
    final_test['event_type_2'].values
])

# Fit and transform
lbl = LabelEncoder()
lbl.fit(all_values)
train['event_type_2'] = lbl.transform(train['event_type_2'])
test['event_type_2'] = lbl.transform(test['event_type_2'])
final_test['event_type_2'] = lbl.transform(final_test['event_type_2'])

pickle.dump(lbl, open('label_encoder_event_type_2.sav', 'wb'))

In [12]:
lbl=LabelEncoder()
train['event_type_1']=train['event_type_1'].fillna('no_event')
test['event_type_1']=test['event_type_1'].fillna('no_event')
final_test['event_type_1']=final_test['event_type_1'].fillna('no_event')
train['event_type_1']=lbl.fit_transform(train['event_type_1'])
test['event_type_1']=lbl.transform(test['event_type_1'])
final_test['event_type_1']=lbl.transform(final_test['event_type_1'])
pickle.dump(lbl,open('label_encoder_event_type_1.sav','wb'))

In [13]:
lbl=LabelEncoder()
train['event_type_2']=train['event_type_2'].fillna('no_event')
test['event_type_2']=test['event_type_2'].fillna('no_event')
final_test['event_type_2']=final_test['event_type_2'].fillna('no_event')
train['event_type_2']=lbl.fit_transform(train['event_type_2'])
test['event_type_2']=lbl.transform(test['event_type_2'])
final_test['event_type_2']=lbl.transform(final_test['event_type_2'])
pickle.dump(lbl,open('label_encoder_event_type_2.sav','wb'))

In [14]:
lbl=LabelEncoder()
train['year']=lbl.fit_transform(train['year'])
test['year']=lbl.transform(test['year'])
final_test['year']=lbl.transform(final_test['year'])
pickle.dump(lbl,open('label_encoder_year.sav','wb'))

After the data reducing has been done, we can remove unnecessary columns. Firstly, let's convert all 3 state SNAPs into one feature named SNAP.

In [15]:
%%time
train.loc[train['state_id'] == 'CA', 'snap'] = train.loc[train['state_id'] == 'CA']['snap_CA']
train.loc[train['state_id'] == 'TX', 'snap'] = train.loc[train['state_id'] == 'TX']['snap_TX']
train.loc[train['state_id'] == 'WI', 'snap'] = train.loc[train['state_id'] == 'WI']['snap_WI']
train.drop(['snap_CA','snap_TX','snap_WI'],axis=1,inplace=True)


test.loc[test['state_id'] == 'CA', 'snap'] = test.loc[test['state_id'] == 'CA']['snap_CA']
test.loc[test['state_id'] == 'TX', 'snap'] = test.loc[test['state_id'] == 'TX']['snap_TX']
test.loc[test['state_id'] == 'WI', 'snap'] = test.loc[test['state_id'] == 'WI']['snap_WI']
test.drop(['snap_CA','snap_TX','snap_WI'],axis=1,inplace=True)

final_test.loc[final_test['state_id'] == 'CA', 'snap'] = final_test.loc[final_test['state_id'] == 'CA']['snap_CA']
final_test.loc[final_test['state_id'] == 'TX', 'snap'] = final_test.loc[final_test['state_id'] == 'TX']['snap_TX']
final_test.loc[final_test['state_id'] == 'WI', 'snap'] = final_test.loc[final_test['state_id'] == 'WI']['snap_WI']
final_test.drop(['snap_CA','snap_TX','snap_WI'],axis=1,inplace=True)

CPU times: total: 1min 56s
Wall time: 2min 20s


Weekday = wday are similar features so there is no need to keep it. The same reason for having wm_yr_wk feature

In [16]:
%%time
train.drop('weekday',axis=1,inplace=True)
train.drop('wm_yr_wk',axis=1,inplace=True)
 
test.drop('weekday',axis=1,inplace=True)
test.drop('wm_yr_wk',axis=1,inplace=True)

final_test.drop('weekday',axis=1,inplace=True)
final_test.drop('wm_yr_wk',axis=1,inplace=True)

CPU times: total: 3min 44s
Wall time: 3min 58s


FEATURES THAT INCLUDE TIME INTERVALS

a) Number of the week - I created the function to get the week number of particular date

In [17]:
def get_week_number(x):
    date=calendar.datetime.date.fromisoformat(x)
    return date.isocalendar()[1]

In [18]:
train['week_number']=train['date'].apply(lambda x:get_week_number(x))
test['week_number']=test['date'].apply(lambda x:get_week_number(x))
final_test['week_number']=final_test['date'].apply(lambda x:get_week_number(x))

b) Season of the year - A function that is used to get season according to the month

In [19]:
def get_season(x):
    if x in [12,1,2]:
        return 0      #"Winter"
    elif x in [3,4,5]:
        return 1   #"Spring"
    elif x in [6,7,8]:
        return 2   #"Summer"
    else:
        return 3   #"Autumn"

In [20]:
train['season']=train['month'].apply(lambda x:get_season(x))
test['season']=test['month'].apply(lambda x:get_season(x))
final_test['season']=final_test['month'].apply(lambda x:get_season(x))

c) Start of a quarter - A function used to check which day starts the quarter

In [21]:
def check_if_quarter_begin(x):
    day=calendar.datetime.date.fromisoformat(x).day
    month=calendar.datetime.date.fromisoformat(x).month
    return 1 if (day==1 and (month in [1,4,7,9])) else 0

In [22]:
train['quarter_start']=train['date'].apply(lambda x:check_if_quarter_begin(x))
test['quarter_start']=test['date'].apply(lambda x:check_if_quarter_begin(x))
final_test['quarter_start']=final_test['date'].apply(lambda x:check_if_quarter_begin(x))

d) End of a quarter - A function used to check which day ends the quarter

In [23]:
def check_if_quarter_end(x):
    day=calendar.datetime.date.fromisoformat(x).day
    month=calendar.datetime.date.fromisoformat(x).month
    if (day==31 and month==3) or (day==30 and month==6) or (day==30 and month==9) or (day==31 and month==12):
        return 1
    else:
        return 0

In [24]:
train['quarter_end']=train['date'].apply(lambda x:check_if_quarter_end(x))
test['quarter_end']=test['date'].apply(lambda x:check_if_quarter_end(x))
final_test['quarter_end']=final_test['date'].apply(lambda x:check_if_quarter_end(x))

e) Start of a month - The function below checks if the day is beginning of the month

In [25]:

def month_start(x):
    day=calendar.datetime.date.fromisoformat(x).day
    return 1 if day==1 else 0

In [26]:
train['month_start']=train['date'].apply(lambda x:month_start(x))
test['month_start']=test['date'].apply(lambda x:month_start(x))
final_test['month_start']=final_test['date'].apply(lambda x:month_start(x))

f) End of a month - The function below checks if the day is end of the month

In [27]:
def month_end(x):
    day=calendar.datetime.date.fromisoformat(x).day
    month=calendar.datetime.date.fromisoformat(x).month
    year=calendar.datetime.date.fromisoformat(x).year
    leap_yr=(year%4==0) # Checking if it is a leap year
    val=(day==31 and month==1) or (day==29 if leap_yr else day==28) or (day==31 and month==3) or (day==30 and month==4) or\
        (day==31 and month==5) or (day==30 and month==6) or (day==31 and month==7) or (day==31 and month==8) or\
        (day==30 and month==9) or (day==31 and month==10) or (day==30 and month==11) or (day==31 and month==12)
    return 1 if val else 0

In [28]:

train['month_end']=train['date'].apply(lambda x:month_end(x))
test['month_end']=test['date'].apply(lambda x:month_end(x))
final_test['month_end']=final_test['date'].apply(lambda x:month_end(x))

g) Start of a year - The function checking if a given day is the beginning of a year

In [29]:
def year_start(x):
    day=calendar.datetime.date.fromisoformat(x).day
    month=calendar.datetime.date.fromisoformat(x).month
    return 1 if (day==1 and month==1) else 0

In [30]:
train['year_start']=train['date'].apply(lambda x:year_start(x))
test['year_start']=test['date'].apply(lambda x:year_start(x))
final_test['year_start']=final_test['date'].apply(lambda x:year_start(x))

h) End of a year - The function checking if a given day is the end of a year

In [31]:
def year_end(x):
    day=calendar.datetime.date.fromisoformat(x).day
    month=calendar.datetime.date.fromisoformat(x).month
    return 1 if (day==31 and month==12) else 0

In [32]:
train['year_end']=train['date'].apply(lambda x:year_end(x))
test['year_end']=test['date'].apply(lambda x:year_end(x))
final_test['year_end']=final_test['date'].apply(lambda x:year_end(x))

We can take the last 28 days from the train data for cross validation that could be used for further modelling.

In [33]:
cv=train[train['date']>='2016-03-28']
train=train[train['date']<'2016-03-28']

Last but not least, I will create a time series related features. I am going to create direct feature to test and train data. Below I wrote a code that creates a large data for all days.

In [34]:
%%time
gc.collect()
tt=pd.concat([train,cv,test,final_test])
tt.sort_values(['id','date'],inplace=True)
df=tt.pivot_table(index=['item_id','store_id'],columns='date',values='sales')
df.fillna(0,inplace=True)

CPU times: total: 3min 55s
Wall time: 4min 32s


My next step that I will take will be the calculation of a rolling mean and standard deviation. I also took 28 days based on requirements and to avoid the data leakage

In [44]:
%%time
# Reduce df memory before heavy operations
if df.values.dtype != 'float16':  # Use float16 instead of float32 to reduce memory
    df = df.astype('float16')

date_cols = list(df.columns)
ncols = len(date_cols)
chunk_size = 500  # Reduced chunk size to prevent memory issues

# Track created features to avoid duplicates
created_features = set()

# Function to safely merge features
def safe_merge(df, features, name):
    # Drop the feature column if it already exists to avoid conflicts
    if name in df.columns:
        df = df.drop(columns=[name])
    return df.merge(features, on=['item_id', 'store_id', 'date'], how='left')

for aggregate in ['mean', 'std']:
    for shif in [28]:
        for r in [7, 14, 30, 60, 360]:
            name = f"roll_{r}_shift_{shif}_{aggregate}"
            if name in created_features:
                print(f"Skipping {name} - already created")
                continue
                
            pad = r - 1
            feature_created = False
            all_features = []
            
            for start in range(0, ncols, chunk_size):
                # Clear memory at start of each iteration
                gc.collect()
                
                left = max(0, start - pad)
                right = min(ncols, start + chunk_size)
                keep_start = start
                keep_end = min(start + chunk_size, ncols)

                try:
                    # Get subset of columns including padding
                    sub_cols = date_cols[left:right]
                    sub_df = df.loc[:, sub_cols].copy()  # Make an explicit copy

                    # Compute rolling stats
                    roll = sub_df.rolling(r, axis=1).agg(aggregate).shift(shif, axis=1)
                    
                    # Keep only the needed columns
                    keep_cols = date_cols[keep_start:keep_end]
                    roll_sel = roll.loc[:, [c for c in keep_cols if c in roll.columns]]
                    
                    if roll_sel.shape[1] == 0:
                        del sub_df, roll, roll_sel
                        continue

                    # Process in smaller batches for melting
                    batch_size = 50000  # Reduced batch size
                    n_batches = (len(roll_sel) + batch_size - 1) // batch_size
                    
                    for b in range(n_batches):
                        start_idx = b * batch_size
                        end_idx = min((b + 1) * batch_size, len(roll_sel))
                        
                        # Process batch
                        roll_batch = roll_sel.iloc[start_idx:end_idx].reset_index()
                        value_vars = [c for c in roll_batch.columns if c not in ('item_id', 'store_id')]
                        
                        if len(value_vars) == 0:
                            continue
                            
                        roll_melt = pd.melt(roll_batch, 
                                          id_vars=['item_id', 'store_id'],
                                          value_vars=value_vars,
                                          var_name='date',
                                          value_name=name)
                        
                        roll_melt['date'] = roll_melt['date'].astype(str)
                        all_features.append(roll_melt)
                        
                        del roll_batch, roll_melt
                        gc.collect()
                        feature_created = True
                        
                    if feature_created:
                        print(f"Feature created named := {name} (cols {keep_start}:{keep_end})")
                    
                except MemoryError:
                    print(f"Memory error encountered for {name}, chunk {keep_start}:{keep_end}. Skipping...")
                    continue
                finally:
                    # Clean up
                    del sub_df, roll
                    if 'roll_sel' in locals():
                        del roll_sel
                    gc.collect()
            
            if feature_created and all_features:
                try:
                    # Combine all features for this rolling window
                    print(f"Combining features for {name}...")
                    combined_features = pd.concat(all_features, ignore_index=True)
                    combined_features = combined_features.drop_duplicates(['item_id', 'store_id', 'date'])
                    
                    # Process one dataset at a time to manage memory
                    print(f"Merging {name} into train...")
                    train = safe_merge(train, combined_features, name)
                    gc.collect()
                    
                    print(f"Merging {name} into cv...")
                    cv = safe_merge(cv, combined_features, name)
                    gc.collect()
                    
                    print(f"Merging {name} into test...")
                    test = safe_merge(test, combined_features, name)
                    gc.collect()
                    
                    print(f"Merging {name} into final_test...")
                    final_test = safe_merge(final_test, combined_features, name)
                    gc.collect()
                    
                    created_features.add(name)
                    print(f"Successfully added feature {name}")
                    
                except Exception as e:
                    print(f"Error processing {name}: {str(e)}")
                finally:
                    del combined_features
                    gc.collect()
            
            del all_features
            gc.collect()



Feature created named := roll_7_shift_28_mean (cols 0:500)




Feature created named := roll_7_shift_28_mean (cols 500:1000)




Feature created named := roll_7_shift_28_mean (cols 1000:1500)




Feature created named := roll_7_shift_28_mean (cols 1500:1969)
Combining features for roll_7_shift_28_mean...
Merging roll_7_shift_28_mean into train...
Merging roll_7_shift_28_mean into train...
Merging roll_7_shift_28_mean into cv...
Merging roll_7_shift_28_mean into cv...
Merging roll_7_shift_28_mean into test...
Merging roll_7_shift_28_mean into test...
Merging roll_7_shift_28_mean into final_test...
Merging roll_7_shift_28_mean into final_test...
Successfully added feature roll_7_shift_28_mean
Successfully added feature roll_7_shift_28_mean




Feature created named := roll_14_shift_28_mean (cols 0:500)




Feature created named := roll_14_shift_28_mean (cols 500:1000)




Feature created named := roll_14_shift_28_mean (cols 1000:1500)




Feature created named := roll_14_shift_28_mean (cols 1500:1969)
Combining features for roll_14_shift_28_mean...
Combining features for roll_14_shift_28_mean...
Merging roll_14_shift_28_mean into train...
Merging roll_14_shift_28_mean into train...
Merging roll_14_shift_28_mean into cv...
Merging roll_14_shift_28_mean into cv...
Merging roll_14_shift_28_mean into test...
Merging roll_14_shift_28_mean into test...
Merging roll_14_shift_28_mean into final_test...
Merging roll_14_shift_28_mean into final_test...
Successfully added feature roll_14_shift_28_mean
Successfully added feature roll_14_shift_28_mean




Feature created named := roll_30_shift_28_mean (cols 0:500)




Feature created named := roll_30_shift_28_mean (cols 500:1000)




Feature created named := roll_30_shift_28_mean (cols 1000:1500)




Feature created named := roll_30_shift_28_mean (cols 1500:1969)
Combining features for roll_30_shift_28_mean...
Combining features for roll_30_shift_28_mean...
Merging roll_30_shift_28_mean into train...
Merging roll_30_shift_28_mean into train...
Merging roll_30_shift_28_mean into cv...
Merging roll_30_shift_28_mean into cv...
Merging roll_30_shift_28_mean into test...
Merging roll_30_shift_28_mean into test...
Merging roll_30_shift_28_mean into final_test...
Merging roll_30_shift_28_mean into final_test...
Successfully added feature roll_30_shift_28_mean
Successfully added feature roll_30_shift_28_mean




Feature created named := roll_60_shift_28_mean (cols 0:500)




Feature created named := roll_60_shift_28_mean (cols 500:1000)




Feature created named := roll_60_shift_28_mean (cols 1000:1500)




Feature created named := roll_60_shift_28_mean (cols 1500:1969)
Combining features for roll_60_shift_28_mean...
Combining features for roll_60_shift_28_mean...
Merging roll_60_shift_28_mean into train...
Merging roll_60_shift_28_mean into train...
Merging roll_60_shift_28_mean into cv...
Merging roll_60_shift_28_mean into cv...
Merging roll_60_shift_28_mean into test...
Merging roll_60_shift_28_mean into test...
Merging roll_60_shift_28_mean into final_test...
Merging roll_60_shift_28_mean into final_test...
Successfully added feature roll_60_shift_28_mean
Successfully added feature roll_60_shift_28_mean




Feature created named := roll_360_shift_28_mean (cols 0:500)




Feature created named := roll_360_shift_28_mean (cols 500:1000)




Feature created named := roll_360_shift_28_mean (cols 1000:1500)




Feature created named := roll_360_shift_28_mean (cols 1500:1969)
Combining features for roll_360_shift_28_mean...
Combining features for roll_360_shift_28_mean...
Merging roll_360_shift_28_mean into train...
Merging roll_360_shift_28_mean into train...
Merging roll_360_shift_28_mean into cv...
Merging roll_360_shift_28_mean into cv...
Merging roll_360_shift_28_mean into test...
Merging roll_360_shift_28_mean into test...
Merging roll_360_shift_28_mean into final_test...
Merging roll_360_shift_28_mean into final_test...
Successfully added feature roll_360_shift_28_mean
Successfully added feature roll_360_shift_28_mean




Feature created named := roll_7_shift_28_std (cols 0:500)




Feature created named := roll_7_shift_28_std (cols 500:1000)




Feature created named := roll_7_shift_28_std (cols 1000:1500)




Feature created named := roll_7_shift_28_std (cols 1500:1969)
Combining features for roll_7_shift_28_std...
Combining features for roll_7_shift_28_std...
Merging roll_7_shift_28_std into train...
Merging roll_7_shift_28_std into train...
Merging roll_7_shift_28_std into cv...
Merging roll_7_shift_28_std into cv...
Merging roll_7_shift_28_std into test...
Merging roll_7_shift_28_std into test...
Merging roll_7_shift_28_std into final_test...
Merging roll_7_shift_28_std into final_test...
Successfully added feature roll_7_shift_28_std
Successfully added feature roll_7_shift_28_std




Feature created named := roll_14_shift_28_std (cols 0:500)




Feature created named := roll_14_shift_28_std (cols 500:1000)




Feature created named := roll_14_shift_28_std (cols 1000:1500)




Feature created named := roll_14_shift_28_std (cols 1500:1969)
Combining features for roll_14_shift_28_std...
Combining features for roll_14_shift_28_std...
Merging roll_14_shift_28_std into train...
Merging roll_14_shift_28_std into train...
Merging roll_14_shift_28_std into cv...
Merging roll_14_shift_28_std into cv...
Merging roll_14_shift_28_std into test...
Merging roll_14_shift_28_std into test...
Merging roll_14_shift_28_std into final_test...
Merging roll_14_shift_28_std into final_test...
Successfully added feature roll_14_shift_28_std
Successfully added feature roll_14_shift_28_std




Feature created named := roll_30_shift_28_std (cols 0:500)




Feature created named := roll_30_shift_28_std (cols 500:1000)




Feature created named := roll_30_shift_28_std (cols 1000:1500)




Feature created named := roll_30_shift_28_std (cols 1500:1969)
Combining features for roll_30_shift_28_std...
Merging roll_30_shift_28_std into train...
Merging roll_30_shift_28_std into train...
Merging roll_30_shift_28_std into cv...
Merging roll_30_shift_28_std into cv...
Merging roll_30_shift_28_std into test...
Merging roll_30_shift_28_std into test...
Merging roll_30_shift_28_std into final_test...
Merging roll_30_shift_28_std into final_test...
Successfully added feature roll_30_shift_28_std
Successfully added feature roll_30_shift_28_std




Feature created named := roll_60_shift_28_std (cols 0:500)




Feature created named := roll_60_shift_28_std (cols 500:1000)




Feature created named := roll_60_shift_28_std (cols 1000:1500)




Feature created named := roll_60_shift_28_std (cols 1500:1969)
Combining features for roll_60_shift_28_std...
Merging roll_60_shift_28_std into train...
Merging roll_60_shift_28_std into train...
Merging roll_60_shift_28_std into cv...
Merging roll_60_shift_28_std into cv...
Merging roll_60_shift_28_std into test...
Merging roll_60_shift_28_std into test...
Merging roll_60_shift_28_std into final_test...
Merging roll_60_shift_28_std into final_test...
Successfully added feature roll_60_shift_28_std
Successfully added feature roll_60_shift_28_std




Feature created named := roll_360_shift_28_std (cols 0:500)




Feature created named := roll_360_shift_28_std (cols 500:1000)




Feature created named := roll_360_shift_28_std (cols 1000:1500)




Feature created named := roll_360_shift_28_std (cols 1500:1969)
Combining features for roll_360_shift_28_std...
Combining features for roll_360_shift_28_std...
Merging roll_360_shift_28_std into train...
Merging roll_360_shift_28_std into train...
Merging roll_360_shift_28_std into cv...
Merging roll_360_shift_28_std into cv...
Merging roll_360_shift_28_std into test...
Merging roll_360_shift_28_std into test...
Merging roll_360_shift_28_std into final_test...
Merging roll_360_shift_28_std into final_test...
Successfully added feature roll_360_shift_28_std
CPU times: total: 6min 57s
Wall time: 7min
Successfully added feature roll_360_shift_28_std
CPU times: total: 6min 57s
Wall time: 7min


Exponential Weighted Average (EWA)

In [46]:
# Memory optimization - convert to float16
if df.values.dtype != 'float16':
    df = df.astype('float16')

# Different alpha values for EWA
alphas = [0.99, 0.95, 0.9, 0.8, 0.7]
shift = 28

for alpha in alphas:
    # Clear memory
    gc.collect()
    
    try:
        # Calculate EWA with shift
        roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()
        dates = roll.columns
        
        # Convert to float16 to save memory
        roll = roll.astype('float16')
        
        # Reset index and melt
        roll.reset_index(level=[0,1], inplace=True)
        roll = pd.melt(roll,
                      id_vars=['item_id', 'store_id'],
                      value_vars=dates,
                      var_name='date',
                      value_name=f'ewa_alpha_{int(alpha*100)}_shift_{shift}')
        
        # Fill NaN values
        roll.fillna(-1, inplace=True)
        
        # Merge with all datasets
        print(f"Merging alpha={alpha} features...")
        train = train.merge(roll, on=['item_id', 'store_id', 'date'])
        cv = cv.merge(roll, on=['item_id', 'store_id', 'date'])
        test = test.merge(roll, on=['item_id', 'store_id', 'date'])
        final_test = final_test.merge(roll, on=['item_id', 'store_id', 'date'])
        
        print(f"Direct Feature created ewa window of size alpha={alpha}")
        
    except Exception as e:
        print(f"Error processing alpha={alpha}: {str(e)}")
    finally:
        del roll
        gc.collect()

  roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()


Merging alpha=0.99 features...
Direct Feature created ewa window of size alpha=0.99
Direct Feature created ewa window of size alpha=0.99


  roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()


Merging alpha=0.95 features...
Direct Feature created ewa window of size alpha=0.95
Direct Feature created ewa window of size alpha=0.95


  roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()


Merging alpha=0.9 features...
Direct Feature created ewa window of size alpha=0.9
Direct Feature created ewa window of size alpha=0.9


  roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()


Merging alpha=0.8 features...
Direct Feature created ewa window of size alpha=0.8
Direct Feature created ewa window of size alpha=0.8


  roll = df.shift(shift, axis=1).ewm(alpha=alpha, axis=1, adjust=False).mean()


Merging alpha=0.7 features...
Direct Feature created ewa window of size alpha=0.7
Direct Feature created ewa window of size alpha=0.7


Our last, but not least step will be calculating lag features with lag of 28,35,42,49,56,63,70,77,84,91,96 days

In [48]:
%%time
for lag in range(28,100,7):
    i='direct_lag_'+str(lag)
    lag_i=df.shift(lag,axis=1)
    dates=lag_i.columns
    lag_i.reset_index(level=[0,1],inplace=True)
    lag_i=pd.melt(lag_i,id_vars=['item_id','store_id'],value_vars=dates,var_name='date',value_name=i)
    lag_i.fillna(-1,inplace=True)
    lag_i[i]=lag_i[i].astype('int16')
    train=train.merge(lag_i,on=['item_id','store_id','date'])
    cv=cv.merge(lag_i,on=['item_id','store_id','date'])
    test=test.merge(lag_i,on=['item_id','store_id','date'])
    final_test=final_test.merge(lag_i,on=['item_id','store_id','date'])
    print("Feature created for lag",lag)
    del lag_i
    gc.collect()

Feature created for lag 28
Feature created for lag 35
Feature created for lag 42
Feature created for lag 49
Feature created for lag 56
Feature created for lag 63
Feature created for lag 70
Feature created for lag 77
Feature created for lag 84
Feature created for lag 91




Feature created for lag 98
CPU times: total: 1min 19s
Wall time: 1min 20s


Finally we can save all creatures created

In [49]:
%%time
train.to_csv('train1.csv',index=False)
cv.to_csv('cv1.csv',index=False)
test.to_csv('test1.csv',index=False)
final_test.to_csv('final_test1.csv',index=False)

CPU times: total: 0 ns
Wall time: 27.9 ms


All time series features have been constructed with a shift of 28 days in order to not get stuck into data leakage problem