The dataset is from https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.

### Total Flow
First, train data goes through the SVD function and becomes smoother data. Smoothened train data and test data are pre-processed with the same pre-process function. The result data are used for sm.OLS linear regression function.

Data pre-processing : Preprocessed the ‘Date’ column data type to date format, made a new column ‘Wk’ as a categorical variable, which defines the order of weeks of the year. Also made a new column ‘Yr’ which defines what year the data is from.

### Model
**SVD:**    

The SVD function performs dimensionality reduction on weekly sales data for input data which is a splitted train data of each department. We chose d = 8 for the reduction amount. It pivots the data into a matrix where rows represent stores and columns represent weeks. The matrix is centered by subtracting the mean weekly sales of each store. SVD is performed on the centered matrix, and the top 8(d) components are used to approximate the original matrix. If the number of stores is less than or equal to d, SVD is skipped, and only the mean is added back because the reduction result will be the same as the original data. The result matrix is reshaped back to the original format with columns for ‘Store’, ‘Date’, and ‘Weekly_Sales’. The function returns this smoothed data with its department column added back.  

  
**Linear regression:**  

First, the unique pairs of 'Store' and 'Dept' from both the train and test datasets are extracted and the intersection of these pairs were found to use only the common store-department combinations. Then, for each department (train_split), it applies SVD, to smooth the data to reduce noise and capture the most important factors. This smoothed train data is then preprocessed and split into individual store-department subsets.
For each store-department pair, the data is transformed into design matrices by ‘patsy’. The dataset is grouped by store and department so that we can train models for each case. Missing or zero value columns are dropped, and the remaining features are further filtered through a backward elimination process to get rid of highly collinearity.
After predictions are made with sm.OLS function, these predictions are then left merged with the test data so that we have the ‘IS_Holiday’ column back to our result. Missing values are filled with zeros because it showed the lowest WMAE among zeros, mean, or other values.


In [None]:
import numpy as np
import pandas as pd
from scipy.linalg import svd

def reduced_svd(train, dept_id, d=8):
    # Filter data for the specified department
    dept_df = train[train['Dept'] == dept_id]

    # Pivot the data to form matrix X where rows are stores and columns are weeks
    X = dept_df.pivot(index='Store', columns='Date', values='Weekly_Sales').fillna(0)

    # Calculate mean weekly sales for each store (mean of each row)
    store_mean = X.mean(axis=1)

    # Center the data by subtracting the store means
    X_centered = X.sub(store_mean, axis=0)

    # Convert to numpy array for SVD
    X_centered_np = X_centered.to_numpy()

    # Perform SVD only if we have enough rows for rank reduction
    m, n = X_centered_np.shape
    if m > d:
        # SVD decomposition
        U, D, Vt = svd(X_centered_np, full_matrices=False)

        # Select the top d components
        U_d = U[:, :d]
        D_d = np.diag(D[:d])
        Vt_d = Vt[:d, :]

        # Reconstruct the matrix using the top d components
        X_approx_centered = U_d @ D_d @ Vt_d

        # Re-center by adding the store mean back to each row
        X_approx = X_approx_centered + store_mean.values[:, np.newaxis]
    else:
        # If rows <= d, we skip SVD and just add the mean to get the original X
        X_approx = X_centered_np + store_mean.values[:, np.newaxis]

    # Convert the reconstructed matrix back to DataFrame for easier handling
    X_approx_df = pd.DataFrame(X_approx, index=X.index, columns=X.columns)

    # Reshape the data back to long format (Store, Date, Weekly_Sales)
    smoothed_data = X_approx_df.reset_index().melt(id_vars='Store', var_name='Date', value_name='Weekly_Sales')
    smoothed_data['Dept'] = dept_id

    return smoothed_data

In [None]:
import patsy
import pandas as pd
import numpy as np
import statsmodels.api as sm

def preprocess(data):
    tmp = pd.to_datetime(data['Date'])
    data['Wk'] = tmp.dt.isocalendar().week
    data['Yr'] = tmp.dt.year
    data['Wk'] = pd.Categorical(data['Wk'], categories=[i for i in range(1, 53)])  # 52 weeks
    return data

In [None]:
    # Load train and test data
    train = pd.read_csv("train.csv")
    test = pd.read_csv("test.csv")

    # Pre-allocate a DataFrame to store the predictions
    test_pred = pd.DataFrame()

    train_pairs = train[['Store', 'Dept']].drop_duplicates(ignore_index=True)
    test_pairs = test[['Store', 'Dept']].drop_duplicates(ignore_index=True)
    unique_pairs = pd.merge(train_pairs, test_pairs, how='inner', on=['Store', 'Dept'])

    train_split = unique_pairs.merge(train, on=['Store', 'Dept'], how='left')
    #####################
    train_smoothed = pd.DataFrame()
    for dept in unique_pairs['Dept'].unique():
        dept_smoothed = reduced_svd(train_split, dept_id=dept, d=8)  # SVD application
        train_smoothed = pd.concat([train_smoothed, dept_smoothed], ignore_index=True)
    ######################
    train_split = preprocess(train_smoothed)
    X = patsy.dmatrix('Weekly_Sales + Store + Dept + Yr  + Wk',
                      data=train_split,
                      return_type='dataframe')
    train_split = dict(tuple(X.groupby(['Store', 'Dept'])))

    test_split = unique_pairs.merge(test, on=['Store', 'Dept'], how='left')
    test_split = preprocess(test_split)
    X = patsy.dmatrix('Store + Dept + Yr  + Wk',
                      data=test_split,
                      return_type='dataframe')
    X['Date'] = test_split['Date']
    test_split = dict(tuple(X.groupby(['Store', 'Dept'])))

    keys = list(train_split)

    for key in keys:
        X_train = train_split[key]
        X_test = test_split[key]

        Y = X_train['Weekly_Sales']
        X_train = X_train.drop(['Weekly_Sales', 'Store', 'Dept'], axis=1)

        cols_to_drop = X_train.columns[(X_train == 0).all()]
        X_train = X_train.drop(columns=cols_to_drop)
        X_test = X_test.drop(columns=cols_to_drop)

        cols_to_drop = []
        for i in range(len(X_train.columns) - 1, 1, -1):  # Start from the last column and move backward
            col_name = X_train.columns[i]
            tmp_Y = X_train.iloc[:, i].values
            tmp_X = X_train.iloc[:, :i].values

            coefficients, residuals, rank, s = np.linalg.lstsq(tmp_X, tmp_Y, rcond=None)
            if np.sum(residuals) < 1e-16:
                cols_to_drop.append(col_name)

        X_train = X_train.drop(columns=cols_to_drop)
        X_test = X_test.drop(columns=cols_to_drop)

        model = sm.OLS(Y, X_train).fit()
        mycoef = model.params.fillna(0)

        tmp_pred = X_test[['Store', 'Dept', 'Date']]
        X_test = X_test.drop(['Store', 'Dept', 'Date'], axis=1)

        tmp_pred['Weekly_Pred'] = np.dot(X_test, mycoef)
        test_pred = pd.concat([test_pred, tmp_pred], ignore_index=True)

    test_pred = test[['Store', 'Dept', 'Date', 'IsHoliday']].merge(test_pred,
                                                                  on=['Store', 'Dept', 'Date'],
                                                                  how='left')
    test_pred['Weekly_Pred'].fillna(0, inplace=True)

# Convert the 'Date' column to datetime
    test_pred['Date'] = pd.to_datetime(test_pred['Date'])

    output_path = "mypred.csv"
    test_pred.to_csv(output_path, index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp_pred['Weekly_Pred'] = np.dot(X_test, mycoef)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp_pred['Weekly_Pred'] = np.dot(X_test, mycoef)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp_pred['Weekly_Pred'] = np.dot(X_test, mycoef)
A value is trying to be set on a copy of a slice from a Da

Performance result   

Folder 1 - WMAE: 1941.597  / Computing time : 2min 2s  
Folder 2 - WMAE: 1363.584  / Computing time : 2min 6s  
Folder 3 - WMAE: 1382.565  / Computing time : 2min 12s  
Folder 4 - WMAE: 1527.389  / Computing time : 2min 19s  
Folder 5 - WMAE: 2310.612  / Computing time : 2min 16s  
Folder 6 - WMAE: 1637.269  / Computing time : 2min 15s  
Folder 7 - WMAE: 1683.922  / Computing time : 2min 22s  
Folder 8 - WMAE: 1399.906  / Computing time : 2min 22s   
Folder 9 - WMAE: 1417.880  / Computing time : 2min 27s  
Folder 10 - WMAE: 1426.248  / Computing time : 2min 24s  

Average over the 10 folders: 1609.097
