# 比赛说明  

注意：这是构成 M5 预测挑战的两个互补竞争之一。你能尽可能准确地估计沃尔玛在美国销售的各种产品的单位销量的点预测吗？如果您有兴趣估计同一系列的已实现值的不确定性分布，请务必查看其[配套竞争](https://www.kaggle.com/c/m5-forecasting-uncertainty)

一家商店一年每月能卖多少野营装备？对于新手来说，计算这个水平的销售额似乎和预测天气一样困难。这两种类型的预测都依赖于科学和历史数据。虽然错误的天气预报可能会导致您在阳光明媚的日子随身携带雨伞，但不准确的业务预测可能会导致实际或机会损失。在本次竞赛中，除了传统的预测方法外，您还面临着使用机器学习提高预测准确性的挑战。

尼科西亚大学马里达基斯开放预测中心（MOFC）进行尖端预测研究并提供业务预测培训。它帮助公司实现准确的预测，估计不确定性水平，避免代价高昂的错误，并应用最佳预测实践。MOFC以其马里达基斯竞赛而闻名，第一次比赛是在20世纪80年代。

在这次竞赛的第五次迭代中，您将使用全球最大收入公司沃尔玛的分层销售数据来预测未来 28 天的每日销售额。这些数据涵盖美国三个州（加利福尼亚州、得克萨斯州和威斯康星州）的商店，包括项目级别、部门、产品类别和商店详细信息。此外，它还具有价格、促销、星期一和特殊事件等解释变量。总之，此强健数据集可用于提高预测准确性。

如果成功，你的工作将继续推进预测的理论和实践。使用的方法可以应用于各种业务领域，例如设置适当的库存或服务级别。通过业务支持和培训，MOFC 将帮助分发这些工具和知识，以便其他人能够实现更准确、更校准的预测，减少浪费，并能够了解不确定性及其风险影响。

感谢其他合作伙伴组织和奖项赞助商，雅典国立技术大学（NTUA）、INSEAD、谷歌、优步和IIF。

In [1]:
import numpy as np
import pandas as pd



In [2]:
data_path = "/home/zhang/Documents/data_set/m5_forecasting_accuracy/"

df_calendar = pd.read_csv(f'{data_path}/calendar.csv', index_col='date')
df_sell_prices = pd.read_csv(f'{data_path}/sell_prices.csv')
sample_submission = pd.read_csv(f'{data_path}/sample_submission.csv')
df_sales = pd.read_csv(f'{data_path}/sales_train_validation.csv', index_col="item_id")

In [3]:
first_date = 'd_1'
last_date = 'd_1913'


# 特征选择

In [4]:
from sklearn.preprocessing import LabelEncoder

nonservingcols = ['wm_yr_wk', 'wday']
dates = df_calendar.drop(nonservingcols, axis=1)
dates.index = dates["d"] # 数据索引转换为天数
dates = dates.fillna(0)


In [5]:
dates[["event_name_1", "event_type_1", "event_name_2", "event_type_2"]]

Unnamed: 0_level_0,event_name_1,event_type_1,event_name_2,event_type_2
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d_1,0,0,0,0
d_2,0,0,0,0
d_3,0,0,0,0
d_4,0,0,0,0
d_5,0,0,0,0
...,...,...,...,...
d_1965,0,0,0,0
d_1966,0,0,0,0
d_1967,0,0,0,0
d_1968,0,0,0,0


In [6]:
categorical_cols = ["event_name_1", "event_type_1", "event_name_2", "event_type_2"]

my_labeler = LabelEncoder()
for i in categorical_cols:
    dates[i] = my_labeler.fit_transform(dates[i].astype("str"))# 标签数据数字化，并没有做one_hot_encoding

In [7]:
dates[categorical_cols]

Unnamed: 0_level_0,event_name_1,event_type_1,event_name_2,event_type_2
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d_1,0,0,0,0
d_2,0,0,0,0
d_3,0,0,0,0
d_4,0,0,0,0
d_5,0,0,0,0
...,...,...,...,...
d_1965,0,0,0,0
d_1966,0,0,0,0
d_1967,0,0,0,0
d_1968,0,0,0,0


In [8]:
dates[['weekday', 'month', 'year']]

Unnamed: 0_level_0,weekday,month,year
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d_1,Saturday,1,2011
d_2,Sunday,1,2011
d_3,Monday,1,2011
d_4,Tuesday,2,2011
d_5,Wednesday,2,2011
...,...,...,...
d_1965,Wednesday,6,2016
d_1966,Thursday,6,2016
d_1967,Friday,6,2016
d_1968,Saturday,6,2016


In [9]:
df_sales.columns

Index(['id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd_1', 'd_2', 'd_3',
       'd_4', 'd_5',
       ...
       'd_1904', 'd_1905', 'd_1906', 'd_1907', 'd_1908', 'd_1909', 'd_1910',
       'd_1911', 'd_1912', 'd_1913'],
      dtype='object', length=1918)

In [10]:
def createFeatures(series):
    '''
    将一个数组转置并与日期数据合并
    Argument:
      series:应该是一个np.series
    '''
    serie = series.transpose()
    df_products = pd.merge(serie. dates, left_index=True, right_index=True)
    df_products["Date"] = pd.to_datetime(df_products["Date"])
    df_prodcuts["quarter"] = df_products["Date"].dt.quarter.astype("unit8")
    df_prodcuts["Month"] = df_products["Date"].dt.month.astype("unit8")
    df_products["Year"] = df_prodcuts["Date"] .dt.year.astype("unit8")
    df_products["dayofyear"] = df_products["Date"].dt.dayofyear.astype("unit8")
    df_products["dayofweek"] = df_products["Date"].dt.dayofweek.astype("unit8")
    df_products.index = df_products.Date
    df_products = df_products.drop(["Date", "weekday", "month", "d"], axis = 1)
    
    return df_products

In [11]:
def crearseries(data):
    '''
    
    '''
    a = data[0]
    df = df_sales.copy()
    first_date = "d_1"
    last_date = "d_1969"
    
    if a:
        final_df = df.groupby(data).sum()
        lnn = list()
        try:
            for i in final_df.index:
                nn = "_".join(i[0], i[1])
                lnn.append(nn)
                final_df["final_name"] = lnn
                final_df.set_index(final_df['final_name'])
#                 final_df.index = final_df["final_name"]
#                 final_df = final_df.drop("final_name", axis=1)
        except:
            pass
        return final_df
    else:
        df = df.loc[:, first_date:last_date]
        final_df = pd.Series(df.sum(axis=0))
        return final_df
        

In [12]:
df_sell_prices.loc[:,["item_id","sell_price"]]

Unnamed: 0,item_id,sell_price
0,HOBBIES_1_001,9.58
1,HOBBIES_1_001,9.58
2,HOBBIES_1_001,8.26
3,HOBBIES_1_001,8.26
4,HOBBIES_1_001,8.26
...,...,...
6841116,FOODS_3_827,1.00
6841117,FOODS_3_827,1.00
6841118,FOODS_3_827,1.00
6841119,FOODS_3_827,1.00


In [13]:
df_prices_stats = df_sell_prices.loc[:,["item_id", "sell_price"]]



In [14]:
df_prices_stats = df_prices_stats.groupby("item_id").agg([min, max, "mean"]).loc[:,"sell_price"]

In [15]:
df_estados = df_sales.loc[:,"state_id":last_date]
df_estados = df_estados.groupby("state_id").sum()
df_estados_Q = pd.DataFrame(df_estados.sum(axis=1))

In [16]:
df_estados = df_estados.transpose()
df_estados = pd.merge(df_estados, dates, left_index=True, right_index=True)
df_estados.head()

Unnamed: 0,CA,TX,WI,weekday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
d_1,14195,9438,8998,Saturday,1,2011,d_1,0,0,0,0,0,0,0
d_2,13805,9630,8314,Sunday,1,2011,d_2,0,0,0,0,0,0,0
d_3,10108,6778,6897,Monday,1,2011,d_3,0,0,0,0,0,0,0
d_4,11047,7381,6984,Tuesday,2,2011,d_4,0,0,0,0,1,1,0
d_5,9925,5912,3309,Wednesday,2,2011,d_5,0,0,0,0,1,0,1


In [17]:
df_estados.index

Index(['d_1', 'd_2', 'd_3', 'd_4', 'd_5', 'd_6', 'd_7', 'd_8', 'd_9', 'd_10',
       ...
       'd_1904', 'd_1905', 'd_1906', 'd_1907', 'd_1908', 'd_1909', 'd_1910',
       'd_1911', 'd_1912', 'd_1913'],
      dtype='object', length=1913)

In [18]:
df_estados.Date = pd.to_datetime(df_estados.loc[:,"Date"])
df_estados.head()

KeyError: 'Date'

In [None]:
df_estados.transpose().index