# 探索的データ分析（EDA）を行います
## やること
- 需要予測モデルを構築する前に、データの特徴を理解するための探索的データ分析（EDA）を行います
- 本ノートブックを上から下まで流してください
- クラスタはDBR15.4 LTS or DBR15.4 LTS ML以降で実行してください

このデモでは、Databricksの分散処理を使って、店舗・アイテムごとの詳細な予測を効率的に生成する方法を学びます。  
トレーニングデータセットには、10の異なる店舗で50品目の5年間の店舗・アイテム単位の販売データを使用します。
売上トレンドを年間、月別、週別などで把握します。

<!-- %md
### Conduct Exploratory Data Analysis (EDA)

In this demo, you will learn how to efficiently generate fine-grained forecasts for each store and item using Databricks' distributed processing.  
The training dataset consists of five years of store-item level sales data for 50 items across 10 different stores.  
We will analyze sales trends on a yearly, monthly, and weekly basis. -->

In [0]:
%run "./01_config"

## Step 1: データのトレンド確認

In [0]:
%sql

SELECT current_catalog(), current_schema();

In [0]:
%sql

SELECT * FROM bronze_train

需要予測を行う際、一般的なトレンドや季節性に関心があることが多いです。まずは、年間の販売数量のトレンドを確認しましょう。

<!-- %md
When performing demand forecasting, we are often interested in general trends and seasonality.  Let's start our exploration by examining the annual trend in unit sales: -->

In [0]:
%sql

SELECT
  CAST(year(order_date) as STRING) as year, 
  sum(sales_quantity) as sales
FROM bronze_train
GROUP BY year(order_date)
ORDER BY year;

Databricks visualization. Run in Databricks to view.


全体的に販売数量が増加しています。ざっと見る限り、数日から数ヶ月、または1年先までは成長が続くと考えても良さそうです。  
続いて季節性を見てみましょう。月ごとに集計すると、売上の増加に伴って拡大する年間の季節的パターンが確認できます。

<!-- %md
There is an overall upward trend in sales volume. At a glance, it seems reasonable to assume that growth will continue for the next few days, months, or even up to a year.
Now let's examine seasonality.  If we aggregate the data around the individual months in each year, a distinct yearly seasonal pattern is observed which seems to grow in scale with overall growth in sales: -->

In [0]:
%sql

SELECT 
  TRUNC(order_date, 'MM') as month,
  SUM(sales_quantity) as sales
FROM bronze_train
GROUP BY TRUNC(order_date, 'MM')
ORDER BY month;

Databricks visualization. Run in Databricks to view.

データを曜日ごとに集計すると、日曜日にピークがあり、月曜日に急落し、週の後半にかけて回復するパターンが見られます。このパターンは5年間ほぼ安定しています。

<!-- %md
When the data is aggregated by day of the week, we see a pattern where sales peak on Sunday, drop sharply on Monday, and then recover towards the end of the week. This pattern has remained fairly stable over the past five years. -->

In [0]:
%sql

SELECT
  YEAR(order_date) as year,
  (
    CASE
      WHEN DATE_FORMAT(order_date, 'E') = 'Sun' THEN 0
      WHEN DATE_FORMAT(order_date, 'E') = 'Mon' THEN 1
      WHEN DATE_FORMAT(order_date, 'E') = 'Tue' THEN 2
      WHEN DATE_FORMAT(order_date, 'E') = 'Wed' THEN 3
      WHEN DATE_FORMAT(order_date, 'E') = 'Thu' THEN 4
      WHEN DATE_FORMAT(order_date, 'E') = 'Fri' THEN 5
      WHEN DATE_FORMAT(order_date, 'E') = 'Sat' THEN 6
    END
  ) % 7 as weekday,
  AVG(sales) as sales
FROM (
  SELECT 
    order_date,
    SUM(sales_quantity) as sales
  FROM bronze_train
  GROUP BY order_date
 ) x
GROUP BY year, weekday
ORDER BY year, weekday;

Databricks visualization. Run in Databricks to view.


次は、データの基本的なパターンが把握できたので、次に予測モデルの構築方法を探っていきましょう。  
04_model_training

<!-- %md

Now that we have a basic understanding of the data patterns, let's explore how to build a forecasting model.
04_model_training -->