# Forecasting Stock Volume

In this example, we will be forecasting the volume of different Dow Jones stocks for a given day.  The data used is a public Kaggle dataset consisting of stock market data for the DJIA 30: [DJIA Stock Data](https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231).

We will be using Sparkling Water to ingest the data and add historical lags.

Our Machine Learning Workflow is: 

1. Import data into Spark
2. Feature engineering
   * Add time lag columns
3. Train a single DRF model
4. Examine DRF model
5. Run AutoML (from Python)
6. Watch AutoML progress (in the H2O Flow Web UI)

# Step 1 (of 6).  Import data into Spark

In [None]:
# Initiate H2OContext on top of Spark

from pysparkling import *
hc = H2OContext.getOrCreate(spark)

In [None]:
# Import data

from pyspark.sql.types import *

schema = StructType([StructField("Date", TimestampType(), True),
                     StructField("Open", DoubleType(), True),
                     StructField("High", DoubleType(), True),
                     StructField("Low", DoubleType(), True),
                     StructField("Close", DoubleType(), True),
                     StructField("Volume", DoubleType(), True),
                     StructField("Name", StringType(), True)])

# https://s3.amazonaws.com/h2o-training/events/h2o_world/TimeSeries/all_stocks_2006-01-01_to_2018-01-01.csv
stock_df = spark.read.csv("data/all_stocks_2006-01-01_to_2018-01-01.csv", header = True, schema = schema)

In [None]:
stock_df.head(5)

# Step 2 (of 6).  Feature Engineering

We will add new features to our data that can help predict the Volume for a given company.  Features that tell us:
* what was the Volume for a company yesterday, two days ago, three days ago?  
* what was the Close price, Open price, High price, Low price for a company yesterday?

can be very predictive in forecasting.  To create these features we will use PySpark's window function.

In [None]:
## Add Volume from the Previous Day, Previous 2 days, Previous 3 days per Company
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

w = Window().partitionBy(col("Name")).orderBy(col("Date"))
ext_stock_df = stock_df.select("*", lag("Volume", count = 1).over(w).alias("Volume_lag1"),
               lag("Volume", count = 2).over(w).alias("Volume_lag2"),
               lag("Volume", count = 3).over(w).alias("Volume_lag3")).na.drop()

In [None]:
ext_stock_df.show()

In [None]:
## Add Close, Open, Low, and High by Company for Previous day per Company

ext_stock_df = ext_stock_df.select("*", lag("Close", count = 1).over(w).alias("Close_lag1")).na.drop()
ext_stock_df = ext_stock_df.select("*", lag("Low", count = 1).over(w).alias("Low_lag1")).na.drop()
ext_stock_df = ext_stock_df.select("*", lag("High", count = 1).over(w).alias("High_lag1")).na.drop()
ext_stock_df = ext_stock_df.select("*", lag("Open", count = 1).over(w).alias("Open_lag1")).na.drop()

In [None]:
## Convert Spark DataFrame to H2O Frame

import h2o
ext_stock_hf = hc.as_h2o_frame(ext_stock_df, "stockWithLagsTable")

In [None]:
## Convert strings to categoricals

ext_stock_hf["Name"] = ext_stock_hf["Name"].asfactor()

# Step 3 (of 6).  Train a single DRF model

We will train a random forest model with our added lag features as predictors.

In [None]:
# Set Predictors
predictors = list(set(ext_stock_hf.col_names) - set(["Volume", "Open", "Close", "High", "Low"]))
response = "Volume"

In [None]:
# Split data into training and testing by time
# Test data is the last day of data

is_test = (ext_stock_hf["Date"].year() == 2017) & \
          (ext_stock_hf["Date"].month() == 12) & \
          (ext_stock_hf["Date"].day() == 29)

train = ext_stock_hf[is_test == 0]
test = ext_stock_hf[is_test == 1]

In [None]:
# Train Random Forest

from h2o.estimators import H2ORandomForestEstimator
drf_model = H2ORandomForestEstimator(model_id = "drf_model.hex",
                                     seed = 1234,
                                     ntrees = 5)
drf_model.train(x = predictors,
                y = response,
                training_frame = train,
                validation_frame = test)

# Step 4 (of 6).  Examine DRF model

The Mean Absolute Percent Error is about 20% on our test data.

In [None]:
preds = drf_model.predict(test)
mape = ((test["Volume"] - preds).abs()/test["Volume"]).mean()[0]
print("Mean Absolute Percent Error: " + "{0:.0f}%".format(100*mape))

The graph below shows the variable importance for the random forest model.  The most important predictors are the volume lags.  We can use the partial dependency plots to see the relationship between these features and the model's prediction.

In [None]:
%matplotlib inline
drf_model.varimp_plot()

In [None]:
# Filter data to common volume
max_volume = train["Volume"].quantile(prob = [0.9])[0, 1]
pdp_data = train[(train["Volume_lag1"] < max_volume) & 
                 (train["Volume_lag2"] < max_volume) & 
                 (train["Volume_lag3"] < max_volume) ]
# create pdp's
pdps = drf_model.partial_plot(data = pdp_data, cols = ["Volume_lag1", "Volume_lag2", "Volume_lag3"])

The partial plots show that the Volume trend tracks the Volume values from the previous days for the company.

# Step 5 (of 6).  Run AutoML

Now we can try running AutoML to see if we can improve the results even further.

In [None]:
from h2o.automl import H2OAutoML

auto_ml = H2OAutoML(project_name = "stock_forecast",
                    max_runtime_secs = 120, 
                    exclude_algos = ["DRF"],
                    keep_cross_validation_predictions = False,
                    keep_cross_validation_models = False,
                    seed = 1234)

auto_ml.train(x = predictors,
              y = response,
              training_frame = train,
              leaderboard_frame = test)

In [None]:
auto_ml.leaderboard

# Step 6 (of 6). Watch AutoML progress (in the H2O Flow Web UI)

* Go to port 54321
* In H2O Flow, go to Admin -> Jobs
* Click on the "Auto Model" job with the "stock_forecast" job name and explore it

# Bonus: Github location for this tutorial

* https://github.com/h2oai/h2o-tutorials/tree/master/nyc-workshop-2018/h2o_sw/sparkling-water-hands-on