Finding trend in oil price data.
================================

Johannes Graner, Albert Nilsson and Raazesh Sainudiin

2020, Uppsala, Sweden

This project was supported by Combient Mix AB through summer internships
at:

Combient Competence Centre for Data Engineering Sciences, Department of
Mathematics, Uppsala University, Uppsala, Sweden

Resources
---------

This builds on the following library and its antecedents therein:

-   <https://github.com/lamastex/spark-trend-calculus>

This work was inspired by:
--------------------------

-   Antoine Aamennd's
    [texata-2017](https://github.com/aamend/texata-r2-2017)
-   Andrew Morgan's [Trend Calculus
    Library](https://github.com/ByteSumoLtd/TrendCalculus-lua)

When dealing with time series, it can be difficult to find a good way to
find and analyze trends in the data.

One approach is by using the Trend Calculus algorithm invented by Andrew
Morgan. More information about Trend Calculus can be found at
[this](https://lamastex.github.io/spark-trend-calculus-examples/)
github.io page.

In [None]:
import org.lamastex.spark.trendcalculus._
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import java.sql.Timestamp

  

>     import org.lamastex.spark.trendcalculus._
>     import spark.implicits._
>     import org.apache.spark.sql._
>     import org.apache.spark.sql.functions._
>     import java.sql.Timestamp

In [None]:
ls /dbfs/FileStore/shared_uploads/fabiansi@kth.se/


  

>     DAT_ASCII_BCOUSD_M1_2010_csv.gz
>     DAT_ASCII_BCOUSD_M1_2011_csv.gz
>     DAT_ASCII_BCOUSD_M1_2012_csv.gz
>     DAT_ASCII_BCOUSD_M1_2013_csv.gz
>     DAT_ASCII_BCOUSD_M1_2014_csv.gz
>     DAT_ASCII_BCOUSD_M1_2015_csv.gz
>     DAT_ASCII_BCOUSD_M1_2016_csv.gz
>     DAT_ASCII_BCOUSD_M1_2017_csv.gz
>     DAT_ASCII_BCOUSD_M1_2018_csv.gz
>     DAT_ASCII_BCOUSD_M1_201901_csv.gz
>     DAT_ASCII_BCOUSD_M1_201902_csv.gz
>     DAT_ASCII_BCOUSD_M1_201903_csv.gz
>     DAT_ASCII_BCOUSD_M1_201904_csv.gz
>     DAT_ASCII_BCOUSD_M1_201905_csv.gz
>     DAT_ASCII_BCOUSD_M1_201906_csv.gz
>     joinedDSWithMaxRev

  

The input to the algorithm is data in the format (ticker, time, value).
In this example, ticker is `"BCOUSD"` (Brent Crude Oil), time is given
in minutes and value is the closing price for Brent Crude Oil during
that minute.

This data is historical data from 2010 to 2019 taken from
https://www.histdata.com/ using methods from
[FX-1-Minute-Data](https://github.com/philipperemy/FX-1-Minute-Data) by
Philippe Remy. In this notebook, everything is done on static
dataframes. See **02streamable-trend-calculus** for examples on
streaming dataframes.

There are gaps in the data, notably during the weekends when no trading
takes place, but this does not affect the algorithm as it is does not
place any assumptions on the data other than that time is monotonically
increasing.

The window size is set to 2, which is minimal, beacuse we want to retain
as much information as possible.

In [None]:
dbutils.fs.ls("dbfs:/FileStore/shared_uploads/fabiansi@kth.se")

  

>     res0: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2010_csv.gz, DAT_ASCII_BCOUSD_M1_2010_csv.gz, 284384), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2011_csv.gz, DAT_ASCII_BCOUSD_M1_2011_csv.gz, 2479115), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2012_csv.gz, DAT_ASCII_BCOUSD_M1_2012_csv.gz, 2327511), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2013_csv.gz, DAT_ASCII_BCOUSD_M1_2013_csv.gz, 2109500), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2014_csv.gz, DAT_ASCII_BCOUSD_M1_2014_csv.gz, 1961172), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2015_csv.gz, DAT_ASCII_BCOUSD_M1_2015_csv.gz, 2205678), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2016_csv.gz, DAT_ASCII_BCOUSD_M1_2016_csv.gz, 2131659), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2017_csv.gz, DAT_ASCII_BCOUSD_M1_2017_csv.gz, 1854793), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_2018_csv.gz, DAT_ASCII_BCOUSD_M1_2018_csv.gz, 2251250), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201901_csv.gz, DAT_ASCII_BCOUSD_M1_201901_csv.gz, 250411), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201902_csv.gz, DAT_ASCII_BCOUSD_M1_201902_csv.gz, 213207), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201903_csv.gz, DAT_ASCII_BCOUSD_M1_201903_csv.gz, 211928), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201904_csv.gz, DAT_ASCII_BCOUSD_M1_201904_csv.gz, 208552), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201905_csv.gz, DAT_ASCII_BCOUSD_M1_201905_csv.gz, 241092), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/DAT_ASCII_BCOUSD_M1_201906_csv.gz, DAT_ASCII_BCOUSD_M1_201906_csv.gz, 171191), FileInfo(dbfs:/FileStore/shared_uploads/fabiansi@kth.se/joinedDSWithMaxRev/, joinedDSWithMaxRev/, 0))

In [None]:
val windowSize = 2
val oilDS = spark.read.fx1m("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/*csv.gz").toDF.withColumn("ticker", lit("BCOUSD")).select($"ticker", $"time" as "x", $"close" as "y").as[TickerPoint].orderBy("x")


  

>     windowSize: Int = 2
>     oilDS: org.apache.spark.sql.Dataset[org.lamastex.spark.trendcalculus.TickerPoint] = [ticker: string, x: timestamp ... 1 more field]

  

If we want to look at long term trends, we can use the output time
series as input for another iteration. The output contains the points of
the input where the trend changes (reversals). This can be repeated
several times, resulting in longer term trends.

Here, we look at (up to) 15 iterations of the algorithm. It is no
problem if the output of some iteration is too small to find a reversal
in the next iteration, since the output will just be an empty dataframe
in that case.

In [None]:
val numReversals = 50
val dfWithReversals = new TrendCalculus2(oilDS, windowSize, spark).nReversalsJoinedWithMaxRev(numReversals)

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("trendCalculus") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")




def logLevel(sc):
    # REF: https://stackoverflow.com/questions/25193488/how-to-turn-off-info-logging-in-spark
    log4jLogger = sc._jvm.org.apache.log4j
    log4jLogger.Logger.getLogger("org").setLevel(log4jLogger.Level.ERROR)
    log = log4jLogger.LogManager.getLogger(__name__)
    log.warn("Custom WARN message")


logLevel(spark)

print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())

# sc._jvm.org.lamastex.spark.trendcalculus.TrendCalculus2.FHLS(emptyPoint,emptyPoint,emptyPoint,emptyPoint)
sc._jvm.org.lamastex.spark.trendcalculus.Point(0, 0.0)

# oilDS = spark.read.fx1m("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/*csv.gz")

# sc._jvm.org.lamastex.spark.trendcalculus.TrendCalculus2.nReversalsJoined(42)

In [None]:
display(dfWithReversals)

  

The number of reversals decrease rapidly as more iterations are done.

In [None]:
dfWithReversals.cache.count

In [None]:
(1 to numReversals).map( i => println(dfWithReversals.filter(s"reversal$i is not null").count))

  

Writing dataframe to parquet in order to read from python.

In [None]:
dfWithReversals.write.mode(SaveMode.Overwrite).parquet("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/joinedDSWithMaxRev")
dfWithReversals.unpersist

  

Visualization
-------------

Plotly in python is used to make interactive visualizations.

In [None]:
from plotly.offline import plot
from plotly.graph_objs import *
from datetime import *
joinedDS = spark.read.parquet("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/joinedDSWithMaxRev").orderBy("x")

  

Seeing how much the timeseries has to be thinned out in order to display
locally.

No information about higher order trend reversals is lost since every
higher order reversal is also a lower order reversal.

In [None]:
joinedDS.filter("maxRev > 2").count()

In [None]:
fullTS = joinedDS.filter("maxRev > 2").select("x","y","maxRev").collect()

  

Picking an interval to focus on.

Start and end dates as (year, month, day, hour, minute, second). Only
year, month and day are required. The interval from 1800 to 2200 ensures
all data is selected.

In [None]:
startDate = datetime(1800,1,1)
endDate= datetime(2200,12,31)
TS = [row for row in fullTS if startDate <= row['x'] and row['x'] <= endDate]

  

Setting up the visualization.

In [None]:
numReversals = 15
startReversal = 7

allData = {'x': [row['x'] for row in TS], 'y': [row['y'] for row in TS], 'maxRev': [row['maxRev'] for row in TS]}
revTS = [row for row in TS if row[2] >= startReversal]
colorList = ['rgba(' + str(tmp) + ',' + str(255-tmp) + ',' + str(255-tmp) + ',1)' for tmp in [int(i*255/(numReversals-startReversal+1)) for i in range(1,numReversals-startReversal+2)]]

def getRevTS(tsWithRevMax, revMax):
  x = [row[0] for row in tsWithRevMax if row[2] >= revMax]
  y = [row[1] for row in tsWithRevMax if row[2] >= revMax]
  return x,y,revMax

reducedData = [getRevTS(revTS, i) for i in range(startReversal, numReversals+1)]

markerPlots = [Scattergl(x=x, y=y, mode='markers', marker=dict(color=colorList[i-startReversal], size=i), name='Reversal ' + str(i)) for (x,y,i) in [getRevTS(revTS, i) for i in range(startReversal, numReversals+1)]]

  

### Plotting result as plotly graph

The graph is interactive, one can drag to zoom in on an area
(double-click to get back) and click on the legend to hide/show
different series.

In [None]:
p = plot(
  [Scattergl(x=allData['x'], y=allData['y'], mode='lines', name='Oil Price')] + markerPlots
  ,
  output_type='div'
)

displayHTML(p)