# Clustering financial data using dataframe

### Importing MLlib libraries

In [1]:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.SQLContext
%AddJar http://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar --magic
%AddDeps com.databricks spark-csv_2.10 1.3.0 --transitive

Using cached version of spark-csv_2.10-1.3.0.jar
Marking com.databricks:spark-csv_2.10:1.3.0 for download
Preparing to fetch from:
-> file:/tmp/.ivy2/
-> https://repo1.maven.org/maven2
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


### Read data and pre-processing

The data that we are going to use in this example is stock market data with the ConnorsRSI indicator. ConnorsRSI is a composite indicator made up from RSI_CLOSE_3, PERCENT_RANK_100, and RSI_STREAK_2. We will use these attributes as well as the actual ConnorsRSI (CRSI) and RSI2 to pass into  KMeans algorithm. The calculation of this data is already normalized from 0 to 100.
The other columns like ID, LABEL, RTN5, FIVE_DAY_GL, and CLOSE we will use to do further analysis once we cluster the instances. They will not be passed into the KMeans algorithm.

In [2]:
// load file and remove header
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val allDF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/spykmeans.csv")

allDF.show()
allDF.schema

+--------------+-----+---------+-----------+------+--------+-----------+----------------+------------+-------+
|            ID|LABEL|     RTN5|FIVE_DAY_GL| CLOSE|    RSI2|RSI_CLOSE_3|PERCENT_RANK_100|RSI_STREAK_2|   CRSI|
+--------------+-----+---------+-----------+------+--------+-----------+----------------+------------+-------+
|2015-09-16:SPY|   UP|  2.76708|   -3.28704|200.18| 91.5775|     81.572|              84|     73.2035|79.5918|
|2015-09-15:SPY|   UP| 0.521704|   -2.29265|198.46| 83.4467|    72.9477|              92|     60.6273|75.1917|
|2015-09-14:SPY|   DN|  1.77579|    0.22958|196.01| 47.0239|    51.3076|              31|      25.807|36.0382|
|2015-09-11:SPY|   UP|  0.60854|  -0.655688|196.74| 69.9559|    61.0005|              76|      76.643|71.2145|
|2015-09-10:SPY|   UP| 0.225168|    1.98111|195.85| 57.2462|    53.9258|              79|     65.2266|66.0508|
|2015-09-09:SPY|   DN|   1.5748|    2.76708|194.79| 42.8488|    46.1728|               7|     31.9797|28.3842|
|

StructType(StructField(ID,StringType,true), StructField(LABEL,StringType,true), StructField(RTN5,DoubleType,true), StructField(FIVE_DAY_GL,DoubleType,true), StructField(CLOSE,DoubleType,true), StructField(RSI2,DoubleType,true), StructField(RSI_CLOSE_3,DoubleType,true), StructField(PERCENT_RANK_100,IntegerType,true), StructField(RSI_STREAK_2,DoubleType,true), StructField(CRSI,DoubleType,true))

### RDD / Dataframe conversions

In [3]:
// convert to rdd and cache the data

val rowsRDD = allDF.rdd.map(r => (r.getString(0), r.getString(1), r.getDouble(2),
                                  r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6),
                                  r.getInt(7), r.getDouble(8), r.getDouble(9) ))

rowsRDD.cache()

// convert data to RDD which will be passed to KMeans and cache the data. We are passing in RSI2, RSI_CLOSE_3, PERCENT_RANK_100,
// RSI_STREAK_2 and CRSI to KMeans. These are the attributes we want to use to assign the instance to a cluster

val vectors = allDF.rdd.map(r => Vectors.dense( r.getDouble(5),r.getDouble(6),
                                               r.getInt(7), r.getDouble(8), r.getDouble(9) ))




In [4]:

rowsRDD.take(5).foreach(println)

(2015-09-16:SPY,UP,2.76708,-3.28704,200.18,91.5775,81.572,84,73.2035,79.5918)
(2015-09-15:SPY,UP,0.521704,-2.29265,198.46,83.4467,72.9477,92,60.6273,75.1917)
(2015-09-14:SPY,DN,1.77579,0.22958,196.01,47.0239,51.3076,31,25.807,36.0382)
(2015-09-11:SPY,UP,0.60854,-0.655688,196.74,69.9559,61.0005,76,76.643,71.2145)
(2015-09-10:SPY,UP,0.225168,1.98111,195.85,57.2462,53.9258,79,65.2266,66.0508)


### Run cluster analysis

In [5]:
//KMeans model with 2 clusters and 20 iterations

val kMeansModel = KMeans.train(vectors, 2, 20)

//Print the center of each cluster

kMeansModel.clusterCenters.foreach(println)

// Get the prediction from the model with the ID so we can link them back to other information

val predictions = rowsRDD.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._6, r._7, r._8, r._9, r._10) ))}
predictions.take(5).foreach(println)
// convert the rdd to a dataframe

val predDF = predictions.toDF("ID","CLUSTER")

[25.590238191780806,33.762228675799086,22.24429223744292,25.428050107305936,27.14485747716894]
[79.06854271582732,72.01910089928057,70.35431654676259,72.65608415467624,71.67650269784168]
(2015-09-16:SPY,1)
(2015-09-15:SPY,1)
(2015-09-14:SPY,0)
(2015-09-11:SPY,1)
(2015-09-10:SPY,1)


### Join the dataframes on ID

In [6]:
val t = allDF.join(predDF, "ID")
t.printSchema()

root
 |-- ID: string (nullable = true)
 |-- LABEL: string (nullable = true)
 |-- RTN5: double (nullable = true)
 |-- FIVE_DAY_GL: double (nullable = true)
 |-- CLOSE: double (nullable = true)
 |-- RSI2: double (nullable = true)
 |-- RSI_CLOSE_3: double (nullable = true)
 |-- PERCENT_RANK_100: integer (nullable = true)
 |-- RSI_STREAK_2: double (nullable = true)
 |-- CRSI: double (nullable = true)
 |-- CLUSTER: integer (nullable = false)



### Review a subset of each cluster

In [7]:
t.filter("CLUSTER = 0").show()
t.filter("CLUSTER = 1").show()

+--------------+-----+----------+-----------+------+-------+-----------+----------------+------------+-------+-------+
|            ID|LABEL|      RTN5|FIVE_DAY_GL| CLOSE|   RSI2|RSI_CLOSE_3|PERCENT_RANK_100|RSI_STREAK_2|   CRSI|CLUSTER|
+--------------+-----+----------+-----------+------+-------+-----------+----------------+------------+-------+-------+
|2007-11-21:SPY|   DN|  -4.05634|    3.88199|141.68|12.9386|    18.8982|               7|     35.5108|20.4696|      0|
|2008-01-18:SPY|   DN|  -5.77239|      2.408|132.06|2.42493|    6.64805|              18|     9.12246|11.2568|      0|
|2013-03-21:SPY|   DN|  -1.51215|     1.4965|154.36|26.5574|    32.0213|               9|     44.0927|28.3713|      0|
|2014-04-07:SPY|   DN|  -1.42773|  -0.759466|184.34|6.34707|    16.6131|               6|     8.95155|10.5216|      0|
|2014-12-04:SPY|   DN|0.00963206|     -1.671|207.66|63.3008|    62.9819|              34|     28.6038|41.8619|      0|
|2011-08-18:SPY|   DN|  -2.40348|    1.54572|114

### Get descriptive statistics for each cluster

In [8]:
t.filter("CLUSTER = 0").describe("RTN5","FIVE_DAY_GL","CLOSE","RSI2","RSI_CLOSE_3").show()
t.filter("CLUSTER = 0").describe("PERCENT_RANK_100","RSI_STREAK_2","CRSI","CLUSTER").show()
t.filter("CLUSTER = 1").describe("RTN5","FIVE_DAY_GL","CLOSE","RSI2","RSI_CLOSE_3").show()
t.filter("CLUSTER = 1").describe("PERCENT_RANK_100","RSI_STREAK_2","CRSI","CLUSTER").show()

+-------+-------------------+------------------+------------------+------------------+------------------+
|summary|               RTN5|       FIVE_DAY_GL|             CLOSE|              RSI2|       RSI_CLOSE_3|
+-------+-------------------+------------------+------------------+------------------+------------------+
|  count|                876|               876|               876|               876|               876|
|   mean|-1.2283751243607308|0.2847856990867581|141.65286529680358|25.590238191780823| 33.76222867579909|
| stddev| 2.7682920695071767| 3.085360825933131|36.991560549603435|18.784853399492384|18.649455797802382|
|    min|           -19.7934|          -19.7934|             68.11|          0.155807|           1.06545|
|    max|            6.38769|           19.4036|            212.99|           79.0877|           83.5509|
+-------+-------------------+------------------+------------------+------------------+------------------+

+-------+------------------+-----------------

checked