# kafkaReceiveDataPy
This notebook receives data from Kafka on the topic 'test', and stores it in the 'time_test' table of Cassandra (created by cassandra_init.script in startup_script.sh).

```
CREATE KEYSPACE cryptocurrency_market_data WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

CREATE TABLE cryptocurrency_market_data.sent_received(
 exchange TEXT,
 cryptocurrency TEXT,
 basecurrency TEXT,
 type TEXT,
 price TEXT,
 size TEXT,
 bid TEXT,
 ask TEXT,
 open TEXT,
 high TEXT,
 low TEXT,
 volume TEXT,
 timestamp TEXT,
PRIMARY KEY (timestamp)
);
```

A message that gives the crypto_currency_market_data informations is received every second. 

## Add dependencies

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'
import time

## Load modules and start SparkContext
Note that SparkContext must be started to effectively load the package dependencies. Two cores are used, since one is needed for running the Kafka receiver.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
    .setAppName("Streaming satori-volatility") \
    .setMaster("local[2]") \
    .set("spark.cassandra.connection.host", "127.0.0.1")
sc = SparkContext(conf=conf) 
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

## SaveToCassandra function
Takes a list of tuple (rows) and save to Cassandra 

In [None]:
def saveToCassandra(rows):
    if not rows.isEmpty(): 
        sqlContext.createDataFrame(rows).write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append')\
        .options(table="sent_received", keyspace="cryptocurrency_market_data")\
        .save()

## Create streaming task
* Receive data from Kafka 'test' topic every five seconds
* Get stream content, and add receiving time to each message
* Save each RDD in the DStream to Cassandra. Also print on screen

In [None]:
import json
import numpy as np
ssc = StreamingContext(sc, 15)
kvs = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'satori-volatility': 1})
data = kvs.map(lambda x: json.loads(x[1]))
rows= data.map(lambda x:Row(timestamp=x["timestamp"],
                            exchange=x["exchange"] if x["exchange"] else 'null',
                            cryptocurrency=x["cryptocurrency"] if x["cryptocurrency"] else 'null',
                            basecurrency=x["basecurrency"] if x["basecurrency"] else 'null',
                            type=x["type"] if x["type"] else 'null',
                            price=x["price"] if x["price"] else 'null',
                            size=x["size"] if x["size"] else 'null',
                            bid=x["bid"] if x["bid"] else 'null',
                            ask=x["ask"] if x["ask"] else 'null',
                            open=x["open"] if x["open"] else 'null',
                            high=x["high"] if x["high"] else 'null',
                            low=x["low"] if x["low"] else 'null',
                            volume=x["volume"] if x["volume"] else 'null'))
                            # https://stackoverflow.com/questions/40713693/inserting-null-values-into-cassandra
def formula(price1, price2):
    return np.log(float(price1)/float(price2))

volatility = data.map(lambda x : ("volatility 5", 1/(2*np.log(4))*(np.log(float(x['high'])/float(x['low']))**2),
"volatility 8", 0.511*((formula(x['high'],x['open'])-formula(x['low'],x['open']))**2)
                -0.019*(formula(x['open'],x['open'])*(formula(x['high'],x['open'])+formula(x['low'],x['open']))-2*formula(x['high'],x['open'])*formula(x['low'],x['open']))
                -0.383*(formula(x['open'],x['open'])**2)))
"""
with close
0.511*((formula(x['high'],x['open'])-formula(x['low'],x['open']))**2)
-0.019*(formula(x['close'],x['open'])*(formula(x['high'],x['open'])+formula(x['low'],x['open']))-2*formula(x['high'],x['open'])*formula(x['low'],x['open']))
-0.383*(formula(x['close'],x['open'])**2)))
"""
                                  
rows.foreachRDD(saveToCassandra)
rows.pprint()
volatility.pprint()

## Start streaming

In [None]:
ssc.start()

## Stop streaming

In [None]:
ssc.stop(stopSparkContext=False,stopGraceFully=True)

## Get Cassandra table content

In [None]:
data=sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="sent_received", keyspace="cryptocurrency_market_data")\
    .load()
data.show()

## Get Cassandra table content using SQL

In [None]:
data.registerTempTable("sent_received");
data.printSchema()
data=sqlContext.sql("select * from sent_received")
data.show()