## Setting Up Kinesis Stream
Before starting this lab first you need to set up a kinesis stream in your amazon account and push bitcoin data from coindesk api. Do the following steps to set it up:
1. Login to your AWS account and launch an EC2 instance. The smallest unit is fine as we will launch a very small task. If you are eligable for free tier you can also use it.
2. Create a kinesis stream with the name "bitcoin-exchange-rate"
3. Get an aws access key and secret key and make sure it has access to kinesis
4. Configure your EC2's aws cli with these keys. To do so you can just type aws configure at the EC2 terminal and fill in the keys
4. Launch the push_data_to_kinesis.py script (from this folder) as a process or via screen (so it always runs). Solve the required pip dependencies if required (i.e. install boto3)

In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, to_timestamp
from pyspark.sql.functions import window
from pyspark.sql.functions import avg

### Reading From Kinesis Stream

We will start our streaming application buy reading the data from Kinesis. Kinesis is a message queue, similar to Kafka, provided by AWS. You can read from Kinesis stream in the following way:

In [5]:
kinesisDF = spark \
  .readStream \
  .format("kinesis") \
  .option("streamName", "bitcoin-exchange-rate") \
  .option("initialPosition", "earliest") \
  .option("region", "us-west-2") \
  .option("awsAccessKey", 'awsAccessKey') \
  .option("awsSecretKey", 'awsSecretKey') \
  .load()

Its a good idea to clear up old files, streaming application produces a lot of temp files, and there is a limit to how many temp files you can have for a free Databricks account.

In [7]:
dbutils.fs.rm('dbfs:/SOME_CHECKPOINT_DIRECTORY/', True)
dbutils.fs.rm(('dbfs:/tmp/'), True)

We will enforce a schema as it is more efficient, if we leave it blank Spark can figure out the Schema as well!

In [9]:
pythonSchema = StructType().add("timestamp", StringType()).add("EUR", FloatType()).add("USD", FloatType()).add ("GBP", FloatType())

Now we will read from the stream into our streaming dataframe!

In [11]:
bitcoinDF = kinesisDF.selectExpr("cast (data as STRING) jsonData").select(from_json("jsonData", pythonSchema).alias("bitcoin")).select("bitcoin.*")

We will also convert the timestamp column to the timestamp type so we can query with datetime object in Python

In [13]:
bitcoinDF = bitcoinDF.withColumn('timestamp', to_timestamp(bitcoinDF.timestamp, "yyyy-MM-dd HH:mm:ss"))

In [14]:
display(bitcoinDF)

### Quering! 
Now you can use all the things you learnt previously from Spark SQL! For example, you can groupBy certain attribute and aggregate, filter, or select as you wish! <br/>
We haven't been introduced to the concept of windowing, which we will briefly zoom in now.

A window function can also be applied to to Bucketize rows into one or more time windows given a timestamp specifying column. For that we will use window groupBy function (pyspark.sql.functions.window) <br/>
you can call the window groupby function in the following way: __window(timeColumn, windowDuration, slideDuration=None, startTime=None)__. The definition of slide interval and window interval are as follows:
* Window Duration: how far back in time the windowed transformation goes
* Slide Duration: how often a windowed intransformation is computed

In [17]:
windowedCounts = bitcoinDF.groupBy(window(bitcoinDF.timestamp, "10 minutes", "5 minutes").alias('time_window')).agg(avg(bitcoinDF.EUR).alias('window_avg_euro_rate'))
display(windowedCounts)

In [18]:
# to read the stream from memory we will set up a table in our memory called bitcoin_window
query = windowedCounts.writeStream.format("memory").queryName("bitcoin_window").outputMode("complete").start()

In [19]:
# you can then take your stream to a dataframe using SQL queries in the following way
df = spark.sql('select time_window.start, time_window.end, window_avg_euro_rate from bitcoin_window')

In [20]:
# we can query on top of live data for average euro rate for the last hour
from datetime import datetime, timedelta
from pyspark.sql.functions import avg
timedelta_ten_mins = datetime.now() - timedelta(minutes=60)
last_hour_rate_query = df.filter(df.start > timedelta_ten_mins).select('window_avg_euro_rate').agg(avg('window_avg_euro_rate').alias('rate')).collect()
print(last_hour_rate_query[0].rate)

## Excercise
Try writing a query to window maximum bitcoin rate per ten minutes for USD, EUR, and GBP. Show them all in line chart.

## Challenge
Write a smart algorithm to trade bitcoin automatically. Start with a hypothetical amount (ex. ten bitcoin) and trade it between currencies and see if you can automatically increase your net worth by trading.