##### Grading Feedback Cell

# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Yash Pasar
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Code from the class text books or class provided code can be copied in its entirety.__
- __Do not change homework file names.__ The FAs and the professor use these names to grade your homework.  Changing file names may result in a point reduction penalty.
- There could be tests in some cells (i.e., `assert` and `np.testing.` statements). These tests (if present) are used to grade your answers. **However, the professor and FAs could use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before submitting your work, remember to check for run time errors with the following procedure:
`Kernel`$\rightarrow$`Restart and Run All`.  All runtime errors will result in a minimum penalty of half off.
- Data Bricks is the official class runtime environment so you should test your code on Data Bricks before submission.  If there is a runtime problem in the grading environment, we will try your code on Data Bricks before making a final grading decision.
- All plots shall include a title, and axis labels.
- Grading feedback cells are there for graders to provide feedback to students.  Don't change or remove grading feedback cells.

# Part 1: MapReduce with Spark

In [1]:
# Run this code to create the Spark session
from pyspark.sql import SparkSession
import types
import numpy as np
import numpy.testing as testing
spark = SparkSession.builder.appName('map-reduce').getOrCreate()
sc = spark.sparkContext

# Introduction

In this set of questions, we will study 10 stocks during one year (365 days). The data that we have are the log returns of the stocks.

The log return of a stock from time $t-1$ to time $t$ is
$$
\begin{align}
r_t &= \log (\frac{P_t}{P_{t-1}})\\
    &= \log (P_t) - \log(P_{t-1})
\end{align}
$$

where $\log$ is the natural log and $P_t$ is the price of the stock at time $t$.

For example, if the stock price of a stock is 100 on day 0 and 120 on day 1, then the log return at time 1 is $\log(120) - \log(100) = 0.18323$. Positive log return means an increase in price and negative log return means a decrease in price.

## Question 1.1 Read data (5 pts)

In this question, you will transform an RDD that contains the contents of a text file with the prices of 10 stocks over one year. Each element of this RDD has the format "time t", "stock ticker", and "return at time t". For example:

```python
stock_returns_rdd.take(5)
```

```
['1    FAD   0.044',
 '1    DHJ   0.033',
 '1    DFC   0.149',
 '1    EHG  -0.021',
 '1    IIK   0.031']
```

In [2]:
# upload the hw2_log_stock_returns.txt data file to data bricks and read it in here
import os
import pandas as pd

db_env = os.getenv("DATABRICKS_RUNTIME_VERSION")

def get_training_filename(data_file_name):    
    
    grading_env = os.getenv("GRADING_RUNTIME_ENV")
    
    if db_env != None:

        full_path_name = "/FileStore/tables/%s" % data_file_name

    elif grading_env != None:

        full_path_name = "%s/%s" % (grading_env, data_file_name)

    else:

        full_path_name = data_file_name
    

    return full_path_name

stock_returns_rdd = sc.textFile(get_training_filename('hw2_log_stock_returns.txt'))

In [3]:
# test
stock_returns_rdd.take(5)

['1    FAD   0.044',
 '1    DHJ   0.033',
 '1    DFC   0.149',
 '1    EHG  -0.021',
 '1    IIK   0.031']

Using Spark, transform `stock_returns_rdd` into another RDD where each element is a list of three elements `[time, ticker, log return]`. `time`, `ticker`, and `log return` must be an integer, string, and floating point number, respectively. You should define a map function `map_stock_list` to apply to `stock_returns_rdd`. Call this new rdd `stock_list_rdd`.

```python
stock_list_rdd.take(5)
```

```
[[1, 'FAD', 0.044],
 [1, 'DHJ', 0.033],
 [1, 'DFC', 0.149],
 [1, 'EHG', -0.021],
 [1, 'IIK', 0.031]]
```

In [4]:
# to transform each element of stock_returns_rdd and then define stock_list_rdd
def map_stock_list(e):
    x = e.split()
    
    return [int(x[0]), x[1], float(x[2])]

stock_list_rdd = stock_returns_rdd.map(map_stock_list)

stock_list_rdd.take(5)

[[1, 'FAD', 0.044],
 [1, 'DHJ', 0.033],
 [1, 'DFC', 0.149],
 [1, 'EHG', -0.021],
 [1, 'IIK', 0.031]]

In [5]:
# 5 pts: check that stock_list_rdd was created properly
testing.assert_array_less([stock_returns_rdd.id()], [stock_list_rdd.id()])
testing.assert_equal(stock_list_rdd.first(), [1, 'FAD', 0.044])
testing.assert_equal(stock_list_rdd.count(), 3650)
testing.assert_almost_equal(stock_list_rdd.map(lambda x: x[-1]).sum(), 
                           -9.597000000000007, decimal=2)

# Question 1.2 (10 pts) Maximum, minimum, and total:

You will now create three map-reduce jobs that will produce RDDs that contain 1) the maximum log return per stock (`max_return_rdd`), 2) the minimum log return per stock (`min_return_rdd`), and 3) the total (sum) log return per stock (`total_return_rdd`) for the year. The map-reduce starts from `stock_list_rdd` defined in question 1.1. You should define the map reduce function for each RDD `map_maximum`, `reduce_maximum`, `map_minimum`, `reduce_minimum`, `map_total`, and `reduce_total`.

Each of these RDDs should have key-value pair elements:

In [6]:
# define map, reduce, and rdd for maximum
def map_maximum(e):
    
    return (e[1], e[2])
    
def reduce_maximum(v1, v2):
    
    return max(v1, v2)

# define rdd below
max_return_rdd = stock_list_rdd.map(map_maximum).reduceByKey(reduce_maximum)

# define map, reduce, and rdd for minimum
def map_minimum(e):
    
    return (e[1], e[2])
    
def reduce_minimum(v1, v2):
   
    return min(v1, v2)

# define rdd below
min_return_rdd = stock_list_rdd.map(map_minimum).reduceByKey(reduce_minimum)

# define map, reduce, and rdd for total
def map_total(e):
    
    return (e[1], e[2])
    
def reduce_total(v1, v2):

    return v1 + v2

# define rdd below
total_return_rdd = stock_list_rdd.map(map_total).reduceByKey(reduce_total)

In [7]:
# 10 points
testing.assert_equal(type(map_maximum), types.FunctionType)
testing.assert_equal(type(reduce_maximum), types.FunctionType)
testing.assert_equal(type(map_minimum), types.FunctionType)
testing.assert_equal(type(reduce_minimum), types.FunctionType)
testing.assert_equal(type(map_total), types.FunctionType)
testing.assert_equal(type(reduce_total), types.FunctionType)


testing.assert_array_almost_equal(max_return_rdd.sortByKey().values().collect(), 
                              [0.276, 0.238, 0.262, 0.241, 0.343, 0.317, 0.264, 0.293, 0.38, 0.331], decimal=2)

testing.assert_array_almost_equal(min_return_rdd.sortByKey().values().collect(), 
                              [-0.258,
 -0.305,
 -0.299,
 -0.283,
 -0.274,
 -0.283,
 -0.277,
 -0.281,
 -0.312,
 -0.301], decimal=2)

testing.assert_array_almost_equal(total_return_rdd.sortByKey().values().collect(),
                              [-2.1779999999999995,
 3.1999999999999984,
 -0.36000000000000054,
 -1.2189999999999996,
 0.3419999999999993,
 -2.747,
 -0.515999999999999,
 -4.391000000000002,
 -0.9330000000000009,
 -0.7949999999999992], decimal=2)


# Question 1.4 (20 pts)

We can compute the total log return during a period $t_1$ and $t_2$ ($t_1 < t_2$) by using the following derivation
$$
\begin{align}
r_{t_1 : t_2} &= \log \frac{P_{t_2}}{P_{t_1}}\\
              &= \log \frac{P_{t_2}}
                           {P_{t_1}}
                       \frac{P_{t_2-1} P_{t_2-2}\cdots P_{t_1+1}}
                           {P_{t_2-1} P_{t_2-2} P_{t_2-3}\cdots P_{t_1+1}}
\end{align}
$$
then re-arranging
$$
\begin{align}
r_{t_1 : t_2} &= \log \frac{P_{t_2}}
                           {P_{t_2-1}}
                       \frac{P_{t_2-1}}
                           {P_{t_2-2}}
                       \cdots
                       \frac{P_{t_1+2}}
                           {P_{t_1+1}}
                       \frac{P_{t_1+1}}
                           {P_{t_1}}\\
              &= \log \frac{P_{t_2}}
                           {P_{t_2-1}} +
                 \log \frac{P_{t_2-1}}
                           {P_{t_2-2}} +
                       \cdots +
                 \log \frac{P_{t_1+2}}
                           {P_{t_1+1}} +
                 \log \frac{P_{t_1+1}}
                           {P_{t_1}}  \\
              &= r_{t_2} + r_{t_2-1} + \cdots + r_{t_1+2} + r_{t_1+1}
\end{align}
$$

therefore

$$
r_{t_1 : t_2} = \sum_{i=t_1+1}^{t_2} r_i
$$


Generate an RDD that would contain the `[(ticker, time), cumulative return up to time)]` where the cumulative return up to time $t$ equals $r_{0:t}$. Below define the functions `map_cumulative` and `reduce_cumulative` and store the resulting RDD in `cumulative_return_rdd2`. 
**Hint: You only need `stock_list_rdd` to generate these results and no join is necessary. The map step in this map-reduce should generate a list of key-value pairs and therefore you need a `flatMap` instead of a `map` operation**. Assume that you know that the maximum time is $t = 365$.

For example, we know that the returns of stock "ADF" are:

```python
stock_list_rdd.filter(lambda x: x[1] == 'ADF').take(10)
```

```
[[1, 'ADF', -0.074],
 [2, 'ADF', -0.198],
 [3, 'ADF', 0.195],
 [4, 'ADF', -0.118],
 [5, 'ADF', -0.173],
 [6, 'ADF', -0.123],
 [7, 'ADF', -0.154],
 [8, 'ADF', 0.098],
 [9, 'ADF', 0.097],
 [10, 'ADF', 0.191]]
```

Therefore, the map-reduce result for "ADF" should contain

```python
(stock_list_rdd.
 filter(lambda x: x[1] == 'ADF').
 flatMap(map_cumulative).
 reduceByKey(reduce_cumulative).sortByKey().take(10)
)
```

```
[(('ADF', 1), -0.074),
 (('ADF', 2), -0.272),
 (('ADF', 3), -0.07700000000000001),
 (('ADF', 4), -0.195),
 (('ADF', 5), -0.368),
 (('ADF', 6), -0.491),
 (('ADF', 7), -0.645),
 (('ADF', 8), -0.547),
 (('ADF', 9), -0.45000000000000007),
 (('ADF', 10), -0.25900000000000006)]
```

In [8]:
def map_cumulative(e):
    
    return [((e[1], time), e[2]) for time in range(e[0], 366)]

def reduce_cumulative(v1, v2):
    
    return v1 +v2
    
cumulative_return_rdd2 = stock_list_rdd.flatMap(map_cumulative).reduceByKey(reduce_cumulative)

In [9]:
# try it here
cumulative_return_rdd2.filter(lambda x: x[0][0] == 'ADF').sortByKey().values().take(10)

[-0.074,
 -0.272,
 -0.07700000000000001,
 -0.195,
 -0.368,
 -0.491,
 -0.645,
 -0.547,
 -0.45000000000000007,
 -0.25900000000000006]

In [10]:
# 20 pts
testing.assert_equal(type(map_cumulative), types.FunctionType)
testing.assert_equal(type(reduce_cumulative), types.FunctionType)
testing.assert_equal(cumulative_return_rdd2.count(), 3650)
testing.assert_equal(cumulative_return_rdd2.first(), (('FAD', 1), 0.044))
testing.assert_array_almost_equal(
    cumulative_return_rdd2.filter(lambda x: x[0][0] == 'ADF').sortByKey().values().take(10),
    [-0.074,
 -0.272,
 -0.07700000000000001,
 -0.195,
 -0.368,
 -0.491,
 -0.645,
 -0.547,
 -0.45000000000000007,
 -0.25900000000000006], decimal=2
)

testing.assert_array_almost_equal(
    cumulative_return_rdd2.filter(lambda x: x[0][0] == 'FAD').sortByKey().values().take(10),
    [0.044,
 0.271,
 0.23600000000000002,
 0.18500000000000003,
 0.18200000000000002,
 0.2,
 0.24000000000000002,
 0.389,
 0.425,
 0.308], decimal=2
)