<h1><center> DS420 - PE3: Spark DataFrames Exercise

## Goal:

In this programming exercise, you will be working with your new Spark DataFrame skills. You are given the stock market data from Warmart from the years 2012 to 2017. Some of the questions may not be as meaningful as in the real cases, but just follow along the questoins and complete the tasks below.

## Dataset:

Dataset can be referenced from URL: https://raw.githubusercontent.com/BlueJayADAL/DS420/master/datasets/dataframe/walmart_stock.csv

## Q1: Start a Spark Session, with the application named as "stock_xxx", where "xxx" is your last name in lower cases. Then read the given data into a Spark DataFrame. 

> Prior Spark 2.0, Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object. We could primarily create just RDDs using Spark Context and we had to create specific spark contexts for any other spark interactions. For SQL SQLContext, hive HiveContext, streaming Streaming Application. In a nutshell, Spark session is a combination of all these different contexts. Internally, Spark session creates a new SparkContext for all the operations and also all the above-mentioned contexts can be accessed using the SparkSession object.

#### Configure Spark path with Jupyter notebook.

In [1]:
import findspark

findspark.init('/opt/spark')


#### Create a spark session

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('stock_moynihan').getOrCreate()



#### Load the Walmart Stock CSV File, have Spark infer the data types.

In [4]:
url = 'https://raw.githubusercontent.com/BlueJayADAL/DS420/master/datasets/dataframe/walmart_stock.csv'

spark.sparkContext.addFile(url)

from pyspark import SparkFiles

fileloc = SparkFiles.get('walmart_stock.csv')

stocks = spark.read.csv('file://'+ fileloc,
                      inferSchema = True, 
                      header = True)



## Q2: Show the column names, schema and the first five rows of the DataFrame.

#### What are the column names?

In [6]:
stocks.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

#### What does the Schema look like?

In [7]:
stocks.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



#### Print out the first 5 rows.

In [10]:
stocks.show(n=5)

+----------+------------------+---------+---------+------------------+--------+------------------+
|      Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|
+----------+------------------+---------+---------+------------------+--------+------------------+
|2012-01-03|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|
|2012-01-04|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|
|2012-01-05|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|
|2012-01-06|         59.419998|59.450001|58.869999|              59.0| 8069400|          51.45922|
|2012-01-09|         59.029999|59.549999|58.919998|             59.18| 6679300|51.616215000000004|
+----------+------------------+---------+---------+------------------+--------+------------------+
only showing top 5 rows



## Q3:
#### Study the basic statistics about the DataFrame.

In [11]:

stocks.describe().show()


+-------+----------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|summary|      Date|              Open|             High|              Low|            Close|           Volume|        Adj Close|
+-------+----------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|  count|      1258|              1258|             1258|             1258|             1258|             1258|             1258|
|   mean|      null| 72.35785375357709|72.83938807631165| 71.9186009594594|72.38844998012726|8222093.481717011|67.23883848728146|
| stddev|      null|  6.76809024470826|6.768186808159218|6.744075756255496|6.756859163732991|  4519780.8431556|6.722609449996857|
|    min|2012-01-03|56.389998999999996|        57.060001|        56.299999|        56.419998|          2094900|        50.363689|
|    max|2016-12-30|         90.800003|        90.970001|            89.25|        90.4700

#### There are too many decimal places for mean and stddev in the describe() dataframe. Format the numbers to just show up to two decimal places. 
#### Hint1: you probably want to checkout the datatypes that .describe() returns before converting.
#### Hint2: We didn't cover how to do this exact formatting, but we covered something very similar.

In [12]:
import pyspark.sql.functions as F

In [50]:

stocks.describe().printSchema()



root
 |-- summary: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Volume: string (nullable = true)
 |-- Adj Close: string (nullable = true)



#### Re-format the entire DataFrame so that columns "Open", "High", "Low", "Close", and "Adj Close" have exactly two decimal places, and column "Volume" is integer type.

In [38]:
import pyspark.sql.functions as F

In [39]:
# Get the DF from describe()
result = stocks.describe()

# Reformat the DF
result.select( ['summary',
                F.format_number(result['Open'].cast('float'),2).alias('Open'),
                F.format_number(result['High'].cast('float'),2).alias('High'),
                F.format_number(result['Low'].cast('float'),2).alias('Low'),
                F.format_number(result['Close'].cast('float'),2).alias('Close'),
                result['Volume'].cast('int').alias('Volume'),
                F.format_number(result['Adj Close'].cast('float'),2).alias('Adj Close')] 
             ).show()

+-------+--------+--------+--------+--------+--------+---------+
|summary|    Open|    High|     Low|   Close|  Volume|Adj Close|
+-------+--------+--------+--------+--------+--------+---------+
|  count|1,258.00|1,258.00|1,258.00|1,258.00|    1258| 1,258.00|
|   mean|   72.36|   72.84|   71.92|   72.39| 8222093|    67.24|
| stddev|    6.77|    6.77|    6.74|    6.76| 4519780|     6.72|
|    min|   56.39|   57.06|   56.30|   56.42| 2094900|    50.36|
|    max|   90.80|   90.97|   89.25|   90.47|80898100|    84.91|
+-------+--------+--------+--------+--------+--------+---------+



## Q4: Create a new DataFrame `df2` with a column called `HV Ratio` that is the ratio of the High Price versus volume of stock traded for a day.

In [37]:
df2 = stocks.withColumn('HV Ratio', stocks['High']/stocks['Volume'])



# Show the new column from df2.
df2.select('HV Ratio').show()

+--------------------+
|            HV Ratio|
+--------------------+
|4.819714653321546E-6|
|6.290848613094555E-6|
|4.669412994783916E-6|
|7.367338463826307E-6|
|8.915604778943901E-6|
|8.644477436914568E-6|
|9.351828421515645E-6|
| 8.29141562102703E-6|
|7.712212102001476E-6|
|7.071764823529412E-6|
|1.015495466386981E-5|
|6.576354146362592...|
| 5.90145296180676E-6|
|8.547679455011844E-6|
|8.420709512685392E-6|
|1.041448341728929...|
|8.316075414862431E-6|
|9.721183814992126E-6|
|8.029436027707578E-6|
|6.307432259386365E-6|
+--------------------+
only showing top 20 rows



## Q5: Answer the following questions using DataFrame functions:
#### Which day had the Peak High in Price?

In [41]:
stocks.orderBy(F.desc('High')).head(1)



[Row(Date='2015-01-13', Open=90.800003, High=90.970001, Low=88.93, Close=89.309998, Volume=8215400, Adj Close=83.825448)]

#### What is the mean of the Close column?

In [43]:
stocks.agg({'Close':'avg'}).show()



+-----------------+
|       avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+



#### What is the max and min of the Volume column?

In [44]:
stocks.select(F.max('Volume'), F.min('Volume')).show()



+-----------+-----------+
|max(Volume)|min(Volume)|
+-----------+-----------+
|   80898100|    2094900|
+-----------+-----------+



## Q6: Answer the following questions using boolean selection.

#### How many days were the Close lower than 60 dollars?

In [46]:
stocks.filter(stocks['Close']<60).count()



81

#### What if we want to generate the result with DataFrame format?

In [69]:
result = stocks.filter(stocks['Close']<60)

result.select(F.count('Close')).alias('Count of Close<60').show()



+------------+
|count(Close)|
+------------+
|          81|
+------------+



#### What percentage of the time was the High greater than 80 dollars ?
#### In other words, (Number of Days High>80)/(Total Days in the dataset)

In [52]:
stocks.filter(stocks['High']>80).count()/stocks.count()



0.09141494435612083

#### What is the Pearson correlation between High and Volume? [Hint](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions.corr)

In [54]:
stocks.select(F.corr('High', 'Volume')).show()



+-------------------+
| corr(High, Volume)|
+-------------------+
|-0.3384326061737161|
+-------------------+



## Q7: Answer the following questions with Spark Dates and Timestamps
#### What is the max High per year? [Reference](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.year.html)

In [61]:
# Create a new DF with a new column as Year

yeardf = stocks.withColumn('Year', F.year(stocks['Date']))

In [65]:
# Get the answer using yeardf
yeardf.groupBy('Year').max().select(['Year','max(high)']).show()


+----+---------+
|Year|max(high)|
+----+---------+
|2015|90.970001|
|2013|81.370003|
|2014|88.089996|
|2012|77.599998|
|2016|75.190002|
+----+---------+



#### What is the average Close for each Calendar Month?
#### In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your result will have a value for each of these months. [Reference](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.month.html)

+-----+-----------------+
|Month|       avg(Close)|
+-----+-----------------+
|    1|71.44801958415842|
|    2|  71.306804443299|
|    3|71.77794377570092|
|    4|72.97361900952382|
|    5|72.30971688679247|
|    6| 72.4953774245283|
|    7|74.43971943925233|
|    8|73.02981855454546|
|    9|72.18411785294116|
|   10|71.57854545454543|
|   11| 72.1110893069307|
|   12|72.84792478301885|
+-----+-----------------+



# Great Job!