<a href="https://colab.research.google.com/github/m-mehdi/Python101/blob/master/ApacheSpark_02_PNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="images/cads-logo.png" style="height: 100px;padding-top:5px" align=left> <img src="images/apache_spark.png" style="height: 20%;width:20%; padding-top:0px" align=right>

# Apache Spark Dataframe Exercise

In this exercise, we are going to get some insights on stock market data. We use `walmart_stock.csv` file as our dataset to analyse the data.

### 1- Create an Apache Spark Session

In [1]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 45kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 2.3MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=d2af38a8e6e12153e49628da388299867d770c5e192e826e7b67a6b5bb45b6e6
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark

### 2- Load the `walmart_stock.csv` file into a dataframe and infer the data schema

In [3]:
import os
MAIN_DIRECTORY = os.getcwd()

In [4]:
file_path = MAIN_DIRECTORY+"/data/walmart_stock.csv"

In [5]:
df = spark.read.format('csv').option("header","true").option("inferSchema","true").load(file_path)

### 3- Display the column names and print the dataframe schema

In [6]:
df.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

In [7]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



### 4- Print out the first five rows of the data

In [8]:
df.head(5)

[Row(Date='2012-01-03', Open=59.970001, High=61.060001, Low=59.869999, Close=60.330002, Volume=12668800, Adj Close=52.619234999999996),
 Row(Date='2012-01-04', Open=60.209998999999996, High=60.349998, Low=59.470001, Close=59.709998999999996, Volume=9593300, Adj Close=52.078475),
 Row(Date='2012-01-05', Open=59.349998, High=59.619999, Low=58.369999, Close=59.419998, Volume=12768200, Adj Close=51.825539),
 Row(Date='2012-01-06', Open=59.419998, High=59.450001, Low=58.869999, Close=59.0, Volume=8069400, Adj Close=51.45922),
 Row(Date='2012-01-09', Open=59.029999, High=59.549999, Low=58.919998, Close=59.18, Volume=6679300, Adj Close=51.616215000000004)]

### 5- Use `describe()` method to get statistical information on the data 

In [9]:
df.describe().show()

+-------+----------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|summary|      Date|              Open|             High|              Low|            Close|           Volume|        Adj Close|
+-------+----------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|  count|      1258|              1258|             1258|             1258|             1258|             1258|             1258|
|   mean|      null| 72.35785375357709|72.83938807631165| 71.9186009594594|72.38844998012726|8222093.481717011|67.23883848728146|
| stddev|      null|  6.76809024470826|6.768186808159218|6.744075756255496|6.756859163732991|  4519780.8431556|6.722609449996857|
|    min|2012-01-03|56.389998999999996|        57.060001|        56.299999|        56.419998|          2094900|        50.363689|
|    max|2016-12-30|         90.800003|        90.970001|            89.25|        90.4700

### 6- Use `format_number` function to format the numbers for just showing up to two decimal places. 
[format_number() documentation](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=format_number#pyspark.sql.functions.format_number)

In [10]:
des_result = df.describe()

In [11]:
from pyspark.sql.functions import format_number

In [12]:
des_result.printSchema()

root
 |-- summary: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Volume: string (nullable = true)
 |-- Adj Close: string (nullable = true)



In [14]:
des_result.select(des_result['summary'],
                  format_number(des_result['Open'].cast('float'),2).alias('Open'),
                  format_number(des_result['High'].cast('float'),2).alias('High'),
                  format_number(des_result['Low'].cast('float'),2).alias('Low'),
                  format_number(des_result['Close'].cast('float'),2).alias('Close'),
                  des_result['Volume'].cast('int').alias('Volume'),
                  format_number(des_result['Adj Close'].cast('float'),2).alias('Adj Close')).show()

+-------+--------+--------+--------+--------+--------+---------+
|summary|    Open|    High|     Low|   Close|  Volume|Adj Close|
+-------+--------+--------+--------+--------+--------+---------+
|  count|1,258.00|1,258.00|1,258.00|1,258.00|    1258| 1,258.00|
|   mean|   72.36|   72.84|   71.92|   72.39| 8222093|    67.24|
| stddev|    6.77|    6.77|    6.74|    6.76| 4519780|     6.72|
|    min|   56.39|   57.06|   56.30|   56.42| 2094900|    50.36|
|    max|   90.80|   90.97|   89.25|   90.47|80898100|    84.91|
+-------+--------+--------+--------+--------+--------+---------+



### 7- Create a new coulmn called HV Ratio on a new dataframe that returns the ratio of the High Price versus volume of stock traded for a day.

In [15]:
df.show(5)

+----------+------------------+---------+---------+------------------+--------+------------------+
|      Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|
+----------+------------------+---------+---------+------------------+--------+------------------+
|2012-01-03|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|
|2012-01-04|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|
|2012-01-05|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|
|2012-01-06|         59.419998|59.450001|58.869999|              59.0| 8069400|          51.45922|
|2012-01-09|         59.029999|59.549999|58.919998|             59.18| 6679300|51.616215000000004|
+----------+------------------+---------+---------+------------------+--------+------------------+
only showing top 5 rows



In [16]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



In [17]:
HV_df = df.withColumn("HV Ratio", df['High']/df['Volume'])

In [19]:
HV_df.select('HV Ratio').show()

+--------------------+
|            HV Ratio|
+--------------------+
|4.819714653321546E-6|
|6.290848613094555E-6|
|4.669412994783916E-6|
|7.367338463826307E-6|
|8.915604778943901E-6|
|8.644477436914568E-6|
|9.351828421515645E-6|
| 8.29141562102703E-6|
|7.712212102001476E-6|
|7.071764823529412E-6|
|1.015495466386981E-5|
|6.576354146362592...|
| 5.90145296180676E-6|
|8.547679455011844E-6|
|8.420709512685392E-6|
|1.041448341728929...|
|8.316075414862431E-6|
|9.721183814992126E-6|
|8.029436027707578E-6|
|6.307432259386365E-6|
+--------------------+
only showing top 20 rows



### 8- What day had the Peak High in Price?

In [21]:
df.orderBy(df['High'].desc()).head()[0]

'2015-01-13'

### 9-What is the mean of the Close column?

In [22]:
from pyspark.sql.functions import mean
df.select(mean('Close')).show()

+-----------------+
|       avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+



In [24]:
df.agg({'Close':'mean'}).show()

+-----------------+
|       avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+



### 10- How many days was the Close lower than 70 USD?

In [25]:
df.filter('Close < 70').count()

397

### 11-What percentage of the time was the High greater than 80 USD ?
#### In other words, (Number of High Days>80)/(Total Days in the dataframe)

In [26]:
(df.filter(df['High']>80).count()/df.count())*100

9.141494435612083

### 12-What is the correlation between High and Volume?

In [27]:
from pyspark.sql.functions import corr
df.select(corr('High','Volume')).show()

+-------------------+
| corr(High, Volume)|
+-------------------+
|-0.3384326061737161|
+-------------------+



### 13- What is the max High per year (use GroupBy)?

In [28]:
from pyspark.sql.functions import max, year

In [29]:
year_df = df.withColumn('Year',year(df['Date']))

In [30]:
year_df.show(5)

+----------+------------------+---------+---------+------------------+--------+------------------+----+
|      Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|Year|
+----------+------------------+---------+---------+------------------+--------+------------------+----+
|2012-01-03|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|2012|
|2012-01-04|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|2012|
|2012-01-05|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|2012|
|2012-01-06|         59.419998|59.450001|58.869999|              59.0| 8069400|          51.45922|2012|
|2012-01-09|         59.029999|59.549999|58.919998|             59.18| 6679300|51.616215000000004|2012|
+----------+------------------+---------+---------+------------------+--------+------------------+----+
only showing top 5 rows



In [31]:
max_df = year_df.groupBy('Year').max()

In [32]:
max_df.show()

+----+-----------------+---------+---------+----------+-----------+-----------------+---------+
|Year|        max(Open)|max(High)| max(Low)|max(Close)|max(Volume)|   max(Adj Close)|max(Year)|
+----+-----------------+---------+---------+----------+-----------+-----------------+---------+
|2015|        90.800003|90.970001|    89.25| 90.470001|   80898100|84.91421600000001|     2015|
|2013|        81.209999|81.370003|    80.82| 81.209999|   25683700|        73.929868|     2013|
|2014|87.08000200000001|88.089996|86.480003| 87.540001|   22812400|81.70768000000001|     2014|
|2012|        77.599998|77.599998|76.690002| 77.150002|   38007300|        68.568371|     2012|
|2016|             74.5|75.190002|73.629997| 74.300003|   35076700|        73.233524|     2016|
+----+-----------------+---------+---------+----------+-----------+-----------------+---------+



### 14- What is the average Close for each Calendar Month (close price for Jan,Feb, Mar, etc)?


In [33]:
from pyspark.sql.functions import month
month_df = df.withColumn('Month', month('date'))
month_avg = month_df.select('Month','Close').groupBy('Month').mean()

In [34]:
month_avg.show()

+-----+----------+-----------------+
|Month|avg(Month)|       avg(Close)|
+-----+----------+-----------------+
|   12|      12.0|72.84792478301885|
|    1|       1.0|71.44801958415842|
|    6|       6.0| 72.4953774245283|
|    3|       3.0|71.77794377570092|
|    5|       5.0|72.30971688679247|
|    9|       9.0|72.18411785294116|
|    4|       4.0|72.97361900952382|
|    8|       8.0|73.02981855454546|
|    7|       7.0|74.43971943925233|
|   10|      10.0|71.57854545454543|
|   11|      11.0| 72.1110893069307|
|    2|       2.0|  71.306804443299|
+-----+----------+-----------------+



In [35]:
month_avg.select('Month','avg(Close)').orderBy('Month').show()

+-----+-----------------+
|Month|       avg(Close)|
+-----+-----------------+
|    1|71.44801958415842|
|    2|  71.306804443299|
|    3|71.77794377570092|
|    4|72.97361900952382|
|    5|72.30971688679247|
|    6| 72.4953774245283|
|    7|74.43971943925233|
|    8|73.02981855454546|
|    9|72.18411785294116|
|   10|71.57854545454543|
|   11| 72.1110893069307|
|   12|72.84792478301885|
+-----+-----------------+



#### Well Done!