## PySpark DataFrame Basics on Walmart Stocks Dataset

##### This notebook focuses on practicing the basic PySpark DataFrame commands on Walmart stocks dataset from 2012-2017. 

##### Following basic operations were performed on the Walmart stocks dataset using PySpark-
* Starting a Spark session.
* Loading Walmart stocks .csv file and infering the datatypes.
* Displaying column names of the spark dataframe.
* Printing the dataframe schema.
* Printing the first 5 rows.
* Displaying the summary statistics of the stocks data.
* Formating the mean, standard deviation, min and max upto 2 decimal places.
* Printing the summary statistics schema.
* Creating a new dataframe column called HV Ratio which is the ratio of High stock price versus volume of stock traded for that day.
* Finding the day that had the peak high stock price.
* Finding the mean of the Close stock price column.
* Finding the max and min Volume of stocks.
* Finding how many days was the closing price less than 60 dollars.
* Finding the percentage of times High stock price was greater than 80 dollars.
* Finding the Pearson's correlation between High Price and Volume.
* Finding maximum High stock price per year.
* Finding the average Close price for each calendar month.

##### Starting a simple Spark Session

In [1]:
import findspark
findspark.init('C:\\Users\\pradn\\Desktop\\spark\\spark-2.4.3-bin-hadoop2.7')
import pyspark
#starting a pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('walmart').getOrCreate()

##### Loading the walmart stocks csv file and  having Spark infer the datatypes

In [2]:
stocks = spark.read.csv('walmart_stock.csv', inferSchema= True, header=True)
stocks.show(5)

+-------------------+------------------+---------+---------+------------------+--------+------------------+
|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|
+-------------------+------------------+---------+---------+------------------+--------+------------------+
|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|
|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|
|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|
|2012-01-06 00:00:00|         59.419998|59.450001|58.869999|              59.0| 8069400|          51.45922|
|2012-01-09 00:00:00|         59.029999|59.549999|58.919998|             59.18| 6679300|51.616215000000004|
+-------------------+------------------+---------+---------+------------------+--------+------------------+
only showing top 5 rows



##### Displaying the columns in the spark dataframe

In [3]:
stocks.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

##### Displaying the dataframe schema

In [4]:
stocks.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



##### Printing the first 5 columns

In [5]:
stocks.head(5)

[Row(Date=datetime.datetime(2012, 1, 3, 0, 0), Open=59.970001, High=61.060001, Low=59.869999, Close=60.330002, Volume=12668800, Adj Close=52.619234999999996),
 Row(Date=datetime.datetime(2012, 1, 4, 0, 0), Open=60.209998999999996, High=60.349998, Low=59.470001, Close=59.709998999999996, Volume=9593300, Adj Close=52.078475),
 Row(Date=datetime.datetime(2012, 1, 5, 0, 0), Open=59.349998, High=59.619999, Low=58.369999, Close=59.419998, Volume=12768200, Adj Close=51.825539),
 Row(Date=datetime.datetime(2012, 1, 6, 0, 0), Open=59.419998, High=59.450001, Low=58.869999, Close=59.0, Volume=8069400, Adj Close=51.45922),
 Row(Date=datetime.datetime(2012, 1, 9, 0, 0), Open=59.029999, High=59.549999, Low=58.919998, Close=59.18, Volume=6679300, Adj Close=51.616215000000004)]

##### Understanding the summary statistics of the data

In [6]:
stocks.describe().show()

+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|summary|              Open|             High|              Low|            Close|           Volume|        Adj Close|
+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|  count|              1258|             1258|             1258|             1258|             1258|             1258|
|   mean| 72.35785375357709|72.83938807631165| 71.9186009594594|72.38844998012726|8222093.481717011|67.23883848728146|
| stddev|  6.76809024470826|6.768186808159218|6.744075756255496|6.756859163732991|  4519780.8431556|6.722609449996857|
|    min|56.389998999999996|        57.060001|        56.299999|        56.419998|          2094900|        50.363689|
|    max|         90.800003|        90.970001|            89.25|        90.470001|         80898100|84.91421600000001|
+-------+------------------+-----------------+--

##### Formatting the mean, standard deviation, min and max upto 2 decimal places and keeping their column names same

In [7]:
summary_stat = stocks.describe()

In [8]:
from pyspark.sql.functions import format_number

In [9]:
summary_stat.select(summary_stat['summary'],
                    format_number(summary_stat['Open'].cast('float'),2).alias('Open'),
                    format_number(summary_stat['High'].cast('float'), 2).alias('High'),
                    format_number(summary_stat['Low'].cast('float'),2).alias('Low'),
                    format_number(summary_stat['Close'].cast('float'),2).alias('Close'),
                    format_number(summary_stat['Adj Close'].cast('float'),2).alias('Adj Close'),
                    summary_stat['Volume'].cast('int').alias('Volume')
                   ).show()

+-------+--------+--------+--------+--------+---------+--------+
|summary|    Open|    High|     Low|   Close|Adj Close|  Volume|
+-------+--------+--------+--------+--------+---------+--------+
|  count|1,258.00|1,258.00|1,258.00|1,258.00| 1,258.00|    1258|
|   mean|   72.36|   72.84|   71.92|   72.39|    67.24| 8222093|
| stddev|    6.77|    6.77|    6.74|    6.76|     6.72| 4519780|
|    min|   56.39|   57.06|   56.30|   56.42|    50.36| 2094900|
|    max|   90.80|   90.97|   89.25|   90.47|    84.91|80898100|
+-------+--------+--------+--------+--------+---------+--------+



##### Checking the schema for summary statistics

In [10]:
stocks.describe().printSchema()

root
 |-- summary: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Volume: string (nullable = true)
 |-- Adj Close: string (nullable = true)



##### Creating a new dataframe with a column called HV Ratio that is a ratio of High Price versus volume of stock traded for that day

In [11]:
price_vol_ratio = stocks.withColumn('HV Ratio', stocks['High']/stocks['Volume'])
#price_vol_ratio.show()
price_vol_ratio.select('HV Ratio').show()

+--------------------+
|            HV Ratio|
+--------------------+
|4.819714653321546E-6|
|6.290848613094555E-6|
|4.669412994783916E-6|
|7.367338463826307E-6|
|8.915604778943901E-6|
|8.644477436914568E-6|
|9.351828421515645E-6|
| 8.29141562102703E-6|
|7.712212102001476E-6|
|7.071764823529412E-6|
|1.015495466386981E-5|
|6.576354146362592...|
| 5.90145296180676E-6|
|8.547679455011844E-6|
|8.420709512685392E-6|
|1.041448341728929...|
|8.316075414862431E-6|
|9.721183814992126E-6|
|8.029436027707578E-6|
|6.307432259386365E-6|
+--------------------+
only showing top 20 rows



##### Displaying the day that had a peak high in price

In [12]:
stocks.orderBy(stocks['High'].desc()).show(1) #showing the entire row

+-------------------+---------+---------+-----+---------+-------+---------+
|               Date|     Open|     High|  Low|    Close| Volume|Adj Close|
+-------------------+---------+---------+-----+---------+-------+---------+
|2015-01-13 00:00:00|90.800003|90.970001|88.93|89.309998|8215400|83.825448|
+-------------------+---------+---------+-----+---------+-------+---------+
only showing top 1 row



In [13]:
stocks.orderBy(stocks['High'].desc()).head(1)[0][0] #displaying just the date

datetime.datetime(2015, 1, 13, 0, 0)

##### Finding the mean of the Close column

In [14]:
from pyspark.sql.functions import mean
stocks.select(mean('Close')).show()

+-----------------+
|       avg(Close)|
+-----------------+
|72.38844998012726|
+-----------------+



##### Finding the max and mean of Volume column

In [15]:
from pyspark.sql.functions import max,min
stocks.select(max('Volume'), min('Volume')).show()

+-----------+-----------+
|max(Volume)|min(Volume)|
+-----------+-----------+
|   80898100|    2094900|
+-----------+-----------+



##### Finding how many days was the close lower than 60 dollars

In [16]:
stocks.filter('Close < 60').count()

81

##### Percentage of times High price was greater than 80 dollars

In [17]:
(stocks.filter(stocks['High'] > 80).count()*1.0/stocks.count()) * 100

9.141494435612083

##### Pearson's correlation between High and Volume

**corr(col1, col2, method=None)** => Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. DataFrame.corr() and DataFrameStatFunctions.corr() are aliases of each other.

**Parameters:**	
* col1 – The name of the first column
* col2 – The name of the second column
* method – The correlation method. Currently only supports “pearson”


In [18]:
from pyspark.sql.functions import corr
stocks.select(corr('High', 'Volume')).show()

+-------------------+
| corr(High, Volume)|
+-------------------+
|-0.3384326061737161|
+-------------------+



##### Maximum High stock price per year

In [19]:
from pyspark.sql.functions import year
years = stocks.withColumn('Year', year(stocks['Date']))
max_high = years.groupBy('Year').max()
max_high.select('Year', 'max(High)').show()

+----+---------+
|Year|max(High)|
+----+---------+
|2015|90.970001|
|2013|81.370003|
|2014|88.089996|
|2012|77.599998|
|2016|75.190002|
+----+---------+



##### Average close price for each calendar month

In [20]:
from pyspark.sql.functions import month
months = stocks.withColumn('Month', month(stocks['Date']))
avg_high = months.groupBy('Month').mean()
avg_high.select('Month', 'avg(Close)').orderBy('Month').show()

+-----+-----------------+
|Month|       avg(Close)|
+-----+-----------------+
|    1|71.44801958415842|
|    2|  71.306804443299|
|    3|71.77794377570092|
|    4|72.97361900952382|
|    5|72.30971688679247|
|    6| 72.4953774245283|
|    7|74.43971943925233|
|    8|73.02981855454546|
|    9|72.18411785294116|
|   10|71.57854545454543|
|   11| 72.1110893069307|
|   12|72.84792478301885|
+-----+-----------------+

