<a href="https://colab.research.google.com/github/markumreed/colab_pyspark/blob/main/pyspark_in_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preamble

The following three cells must be ran in order to use PySpark in Google Colab. 

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

In [3]:
import findspark
findspark.init()

# Spark DataFrame Basics

Spark DataFrames allow for easy handling of large datasets. 

* Easy syntax
* Ability to use SQL directly in the dataframe
* Operations are automatically distributed across RDDs

## Create a DataFrame


In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("pyspark_basics").getOrCreate()

In [None]:
%%writefile user_simple.json
{"name":"Bob"}
{"name":"Jim", "age":40}
{"name":"Mary", "age": 24}

Writing user_simple.json


In [None]:
df = spark.read.json("user_simple.json")

In [None]:
df

DataFrame[age: bigint, name: string]

## Show DataFrame


In [None]:
df.show()

+----+----+
| age|name|
+----+----+
|null| Bob|
|  40| Jim|
|  24|Mary|
+----+----+



In [None]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [None]:
df.columns

['age', 'name']

In [None]:
df.describe()

DataFrame[summary: string, age: string, name: string]

In [None]:
df.describe().show()

+-------+------------------+----+
|summary|               age|name|
+-------+------------------+----+
|  count|                 2|   3|
|   mean|              32.0|null|
| stddev|11.313708498984761|null|
|    min|                24| Bob|
|    max|                40|Mary|
+-------+------------------+----+



## Specifying Schema Structure

- Some data types make it easier to infer schema. 

- Often have to set the schema yourself

- Spark has tools to help specify the structure

Next we need to create the list of Structure fields
  * :param name: string, name of the field.
  * :param dataType: :class:`DataType` of the field.
  * :param nullable: boolean, whether the field can be null (None) 

In [None]:
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

In [None]:
data_schema = [StructField("age", IntegerType(), True), StructField("name",StringType(), True)]

In [None]:
final_struc = StructType(fields=data_schema)

In [None]:
df = spark.read.json("user_simple.json", schema=final_struc)

In [None]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



In [None]:
df.show()

+----+----+
| age|name|
+----+----+
|null| Bob|
|  40| Jim|
|  24|Mary|
+----+----+



## Grab Data

In [None]:
df['age']

Column<b'age'>

In [None]:
type(df['age'])

pyspark.sql.column.Column

In [None]:
df.select("age")

DataFrame[age: int]

In [None]:
type(df.select("age"))

pyspark.sql.dataframe.DataFrame

In [None]:
df.select("age").show()

+----+
| age|
+----+
|null|
|  40|
|  24|
+----+



In [None]:
df.head(2)

[Row(age=None, name='Bob'), Row(age=40, name='Jim')]

In [None]:
df.select(["name","age"])

DataFrame[name: string, age: int]

In [None]:
df.select(["name","age"]).show()

+----+----+
|name| age|
+----+----+
| Bob|null|
| Jim|  40|
|Mary|  24|
+----+----+



## Create New Columns

In [None]:
df.withColumn("newAge", df['age']).show()

+----+----+------+
| age|name|newAge|
+----+----+------+
|null| Bob|  null|
|  40| Jim|    40|
|  24|Mary|    24|
+----+----+------+



In [None]:
df.show()

+----+----+
| age|name|
+----+----+
|null| Bob|
|  40| Jim|
|  24|Mary|
+----+----+



In [None]:
df.withColumnRenamed("name","firstName").show()

+----+---------+
| age|firstName|
+----+---------+
|null|      Bob|
|  40|      Jim|
|  24|     Mary|
+----+---------+



In [None]:
df.show()

+----+----+
| age|name|
+----+----+
|null| Bob|
|  40| Jim|
|  24|Mary|
+----+----+



In [None]:
df.withColumn("agePlusTen", df['age']+10).show()

+----+----+----------+
| age|name|agePlusTen|
+----+----+----------+
|null| Bob|      null|
|  40| Jim|        50|
|  24|Mary|        34|
+----+----+----------+



In [None]:
df.withColumn("age_minus_5", df['age']-5).show()

+----+----+-----------+
| age|name|age_minus_5|
+----+----+-----------+
|null| Bob|       null|
|  40| Jim|         35|
|  24|Mary|         19|
+----+----+-----------+



## Using SQL

In [None]:
df.createOrReplaceTempView("custmers")

In [None]:
sql_results = spark.sql("SELECT * from custmers")

In [None]:
sql_results

DataFrame[age: int, name: string]

In [None]:
sql_results.show()

+----+----+
| age|name|
+----+----+
|null| Bob|
|  40| Jim|
|  24|Mary|
+----+----+



In [None]:
spark.sql("SELECT * FROM custmers WHERE age=24").show()

+---+----+
|age|name|
+---+----+
| 24|Mary|
+---+----+



## DataFrame Operations

- Cover basic operations with Spark DataFrames.
- Use stock data from Walmart.

In [None]:
!curl https://raw.githubusercontent.com/markumreed/colab_pyspark/main/WMT.csv >> WMT.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 89556  100 89556    0     0   392k      0 --:--:-- --:--:-- --:--:--  392k


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("operations").getOrCreate()
df = spark.read.csv('WMT.csv',inferSchema=True,header=True)

In [None]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Adj Close: double (nullable = true)
 |-- Volume: string (nullable = true)



In [None]:
df.head(5)

[Row(Date='2016-01-20', Open=61.799999, High=62.330002, Low=60.200001, Close=60.84, Adj Close=53.990601, Volume='17369100'),
 Row(Date='2016-01-21', Open=60.98, High=62.790001, Low=60.91, Close=61.880001, Adj Close=54.913509, Volume='12089200'),
 Row(Date='2016-01-22', Open=62.439999, High=63.259998, Low=62.130001, Close=62.689999, Adj Close=55.632324, Volume='9197500'),
 Row(Date='2016-01-25', Open=62.779999, High=63.82, Low=62.549999, Close=63.450001, Adj Close=56.306763, Volume='12823400'),
 Row(Date='2016-01-26', Open=63.360001, High=64.470001, Low=63.259998, Close=64.0, Adj Close=56.794834, Volume='9441200')]

## Filtering Data

- DataFrames allow for quick filtering of data based on conditions 


In [None]:
df.filter('Close<62').show()

+----------+---------+---------+---------+---------+---------+--------+
|      Date|     Open|     High|      Low|    Close|Adj Close|  Volume|
+----------+---------+---------+---------+---------+---------+--------+
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|
+----------+---------+---------+---------+---------+---------+--------+



In [None]:
df.filter('Close<62').select('Open').show()

+---------+
|     Open|
+---------+
|61.799999|
|    60.98|
|61.799999|
|    60.98|
+---------+



In [None]:
df.filter('Close<62').select(['Date','Open']).show()

+----------+---------+
|      Date|     Open|
+----------+---------+
|2016-01-20|61.799999|
|2016-01-21|    60.98|
|2016-01-20|61.799999|
|2016-01-21|    60.98|
+----------+---------+



## Using Comparison Operators
- Using comparison operators will look similar to SQL operators
- Make to call the entire column within the dataframe

In [None]:
df.filter(df['Close'] < 62).show()

+----------+---------+---------+---------+---------+---------+--------+
|      Date|     Open|     High|      Low|    Close|Adj Close|  Volume|
+----------+---------+---------+---------+---------+---------+--------+
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|
+----------+---------+---------+---------+---------+---------+--------+



In [None]:
df.filter((df['Close'] < 62) & ~(df['Open'] > 60)).show()

+----+----+----+---+-----+---------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|
+----+----+----+---+-----+---------+------+
+----+----+----+---+-----+---------+------+



In [None]:
df.filter(df['Open'] == 60.98).show(1)

+----------+-----+---------+-----+---------+---------+--------+
|      Date| Open|     High|  Low|    Close|Adj Close|  Volume|
+----------+-----+---------+-----+---------+---------+--------+
|2016-01-21|60.98|62.790001|60.91|61.880001|54.913509|12089200|
+----------+-----+---------+-----+---------+---------+--------+
only showing top 1 row



In [None]:
df.filter(df['Open'] == 60.98).collect()

[Row(Date='2016-01-21', Open=60.98, High=62.790001, Low=60.91, Close=61.880001, Adj Close=54.913509, Volume='12089200'),
 Row(Date='2016-01-21', Open=60.98, High=62.790001, Low=60.91, Close=61.880001, Adj Close=54.913509, Volume='12089200')]

In [None]:
res =df.filter(df['Open'] == 60.98).collect()

In [None]:
type(res[0])

pyspark.sql.types.Row

In [None]:
res[0].asDict()

{'Adj Close': 54.913509,
 'Close': 61.880001,
 'Date': '2016-01-21',
 'High': 62.790001,
 'Low': 60.91,
 'Open': 60.98,
 'Volume': '12089200'}

In [None]:
for item in res[0]:
  print(item)

2016-01-21
60.98
62.790001
60.91
61.880001
54.913509
12089200


In [None]:
import pandas as pd

In [None]:
pd.Series(res[0].asDict())

Date         2016-01-21
Open              60.98
High              62.79
Low               60.91
Close             61.88
Adj Close       54.9135
Volume         12089200
dtype: object

# GroupBy and Aggregate Functions
- `GroupBy` allows you to group rows together based off some column value
- Once you've performed the `GroupBy` operation you can use an aggregate function off that data. 
- An aggregate function aggregates multiple rows of data into a single output, such as taking the sum of inputs, or counting the number of inputs.



In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("groupbyagg").getOrCreate()

## Import Data


In [None]:
!curl https://raw.githubusercontent.com/markumreed/colab_pyspark/main/sales_data.csv >> sales_data.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   202  100   202    0     0   1836      0 --:--:-- --:--:-- --:--:--  1836


In [None]:
df = spark.read.csv("sales_data.csv", inferSchema=True, header=True)

In [None]:
df.printSchema()

root
 |-- company: string (nullable = true)
 |-- representative: string (nullable = true)
 |-- num_sales: double (nullable = true)



In [None]:
df.show()

+-------+--------------+---------+
|company|representative|num_sales|
+-------+--------------+---------+
|    XYZ|           Bob|    200.0|
|    XYZ|           Tom|    120.0|
|    XYZ|         Frank|    340.0|
|   ABCD|         Jerry|    600.0|
|   ABCD|           Amy|    124.0|
|   ABCD|       Vanessa|    243.0|
|     OK|          Carl|    870.0|
|     OK|         Sarah|    350.0|
|   BLAH|          John|    250.0|
|   BLAH|         Linda|    130.0|
|   BLAH|          Mike|    750.0|
|   BLAH|         Chris|    350.0|
+-------+--------------+---------+



## Grouping Data
- Group the data by company

In [None]:
df.groupBy("company")

<pyspark.sql.group.GroupedData at 0x7fe2f0e67ef0>

## Aggregate Functions
- mean, count, max, min, sum...

In [None]:
df.groupBy("company").mean().show()

+-------+-----------------+
|company|   avg(num_sales)|
+-------+-----------------+
|   BLAH|            370.0|
|    XYZ|            220.0|
|     OK|            610.0|
|   ABCD|322.3333333333333|
+-------+-----------------+



In [None]:
df.groupBy("company").count().show()

+-------+-----+
|company|count|
+-------+-----+
|   BLAH|    4|
|    XYZ|    3|
|     OK|    2|
|   ABCD|    3|
+-------+-----+



In [None]:
df.groupBy("company").min().show()

+-------+--------------+
|company|min(num_sales)|
+-------+--------------+
|   BLAH|         130.0|
|    XYZ|         120.0|
|     OK|         350.0|
|   ABCD|         124.0|
+-------+--------------+



In [None]:
df.groupBy("company").max().show()

+-------+--------------+
|company|max(num_sales)|
+-------+--------------+
|   BLAH|         750.0|
|    XYZ|         340.0|
|     OK|         870.0|
|   ABCD|         600.0|
+-------+--------------+



In [None]:
df.groupBy("company").sum().show()

+-------+--------------+
|company|sum(num_sales)|
+-------+--------------+
|   BLAH|        1480.0|
|    XYZ|         660.0|
|     OK|        1220.0|
|   ABCD|         967.0|
+-------+--------------+



## Aggregating

- Not all methods need a groupby call, instead you can just call the generalized `.agg()` method, that will call the aggregate across all rows in the dataframe column specified. 
- It can take in arguments as a single column, or create multiple aggregate calls all at once using dictionary notation.


In [None]:
df.agg({"num_sales":"max"}).show()

+--------------+
|max(num_sales)|
+--------------+
|         870.0|
+--------------+



In [None]:
df.groupBy("company").agg({"num_sales":"mean"}).show()

+-------+-----------------+
|company|   avg(num_sales)|
+-------+-----------------+
|   BLAH|            370.0|
|    XYZ|            220.0|
|     OK|            610.0|
|   ABCD|322.3333333333333|
+-------+-----------------+



In [None]:
company_groups = df.groupBy("company")

In [None]:
company_groups.min().show()

+-------+--------------+
|company|min(num_sales)|
+-------+--------------+
|   BLAH|         130.0|
|    XYZ|         120.0|
|     OK|         350.0|
|   ABCD|         124.0|
+-------+--------------+



## Functions
There are a variety of functions you can import from pyspark.sql.functions.

In [None]:
from pyspark.sql.functions import countDistinct, avg, stddev

In [None]:
df.select(countDistinct("num_sales")).show()

+-------------------------+
|count(DISTINCT num_sales)|
+-------------------------+
|                       11|
+-------------------------+



In [None]:
df.select(avg("num_sales")).show()

+-----------------+
|   avg(num_sales)|
+-----------------+
|360.5833333333333|
+-----------------+



In [None]:
df.select(stddev("num_sales")).show()

+----------------------+
|stddev_samp(num_sales)|
+----------------------+
|    250.08742410799007|
+----------------------+



### Alias
- To change the name, use the `.alias()` method for this:

In [None]:
df.select(countDistinct("num_sales").alias("ANYTHING WE WANT")).show()

+----------------+
|ANYTHING WE WANT|
+----------------+
|              11|
+----------------+



### Precision
- Use the `format_number` to change precision


In [None]:
from pyspark.sql.functions import format_number

In [None]:
sales_std = df.select(stddev("num_sales").alias("stddev"))

In [None]:
sales_std.show()

+------------------+
|            stddev|
+------------------+
|250.08742410799007|
+------------------+



In [None]:
sales_std.select(format_number("stddev",2)).show()

+------------------------+
|format_number(stddev, 2)|
+------------------------+
|                  250.09|
+------------------------+



## Order By


In [None]:
df.orderBy("num_sales").show() # Ascending Order

+-------+--------------+---------+
|company|representative|num_sales|
+-------+--------------+---------+
|    XYZ|           Tom|    120.0|
|   ABCD|           Amy|    124.0|
|   BLAH|         Linda|    130.0|
|    XYZ|           Bob|    200.0|
|   ABCD|       Vanessa|    243.0|
|   BLAH|          John|    250.0|
|    XYZ|         Frank|    340.0|
|     OK|         Sarah|    350.0|
|   BLAH|         Chris|    350.0|
|   ABCD|         Jerry|    600.0|
|   BLAH|          Mike|    750.0|
|     OK|          Carl|    870.0|
+-------+--------------+---------+



In [None]:
df.orderBy(df['num_sales'].desc()).show()

+-------+--------------+---------+
|company|representative|num_sales|
+-------+--------------+---------+
|     OK|          Carl|    870.0|
|   BLAH|          Mike|    750.0|
|   ABCD|         Jerry|    600.0|
|     OK|         Sarah|    350.0|
|   BLAH|         Chris|    350.0|
|    XYZ|         Frank|    340.0|
|   BLAH|          John|    250.0|
|   ABCD|       Vanessa|    243.0|
|    XYZ|           Bob|    200.0|
|   BLAH|         Linda|    130.0|
|   ABCD|           Amy|    124.0|
|    XYZ|           Tom|    120.0|
+-------+--------------+---------+



# Missing Data

- Often data sources are incomplete
- There are 3 options for filling in missing data:
  1. Just keep the missing data points.
  1. Drop them missing data points/row
  1. Fill them in with some other value.

## Keeping the missing data
A few machine learning algorithms can easily deal with missing data, let's see what it looks like:

In [None]:
!curl https://raw.githubusercontent.com/markumreed/colab_pyspark/main/missing_data.csv >> missing_data.csv

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("missing_data").getOrCreate()

In [8]:
df = spark.read.csv("missing_data.csv", header=True, inferSchema=True)

In [9]:
df.show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob| null|
|id002| null| null|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



In [10]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- sales: double (nullable = true)



## Drop the missing data

You can use the `.na` functions for missing data. The `drop` command has the following parameters:

```df.na.drop(how='any', thresh=None, subset=None)```
    
    * param how: 'any' or 'all'.
    
        If 'any', drop a row if it contains any nulls.
        If 'all', drop a row only if all its values are null.
    
    * param thresh: int, default None
    
        If specified, drop rows that have less than `thresh` non-null values.
        This overwrites the `how` parameter.
        
    * param subset: 
        optional list of column names to consider.


In [11]:
df.na.drop().show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id004|Karen|404.0|
+-----+-----+-----+



In [12]:
df.na.drop(thresh=2).show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob| null|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



In [13]:
df.na.drop(subset=['sales']).show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



In [14]:
df.na.drop(how='any').show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id004|Karen|404.0|
+-----+-----+-----+



In [15]:
df.na.drop(how='all').show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob| null|
|id002| null| null|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



## Fill the missing values

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark is actually smart enough to match up the data types. For example:


In [16]:
df.na.fill('SOME VALUE').show()

+-----+----------+-----+
|   id|      name|sales|
+-----+----------+-----+
|id001|       Bob| null|
|id002|SOME VALUE| null|
|id003|SOME VALUE|585.0|
|id004|     Karen|404.0|
+-----+----------+-----+



In [17]:
df.na.fill(999).show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob|999.0|
|id002| null|999.0|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



In [18]:
df.na.fill("Missing Name", subset=["name"]).show()

+-----+------------+-----+
|   id|        name|sales|
+-----+------------+-----+
|id001|         Bob| null|
|id002|Missing Name| null|
|id003|Missing Name|585.0|
|id004|       Karen|404.0|
+-----+------------+-----+



In [19]:
from pyspark.sql.functions import mean

In [20]:
mean_value = df.select(mean(df['sales'])).collect()

In [23]:
mean_sales_value = mean_value[0][0]

In [25]:
df.na.fill(mean_sales_value, ["sales"]).show()

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob|494.5|
|id002| null|494.5|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



In [27]:
# DON'T DO THIS
df.na.fill(df.select(mean(df['sales'])).collect()[0][0] ,['sales']).show() # NOT EASY TO READ

+-----+-----+-----+
|   id| name|sales|
+-----+-----+-----+
|id001|  Bob|494.5|
|id002| null|494.5|
|id003| null|585.0|
|id004|Karen|404.0|
+-----+-----+-----+



# Dates and Timestamps

You will often find yourself working with Time and Date information


In [5]:
!curl https://raw.githubusercontent.com/markumreed/colab_pyspark/main/WMT.csv >> WMT.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 89556  100 89556    0     0   352k      0 --:--:-- --:--:-- --:--:--  351k


In [28]:
spark = SparkSession.builder.appName('walmart_dates').getOrCreate()

In [29]:
df = spark.read.csv('WMT.csv', header=True, inferSchema=True)

In [30]:
df.show()

+----------+---------+---------+---------+---------+---------+--------+
|      Date|     Open|     High|      Low|    Close|Adj Close|  Volume|
+----------+---------+---------+---------+---------+---------+--------+
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|
|2016-01-22|62.439999|63.259998|62.130001|62.689999|55.632324| 9197500|
|2016-01-25|62.779999|    63.82|62.549999|63.450001|56.306763|12823400|
|2016-01-26|63.360001|64.470001|63.259998|     64.0|56.794834| 9441200|
|2016-01-27|64.099998|    65.18|63.889999|63.950001|56.750477|10214300|
|2016-01-28|64.029999|64.510002|    63.43|64.220001| 56.99007|11278300|
|2016-01-29|    64.75|66.529999|64.739998|66.360001|58.889149|16439100|
|2016-02-01|65.910004|    67.93|65.889999|     67.5| 59.90081|14728400|
|2016-02-02|67.300003|67.839996|66.279999|66.860001|59.332867|13585900|
|2016-02-03|67.309998|     67.5|    65.07|66.269997| 58.80928|12

In [31]:
from pyspark.sql.functions import format_number, dayofmonth, hour, dayofyear, month, year, weekofyear, date_format

In [32]:
df.select(dayofmonth(df['Date'])).show()

+----------------+
|dayofmonth(Date)|
+----------------+
|              20|
|              21|
|              22|
|              25|
|              26|
|              27|
|              28|
|              29|
|               1|
|               2|
|               3|
|               4|
|               5|
|               8|
|               9|
|              10|
|              11|
|              12|
|              16|
|              17|
+----------------+
only showing top 20 rows



In [33]:
df.select(hour(df['Date'])).show()

+----------+
|hour(Date)|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows



In [34]:
df.select(dayofyear(df['Date'])).show()

+---------------+
|dayofyear(Date)|
+---------------+
|             20|
|             21|
|             22|
|             25|
|             26|
|             27|
|             28|
|             29|
|             32|
|             33|
|             34|
|             35|
|             36|
|             39|
|             40|
|             41|
|             42|
|             43|
|             47|
|             48|
+---------------+
only showing top 20 rows



In [35]:
df.select(month(df['Date'])).show()

+-----------+
|month(Date)|
+-----------+
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
+-----------+
only showing top 20 rows



Find Avg Close Price per month.

In [37]:
df.select(month(df['Date'])).show()

+-----------+
|month(Date)|
+-----------+
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
|          2|
+-----------+
only showing top 20 rows



In [38]:
df.withColumn("Month", month(df['Date'])).show()

+----------+---------+---------+---------+---------+---------+--------+-----+
|      Date|     Open|     High|      Low|    Close|Adj Close|  Volume|Month|
+----------+---------+---------+---------+---------+---------+--------+-----+
|2016-01-20|61.799999|62.330002|60.200001|    60.84|53.990601|17369100|    1|
|2016-01-21|    60.98|62.790001|    60.91|61.880001|54.913509|12089200|    1|
|2016-01-22|62.439999|63.259998|62.130001|62.689999|55.632324| 9197500|    1|
|2016-01-25|62.779999|    63.82|62.549999|63.450001|56.306763|12823400|    1|
|2016-01-26|63.360001|64.470001|63.259998|     64.0|56.794834| 9441200|    1|
|2016-01-27|64.099998|    65.18|63.889999|63.950001|56.750477|10214300|    1|
|2016-01-28|64.029999|64.510002|    63.43|64.220001| 56.99007|11278300|    1|
|2016-01-29|    64.75|66.529999|64.739998|66.360001|58.889149|16439100|    1|
|2016-02-01|65.910004|    67.93|65.889999|     67.5| 59.90081|14728400|    2|
|2016-02-02|67.300003|67.839996|66.279999|66.860001|59.332867|13

In [40]:
df2 = df.withColumn("Month", month(df['Date']))

In [41]:
df2.groupBy("Month").mean()[['avg(Month)', 'avg(Close)']].show()

+----------+------------------+
|avg(Month)|        avg(Close)|
+----------+------------------+
|      12.0|106.02932022330099|
|       1.0| 98.94980368627448|
|       6.0|  92.2302801401869|
|       3.0| 87.44880724770645|
|       5.0| 90.54859816822429|
|       9.0|100.69396066336634|
|       4.0| 91.55893247572816|
|       8.0| 96.97705391071432|
|       7.0| 96.65647596190469|
|      10.0|102.74810810810811|
|      11.0|105.59009729126215|
|       2.0| 89.16364570833336|
+----------+------------------+



In [42]:
res = df2.groupBy("Month").mean()[['avg(Month)', 'avg(Close)']]
res = res.withColumnRenamed("avg(Month)", "Month")
res = res.select("Month", format_number('avg(Close)',2).alias("Mean Close")).show()

+-----+----------+
|Month|Mean Close|
+-----+----------+
| 12.0|    106.03|
|  1.0|     98.95|
|  6.0|     92.23|
|  3.0|     87.45|
|  5.0|     90.55|
|  9.0|    100.69|
|  4.0|     91.56|
|  8.0|     96.98|
|  7.0|     96.66|
| 10.0|    102.75|
| 11.0|    105.59|
|  2.0|     89.16|
+-----+----------+

