# Dates and Timestamps

You will often find yourself working with Time and Date information, let's walk through some ways you can deal with it!

In [2]:
from pyspark.sql import SparkSession
# May take a little while on a local computer
spark = SparkSession.builder.appName("dates").getOrCreate()

In [3]:
df = spark.read.csv("/FileStore/tables/appl_stock.csv",header=True,inferSchema=True)

In [4]:
df.show()

In [5]:
df.head(2)

Let's walk through how to grab parts of the timestamp data

In [7]:
from pyspark.sql.functions import format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format

In [8]:
df.select(df.Date, dayofmonth(df['Date']), month(df['Date']), year(df['Date']),dayofyear(df['Date']).alias('Dias corridos no ano')
         , hour(df['Date'])).show()

So for example, let's say we wanted to know the average closing price per year. Easy! With a groupby and the year() function call:

In [10]:
# .withColumn creates a new column

df.withColumn("Year",year(df['Date'])).select(['Year','Date','Open','Volume']).show()

In [11]:
# Average of Closing price per year

newdf = df.withColumn("Year",year(df['Date']))

newdf.groupBy("Year").mean()[[newdf.Year,'avg(Close)']].show()

Still not quite presentable! Let's use the .alias method as well as round() to clean this up!

In [13]:
# .withColumnRenamed renames the column
# format_number() formats the numeric column

result = newdf.groupBy("Year").mean()[['avg(Year)','avg(Close)']]
result = result.withColumnRenamed("avg(Year)","Year")
result = result.select('Year',format_number('avg(Close)',2).alias("Mean Close")).show()

In [14]:
# That was my solution for the same case

newdf = df.withColumn("Year",year(df['Date']))

newdf = newdf.groupBy("Year").avg("Close").orderBy(newdf.Year.desc()).withColumnRenamed("avg(Close)","Mean Close")

newdf.select(['Year',format_number('Mean Close',2).alias('Mean Close')]).show()


Perfect! Now you know how to work with Date and Timestamp information!