### 05_Dates_and_ternary_operations

using `lit`, `to_date`  , `when`, `otherwise`

In [82]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName('PySparkdatestern').getOrCreate()
import pandas as pd
sc = SparkSession.sparkContext

In [83]:
spdfA = spark.read.csv('02_MockDataset.csv',inferSchema=True,header=True)
spdfA.printSchema()

root
 |-- id: integer (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- SPAM: integer (nullable = true)
 |-- numvar: double (nullable = true)



### lit: literal values in a column

`pyspark.sql.functions.lit(val)`

Creates a Column of literal value. 

Parameters: `val`  : a literal value.

Use cases:
    
When you are trying to compare cols to literal values.
When you are trying to assign literal values to columns.

In [84]:
spdfA=spdfA.withColumn('accessdate',lit("01/3/2018"))
spdfA.show(3)

+---+---------------+----+----------+----------+
| id|     ip_address|SPAM|    numvar|accessdate|
+---+---------------+----+----------+----------+
|  1|176.121.226.242|   1|   47.8075| 01/3/2018|
|  2| 10.181.138.135|   0| 45.470831| 01/3/2018|
|  3|  87.50.171.217|   1|53.5662722| 01/3/2018|
+---+---------------+----+----------+----------+
only showing top 3 rows



### Ternary Operators (when & otherwise) in PySpark:



`pyspark.sql.functions.when(condition, value)`  Evaluates a list of conditions and returns one of multiple possible result expressions. 

If Column.otherwise() is not invoked, None is returned for unmatched conditions.

Parameters: 

`condition` : a boolean Column expression.
    
`value` : a literal value, or a Column expression.



`pyspark.sql.functions.otherwise(value)` Evaluates a list of conditions and returns one of multiple possible result expressions. 



If Column.otherwise() is not invoked, None is returned for unmatched conditions.

In [85]:
spdfA=spdfA.withColumn('newifdate',when(spdfA.SPAM ==1,"01/3/2018").otherwise("05/5/2018"))
spdfA.show(3)


+---+---------------+----+----------+----------+---------+
| id|     ip_address|SPAM|    numvar|accessdate|newifdate|
+---+---------------+----+----------+----------+---------+
|  1|176.121.226.242|   1|   47.8075| 01/3/2018|01/3/2018|
|  2| 10.181.138.135|   0| 45.470831| 01/3/2018|05/5/2018|
|  3|  87.50.171.217|   1|53.5662722| 01/3/2018|01/3/2018|
+---+---------------+----+----------+----------+---------+
only showing top 3 rows



### Convert date from String to Date format in Dataframe


In [86]:
spdfA = spdfA.withColumn('new_accessdate', to_date(col('accessdate'),'dd/MM/yyyy').cast("date"))
spdfA = spdfA.withColumn('new_ifdate', to_date(col('newifdate'),'dd/MM/yyyy').cast("date"))
spdfA = spdfA.drop('newifdate','accessdate')
spdfA.show()

+---+---------------+----+-----------+--------------+----------+
| id|     ip_address|SPAM|     numvar|new_accessdate|new_ifdate|
+---+---------------+----+-----------+--------------+----------+
|  1|176.121.226.242|   1|    47.8075|    2018-03-01|2018-03-01|
|  2| 10.181.138.135|   0|  45.470831|    2018-03-01|2018-05-05|
|  3|  87.50.171.217|   1| 53.5662722|    2018-03-01|2018-03-01|
|  4|117.104.191.215|   1| -7.3162208|    2018-03-01|2018-03-01|
|  5|  48.195.56.188|   0|  36.865827|    2018-03-01|2018-05-05|
|  6|234.127.184.207|   0|  57.782249|    2018-03-01|2018-05-05|
|  7|   143.8.89.200|   1| -2.4431287|    2018-03-01|2018-03-01|
|  8|  96.253.105.66|   1|  40.980653|    2018-03-01|2018-03-01|
|  9|207.255.168.139|   1| 49.7030481|    2018-03-01|2018-03-01|
| 10| 177.41.205.178|null|       null|    2018-03-01|2018-05-05|
| 11| 170.31.179.242|   0| 15.6432256|    2018-03-01|2018-05-05|
| 12|   211.97.10.16|   0| 46.2686934|    2018-03-01|2018-05-05|
| 13| 108.153.197.45|   0

In [87]:

print(spdfA.printSchema())

root
 |-- id: integer (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- SPAM: integer (nullable = true)
 |-- numvar: double (nullable = true)
 |-- new_accessdate: date (nullable = true)
 |-- new_ifdate: date (nullable = true)

None


### Date difference

In [90]:
spdfA = spdfA.withColumn("datedifference",datediff(col('new_ifdate'),col('new_accessdate')))
spdfA.show()

+---+---------------+----+-----------+--------------+----------+--------------+
| id|     ip_address|SPAM|     numvar|new_accessdate|new_ifdate|datedifference|
+---+---------------+----+-----------+--------------+----------+--------------+
|  1|176.121.226.242|   1|    47.8075|    2018-03-01|2018-03-01|             0|
|  2| 10.181.138.135|   0|  45.470831|    2018-03-01|2018-05-05|            65|
|  3|  87.50.171.217|   1| 53.5662722|    2018-03-01|2018-03-01|             0|
|  4|117.104.191.215|   1| -7.3162208|    2018-03-01|2018-03-01|             0|
|  5|  48.195.56.188|   0|  36.865827|    2018-03-01|2018-05-05|            65|
|  6|234.127.184.207|   0|  57.782249|    2018-03-01|2018-05-05|            65|
|  7|   143.8.89.200|   1| -2.4431287|    2018-03-01|2018-03-01|             0|
|  8|  96.253.105.66|   1|  40.980653|    2018-03-01|2018-03-01|             0|
|  9|207.255.168.139|   1| 49.7030481|    2018-03-01|2018-03-01|             0|
| 10| 177.41.205.178|null|       null|  