## Functions or APIs to process Data Frames.

- Projection - **select** or **withColumn** or **drop** or **selectExpr**
- Filtering - **filter** or **where**
- Grouping data by key and perform aggregations - **groupBy**
- Sorting data - **sort** or **orderBy**
- We can pass column names or literals or expressions to all the Data Frame APIs.
- Expressions include arithmetic operations, transformations using functions from pyspark.sql.functions.
- There are approximately 300 functions under pyspark.sql.functions.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    enableHiveSupport(). \
    appName(' Python - Processing Column Data'). \
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
# Reading data
orders = spark.read.csv(
    '/data/retail_db/orders',
    schema='order_id INT, order_date STRING, order_customer_id INT, order_status STRING'
)

#### Importing functions

In [56]:
from pyspark.sql.functions import date_format,current_date,current_timestamp,unix_timestamp,\
to_date,to_timestamp,from_unixtime,date_add,date_sub,datediff,months_between,add_months,trunc,date_trunc,\
year, month, weekofyear, dayofmonth,dayofyear, dayofweek, current_date,hour, minute, second,\
upper,concat,lower,initcap,length,substring,split,explode,lpad,rpad,trim,round,\
coalesce,lit,col,expr,when

In [5]:
orders.show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [6]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [7]:
# Function as part of projections

orders.select('*', date_format('order_date', 'yyyyMM').alias('order_month')).show(5)

+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|     201307|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|     201307|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|     201307|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|     201307|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|     201307|
+--------+--------------------+-----------------+---------------+-----------+
only showing top 5 rows



In [8]:
orders.withColumn('order_month', date_format('order_date', 'yyyyMM')).show(5)

+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|     201307|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|     201307|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|     201307|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|     201307|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|     201307|
+--------+--------------------+-----------------+---------------+-----------+
only showing top 5 rows



In [9]:
# Function as part of where or filter

orders. \
    filter(date_format('order_date', 'yyyyMM') == 201401). \
    show(5, truncate=False)

+--------+---------------------+-----------------+---------------+
|order_id|order_date           |order_customer_id|order_status   |
+--------+---------------------+-----------------+---------------+
|25876   |2014-01-01 00:00:00.0|3414             |PENDING_PAYMENT|
|25877   |2014-01-01 00:00:00.0|5549             |PENDING_PAYMENT|
|25878   |2014-01-01 00:00:00.0|9084             |PENDING        |
|25879   |2014-01-01 00:00:00.0|5118             |PENDING        |
|25880   |2014-01-01 00:00:00.0|10146            |CANCELED       |
+--------+---------------------+-----------------+---------------+
only showing top 5 rows



In [10]:
# Function as part of groupBy
orders. \
    groupBy(date_format('order_date', 'yyyyMM').alias('order_month')). \
    count(). \
    show(5)

+-----------+-----+
|order_month|count|
+-----------+-----+
|     201401| 5908|
|     201405| 5467|
|     201312| 5892|
|     201310| 5335|
|     201311| 6381|
+-----------+-----+
only showing top 5 rows



create data frame using dummy data 

In [11]:
# Oracle dual (view)
# dual - dummy CHAR(1)
# "X" - One record
l = [('X', )]
df = spark.createDataFrame(l, "dummy STRING")

In [12]:
df.printSchema()

root
 |-- dummy: string (nullable = true)



In [13]:
df.show()

+-----+
|dummy|
+-----+
|    X|
+-----+



In [14]:
df.select(current_date()).show()

+--------------+
|current_date()|
+--------------+
|    2023-04-15|
+--------------+



In [15]:
df.select(current_date().alias("current_date")). \
    show()

+------------+
|current_date|
+------------+
|  2023-04-15|
+------------+



creating Data Frame using collection of employees

In [16]:
employees = [
    (1, "Scott", "Tiger", 1000.0, 
      "united states", "+1 123 456 7890", "123 45 6789"
    ),
     (2, "Henry", "Ford", 1250.0, 
      "India", "+91 234 567 8901", "456 78 9123"
     ),
     (3, "Nick", "Junior", 750.0, 
      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
     )
]

In [17]:
len(employees)

4

In [18]:
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_number STRING, ssn STRING"""
                   )

In [19]:
employeesDF.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- nationality: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- ssn: string (nullable = true)



In [20]:
employeesDF.show(truncate=False)

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|nationality   |phone_number    |ssn        |
+-----------+----------+---------+------+--------------+----------------+-----------+
|1          |Scott     |Tiger    |1000.0|united states |+1 123 456 7890 |123 45 6789|
|2          |Henry     |Ford     |1250.0|India         |+91 234 567 8901|456 78 9123|
|3          |Nick      |Junior   |750.0 |united KINGDOM|+44 111 111 1111|222 33 4444|
|4          |Bill      |Gomes    |1500.0|AUSTRALIA     |+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



## Categories of Functions

- ### String Manipulation Functions
    - Case Conversion - lower, upper
    - Getting Length - length
    - Extracting substrings - substring, split
    - Trimming - trim, ltrim, rtrim
    - Padding - lpad, rpad
    - Concatenating string - concat, concat_ws
- ### Date Manipulation Functions
    - Getting current date and time - current_date, current_timestamp
    - Date Arithmetic - date_add, date_sub, datediff, months_between, add_months, next_day
    - Beginning and Ending Date or Time - last_day, trunc, date_trunc
    - Formatting Date - date_format
    - Extracting Information - dayofyear, dayofmonth, dayofweek, year, month
- ### Aggregate Functions
    - count, countDistinct
    - sum, avg
    - min, max
- ### Other Functions - We will explore depending on the use cases.
    - CASE and WHEN
    - CAST for type casting
    - Functions to manage special types such as ARRAY, MAP, STRUCT type columns
- ### Special Functions 
    - col 
    - lit
     



>Special functions such as **col** and **lit**  are typically used to convert the strings to column type.

>If there are no transformations on any column in any function then we should be able to pass all column names as strings.

>If not we need to pass all columns as type column by using **col** function.

>If we want to apply transformations using some of the functions then passing column names as strings will not suffice. We have to pass them as column type.


In [21]:
employeesDF. \
    select("first_name", "last_name"). \
    show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Scott|    Tiger|
|     Henry|     Ford|
|      Nick|   Junior|
|      Bill|    Gomes|
+----------+---------+



In [22]:
employeesDF. \
    groupBy("nationality"). \
    count(). \
    show()

+--------------+-----+
|   nationality|count|
+--------------+-----+
|united KINGDOM|    1|
|     AUSTRALIA|    1|
|         India|    1|
| united states|    1|
+--------------+-----+



In [23]:
employeesDF. \
    orderBy("employee_id"). \
    show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [24]:
employeesDF. \
    select(col("first_name"), col("last_name")). \
    show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Scott|    Tiger|
|     Henry|     Ford|
|      Nick|   Junior|
|      Bill|    Gomes|
+----------+---------+



In [25]:
employeesDF. \
    select(upper("first_name").alias("First_name"), upper("last_Name").alias("Last_Name")). \
    show()

+----------+---------+
|First_name|Last_Name|
+----------+---------+
|     SCOTT|    TIGER|
|     HENRY|     FORD|
|      NICK|   JUNIOR|
|      BILL|    GOMES|
+----------+---------+



In [26]:
employeesDF. \
    select(concat(upper("first_name"),lit(" "), upper("last_Name")).alias("Names")). \
    show()

+-----------+
|      Names|
+-----------+
|SCOTT TIGER|
| HENRY FORD|
|NICK JUNIOR|
| BILL GOMES|
+-----------+



In [27]:
employeesDF. \
    groupBy(upper("nationality")). \
    count(). \
    show()

+------------------+-----+
|upper(nationality)|count|
+------------------+-----+
|    UNITED KINGDOM|    1|
|         AUSTRALIA|    1|
|             INDIA|    1|
|     UNITED STATES|    1|
+------------------+-----+



In [28]:
employeesDF. \
    orderBy("employee_id".desc()). \
    show()

AttributeError: 'str' object has no attribute 'desc'

In [None]:
employeesDF. \
    orderBy(col("employee_id").desc()). \
    show()

In [None]:
employeesDF. \
    orderBy(col("first_name").desc()). \
    show()

In [None]:
# Alternative - we can also refer column names using Data Frame like this
employeesDF. \
    orderBy(upper(employeesDF['first_name']).alias('first_name')). \
    show()

### Common String Manipulation Functions
- Case Conversion and Length
    - Convert all the alphabetic characters in a string to uppercase - upper
    - Convert all the alphabetic characters in a string to lowercase - lower
    - Convert first character in a string to uppercase - initcap
    - Get number of characters in a string - length

Concatenating Strings

In [None]:
employeesDF. \
    withColumn("full_name", concat("first_name",lit(", "), "last_name")). \
    show()

Case Conversion and length

In [None]:
employeesDF. \
  select("employee_id", "nationality"). \
  withColumn("nationality_upper", upper(col("nationality"))). \
  withColumn("nationality_lower", lower(col("nationality"))). \
  withColumn("nationality_initcap", initcap(col("nationality"))). \
  withColumn("nationality_length", length(col("nationality"))). \
  show()

### Extracting Strings using substring
- If we are processing ***fixed length*** columns then we use substring to extract the information.

In [None]:
s = "Hello World"
s.index('H')

In [None]:
employeesDF. \
    select("employee_id", "phone_number", "ssn"). \
    withColumn("phone_last4", substring(col("phone_number"), -4, 4).cast("int")). \
    withColumn("ssn_last4", substring(col("ssn"), 8, 4).cast("int")). \
    show()

In [None]:
employeesDF. \
    select("employee_id", "phone_number", "ssn"). \
    withColumn("phone_last4", substring(col("phone_number"), -4, 4).cast("int")). \
    withColumn("ssn_last4", substring(col("ssn"), 8, 4).cast("int")). \
    printSchema()

In [None]:
help(employeesDF.employee_id.substr)

### Extracting Strings using split
- If we are processing variable length columns with delimiter then we use split to extract the information.
- processing variable length columns with delimiter then we use split to extract the information.
- **split** takes 2 arguments, **column** and **delimiter**.
- **split** convert each string into array and we can access the elements using **index**.
- We can also use **explode** in conjunction with split to explode the list or array into records in Data Frame.

In [None]:
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890,+1 234 567 8901", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111,+44 222 222 2222", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210,+61 876 543 2109", "789 12 6118"
                     )
                ]

In [None]:
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_numbers STRING, ssn STRING"""
                   )

In [None]:
employeesDF. \
    select('employee_id', 'phone_numbers'). \
    show(truncate=False)

In [None]:
employeesDF = employeesDF. \
    select('employee_id', 'phone_numbers', 'ssn'). \
    withColumn('phone_number', explode(split('phone_numbers', ',')))

In [None]:
employeesDF.show(truncate=False)

In [None]:
employeesDF. \
    groupBy('employee_id'). \
    count(). \
    show()

### Padding Characters around Strings
- We use lpad to pad a string with a specific character on leading or left side and rpad to pad on trailing or right side.
- Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.

In [None]:
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]

In [None]:
employeesDF = spark.createDataFrame(employees). \
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

In [None]:
employeesDF.show()

In [None]:
employeesDF.printSchema()

Length of the employee_id should be 5 characters and should be padded with zero.

In [None]:
empFixedDF = employeesDF.select(
    concat(
        lpad("employee_id", 5, "0")
    ).alias("employee")
).show()

Length of first_name and last_name should be 10 characters and should be padded with - on the right side.

In [None]:
empFixedDF = employeesDF.select(
    concat(
        
        rpad("first_name", 10, "-"), 
        rpad("last_name", 10, "-")
    ).alias("employee")
).show()

Length of salary should be 10 characters and should be padded with zero.

In [None]:
empFixedDF = employeesDF.select(
    concat(
                lpad("salary", 10, "0")
    ).alias("employee")
).show()

Length of the nationality should be 15 characters and should be padded with - on the right side.

In [None]:
empFixedDF = employeesDF.select(
    concat(
         
        rpad("nationality", 15, "-"), 
       
    ).alias("employee")
).show()

Length of the phone_number should be 17 characters and should be padded with - on the right side

In [None]:
empFixedDF = employeesDF.select(
    concat(
        rpad("phone_number", 17, "-")
    ).alias("employee")
).show()

Length of the ssn can be left as is. It is 11 characters.

In [None]:
empFixedDF = employeesDF.select(
    concat(
       rpad("phone_number", 17, "-"), 
        "ssn"
    ).alias("employee")
).show()

### Trimming Characters from Strings
- We typically use trimming to remove unnecessary characters from fixed length records.
- As of now Spark trim functions take the column as argument and remove leading or trailing spaces. However, we can use **expr** or **selectExpr** to use Spark SQL based trim functions to remove leading or trailing spaces or any other such characters.
    - Trim spaces towards left - **ltrim**
    - Trim spaces towards right - **rtrim**
    - Trim spaces on both sides - **trim**



In [None]:
l = [("   Hello.    ",) ]
df = spark.createDataFrame(l).toDF("dummy")
df.show()

In [None]:
spark.sql('DESCRIBE FUNCTION rtrim').show(truncate=False)

In [None]:
# if we do not specify trimStr, it will be defaulted to space
df.withColumn("ltrim", expr("ltrim(dummy)")). \
  withColumn("rtrim", expr("rtrim('.', rtrim(dummy))")). \
  withColumn("trim", trim(col("dummy"))). \
  show()

In [None]:
spark.sql('DESCRIBE FUNCTION trim').show(truncate=False)

In [None]:
df.withColumn("ltrim", expr("trim(LEADING ' ' FROM dummy)")). \
  withColumn("rtrim", expr("trim(TRAILING '.' FROM rtrim(dummy))")). \
  withColumn("trim", expr("trim(BOTH ' ' FROM dummy)")). \
  show()

### Date and Time Manipulation Functions
- We can use ***current_date*** to get today’s server date.
     - Date will be returned using ***yyyy-MM-dd*** format.
- We can use ***current_timestamp*** to get current server time.
    - Timestamp will be returned using ***yyyy-MM-dd HH:mm:ss:SSS*** format.
    - Hours will be by default in 24 hour format.


In [None]:
df.select(current_date()).show() #yyyy-MM-dd

In [None]:
df.select(current_timestamp()).show(truncate=False) #yyyy-MM-dd HH:mm:ss.SSS

converting a string which contain date or timestamp in non-standard format to standard date or time using to_date or to_timestamp function respectively

In [None]:
df.select(to_date(lit('20210228'), 'yyyyMMdd').alias('to_date')).show()

In [None]:
df.select(to_timestamp(lit('20210228 1725'), 'yyyyMMdd HHmm').alias('to_timestamp')).show()

### Date and Time Arithmetic
- Adding days to a date or timestamp - **date_add**
- Subtracting days from a date or timestamp - **date_sub**
- Getting difference between 2 dates or timestamps - **datediff**
- Getting the number of months between 2 dates or timestamps - **months_between**
- Adding months to a date or timestamp - **add_months**
- Getting next day from a given date - **next_day**

In [None]:
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]

In [None]:
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")

In [None]:
datetimesDF.show(truncate=False)

- Add 10 days to both date and time values.
- Subtract 10 days from both date and time values.

In [None]:
datetimesDF. \
    withColumn("date_add_date", date_add("date", 10)). \
    withColumn("date_add_time", date_add("time", 10)). \
    withColumn("date_sub_date", date_sub("date", 10)). \
    withColumn("date_sub_time", date_sub("time", 10)). \
    show()

In [None]:
datetimesDF. \
    withColumn("months_between_date", round(months_between(current_date(), "date"), 2)). \
    withColumn("months_between_time", round(months_between(current_timestamp(), "time"), 2)). \
    withColumn("add_months_date", add_months("date", 3)). \
    withColumn("add_months_time", add_months("time", 3)). \
    show(truncate=False)

Getting the difference between current_date and date values as well as current_timestamp and time values.

In [None]:
datetimesDF. \
    withColumn("datediff_date", datediff(current_date(), "date")). \
    withColumn("datediff_time", datediff(current_timestamp(), "time")). \
    show()

- Getting the number of months between current_date and date values as well as current_timestamp and time values.
- Adding 3 months to both date values as well as time values.


In [None]:
datetimesDF. \
    withColumn("months_between_date", round(months_between(current_date(), "date"), 2)). \
    withColumn("months_between_time", round(months_between(current_timestamp(), "time"), 2)). \
    withColumn("add_months_date", add_months("date", 3)). \
    withColumn("add_months_time", add_months("time", 3)). \
    show(truncate=False)

### Using Date and Time Trunc Functions
- We can use **trunc** or **date_trunc** for the same to get the beginning date of the **week, month, current year etc** by passing ***date*** or ***timestamp*** to it.
- We can use **trunc** to get beginning **date of the month or year** by passing ***date or timestamp*** to it - for example trunc(current_date(), "MM") will give the first of the current month.
- We can use **date_trunc** to get beginning date of the month or year as well as beginning time of the day or hour by passing timestamp to it.

Creating a Dataframe by name datetimesDF with columns date and time.

In [None]:
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")

In [None]:
datetimesDF.show(truncate=False)

Getting **beginning month** and **year** date using date field and beginning year date using time field.

In [None]:
datetimesDF. \
    withColumn("date_trunc", trunc("date", "MM")). \
    withColumn("time_trunc", trunc("time", "yy")). \
    show(truncate=False)

Gettting **beginning hour** time using date and time field.

In [None]:
datetimesDF. \
    withColumn("date_trunc", date_trunc('MM', "date")). \
    withColumn("time_trunc", date_trunc('yy', "time")). \
    show(truncate=False)

In [None]:
datetimesDF. \
    withColumn("date_dt", date_trunc("HOUR", "date")). \
    withColumn("time_dt", date_trunc("HOUR", "time")). \
    withColumn("time_dt1", date_trunc("dd", "time")). \
    show(truncate=False)

### Date and Time Extract Functions
- year
- month
- weekofyear
- dayofyear
- dayofmonth
- dayofweek
- hour
-  minute
- second


In [None]:
l = [("X", )]
df = spark.createDataFrame(l).toDF("dummy")
df.show()


In [None]:
df.select(
    current_date().alias('current_date'), 
    year(current_date()).alias('year'),
    month(current_date()).alias('month'),
    weekofyear(current_date()).alias('weekofyear'),
    dayofyear(current_date()).alias('dayofyear'),
    dayofmonth(current_date()).alias('dayofmonth'),
    dayofweek(current_date()).alias('dayofweek')
).show() #yyyy-MM-dd

In [None]:
#from pyspark.sql import functions
#help(functions)

In [None]:
df.select(
    current_timestamp().alias('current_timestamp'), 
    year(current_timestamp()).alias('year'),
    month(current_timestamp()).alias('month'),
    dayofmonth(current_timestamp()).alias('dayofmonth'),
    hour(current_timestamp()).alias('hour'),
    minute(current_timestamp()).alias('minute'),
    second(current_timestamp()).alias('second')
).show(truncate=False) #yyyy-MM-dd HH:mm:ss.SSS

## Using to_date and to_timestamp
- **yyyy-MM-dd** is the standard date format
- **yyyy-MM-dd HH:mm:ss.SSS** is the standard timestamp format
- If data is not in the expected standard format, we can use **to_date** and **to_timestamp** to convert non standard dates and timestamps to standard ones respectively.


In [None]:
datetimes = [(20140228, "28-Feb-2014 10:00:00.123"),
                     (20160229, "20-Feb-2016 08:08:08.999"),
                     (20171031, "31-Dec-2017 11:59:59.123"),
                     (20191130, "31-Aug-2019 00:00:00.000")
                ]

datetimesDF = spark.createDataFrame(datetimes, schema="date BIGINT, time STRING")

datetimesDF.show(truncate=False)


In [None]:
l = [("X", )]
df = spark.createDataFrame(l).toDF("dummy")
df.show()


In [None]:
df.select(to_date(lit('20210302'), 'yyyyMMdd').alias('to_date')).show()

In [None]:
# year and day of year to standard date
df.select(to_date(lit('2021061'), 'yyyyDDD').alias('to_date')).show()

In [None]:
df.select(to_date(lit('02/03/2021'), 'dd/M/yyyy').alias('to_date')).show()

In [None]:
df.select(to_date(lit('02-03-2021'), 'dd-MM-yyyy').alias('to_date')).show()

In [None]:
df.select(to_date(lit('02-Mar-2021'), 'dd-MMM-yyyy').alias('to_date')).show()

In [None]:
df.select(to_date(lit('02-March-2021'), 'dd-MMMM-yyyy').alias('to_date')).show()

In [None]:
df.select(to_date(lit('March 2, 2021'), 'MMMM d, yyyy').alias('to_date')).show()

In [None]:
df.select(to_timestamp(lit('02-Mar-2021'), 'dd-MMM-yyyy').alias('to_date')).show()

In [None]:
datetimesDF.printSchema()

In [None]:
datetimesDF.show(truncate=False)

In [None]:
datetimesDF. \
    withColumn('to_date', to_date(col('date').cast('string'), 'yyyyMMdd')). \
    withColumn('to_timestamp', to_timestamp(col('time'), 'dd-MMM-yyyy HH:mm:ss.SSS')). \
    show(truncate=False)

### Using date_format Function
- use date_format to extract the required information in a desired format from standard date or timestamp. 
    - yyyy
    - MM
    - dd
    - DD
    - HH
    - hh
    - mm
    - ss
    - SSS

In [None]:
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")

datetimesDF.show(truncate=False)


Get the year and month from both date and time columns using _yyyyMM format_. Also make sure that the data type is converted to integer.

In [None]:
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM")). \
    withColumn("time_ym", date_format("time", "yyyyMM")). \
    show(truncate=False)

In [None]:
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM")). \
    withColumn("time_ym", date_format("time", "yyyyMM")). \
    printSchema()

In [None]:
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
    withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
    printSchema()

In [None]:
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
    withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
    show(truncate=False)

yyyyMMddHHmmss format.

In [None]:
datetimesDF. \
    withColumn("date_dt", date_format("date", "yyyyMMddHHmmss")). \
    withColumn("date_ts", date_format("time", "yyyyMMddHHmmss")). \
    show(truncate=False)

In [None]:
datetimesDF. \
    withColumn("date_dt", date_format("date", "yyyyMMddHHmmss").cast('long')). \
    withColumn("date_ts", date_format("time", "yyyyMMddHHmmss").cast('long')). \
    show(truncate=False)

Getting year and day of year using yyyyDDD format.

In [None]:
datetimesDF. \
    withColumn("date_yd", date_format("date", "yyyyDDD").cast('int')). \
    withColumn("time_yd", date_format("time", "yyyyDDD").cast('int')). \
    show(truncate=False)

Getting complete description of the date.

In [None]:
datetimesDF. \
    withColumn("date_desc", date_format("date", "MMMM d, yyyy")). \
    show(truncate=False)

Getting name of the week day using date.

In [None]:
datetimesDF. \
    withColumn("day_name_abbr", date_format("date", "EE")). \
    show(truncate=False)

In [None]:
datetimesDF. \
    withColumn("day_name_full", date_format("date", "EEEE")). \
    show(truncate=False)

### Dealing with Unix Timestamp
- It is an integer and started from January 1st 1970 Midnight UTC.
- Beginning time is also known as epoch and is incremented by 1 every second.
- We can convert Unix Timestamp to regular date or timestamp and vice versa.
- We can use **unix_timestamp** to convert regular date or timestamp to a unix timestamp value. For example unix_timestamp(lit("2019-11-19 00:00:00"))
- We can use from_unixtime to convert unix timestamp to regular date or timestamp. For example from_unixtime(lit(1574101800))
- We can also pass format to both the functions.


In [30]:
datetimes = [(20140228, "2014-02-28", "2014-02-28 10:00:00.123"),
                     (20160229, "2016-02-29", "2016-02-29 08:08:08.999"),
                     (20171031, "2017-10-31", "2017-12-31 11:59:59.123"),
                     (20191130, "2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes).toDF("dateid", "date", "time")

datetimesDF.show(truncate=False)


+--------+----------+-----------------------+
|dateid  |date      |time                   |
+--------+----------+-----------------------+
|20140228|2014-02-28|2014-02-28 10:00:00.123|
|20160229|2016-02-29|2016-02-29 08:08:08.999|
|20171031|2017-10-31|2017-12-31 11:59:59.123|
|20191130|2019-11-30|2019-08-31 00:00:00.000|
+--------+----------+-----------------------+



In [32]:
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
datetimesDF. \
    withColumn("unix_date_id", unix_timestamp(col("dateid").cast("string"), "yyyyMMdd")). \
    withColumn("unix_date", unix_timestamp("date", "yyyy-MM-dd")). \
    withColumn("unix_time", unix_timestamp("time")). \
    show()

+--------+----------+--------------------+------------+----------+----------+
|  dateid|      date|                time|unix_date_id| unix_date| unix_time|
+--------+----------+--------------------+------------+----------+----------+
|20140228|2014-02-28|2014-02-28 10:00:...|  1393545600|1393545600|1393581600|
|20160229|2016-02-29|2016-02-29 08:08:...|  1456704000|1456704000|1456733288|
|20171031|2017-10-31|2017-12-31 11:59:...|  1509408000|1509408000|1514721599|
|20191130|2019-11-30|2019-08-31 00:00:...|  1575072000|1575072000|1567209600|
+--------+----------+--------------------+------------+----------+----------+



In [33]:
unixtimes = [(1393561800, ),
             (1456713488, ),
             (1514701799, ),
             (1567189800, )
            ]
unixtimesDF = spark.createDataFrame(unixtimes).toDF("unixtime")
unixtimesDF.show()


+----------+
|  unixtime|
+----------+
|1393561800|
|1456713488|
|1514701799|
|1567189800|
+----------+



In [35]:
unixtimesDF.printSchema()

root
 |-- unixtime: long (nullable = true)



In [39]:
unixtimesDF. \
    withColumn("date", from_unixtime("unixtime", "yyyyMMdd")). \
    withColumn("time", from_unixtime("unixtime")). \
    show()
#yyyyMMdd

+----------+--------+-------------------+
|  unixtime|    date|               time|
+----------+--------+-------------------+
|1393561800|20140228|2014-02-28 04:30:00|
|1456713488|20160229|2016-02-29 02:38:08|
|1514701799|20171231|2017-12-31 06:29:59|
|1567189800|20190830|2019-08-30 18:30:00|
+----------+--------+-------------------+



### Dealing with Nulls
- We can use coalesce to return first non null value.
- We also have traditional SQL style functions such as nvl. However, they can be used either with expr or selectExpr.


In [40]:
employees = [(1, "Scott", "Tiger", 1000.0, 10,
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, None,
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, '',
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 10,
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]
employeesDF = spark.createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, last_name STRING, salary FLOAT,
                    bonus STRING, nationality STRING,phone_number STRING, ssn STRING"""
                   )

In [41]:
employeesDF.show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [44]:
employeesDF. \
    withColumn('bonus', coalesce('bonus', lit(0))). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [45]:
employeesDF. \
    withColumn('bonus1', col('bonus').cast('int')). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|  null|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|  null|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+



In [46]:
employeesDF. \
    withColumn('bonus1', coalesce(col('bonus').cast('int'), lit(0))). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|     0|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+



In [47]:
employeesDF. \
    withColumn('bonus', expr("nvl(bonus, 0)")). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [49]:
employeesDF. \
    withColumn('bonus', expr("nvl(nullif(bonus, ''), 0)")). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|    0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [50]:
employeesDF. \
    withColumn('payment', col('salary') + (col('salary') * coalesce(col('bonus').cast('int'), lit(0)) / 100)). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+-------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|payment|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789| 1100.0|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123| 1250.0|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|  750.0|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118| 1650.0|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-------+



## Using CASE and WHEN
- **CASE** and **WHEN** is typically used to apply transformations based up on conditions. We can use CASE and WHEN similar to SQL using expr or selectExpr.
- If we want to use APIs, Spark provides functions such as **when** and **otherwise**. when is available as part of pyspark.sql.functions. On top of column type that is generated using when we should be able to invoke otherwise.


In [52]:
employees = [(1, "Scott", "Tiger", 1000.0, 10,
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, None,
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, '',
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 10,
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]
employeesDF = spark.createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, last_name STRING, salary FLOAT, 
                    bonus STRING, nationality STRING,phone_number STRING, ssn STRING"""
                   )
employeesDF.show()


+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



transforming bonus to 0 in case of null or empty, otherwise return the bonus amount.

In [53]:
employeesDF. \
    withColumn('bonus1', coalesce(col('bonus').cast('int'), lit(0))). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|     0|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+



In [54]:
employeesDF. \
    withColumn('bonus', 
        expr("""
            CASE WHEN bonus IS NULL OR bonus = '' THEN 0
            ELSE bonus
            END
            """)
    ).show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|    0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [57]:
when?

[0;31mSignature:[0m [0mwhen[0m[0;34m([0m[0mcondition[0m[0;34m:[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mcolumn[0m[0;34m.[0m[0mColumn[0m[0;34m,[0m [0mvalue[0m[0;34m:[0m [0mAny[0m[0;34m)[0m [0;34m->[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mcolumn[0m[0;34m.[0m[0mColumn[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Evaluates a list of conditions and returns one of multiple possible result expressions.
If :func:`pyspark.sql.Column.otherwise` is not invoked, None is returned for unmatched
conditions.

.. versionadded:: 1.4.0

Parameters
----------
condition : :class:`~pyspark.sql.Column`
    a boolean :class:`~pyspark.sql.Column` expression.
value :
    a literal value, or a :class:`~pyspark.sql.Column` expression.

Examples
--------
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
[Row(age=3), Row(age=4)]

>>> df.select(when(df.age == 2, df.age + 1).alias("age")).collect()
[Row(age=3), Row(age=Non

# TASK

Create a dataframe using list called as persons and categorize them based up on following rules.

| Age range                | Category          |
| -------------------------| ------------------|
| 0 to 2 Months            | New Born          |
| 2+ Months to 12 Months   | Infant            |
|12+ Months to 48 Months   | Toddler           |
| 48+ Months to 144 Months | Kids              |
| 144+ Months              | Teenager or Adult |
| -------------------------| ------------------|

In [58]:
persons = [
    (1, 1),
    (2, 13),
    (3, 18),
    (4, 60),
    (5, 120),
    (6, 0),
    (7, 12),
    (8, 160)
]
personsDF = spark.createDataFrame(persons, schema='id INT, age INT')

In [59]:
personsDF. \
    withColumn(
        'category',
        expr("""
            CASE
            WHEN age BETWEEN 0 AND 2 THEN 'New Born'
            WHEN age > 2 AND age <= 12 THEN 'Infant'
            WHEN age > 12 AND age <= 48 THEN 'Toddler'
            WHEN age > 48 AND age <= 144 THEN 'Kid'
            ELSE 'Teenager or Adult'
            END
        """)
    ). \
    show()

+---+---+-----------------+
| id|age|         category|
+---+---+-----------------+
|  1|  1|         New Born|
|  2| 13|          Toddler|
|  3| 18|          Toddler|
|  4| 60|              Kid|
|  5|120|              Kid|
|  6|  0|         New Born|
|  7| 12|           Infant|
|  8|160|Teenager or Adult|
+---+---+-----------------+



In [61]:
#Another format
personsDF. \
    withColumn(
        'category',
        when(col('age').between(0, 2), 'New Born').
        when((col('age') > 2) & (col('age') <= 12), 'Infant').
        when((col('age') > 12) & (col('age') <= 48), 'Toddler').
        when((col('age') > 48) & (col('age') <= 144), 'Kid').
        otherwise('Teenager or Adult')
    ). \
    show()

+---+---+-----------------+
| id|age|         category|
+---+---+-----------------+
|  1|  1|         New Born|
|  2| 13|          Toddler|
|  3| 18|          Toddler|
|  4| 60|              Kid|
|  5|120|              Kid|
|  6|  0|         New Born|
|  7| 12|           Infant|
|  8|160|Teenager or Adult|
+---+---+-----------------+

