In [0]:
PySpark Aggregate Functions with Examples
PySpark provides built-in standard Aggregate functions defines in DataFrame API, these come in handy when we need to make aggregate operations on DataFrame columns. Aggregate functions operate on a group of rows and calculate a single return value for every group.

All these aggregate functions accept input as, Column type or column name in a string and several other arguments based on the function and return Column type.
When possible try to leverage standard library as they are little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee on performance.

PySpark Aggregate Functions
PySpark SQL Aggregate functions are grouped as “agg_funcs” in Pyspark. Below is a list of functions defined under this group. Click on each link to learn with example.
PySpark Aggregate Functions Examples
First, let’s create a DataFrame to work with PySpark aggregate functions. All examples provided here are also available at PySpark Examples GitHub project.
simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

In [0]:
approx_count_distinct Aggregate Function
In PySpark approx_count_distinct() function returns the count of distinct items in a group.
//approx_count_distinct()
print("approx_count_distinct: " + \
      str(df.select(approx_count_distinct("salary")).collect()[0][0]))

//Prints approx_count_distinct: 6

avg (average) Aggregate Function
avg() function returns the average of values in the input column.
//avg
print("avg: " + str(df.select(avg("salary")).collect()[0][0]))

//Prints avg: 3400.0

collect_list Aggregate Function
collect_list() function returns all values from an input column with duplicates.

countDistinct Aggregate Function
countDistinct() function returns the number of distinct elements in a columns

count function
count() function returns number of elements in a column.
print("count: "+str(df.select(count("salary")).collect()[0]))

Prints county: 10

grouping function
grouping() Indicates whether a given input column is aggregated or not. returns 1 for aggregated or 0 for not aggregated in the result. If you try grouping directly on the salary column you will get below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
  // grouping() can only be used with GroupingSets/Cube/Rollup
first function
first() function returns the first element in a column when ignoreNulls is set to true, it returns the first non-null element.

//first
df.select(first("salary")).show(truncate=False)
last function
last() function returns the last element in a column. when ignoreNulls is set to true, it returns the last non-null element.
//last
df.select(last("salary")).show(truncate=False)
max function
max() function returns the maximum value in a column.
df.select(max("salary")).show(truncate=False)
min function
min() function
df.select(min("salary")).show(truncate=False)
sum function
sum() function Returns the sum of all values in a column.
df.select(sum("salary")).show(truncate=False)
sumDistinct function
sumDistinct() function returns the sum of all distinct values in a column.
df.select(sumDistinct("salary")).show(truncate=False)

In [0]:
import pyspark 
from pyspark.sql.functions import col,avg,approx_count_distinct,collect_list,collect_set,countDistinct,count,first,last,min,max,sum,sum_distinct
simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df=spark.createDataFrame(data = simpleData,schema = schema)
df.show()
print("approx_count_distinct: "+\
    str(df.select(approx_count_distinct("salary")).collect()[0][0]
     ))

print("avg: "+str(df.select(avg("salary")).collect()[0][0]))

df.select(collect_list("salary")).show(truncate = False)
df.select(collect_set("salary")).show(truncate = False)
df.select(countDistinct("department","salary")).show(truncate = False)
print("countDistinct:"+str(df.select(countDistinct("department","salary")).collect()[0][0]))
print("count:"+str(df.select(count("salary")).collect()[0][0]))
df.select(first("salary")).show()
df.select(last("salary")).show()
df.select(min("salary")).show()
df.select(max("salary")).show()
df.select(sum("salary")).show()
df.select(sum_distinct("salary")).show()



+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

approx_count_distinct: 6
avg: 3400.0
+------------------------------------------------------------+
|collect_list(salary)                                        |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

+------------------------------------+
|collect_set(salary)                 |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+-------------

In [0]:
PySpark Window Functions
PySpark Window functions are used to calculate results such as the rank, row number e.t.c over a range of input rows. In this article, I’ve explained the concept of window functions, syntax, and finally how to use them with PySpark SQL and PySpark DataFrame API. These come in handy when we need to make aggregate operations in a specific window frame on DataFrame columns.

When possible try to leverage standard library as they are little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee on performance
1. Window Functions
PySpark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. PySpark SQL supports three kinds of window functions:

ranking functions
analytic functions
aggregate functions
The below table defines Ranking and Analytic functions and for aggregate functions, we can use any existing aggregate functions as a window function.
To perform an operation on a group first, we need to partition the data using Window.partitionBy() , and for row number and rank function we need to additionally order by on partition data using orderBy clause.
Click on each link to know more about these functions along with the Scala examples.

WINDOW FUNCTIONS USAGE & SYNTAX	PYSPARK WINDOW FUNCTIONS DESCRIPTION
row_number(): Column	Returns a sequential number starting from 1 within a window partition
rank(): Column	Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column	Returns the percentile rank of rows within a window partition.
dense_rank(): Column	Returns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column	Returns the ntile id in a window partition
cume_dist(): Column	Returns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

In [0]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyexample.com").getOrCreate()
simpleData = (("James", "Sales", 3000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
columns1 = ["employee_name","department","salary"]
df10 = spark.createDataFrame(data =simpleData, schema = columns1)
df10.show(truncate = False)


+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+



In [0]:
2. PySpark Window Ranking functions
2.1 row_number Window Function
row_number() window function is used to give the sequential row number starting from 1 to the result of each window partition.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec  = Window.partitionBy("department").orderBy("salary")

df.withColumn("row_number",row_number().over(windowSpec)) \
    .show(truncate=False)
2.2 rank Window Function
rank() window function is used to provide a rank to the result within a window partition. This function leaves gaps in rank when there are ties.
"""rank"""
from pyspark.sql.functions import rank
df.withColumn("rank",rank().over(windowSpec)) \
    .show()
2.3 dense_rank Window Function
dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.
"""dens_rank"""
from pyspark.sql.functions import dense_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec)) \
    .show()
2.4 percent_rank Window Function
""" percent_rank """
from pyspark.sql.functions import percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec)) \
    .show()
2.5 ntile Window Function
ntile() window function returns the relative rank of result rows within a window partition. In below example we have used 2 as an argument to ntile hence it returns ranking between 2 values (1 and 2)
"""ntile"""
from pyspark.sql.functions import ntile
df.withColumn("ntile",ntile(2).over(windowSpec)) \
    .show()
3. PySpark Window Analytic functions
3.1 cume_dist Window Function
cume_dist() window function is used to get the cumulative distribution of values within a window partition.

This is the same as the DENSE_RANK function in SQL.
""" cume_dist """
from pyspark.sql.functions import cume_dist    
df.withColumn("cume_dist",cume_dist().over(windowSpec)) \
   .show()
3.2 lag Window Function
This is the same as the LAG function in SQL.
"""lag"""
from pyspark.sql.functions import lag    
df.withColumn("lag",lag("salary",2).over(windowSpec)) \
      .show()
3.3 lead Window Function
This is the same as the LEAD function in SQL.
 """lead"""
from pyspark.sql.functions import lead    
df.withColumn("lead",lead("salary",2).over(windowSpec)) \
    .show()
4. PySpark Window Aggregate Functions
In this section, I will explain how to calculate sum, min, max for each department using PySpark SQL Aggregate window functions and WindowSpec. When working with Aggregate functions, we don’t need to use order by clause.
windowSpecAgg  = Window.partitionBy("department")
from pyspark.sql.functions import col,avg,sum,min,max,row_number 
df.withColumn("row",row_number().over(windowSpec)) \
  .withColumn("avg", avg(col("salary")).over(windowSpecAgg)) \
  .withColumn("sum", sum(col("salary")).over(windowSpecAgg)) \
  .withColumn("min", min(col("salary")).over(windowSpecAgg)) \
  .withColumn("max", max(col("salary")).over(windowSpecAgg)) \
  .where(col("row")==1).select("department","avg","sum","min","max") \
  .show()

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowspec=Window.partitionBy("department").orderBy("salary")
df10.show()
windowspec=Window.partitionBy("department").orderBy("salary")
df10.withColumn("row_number",row_number().over(windowspec)).show(truncate=False)


[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-3085729503368848>[0m in [0;36m<module>[0;34m[0m
[1;32m      2[0m [0;32mfrom[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mfunctions[0m [0;32mimport[0m [0mrow_number[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0mwindowspec[0m[0;34m=[0m[0mWindow[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0;34m"department"[0m[0;34m)[0m[0;34m.[0m[0morderBy[0m[0;34m([0m[0;34m"salary"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m [0mdf10[0m[0;34m.[0m[0mshow[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      5[0m [0mwindowspec[0m[0;34m=[0m[0mWindow[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0;34m"department"[0m[0;34m)[0m[0;34m.[0m[0morderBy[0m[0;34m([0m[0;34m"salary"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,rank
windowsec1 = Window.partitionBy("department").orderBy("salary")
df10.withColumn("rank",rank().over(windowsec1))\
    .withColumn("row_number",row_number().over(windowsec1)).show(truncate = False)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-3085729503368849>[0m in [0;36m<module>[0;34m[0m
[1;32m      2[0m [0;32mfrom[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mfunctions[0m [0;32mimport[0m [0mrow_number[0m[0;34m,[0m[0mrank[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0mwindowsec1[0m [0;34m=[0m [0mWindow[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0;34m"department"[0m[0;34m)[0m[0;34m.[0m[0morderBy[0m[0;34m([0m[0;34m"salary"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m [0mdf10[0m[0;34m.[0m[0mwithColumn[0m[0;34m([0m[0;34m"rank"[0m[0;34m,[0m[0mrank[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mover[0m[0;34m([0m[0mwindowsec1[0m[0;34m)[0m[0;34m)[0m[0;31m\[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      5[0m     [0;34m.[0m[0mwithColumn[0m[0;34m([0m

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,rank,dense_rank
windowspec2 = Window.partitionBy("department").orderBy("salary")
df10.withColumn("dense_rank",dense_rank().over(windowspec2))\
    .withColumn("row_number",row_number().over(windowspec2))\
    .withColumn("rank",rank().over(windowspec2)).show(truncate = False)

+-------------+----------+------+----------+----------+----+
|employee_name|department|salary|dense_rank|row_number|rank|
+-------------+----------+------+----------+----------+----+
|Maria        |Finance   |3000  |1         |1         |1   |
|Scott        |Finance   |3300  |2         |2         |2   |
|Jen          |Finance   |3900  |3         |3         |3   |
|Kumar        |Marketing |2000  |1         |1         |1   |
|Jeff         |Marketing |3000  |2         |2         |2   |
|James        |Sales     |3000  |1         |1         |1   |
|James        |Sales     |3000  |1         |2         |1   |
|Robert       |Sales     |4100  |2         |3         |3   |
|Saif         |Sales     |4100  |2         |4         |3   |
|Michael      |Sales     |4600  |3         |5         |5   |
+-------------+----------+------+----------+----------+----+



In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import ntile,percent_rank,row_number,rank,dense_rank
windowspec = Window.partitionBy("department").orderBy("salary")
df10.withColumn("percent_rank",percent_rank().over(windowspec))\
    .withColumn("dense_rank",dense_rank().over(windowspec))\
    .withColumn("rank",rank().over(windowspec))\
    .withColumn("row_number",row_number().over(windowspec))\
    .withColumn("ntile",ntile(3).over(windowspec)).show(truncate = False)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-3085729503368852>[0m in [0;36m<module>[0;34m[0m
[1;32m      2[0m [0;32mfrom[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mfunctions[0m [0;32mimport[0m [0mntile[0m[0;34m,[0m[0mpercent_rank[0m[0;34m,[0m[0mrow_number[0m[0;34m,[0m[0mrank[0m[0;34m,[0m[0mdense_rank[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0mwindowspec[0m [0;34m=[0m [0mWindow[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0;34m"department"[0m[0;34m)[0m[0;34m.[0m[0morderBy[0m[0;34m([0m[0;34m"salary"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m [0mdf10[0m[0;34m.[0m[0mwithColumn[0m[0;34m([0m[0;34m"percent_rank"[0m[0;34m,[0m[0mpercent_rank[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mover[0m[0;34m([0m[0mwindowspec[0m[0;34m)[0m[0;34m)[0m[0;31

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import lag,lead,cume_dist,percent_rank,dense_rank,rank,row_number,ntile
windowspec3 =Window.partitionBy("department").orderBy("salary")
df10.withColumn("cume_dist",cume_dist().over(windowspec3))\
    .withColumn("lead",lead("salary").over(windowspec3))\
    .withColumn("lead1",lead("salary",2).over(windowspec3))\
    .withColumn("lag",lag("salary").over(windowspec3))\
    .withColumn("lag1",lag("salary",2).over(windowspec3)).show(truncate = False)


+-------------+----------+------+------------------+----+-----+----+----+
|employee_name|department|salary|cume_dist         |lead|lead1|lag |lag1|
+-------------+----------+------+------------------+----+-----+----+----+
|Maria        |Finance   |3000  |0.3333333333333333|3300|3900 |null|null|
|Scott        |Finance   |3300  |0.6666666666666666|3900|null |3000|null|
|Jen          |Finance   |3900  |1.0               |null|null |3300|3000|
|Kumar        |Marketing |2000  |0.5               |3000|null |null|null|
|Jeff         |Marketing |3000  |1.0               |null|null |2000|null|
|James        |Sales     |3000  |0.4               |3000|4100 |null|null|
|James        |Sales     |3000  |0.4               |4100|4100 |3000|null|
|Robert       |Sales     |4100  |0.8               |4100|4600 |3000|3000|
|Saif         |Sales     |4100  |0.8               |4600|null |4100|3000|
|Michael      |Sales     |4600  |1.0               |null|null |4100|4100|
+-------------+----------+------+-----

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import avg,min,max,sum,row_number,col
windowspwc = Window.partitionBy("department").orderBy("salary")
df10.withColumn("row_number",row_number().over(windowspwc))\
    .withColumn("avg",avg(col("salary")).over(windowspwc))\
    .withColumn("sum",sum(col("salary")).over(windowspwc))\
    .withColumn("min",min(col("salary")).over(windowspwc))\
    .withColumn("max",max(col("salary")).over(windowspwc))\
     .where(col("row_number")==2).select("employee_name","department","salary","row_number").show(truncate = False)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-3085729503368854>[0m in [0;36m<module>[0;34m[0m
[1;32m      2[0m [0;32mfrom[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mfunctions[0m [0;32mimport[0m [0mavg[0m[0;34m,[0m[0mmin[0m[0;34m,[0m[0mmax[0m[0;34m,[0m[0msum[0m[0;34m,[0m[0mrow_number[0m[0;34m,[0m[0mcol[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0mwindowspwc[0m [0;34m=[0m [0mWindow[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0;34m"department"[0m[0;34m)[0m[0;34m.[0m[0morderBy[0m[0;34m([0m[0;34m"salary"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m [0mdf10[0m[0;34m.[0m[0mwithColumn[0m[0;34m([0m[0;34m"row_number"[0m[0;34m,[0m[0mrow_number[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mover[0m[0;34m([0m[0mwindowspwc[0m[0;34m)[0m[0;34m)[0m[0;31

In [0]:
PySpark SQL Date and Timestamp Functions
PySpark Date and Timestamp Functions are supported on DataFrame and SQL queries and they work similarly to traditional SQL, Date and Time are very important if you are using PySpark for ETL. Most of all these functions accept input as, Date type, Timestamp type, or String. If a String used, it should be in a default format that can be cast to date.

DateType default format is yyyy-MM-dd 
TimestampType default format is yyyy-MM-dd HH:mm:ss.SSSS
Returns null if the input is a string that can not be cast to Date or Timestamp.
PySpark SQL provides several Date & Timestamp functions hence keep an eye on and understand these. Always you should choose these functions instead of writing your own functions (UDF) as these functions are compile-time safe, handles null, and perform better when compared to PySpark UDF. If your PySpark application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee performance.
For readable purposes, I’ve grouped these functions into the following groups.

Date Functions
Timestamp Functions
Date and Timestamp Window Functions
Before you use any examples below, make sure you Create PySpark Sparksession and import SQL functions.
from pyspark.sql.functions import *
PySpark SQL Date Functions
Below are some of the PySpark SQL Date functions, these functions operate on the just Date.

The default format of the PySpark Date is yyyy-MM-dd.

PYSPARK DATE FUNCTION	DATE FUNCTION DESCRIPTION
current_date()	Returns the current date as a date column.
date_format(dateExpr,format)	Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
to_date()	Converts the column into `DateType` by casting rules to `DateType`.
to_date(column, fmt)	Converts the column into a `DateType` with a specified format
add_months(Column, numMonths)	Returns the date that is `numMonths` after `startDate`.
date_add(column, days)
date_sub(column, days)	Returns the date that is `days` days after `start`
datediff(end, start)	Returns the number of days from `start` to `end`.
months_between(end, start)	Returns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
months_between(end, start, roundOff)	Returns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
next_day(column, dayOfWeek)	Returns the first date which is later than the value of the `date` column that is on the specified day of the week.
For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02 because that is the first Sunday after 2015-07-27.
trunc(column, format)	Returns date truncated to the unit specified by the format.
For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format, timestamp)	Returns timestamp truncated to the unit specified by the format.
For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week', 'month', 'quarter'
year(column)	Extracts the year as an integer from a given date/timestamp/string
quarter(column)	Extracts the quarter as an integer from a given date/timestamp/string.
month(column)	Extracts the month as an integer from a given date/timestamp/string
dayofweek(column)	Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
dayofmonth(column)	Extracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(column)	Extracts the day of the year as an integer from a given date/timestamp/string.
weekofyear(column)	Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(column)	Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.
from_unixtime(column)	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(column, f)	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp()	Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(column)	Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
unix_timestamp(column, p)	Converts time string with given pattern to Unix timestamp (in seconds).
PySpark SQL Timestamp Functions
Below are some of the PySpark SQL Timestamp functions, these functions operate on both date and timestamp values.
PySpark SQL Timestamp Functions
Below are some of the PySpark SQL Timestamp functions, these functions operate on both date and timestamp values.
The default format of the Spark Timestamp is yyyy-MM-dd HH:mm:ss.SSSS

PYSPARK TIMESTAMP FUNCTION SIGNATURE	TIMESTAMP FUNCTION DESCRIPTION
current_timestamp ()	Returns the current timestamp as a timestamp column
hour(column)	Extracts the hours as an integer from a given date/timestamp/string.
minute(column)	Extracts the minutes as an integer from a given date/timestamp/string.
second(column)	Extracts the seconds as an integer from a given date/timestamp/string.
to_timestamp(column)	Converts to a timestamp by casting rules to `TimestampType`.
to_timestamp(column, fmt)	Converts time string with the given pattern to timestamp.
Date and Timestamp Window Functions
Below are PySpark Data and Timestamp window functions.
DATE & TIME WINDOW FUNCTION SYNTAX	DATE & TIME WINDOW FUNCTION DESCRIPTION
window(timeColumn: Column, windowDuration: String,
slideDuration: String, startTime: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC
window(timeColumn: Column, windowDuration: String): Column	Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC.
PySpark SQL Date and Timestamp Functions Examples
Following are the most used PySpark SQL Date and Timestamp Functions with examples, you can use these on DataFrame and SQL expressions.

In [0]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("sparkby examples").getOrCreate()
data99 = ( ("1","2020-02-01"),("2","2019-03-01"),("3","2021-12-31"))
#df11 = spark.createDataFrame(data = data99,schema=["id","date"])
df11 = spark.createDataFrame(data99 ,["id","input"])
df11.show()
df11.withColumn("current_date",current_date().alias("currentdat")).show()
df11.withColumn("date_format",date_format(col("input"),"MM-dd-yyyy")).show()
df11.withColumn("to_date",to_date(col("input"),"yyyy-MM-dd")).show()
df11.withColumn("datediff",datediff(current_date(),col("input"))).show()
df11.withColumn("months_between",months_between(current_date(),col("input"))).show()


+---+----------+
| id|     input|
+---+----------+
|  1|2020-02-01|
|  2|2019-03-01|
|  3|2021-12-31|
+---+----------+

+---+----------+------------+
| id|     input|current_date|
+---+----------+------------+
|  1|2020-02-01|  2022-09-28|
|  2|2019-03-01|  2022-09-28|
|  3|2021-12-31|  2022-09-28|
+---+----------+------------+

+---+----------+-----------+
| id|     input|date_format|
+---+----------+-----------+
|  1|2020-02-01| 02-01-2020|
|  2|2019-03-01| 03-01-2019|
|  3|2021-12-31| 12-31-2021|
+---+----------+-----------+

+---+----------+----------+
| id|     input|   to_date|
+---+----------+----------+
|  1|2020-02-01|2020-02-01|
|  2|2019-03-01|2019-03-01|
|  3|2021-12-31|2021-12-31|
+---+----------+----------+

+---+----------+--------+
| id|     input|datediff|
+---+----------+--------+
|  1|2020-02-01|     970|
|  2|2019-03-01|    1307|
|  3|2021-12-31|     271|
+---+----------+--------+

+---+----------+--------------+
| id|     input|months_between|
+---+----------+-----

In [0]:
df11.select(current_date().alias("current_date")).show()
df11.select(col("input"),date_format(col("input"),"dd-MM-yyyy").alias("date_format")).show()
df11.select(col("input"),datediff(current_date(),col("input")).alias("datediff")).show()
df11.select(col("input"),to_date(col("input"),"yyyy-MM-dd").alias("to_date")).show()
df11.select(col("input"),months_between(current_date(),col("input")).alias("months_between")).show()
df11.select(col("input"),trunc(col("input"),"month").alias("month_trunc"),
            trunc(col("input"),"year").alias("year_trunc"),
            trunc(col("input"),"day").alias("Day_trunc")).show()
df11.select(col("input"),add_months(col("input"),2).alias("add_month"),
            add_months(col("input"),-2).alias("sub_months"),
            date_add(col("input"),2).alias("date_add"),
            date_sub(col("input"),2).alias("date_sub")).show()

+------------+
|current_date|
+------------+
|  2022-09-27|
|  2022-09-27|
|  2022-09-27|
+------------+

+----------+-----------+
|     input|date_format|
+----------+-----------+
|2020-02-01| 01-02-2020|
|2019-03-01| 01-03-2019|
|2021-12-31| 31-12-2021|
+----------+-----------+

+----------+--------+
|     input|datediff|
+----------+--------+
|2020-02-01|     969|
|2019-03-01|    1306|
|2021-12-31|     270|
+----------+--------+

+----------+----------+
|     input|   to_date|
+----------+----------+
|2020-02-01|2020-02-01|
|2019-03-01|2019-03-01|
|2021-12-31|2021-12-31|
+----------+----------+

+----------+--------------+
|     input|months_between|
+----------+--------------+
|2020-02-01|   31.83870968|
|2019-03-01|   42.83870968|
|2021-12-31|    8.87096774|
+----------+--------------+

+----------+-----------+----------+---------+
|     input|month_trunc|year_trunc|Day_trunc|
+----------+-----------+----------+---------+
|2020-02-01| 2020-02-01|2020-01-01|     null|
|2019-03-01| 

In [0]:
df11.select(col("input"),year(col("input")).alias("year"),
            month(col("input")).alias("month"),
            next_day(col("input"),"tuesday").alias("next_day"),
            weekofyear(col("input")).alias("weekofyear")).show(truncate = False)
            

+----------+----+-----+----------+----------+
|input     |year|month|next_day  |weekofyear|
+----------+----+-----+----------+----------+
|2020-02-01|2020|2    |2020-02-04|5         |
|2019-03-01|2019|3    |2019-03-05|9         |
|2021-12-31|2021|12   |2022-01-04|52        |
+----------+----+-----+----------+----------+



In [0]:
df11.select(col("input"),dayofyear(col("input")).alias("dayofyear"),
            dayofmonth(col("input")).alias("dayofmonth"),
            dayofweek(col("input")).alias("dayofweek")
           ).show(truncate = False)

+----------+---------+----------+---------+
|input     |dayofyear|dayofmonth|dayofweek|
+----------+---------+----------+---------+
|2020-02-01|32       |1         |7        |
|2019-03-01|60       |1         |6        |
|2021-12-31|365      |31        |6        |
+----------+---------+----------+---------+



In [0]:
data1=[["1","02-01-2020 11 01 19 06"],["2","03-01-2019 12 01 19 406"],["3","03-01-2021 12 01 19 406"]]
df12 = spark.createDataFrame(data1,["id","input"])
df12.show(truncate = False)
df12.select(current_timestamp().alias("currenttimestamp")).show(1,truncate=False)
df12.select(col("input"),to_timestamp(col("input"),"dd-MM-yyyy HH mm ss SSS").alias("currenttimestamp")).show(truncate = False)

+---+-----------------------+
|id |input                  |
+---+-----------------------+
|1  |02-01-2020 11 01 19 06 |
|2  |03-01-2019 12 01 19 406|
|3  |03-01-2021 12 01 19 406|
+---+-----------------------+

+-----------------------+
|currenttimestamp       |
+-----------------------+
|2022-09-27 14:49:10.679|
+-----------------------+
only showing top 1 row

+-----------------------+-----------------------+
|input                  |currenttimestamp       |
+-----------------------+-----------------------+
|02-01-2020 11 01 19 06 |2020-01-02 11:01:19.06 |
|03-01-2019 12 01 19 406|2019-01-03 12:01:19.406|
|03-01-2021 12 01 19 406|2021-01-03 12:01:19.406|
+-----------------------+-----------------------+



In [0]:
  df12.select(col("input"),hour(col("input")).alias("hour"),
             minute(col("input")).alias("minute"),
              second(col("input")).alias("seconds")
             ).show(truncate = False)

+-----------------------+----+------+-------+
|input                  |hour|minute|seconds|
+-----------------------+----+------+-------+
|02-01-2020 11 01 19 06 |null|null  |null   |
|03-01-2019 12 01 19 406|null|null  |null   |
|03-01-2021 12 01 19 406|null|null  |null   |
+-----------------------+----+------+-------+



In [0]:
PySpark JSON Functions with Examples
PySpark JSON functions are used to query or extract the elements from JSON string of DataFrame column by path, convert it to struct, mapt type e.t.c, In this article, I will explain the most used JSON SQL functions with Python examples.

1. PySpark JSON Functions
from_json() – Converts JSON string into Struct type or Map type.
to_json() – Converts MapType or Struct type to JSON string.
json_tuple() – Extract the Data from JSON and create them as a new columns.
get_json_object() – Extracts JSON element from a JSON string based on json path specified.
schema_of_json() – Create schema string from JSON string
1.1. Create DataFrame with Column contains JSON String
In order to explain these JSON functions first, let’s create DataFrame with a column contains JSON string.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)
2. PySpark JSON Functions Examples
2.1. from_json()
PySpark from_json() function is used to convert JSON string into Struct type or Map type. The below example converts JSON string to Map key-value pair. I will leave it to you to convert to struct type. Refer, Convert JSON string to Struct type column.


#Convert JSON string column to Map type
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)
2.2. to_json()
to_json() function is used to convert DataFrame columns MapType or Struct type to JSON string. Here, I am using df2 that created from above from_json() example.

from pyspark.sql.functions import to_json,col
df2.withColumn("value",to_json(col("value"))) \
   .show(truncate=False)
2.3. json_tuple()
Function json_tuple() is used the query or extract the elements from JSON column and create the result as a new columns.
from pyspark.sql.functions import json_tuple
df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City")) \
    .toDF("id","Zipcode","ZipCodeType","City") \
    .show(truncate=False)
2.4. get_json_object()
get_json_object() is used to extract the JSON string based on path from the JSON column.
from pyspark.sql.functions import get_json_object
df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
    .show(truncate=False)

2.5. schema_of_json()
Use schema_of_json() to create schema string from JSON string column.
from pyspark.sql.functions import schema_of_json,lit
schemaStr=spark.range(1) \
    .select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \
    .collect()[0][0]
print(schemaStr)


https://sparkbyexamples.com/spark/spark-streaming-read-json-files-from-directory/

In [0]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,MapType,StructType
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("pysparkByexamples.com").getOrCreate()
jsonString = """{"zipcode":704,"zipCodetype":"STANDARD","city":"PARC PARQUE","state":"PR"}"""
df33 = spark.createDataFrame([(1,jsonString)],["id","value"])
df33.printSchema()
df33.show(truncate = False)

df44=df33.withColumn("value",from_json(df33.value,MapType(StringType(),StringType())))
df44.printSchema()
df44.show(truncate = False)

#val schema = new StructType()
#            .add("zipcode",StringType,true)
 #           .add("zipCodetype",StringType,true)
  #          .add("city",StringType,true)
   #         .add("state",StringType,true)
#val df55 = df33.withColumn("value",from_json(df33.value,schema))
#df55.printSchema()
#df55.show(truncate = False)

root
 |-- id: long (nullable = true)
 |-- value: string (nullable = true)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"zipcode":704,"zipCodetype":"STANDARD","city":"PARC PARQUE","state":"PR"}|
+---+--------------------------------------------------------------------------+

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |{zipcode -> 704, zipCodetype -> STANDARD, city -> PARC PARQUE, state -> PR}|
+---+-----------------------------------------

In [0]:
df55 = df44.withColumn("value",to_json(df44.value))
df55.show(truncate = False)

jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])

df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City","State"))\
            .toDF("id","zipcode","zipCodetype","city","state")\
            .show(truncate = False)

df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("zipCodetype")).show(truncate = False)


+---+----------------------------------------------------------------------------+
|id |value                                                                       |
+---+----------------------------------------------------------------------------+
|1  |{"zipcode":"704","zipCodetype":"STANDARD","city":"PARC PARQUE","state":"PR"}|
+---+----------------------------------------------------------------------------+

+---+-------+-----------+-----------+-----+
|id |zipcode|zipCodetype|city       |state|
+---+-------+-----------+-----------+-----+
|1  |704    |STANDARD   |PARC PARQUE|PR   |
+---+-------+-----------+-----------+-----+

+---+-----------+
|id |zipCodetype|
+---+-----------+
|1  |STANDARD   |
+---+-----------+



In [0]:
1. PySpark UDF Introduction
1.1 What is UDF?
UDF’s a.k.a User Defined Functions, If you are coming from SQL background, UDF’s are nothing new to you as most of the traditional RDBMS databases support User Defined Functions, these functions need to register in the database library and use them on SQL as regular functions.

PySpark UDF’s are similar to UDF on traditional databases. In PySpark, you create a function in a Python syntax and wrap it with PySpark SQL udf() or register it as udf and use it on DataFrame and SQL respectively.

1.2 Why do we need a UDF?
UDF’s are used to extend the functions of the framework and re-use these functions on multiple DataFrame’s. For example, you wanted to convert every first letter of a word in a name string to a capital case; PySpark build-in features don’t have this function hence you can create it a UDF and reuse this as needed on many Data Frames. UDF’s are once created they can be re-used on several DataFrame’s and SQL expressions.

Before you create any UDF, do your research to check if the similar function you wanted is already available in Spark SQL Functions. PySpark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel.
When you creating UDF’s you need to design them very carefully otherwise you will come across optimization & performance issues.

2. Create PySpark UDF
2.1 Create a DataFrame
Before we jump in creating a UDF, first let’s create a PySpark DataFrame.
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)
2.2 Create a Python Function
The first step in creating a UDF is creating a Python function. Below snippet creates a function convertCase() which takes a string parameter and converts the first letter of every word to capital letter. UDF’s take parameters of your choice and returns a value.


def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
       resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr 
2.3 Convert a Python function to PySpark UDF
Now convert this function convertCase() to UDF by passing the function to PySpark SQL udf(), this function is available at org.apache.spark.sql.functions.udf package. Make sure you import this package before using it.

PySpark SQL udf() function returns org.apache.spark.sql.expressions.UserDefinedFunction class object.


""" Converting function to UDF """
convertUDF = udf(lambda z: convertCase(z),StringType())
Note: The default type of the udf() is StringType hence, you can also write the above statement without return type.

""" Converting function to UDF 
StringType() is by default hence not required """
convertUDF = udf(lambda z: convertCase(z)) 
3. Using UDF with DataFrame
3.1 Using UDF with PySpark DataFrame select()
Now you can use convertUDF() on a DataFrame column as a regular build-in function.


df.select(col("Seqno"), \
    convertUDF(col("Name")).alias("Name") ) \
   .show(truncate=False)
3.2 Using UDF with PySpark DataFrame withColumn()
You could also use udf on DataFrame withColumn() function, to explain this I will create another upperCase() function which converts the input string to upper case.


def upperCase(str):
    return str.upper()
Let’s convert upperCase() python function to UDF and then use it with DataFrame withColumn(). Below example converts the values of “Name” column to upper case and creates a new column “Curated Name”


upperCaseUDF = udf(lambda z:upperCase(z),StringType())   

df.withColumn("Cureated Name", upperCaseUDF(col("Name"))) \
  .show(truncate=False)
3.3 Registering PySpark UDF & use it on SQL
In order to use convertCase() function on PySpark SQL, you need to register the function with PySpark by using spark.udf.register().


""" Using UDF on SQL """
spark.udf.register("convertUDF", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE") \
     .show(truncate=False)
This yields the same output as 3.1 example.

4. Creating UDF using annotation
In the previous sections, you have learned creating a UDF is a 2 step process, first, you need to create a Python function, second convert function to UDF using SQL udf() function, however, you can avoid these two steps and create it with just a single step by using annotations.


@udf(returnType=StringType()) 
def upperCase(str):
    return str.upper()

df.withColumn("Cureated Name", upperCase(col("Name"))) \
.show(truncate=False)
This results same output as section 3.2
5. Special Handling
5.1 Execution order
One thing to aware is in PySpark/Spark does not guarantee the order of evaluation of subexpressions meaning expressions are not guarantee to evaluated left-to-right or in any other fixed order. PySpark reorders the execution for query optimization and planning hence, AND, OR, WHERE and HAVING expression will have side effects.

So when you are designing and using UDF, you have to be very careful especially with null handling as these results runtime exceptions.


""" 
No guarantee Name is not null will execute first
If convertUDF(Name) like '%John%' execute first then 
you will get runtime error
"""
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \ 
         "where Name is not null and convertUDF(Name) like '%John%'") \
     .show(truncate=False)  
5.2 Handling null check
UDF’s are error-prone when not designed carefully. for example, when you have a column that contains the value null on some records


""" null check """

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders"),
    ('4',None)]

df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)
df2.createOrReplaceTempView("NAME_TABLE2")

spark.sql("select convertUDF(Name) from NAME_TABLE2") \
     .show(truncate=False)
Note that from the above snippet, record with “Seqno 4” has value “None” for “name” column. Since we are not handling null with UDF function, using this on DataFrame returns below error. Note that in Python None is considered null.


AttributeError: 'NoneType' object has no attribute 'split'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
Below points to remember

Its always best practice to check for null inside a UDF function rather than checking for null outside.
In any case, if you can’t do a null check in UDF at lease use IF or CASE WHEN to check for null and call UDF conditionally.

spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "" , StringType())

spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2") \
     .show(truncate=False)

spark.sql("select Seqno, _nullsafeUDF(Name) as Name from NAME_TABLE2 " + \
          " where Name is not null and _nullsafeUDF(Name) like '%John%'") \
     .show(truncate=False)    
This executes successfully without errors as we are checking for null/none while registering UDF.

5.3 Performance concern using UDF
UDF’s are a black box to PySpark hence it can’t apply optimization and you will lose all the optimization PySpark does on Dataframe/Dataset. When possible you should use Spark SQL built-in functions as these functions provide optimization. Consider creating UDF only when existing built-in SQL function doesn’t have it.


In [0]:
import pyspark 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sparkbyExample").getOrCreate()
columns=["seqno","name"]
data2 = [("1","john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]
df55 = spark.createDataFrame(data = data2,schema = columns)
df55.show()

+-----+------------+
|seqno|        name|
+-----+------------+
|    1|  john jones|
|    2|tracey smith|
|    3| amy sanders|
+-----+------------+



In [0]:
from pyspark.sql.functions import udf

def convertCase(str):
resStr = ""
arr = str.split(" ")
for x in arr :
    resStr = resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr



[0;36m  File [0;32m"<command-860607162882995>"[0;36m, line [0;32m4[0m
[0;31m    resStr = ""[0m
[0m    ^[0m
[0;31mIndentationError[0m[0;31m:[0m expected an indented block


In [0]:
def convertCase(str):
    resStr=""
    arr = str.split(" ")
    for x in arr:
       resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
    return resStr

In [0]:
from pyspark.sql.types import StringType
convertUDF = udf(lambda z: convertCase(z),StringType())

In [0]:
from pyspark.sql.functions import col,udf
df55.select(col("seqno"),convertUDF(col("name")).alias("name"))
df55.show(truncate = False)

+-----+------------+
|seqno|name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+

