# Processing Column Data

As part of this module we will explore the functions available under `org.apache.spark.sql.functions` to derive new values from existing column values with in a Data Frame.

* Pre-defined Functions
* Create Dummy Data Frame
* Categories of Functions
* Special Functions - col and lit
* String Manipulation Functions - 1
* String Manipulation Functions - 2
* Date and Time Overview
* Date and Time Arithmetic
* Date and Time - trunc and date_trunc
* Date and Time - Extracting Information
* Dealing with Unix Timestamp
* Example - Word Count
* Conclusion

## Pre-defined Functions

We typically process data in the columns using functions in `pyspark.sql.functions`. Let us understand details about these functions in detail as part of this module.
* Let us recap about Functions or APIs to process Data Frames.
 * Projection - `select` or `withColumn` or `drop` or `selectExpr`
 * Filtering - `filter` or `where`
 * Grouping data by key and perform aggregations - `groupBy`
 * Sorting data - `sort` or `orderBy` 
* We can pass column names or literals or expressions to all the Data Frame APIs.
* Expressions include arithmetic operations, transformations using functions from `pyspark.sql.functions`.
* There are approximately 300 functions under `pyspark.sql.functions`.
* We will talk about some of the important functions used for String Manipulation, Date Manipulation etc.

#############################
* String Manipulation Functions
  * Case Conversion - `lower`,  `upper`
  * Getting Length -  `length`
  * Extracting substrings - `substring`, `split`
  * Trimming - `trim`, `ltrim`, `rtrim`
  * Padding - `lpad`, `rpad`
  * Concatenating string - `concat`, `concat_ws`
* Date Manipulation Functions
  * Getting current date and time - `current_date`, `current_timestamp`
  * Date Arithmetic - `date_add`, `date_sub`, `datediff`, `months_between`, `add_months`, `next_day`
  * Beginning and Ending Date or Time - `last_day`, `trunc`, `date_trunc`
  * Formatting Date - `date_format`
  * Extracting Information - `dayofyear`, `dayofmonth`, `dayofweek`, `year`, `month`
* Aggregate Functions
  * `count`, `countDistinct`
  * `sum`, `avg`
  * `min`, `max`
* Other Functions - We will explore depending on the use cases.
  * `CASE` and `WHEN`
  * `CAST` for type casting
  * Functions to manage special types such as `ARRAY`, `MAP`, `STRUCT` type columns
  * Many others
  
  * Concatenating strings
  * We can pass a variable number of strings to `concat` function.
  * It will return one string concatenating all the strings.
  * If we have to concatenate literal in between then we have to use `lit` function.
* Case Conversion and Length
  * Convert all the alphabetic characters in a string to **uppercase** - `upper`
  * Convert all the alphabetic characters in a string to **lowercase** - `lower`
  * Convert first character in a string to **uppercase** - `initcap`
  * Get **number of characters in a string** - `length`
  * All the 4 functions take column type argument.

* If we are processing **variable length columns** with **delimiter** then we use `split` to extract the information.
* Here are some of the examples for **variable length columns** and the use cases for which we typically extract information.
* Address where we store House Number, Street Name, City, State and Zip Code comma separated. We might want to extract City and State for demographics reports.
* `split` takes 2 arguments, **column** and **delimiter**.
* `split` convert each string into array and we can access the elements using index.
* We can also use `explode` in conjunction with `split` to explode the list or array into records in Data Frame. It can be used in cases such as word count, phone count etc. 

* We typically pad characters to build fixed length values or records.
* Fixed length values or records are extensively used in Mainframes based systems.
* Length of each and every field in fixed length records is predetermined and if the value of the field is less than the predetermined length then we pad with a standard character.
* In terms of numeric fields we pad with zero on the leading or left side. For non numeric fields, we pad with some standard character on leading or trailing side.
* We use `lpad` to pad a string with a specific character on leading or left side and `rpad` to pad on trailing or right side.
* Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.

* We typically use trimming to remove unnecessary characters from fixed length records.
* Fixed length records are extensively used in Mainframes and we might have to process it using Spark.
* As part of processing we might want to remove leading or trailing characters such as 0 in case of numeric types and space or some standard character in case of alphanumeric types.
* As of now Spark trim functions take the column as argument and remove leading or trailing spaces. However, we can use `expr` or `selectExpr` to use Spark SQL based trim functions to remove leading or trailing spaces or any other such characters.
  * Trim spaces towards left - `ltrim`
  * Trim spaces towards right - `rtrim`
  * Trim spaces on both sides - `trim`
  
  * We can use `current_date` to get today’s server date. 
  * Date will be returned using **yyyy-MM-dd** format.
* We can use `current_timestamp` to get current server time. 
  * Timestamp will be returned using **yyyy-MM-dd HH:mm:ss:SSS** format.
  * Hours will be by default in 24 hour format.
  
  * Adding days to a date or timestamp - `date_add`
* Subtracting days from a date or timestamp - `date_sub`
* Getting difference between 2 dates or timestamps - `datediff`
* Getting the number of months between 2 dates or timestamps - `months_between`
* Adding months to a date or timestamp - `add_months`
* Getting next day from a given date - `next_day`
* All the functions are self explanatory. We can apply these on standard date or timestamp. All the functions return date even when applied on timestamp field.
## Date and Time Extract Functions
Let us get an overview about Date and Time extract functions. Here are the extract functions that are useful which are self explanatory.
* `year`
* `month`
* `weekofyear`
* `dayofyear`
* `dayofmonth`
* `dayofweek`
* `hour`
* `minute`
* `second`

* It is an integer and started from January 1st 1970 Midnight UTC.
* Beginning time is also known as epoch and is incremented by 1 every second.
* We can convert Unix Timestamp to regular date or timestamp and vice versa.
* We can use `unix_timestamp` to convert regular date or timestamp to a unix timestamp value. For example `unix_timestamp(lit("2019-11-19 00:00:00"))`
* We can use `from_unixtime` to convert unix timestamp to regular date or timestamp. For example `from_unixtime(lit(1574101800))`
* We can also pass format to both the functions.

## Dealing with Nulls

Let us understand how to deal with nulls using functions that are available in Spark.
* We can use `coalesce` to return first non null value.
* We also have traditional SQL style functions such as `nvl`. However, they can be used either with `expr` or `selectExpr`.
* `CASE` and `WHEN` is typically used to apply transformations based up on conditions. We can use `CASE` and `WHEN` similar to SQL using `expr` or `selectExpr`.
* If we want to use APIs, Spark provides functions such as `when` and `otherwise`. `when` is available as part of `pyspark.sql.functions`. On top of column type that is generated using `when` we should be able to invoke `otherwise`.

## Conclusion

As part of this module we have gone through list of functions that can be applied on top of columns for row level transformations.

* There are approximately 300 pre-defined functions.
* Functions can be broadly categorized into String Manipulation Functions, Date Manipulation Functions, Numeric Functions etc.
* Typically when we read data from source, we get data in the form of strings and we need to apply functions to apply standardization rules, data type conversion, transformation rules etc.
* Most of these functions can be used while projection using `select`, `selectExpr`, `withColumn` etc as well as part of `filter` or `where`, `groupBy`, `orderBy` or `sort` etc.
* For `selectExpr` we need to use the functions using SQL Style syntax.
* There are special functions such as `col` and `lit`. `col` is used to pass column names as column type for some of the functions while `lit` is used to pass literals as values as part of expressions (eg: `concat($"first_name", lit(", "), $"last_name")`).

In [0]:
## creating the dataframe from list 
a=[('x',)]
df = spark.createDataFrame(a, "dummy STRING")
df.show()

spark.createDataFrame(rdd, "a: string, b: int").collect()

+-----+
|dummy|
+-----+
|    x|
+-----+



In [0]:
## creating dataframe from rdd
rdd=[(1,2,3)]
## ddl way of writing the schema ie scala datatype as follows 
# spark.createDataFrame(rdd, "a: string, b: int, c:int").collect()
df=spark.createDataFrame(rdd, "a: string, b: int, c:int")
df.printSchema
df.collect()


Out[13]: [Row(a='1', b=2, c=3)]

In [0]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
## creating dataframe from rdd of list of tuple 
rdd=[(1,2,3),(2,4,6),(5,4,8)]
rdd1=sc.parallelize(rdd,5)
print(type(rdd1))
## ddl way of writing the schema ie scala datatype as follows 
# spark.createDataFrame(rdd, "a: string, b: int, c:int").collect()
# note: by giveing the dataTypeschema 
# way 1 : here we used list as rdd is list 
aSchema=StructType([StructField("a",IntegerType(),True),StructField("b",IntegerType(),True),StructField("c",IntegerType(),True)])
df=spark.createDataFrame(rdd1,schema=aSchema)
df.printSchema
df.collect()
df.show()
print(type(rdd1))


<class 'pyspark.rdd.RDD'>
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  4|  6|
|  5|  4|  8|
+---+---+---+

<class 'pyspark.rdd.RDD'>


In [0]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
## creating dataframe from rdd
rdd=(1,2,3)
print(type(rdd))
# ## ddl way of writing the schema ie scala datatype as follows 
# # spark.createDataFrame(rdd, "a: string, b: int, c:int").collect()
# # note: by giveing the dataTypeschema 
# # way 1 : here we used list as rdd is list 
# # aSchema=StructType(((StructField("a",IntegerType(),True)),StructField("b",IntegerType(),True),StructField("a",IntegerType(),True))
# df=spark.createDataFrame(rdd,schema=aSchema)
# df.printSchema
# df.collect()

[0;36m  File [0;32m"<command-2510707286765948>"[0;36m, line [0;32m10[0m
[0;31m    df=spark.createDataFrame(rdd,schema=aSchema)[0m
[0m    ^[0m
[0;31mSyntaxError[0m[0;31m:[0m invalid syntax


In [0]:
## creating dataframe from list
rdd=[(1,2,3)]
spark.createDataFrame(rdd, "a: string, b: int, c:int").collect()

In [0]:
## creating dataframe from list 
l = [('Alice', 1)]
df=spark.createDataFrame(l)
df
## here data frame is 

Out[14]: DataFrame[_1: string, _2: bigint]

In [0]:
## creating dataframe from dictionary  key is taken as column and value as rows 
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
        ]
df = spark.createDataFrame(data)
df.collect()

Out[16]: [Row(Category='A', ID=1, Truth=True, Value=121.44),
 Row(Category='B', ID=2, Truth=False, Value=300.01),
 Row(Category='C', ID=3, Truth=None, Value=10.99),
 Row(Category='E', ID=4, Truth=True, Value=33.87)]

In [0]:
## create dataframe from rdd 
a=sc.parallelize([1,2,3,4,5],5)
# a.collect()
## collect() will collect the data in rdd to driver
a.glom().collect()
## note: glom will give the data how it partioned always uesd with collect method 
# a.collect()
# a.getNumPartitions()
## note : getNumPartiton give the no of partitions in rdd 
# df=spark.createDataFrame(a)
##  createDataFrame(rdd,"<schema>") will create data frame from rdd by programtically ddl way 
df_fromRdd=a.toDF("a int")
## rdd have toDf(*colms) method to convert to dataFrame 

RDD (Resilient Distributed Dataset)
Terminologies

    RDD stands for Resilient Distributed Dataset, these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.

RDDs are...

    immutable
    fault tolerant / automatic recovery
    can apply multiple ops on RDDs

RDD operation are...

    Transformation
    Action

Basic Operations (Ops)

    count(): Number of elements in the RDD is returned.
    collect(): All the elements in the RDD are returned.
    foreach(f): input callable, and returns only those elements which meet the condition of the function inside foreach.
    filter(f): input callable, and returns new RDDs containing the elements which satisfy the given callable
    map(f, preservesPartitioning = False): A new RDD is returned by applying a function to each element in the RDD
    reduce(f): After performing the specified commutative and associative binary operation, the element in the RDD is returned.
    join(other, numPartitions = None): It returns RDD with a pair of elements with the matching keys and all the values for that particular key.
    cache(): Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the RDD is cached or not

Narrow transformations

    map()
    filter()
    flatMap(j
    distinct()

Wide (Broad) transformations

    reduce()
    groupby()
    sortBy()
    join()

Actions

    count()
    take()
    takeOrdered()
    top()
    collect()
    saveAsTextFile()
    first()
    reduce()
    fold()
    aggregate()
    foreach()

Dictionary functions

    keys()
    values()
    keyBy()

Functional transformations

    mapValues()
    flatMapValues()

Grouping, sorting and aggregation

    groupByKey()
    reduceByKey()
    foldByKey()
    sortByKey()

Joins

    join()
    leftOuterJoin()
    rightOuterJoin()
    fullOuterJoin()
    cogroup()
    cartesian()

Set operations

    union()
    intersection()
    subtract()
    subtractByKey()

Numeric RDD

    min()
    max()
    sum()
    mean()
    stdev()
    variance()

Lambda Function

    A lambda function is a small anonymous function.
    A lambda function can take any number of arguments, but can only have one expression.

In [0]:
# help(spark.createDataFrame)

Help on method createDataFrame in module pyspark.sql.session:

createDataFrame(data: Union[pyspark.rdd.RDD[Any], Iterable[Any], ForwardRef('PandasDataFrameLike'), ForwardRef('ArrayLike')], schema: Union[pyspark.sql.types.AtomicType, pyspark.sql.types.StructType, str, NoneType] = None, samplingRatio: Optional[float] = None, verifySchema: bool = True) -> pyspark.sql.dataframe.DataFrame method of pyspark.sql.session.SparkSession instance
    Creates a :class:`DataFrame` from an :class:`RDD`, a list, a :class:`pandas.DataFrame`
    or a :class:`numpy.ndarray`.
    
    When ``schema`` is a list of column names, the type of each column
    will be inferred from ``data``.
    
    When ``schema`` is ``None``, it will try to infer the schema (column names and types)
    from ``data``, which should be an RDD of either :class:`Row`,
    :class:`namedtuple`, or :class:`dict`.
    
    When ``schema`` is :class:`pyspark.sql.types.DataType` or a datatype string, it must match
    the real data, or

In [0]:
l = [('X', )]
df = spark.createDataFrame(l, "dummy STRING")
df.printSchema()

In [0]:
## creating dataframe from dictionary  key is taken as column and value as rows 
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
        ]
df = spark.createDataFrame(data)
df.columns
# to get columns 

Out[86]: ['Category', 'ID', 'Truth', 'Value']

In [0]:
#  show current date from data frame 
#  * Projection - `select` or `withColumn` or `drop` or `selectExpr`
from pyspark.sql.functions import current_date
from pyspark.sql.functions import col,lit
# note:select(*cols: 'ColumnOrName') -> 'DataFrame'
df.show()
#  sql like manupulation select ,where ,join 
df1=df.select('Category')
df1.show()

+--------+---+-----+------+
|Category| ID|Truth| Value|
+--------+---+-----+------+
|       A|  1| true|121.44|
|       B|  2|false|300.01|
|       C|  3| null| 10.99|
|       E|  4| true| 33.87|
+--------+---+-----+------+

+--------+
|Category|
+--------+
|       A|
|       B|
|       C|
|       E|
+--------+



In [0]:
df.printSchema()

root
 |-- Category: string (nullable = true)
 |-- ID: long (nullable = true)
 |-- Truth: boolean (nullable = true)
 |-- Value: double (nullable = true)



In [0]:
help(df.select)

Help on method select in module pyspark.sql.dataframe:

select(*cols: 'ColumnOrName') -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Projects a set of expressions and returns a new :class:`DataFrame`.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    cols : str, :class:`Column`, or list
        column names (string) or expressions (:class:`Column`).
        If one of the column names is '*', that column is expanded to include all columns
        in the current :class:`DataFrame`.
    
    Examples
    --------
    >>> df.select('*').collect()
    [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    >>> df.select('name', 'age').collect()
    [Row(name='Alice', age=2), Row(name='Bob', age=5)]
    >>> df.select(df.name, (df.age + 10).alias('age')).collect()
    [Row(name='Alice', age=12), Row(name='Bob', age=15)]



In [0]:
## user defined function in pyspark  defining function first 
from pyspark.sql.types import *
from pyspark.sql.functions import udf

#  creating Initcap function 
def Initcap(name):
    name=str(name)
    return name.capitalize()
# registering inticap function with spark and stating its return type 
## udf(<functionName>,<ReturnType>)
Initcap_udf = udf(Initcap, StringType())
## with column is used to replacing existing column with new column 
df.withColumn('Category', Initcap_udf(df['Category'])).show()

+--------+---+-----+------+
|Category| ID|Truth| Value|
+--------+---+-----+------+
|       A|  1| true|121.44|
|       B|  2|false|300.01|
|       C|  3| null| 10.99|
|       E|  4| true| 33.87|
+--------+---+-----+------+



In [0]:
dir(spark.read)
# note spark.read have json,table,txt,jdbc,load,parquet,orc
# note pyspark.sql.function have multiple functions aprox 300 functions 

Out[93]: ['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_df',
 '_jreader',
 '_set_opts',
 '_spark',
 'csv',
 'format',
 'jdbc',
 'json',
 'load',
 'option',
 'options',
 'orc',
 'parquet',
 'schema',
 'table',
 'text']

In [0]:
## create rdd from list,tuple,csv,excel,table , html,text ,reading multiple textFile 
## create dataframe from list,tuple,csv,excel,table,html,textFile
## dataFrame manupulation operations , agg , grouping,joining , merging , filling na ,where 
## rdd manupulatipon operations 
## spark structural streaming 
## spark sql ( just revise, windowing and frames questions)

In [0]:
%fs ls


In [0]:
dbutils.fs.ls 

[0;36m  File [0;32m"<command-711641036960204>"[0;36m, line [0;32m1[0m
[0;31m    dbutils.fs.ls /[0m
[0m                   ^[0m
[0;31mSyntaxError[0m[0;31m:[0m invalid syntax


In [0]:
## read data in spark order csv 
# File location and type
file_location = "/FileStore/tables/orders.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)
df.printSchema
## pandas way 
# df2=df.withColumnRenamed({"_c0":"order_id","_c1":"order_date","_c2":"order_customer_id","_c3":"order_status"})
## renaming multiple columns 
df2=df.withColumnRenamed("_c0","order_id").withColumnRenamed("_c1","order_date").withColumnRenamed("_c2","order_customer_id").withColumnRenamed("_c3","order_status")
display(df2)

_c0,_c1,_c2,_c3
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


order_id,order_date,order_customer_id,order_status
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


In [0]:
## or using short cut method but it will not good for all the file 
## scla ddl schema 
ordersdf = spark.read.csv(
    "/FileStore/tables/orders.csv",
    schema='order_id INT, order_date STRING, order_customer_id INT, order_status STRING')
ordersdf.printSchema()
ordersdf.show()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:0

In [0]:
## projection with new column  changing time fame and added
#  function used date_format ,.alias,.select
from pyspark.sql.functions import date_format
ordersdf.select("*",date_format("order_date",'yyyyMM').alias("orderMonth")).show()
# note: select is projection function dont change the dataframe permanently

+--------+--------------------+-----------------+---------------+----------+
|order_id|          order_date|order_customer_id|   order_status|orderMonth|
+--------+--------------------+-----------------+---------------+----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|    201307|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|    201307|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|    201307|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|    201307|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|    201307|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|    201307|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|    201307|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|    201307|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|    201307|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|    201307|

In [0]:
## filter the order with year 2013  and month 08
from pyspark.sql.functions import filter
ordersdf.printSchema()
ordersdf.filter(date_format("order_date","yyyyMM")==201308).show()
# dont use select with filter

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

Out[32]: DataFrame[order_id: int, order_date: string, order_customer_id: int, order_status: string]

In [0]:
# Function as part of groupBy
# count the no of orders by orderMonth 
ordersdf.groupBy(date_format("order_date","yyyyMM").alias("dateMonth")).count().show()

+---------+-----+
|dateMonth|count|
+---------+-----+
|   201401| 5908|
|   201405| 5467|
|   201312| 5892|
|   201310| 5335|
|   201311| 6381|
|   201307| 1533|
|   201407| 4468|
|   201403| 5778|
|   201404| 5657|
|   201402| 5635|
|   201309| 5841|
|   201406| 5308|
|   201308| 5680|
+---------+-----+



In [0]:
## create data fram from row type list  give schema 
employees = [
    (1, "Scott", "Tiger", 1000.0, 
      "united states", "+1 123 456 7890", "123 45 6789"
    ),
     (2, "Henry", "Ford", 1250.0, 
      "India", "+91 234 567 8901", "456 78 9123"
     ),
     (3, "Nick", "Junior", 750.0, 
      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
     )
]

len(employees)
## here 4 row/tuple 
employeesdf=spark.createDataFrame(employees,schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_number STRING, ssn STRING""")
employeesdf.printSchema()
employeesdf.show(truncate=False)

root
 |-- employee_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- nationality: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- ssn: string (nullable = true)

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|nationality   |phone_number    |ssn        |
+-----------+----------+---------+------+--------------+----------------+-----------+
|1          |Scott     |Tiger    |1000.0|united states |+1 123 456 7890 |123 45 6789|
|2          |Henry     |Ford     |1250.0|India         |+91 234 567 8901|456 78 9123|
|3          |Nick      |Junior   |750.0 |united KINGDOM|+44 111 111 1111|222 33 4444|
|4          |Bill      |Gomes    |1500.0|AUSTRALIA     |+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [0]:
#note: For Data Frame APIs such as `select`, `groupBy`, `orderBy` etc we can pass column names as strings.
#  note : when to use col ,lit ?
# * If there are no transformations on any column in any function then we should be able to pass all column names as strings.
# * If not we need to pass all columns as type column by using col function.
# * If we want to apply transformations using some of the functions then passing column names as strings will not suffice. We have to pass them as column type.
# note: when we do transformation on column we need to pass columns with col
# * `col` is the function which will convert column name from string type to **Column** type. We can also refer column names as **Column** type using Data Frame name.
# not that much useful in databricks 

In [0]:
from pyspark.sql.functions import col
employeesdf.select(col("employee_id"),col("first_name"),col("last_name"),col("salary")).show()

+-----------+----------+---------+------+
|employee_id|first_name|last_name|salary|
+-----------+----------+---------+------+
|          1|     Scott|    Tiger|1000.0|
|          2|     Henry|     Ford|1250.0|
|          3|      Nick|   Junior| 750.0|
|          4|      Bill|    Gomes|1500.0|
+-----------+----------+---------+------+



In [0]:
# example for col why use with transformation 
# here we get error 
from pyspark.sql.functions import upper,lower
employeesdf. \
    select(upper("first_name"), upper("last_name")). \
    show()

+-----------------+----------------+
|upper(first_name)|upper(last_name)|
+-----------------+----------------+
|            SCOTT|           TIGER|
|            HENRY|            FORD|
|             NICK|          JUNIOR|
|             BILL|           GOMES|
+-----------------+----------------+



In [0]:
employeesdf. \
    groupBy(upper(col("nationality"))). \
    count(). \
    show()

+------------------+-----+
|upper(nationality)|count|
+------------------+-----+
|     UNITED STATES|    1|
|             INDIA|    1|
|    UNITED KINGDOM|    1|
|         AUSTRALIA|    1|
+------------------+-----+



In [0]:
# This will fail as the function desc is available only on column type.
## note here we can see use of col function 
employeesdf. \
    orderBy("employee_id".desc()). \
    show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAttributeError[0m                            Traceback (most recent call last)
[0;32m<command-711641036960237>[0m in [0;36m<cell line: 2>[0;34m()[0m
[1;32m      1[0m [0;31m# This will fail as the function desc is available only on column type.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      2[0m [0memployeesdf[0m[0;34m.[0m[0;31m [0m[0;31m\[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 3[0;31m     [0morderBy[0m[0;34m([0m[0;34m"employee_id"[0m[0;34m.[0m[0mdesc[0m[0;34m([0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0;31m [0m[0;31m\[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      4[0m     [0mshow[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m

[0;31mAttributeError[0m: 'str' object has no attribute 'desc'

In [0]:
# This will fail as the function desc is available only on column type.
## note here we can see use of col function 
employeesdf. \
    orderBy(col("employee_id").desc()). \
    show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [0]:
# Alternative - we can also refer column names using Data Frame like this ie pandas style 
employeesdf. \
    orderBy(upper(employeesdf['first_name']).alias('first_name')). \
    show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [0]:
# Alternative - we can also refer column names using Data Frame like this pandas style 
employeesdf. \
    orderBy(upper(employeesdf.first_name).alias('first_name')). \
    show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [0]:
# Q concat first name and last name as full name
# lit is use for literal string pass as columns 
from pyspark.sql.functions import concat ,lit
employeesdf.select("*",concat(col("first_name"),lit(":"),col("last_name")).alias("fullName")).show()

+-----------+----------+---------+------+--------------+----------------+-----------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|   fullName|
+-----------+----------+---------+------+--------------+----------------+-----------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|Scott:Tiger|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123| Henry:Ford|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|Nick:Junior|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118| Bill:Gomes|
+-----------+----------+---------+------+--------------+----------------+-----------+-----------+



In [0]:
## apply case conversion function initcap,lower,upper, 
from pyspark.sql.functions import concat,lit,upper,initcap 
employeesdf.select("*",concat("first_name",lit(":"),"last_name").alias("fullName"))\
.select("*",initcap(col("fullName")).alias("fullNameInitcap"))\
.show()

+-----------+----------+---------+------+--------------+----------------+-----------+-----------+---------------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|   fullName|fullNameInitcap|
+-----------+----------+---------+------+--------------+----------------+-----------+-----------+---------------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|Scott:Tiger|    Scott:tiger|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123| Henry:Ford|     Henry:ford|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|Nick:Junior|    Nick:junior|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118| Bill:Gomes|     Bill:gomes|
+-----------+----------+---------+------+--------------+----------------+-----------+-----------+---------------+



In [0]:
## substring functions 
## for this creating new DataFrame 
## 
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_number STRING, ssn STRING"""
                   )

In [0]:
employeesDF.show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [0]:
#* Extract last 4 digits from the phone number.
# * Extract last 4 digits from SSN.
# substring(<colname>,postion,noofLength)
from pyspark.sql.functions import substring,col
employeesDF.select("*",substring("ssn",-4,4).alias("ssnlast4"),substring("phone_number",-4,4).alias("phone_numberlast4digit")).show()

+-----------+----------+---------+------+--------------+----------------+-----------+--------+----------------------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|ssnlast4|phone_numberlast4digit|
+-----------+----------+---------+------+--------------+----------------+-----------+--------+----------------------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|    6789|                  7890|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|    9123|                  8901|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|    4444|                  1111|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|    6118|                  3210|
+-----------+----------+---------+------+--------------+----------------+-----------+--------+----------------------+



In [0]:
#  note use of split,explode ,creating new dataFrame 
from pyspark.sql.functions import split, explode, lit
## creating new dataFrame 

In [0]:
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890,+1 234 567 8901", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111,+44 222 222 2222", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210,+61 876 543 2109", "789 12 6118"
                     )
                ]
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_numbers STRING, ssn STRING"""
                   )
employeesDF.columns
employeesDF.show()

+-----------+----------+---------+------+--------------+--------------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn|
+-----------+----------+---------+------+--------------+--------------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210,...|789 12 6118|
+-----------+----------+---------+------+--------------+--------------------+-----------+



In [0]:
# q phone _number columns have multiple phone no  show them in proper format 
## note split will create array ie list 
from pyspark.sql.functions import split,substring,explode
employeesDF.select("*",split(col("phone_numbers"),",").alias("phone_no_proper")).show()
employeesDF.select("*",explode(split(col("phone_numbers"),",")).alias("phone_no_proper")).show()

+-----------+----------+---------+------+--------------+--------------------+-----------+--------------------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn|     phone_no_proper|
+-----------+----------+---------+------+--------------+--------------------+-----------+--------------------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789|[+1 123 456 7890,...|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|  [+91 234 567 8901]|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|[+44 111 111 1111...|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210,...|789 12 6118|[+61 987 654 3210...|
+-----------+----------+---------+------+--------------+--------------------+-----------+--------------------+

+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+
|emp

In [0]:
# Q count the distinct phone no by employee_id
employeesDF_phoneProper=employeesDF.withColumn("phone_no_proper",explode(split(col("phone_numbers"),",")))
#  note withcolumn take all values from dataframe and change the datframe temperery like select
#  withColumn("<nametoGivenToDataFrame>",<transformation>)


employeesDF_phoneProper.show()
employeesDF_phoneProper.select("employee_id","phone_no_proper").groupby("phone_no_proper").count().show()

+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn| phone_no_proper|
+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 123 456 7890|
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 234 567 8901|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|+91 234 567 8901|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+44 111 111 1111|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+44 222 222 2222|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210,...|789 12 6118|+61 987 654 3210|
|          4|      Bill|    Gomes|150

In [0]:
# Q pad phone from left side 3 digits , ssn by 4 digits
#  lpad(<col>,<length to be shown >,<charToPad>) === this add padding to data not mask the data 
from pyspark.sql.functions import lpad, rpad, concat
employeesDF_phoneProper.select("*",lpad(col("phone_no_proper"),20,'>').alias("phone_no_proper_padded")).show()

+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+----------------------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn| phone_no_proper|phone_no_proper_padded|
+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+----------------------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 123 456 7890|  >>>>>+1 123 456 7890|
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 234 567 8901|  >>>>>+1 234 567 8901|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|+91 234 567 8901|  >>>>+91 234 567 8901|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+44 111 111 1111|  >>>>+44 111 111 1111|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+

In [0]:
##trim functions take the column as argument and remove leading or trailing spaces
# ltrim = trim specific char 
# rtrim === trim specific char
# trim  === only white space 
#
# question remove white space from left side in phone_no_proper column

# trim(str) - Removes the leading and trailing space characters from `str`.

#trim(BOTH trimStr FROM str) - Remove the leading and trailing `trimStr` characters from `str`

#trim(LEADING trimStr FROM str) - Remove the leading `trimStr` characters from `str`

#trim(TRAILING trimStr FROM str) - Remove the trailing `trimStr` characters from `str`

from pyspark.sql.functions import ltrim,rtrim 
employeesDF_phoneProper.select("*",ltrim((col("nationality")))).show()

+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+------------------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn| phone_no_proper|ltrim(nationality)|
+-----------+----------+---------+------+--------------+--------------------+-----------+----------------+------------------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 123 456 7890|     united states|
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789| +1 234 567 8901|     united states|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|+91 234 567 8901|             India|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+44 111 111 1111|    united KINGDOM|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|+44 222 222 2222|    united K

In [0]:
l = [("   Hello.    ",) ]
df = spark.createDataFrame(l).toDF("dummy")
df.show()

+-------------+
|        dummy|
+-------------+
|   Hello.    |
+-------------+



In [0]:
from pyspark.sql.functions import col, ltrim, rtrim, trim
df.withColumn("ltrimcolumn", ltrim(col("dummy"))). \
  withColumn("rtrimcolumn", rtrim(col("dummy"))). \
  withColumn("trimcolumn", trim(col("dummy"))). \
  show()

+-------------+----------+---------+------+
|        dummy|     ltrim|    rtrim|  trim|
+-------------+----------+---------+------+
|   Hello.    |Hello.    |   Hello.|Hello.|
+-------------+----------+---------+------+



In [0]:
# spark.sql('DESCRIBE FUNCTION trim').show(truncate=False)

# note: PySpark expr() is a SQL function to execute SQL-like expressions 
# and to use an existing DataFrame column value as an expression argument to Pyspark built-in functions. 

In [0]:
# if we do not specify trimStr, it will be defaulted to space
from pyspark.sql.functions import expr
df.withColumn("ltrim", expr("ltrim(dummy)")). \
  withColumn("rtrim", expr("rtrim('.', rtrim(dummy))")). \
  withColumn("trim", trim(col("dummy"))). \
  show()

+-------------+----------+--------+------+
|        dummy|     ltrim|   rtrim|  trim|
+-------------+----------+--------+------+
|   Hello.    |Hello.    |   Hello|Hello.|
+-------------+----------+--------+------+



In [0]:
#* We can convert a string which contain date or timestamp 
# in non-standard format to standard date or time using `to_date` or `to_timestamp` function respectively.
# to_date === convert to date format
#  to_timestamp  == convert to timestamp format 
from pyspark.sql.functions import lit, to_date, to_timestamp
df.select(to_date(lit('20210228'), 'yyyyMMdd').alias('convertedtodateFormat')).show()

+---------------------+
|convertedtodateFormat|
+---------------------+
|           2021-02-28|
+---------------------+



In [0]:
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")

In [0]:
datetimesDF.show(truncate=False)

+----------+-----------------------+
|date      |time                   |
+----------+-----------------------+
|2014-02-28|2014-02-28 10:00:00.123|
|2016-02-29|2016-02-29 08:08:08.999|
|2017-10-31|2017-12-31 11:59:59.123|
|2019-11-30|2019-08-31 00:00:00.000|
+----------+-----------------------+



In [0]:
# * Add 10 days to both date and time values.
# * Subtract 10 days from both date and time values.
from pyspark.sql.functions import date_add,date_sub
datetimesDF.withColumn("datE+10",date_add(col("date"),10)).show()

datetimesDF. \
    withColumn("date_add_date", date_add("date", 10)). \
    withColumn("date_add_time", date_add("time", 10)). \
    withColumn("date_sub_date", date_sub("date", 10)). \
    withColumn("date_sub_time", date_sub("time", 10)). \
    show()

+----------+--------------------+----------+
|      date|                time|   datE+10|
+----------+--------------------+----------+
|2014-02-28|2014-02-28 10:00:...|2014-03-10|
|2016-02-29|2016-02-29 08:08:...|2016-03-10|
|2017-10-31|2017-12-31 11:59:...|2017-11-10|
|2019-11-30|2019-08-31 00:00:...|2019-12-10|
+----------+--------------------+----------+



In [0]:
#curent_date === gives current date
# date_add(col<colName>,<no>)
# date_sub(col<colName>,<no>)
#current_timestap ===gives current timestamp
# month_between === gives month_between 
## add_month --- add months to date
## round ---rounding off 
## datediff-- difference in days 
## trunc --- trunc(<timestamp col>,<Year/Month>) function to truncate Date at Year and Month units not support day  ie give begning of month or year 
#  trunc returen only whats required ie trunc("1993-02-26","year") o/p = ("1993-01-01")
## date_trunc ---date_trunc(<format eg 'yyyy'>,<timestamp col>) ===> begning of month or year
#date_trunc("HOUR", "date") === will give beging of hour for perticular date 
from pyspark.sql.functions import current_date, current_timestamp, datediff,months_between, add_months, round,date_trunc
datetimesDF. \
    withColumn("datediff_date", datediff(current_date(), "date")). \
    withColumn("datediff_time", datediff(current_timestamp(), "time")). \
    show()
datetimesDF. \
    withColumn("date_dt",col("date")).\
    withColumn("date_dt", date_trunc("HOUR", "date")). \
    withColumn("time_dt", date_trunc("HOUR", "time")). \
    withColumn("time_dt1", date_trunc("dd", "time")). \
    show(truncate=False)

+----------+--------------------+-------------+-------------+
|      date|                time|datediff_date|datediff_time|
+----------+--------------------+-------------+-------------+
|2014-02-28|2014-02-28 10:00:...|         3205|         3205|
|2016-02-29|2016-02-29 08:08:...|         2474|         2474|
|2017-10-31|2017-12-31 11:59:...|         1864|         1803|
|2019-11-30|2019-08-31 00:00:...|         1104|         1195|
+----------+--------------------+-------------+-------------+

+----------+-----------------------+-------------------+-------------------+-------------------+
|date      |time                   |date_dt            |time_dt            |time_dt1           |
+----------+-----------------------+-------------------+-------------------+-------------------+
|2014-02-28|2014-02-28 10:00:00.123|2014-02-28 00:00:00|2014-02-28 10:00:00|2014-02-28 00:00:00|
|2016-02-29|2016-02-29 08:08:08.999|2016-02-29 00:00:00|2016-02-29 08:00:00|2016-02-29 00:00:00|
|2017-10-31|2017-1

In [0]:
from pyspark.sql.functions import current_date, current_timestamp, datediff,months_between, add_months, round,date_trunc
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")
datetimesDF. \
    withColumn("date_dt",col("date")).\
    withColumn("date_dt", date_trunc("HOUR", "date")). \
    withColumn("time_dt", date_trunc("HOUR", "time")). \
    withColumn("time_dt1", date_trunc("dd", "time")). \
    show(truncate=False)

+----------+-----------------------+-------------------+-------------------+-------------------+
|date      |time                   |date_dt            |time_dt            |time_dt1           |
+----------+-----------------------+-------------------+-------------------+-------------------+
|2014-02-28|2014-02-28 10:00:00.123|2014-02-28 00:00:00|2014-02-28 10:00:00|2014-02-28 00:00:00|
|2016-02-29|2016-02-29 08:08:08.999|2016-02-29 00:00:00|2016-02-29 08:00:00|2016-02-29 00:00:00|
|2017-10-31|2017-12-31 11:59:59.123|2017-10-31 00:00:00|2017-12-31 11:00:00|2017-12-31 00:00:00|
|2019-11-30|2019-08-31 00:00:00.000|2019-11-30 00:00:00|2019-08-31 00:00:00|2019-08-31 00:00:00|
+----------+-----------------------+-------------------+-------------------+-------------------+



In [0]:
# note:
#     year === gives year
#     month === gives month 
#     weekofyear == gives weekofYear
#     dayofMonth == gives day of month 

In [0]:

from pyspark.sql.functions import year, month, weekofyear, dayofmonth, \
    dayofyear, dayofweek, current_date,current_timestamp, hour, minute, second
l = [("X", )]
df = spark.createDataFrame(l).toDF("dummy")

df.select(
    current_date().alias('current_date'), 
    year(current_date()).alias('year'),
    month(current_date()).alias('month'),
    weekofyear(current_date()).alias('weekofyear'),
    dayofyear(current_date()).alias('dayofyear'),
    dayofmonth(current_date()).alias('dayofmonth'),
    dayofweek(current_date()).alias('dayofweek'),
    current_timestamp().alias('current_timestamp'), 
    year(current_timestamp()).alias('year'),
    month(current_timestamp()).alias('month'),
    hour(current_timestamp()).alias('hour'),
    minute(current_timestamp()).alias('minute'),
    second(current_timestamp()).alias('second')
).show()

+------------+----+-----+----------+---------+----------+---------+--------------------+----+-----+----+------+------+
|current_date|year|month|weekofyear|dayofyear|dayofmonth|dayofweek|   current_timestamp|year|month|hour|minute|second|
+------------+----+-----+----------+---------+----------+---------+--------------------+----+-----+----+------+------+
|  2022-12-08|2022|   12|        49|      342|         8|        5|2022-12-08 08:24:...|2022|   12|   8|    24|    17|
+------------+----+-----+----------+---------+----------+---------+--------------------+----+-----+----+------+------+



In [0]:
## conversion via to date , to timestamp ,casting to string etc ...
df.select(to_timestamp(lit('02-Mar-2021'), 'dd-MMM-yyyy').alias('to_date')).show()
df.select(to_timestamp(lit('02-Mar-2021 17:30:15'), 'dd-MMM-yyyy HH:mm:ss').alias('to_date')).show()



In [0]:
datetimes = [(20140228, "28-Feb-2014 10:00:00.123"),
                     (20160229, "20-Feb-2016 08:08:08.999"),
                     (20171031, "31-Dec-2017 11:59:59.123"),
                     (20191130, "31-Aug-2019 00:00:00.000")
                ]
    


# revisit with column

In [0]:
from pyspark.sql.functions import to_date,to_timestamp
datetimesDf = spark.createDataFrame(datetimes,schema="date BIGINT, time STRING")
datetimesDf.show()                            
datetimesDF. \
    withColumn("converted_to_date",to_date((col("date").cast('string'), 'yyyyMMdd')). \
    withColumn('converted_to_timestamp',to_timestamp(col('time'), 'dd-MMM-yyyy HH:mm:ss.SSS')). \
    show(truncate=False)

+--------+--------------------+
|    date|                time|
+--------+--------------------+
|20140228|28-Feb-2014 10:00...|
|20160229|20-Feb-2016 08:08...|
|20171031|31-Dec-2017 11:59...|
|20191130|31-Aug-2019 00:00...|
+--------+--------------------+

+----------+-----------------------+-----------------+----------------------+
|date      |time                   |converted_to_date|converted_to_timestamp|
+----------+-----------------------+-----------------+----------------------+
|2014-02-28|2014-02-28 10:00:00.123|null             |null                  |
|2016-02-29|2016-02-29 08:08:08.999|null             |null                  |
|2017-10-31|2017-12-31 11:59:59.123|null             |null                  |
|2019-11-30|2019-08-31 00:00:00.000|null             |null                  |
+----------+-----------------------+-----------------+----------------------+



In [0]:
# Note: date_format== convert date to required format 
#  .cast("String /..") === cast to data type 
from pyspark.sql.functions import date_format
datetimesDF. \
    withColumn("date_dt", date_format("date", "yyyyMMddHHmmss").cast('long')). \
    withColumn("date_ts", date_format("time", "yyyyMMddHHmmss").cast('long')). \
    show(truncate=False)

+----------+-----------------------+--------------+--------------+
|date      |time                   |date_dt       |date_ts       |
+----------+-----------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:00.123|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:08.999|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:59.123|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:00.000|20191130000000|20190831000000|
+----------+-----------------------+--------------+--------------+



In [0]:
unixtimes = [(1393561800, ),
             (1456713488, ),
             (1514701799, ),
             (1567189800, )
            ]
unixtimesDF = spark.createDataFrame(unixtimes).toDF("unixtime")



In [0]:
from pyspark.sql.functions import from_unixtime
unixtimesDF. \
    withColumn("date", from_unixtime("unixtime", "yyyyMMdd")). \
    withColumn("time", from_unixtime("unixtime")). \
    show()
#yyyyMMdd

# note: expr(str: str) -> pyspark.sql.column.Colume. Parses the expression string into the column that it represents

+----------+--------+-------------------+
|  unixtime|    date|               time|
+----------+--------+-------------------+
|1393561800|20140228|2014-02-28 04:30:00|
|1456713488|20160229|2016-02-29 02:38:08|
|1514701799|20171231|2017-12-31 06:29:59|
|1567189800|20190830|2019-08-30 18:30:00|
+----------+--------+-------------------+



In [0]:
## nullif,null,coalsc

employees = [(1, "Scott", "Tiger", 1000.0, 10,
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, None,
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, '',
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 10,
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]

employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, bonus STRING, nationality STRING,
                    phone_number STRING, ssn STRING"""
                   )

from pyspark.sql.functions import lit,col,expr,coalesce
employeesDF. \
    withColumn('bonus1', coalesce('bonus', lit(0))). \
    show()

employeesDF. \
    withColumn('bonus', expr("nvl(nullif(bonus, ''), 0)")). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|      |
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+------+

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------

In [0]:
## using case when conditions are used with expr 
## use of case , when ,otherwise 
### note Imp : when(condition, value)
## use with withcolumn when((condition | condition),<value>).otherwise()
from pyspark.sql.functions import when
employeesDF. \
    withColumn(
        'bonus',
        when((col('bonus').isNull()) | (col('bonus') == lit('')), 0).otherwise(col('bonus'))
    ). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|    0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [0]:
employeesDF. \
    withColumn(
        'bonus', 
        expr("""
            CASE WHEN bonus IS NULL OR bonus = '' THEN 0
            ELSE bonus
            END
            """)
    ). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|    0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|    0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|   10|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



#####case when then used in expr in string format .
##### otherwise use

In [0]:
persons = [
    (1, 1),
    (2, 13),
    (3, 18),
    (4, 60),
    (5, 120),
    (6, 0),
    (7, 12),
    (8, 160)
]

personsDF = spark.createDataFrame(persons, schema='id INT, age INT')
personsDF. \
    withColumn(
        'category',
        expr("""
            CASE
            WHEN age BETWEEN 0 AND 2 THEN 'New Born'
            WHEN age > 2 AND age <= 12 THEN 'Infant'
            WHEN age > 12 AND age <= 48 THEN 'Toddler'
            WHEN age > 48 AND age <= 144 THEN 'Kid'
            ELSE 'Teenager or Adult'
            END
        """)
    ). \
    show()

+---+---+-----------------+
| id|age|         category|
+---+---+-----------------+
|  1|  1|         New Born|
|  2| 13|          Toddler|
|  3| 18|          Toddler|
|  4| 60|              Kid|
|  5|120|              Kid|
|  6|  0|         New Born|
|  7| 12|           Infant|
|  8|160|Teenager or Adult|
+---+---+-----------------+



In [0]:
help(Selectexpr)
# note: expr(str: str) -> pyspark.sql.column.Colume. Parses the expression string into the column that it represents

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-3463592921936972>[0m in [0;36m<cell line: 1>[0;34m()[0m
[0;32m----> 1[0;31m [0mhelp[0m[0;34m([0m[0mSelectexpr[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      2[0m [0;31m# note: expr(str: str) -> pyspark.sql.column.Colume. Parses the expression string into the column that it represents[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m

[0;31mNameError[0m: name 'Selectexpr' is not defined

In [0]:
# help(employeesDF.withColumn)

Help on method withColumn in module pyspark.sql.dataframe:

withColumn(colName: str, col: pyspark.sql.column.Column) -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` by adding a column or replacing the
    existing column that has the same name.
    
    The column expression must be an expression over this :class:`DataFrame`; attempting to add
    a column from some other :class:`DataFrame` will raise an error.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    colName : str
        string, name of the new column.
    col : :class:`Column`
        a :class:`Column` expression for the new column.
    
    Notes
    -----
    This method introduces a projection internally. Therefore, calling it multiple
    times, for instance, via loops in order to add multiple columns can generate big
    plans which can cause performance issues and even `StackOverflowException`.
    To avoid this, use :func:`select` with the m