## Column Class | Operators & Functions
`pyspark.sql.Column` class provides several functions to work with DataFrame to manipulate the Column values, evaluate the boolean expression to filter rows, retrieve a value or part of a value from a DataFrame column, and to work with list, map & struct columns

In [93]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import Row


spark = SparkSession.builder.appName("Column").getOrCreate()

### Create Column Class Object
access the Column from DataFrame by multiple ways.

In [94]:
data=[("James", 'Male', 23),("Ann", 'Female', 40),("Mary", 'Female', 24)]
df=spark.createDataFrame(data).toDF("name.fname","gender", "Age")
df.printSchema()

root
 |-- name.fname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- Age: long (nullable = true)



In [95]:
df.select(df.gender).show()
df.select(df["gender"]).show()
#Accessing column name with dot (with backticks)
df.select(df["`name.fname`"]).show()

+------+
|gender|
+------+
|  Male|
|Female|
|Female|
+------+

+------+
|gender|
+------+
|  Male|
|Female|
|Female|
+------+

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
|      Mary|
+----------+



In [96]:
# Using SQL col() function
df.select(col("gender")).show()
# Accessing column name with dot (with backticks)
df.select(col("`name.fname`")).show()

+------+
|gender|
+------+
|  Male|
|Female|
|Female|
+------+

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
|      Mary|
+----------+



In [97]:
data=[Row(name="James",prop=Row(hair="black",eye="blue")),
      Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df=spark.createDataFrame(data)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- prop: struct (nullable = true)
 |    |-- hair: string (nullable = true)
 |    |-- eye: string (nullable = true)

+-----+-------------+
| name|         prop|
+-----+-------------+
|James|{black, blue}|
|  Ann|{grey, black}|
+-----+-------------+



In [98]:
#Access struct column
df.select(df.prop.hair).show()
df.select(df["prop.hair"]).show()
df.select(col("prop.hair")).show()

+---------+
|prop.hair|
+---------+
|    black|
|     grey|
+---------+

+-----+
| hair|
+-----+
|black|
| grey|
+-----+

+-----+
| hair|
+-----+
|black|
| grey|
+-----+



In [99]:
# Access all columns from struct
df.select(col("prop.*")).show()

+-----+-----+
| hair|  eye|
+-----+-----+
|black| blue|
| grey|black|
+-----+-----+



### PySpark Column Operators
arithmetic operations on columns using operators.

In [100]:
data=[(100,2,1),(200,3,4),(300,4,4)]
df=spark.createDataFrame(data).toDF("col1","col2","col3")
df.show()

#Arthmetic operations
df.select(df.col1 + df.col2).show()
df.select(df.col1 - df.col2).show() 
df.select(df.col1 * df.col2).show()
df.select(df.col1 / df.col2).show()
df.select(df.col1 % df.col2).show()

df.select(df.col2 > df.col3).show()
df.select(df.col2 < df.col3).show()
df.select(df.col2 == df.col3).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
| 100|   2|   1|
| 200|   3|   4|
| 300|   4|   4|
+----+----+----+

+-------------+
|(col1 + col2)|
+-------------+
|          102|
|          203|
|          304|
+-------------+

+-------------+
|(col1 - col2)|
+-------------+
|           98|
|          197|
|          296|
+-------------+

+-------------+
|(col1 * col2)|
+-------------+
|          200|
|          600|
|         1200|
+-------------+

+-----------------+
|    (col1 / col2)|
+-----------------+
|             50.0|
|66.66666666666667|
|             75.0|
+-----------------+

+-------------+
|(col1 % col2)|
+-------------+
|            0|
|            2|
|            0|
+-------------+

+-------------+
|(col2 > col3)|
+-------------+
|         true|
|        false|
|        false|
+-------------+

+-------------+
|(col2 < col3)|
+-------------+
|        false|
|         true|
|        false|
+-------------+

+-------------+
|(col2 = col3)|
+-------------+
|        fals

### Column Functions Examples
Dataframe for testing

In [101]:
data=[("James","Bond","100",None),
      ("Ann","Varsa","200",'F'),
      ("Tom Cruise","XXX","400",''),
      ("Tom Brand",None,"400",'M')] 

columns = ["fname","lname","id","gender"]
df = spark.createDataFrame(data,columns)
df.show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|     James| Bond|100|  NULL|
|       Ann|Varsa|200|     F|
|Tom Cruise|  XXX|400|      |
| Tom Brand| NULL|400|     M|
+----------+-----+---+------+



#### alias() - Set’s name to Column
`df.fname` refers to Column object and `alias()` is a function of the Column to give an alternate name. Here, `fname` column has been changed to `first_name` & `lname` to `last_name`.

In [102]:
from pyspark.sql.functions import expr
df.select(df.fname.alias("first_name"), df.lname.alias("last_name")).show()

#Another example
df.select(expr(" fname ||','|| lname").alias("fullName")).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     James|     Bond|
|       Ann|    Varsa|
|Tom Cruise|      XXX|
| Tom Brand|     NULL|
+----------+---------+

+--------------+
|      fullName|
+--------------+
|    James,Bond|
|     Ann,Varsa|
|Tom Cruise,XXX|
|          NULL|
+--------------+



#### asc() & desc() – Sort the DataFrame columns by Ascending or Descending order.

In [103]:
#asc, desc to sort ascending and descending order repsectively.
df.sort(df.fname.asc()).show()
df.sort(df.fname.desc()).show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|       Ann|Varsa|200|     F|
|     James| Bond|100|  NULL|
| Tom Brand| NULL|400|     M|
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
| Tom Brand| NULL|400|     M|
|     James| Bond|100|  NULL|
|       Ann|Varsa|200|     F|
+----------+-----+---+------+



#### cast() & astype() – Used to convert the data Type

In [104]:
df.printSchema()
df.select(df.fname,df.id.cast("int")).printSchema()

root
 |-- fname: string (nullable = true)
 |-- lname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)

root
 |-- fname: string (nullable = true)
 |-- id: integer (nullable = true)



####  between() – Returns a Boolean expression when a column values in between lower and upper bound.

In [105]:
df.filter(df.id.between(100,300)).show()

+-----+-----+---+------+
|fname|lname| id|gender|
+-----+-----+---+------+
|James| Bond|100|  NULL|
|  Ann|Varsa|200|     F|
+-----+-----+---+------+



#### contains() – Checks if a DataFrame column value contains a a value specified in this function.

In [106]:
df.filter(df.fname.contains("Cruise")).show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+



#### startswith() & endswith() – Checks if the value of the DataFrame Column starts and ends with a String respectively.

In [107]:
df.filter(df.fname.startswith("T")).show()
df.filter(df.fname.endswith("Cruise")).show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
| Tom Brand| NULL|400|     M|
+----------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+



#### isNull & isNotNull() – Checks if the DataFrame column has NULL or non NULL values.

In [108]:
df.filter(df.lname.isNull()).show()
df.filter(df.lname.isNotNull()).show()

+---------+-----+---+------+
|    fname|lname| id|gender|
+---------+-----+---+------+
|Tom Brand| NULL|400|     M|
+---------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|     James| Bond|100|  NULL|
|       Ann|Varsa|200|     F|
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+



#### like() & rlike() – Similar to SQL LIKE expression

In [109]:
df.select(df.fname,df.lname,df.id).filter(df.fname.like("%es")).show()

+-----+-----+---+
|fname|lname| id|
+-----+-----+---+
|James| Bond|100|
+-----+-----+---+



#### when() & otherwise() 

In [110]:
from pyspark.sql.functions import when
df.select(df.fname,df.lname,when(df.gender=="M","Male") \
              .when(df.gender=="F","Female") \
              .when(df.gender==None ,"") \
              .otherwise(df.gender).alias("new_gender") \
    ).show()

+----------+-----+----------+
|     fname|lname|new_gender|
+----------+-----+----------+
|     James| Bond|      NULL|
|       Ann|Varsa|    Female|
|Tom Cruise|  XXX|          |
| Tom Brand| NULL|      Male|
+----------+-----+----------+



#### getField() – To get the value by key from MapType column and by stuct child name from StructType column

In [111]:
# Create DataFrame with struct, array & map
from pyspark.sql.types import StructType,StructField,StringType,ArrayType,MapType
data=[(("James","Bond"),["Java","C#"],{'hair':'black','eye':'brown'}),
      (("Ann","Varsa"),[".NET","Python"],{'hair':'brown','eye':'black'}),
      (("Tom Cruise",""),["Python","Scala"],{'hair':'red','eye':'grey'}),
      (("Tom Brand",None),["Perl","Ruby"],{'hair':'black','eye':'blue'})]

schema = StructType([
            StructField('name', StructType([
            StructField('fname', StringType(), True),
            StructField('lname', StringType(), True)])),
            StructField('languages', ArrayType(StringType()),True),
            StructField('properties', MapType( StringType(), StringType()),True)
         ])
df=spark.createDataFrame(data,schema)
df.printSchema()
df.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- fname: string (nullable = true)
 |    |-- lname: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-----------------+---------------+-----------------------------+
|name             |languages      |properties                   |
+-----------------+---------------+-----------------------------+
|{James, Bond}    |[Java, C#]     |{eye -> brown, hair -> black}|
|{Ann, Varsa}     |[.NET, Python] |{eye -> black, hair -> brown}|
|{Tom Cruise, }   |[Python, Scala]|{eye -> grey, hair -> red}   |
|{Tom Brand, NULL}|[Perl, Ruby]   |{eye -> blue, hair -> black} |
+-----------------+---------------+-----------------------------+



In [112]:
#getField from MapType
df.select(df.properties.getField("hair")).show()

+----------------+
|properties[hair]|
+----------------+
|           black|
|           brown|
|             red|
|           black|
+----------------+



In [113]:
#getField from Struct
df.select(df.name.getField("fname")).show()

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
|Tom Cruise|
| Tom Brand|
+----------+



#### getItem() – To get the value by index from MapType or ArrayTupe & ny key for MapType column.

In [114]:
#getItem() used with ArrayType
df.select(df.languages.getItem(1)).show()

+------------+
|languages[1]|
+------------+
|          C#|
|      Python|
|       Scala|
|        Ruby|
+------------+



In [115]:
#getItem() used with MapType
df.select(df.properties.getItem("hair")).show()

+----------------+
|properties[hair]|
+----------------+
|           black|
|           brown|
|             red|
|           black|
+----------------+



multiple columns

In [116]:
data=[( "James", 'Male', 23, "Wheaton", 60187, "Dupage", 80, "20 SE"),
      ( "Ann", 'Female', 40, "Glen Ellyn", 60137, "Dupage", 78, "21 S"),
      ("Mary", 'Female', 24, "Dekalb", 60115, "Dekalb", 82, "30 E")]
df=spark.createDataFrame(data).toDF("name","gender", "Age", "city", "zip", "county", "temperature", "winds")
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- zip: long (nullable = true)
 |-- county: string (nullable = true)
 |-- temperature: long (nullable = true)
 |-- winds: string (nullable = true)

+-----+------+---+----------+-----+------+-----------+-----+
| name|gender|Age|      city|  zip|county|temperature|winds|
+-----+------+---+----------+-----+------+-----------+-----+
|James|  Male| 23|   Wheaton|60187|Dupage|         80|20 SE|
|  Ann|Female| 40|Glen Ellyn|60137|Dupage|         78| 21 S|
| Mary|Female| 24|    Dekalb|60115|Dekalb|         82| 30 E|
+-----+------+---+----------+-----+------+-----------+-----+



In [117]:
from pyspark.sql.functions import col,struct,when, concat, lit


final = df.select("name", 
                    struct(col('temperature'), 
                           col('winds')).alias('conditions'),
                    struct(col('city'), 
                           col('zip'),
                           col('county')).alias('location'),
                    struct(struct(col('gender'),
                                  col('Age')).alias('personType')).alias('person'),
                    concat(col('name'), lit("|"), col('County')).alias('subjectId'),
                    when(df.temperature>79, 'hot').otherwise('cool').alias('hot places')                      
                    )
final.show(truncate=False)

+-----+-----------+---------------------------+--------------+------------+----------+
|name |conditions |location                   |person        |subjectId   |hot places|
+-----+-----------+---------------------------+--------------+------------+----------+
|James|{80, 20 SE}|{Wheaton, 60187, Dupage}   |{{Male, 23}}  |James|Dupage|hot       |
|Ann  |{78, 21 S} |{Glen Ellyn, 60137, Dupage}|{{Female, 40}}|Ann|Dupage  |cool      |
|Mary |{82, 30 E} |{Dekalb, 60115, Dekalb}    |{{Female, 24}}|Mary|Dekalb |hot       |
+-----+-----------+---------------------------+--------------+------------+----------+



`collect_set()` is an aggregate function used to gather unique values from a column within a group into a set

In [118]:
from pyspark.sql.functions import collect_set

data_1 = [("James", "hot"), ("Ann", "hot"), ("James", "cool"), ("Ann", "hot"), ("Mary", "cool")]
df_2 = spark.createDataFrame(data_1, ["name", "hot place"])

# Group by 'key' and collect unique 'value's into a set
result_df_2 = df_2.groupBy("name").agg(collect_set("hot place").alias("unique hot places"))

df_2.show()
result_df_2.show()

+-----+---------+
| name|hot place|
+-----+---------+
|James|      hot|
|  Ann|      hot|
|James|     cool|
|  Ann|      hot|
| Mary|     cool|
+-----+---------+

+-----+-----------------+
| name|unique hot places|
+-----+-----------------+
|James|      [cool, hot]|
|  Ann|            [hot]|
| Mary|           [cool]|
+-----+-----------------+

