## Column Class Object

`pyspark.sql.Column` class provides several functions to work with DataFrame to manipulate the Column values, evaluate the boolean expression to filter rows, retrieve a value or part of a value from a DataFrame column, and to work with list, map & struct columns.

Let's see how to create Column object, access them to perform operations.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType
from pyspark.sql.functions import lit, col, expr, when

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('Column Class Object').getOrCreate()

#### Create Column Class Object

In [0]:
# This takes a literal value and returns a Column object.
colObj = lit('column_name')

In [0]:
# You can also access the Column from DataFrame by multiple ways.
data=[('Elijah',36),('Gregory',9)]
df=spark.createDataFrame(data).toDF('firstname','age')
df.printSchema()

In [0]:
# Using DataFrame object (df)
df.select(
  df.age,
  df['firstname']
).show()

In [0]:
# Using SQL col() function 
df.select(col('age')).show()
# Accessing column name with dot or space (with backticks) ``
df.select(col('firstname')).show()


In [0]:
# Create DataFrame with struct using Row class
data=[
  Row(name='John',prop=Row(hair='black',eye='brown')),
  Row(name='Marie',prop=Row(hair='blond',eye='black'))
]

df=spark.createDataFrame(data)
df.printSchema()

In [0]:
#Access struct column
df.select(
  df.prop.hair,
  df['prop.hair'],
  col('prop.hair')
).show()

In [0]:
# Access all columns from struct
df.select(col('prop.*')).show()

#### PySpark Column Operators

In [0]:
data=[(104,3,2),(201,2,1),(400,5,5)]
df=spark.createDataFrame(data).toDF("col1","col2","col3")

##### Arthmetic operations

In [0]:
df.select(
  df.col1 + df.col2,
  df.col1 - df.col2,
  df.col1 * df.col2,
  df.col1 / df.col2,
  df.col1 % df.col2, 
  df.col2 > df.col3,
  df.col2 < df.col3,
  df.col2 == df.col3
).show()

#### Column Functions Examples

In [0]:
data=[
  ('James','Bond','100',None),
  ('Ann','Varsa','200','F'),
  ('Tom Cruise','XXX','400',''),
  ('Tom Brand',None,'400','M')
] 
columns=['fname','lname','id','gender']
df=spark.createDataFrame(data,columns)
df.printSchema()

##### alias()
Set’s name to Column

In [0]:
df.select(
  df.fname.alias("first_name"),
  df.lname.alias("last_name")
).show()

In [0]:
df.select(expr('fname ||","|| lname').alias('fullName')).show()

##### asc() & desc()
Sort the DataFrame columns

In [0]:
df.sort(df.fname.asc()).show()
df.sort(df.fname.desc()).show()

##### cast() & astype()
Used to convert the data Type.

In [0]:
df.select(
  df.fname,
  df.id.cast('int')
).printSchema()

In [0]:
# astype() is an alias for cast().
df.select(
  df.fname,
  df.id.astype('int')
).printSchema()

##### between()
Returns a Boolean expression when a column values in between lower and upper bound.

In [0]:
df.filter(df.id.cast('int').between(100,300)).show()

##### contains()
Checks if a DataFrame column value contains a a value specified in this function

In [0]:
df.filter(df.fname.contains('Cruise')).show()

##### startswith() & endswith()
Checks if the value of the DataFrame Column starts and ends with a String respectively

In [0]:
df.filter(df.fname.startswith('J')).show()
df.filter(df.fname.endswith('nd')).show()

##### isNull & isNotNull()
Checks if the DataFrame column has NULL or non NULL values.

In [0]:
df.filter(df.gender.isNull()).show()
df.filter(df.gender.isNotNull()).show()

##### substr()
Returns a Column after getting sub string from the Column

In [0]:
df.select(df.fname.substr(1,2).alias('substr')).show()

##### when() & otherwise()
Executes sequence of expressions until it matches the condition and returns a value when match.

In [0]:
df.select(
  df.fname,
  df.lname,
  df.gender.alias('old_gender'),
  when(df.gender=='M','Male')
  .when(df.gender=='F','Female')
  .when((df.gender.isNull()) | (df.gender==''),'Not specified')
  .otherwise(df.gender).alias('new_gender')
).show()

##### isin()
Check if value presents in a List.

In [0]:
li=['100','200']
df.select(
  df.fname,
  df.lname,
  df.id
).filter(df.id.isin(li)).show()

##### like() & rlike()

In [0]:
df.select(df.fname,df.lname,df.id).filter(df.fname.like('%n')).show()

##### getField()
To get the value by key from MapType column and by stuct child name from StructType column

In [0]:
# Create DataFrame with struct using Row class
data1=[
  (('John','Smith'),['Python','Scala'],{'hair': 'black','eye': 'brown'}),
  (('Marie','Brand'),['Java','C#'],{'hair': 'blond','eye': 'black'})
]

schema = StructType([
  StructField('name', StructType([
    StructField('first', StringType(), True),
    StructField('last', StringType(), True)
  ])),
  StructField('languages', ArrayType(StringType()),True),
  StructField('properties', MapType(StringType(),StringType()),True)
])

df1=spark.createDataFrame(data1,schema)
df1.printSchema()

In [0]:
# getField() from MapType
df1.select(df1.properties.getField('hair')).show()

In [0]:
# getField() from Struct
df1.select(df1.name.getField('first')).show()

##### getItem()
To get the value by index from MapType or ArrayTupe

In [0]:
# getItem() with ArrayType
df1.select(
  df1.languages.getItem(0),
  df1.languages.getItem(1),
).show()

In [0]:
#getItem() with MapType
df1.select(
  df1.properties.getItem('hair'),
  df1.properties.getItem('eye')
).show()

##### dropFields()
An expression that drops fields in StructType by name.

In [0]:
df1.withColumn('firstname', df1['name'].dropFields('last')).show(truncate=False)

##### withField()
An expression that adds/replaces a field in StructType by name.

In [0]:
df1.withColumn('fullname', df1['name'].withField('fullname', lit('M') )).show(truncate=False)

#### The end of the notebook