
#PySpark Dataframes - compare various methods to access columns

##PySpark Dataframe: Available methods to access columns

* Method 1: **Dot notation** 
    
    e.g. `df_nation.select(df_nation.n_name, df_nation.n_regionkey)`
* Method 2: **Square bracket notation**
    
    e.g. `df_nation.select(df_nation["n_name"], df_nation["n_regionkey"])`
* Method 3: **Using col() / column() functions**

    Using col(): `df_nation.select(col("n_name"), col("n_regionkey"))`

    Using column(): `df_nation.select(column("n_name"), column("n_regionkey"))`

* Method 4: **Specifying column name as string literals**

    Using double quotes: `df_nation.select("n_name", "n_regionkey")`

    Using single quotes: `df_nation.select('n_name', 'n_regionkey')`

* Method 5: **Using expr() / selectExpr() functions**

    Using expr(): `df_nation.select("n_name", expr("n_regionkey"), expr("concat_ws(':', n_name, n_regionkey) as n_region"))`

    Using selectExpr(): `df_nation.selectExpr("n_name", "n_regionkey", "concat_ws(':', n_name, n_regionkey) as n_region")`

##Sample data - part 1

Note: The sample tables as below are part of the samples catalog that comes along with the Free Edition of Databricks

In [0]:
#25
df_nation = spark.read.table("samples.tpch.nation")
display(df_nation.limit(100))

#750,000
df_customer = spark.read.table("samples.tpch.customer")
display(df_customer.limit(100))


##Scenarios considered

###Scenario 1: SELECT statements using column names that do not contain special characters.

In [0]:

#Method 1: Dot notation
df_result = df_nation.select(df_nation.n_name, df_nation.n_regionkey)
df_result.display()

#Method 2: Square bracket notation
df_result = df_nation.select(df_nation["n_name"], df_nation["n_regionkey"])
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col, column
#3.1 Using col()
df_result = df_nation.select(col("n_name"), col("n_regionkey"))
df_result.display()
#3.2 Using column()
df_result = df_nation.select(column("n_name"), column("n_regionkey"))
df_result.display()

#Method 4: Specifying column name as string literals
#4.1 Column name in double quotes
df_result = df_nation.select("n_name", "n_regionkey")
df_result.display()
#4.2 Column name in single quotes
df_result = df_nation.select('n_name', 'n_regionkey')
df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
#5.1 Using expr()
df_result = df_nation.select("n_name", expr("n_regionkey"), expr("concat_ws(':', n_name, n_regionkey) as n_region"))
df_result.display()
#5.2 Using selectExpr()
df_result = df_nation.selectExpr("n_name", "n_regionkey", "concat_ws(':', n_name, n_regionkey) as n_region")
df_result.display()


**Notes**: All the five methods work fine, as the column names didn't contain any special characters

###Scenario 2: SELECT statements using column names that contain special characters

Example special characters: space, hyphen (-) , @ etc.

Sample dataframe with column names that contain special characters

In [0]:
sampleData = (
    ("Kevin","","S","NewYork",3100), 
    ("David","R","","California",4300), 
    ("Ben","L","J","NewYork",3000) 
  )

#Notice special characters in column names: space, @, - etc.
myschema = (
    "`first name` string, "
    "`middle@name` string, "
    "`last-name` string, "
    "location string, "
    "salary int"
)

df_specialCharacters = spark.createDataFrame(data = sampleData, schema = myschema)

df_specialCharacters.select("*").display()

In [0]:

#Method 1: Dot notation
#ATTENTION: FAILS as it doesn't support special characters in column names.
# df_results_sc = df_specialCharacters.select(df_specialCharacters.first name, df_specialCharacters.middle@name)
# df_results_sc.display()

#Method 2: Square bracket notation
df_results_sc = df_specialCharacters.select(df_specialCharacters["first name"], df_specialCharacters["middle@name"])
df_results_sc.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col, column
df_results_sc = df_specialCharacters.select(col("first name"), col("middle@name"))
df_results_sc.display()

#Method 4: Specifying column name as string literals
#4.1 Column name in double quotes
df_results_sc = df_specialCharacters.select("first name", "middle@name")
df_results_sc.display()

#Method 5: Using expr() / selectExpr() function
#IMPORTANT: Use backticks to enclose column names that have special characters in it.
from pyspark.sql.functions import expr
#5.1 Using expr()
df_results_sc = df_specialCharacters.select(expr("`first name`"), expr("`middle@name`"), expr("`last-name`"), expr("concat_ws(':', `first name`, `middle@name`, `last-name`) as full_name"))
df_results_sc.display()
#5.2 Using selectExpr()
df_results_sc = df_specialCharacters.selectExpr("`first name`", "`middle@name`", "`last-name`", "concat_ws(':', `first name`, `middle@name`, `last-name`) as full_name")
df_results_sc.display()

Notes:

| Method | supports special characters? |
|--------|------------------------------|
|Dot notation|No. Doesn't support special characters in column names|
|Square bracket notation|Yes|
|col() function|Yes|
|column name as string literals|Yes|
|expr() / selectExpr()|Yes, provided backticks are used to enclose the column names|


###Scenario 3: Is it able to return column object?

Tip: The benefits in returning a column object are that we can invoke various methods on the column object.

Example methods available on a column object:

* alias()
* asc()
* between()
* cast()
* contains()
* desc()
* endswith()
* ilike()
* isNull()
* substr() etc.


In [0]:
#Method 1: Dot notation - invoking alias(), cast() methods on column object
df_result = df_nation.select(df_nation.n_name.alias("nation@name"))
df_result.display()
df_result = df_nation.select(df_nation.n_nationkey.cast("string"))
df_result.display()

#Method 2: Square bracket notation - invoking alias(), cast() methods on column object
df_result = df_nation.select(df_nation["n_name"].alias("nation@name"))
df_result.display()
df_result = df_nation.select(df_nation["n_nationkey"].cast("string"))
df_result.display()

#Method 3: Using col()/column() function - invoking alias(), cast() methods on column object
from pyspark.sql.functions import col, column
df_result = df_nation.select(col("n_name").alias("nation@name"))
df_result.display()
df_result = df_nation.select(col("n_nationkey").cast("string"))
df_result.display()

#Method 4: Specifying column name as string literals - invoking alias(), cast() methods on column object
#ATTENTION: Fails. Because, column name as a string literal doesn't return column object. Instead, it returns a string object. 
#df_result = df_nation.select("n_name".alias("nation@name"))
#df_result.display()
#df_result = df_nation.select("n_nationkey".cast("string"))
#df_result.display()

#Method 5: Using expr() / selectExpr() function - invoking alias(), cast() methods on column object
from pyspark.sql.functions import expr
df_result = df_nation.select(expr("n_name").alias("nation@name"))
df_result.display()
df_result = df_nation.select(expr("n_nationkey").cast("string"))
df_result.display()



Notes:
* With the exception of column name as string literal method, all other methods return a column object. 

| Method | return column object? |
|--------|------------------------------|
|Dot notation|Yes|
|Square bracket notation|Yes|
|col() function|Yes|
|column name as string literals|No. Instead of column object, it returns a string object|
|expr() / selectExpr()|Yes|

* So, when you use a method that returns a column object, you could invoke all the available column methods on such column object.

###Scenario 4: Ability to apply column operators (i.e. Column Expressions)

Example column operators: \ , * , + , < , >= , ==, != 

To be able to apply column operators, the method we use to access the column must return a column object rather than a string object. 

If we are able to apply column operators on existing columns in a dataframe, we will be able to construct new columns based on the existing columns.

For example: `(df_customer.c_acctbal * 2)` is a column expression, wherein multiply (i.e. * ) operator was used to double the existing column value. Based on this column expression, we can construct a new column in the dataframe.

Note: Column expressions aren't the same as `expr()` / `selectExpr()`. Column expression is defined based on a column object as returned by dot notation/square bracket notation/col(). Whereas, `expr()` / `selectExpr()` are based on a string literal that contains within it as a SQL-like expression.


Example operator: using * (i.e. multiplication)

In [0]:
#Method 1: Dot notation
df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled"))
df_result.display()

#Method 2: Square bracket notation
df_result = df_customer.select((df_customer["c_acctbal"] * 2).alias("acctbalDoubled"))
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col
df_result = df_customer.select((col("c_acctbal") * 2).alias("acctbalDoubled"))
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION: Fails because column name as string literal doesn't return column object. Instead it returns string object. So, you cannot apply operators. 
# df_result = df_customer.select((("c_acctbal") * 2).alias("acctbalDoubled"))
# df_result.display()
#Notes: unlike column object's methods, string object's methods work on the string literal itself rather than on the returned result.
#For example, if you invoke the upper() method on a column object, it returns a column object with the upper() method applied to the results. On the other hand, if you invoke the upper() method on a string literal, it returns the column name in uppercase rather than coverting the results to upper case.
#df_result = df_customer.select("(c_comment").upper())
#df_result.display()
#df_result = df_customer.select("c_comment".upper())
#df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
df_result = df_customer.select("c_acctbal", (expr("c_acctbal") * 2).alias("acctbalDoubled"))
df_result.display()


Notes:

* Similar to column methods, even the column operators can be applied only on column objects and not on string literals. String literal is a string object.
* For example, if you invoke the upper() method on a column object, it returns a column object with the upper() method applied to the results. However, if you invoke the upper() method on a string literal, it returns the column name in uppercase rather than upper casing the values in the c_comment column. e.g. ("c_comment").upper() would return the column name as C_COMMENT .

###Scenario 5: FILTER / WHERE clause using example operator ( == )

In [0]:
#Method 1: Dot notation
df_result = df_customer.select(df_customer.c_acctbal).where(df_customer.c_acctbal <= 5000)
df_result.display()

#Method 2: Square bracket notation
df_result = df_customer.select(df_customer["c_acctbal"]).where(df_customer["c_acctbal"] <= 5000)
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col
df_result = df_customer.select(col("c_acctbal")).where(col("c_acctbal") <= 5000)
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION:Fails because column name as string literal doesn't return column object. Instead it returns string object. So, you cannot apply operators. 
# df_result = df_customer.select("c_acctbal").where(("c_acctbal") <= 5000)
# df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
df_result = df_customer.select("c_acctbal").where ((expr("c_acctbal")) <= 5000)
df_result.display()


Notes:

* To be able to reference a column in a FILTER / WHERE clause, the method that we use to access the column must return a column object rather than a string object. Only then we can apply operators on such column object in FILTER / WHERE clause.

###Scenario 6: ORDER BY clause

In [0]:
#Method 1: Dot notation
df_result = df_customer.select("*").where(df_customer.c_acctbal <= 5000).orderBy(df_customer.c_acctbal.desc())
df_result.display()

#Method 2: Square bracket notation
df_result = df_customer.select("*").where(df_customer["c_acctbal"] <= 5000).orderBy(df_customer["c_acctbal"].desc())
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col
df_result = df_customer.select("*").where(col("c_acctbal") <= 5000).orderBy(col("c_acctbal").desc())
df_result.display()

#Method 4: Specifying column name as string literals
#Note: Notice how we are able to specify string literal as column name in the orderBy clause, by also specifying with it "ascending" param value.
df_result = df_customer.select("*").where(col("c_acctbal") <= 5000).orderBy("c_acctbal", ascending=False)
df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
df_result = df_customer.select("c_acctbal", (expr("c_acctbal") * 2).alias("acctbalDoubled")).where(expr("c_acctbal") <= 5000).orderBy(expr("c_acctbal").desc())
df_result.display()


Notes:

Unlike FILTER / WHERE clause, the ORDER BY clause supports all the five methods of accessing columns. However, in case of column name as string literal in orderBy clause, the ascending/descending parameter value will need to be specified separately as 2nd argument.

###Scenario 7: Accessing aliased columns


In [0]:
#Method 1: Dot notation
#ATTENTION: Fails as Dot notation doesn't support accessing aliased column.
# df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled")).where(df_customer.acctbalDoubled <= 10000)
# df_result.display()

#Method 2: Square bracket notation
#ATTENTION: Fails as Dot notation doesn't support accessing aliased column.
# df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled")).where(df_customer["acctbalDoubled"] <= 10000)
# df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col
df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled")).where(col("acctbalDoubled") <= 10000)
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION: Fails as Dot notation doesn't support accessing aliased column.
# df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled")).where("acctbalDoubled" <= 10000)
# df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
df_result = df_customer.select((df_customer.c_acctbal * 2).alias("acctbalDoubled")).where(expr("acctbalDoubled") <= 10000)
df_result.display()


Notes:

* An aliased column can be subsequently accessed using `col()` and `expr()` methods only. We cannot use string literal option to access aliased column name. Similarly, we cannot use dot notation and square bracket notation to access aliased column.

###Scenario 8: Adding column / renaming column in a dataframe


####8.1 withColumn() method of Dataframe

Syntax: `DataFrame.withColumn(colName, col)`

Parameters to the `withColumn` method are:
* colName: string, name of the new column.
* col: column object or column expression


In [0]:

#Method 1: Dot notation
df_result = df_nation.select(df_nation.n_name, df_nation.n_regionkey).withColumn("nation@name", df_nation.n_name)
df_result.display()

#Method 2: Square bracket notation
df_result = df_nation.select(df_nation["n_name"], df_nation["n_regionkey"]).withColumn("nation@name", df_nation["n_name"])
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col, column
#3.1 Using col()
df_result = df_nation.select(col("n_name"), col("n_regionkey")).withColumn("nation@name", col("n_name"))
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION: Fails. Because, the 2nd param to the withColumn() is expected to be a column object/expression and not a string literal
#4.1 Column name in double quotes
# df_result = df_nation.select("n_name", "n_regionkey").withColumn("nation@name", "n_name")
# df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
#5.1 Using expr()
df_result = df_nation.select("n_name", "n_regionkey").withColumn("nation@name", expr("n_name"))
df_result.display()


Notes: To the `withColumn` method of Dataframe:
* The first param (i.e. new column name) must ALWAYS be a string literal only. It can contain special characters.
* The second param must ALWAYS be a column object or column expression.
* All the four methods as below return a column object
  - dot notation
  - square bracket notation
  - col() / column ()
  - expr()
* However, as 2nd param, column name as string literals is not acceptable. Because, it doesn't return a column object.


####8.2 withColumns() method of Dataframe

In [0]:

#Method 1: Dot notation
df_result = df_nation.select(df_nation.n_name, df_nation.n_regionkey).withColumns({"nation@name": df_nation.n_name, "region-key": df_nation.n_regionkey})
df_result.display()

#Method 2: Square bracket notation
df_result = df_nation.select(df_nation["n_name"], df_nation["n_regionkey"]).withColumns({"nation@name": df_nation["n_name"], "region-key": df_nation["n_regionkey"]})
df_result.display()

#Method 3: Using col()/column() function
from pyspark.sql.functions import col, column
#3.1 Using col()
df_result = df_nation.select(col("n_name"), col("n_regionkey")).withColumns({"nation@name": col("n_name"), "region-key": col("n_regionkey")})
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION: Fails. Because, in each key-value pair in the dictionary as param to the withColumns(), the value is expected to be a column object/expression and not a string literal
#4.1 Column name in double quotes
# df_result = df_nation.select("n_name", "n_regionkey").withColumns({"nation@name": "n_name", "region-key": "n_regionkey"})
# df_result.display()

#Method 5: Using expr() / selectExpr() function
from pyspark.sql.functions import expr
#5.1 Using selectExpr()
df_result = df_nation.select("n_name", "n_regionkey").withColumns({"nation@name": expr("n_name"), "region-key": expr("n_regionkey")})
df_result.display()


Notes: While passing dictionary as param to the `withColumns()` method, each key is expected to be a string literal while the corresponding value is expected to be a column object/expression and not a string literal.

The rest of the observations are very similar to that of `withColumn`

####8.3 withColumnRenamed() method of Dataframe

Syntax: `DataFrame.withColumnRenamed(existing, new)`

Parameters to the `withColumn` method are:
* existing: string, The name of the existing column to be renamed.
* new: string, The new name to be assigned to the column


In [0]:

#Method 1: Dot notation
#ATTENTION: FAILS, unless both the params to the withColumnRenamed() method were string literals.
# df_result = df_nation.select(df_nation.n_name, df_nation.n_regionkey).withColumnRenamed(df_nation.n_name, "nation@name")
# df_result.display()

#Method 2: Square bracket notation
#ATTENTION: FAILS, unless both the params to the withColumnRenamed() method were string literals.
# df_result = df_nation.select(df_nation["n_name"], df_nation["n_regionkey"]).withColumnRenamed(df_nation["n_name"], "nation@name")
# df_result.display()

#Method 3: Using col()/column() function
#ATTENTION: FAILS, unless both the params to the withColumnRenamed() method were string literals.
# from pyspark.sql.functions import col, column
# #3.1 Using col()
# df_result = df_nation.select(col("n_name"), col("n_regionkey")).withColumnRenamed(col("n_name"), "nation@name")
# df_result.display()

#Method 4: Specifying column name as string literals
#4.1 Column name in double quotes
df_result = df_nation.select("n_name", "n_regionkey").withColumnRenamed("n_name", "nation@name")
df_result.display()

#Method 5: Using expr() / selectExpr() function
#ATTENTION: FAILS, unless both the params to the withColumnRenamed() method were string literals.
# from pyspark.sql.functions import expr
# #5.1 Using expr()
# df_result = df_nation.select("n_name", "n_regionkey").withColumnRenamed(expr("n_name"), "nation@name")
# df_result.display()


Notes:

To `withColumnRenamed()` method of Dataframe, both column names (existing and new) as params must be string literals only and no column object acceptable.

Tip: In case of `withColumn()`, new column name is the first param. Whereas, in case of `withColumnRenamed()`, new column name is the 2nd param. 

###Scenario 9: - CASE WHEN statement

In [0]:
from pyspark.sql.functions import when, col, expr

#Method 1: Dot notation
df_result = df_customer.select(df_customer.c_acctbal, 
  when(df_customer.c_acctbal <= 3000, "Class A")
  .when(df_customer.c_acctbal <= 6000, "Class B")
  .otherwise("Class C")
  .alias("CustomerClassification"))
df_result.display()

#Method 2: Square bracket notation
df_result = df_customer.select(df_customer.c_acctbal, 
  when(df_customer["c_acctbal"] <= 3000, "Class A")
  .when(df_customer["c_acctbal"] <= 6000, "Class B")
  .otherwise("Class C")
  .alias("CustomerClassification"))
df_result.display()

#Method 3: Using col()/column() function
df_result = df_customer.select(df_customer.c_acctbal, 
  when(col("c_acctbal") <= 3000, "Class A")
  .when(col("c_acctbal") <= 6000, "Class B")
  .otherwise("Class C")
  .alias("CustomerClassification"))
df_result.display()

#Method 4: Specifying column name as string literals
#ATTENTION: Fails. Because column name as string literal wouldn't return column object. As a consequence, it cannot be used along with the column operators such as <= etc. TypeError: '<=' not supported between instances of 'str' and 'int'
# df_result = df_customer.select(df_customer.c_acctbal, 
#   when("c_acctbal" <= 3000, "Class A")
#   .when("c_acctbal" <= 6000, "Class B")
#   .otherwise("Class C")
#   .alias("CustomerClassification"))
# df_result.display()

#Method 5: Using expr() / selectExpr() function
df_result = df_customer.select(df_customer.c_acctbal, expr("CASE WHEN c_acctbal <= 3000 THEN 'Class A' " + 
                               "WHEN c_acctbal <= 6000 THEN 'Class B' " + 
                               "ELSE 'Class C' END as CustomerClassification"))
df_result.display()

Notes:

* With the exception of string literal method, all other methods support CASE statement.

###Scenario 10: - GROUP BY / AGGREGATIONS

In [0]:

from pyspark.sql.functions import col,expr, sum,avg,max,min,count

#Method 1: Dot notation
df_result = (df_customer.groupBy(df_customer.c_mktsegment)
    .agg(
        sum(df_customer.c_acctbal).alias("total_balance"),
        avg(df_customer.c_acctbal).alias("avg_balance"),
        max(df_customer.c_acctbal).alias("max_balance"),
        min(df_customer.c_acctbal).alias("min_balance"),
        count("*").alias("groupCount")
    )
).where(col("total_balance") > 675000000) #Alternatively, expr() would also work, if you choose to.
df_result.display()

#Method 2: Square bracket notation
df_result = (df_customer.groupBy(df_customer["c_mktsegment"])
    .agg(
        sum(df_customer["c_acctbal"]).alias("total_balance"),
        avg(df_customer["c_acctbal"]).alias("avg_balance"),
        max(df_customer["c_acctbal"]).alias("max_balance"),
        min(df_customer["c_acctbal"]).alias("min_balance"),
        count("*").alias("groupCount")
    )
).where(col("total_balance") > 675000000)
df_result.display()

#Method 3: Using col()/column() function
df_result = (df_customer.groupBy(col("c_mktsegment"))
    .agg(
        sum(col("c_acctbal")).alias("total_balance"),
        avg(col("c_acctbal")).alias("avg_balance"),
        max(col("c_acctbal")).alias("max_balance"),
        min(col("c_acctbal")).alias("min_balance"),
        count("*").alias("groupCount")
    )
).where(col("total_balance") > 675000000)
df_result.display()

#Method 4: Specifying column name as string literals
df_result = (df_customer.groupBy(("c_mktsegment"))
    .agg(
        sum(("c_acctbal")).alias("total_balance"),
        avg(("c_acctbal")).alias("avg_balance"),
        max(("c_acctbal")).alias("max_balance"),
        min(("c_acctbal")).alias("min_balance"),
        count("*").alias("groupCount")
    )
).where(col("total_balance") > 675000000)
df_result.display()

#Method 5: Using expr() / selectExpr() function
df_result = (df_customer.groupBy(expr("c_mktsegment"))
    .agg(
        sum(expr("c_acctbal")).alias("total_balance"),
        avg(expr("c_acctbal")).alias("avg_balance"),
        max(expr("c_acctbal")).alias("max_balance"),
        min(expr("c_acctbal")).alias("min_balance"),
        count("*").alias("groupCount")
    )
).where(col("total_balance") > 675000000)
df_result.display()


Notes:

In aggregate functions such as `sum()`, `avg()`, `max()` etc., you can use all the five methods to access columns.

Want to reference aggregated columns?

While referencing the aggregated columns in subsequent transformations, for example in a WHERE clause, you can use only `col()` and `expr()` methods and not the other column access methods. Because, the aggregated columns are assigned alias column names.

In other words, you cannot reference columns of a DataFrame (like `df_result.total_balance`) inside the `.where()` clause before the DataFrame is created. 

###Scenario 11: - Dataframe joins

####11.1 - Single column join

In [0]:
from pyspark.sql.functions import col,expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_n = df_nation.alias("df_n")
df_c = df_customer.alias("df_c")

df_twoDFs_joined = (
        df_n
        .join(df_c,
            #Method 1: Dot notation
            df_n.n_nationkey == df_c.c_nationkey,

            "inner"
        )
        .select(df_n.n_nationkey.alias("NationKey"),df_n.n_name.alias("NationName"),df_c.c_name.alias("CustomerName"))
        .sort(col("CustomerName").asc()) #With col(), you get to use aliased column name
        #.sort(df_c.c_name.asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
        #.sort(df_c["c_name"].asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
    )

df_twoDFs_joined.display()

In [0]:
from pyspark.sql.functions import col,expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_n = df_nation.alias("df_n")
df_c = df_customer.alias("df_c")

df_twoDFs_joined = (
        df_n
        .join(df_c,

            #Method 2: Square bracket notation
            df_n["n_nationkey"] == df_c["c_nationkey"],

            "inner"
        )
        .select(df_n["n_nationkey"].alias("NationKey"),df_n["n_name"].alias("NationName"),df_c["c_name"].alias("CustomerName"))
        .sort(col("CustomerName").asc()) #With col(), you get to use aliased column name
        #.sort(df_c.c_name.asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
        #.sort(df_c["c_name"].asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
    )

df_twoDFs_joined.display()

In [0]:
from pyspark.sql.functions import col,expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_n = df_nation.alias("df_n")
df_c = df_customer.alias("df_c")

df_twoDFs_joined = (
        df_n
        .join(df_c,

            #Method 3: Using col()/column() function
            col("df_n.n_nationkey") == col("df_c.c_nationkey"),            

            "inner"
        )
        # column aliasing: all the three variations of column aliasing are supported, because dataframes are aliased before the current join command.
        .select(col("df_n.n_nationkey").alias("NationKey"),col("df_n.n_name").alias("NationName"),col("df_c.c_name").alias("CustomerName"))
        .sort(col("CustomerName").asc()) #With col(), you get to use aliased column name
        #.sort(df_c.c_name.asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
        #.sort(df_c["c_name"].asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
    )

df_twoDFs_joined.display()

In [0]:
# from pyspark.sql.functions import col,expr

#Dataframe aliasing
# #IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
# df_n = df_nation.alias("df_n")
# df_c = df_customer.alias("df_c")

# df_twoDFs_joined = (
#         df_n
#         .join(df_c,
       
#             #Method 4: (DOES NOT WORK) Specifying column name as string literals
#             #because, the pre-requisite with this option is the joining column names must be exactly the same in both the dataframes
#             #"n_nationkey",  

#             "inner"
#         )

#         .select(df_n.n_nationkey.alias("NationKey"),df_n.n_name.alias("NationName"),df_c.c_name.alias("CustomerName"))
#         .sort(col("CustomerName").asc()) #With col(), you get to use aliased column name
#         #.sort(df_c.c_name.asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
#         #.sort(df_c["c_name"].asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
#     )

# df_twoDFs_joined.display()

In [0]:
from pyspark.sql.functions import col,expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_n = df_nation.alias("df_n")
df_c = df_customer.alias("df_c")

df_twoDFs_joined = (
        df_n
        .join(df_c,

            #Method 5: Using expr() / selectExpr() function
            #Both variations of the below syntax works
            #Note: Even though expr() works here too, but do you really want to use it hear as the function name of expr() isn't self-descriptive in this context?
            #expr("df_n.n_nationkey") == expr("df_c.c_nationkey"),           
            expr("df_n.n_nationkey == df_c.c_nationkey"),    

            "inner"
        )
        # column aliasing: all the three variations of column aliasing are supported, because dataframes are aliased before the current join command.
        .select(expr("df_n.n_nationkey").alias("NationKey"),expr("df_n.n_name").alias("NationName"),expr("df_c.c_name").alias("CustomerName"))
        .sort(expr("CustomerName").asc()) 
        #.sort(df_c.c_name.asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
        #.sort(df_c["c_name"].asc()) #With dot notation/square bracket notation, you'll need to use original column name rather than aliased name
    )

df_twoDFs_joined.display()

**Assumption**: The observations as below are based on the pre-requisite that the dataframes are aliased and assigned to variables before they are referenced in the dataframe joins, as demonstrated in the above examples.

Notes:

* With the exception of the column name as string literal method, all other four methods can be used to access the dataframe columns for its join
* To be able to use column name as string literal method, the condition to be satisfied is that the joining column names must be the same in both the dataframes.
* To referencing the aliased column names, we need to use col() method or expr() method.
* Even though expr() works even in dataframe joins, do you really want to use it in joins, given the function name doesn't reflect the purpose of its usage in this context?

####11.2 - Multi-column join


In [0]:
#29,999,795
df_lineitem = spark.read.table("samples.tpch.lineitem")
display(df_lineitem)

#4000,000
df_partsupp = spark.read.table("samples.tpch.partsupp")
display(df_partsupp)


In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_li = df_lineitem.alias("df_li")  
df_ps = df_partsupp.alias("df_ps")

df_multiplejoin2DFs = (
        df_li
        .join(df_ps,
        #Method 1: Dot notation along with aliased dataframe names 
        [(df_li.l_partkey == df_ps.ps_partkey) &
        (df_li.l_suppkey == df_ps.ps_suppkey)],
        
        "inner"
    )
    .select(df_li.l_partkey.alias("PartKey"), df_li.l_suppkey.alias("SuppKey"), df_ps.ps_availqty.alias("AvailableQty"))
    .sort(col("PartKey").asc()) #using aliased column name 
)

df_multiplejoin2DFs.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_li = df_lineitem.alias("df_li")  
df_ps = df_partsupp.alias("df_ps")

df_multiplejoin2DFs = (
        df_li
        .join(df_ps,

        #Method 2: Square bracket notation
        [(df_li["l_partkey"] == df_ps["ps_partkey"]) &
        (df_li["l_suppkey"] == df_ps["ps_suppkey"])],  
        
        "inner"
    )
    .select(df_li["l_partkey"].alias("PartKey"), df_li["l_suppkey"].alias("SuppKey"), df_ps["ps_availqty"].alias("AvailableQty"))
    .sort(col("PartKey").asc()) #using aliased column name 
)

df_multiplejoin2DFs.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_li = df_lineitem.alias("df_li")  
df_ps = df_partsupp.alias("df_ps")

df_multiplejoin2DFs = (
        df_li
        .join(df_ps,

        #Method 3: Using col()/column() function + aliased dataframes
        [(col("df_li.l_partkey") == col("df_ps.ps_partkey")) &
        (col("df_li.l_suppkey") == col("df_ps.ps_suppkey"))],
        
        "inner"
    )
    .select(col("df_li.l_partkey").alias("PartKey"), col("df_li.l_suppkey").alias("SuppKey"), col("df_ps.ps_availqty").alias("AvailableQty"))
    .sort(col("PartKey").asc()) #using aliased column name 
)

df_multiplejoin2DFs.display()

In [0]:
# from pyspark.sql.functions import col, expr

# #Dataframe aliasing
# #IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
# df_li = df_lineitem.alias("df_li")  
# df_ps = df_partsupp.alias("df_ps")

# df_multiplejoin2DFs = (
#         df_li
#         .join(df_ps,

#         #Method 4: Specifying column name as string literals: (DOES NOT WORK) using columns names as string literal doesn't work because, the joining column names are not exactly the same in both the dataframes
#         #["l_partkey","l_suppkey"],
        
#         "inner"
#     )
#     .select(df_li.l_partkey.alias("PartKey"), df_li.l_suppkey.alias("SuppKey"), df_ps.ps_availqty.alias("AvailableQty"))
#     .sort(col("PartKey").asc()) #using aliased column name 
# )

# df_multiplejoin2DFs.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_li = df_lineitem.alias("df_li")  
df_ps = df_partsupp.alias("df_ps")

df_multiplejoin2DFs = (
        df_li
        .join(df_ps,

        #Method 5: Using expr() / selectExpr() function + aliased dataframes
        #Both variations of the below syntax works
        # [(expr("df_li.l_partkey") == expr("df_ps.ps_partkey")) &
        # (expr("df_li.l_suppkey") == expr("df_ps.ps_suppkey"))],

        [(expr("df_li.l_partkey == df_ps.ps_partkey")) &
        (expr("df_li.l_suppkey == df_ps.ps_suppkey"))],
        
        "inner"
    )
    .select(expr("df_li.l_partkey").alias("PartKey"), expr("df_li.l_suppkey").alias("SuppKey"), expr("df_ps.ps_availqty").alias("AvailableQty"))
    .sort(expr("PartKey").asc()) #using aliased column name 
)

df_multiplejoin2DFs.display()

Notes: Same as those with the single column joins.

####11.3 Self-join

Preview of sample data for self-join

| emp_id | emp_name | job_title | manager_id |
|--------|----------|-----------|------------|
|101|David|CEO|None|
|102|Kevin|GM|101|
|103|Ben|EM|102|
|104|Mike|PM|103|
|105|Jack|Data Engineer|104|
|106|Melissa|Data Engineer|104|


In [0]:
employee_data = (
    (101,"David","CEO",None),
    (102,"Kevin","GM",101),
    (103,"Ben","EM",102),
    (104,"Mike","PM",103),
    (105,"Jack","Data Engineer",104),
    (106,"Melissa","Data Engineer",104)
  )
employee_schema = ("emp_id int, emp_name string, job_title string, manager_id int")

df_employee = spark.createDataFrame(data = employee_data, schema = employee_schema)

df_employee.select("*").display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_emp = df_employee.alias("df_emp")
df_mgr = df_employee.alias("df_mgr")

#Method 1: Dot notation
df_result_selfjoin = (
    df_emp
    .join(
        df_mgr,
        df_emp.manager_id == df_mgr.emp_id,
        "left_outer"
    )
    .select(
        df_emp.emp_id.alias("Employee_Id"),
        df_emp.emp_name.alias("Employee_Name"),
        df_emp.job_title.alias("Employee_Job_Title"),
        df_mgr.emp_id.alias("Manager_Id"),
        df_mgr.emp_name.alias("Manager_Name"),
        df_mgr.job_title.alias("Manager_Job_Title")
    )
)

df_result_selfjoin.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_emp = df_employee.alias("df_emp")
df_mgr = df_employee.alias("df_mgr")

#Method 2: Square bracket notation
df_result_selfjoin = (
    df_emp
    .join(
        df_mgr,
        df_emp["manager_id"] == df_mgr["emp_id"],
        "left_outer"
    )
    .select(
        df_emp["emp_id"].alias("Employee_Id"),
        df_emp["emp_name"].alias("Employee_Name"),
        df_emp["job_title"].alias("Employee_Job_Title"),
        df_mgr["emp_id"].alias("Manager_Id"),
        df_mgr["emp_name"].alias("Manager_Name"),
        df_mgr["job_title"].alias("Manager_Job_Title")
    )
)

df_result_selfjoin.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_emp = df_employee.alias("df_emp")
df_mgr = df_employee.alias("df_mgr")

#Method 3: Using col()/column() function
df_result_selfjoin = (
    df_emp
    .join(
        df_mgr,
        col("df_emp.manager_id") == col("df_mgr.emp_id"),
        "left_outer"
    )
    .select(
        col("df_emp.emp_id").alias("Employee_Id"),
        col("df_emp.emp_name").alias("Employee_Name"),
        col("df_emp.job_title").alias("Employee_Job_Title"),
        col("df_mgr.emp_id").alias("Manager_Id"),
        col("df_mgr.emp_name").alias("Manager_Name"),
        col("df_mgr.job_title").alias("Manager_Job_Title")
    )
)

df_result_selfjoin.display()

In [0]:
# from pyspark.sql.functions import col, expr

# #Dataframe aliasing
# #IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
# df_emp = df_employee.alias("df_emp")
# df_mgr = df_employee.alias("df_mgr")

# #Method 4: Specifying column name as string literals
# #ATTENTION: Not tried because for the string literal to work in a join, column names must be identical on either side of the dataframe join. In case of self-joins, the column names being joined wouldn't be the same.

# df_result_selfjoin.display()

In [0]:
from pyspark.sql.functions import col, expr

#Dataframe aliasing
#IMPORTANT: make sure variable name and string name passed to alias method are exactly the same.
df_emp = df_employee.alias("df_emp")
df_mgr = df_employee.alias("df_mgr")

#Method 5: Using expr() / selectExpr() function
df_result_selfjoin = (
    df_emp
    .join(
        df_mgr,
        expr("df_emp.manager_id == df_mgr.emp_id"),
        "left_outer"
    )
    .select(
        expr("df_emp.emp_id").alias("Employee_Id"),
        expr("df_emp.emp_name").alias("Employee_Name"),
        expr("df_emp.job_title").alias("Employee_Job_Title"),
        expr("df_mgr.emp_id").alias("Manager_Id"),
        expr("df_mgr.emp_name").alias("Manager_Name"),
        expr("df_mgr.job_title").alias("Manager_Job_Title")
    )
)

df_result_selfjoin.display()

Notes: Same comments as with the single-column joins

##Key Takeways

###Strengths and limitations - Method-wise

| Method | Strengths | Limitations / When Not to Use |
|--------|-----------|-------------------------------|
|`col("colName")` |	- Works in almost all scenarios <br> - Can reference *aliased* and *aggregated* columns (e.g. in `WHERE / ORDER BY`) | - Cannot be used as new name when *adding* columns → use **string literals**<br> - Cannot be used to *rename* columns → use **string literals** |
|Dot Notation <br>`(df.colName)` |	- Concise, intuitive, very readable <br> - Familiar to OOP/Python users <br> - Gives code a **consistent look** |	- Fails if special characters in column names → use **col()** <br> - Cannot access aliased columns → use **col()** <br> - Cannot be used as new name when *adding* columns → use **string literals**<br> - Cannot be used to *rename* columns → use **string literals** |
|String Literals <br>`("colName")` | - Required for: <br>   • new column name while adding new column with `withColumn()` <br>   • Renaming with `withColumnRenamed()`	| - Very limited outside of those cases |
|`expr("sql_expression")`| - Can do everything **col()** can <br> - Allows **SQL-like expressions** directly as strings <br> - Great for analysts moving from SQL 	| - Best kept for SQL-style expressions (avoid using it as a **col()** replacement for clarity) |
|Square Brackets <br>`(df["colName"])` |- Same as dot notation. Additionally, supports special characters in column names |	- Cannot access *aliased* columns → use **col()** <br> - Cannot be used as new name when *adding* columns → use **string literals**<br> - Cannot be used to *rename* columns → use **string literals** <br> - Slightly more verbose, less elegant |

###Best practice recommendations

A **hybrid approach** is recommended, as no single method covers all scenarios for accessing columns in a DataFrame.

**Option A**

* Use **col()** for most transformations, including when working with aliased/aggregated columns or column names containing special characters.
* Use **string literals** only when:
  - Specifying a new column name with `withColumn()`, or
  - Renaming a column with `withColumnRenamed()`.

**Option B**

* Use **dot notation** for concise, readable code. It is familiar to Python users and provides a consistent coding style.
* Use **col()** when working with aliased/aggregated columns or columns with special characters.
* Use **string literals** only when:
  - Specifying a new column name with `withColumn()`, or
  - Renaming a column with `withColumnRenamed()`.

**Either option works well**; choosing between them is a matter of what suits you best.

**Notes**
1. Reserve **expr()** for SQL-style expressions. Avoid using **expr()** as a replacement for **col()** to maintain clarity.
2. While **square bracket notation** can perform everything that dot notation does, it is not recommended as a best practice due to its slightly more verbose and cumbersome nature. If you still prefer square bracket notation over dot notation, you will still need a hybrid approach along with col() for aliased/aggregated columns and string literals when creating or renaming columns.