# Joining spark DFs w/identical column names

## Background

This notebook ports the Scala DataFrame code presented in the [joining spark dataframes with identical column names](https://lestermartin.blog/2020/12/02/joining-spark-dataframes-with-identical-column-names-not-just-in-the-join-condition/) blog post and ports it to Python.

Additionally, this notebook is using [PyStarburst](https://docs.starburst.io/clients/python/pystarburst.html) and not PySpark, but one could very easily modify that in the config/setup section below.

Oh... this notebook provides and easier way to solve this than the `withColumnRenamed` function, but I'll save that until the last.

---
## Config & setup

Be sure to run `pip install pystarburst` if not already installed.

Should return `[Row(Working='Yes')]` if functional.  If an exception is raised, 
it is likely due to incorrect cluster and/or credentials values.

In [None]:
import getpass
import trino
from pystarburst import Session
from pystarburst import functions as F
from pystarburst.functions import *
from pystarburst.window import Window as W
from pystarburst.types import * 

# grab credentials from the notebook user to be used when making a connection
host = input("Host name")
username = input("User name")
password = getpass.getpass("Password")

# PyStarburst setup
session_properties = {
    "host":host,
    "port": 443,
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication(username, password)
}
session = Session.builder.configs(session_properties).create()

# validate PyStarburst working
session.sql("select 'Yes' as Working").collect()

---
## Let's code

In [21]:
# create a customer DF

cust_schema = StructType([StructField("cust_id", IntegerType()), 
                          StructField("name", StringType())])

custDF = session.create_dataframe([[101, "Lester"], [102, "Gretchen"], 
                                   [103, "Zoe"], [104, "Connor"]], cust_schema)
custDF.show()

------------------------
|"cust_id"  |"name"    |
------------------------
|101        |Lester    |
|102        |Gretchen  |
|103        |Zoe       |
|104        |Connor    |
------------------------



In [22]:
# create an order DF

order_schema = StructType([StructField("order_id", IntegerType()), 
                           StructField("cust_id", IntegerType()), 
                           StructField("total_price", FloatType())])

orderDF = session.create_dataframe([[8888, 101, 33.33], 
                                    [8889, 101, 66.66],
                                    [8890, 102, 101.01],
                                    [8891, 101, 99.99]], order_schema)
orderDF.show()

------------------------------------------
|"order_id"  |"cust_id"  |"total_price"  |
------------------------------------------
|8888        |101        |33.33          |
|8889        |101        |66.66          |
|8890        |102        |101.01         |
|8891        |101        |99.99          |
------------------------------------------



In [23]:
# join them

joinedDF1 = orderDF.join(custDF, orderDF.cust_id == custDF.cust_id)
joinedDF1.show()

-------------------------------------------------------------------------------
|"order_id"  |"l_66x3_cust_id"  |"total_price"  |"r_pzf3_cust_id"  |"name"    |
-------------------------------------------------------------------------------
|8891        |101               |99.99          |101               |Lester    |
|8889        |101               |66.66          |101               |Lester    |
|8888        |101               |33.33          |101               |Lester    |
|8890        |102               |101.01         |102               |Gretchen  |
-------------------------------------------------------------------------------



In [24]:
# ^^^ got some "cute" names for the columns whose names were the same

# original blog post suggested just renaming one of the columns first 
#  which makes it super easy to get rid of the duplicate column

tweaked_custDF = custDF.with_column_renamed("cust_id", "id")
joinedDF2 = orderDF.join(tweaked_custDF, orderDF.cust_id == tweaked_custDF.id) \
    .drop("id")

joinedDF2.show()

-----------------------------------------------------
|"order_id"  |"cust_id"  |"total_price"  |"name"    |
-----------------------------------------------------
|8891        |101        |99.99          |Lester    |
|8889        |101        |66.66          |Lester    |
|8888        |101        |33.33          |Lester    |
|8890        |102        |101.01         |Gretchen  |
-----------------------------------------------------



In [25]:
# these "cute" names happen for any name collisions

custDF_with_notes = custDF.with_column("notes", lit("cust note"))

orderDF_with_notes = orderDF.with_column("notes", lit("order note"))

joinedDF3 = orderDF_with_notes.join(custDF_with_notes, 
                                    orderDF_with_notes.cust_id == custDF_with_notes.cust_id)
joinedDF3.show()

-----------------------------------------------------------------------------------------------------------------
|"order_id"  |"l_vxck_cust_id"  |"total_price"  |"l_vxck_notes"  |"r_j1lm_cust_id"  |"name"    |"r_j1lm_notes"  |
-----------------------------------------------------------------------------------------------------------------
|8888        |101               |33.33          |order note      |101               |Lester    |cust note       |
|8889        |101               |66.66          |order note      |101               |Lester    |cust note       |
|8890        |102               |101.01         |order note      |102               |Gretchen  |cust note       |
|8891        |101               |99.99          |order note      |101               |Lester    |cust note       |
-----------------------------------------------------------------------------------------------------------------



In [26]:
# ^^^ shows 4 of those "cute" rewritten names
#  you could rename everthing manually that can have a collision, but you could also 
#  leverage lsuffix and/or rsuffix (either or both!) as shown below to allow the 
#  "cute" names become predictable (showing both in play below)

orderDF_with_notes.join(custDF_with_notes, 
                        orderDF_with_notes.cust_id == custDF_with_notes.cust_id,

                        lsuffix = '_o', rsuffix = '_c').show()

----------------------------------------------------------------------------------------------
|"order_id"  |"cust_id_o"  |"total_price"  |"notes_o"   |"cust_id_c"  |"name"    |"notes_c"  |
----------------------------------------------------------------------------------------------
|8888        |101          |33.33          |order note  |101          |Lester    |cust note  |
|8889        |101          |66.66          |order note  |101          |Lester    |cust note  |
|8890        |102          |101.01         |order note  |102          |Gretchen  |cust note  |
|8891        |101          |99.99          |order note  |101          |Lester    |cust note  |
----------------------------------------------------------------------------------------------



---
## Blog post specific examples

These last few cells are repeats from above, but are being used in the blog post that's link to this notebook

In [31]:
# this last one is a single cell version of the w/rename route

orders = orderDF_with_notes
orders.show()

custs = custDF_with_notes
custs.show()

-------------------------------------------------------
|"order_id"  |"cust_id"  |"total_price"  |"notes"     |
-------------------------------------------------------
|8888        |101        |33.33          |order note  |
|8889        |101        |66.66          |order note  |
|8890        |102        |101.01         |order note  |
|8891        |101        |99.99          |order note  |
-------------------------------------------------------

------------------------------------
|"cust_id"  |"name"    |"notes"    |
------------------------------------
|101        |Lester    |cust note  |
|102        |Gretchen  |cust note  |
|103        |Zoe       |cust note  |
|104        |Connor    |cust note  |
------------------------------------



In [32]:
# generated names on collisions

orders.join(custs, orders.cust_id == custs.cust_id).show()

-----------------------------------------------------------------------------------------------------------------
|"order_id"  |"l_ptl0_cust_id"  |"total_price"  |"l_ptl0_notes"  |"r_9zag_cust_id"  |"name"    |"r_9zag_notes"  |
-----------------------------------------------------------------------------------------------------------------
|8888        |101               |33.33          |order note      |101               |Lester    |cust note       |
|8889        |101               |66.66          |order note      |101               |Lester    |cust note       |
|8890        |102               |101.01         |order note      |102               |Gretchen  |cust note       |
|8891        |101               |99.99          |order note      |101               |Lester    |cust note       |
-----------------------------------------------------------------------------------------------------------------



In [35]:
# tackling the issue by manually renaming columns

ordersMod = orders.with_column_renamed(
    "notes", "order_notes")

custsMod = custs.with_column_renamed(
    "cust_id", "id")

ordersMod.join(custsMod, 
               ordersMod.cust_id == custsMod.id).show()

----------------------------------------------------------------------------------------
|"order_id"  |"cust_id"  |"total_price"  |"order_notes"  |"id"  |"name"    |"notes"    |
----------------------------------------------------------------------------------------
|8888        |101        |33.33          |order note     |101   |Lester    |cust note  |
|8889        |101        |66.66          |order note     |101   |Lester    |cust note  |
|8890        |102        |101.01         |order note     |102   |Gretchen  |cust note  |
|8891        |101        |99.99          |order note     |101   |Lester    |cust note  |
----------------------------------------------------------------------------------------



In [37]:
# tackling the issue with automagical suffix values

orders.join(custs, 
            orders.cust_id == custs.cust_id, 
            rsuffix = '_c').show()

--------------------------------------------------------------------------------------------
|"order_id"  |"cust_id"  |"total_price"  |"notes"     |"cust_id_c"  |"name"    |"notes_c"  |
--------------------------------------------------------------------------------------------
|8888        |101        |33.33          |order note  |101          |Lester    |cust note  |
|8889        |101        |66.66          |order note  |101          |Lester    |cust note  |
|8890        |102        |101.01         |order note  |102          |Gretchen  |cust note  |
|8891        |101        |99.99          |order note  |101          |Lester    |cust note  |
--------------------------------------------------------------------------------------------

