### Overview

Often times when working with data you will need to rename fields. For example, if you are engineering data for modelers to recalibrate an existing model, some of the column names will need to be renamed so that they fit into the modeler's script.

Start with creating a dataframe...

In [0]:
sdf = spark.createDataFrame(data=[(1001,'Chicago',535),          ##  
                                  (1002,'Boston',495),            # #Values
                                  (1003,'Seattle',318),          ##
                                  ], 
                            schema=['station_id','city','rainfall']
  ) 


# Print top 5 rows
print(sdf.show(n=5, truncate = False))

# Print Schema 
print(sdf.printSchema())

+----------+-------+--------+
|station_id|city   |rainfall|
+----------+-------+--------+
|1001      |Chicago|535     |
|1002      |Boston |495     |
|1003      |Seattle|318     |
+----------+-------+--------+

None
root
 |-- station_id: long (nullable = true)
 |-- city: string (nullable = true)
 |-- rainfall: long (nullable = true)

None


### Renaming Columns

#### .withColumnRenamed() function

This will rename the column within a Spark DataFrame without droping any data elements.

In [0]:
sdf_renamed_columns = sdf.withColumnRenamed('city','city_US')
print('Renamed Column:')
sdf_renamed_columns.show()
print('Original Column:')
sdf.show()

Renamed Column:
+----------+-------+--------+
|station_id|city_US|rainfall|
+----------+-------+--------+
|      1001|Chicago|     535|
|      1002| Boston|     495|
|      1003|Seattle|     318|
+----------+-------+--------+

Original Column:
+----------+-------+--------+
|station_id|   city|rainfall|
+----------+-------+--------+
|      1001|Chicago|     535|
|      1002| Boston|     495|
|      1003|Seattle|     318|
+----------+-------+--------+



#### .selectExpr() function:

This function will only retain columns explicity stated within the function.

In [0]:
sdf_selectExpr = sdf.selectExpr("city as city_US")
print('Renamed Column:')
sdf_selectExpr.show()
print('Original Column:')
sdf.show()

Renamed Column:
+-------+
|city_US|
+-------+
|Chicago|
| Boston|
|Seattle|
+-------+

Original Column:
+----------+-------+--------+
|station_id|   city|rainfall|
+----------+-------+--------+
|      1001|Chicago|     535|
|      1002| Boston|     495|
|      1003|Seattle|     318|
+----------+-------+--------+



#### .select() + .alias() functions:

This function will only retain columns explicity stated within the function. The combination of these two functions requires wrapping the original column's alias "city" with the col() function. This method is situational. For example, if you need a subset of columns from a big dataset to train a model, you could use the .select() + .alias() functions to grab all columns and rename based on model specs within a single command.

In [0]:
import pyspark.sql.functions as f

sdf_selectAlias = sdf.select(f.col("city").alias('city_US'))
print('Renamed Column:')
sdf_selectExpr.show()
print('Original Column:')
sdf.show()

Renamed Column:
+-------+
|city_US|
+-------+
|Chicago|
| Boston|
|Seattle|
+-------+

Original Column:
+----------+-------+--------+
|station_id|   city|rainfall|
+----------+-------+--------+
|      1001|Chicago|     535|
|      1002| Boston|     495|
|      1003|Seattle|     318|
+----------+-------+--------+

