### Overview

Formatting data is a critical part of data preparation. Effectively formatting data can make queries run faster, reduce storage costs, and prevent downstream systeming and modeling errors. <br> <br>
<b> Documentation: </b> https://spark.apache.org/docs/latest/sql-ref-datatypes.html

Start with creating a dataframe...

In [0]:
sdf = spark.createDataFrame(data=[(1001,'Chicago',535),            
                                  (1002,'Boston',495),           
                                  (1003,'Seattle',318),          
                                  ], 
                            schema=['station_id','city','rainfall']
  ) 


sdf.printSchema()

sdf.show()

root
 |-- station_id: long (nullable = true)
 |-- city: string (nullable = true)
 |-- rainfall: long (nullable = true)

+----------+-------+--------+
|station_id|   city|rainfall|
+----------+-------+--------+
|      1001|Chicago|     535|
|      1002| Boston|     495|
|      1003|Seattle|     318|
+----------+-------+--------+



### Reformatting Columns

#### .withColumn() and pyspark.sql.types functions:

This will rename the column within a Spark DataFrame without dropping any data elements.

In [0]:
from pyspark.sql.types import *

# convert to string types
sdf_cast_columns = sdf.withColumn('station_id', sdf['station_id'].cast(StringType()))
sdf_cast_columns.printSchema()

# convert to float types
sdf_cast_columns = sdf.withColumn('station_id', sdf['station_id'].cast(FloatType()))
sdf_cast_columns.printSchema()

root
 |-- station_id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- rainfall: long (nullable = true)

root
 |-- station_id: float (nullable = true)
 |-- city: string (nullable = true)
 |-- rainfall: long (nullable = true)



#### .selectExpr() with nested .cast() functions:

This function will only retain columns explicity stated within the function.

In [0]:
sql_style_casting = sdf.selectExpr('cast(station_id as long)', 'cast(rainfall as float)')

sql_style_casting.printSchema()

root
 |-- station_id: long (nullable = true)
 |-- rainfall: float (nullable = true)

