## Padding Characters around Strings
Let us go through how to pad characters to strings using Spark Functions.

* We typically pad characters to build fixed length values or records.
* Fixed length values or records are extensively used in Mainframes based systems.
* Length of each and every field in fixed length records is predetermined and if the value of the field is less than the predetermined length then we pad with a standard character.
* In terms of numeric fields we pad with zero on the leading or left side. For non numeric fields, we pad with some standard character on leading or trailing side.
* We use `lpad` to pad a string with a specific character on leading or left side and `rpad` to pad on trailing or right side.
* Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.

### Tasks - Padding Strings

Let us perform simple tasks to understand the syntax of `lpad` or `rpad`.
* Create a Dataframe with single value and single column.
* Apply `lpad` to pad with - to Hello to make it 10 characters.

In [1]:
l = [('X',)]

In [2]:
df = spark.createDataFrame(l).toDF("dummy")

In [3]:
from pyspark.sql.functions import lit, lpad

In [4]:
df.select(lpad(lit("Hello"), 10, "-").alias("dummy")).show()

[Stage 0:>                                                          (0 + 1) / 1]

+----------+
|     dummy|
+----------+
|-----Hello|
+----------+



                                                                                

* Let’s create the **employees** Dataframe

In [5]:
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]

In [6]:
employeesDF = spark.createDataFrame(employees). \
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

In [7]:
employeesDF.show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [8]:
employeesDF.printSchema()

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- nationality: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- ssn: string (nullable = true)



* Use **pad** functions to convert each of the field into fixed length and concatenate. Here are the details for each of the fields.
  * Length of the employee_id should be 5 characters and should be padded with zero.
  * Length of first_name and last_name should be 10 characters and should be padded with - on the right side.
  * Length of salary should be 10 characters and should be padded with zero.
  * Length of the nationality should be 15 characters and should be padded with - on the right side.
  * Length of the phone_number should be 17 characters and should be padded with - on the right side.
  * Length of the ssn can be left as is. It is 11 characters. 
* Create a new Dataframe **empFixedDF** with column name **employee**. Preview the data by disabling truncate.

In [9]:
from pyspark.sql.functions import lpad, rpad, concat

In [10]:
empFixedDF = employeesDF.select(
    concat(
        lpad("employee_id", 5, "0"), 
        rpad("first_name", 10, "-"), 
        rpad("last_name", 10, "-"),
        lpad("salary", 10, "0"), 
        rpad("nationality", 15, "-"), 
        rpad("phone_number", 17, "-"), 
        "ssn"
    ).alias("employee")
)

In [11]:
empFixedDF.show(truncate=False)

+------------------------------------------------------------------------------+
|employee                                                                      |
+------------------------------------------------------------------------------+
|00001Scott-----Tiger-----00001000.0united states--+1 123 456 7890--123 45 6789|
|00002Henry-----Ford------00001250.0India----------+91 234 567 8901-456 78 9123|
|00003Nick------Junior----00000750.0united KINGDOM-+44 111 111 1111-222 33 4444|
|00004Bill------Gomes-----00001500.0AUSTRALIA------+61 987 654 3210-789 12 6118|
+------------------------------------------------------------------------------+

