## Padding and Trimming Strings
Let us go through how to pad characters to strings or trim unwanted characters using Spark Functions.

* Padding Characters to strings
  * We typically pad characters to build fixed length values or records.
  * Fixed length values or records are extensively used in Mainframes based systems.
  * Length of each and every field in fixed length records is predetermined and if the value of the field is less than the predetermined length then we pad with a standard character.
  * In terms of numeric fields we pad with zero on the leading or left side. For non numeric fields, we pad with some standard character on leading or trailing side.
  * We use `lpad` to pad a string with a specific character on leading or left side and `rpad` to pad on trailing or right side.
  * Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.
* Trimming Characters around strings
  * We typically use trimming to remove unnecessary characters from fixed length records.
  * Fixed length records are extensively used in Mainframes and we might have to process it using Spark.
  * As part of processing we might want to remove leading or trailing characters such as 0 in case of numeric types and space or some standard character in case of alphanumeric types.
  * As of now Spark trim functions take the column as argument and remove leading or trailing spaces.
  * Trim spaces towards left - `ltrim`
  * Trim spaces towards right - `rtrim`
  * Trim spaces on both sides - `trim`

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    enableHiveSupport. \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

### Tasks - Padding Strings

Let us perform simple tasks to understand the syntax of `lpad` or `rpad`.
* Create a Dataframe with single value and single column.
* Apply `lpad` to pad with - to Hello to make it 10 characters.

In [None]:
l = [('X',)]

In [None]:
df = spark.createDataFrame(l).toDF("dummy")

In [None]:
from pyspark.sql.functions import lit, lpad

In [None]:
df.select(lpad(lit("Hello"), 10, "-").alias("dummy")).show()

* Letâ€™s take the **employees** Dataframe

In [None]:
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]

In [None]:
employeesDF = spark.createDataFrame(employees). \
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

* Use **pad** functions to convert each of the field into fixed length and concatenate. Here are the details for each of the fields.
  * Length of the employee_id should be 5 characters and should be padded with zero.
  * Length of first_name and last_name should be 10 characters and should be padded with - on the right side.
  * Length of salary should be 10 characters and should be padded with zero.
  * Length of the nationality should be 15 characters and should be padded with - on the right side.
  * Length of the phone_number should be 17 characters and should be padded with - on the right side.
  * Length of the ssn can be left as is. It is 11 characters. 
* Create a new Dataframe **empFixedDF** with column name **employee**. Preview the data by disabling truncate.

In [None]:
from pyspark.sql.functions import lpad, rpad, concat

In [None]:
empFixedDF = employeesDF.select(
    concat(
        lpad("employee_id", 5, "0"), 
        rpad("first_name", 10, "-"), 
        rpad("last_name", 10, "-"),
        lpad("salary", 10, "0"), 
        rpad("nationality", 15, "-"), 
        rpad("phone_number", 17, "-"), 
        "ssn"
    ).alias("employee")
)

In [None]:
empFixedDF.show(truncate=False)

### Tasks - Trimming Strings

Let us understand how to use trim functions to remove spaces on left or right or both.
* Create a Dataframe with one column and one record.
* Apply trim functions to trim spaces.

In [None]:
from pyspark.sql.functions import ltrim, rtrim, trim

In [None]:
l = [("   Hello.    ",) ]

In [None]:
df = spark.createDataFrame(l).toDF("dummy")

In [None]:
from pyspark.sql.functions import col, ltrim, rtrim, trim

In [None]:
df.withColumn("ltrim", ltrim(col("dummy"))). \
  withColumn("rtrim", rtrim(col("dummy"))). \
  withColumn("trim", trim(col("dummy"))). \
  show()

In [None]:
df.withColumn("ltrim", ltrim(col("dummy"))). \
  withColumn("rtrim", rtrim(col("dummy"))). \
  withColumn("trim", trim(col("dummy"))). \
  show()