## String Manipulation Functions

Let us go through some of the Spark predefined functions which are used for manipulating strings.

### Starting Spark Context

Let us start spark context for this Notebook so that we can execute the code provided.

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Processing Column Data").
    master("yarn").
    getOrCreate

In [None]:
spark

### Case Conversion and Length
Let us check the functions which can convert the case of the column values which are of type string and also get the length.
* Convert all the alphabetic characters in a string to **uppercase** - `upper`
* Convert all the alphabetic characters in a string to **lowercase** - `lower`
* Convert first character in a string to **uppercase** - `initcap`
* Get **number of characters in a string** - `length`
* All the 4 functions take column type argument.

#### Tasks

Let us perform tasks to understand the behavior of case conversion functions and length.

* Use employees data and create a Data Frame.
* Apply all 4 functions on **nationality** and see the results.

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

### Using substring

Let us understand how we can extract substrings using function  `substring`.
* If we are processing **fixed length columns** then we use `substring` to extract the information.
* Here are some of the examples for **fixed length columns** and the use cases for which we typically extract information..
 * 9 Digit Social Security Number. We typically extract last 4 digits and provide it to the tele verification applications..
 * 16 Digit Credit Card Number. We typically use first 4 digit number to identify Credit Card Provider and last 4 digits for the purpose of tele verification.
 * Data coming from MainFrames systems are quite often fixed length. We might have to extract the information and store in multiple columns.
* `substring` function takes 3 arguments, **column**, **position**, **length**. We can also provide position from the end by passing negative value.


In [None]:
val s = "Hello World"

In [None]:
s.substring(0, 5)

In [None]:
s.substring(1, 4)

In [None]:
val l = List("X")

In [None]:
val df = l.toDF("dummy")

#### Tasks

Let us perform few tasks to extract information from fixed length strings.
* Create a list for employees with name, ssn and phone_number.
* SSN Format **3 2 4** - Fixed Length with 9 digits
* Phone Number Format - Country Code is variable and remaining phone number have 10 digits:
  * Country Code - one to 3 digits
  * Area Code - 3 digits
  * Phone Number Prefix - 3 digits
  * Phone Number Remaining - 4 digits
  * All the 4 parts are separated by spaces
* Create a Dataframe with column names name, ssn and phone_number
* Extract last 4 digits from the phone number.
* Extract last 4 digits from SSN.

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

### Using split
Let us understand how we can extract substrings using  `split`.
* If we are processing **variable length columns** with **delimiter** then we use `split` to extract the information.
* Here are some of the examples for **variable length columns** and the use cases for which we typically extract information.
  * Address where we store House Number, Street Name, City, State and Zip Code comma separated. We might want to extract City and State for demographics reports.
* `split` takes 2 arguments, **column** and **delimiter**.
* `split` convert each string into array and we can access the elements using index.

In [None]:
val l = List("X")

In [None]:
val df = l.toDF("dummy")

* Most of the problems can be solved either by using `substring` or `split`.

#### Tasks
Let us perform few tasks to extract information from fixed length strings as well as delimited variable length strings.
* Create a list for employees with name, ssn and phone_number.
* SSN Format **3 2 4** - Fixed Length with 9 digits
* Phone Number Format - Country Code is variable and remaining phone number have 10 digits:
  * Country Code - one to 3 digits
  * Area Code - 3 digits
  * Phone Number Prefix - 3 digits
  * Phone Number Remaining - 4 digits
  * All the 4 parts are separated by spaces
* Create a Dataframe with column names name, ssn and phone_number
* Extract area code and last 4 digits from the phone number.
* Extract last 4 digits from SSN.

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

### Concatenating of Strings
Let us understand how to concatenate strings using `concat` function.
* We can pass a variable number of strings to `concat` function.
* It will return one string concatenating all the strings.
* If we have to concatenate literal in between then we have to use `lit` function.

#### Tasks

Let us perform few tasks to understand more about 
`concat` function.
* Let’s create a Data Frame and explore `concat` function.

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

In [None]:
employeesDF.show

* Create a new column by name **full_name** concatenating **first_name** and **last_name**.

* Improvise by adding a **comma followed by a space** in between **first_name** and **last_name**.


### Padding Characters
Let us understand how to pad characters at the beginning or at the end of strings.
* We typically pad characters to build fixed length values or records.
* Fixed length values or records are extensively used in Mainframes based systems.
* Length of each and every field in fixed length records is predetermined and if the value of the field is less than the predetermined length then we pad with a standard character.
* In terms of numeric fields we pad with zero on the leading or left side. For non numeric fields, we pad with some standard character on leading or trailing side.
* We use `lpad` to pad a string with a specific character on leading or left side and `rpad` to pad on trailing or right side.
* Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.

#### Tasks

Let us perform simple tasks to understand the syntax of `lpad` or `rpad`.
* Create a Dataframe with single value and single column.
* Apply `lpad` to pad with - to Hello to make it 10 characters.

In [None]:
val l = List("X")

In [None]:
val df = l.toDF("dummy")

Let us perform the task to understand how to use pad functions to convert our data into fixed length records.

* Let’s take the **employees** Dataframe

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

* Use **pad** functions to convert each of the field into fixed length and concatenate. Here are the details for each of the fields.
  * Length of the employee_id should be 5 characters and should be padded with zero.
  * Length of first_name and last_name should be 10 characters and should be padded with - on the right side.
  * Length of salary should be 10 characters and should be padded with zero.
  * Length of the nationality should be 15 characters and should be padded with - on the right side.
  * Length of the phone_number should be 17 characters and should be padded with - on the right side.
  * Length of the ssn can be left as is. It is 11 characters.
* Create a new Dataframe **empFixedDF** with column name **employee**. Preview the data by disabling truncate.


### Trimming Unwanted Characters
Let us understand how to trim unwanted leading and trailing characters around a string.
* We typically use trimming to remove unnecessary characters from fixed length records.
* Fixed length records are extensively used in Mainframes and we might have to process it using Spark.
* As part of processing we might want to remove leading or trailing characters such as 0 in case of numeric types and space or some standard character in case of alphanumeric types.
* As of now Spark trim functions take the column as argument and remove leading or trailing spaces.
* Trim spaces towards left - `ltrim`
* Trim spaces towards right - `rtrim`
* Trim spaces on both sides - `trim`

#### Tasks

Let us understand how to use trim functions to remove spaces on left or right or both.
* Create a Dataframe with one column and one record.
* Apply trim functions to trim spaces.

In [None]:
import org.apache.spark.sql.functions.{ltrim, rtrim, trim}

In [None]:
val l = List("   Hello.    ")

In [None]:
val df = l.toDF("dummy")