# PySpark training for data engineers
## 05. Data Enriching

### Goal

Adding more value to the data by 
* Adding new columns
* Using lambda functions
* Using user defined functions

### Highlights
* `df.withColumn('new_col', Function())` a new column is added to the DataFrame
* `len_fun = udf(lambda z: len(z), IntegerType())` is a User Defined Function that returns the length of the input as integer
* `df = df.withColumn('length_col', len_fun('text_col'))` will add a column `length_col` with the length of the item in `text_col`

### Implementation

In [3]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
config = SparkConf().setMaster('local')
spark = SparkContext.getOrCreate(conf=config)
sqlContext = SQLContext(spark)

In [4]:
df = sqlContext.read.parquet('notebook-04-parquet/')

In [6]:
df.show()

+----------+---------+------+---+
|first_name|last_name|gender|age|
+----------+---------+------+---+
|      john|      doe|  male| 32|
|      jake|      doe|  male| 16|
|      jane|      doe|female| 31|
|     janet|      doe|female| 13|
+----------+---------+------+---+



In [8]:
df = df.withColumn('double_age', df.age*2)

In [9]:
df.show()

+----------+---------+------+---+----------+
|first_name|last_name|gender|age|double_age|
+----------+---------+------+---+----------+
|      john|      doe|  male| 32|      64.0|
|      jake|      doe|  male| 16|      32.0|
|      jane|      doe|female| 31|      62.0|
|     janet|      doe|female| 13|      26.0|
+----------+---------+------+---+----------+



Define a user defined function:

In [19]:
from pyspark.sql.functions import udf

@udf('integer')
def calc_name_length(name):
    return len(name)

In [20]:
df = df.withColumn('name_length', calc_name_length(df.first_name))

In [21]:
df.show()

+----------+---------+------+---+----------+-----------+
|first_name|last_name|gender|age|double_age|name_length|
+----------+---------+------+---+----------+-----------+
|      john|      doe|  male| 32|      64.0|          4|
|      jake|      doe|  male| 16|      32.0|          4|
|      jane|      doe|female| 31|      62.0|          4|
|     janet|      doe|female| 13|      26.0|          5|
+----------+---------+------+---+----------+-----------+



Define a lambda function with one input parameter:

In [25]:
from pyspark.sql.types import IntegerType
len_udf_int = udf(lambda z: len(z), IntegerType())

In [26]:
df = df.withColumn('last_name_length', len_udf_int('last_name'))

In [27]:
df.show()

+----------+---------+------+---+----------+-----------+----------------+
|first_name|last_name|gender|age|double_age|name_length|last_name_lenght|
+----------+---------+------+---+----------+-----------+----------------+
|      john|      doe|  male| 32|      64.0|          4|               3|
|      jake|      doe|  male| 16|      32.0|          4|               3|
|      jane|      doe|female| 31|      62.0|          4|               3|
|     janet|      doe|female| 13|      26.0|          5|               3|
+----------+---------+------+---+----------+-----------+----------------+



Define a lambda function with two input parameters:

In [28]:
len_udf_two_int = udf(lambda z,y: len(z)+len(y), IntegerType())

In [31]:
df = df.withColumn('full_name_length', len_udf_two_int('first_name', 'last_name'))

In [32]:
df.show()

+----------+---------+------+---+----------+-----------+----------------+----------------+
|first_name|last_name|gender|age|double_age|name_length|last_name_lenght|full_name_length|
+----------+---------+------+---+----------+-----------+----------------+----------------+
|      john|      doe|  male| 32|      64.0|          4|               3|               7|
|      jake|      doe|  male| 16|      32.0|          4|               3|               7|
|      jane|      doe|female| 31|      62.0|          4|               3|               7|
|     janet|      doe|female| 13|      26.0|          5|               3|               8|
+----------+---------+------+---+----------+-----------+----------------+----------------+



Remove a column from the dataframe:

In [39]:
df = df.drop('double_age')
df.show()

+----------+---------+------+---+-----------+----------------+----------------+
|first_name|last_name|gender|age|name_length|last_name_lenght|full_name_length|
+----------+---------+------+---+-----------+----------------+----------------+
|      john|      doe|  male| 32|          4|               3|               7|
|      jake|      doe|  male| 16|          4|               3|               7|
|      jane|      doe|female| 31|          4|               3|               7|
|     janet|      doe|female| 13|          5|               3|               8|
+----------+---------+------+---+-----------+----------------+----------------+

