# Chapter 06: Working with Different Types of Data

This chapter covers building expressions, which are the bread and butter of Spark's strutured operations and also review working with variety of different kinds of data, including the following:
- Booleans
- Numbers
- Strings
- Dates and timestamps
- Handling null
- Complex types: arrays, maps, and structs
- User-defined types

## Import necessary libraries and initialize Spark session

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, instr

In [2]:
sparkSession = SparkSession.builder.master("local").appName("Chapter06").getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
sparkSession

25/12/06 14:18:18 WARN Utils: Your hostname, Hoangs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 192.168.1.204 instead (on interface en0)
25/12/06 14:18:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/06 14:18:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read the retail data into a DataFrame

In [3]:
df = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/retails.csv")
df.printSchema()

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: integer (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



## Coverting to Spark Types

Covert native types to Spark types using the `lit` function. 

In [4]:
df.select(lit(5), lit("five"), lit(5.0), lit(True)).show()

+---+----+---+----+
|  5|five|5.0|true|
+---+----+---+----+
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
+---+----+---+----+
only showing top 20 rows



## Working with Booleans

> Questions: Which transactions involve products whose Description contains the word 'BAG' and have revenue greater than 1.5?

### Conventional way

In [5]:
df.where(col("Description").contains("BAG") & (col("UnitPrice") * col("Quantity") > 1.5)).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



In [6]:
df.filter(col("Description").contains("BAG") & (col("UnitPrice") * col("Quantity") > 1.5)).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



### Define filter conditions separately

In [7]:
descriptionFilter = col("Description").contains("BAG")
revenueFilter = (col("UnitPrice") * col("Quantity")) > 1.5
df.where(descriptionFilter & revenueFilter).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



In [10]:
descriptionFilter = instr(col("Description"), "BAG") >= 1
revenueFilter = (col("UnitPrice") * col("Quantity")) > 1.5
df.filter(descriptionFilter & revenueFilter).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



## Working with Numbers