# Chapter 06: Working with Different Types of Data

This chapter covers building expressions, which are the bread and butter of Spark's strutured operations and also review working with variety of different kinds of data, including the following:
- Booleans
- Numbers
- Strings
- Dates and timestamps
- Handling null
- Complex types: arrays, maps, and structs
- User-defined types

## Import necessary libraries and initialize Spark session

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, instr, pow, round, bround, corr

In [2]:
sparkSession = SparkSession.builder.master("local").appName("Chapter06").getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
sparkSession

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/07 00:09:02 WARN Utils: Your hostname, alex-mathew, resolves to a loopback address: 127.0.1.1; using 192.168.1.14 instead (on interface wlo1)
25/12/07 00:09:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/07 00:09:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read the retail data into a DataFrame

In [3]:
df = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/retails.csv")
df.printSchema()

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: integer (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



## Coverting to Spark Types

Covert native types to Spark types using the `lit` function. 

In [4]:
df.select(lit(5), lit("five"), lit(5.0), lit(True)).show()

+---+----+---+----+
|  5|five|5.0|true|
+---+----+---+----+
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
|  5|five|5.0|true|
+---+----+---+----+
only showing top 20 rows


## Working with Booleans

> Questions: Which transactions involve products whose Description contains the word 'BAG' and have revenue greater than 1.5?

### Conventional way

In [5]:
df.where(col("Description").contains("BAG") & (col("UnitPrice") * col("Quantity") > 1.5)).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



In [6]:
df.filter(col("Description").contains("BAG") & (col("UnitPrice") * col("Quantity") > 1.5)).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



### Define filter conditions separately

In [7]:
descriptionFilter = col("Description").contains("BAG")
revenueFilter = (col("UnitPrice") * col("Quantity")) > 1.5
df.where(descriptionFilter & revenueFilter).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



In [8]:
descriptionFilter = instr(col("Description"), "BAG") >= 1
revenueFilter = (col("UnitPrice") * col("Quantity")) > 1.5
df.filter(descriptionFilter & revenueFilter).show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536376|    20725|LUNCH BAG RED SPOTTY|       6|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536376|    20726|LUNCH BAG BLACK S...|       4|2010-12-01 09:25:00|     1.65|     14911|   Netherlands|
|   536388|    23084|  RECYCLING BAG BLUE|       6|2010-12-01 10:12:00|     1.95|     17850|United Kingdom|
|   536389|    23085| RECYCLING BAG GREEN|       6|2010-12-01 10:15:00|     1.95|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



## Working with Numbers

Let's imagine that we found out that we mis-recorded the quantity in our retail dataset and the true quantity is equal to `(the 
current quantity * the unit price)^2 + 5 `

In [10]:
correctQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
df.select("CustomerId", correctQuantity.alias("realQuantity")).show(2)

+----------+------------+
|CustomerId|realQuantity|
+----------+------------+
|     14911|       294.0|
|     14911|      552.56|
+----------+------------+
only showing top 2 rows


Another common numerical task is **rounding**

In [12]:
df.select(round(lit(2.5)).alias("Round up"), bround(lit(2.5)).alias("Round down")).show(5)

+--------+----------+
|Round up|Round down|
+--------+----------+
|     3.0|       2.0|
|     3.0|       2.0|
|     3.0|       2.0|
|     3.0|       2.0|
|     3.0|       2.0|
+--------+----------+
only showing top 5 rows


Another numerical task is to compute the **correlation** of two columns. We can use it to see if cheaper things are typically bought in greater quantities. 

In [14]:
df.select(corr(col("Quantity"), col("UnitPrice"))).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|      -0.6196211329694905|
+-------------------------+



Another common task is to compute statistics for a column or set of columns. We can use `describe` method to achieve exactly this. This will take all numeric columns and calculate the **count**, **mean**, **standard deviation**, **min** and **max**. 

In [15]:
df.describe().show()

+-------+-----------------+------------------+--------------------+-----------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|         Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+-----------------+------------------+------------------+--------------+
|  count|               35|                35|                  35|               35|                35|                35|            35|
|   mean|         536385.4|           25814.2|                NULL|6.714285714285714|3.6700000000000013|15056.314285714287|          NULL|
| stddev|8.626907291253694|14796.839342158413|                NULL|6.171907077621141| 2.571826999945187|1977.1051739620643|          NULL|
|    min|           536373|             20725|BLUE COAT RACK PA...|                1|              0.55|             12583|        France|
|    max|           536401|