# Assignment 6

Please answer each question to the corresponding question cell below. Your final code must have the code as well as the output of your code. You can use Saint Peter's [Azure Databricks](https://adb-7130196131129306.6.azuredatabricks.net/?o=7130196131129306#) to do this assignment.

## Questions

### Q1

Spark [catalog](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.html#pyspark-sql-catalog) has database(s), local and external tables, functions, etc. Show the current database you are connected in Spark.

In [0]:
spark.catalog.currentDatabase()

### Q2

List all tables under the current database.

In [0]:
spark.catalog.listTables('default')
# Alternate method
# spark.catalog.listTables(spark.catalog.currentDatabase())

### Q3

Show all the databases available in Spark.

In [0]:
spark.sql('show databases').show(truncate=False)
# Alternate method
# spark.catalog.listDatabases()

### Q4

Using `sales` table that you have discovered in Q2, find the total `Quantity` and sum of the `SaleAmount` for each combination of `State` and `City`.

In [0]:
# spark.catalog.listColumns("sales")

In [0]:
%sql
SELECT State, City, SUM(Quantity) as Total_quantity, SUM(SaleAmount) as Total_sale_amt FROM sales GROUP BY State, City ;

State,City,Total_quantity,Total_sale_amt
Connecticut,Glastonbury,5,705.0
Louisiana,Lafayette,216,31369.999908447266
Colorado,Pueblo West,138,20118.19999694824
Maine,Biddeford,194,26125.7001953125
MA,Attleboro,68,9621.0
MA,Dracut,39,5674.5
Illinois,Saint Charles,376,47497.749755859375
MO,Kansas City,184,25127.200439453125
Connecticut,Waterbury,64,7210.9501953125
MA,Shrewsbury,36,6768.0


### Q5

Below are some of the data types of the `sales` DataFrame that you have interacted with. Your collegue find out that some of the data types are not set properly. 

```
>>> df.printSchema()
root
 |-- RowID: string (nullable = true)
 |-- OrderID: string (nullable = true)
 |-- OrderDate: string (nullable = true)
 ...
 |-- SaleAmount: float (nullable = true)
 |-- CustomerName: string (nullable = true)
 ...
 |-- WageMargin: string (nullable = true)

```

Convert `OrderDate` to DateType, `SaleAmount` to DecimalType, and `WageMargin` to FloatType.

In [0]:
df = spark.sql("SELECT RowID, OrderID, OrderDate, SaleAmount, CustomerName, WageMargin from sales")
# Alternate method
# df = spark.read.table("default.sales").select('RowID', 'OrderID', 'OrderDate', 'SaleAmount', 'CustomerName', 'WageMargin')
display(df)

RowID,OrderID,OrderDate,SaleAmount,CustomerName,WageMargin
1,3,2010-10-13,1152.0,Muhammed MacIntyre,0.7
2,6,2012-02-20,277.2,Ruben Staebel,0.46
3,32,2011-07-15,3022.5,Liz Greer,0.4
4,32,2011-07-15,2730.0,Liz Greer,0.58
5,32,2011-07-15,3312.0,Liz Greer,0.67
6,32,2011-07-15,2160.0,Liz Greer,0.59
7,35,2011-10-22,3492.0,Julie Knight,0.48
8,35,2011-10-22,2079.0,Julie Knight,0.55
9,36,2011-11-02,6210.0,Sample Manning,0.66
10,65,2011-03-17,4704.0,Tamara O'Brill,0.55


In [0]:
df.count()

In [0]:
df.printSchema()

In [0]:
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import DateType, DecimalType, FloatType 

In [0]:
df_fixed = df.withColumn("OrderDate",to_date(col("OrderDate"))).withColumn("SaleAmount",(col("SaleAmount").cast(DecimalType(10,2)))).\
withColumn("WageMargin",(col("WageMargin").cast(FloatType())))

# Alternate method
# df_fixed = df.withColumn("OrderDate",(col("OrderDate").cast(DateType()))).withColumn("SaleAmount",(col("SaleAmount").cast(DecimalType(10,2)))).\
# withColumn("WageMargin",(col("WageMargin").cast(FloatType())))

display(df_fixed)

RowID,OrderID,OrderDate,SaleAmount,CustomerName,WageMargin
1,3,2010-10-13,1152.0,Muhammed MacIntyre,0.7
2,6,2012-02-20,277.2,Ruben Staebel,0.46
3,32,2011-07-15,3022.5,Liz Greer,0.4
4,32,2011-07-15,2730.0,Liz Greer,0.58
5,32,2011-07-15,3312.0,Liz Greer,0.67
6,32,2011-07-15,2160.0,Liz Greer,0.59
7,35,2011-10-22,3492.0,Julie Knight,0.48
8,35,2011-10-22,2079.0,Julie Knight,0.55
9,36,2011-11-02,6210.0,Sample Manning,0.66
10,65,2011-03-17,4704.0,Tamara O'Brill,0.55


In [0]:
df_fixed.printSchema()

### Q6

We want to add another column to this dataset, the initals of the customer's name and surname. However, there is no built in function for it, so you have to make one yourself. Register this function as a UDF. This serializes the function and sends it to executors to be able to transform DataFrame records. Then using this UDF, create a new column with name `CustomerNameInitials`. For example, if name is `John Smith`, the new value should be `JS`.

In [0]:
from pyspark.sql.types import StringType

In [0]:
@udf(returnType=StringType()) 
def CustomerNameInitials(name):
    return "".join([x[0] for x in name.split(" ")])  

In [0]:
df_with_ini1 = df_fixed.withColumn("CustomerNameInitials", CustomerNameInitials(col("CustomerName")))
display(df_with_ini1)

RowID,OrderID,OrderDate,SaleAmount,CustomerName,WageMargin,CustomerNameInitials
1,3,2010-10-13,1152.0,Muhammed MacIntyre,0.7,MM
2,6,2012-02-20,277.2,Ruben Staebel,0.46,RS
3,32,2011-07-15,3022.5,Liz Greer,0.4,LG
4,32,2011-07-15,2730.0,Liz Greer,0.58,LG
5,32,2011-07-15,3312.0,Liz Greer,0.67,LG
6,32,2011-07-15,2160.0,Liz Greer,0.59,LG
7,35,2011-10-22,3492.0,Julie Knight,0.48,JK
8,35,2011-10-22,2079.0,Julie Knight,0.55,JK
9,36,2011-11-02,6210.0,Sample Manning,0.66,SM
10,65,2011-03-17,4704.0,Tamara O'Brill,0.55,TO


In [0]:
# Alternate method
def CustomerNameInitials(name):
    return "".join([x[0] for x in name.split(" ")])
CustomerNameInitialsUDF = udf(lambda z:CustomerNameInitials(z),StringType())  
df_with_ini2 = df_fixed.withColumn("CustomerNameInitials", CustomerNameInitialsUDF(col("CustomerName")))
display(df_with_ini2)

RowID,OrderID,OrderDate,SaleAmount,CustomerName,WageMargin,CustomerNameInitials
1,3,2010-10-13,1152.0,Muhammed MacIntyre,0.7,MM
2,6,2012-02-20,277.2,Ruben Staebel,0.46,RS
3,32,2011-07-15,3022.5,Liz Greer,0.4,LG
4,32,2011-07-15,2730.0,Liz Greer,0.58,LG
5,32,2011-07-15,3312.0,Liz Greer,0.67,LG
6,32,2011-07-15,2160.0,Liz Greer,0.59,LG
7,35,2011-10-22,3492.0,Julie Knight,0.48,JK
8,35,2011-10-22,2079.0,Julie Knight,0.55,JK
9,36,2011-11-02,6210.0,Sample Manning,0.66,SM
10,65,2011-03-17,4704.0,Tamara O'Brill,0.55,TO


## Notes

After done with the assignment, please upload your work to the blackboard. Please make sure you upload:

- The outputs of the code in the notebook
- The pdf version of the notebook

Best of luck!