
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Manipulating Data

In this notebook, you will be working with the online retail sales data that you worked with in the Module 3 Lab. This time, you will work with the columns of data that contain `NULL` values and a non-standard date format. 

Run the following queries to learn about how to work with and manage null values and timestamps in Spark SQL. In this notebook, you will:

* Sample a table
* Access individual values from an array
* Reformat values using a padding function
* Concatenate values to match a standard format
* Access parts of a `DateType` value like the month, day, or year

### Getting Started

Run the cell below to set up your classroom environment. 

In [0]:
%run ../Includes/Classroom-Setup

### Create table

Our data is stored as a csv. The optional arguments show the path to the data and store the first row as a header. 

In [0]:
%sql
-- crear tabla
DROP TABLE IF EXISTS outdoorProductsRaw;
CREATE TABLE outdoorProductsRaw USING csv OPTIONS (
  path "/mnt/training/online_retail/data-001/data.csv",
  header "true"
)

### Describe

Recall that the `DESCRIBE` command tells us about the schema of the table. Notice that all of our columns are string values. 

In [0]:
%sql
-- info de la tabla
DESCRIBE outdoorProductsRaw

col_name,data_type,comment
InvoiceNo,string,
StockCode,string,
Description,string,
Quantity,string,
InvoiceDate,string,
UnitPrice,string,
CustomerID,string,
Country,string,


### Sample the table
In the previous reading, you accessed a random sample of rows from a table using the `RAND()` function and `LIMIT` keyword. While this is a common way to retrieve a sample with other SQL dialects, Spark SQL includes a built-in function that you may want to use instead. 

The function, `TABLESAMPLE`, allows you to return a number of rows or a certain percentage of the data. In the cell directly below this one, we show that `TABLESAMPLE` can be used to access a specific number of rows. In the following cell, we show that it can be used to access a given percentage of the data. Please note, however, any table display is limited to 1,000 rows. If the percentage of data you request returns more thna 1,000 rows, only the first 1000 will show. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The sample that displays 2 percent of the table is also ordered by the `InvoiceDate`. This shows off a formatting issue in the date column that we will have to fix later on. Take a moment and see if you can predict how we might need to change in the way the `InvoiceDate` is written. 

In [0]:
%sql
-- tomar aleatoriamente 5 filas
SELECT * FROM outdoorProductsRaw TABLESAMPLE (5 ROWS)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850,United Kingdom


In [0]:
%sql
-- tomar aleatoriamente 2 porciento de los datos y los organiza
SELECT * FROM outdoorProductsRaw TABLESAMPLE (2 PERCENT) ORDER BY InvoiceDate 

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
540566,21977,PACK OF 60 PINK PAISLEY CAKE CASES,2,1/10/11 10:58,0.55,17811.0,United Kingdom
540568,85099B,JUMBO BAG RED RETROSPOT,10,1/10/11 11:22,1.95,15039.0,United Kingdom
540595,85049C,ROMANTIC PINKS RIBBONS,12,1/10/11 11:35,1.25,14321.0,United Kingdom
540604,22113,GREY HEART HOT WATER BOTTLE,4,1/10/11 11:38,3.75,15326.0,United Kingdom
540639,22615,PACK OF 12 CIRCUS PARADE TISSUES,24,1/10/11 12:28,0.29,13107.0,United Kingdom
540642,20750,RED RETROSPOT MINI CASES,2,1/10/11 13:22,7.95,12681.0,France
540646,84596G,SMALL CHOCOLATES PINK BOWL,3,1/10/11 14:32,0.85,,United Kingdom
540646,22210,WOOD STAMP SET BEST WISHES,1,1/10/11 14:32,1.66,,United Kingdom
540646,22342,HOME GARLAND PAINTED ZINC,2,1/10/11 14:32,1.66,,United Kingdom
540646,22492,MINI PAINT SET VINTAGE,3,1/10/11 14:32,1.66,,United Kingdom


### Check for null values

Run this cell to see the number of `NULL` values in the `Description` column of our table. 

In [0]:
%sql
-- cuenta cuantos nulos tiene la columna
SELECT count(*) FROM outdoorProductsRaw WHERE Description IS NULL;

count(1)
166


### Create a temporary view

The next cell creates the temporary view `outdoorProducts`. By now, you should be familiar with how to create (or replace) a temporary view. There are a few new commands to notice in this particular command. 

This is where we will start to work with the problematic date formatting mentioned previously. Did you notice the inconsistency in your displays? 

Our dates do not have a standard number of digits for months and years. For example, `12/1/11` has a two-digit month and one-digit day, while `1/10/11` has an one-digit month and two-digit day. It's easy enough to specify a format to convert a string to a date, but the format must be consistent throughout the table. We will begin to attempt a fix for this problem by simply separating all of the components of the date and dropping the time value entirely. 

### Code breakdown 

**`COALESCE`** - This command is popular among many different SQL dialects. We can use it to replace `NULL` values. For all `NULL` values in the `Description` column, `COALESCE()` will replace the null with a value you include in the function. In this case, the value is `"Misc"`. For more information about `COALESCE`, check [the documentation](https://spark.apache.org/docs/latest/api/sql/index.html#coalesce).

**`SPLIT`** - This command splits a string value around a specified character and returns an **array**. An array is a list of values that you can access by position. In this case, the forward slash ("/") is the character we use to split the data. The first value in the array is the month. This list is **zero-indexed** for the index of the first position is **0**. Since we want to pull out the first value as the month, we indicate the value like this: `SPLIT(InvoiceDate, "/")[0]` and rename the column **`month`**. The day is the second value and its index is 1. 

The third `SPLIT` is different. Remember that our `InvoiceDate` column is a string that includes a date and time. Each part of the date is seperated by a forward slash, but between the date and the time, there is only a space. **`Line 10`** contains a **nested** `SPLIT` function that splits the string on a space delimiter. 

`SPLIT(InvoiceDate, " ")[0]` --> Drops the time from the string and leaves the date intact. Then, we split that value on the forward slash delimiter. We access the year at index 2. Learn more about the `SPLIT` function by accessing [the documentation](https://spark.apache.org/docs/latest/api/sql/#split).

In [0]:
%sql
-- crea una vista temporal, modificando la fecha en partes
CREATE
OR REPLACE TEMPORARY VIEW outdoorProducts AS
SELECT
  InvoiceNo,
  StockCode,
  COALESCE(Description, "Misc") AS Description,
  Quantity,
  InvoiceDate,
  SPLIT(InvoiceDate, "/")[0] month,
  SPLIT(InvoiceDate, "/")[1] day,
  SPLIT(SPLIT(InvoiceDate, " ")[0], "/")[2] year,
  UnitPrice,
  Country
FROM
  outdoorProductsRaw

### Check "Misc" values

We perform a quick sanity check here to demonstrate that all of the `NULL` values in Description have been replaced with the string `"Misc"`. 

In [0]:
%sql
-- consulta 
SELECT count(*) FROM outdoorProducts WHERE Description = "Misc" 

count(1)
166


### Create a new table

Now, we can write a new table with a consistently formatted date string. Notice that this table creation statement has a CTE inside of it. Recall that the CTE starts with a `WITH` clause. 

Notice the `LPAD()` functions on lines 11 and 12. [This function](https://spark.apache.org/docs/latest/api/sql/#lpad) inserts characters to the left of a string until the string reachers a certain length. In this example, we use `LPAD` to insert a zero to the left of any value in the month or day column that **is not** two digits. For values that are two digits, `LPAD` does nothing. 

We use the `padStrings` CTE to standardize the length of the individual date components. When we query the CTE, we use the `CONCAT_WS()` function to put the date string back together.  [This function](https://spark.apache.org/docs/latest/api/sql/#concat_ws) returns a concatenated string with a specified separator. In this example, we concatenate values from the month, date, and year columns, and specify that each value should be separated by a forward slash ("/"). 

In [0]:
%sql
-- crea una tabla, LPAD rellena con ceros a la izquierda, concat_ws concatena con separador de /
DROP TABLE IF EXISTS standardDate;
CREATE TABLE standardDate

WITH padStrings AS
(
SELECT 
  InvoiceNo,
  StockCode,
  Description,
  Quantity, 
  LPAD(month, 2, 0) AS month,
  LPAD(day, 2, 0) AS day,
  year,
  UnitPrice, 
  Country
FROM outdoorProducts
)
SELECT 
 InvoiceNo,
  StockCode,
  Description,
  Quantity, 
  concat_ws("/", month, day, year) sDate,
  UnitPrice,
  Country
FROM padStrings;

num_affected_rows,num_inserted_rows


### Table check
When we view our new table, we can see that the date field shows two digits each for the month, day, and year. 

In [0]:
%sql
-- consulta
SELECT * FROM standardDate LIMIT 5;

InvoiceNo,StockCode,Description,Quantity,sDate,UnitPrice,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/01/10,2.55,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/01/10,3.39,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/01/10,2.75,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/01/10,3.39,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/01/10,3.39,United Kingdom


### Check schema

Oops! All of our values are still strings. The date field would be much more useful as a `DateType`. 

In [0]:
%sql
-- info tabla
DESCRIBE standardDate;

col_name,data_type,comment
InvoiceNo,string,
StockCode,string,
Description,string,
Quantity,string,
sDate,string,
UnitPrice,string,
Country,string,


### Change to DateType

In the next cell, we create a new temporary view that converts the value to a date. The optional argument `MM/dd/yy` indicates the meaning of each part of the date. You can find a complete guide to Spark SQL's Datetime Patterns [here](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html).

Also, we cast the `UnitPrice` as a `DOUBLE` so that we can treat it as a number. 

In [0]:
%sql
-- cambiar el formato de columna, to_date y CAST
CREATE
OR REPLACE TEMPORARY VIEW salesDateFormatted AS
SELECT
  InvoiceNo,
  StockCode,
  to_date(sDate, "MM/dd/yy") date,
  Quantity,
  CAST(UnitPrice AS DOUBLE)
FROM
  standardDate

### Visualize Data

We can extract the day of the week and figure out the total quantitybar of items sold on each day. You can create a quick visual by clicking on the chart icon and creating a bar chart where the key is the `day` and the values are the `quantity`. 

We use the `date_format()` function to map the day to a day of the week. [This function](https://spark.apache.org/docs/latest/api/sql/#date_format) converts a `timestamp` to a `string` in the format specified. For this command, the `"E"` specifies that we want the output to be the day of the week. 

In [0]:
%sql
-- consulta
SELECT *
FROM
  salesDateFormatted

InvoiceNo,StockCode,date,Quantity,UnitPrice
536365,85123A,2010-12-01,6,2.55
536365,71053,2010-12-01,6,3.39
536365,84406B,2010-12-01,8,2.75
536365,84029G,2010-12-01,6,3.39
536365,84029E,2010-12-01,6,3.39
536365,22752,2010-12-01,2,7.65
536365,21730,2010-12-01,6,4.25
536366,22633,2010-12-01,6,1.85
536366,22632,2010-12-01,6,1.85
536367,84879,2010-12-01,32,1.69


In [0]:
%sql
-- consulta, E dia de la semana
SELECT
  date_format(date, "E") day,
  SUM(quantity) totalQuantity
FROM
  salesDateFormatted
GROUP BY (day)
ORDER BY day

day,totalQuantity
Fri,90762.0
Mon,76366.0
Sun,43117.0
Thu,114827.0
Tue,106256.0
Wed,116652.0


In [0]:
%run ../Includes/Classroom-Cleanup


&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>