d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exploring the Single Source of Truth

**Objective:** Identify Late-Arriving data and bad data

## Notebook Configuration

Before you run this cell, make sure to add a unique user name to the file
<a href="$./includes/configuration" target="_blank">
includes/configuration</a>, e.g.

```
username = "yourfirstname_yourlastname"
```

In [0]:
%run ./includes/configuration

#### Step 1: Count the Number of Records Per Device
Let’s run a query to count the number of records per device.
Recall that we will need to tell Spark that our format is a Delta table,
which we can do with our `.format()` method. Additionally, instead of passing in the path
as we did in previous notebooks, we need to pass in the health tracker variable.
Finally, we'll do a `groupby` and aggregation on our `p_device_id` column.

In [0]:
# ANSWER
from pyspark.sql.functions import count

display(
  spark.read
  .format("delta")
  .load(health_tracker + "processed")
  .groupby("p_device_id")
  .agg(count("*"))
)

#### Step 2: Plot the Missing Records
Let’s run a query to discover the timing of the missing records. We use a Databricks visualization to display the number of records per day. It appears that we have no records for device 4 for the last few days of the month.

In [0]:
from pyspark.sql.functions import col

display(
  spark.read
  .format("delta")
  .load(health_tracker + "processed")
  .where(col("p_device_id").isin([3,4]))
)

### Configuring the Visualization
Create a Databricks visualization to view the sensor counts by day.
We have used the following options to configure the visualization:
```
Keys: dte
Series groupings: p_device_id
Values: heartrate
Aggregation: COUNT
Display Type: Bar Chart
```

### Broken Readings in the Table
Upon our initial load of data into the `health_tracker_processed` table, we noted that there are broken records in the data. In particular, we made a note of the fact that several negative readings were present even though it is impossible to record a negative heart rate.

Let’s assess the extent of these broken readings in our table.

#### Step 1: Create Temporary View for Broken Readings
First, we create a temporary view for the broken readings in the `health_tracker_processed` table.
Here, we want to find the columns where `heartrate` is less than 0.

In [0]:
# ANSWER
broken_readings = (
  spark.read
  .format("delta")
  .load(health_tracker + "processed")
  .select(col("heartrate"), col("dte"))
  .where(col("heartrate") < 0)
  .groupby("dte")
  .agg(count("heartrate"))
  .orderBy("dte")
)
broken_readings.createOrReplaceTempView("broken_readings")

#### Step 2: Display broken_readings
Display the records in the `broken_readings` view, again using a Databricks visualization.
Note that most days have at least one broken reading and that some have more than one.

In [0]:
%sql
SELECT * FROM broken_readings

#### Step 3: Sum the Broken Readings
Next, we sum the records in the view.

In [0]:
%sql
SELECT SUM(`count(heartrate)`) FROM broken_readings


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>