# Apache Spark Project on Databricks

## Description

There are datasets corresponding to a **list of health inspections in establishments** (restaurants, supermarkets, etc.), along with their respective health risks. Additionally, there is another dataset that shows a **description of these risks**.

**The objective is to load these datasets under specific requirements and manipulate them according to the instructions of each exercise**.

All necessary operations are described in the exercises, although extra tasks initiated by the student will be appreciated. The use of the DataFrame API will also be valued.

## Download Datasets

In [0]:
%sh 
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/food_inspections_lite.csv' --output-dir /databricks/driver
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/risk_description.csv'  --output-dir /databricks/driver

In [0]:
dbutils.fs.cp('file:/databricks/driver/food_inspections_lite.csv','dbfs:/dataset/food_inspections_lite.csv')
dbutils.fs.cp('file:/databricks/driver/risk_description.csv','dbfs:/dataset/risk_description.csv')

In [0]:
dbutils.fs.ls('/dataset/')

In [0]:
KAFKA_BOOSTRAP_SERVER="35.227.18.205:9094"

In [0]:
checkpoint_path = "/tmp/project_spark/_checkpoint"

In [0]:
spark.conf.set("spark.sql.streaming.checkpointLocation", checkpoint_path)
spark.conf.get("spark.sql.streaming.checkpointLocation")

## Exercise 1
---
1. **Create two dataframes, one from the file `food_inspections_lite.csv` and another from `risk_description.csv`**.
2. **Convert these two dataframes into Delta tables**.

## Exercise 2
---
**Obtain the number of distinct inspections with High Risk `Risk 1 (High)`.**

## Ejercicio 3
---
**From the previously loaded dataframes, obtain a table with the following columns:**
1. `DBA Name`
2. `Facility Type`
3. `Risk`
4. `Risk Description`

## Ejercicio 4
---
**Access the Spark UI to view the execution plan and describe each of the pieces/boxes that make up the execution plan. Add a screenshot of the analyzed execution plan.**

> **Note:** A brief one-line description per box is sufficient.

## Ejercicio 5
---
1. **Obtain the number of inspections for each establishment (`DBA Name` column) and their result (`Results` column).**
2. **Get the two establishments (`DBA Name`) with the most inspections for each of the results.**
3. **Save the results in a new Delta table named `inspections_results`.**

## Ejercicio 6
---
1. **Update the Delta table created in the previous exercise with the value `DBA_Name = "error"`.**
2. **Restore the table to its original state.**

## Ejercicio 7
---
**Create an application with Structured Streaming that reads data from the Kafka topic `inspections`.**
> **Note:** The Kafka server URL is defined at the beginning of this notebook.

**The data from this topic is exactly the same as the data being analyzed throughout this notebook, `Food Inspections`, so the schema is the same.**

## Ejercicio 8
---
**Based on the data source from the previous exercise, obtain the number of inspections per `Facility Type` every 5 seconds.**

## Ejercicio 9
---
**Based on the data source from exercise 7, obtain the number of inspections by `Results` for the last 30 seconds every 5 seconds.**

## Ejercicio 10
---
1. **Update the Results column in the Delta table for food inspections created in Exercise 1 with the value `No result`.**
2. **Now that the Delta table is corrupted with the value `No result`, the problem must be resolved with the data coming from Kafka, which will be assumed as the absolute truth. Therefore, it will need to be updated in real time as items arrive from Kafka.**.
> **Note**: It is recommended to stop all previous streams, as the one in this exercise tends to be resource-intensive.

## Ejercicio 11
---
**Design a real-time analysis solution using Apache Spark in Databricks to consume flight data transmitted by Kafka. These data should be stored in a Delta table, and the current position of the flights should be visualized on a map.**
* **Flight data is in a topic called `flights`.**
* **Save all the flights in a Delta table, but only one entry per flight code, so if updates on the flight position are received, the corresponding record will be updated. This must happen in real-time.**

> **Note**: For more information on the input data, refer to [OpenSky Network](https://openskynetwork.github.io/opensky-api/rest.html#all-state-vectors). A screenshot of the visualization to be achieved is shown below. Keep in mind that this map visualization is available in Databricks, so there will be no need to import any external libraries.

![Flight Map](https://raw.githubusercontent.com/masfworld/datahack_docker/ab487794745499248388b67cf574085c5d86746e/zeppelin/data/image.png)