# Databricks File Areas

In Databricks, if you click "Workspaces" on the menu to the left, you'll see a file explorer type interface.

There are two main areas:

- "Home" is your personal area (this is the same as "Workspace/users/<your_user_name>")
- "Workspace/Shared" (where this notebook is), is the shared area that everyone can access.

# Cloning notebooks in Databricks

If you're in databricks and want to copy a notebook to your personal file area, click "File" from the menu bar at the top, then "Clone...", then "Browse". Then navigate to your user area and clone it into there.

# Import PySpark functions

PySpark comes with a library of functions we'll need to use in our code, so we'll import these first:

In [None]:
from pyspark.sql import functions as F

# Download Artificial HES data

Will use the Artificial NHS Hospital Episode Statistics Accident and Emergency (HES AE) data from 2003 for these examples. The cell below just downloads this data from the public website, and unzips it.

In [None]:
# These libraries will help us download the file
import zipfile
import io
from pathlib import Path
import requests

zip_file_url = "https://files.digital.nhs.uk/assets/Services/Artificial%20data/Artificial%20HES%20final/artificial_hes_ae_202302_v1_sample.zip"
path_to_downloaded_data = "data_in/artificial_hes_ae_202302_v1_sample.zip/artificial_hes_ae_202302_v1_sample/artificial_hes_ae_2122.csv"

filename = Path(zip_file_url).name
output_path = f"data_in/{filename}"

response = requests.get(zip_file_url, stream=True,timeout=3600)
downloaded_zip = zipfile.ZipFile(io.BytesIO(response.content))
downloaded_zip.extractall(output_path)

the `spark` variable here is known as the SparkSession. It's the entry point to Spark's functionality. Above we used `spark.read` to load data stored in parquet files in an Azure Blob (that's what the 'abfss' part indicates).

This returns the data as a DataFrame - basically a table.

# Displaying data

Just use the display() function and pass the DataFrame to it. Can take a while with a lot of data or intense queries.

In [None]:
display(df_ec_core_snapshot)

# PySpark DataFrame methods

Most functions you use to manipulate data in PySpark belong to the DataFrame itself. They are functions attached to the DataFrame class. When a function belongs to a class it's known as a method.

So everytime you create a DataFrame, it has access to all these dataframe methods. These are what we use to manipulate the data.

There's a PySpark equivalent to all the usual suspects in SQL: SELECT, WHERE (filter), GROUP BY, COUNT, etc.

To use a DataFrame method, you invoke the name of your dataframe, then a dot, then the name of the method you want to call. E.g.:

`df.select()`
`df.count()`

You chain these methods together to build your query.

Here are a few common examples.

# COUNT(*)

To count all rows, use the `.count()` method

In [None]:
df_ec_core_snapshot.count()

# TOP

The equivalent of the T-SQL TOP is `.limit()`

The example below creates a new dataframe with the same name as the old one - so we are overwriting it

In [None]:
df_ec_core_snapshot_top_100 = (df_ec_core_snapshot
    .limit(1000)            
)

df_ec_core_snapshot_top_100.count()

# SELECT

Use select and pass the names of the columns you want to select.
This time we are creating a new dataframe based on the the old one. So the old one still exists.

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )             
)

display(df_ec_core_snapshot_filtered)

# ORDER BY

`.orderBy()` is pretty straight forward:

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .orderBy("Report_Period_Start_Date")
)

display(df_ec_core_snapshot_filtered)

It's ascending by default. For descending you could use:

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .orderBy( F.desc("Report_Period_Start_Date") )
)

display(df_ec_core_snapshot_filtered)

The `F` in `F.desc()` indicates that it's a function from the pyspark.sql.functions library that we imported at the top. So it's not a DataFrame method.

Note that in Spark, order is not guaranteed unless you use `.orderBy()`. It just depends on which workers on the cluster return their results first, which can differ for any number of reasons, machines overheating, network being clogged, cats eating cables in the data centre etc etc.

# WHERE

For WHERE clauses you can use `.where()` or `.filter()` - they are the same.

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where( F.col("Report_Period_Start_Date") > "2023-04-02")
    .orderBy("Report_Period_Start_Date")
)

display(df_ec_core_snapshot_filtered)

## WHERE col IN () - use `.isin()`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(F.col("provider_code").isin(["RK5", "RNS"]))
)

display(df_ec_core_snapshot_filtered)

## WHERE col NOT IN () - use ~

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(~F.col("provider_code").isin(["RK5", "RNS"]))
)

display(df_ec_core_snapshot_filtered)

# Multiple where clauses

For AND, you can use `&`, or simply use two `.where()` method calls

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(F.col("Report_Period_Start_Date") > "2023-04-02")
    .where(F.col("Report_Period_Start_Date").isNotNull())
    .orderBy("Report_Period_Start_Date")
)

display(df_ec_core_snapshot_filtered)

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(
        (F.col("Report_Period_Start_Date") > "2023-04-02")
        &
        (F.col("Report_Period_Start_Date").isNotNull()) # IS NOT NULL
    )
    .orderBy("Report_Period_Start_Date")
)

display(df_ec_core_snapshot_filtered)

For or use the pipe `|`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(
        (F.col("Report_Period_Start_Date") > "2023-04-02")
        |
        (F.col("Report_Period_Start_Date").isNull()) # IS NULL
    )
    .orderBy("Report_Period_Start_Date")
)

display(df_ec_core_snapshot_filtered)


# GROUP BY / COUNT(*)

Note that the select statement is not necessary here - it gets overrided by the groupBy.

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(
        (F.col("Report_Period_Start_Date") > "2023-04-02")
        |
        (F.col("Report_Period_Start_Date").isNull()) # IS NULL
    )
    .groupBy(
      "provider_code"
    )
    .count()
)

display(df_ec_core_snapshot_filtered)

## Other aggregations - use `.agg()`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(
        (F.col("Report_Period_Start_Date") > "2023-04-02")
        |
        (F.col("Report_Period_Start_Date").isNull()) # IS NULL
    )
    .groupBy("provider_code")
    .agg(
        F.countDistinct("EC_Ident"),
        F.min("Report_Period_Start_Date"),
        F.max("Report_Period_Start_Date"),
        F.sum("EC_Ident")
    )
)

display(df_ec_core_snapshot_filtered)

## aliasing columns - `.alias()`

### In `.groupBy()`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date"
    )
    .where(
        (F.col("Report_Period_Start_Date") > "2023-04-02")
        |
        (F.col("Report_Period_Start_Date").isNotNull()) 
    )
    .groupBy(
      "provider_code"
    )
    .agg(
        F.count("EC_Ident").alias("EC_Ident_count"),
        F.countDistinct("EC_Ident").alias("EC_Ident_distinct"),
        F.min("Report_Period_Start_Date").alias("Min_Report_Period_Start_Date"),
        F.max("Report_Period_Start_Date").alias("Max_Report_Period_Start_Date"),
        F.sum("EC_Ident").alias("EC_Ident_sum"),
    )
)

display(df_ec_core_snapshot_filtered)

### In `.select()`

To alias in `.select()` use `F.col()` and chain the alias to that.

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      F.col("EC_Ident").alias("EC_Identifier"),
      F.col("provider_code").alias('prov_code'),
    )
)

display(df_ec_core_snapshot_filtered)

# Add a new column - `.withColumn()`

### With the same value for every row - use `.lit()`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      F.col("EC_Ident").alias("EC_Identifier"),
      F.col("provider_code").alias('prov_code'),
    )
    .withColumn("some_flag_column", F.lit(True))
)

display(df_ec_core_snapshot_filtered)

### For a CASE WHEN use `F.when().otherwise()`

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
      "Report_Period_Start_Date",
    )
    .withColumn("a_new_col", 
      F.when(F.col("Report_Period_Start_Date").isNull(), F.lit("No date"))
       .when(F.col("Report_Period_Start_Date") > "2023-05-01", F.lit("After May 1st 2023"))
       .otherwise(F.lit("Before May 1st 2023"))
    )
)

display(df_ec_core_snapshot_filtered)

# Creating DataFrames manually

Sometimes you might need to create your own DataFrame manually. Maybe to make some reference data that you want to join.

To do that, you need to use the `SparkSession`. The `SparkSession` has a method called `.createDataFrame()`. This method takes two parameters:

## The data

The data takes the form of a list (square brackets) of tuples (round brackets). Each tuple represents a row, and each item in the tuple represents the values for that row across all the columns.

## The schema

It is possible to specify the column types in the schema, and in some cases you might need to do that. But you can also just give a list of column names, and let Spark figure it out. For this simple example we'll just do that.

In [None]:
data = [
    ("RK5", "Sherwood Forest Hospitals NHS Foundation Trust"),
    ("RNS", "NORTHAMPTON GENERAL HOSPITAL NHS TRUST")
]

schema = ["provider_code", "provider_name"]

df_provider_names = spark.createDataFrame(data, schema)

display(df_provider_names)

# Joins

Just like in SQL, we need three things to do a join:

- The name of the **other** DataFrame we're joining to
- The columns we want to join **on**
- **How** we want to join (left, right, inner, etc.)

We just use the `.join()` method and pass these as parameters.

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
    )
    .where(F.col("provider_code").isin(["RK5", "RNS"]))
    .limit(1000)
    .join(other=df_provider_names, on="provider_code", how="left")
)

display(df_ec_core_snapshot_filtered)

**Important:** Notice how the `.select()` method does not contain provider_name, but it appears in the results anyway.

This is where PySpark differs from SQL a bit. In PySpark, the order of the method calls matters. The join came after the select, so we can't select a column in the other DataFrame, because Spark doesn't know about it yet. 

So how does provider_name end up in the output?

It's because when you left join, all of the other DataFrame's columns are selected by default!

If you don't need all the columns, then you can use `.select()` on the other DataFrame when you define it, to ensure you only select the columns you need. Another way would be to put the `.select()` after the `.join()`


With `.join()`, you don't need to specify the names of the `other`, `on`, and `how` parameters. As long as you put them in the right order it will still work, and it's conventional to not include the names:

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
    )
    .where(F.col("provider_code").isin(["RK5", "RNS"]))
    .limit(1000)
    .join(df_provider_names, "provider_code", "left")
)

display(df_ec_core_snapshot_filtered)

For an inner join, just change "left" to "inner".

In [None]:
df_ec_core_snapshot_filtered = (df_ec_core_snapshot
    .select(
      "EC_Ident",
      "provider_code",
    )
    .limit(1000)
    .join(df_provider_names, "provider_code", "inner")
)

display(df_ec_core_snapshot_filtered)

# Practice Exercises


A few ideas for practice:

- Try making your own queries using the methods above, e.g:
  - Could you get the number of rows for provider RWP, grouped by age at arrival, and sorted by age?
  - Which provider has the most records in 2023 based on the report period end data?
- Try taking a simple SQL script you have used on this data in NCDR and try to recreate it in PySpark here