This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.

### Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

In [0]:
storage_account_name = "<Storage Account Name>"
storage_account_access_key = "<Storage Account Access Secret Key>"
container_name = "sparkcontainer"

In [0]:
file_name = "<CSV File name>"
file_location = "wasbs://sparkcontainer@" + storage_account_name + ".blob.core.windows.net/Triage/" + file_name
file_type = "csv"

In [0]:
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [0]:
df = spark.read.format(file_type).options(inferSchema='true', header='True').load(file_location)

### Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

In [0]:
#display(df.select("EXAMPLE_COLUMN"))
display(df)

AIRPORT_ID,AIRPORT,DISPLAY_AIRPORT_NAME,LATITUDE,LONGITUDE
10001,01A,Afognak Lake Airport,58.10944444,-152.9066667
10003,03A,Bear Creek Mining Strip,65.54805556,-161.0716667
10004,04A,Lik Mining Camp,68.08333333,-163.1666667
10005,05A,Little Squaw Airport,67.57,-148.1838889
10006,06A,Kizhuyak Bay,57.74527778,-152.8827778
10007,07A,Klawock Seaplane Base,55.55472222,-133.1016667
10008,08A,Elizabeth Island Airport,59.15694444,-151.8291667
10009,09A,Augustin Island,59.36277778,-153.4305556
10010,1B1,Columbia County,42.29138889,-73.71027778
10011,1G4,Grand Canyon West,35.98611111,-113.8169444


In [0]:
df.write.format("delta").save("wasbs://sparkcontainer@" + storage_account_name + ".blob.core.windows.net/Bronze/AirportCodeLocationLookupClean")

In [0]:
%sql
DROP TABLE IF EXISTS airport_code_location_lookup_clean;

CREATE TABLE IF NOT EXISTS airport_code_location_lookup_clean
USING DELTA LOCATION 'wasbs://sparkcontainer@asastoremcwaztek.blob.core.windows.net/Bronze/AirportCodeLocationLookupClean'

In [0]:
%sql
SELECT * FROM airport_code_location_lookup_clean

AIRPORT_ID,AIRPORT,DISPLAY_AIRPORT_NAME,LATITUDE,LONGITUDE
10001,01A,Afognak Lake Airport,58.10944444,-152.9066667
10003,03A,Bear Creek Mining Strip,65.54805556,-161.0716667
10004,04A,Lik Mining Camp,68.08333333,-163.1666667
10005,05A,Little Squaw Airport,67.57,-148.1838889
10006,06A,Kizhuyak Bay,57.74527778,-152.8827778
10007,07A,Klawock Seaplane Base,55.55472222,-133.1016667
10008,08A,Elizabeth Island Airport,59.15694444,-151.8291667
10009,09A,Augustin Island,59.36277778,-153.4305556
10010,1B1,Columbia County,42.29138889,-73.71027778
10011,1G4,Grand Canyon West,35.98611111,-113.8169444


### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

In [0]:
df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [0]:
%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

In [0]:
df.write.format("parquet").saveAsTable("MY_PERMANENT_TABLE_NAME")

This table will persist across cluster restarts and allow various users across different notebooks to query this data.

In [0]:
%sql
select * from flight_delays_with_airport_codes

Year,Month,DayofMonth,DayOfWeek,Carrier,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled,OriginAirportCode,OriginAirportName,OriginLatitude,OriginLongitude,DestAirportCode,DestAirportName,DestLatitude,DestLongitude
2013,4,19,5,DL,837,-3.0,0.0,1138,1.0,0,0,DTW,Detroit Metro Wayne County,42.2125,-83.35333333,MIA,Miami International,25.79527778,-80.29
2013,4,19,5,DL,1705,0.0,0.0,2336,-8.0,0,0,SLC,Salt Lake City International,40.78833333,-111.9777778,JFK,John F. Kennedy International,40.64,-73.77861111
2013,4,19,5,DL,600,-4.0,0.0,851,-15.0,0,0,PDX,Portland International,45.58861111,-122.5969444,SLC,Salt Lake City International,40.78833333,-111.9777778
2013,4,19,5,DL,1630,28.0,1.0,1903,24.0,1,0,STL,Lambert-St. Louis International,38.74861111,-90.37,DTW,Detroit Metro Wayne County,42.2125,-83.35333333
2013,4,19,5,DL,1615,-6.0,0.0,1805,-11.0,0,0,CVG,Cincinnati/Northern Kentucky International,39.04888889,-84.66777778,LAX,Los Angeles International,33.9425,-118.4080556
2013,4,19,5,DL,1726,-1.0,0.0,1818,-19.0,0,0,ATL,Hartsfield-Jackson Atlanta International,33.63666667,-84.42777778,STL,Lambert-St. Louis International,38.74861111,-90.37
2013,4,19,5,DL,1900,0.0,0.0,2133,-1.0,0,0,STL,Lambert-St. Louis International,38.74861111,-90.37,ATL,Hartsfield-Jackson Atlanta International,33.63666667,-84.42777778
2013,4,19,5,DL,2145,15.0,1.0,2356,24.0,1,0,ATL,Hartsfield-Jackson Atlanta International,33.63666667,-84.42777778,SLC,Salt Lake City International,40.78833333,-111.9777778
2013,4,19,5,DL,2157,33.0,1.0,2333,34.0,1,0,ATL,Hartsfield-Jackson Atlanta International,33.63666667,-84.42777778,AUS,Austin - Bergstrom International,30.19444444,-97.67
2013,4,19,5,DL,1900,323.0,1.0,2055,322.0,1,0,DCA,Ronald Reagan Washington National,38.85138889,-77.03777778,ATL,Hartsfield-Jackson Atlanta International,33.63666667,-84.42777778
