This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.

### Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

In [0]:
storage_account_name = "mainstorageaccountv2"
storage_account_access_key = "access_token"

In [0]:
file_location = "wasbs://my-csv@mainstorageaccountv2.blob.core.windows.net/market_data.csv"
file_type = "csv"

In [0]:
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [0]:
df = spark.read.format(file_type).option("inferSchema", "true").option("header", "true").load(file_location)

### Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

In [0]:
display(df.select("Date", "EPAM"))

Date,EPAM
16-Feb-16,60.44
12-Feb-16,57.0
11-Feb-16,58.05
10-Feb-16,58.06
9-Feb-16,58.69
8-Feb-16,59.77
5-Feb-16,60.92
4-Feb-16,68.86
3-Feb-16,71.71
2-Feb-16,75.2


### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

In [0]:
df.createOrReplaceTempView("test_market_data")

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [0]:
%sql

SELECT * FROM test_market_data

Date,EPAM,XOM,VIX,AAPL,FB,AMJ,GOOG,ICHGF
16-Feb-16,60.44,81.22,24.11,96.639999,101.610001,23.44,717.640015,33.0
12-Feb-16,57.0,81.03,25.4,93.989998,102.010002,22.05,706.890015,33.0
11-Feb-16,58.05,79.6,28.14,93.699997,101.910004,20.88,706.359985,33.0
10-Feb-16,58.06,79.35,26.29,94.269997,101.0,21.84,706.849976,33.0
9-Feb-16,58.69,80.08,26.54,94.989998,99.540001,21.68,701.02002,36.790001
8-Feb-16,59.77,81.16,26.0,95.010002,99.75,22.55,704.159973,36.790001
5-Feb-16,60.92,80.08,23.38,94.019997,104.07,24.53,703.76001,36.790001
4-Feb-16,68.86,79.83,21.84,96.599998,110.489998,25.24,730.030029,36.790001
3-Feb-16,71.71,78.48,21.65,96.349998,112.690002,24.8,749.380005,36.790001
2-Feb-16,75.2,74.59,21.98,94.480003,114.610001,24.74,780.909973,36.790001


Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

In [0]:
df.write.format("parquet").saveAsTable("test_market_data_tbl")

This table will persist across cluster restarts and allow various users across different notebooks to query this data.