# Introduction to Data in Azure 

This notebook uses weather data from NOAA to explore how elevation affects weather conditions.


[Be sure to follow the instructions in the repository](https://github.com/paladique/Workshop-DataInAzure/blob/master/README.md) before continuing here


## Querying from the Database

Now that the data is in Azure SQL, lets query it. Be sure to add and update `myconfig.cfg` with the following:

  ```python
[my_db]
server: [your Azure SQL server name]
database: [your Azure SQL database name]
username: [your Azure SQL username]
password: [your Azure SQL password]
  ```
  
  **Note: The file will contain critical information. Avoid setting your notebook public until they are removed.**

What's Happening?! Using Azure Data Factory
Looks like this data has some parsing and formatting issues! Let's clean it up - we'll grab the JSON version of this file and put it in Azure SQL

Here's a preview of the structure:

{
  "id": 338995,
  "updated": "2020-01-21",
  "confirmed": 262,
  "deaths": 0,
  "country_region": "Worldwide",
  "load_time": "2020-06-16 00:05:27"
}
How do we convert this semi-structed data into relational data? Let's use the Azure Data Factory to achieve this.

From the Azure Portal, Open you Data Factory and select Author and Monitor, which will open a new tab.
In the Data Factory home page, select Copy Data to setup the manual task
After clicking Next on Properties, let's create our data connections.

3a. Select + Create New Connection

3b. Search for Blob Storage, select Azure Blob Storage > Next

3c. Select your Azure Subscirption and your Storage Account Name, select Create

3d. Repeat this process for Azure SQL Database and use SQL authentication

Optional: Test your connection

Select Azure Blob Storage connection as the source > Next

Select Browse on the right hand side, select your container and click Choose on bing_covid-19_data.json > Next
Confirm the file format is json, select it if not and click Next
Select Azure SQL connection as the destination target > Next
Select the CovidData databsase as the destination target > Next
We're only interested in the id, updated, and confirmed columns, deleting the other rows is optional > click Next until you reach the Deployment complete window
Loading this data will take a few minutes.
To much clicking?
You can build data piplines in Data Factory with the command line

In [10]:
# Pip install packages
import os, sys

!{sys.executable} -m pip install azure-storage-blob
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas



In [None]:
Let's grab the COVID data from the Open Datasets Catalog Be sure to download the json file and upload it to your blob container if you haven't already!

In [12]:
# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "isdweatherdatacontainer"
folder_name = "ISDWeather/"

In [14]:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

if azure_storage_account_name is None or azure_storage_sas_token is None:
    raise Exception(
        "Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")

print('Looking for the first parquet under the folder ' +
      folder_name + ' in container "' + container_name + '"...')
container_url = f"https://{azure_storage_account_name}.blob.core.windows.net/"
blob_service_client = BlobServiceClient(
    container_url, azure_storage_sas_token if azure_storage_sas_token else None)

container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(folder_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName = ''
for blob in sorted_blobs:
    if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
        targetBlobName = blob.name
        break

print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
blob_client = container_client.get_blob_client(targetBlobName)
with open(filename, 'wb') as local_file:
    blob_client.download_blob().download_to_stream(local_file)

Looking for the first parquet under the folder ISDWeather/ in container "isdweatherdatacontainer"...
Target blob to download: ISDWeather/year=2020/month=9/part-00007-tid-8228874701277795085-7b8726ea-d43e-4602-80de-455ce4f066a5-2406-9.c000.snappy.parquet


In [19]:
# Read the parquet file into Pandas data frame
import pandas as pd

print('Reading the parquet file into Pandas data frame')
df = pd.read_parquet(filename, columns=['datetime', 'latitude', 'longitude', 'elevation', 'stationName', 'temperature', 'windSpeed'])

Reading the parquet file into Pandas data frame


In [45]:
# you can add your filter at below
print('Loaded as a Pandas data frame: ')
# df

df.query('(38.503 <= latitude <= 39.887) & (-108.294 <= longitude <= -105.943)' )
df.query('(39.375.503 <= latitude <= 39.887) & (-108.294 <= longitude <= -105.943)' )



Loaded as a Pandas data frame: 


Unnamed: 0,datetime,latitude,longitude,elevation,stationName,temperature,windSpeed
40695,2020-09-07 02:55:00,38.698,-106.070,2422.0,CENTRAL COLORADO REGIONAL AP,19.1,4.6
40724,2020-09-05 11:35:00,38.698,-106.070,2422.0,CENTRAL COLORADO REGIONAL AP,11.1,3.1
40766,2020-09-03 12:55:00,38.698,-106.070,2422.0,CENTRAL COLORADO REGIONAL AP,9.2,3.1
40767,2020-09-05 01:55:00,38.698,-106.070,2422.0,CENTRAL COLORADO REGIONAL AP,23.7,4.1
40774,2020-09-01 22:15:00,38.698,-106.070,2422.0,CENTRAL COLORADO REGIONAL AP,23.0,9.8
...,...,...,...,...,...,...,...
515734,2020-09-11 04:31:00,39.433,-107.383,3232.0,SUNLIGHT,-1.0,1.5
515742,2020-09-13 07:29:00,39.433,-107.383,3232.0,SUNLIGHT,7.0,0.0
515754,2020-09-05 14:12:00,39.433,-107.383,3232.0,SUNLIGHT,15.0,2.1
515790,2020-09-03 03:52:00,39.433,-107.383,3232.0,SUNLIGHT,12.0,1.5
