# Introduction to Data Storage in Azure 

We'll be using a some existing data to track weekly confirmed cases in the United States
[Be sure to follow the instructions in the repository](https://github.com/paladique/Workshop-DataInAzure/blob/master/README.md) before continuing here


In [2]:
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas


Collecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/64/ec/82aaab43393bbf3321caff905506cc4045e4c4a503845f749798370bc4c2/pyarrow-0.17.1-cp35-cp35m-manylinux2014_x86_64.whl (63.7MB)
[K     |████████████████████████████████| 63.7MB 7.1kB/s eta 0:00:01    |▉                               | 1.6MB 2.1MB/s eta 0:00:30     |█▊                              | 3.4MB 2.1MB/s eta 0:00:29     |████████▎                       | 16.4MB 3.2MB/s eta 0:00:15     |███████████████████▋            | 39.1MB 4.2MB/s eta 0:00:06MB/s eta 0:00:06MB/s eta 0:00:06████████████████▎         | 44.4MB 5.8MB/s eta 0:00:04     |███████████████████████▍        | 46.5MB 485kB/s eta 0:00:36     |███████████████████████▌        | 46.7MB 485kB/s eta 0:00:36:00:35��▊       | 49.3MB 485kB/s eta 0:00:30�████████████▌      | 50.7MB 449kB/s eta 0:00:29�█████████████      | 51.9MB 449kB/s eta 0:00:27�██████████████████████████▏    | 54.0MB 449kB/s eta 0:00:22MB/s eta 0:00:03MB/s eta 0:00:03��████▏   | 56

Let's grab the COVID data from the [Open Datasets Catalog](https://azure.microsoft.com/en-us/services/open-datasets/catalog/bing-covid-19-data/)
_**Be sure to download the json file and upload it to your blob container if you haven't already!**_

In [23]:
# Install important packages
import os, sys
import pyarrow.parquet as pq
import pandas as pd
import pyodbc
import numpy as np
# %matplotlib inline
# import matplotlib.pyplot as plt


from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta

#Load csv Data
df_covid = pd.read_csv("bing_covid-19_data.csv")

In [115]:
us_cases = df_covid.query('iso2 == "US"')
us_cases[['confirmed','updated']].groupby(pd.Grouper(key='updated',freq='W')).sum()


Unnamed: 0_level_0,confirmed
updated,Unnamed: 1_level_1
2020-01-05,4031065.0
2020-01-12,5183035.0
2020-01-19,
2020-01-26,23.0
2020-02-02,63.0
2020-02-09,9440153.0
2020-02-16,59.0
2020-02-23,205.0
2020-03-01,301.0
2020-03-08,9671943.0


## Querying from the Database

Now that the data is in Azure SQL, lets query it. Be sure to add and update `myconfig.cfg` with the following:

  ```python
[my_db]
server: [your Azure SQL server name]
database: [your Azure SQL database name]
username: [your Azure SQL username]
password: [your Azure SQL password]
  ```
  
  **Note: The file will contain critical information. Avoid setting your notebook public until they are removed.**

## What's Happening?! Using Azure Data Factory

Looks like this data has some parsing and formatting issues! Let's clean it up - we'll grab the JSON version of this file and put it in Azure SQL

Here's a preview of the structure:

```json
{
  "id": 338995,
  "updated": "2020-01-21",
  "confirmed": 262,
  "deaths": 0,
  "country_region": "Worldwide",
  "load_time": "2020-06-16 00:05:27"
}
```
How do we convert this semi-structed data into relational data? Let's use the Azure Data Factory to achieve this. 

1. From the Azure Portal, Open you Data Factory and select **Author and Monitor**, which will open a new tab.
2. In the Data Factory home page, select **Copy Data** to setup the manual task 
3. After clicking Next on Properties, let's create our data connections. 

    3a. Select **+ Create New Connection**
    
    3b. Search for Blob Storage, select **Azure Blob Storage** > **Next**
    
    3c. Select your Azure Subscirption and your Storage Account Name, select **Create**
    
    3d. Repeat this process for Azure SQL Database and use SQL authentication 
    
    Optional: Test your connection

4. Select Azure Blob Storage connection as the source > **Next** 
5. Select **Browse** on the right hand side, select your container and click **Choose** on `bing_covid-19_data.json` > **Next** 
6. Confirm the file format is json, select it if not and click **Next**
4. Select Azure SQL connection as the destination target > **Next** 
4. Select the CovidData databsase as the destination target > **Next** 
5. We're only interested in the `id, updated, and confirmed` columns, deleting the other rows is optional > click **Next** until you reach the `Deployment complete` window
7. Loading this data will take a few minutes.

### To much clicking? 
You can build data piplines in Data Factory with the command line

In [116]:
from configparser import ConfigParser
parser = ConfigParser()
_ = parser.read('myconfig.cfg')


server = parser.get('my_db', 'server')
database = parser.get('my_db', 'database')
username = parser.get('my_db', 'username')
password = parser.get('my_db', 'password')
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)

In [117]:
query = "SELECT * FROM [dbo].[CovidData] WHERE country_region='United States'" 
df_covid_us_sql = pd.read_sql(query, cnxn)
df_covid_us_sql['country_region'] = f['country_region'].str.strip()
df_covid_us_sql.head(10)

df_covid_us_sql[['confirmed','updated']].groupby(pd.Grouper(key='updated',freq='W')).sum()

Unnamed: 0_level_0,confirmed
updated,Unnamed: 1_level_1
2020-01-26,23
2020-02-02,79
2020-02-09,94
2020-02-16,97
2020-02-23,205
2020-03-01,402
2020-03-08,3528
2020-03-15,27931
2020-03-22,293590
2020-03-29,1760926


When compared to the following [Bing Visualization](https://bing.com/covid/local/unitedstates), we can see that we're _very_ off, but in better shape than the original query.