# Tutorial: Upload, access and explore your data in Azure Machine Learning

In this tutorial you learn how to:

> * Upload your data to cloud storage
> * Create an Azure Machine Learning data asset
> * Access your data in a notebook for interactive development
> * Create new versions of data assets

The start of a machine learning project typically involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and the building of Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.

> [!NOTE]
> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource. 

1. Select **Open terminal** below the three dots, as shown in this image:

    ![Open terminal](./media/open-terminal.png)

1. The terminal window opens in a new tab. 
1. Make sure you `cd` to the same folder where this notebook is located.  For example, if the notebook is in a folder named **get-started-notebooks**:

    ```
    cd get-started-notebooks    #  modify this to the path where your notebook is located
    ```

1. Enter these commands in the terminal window to copy the data to your compute instance:

    ```
    mkdir data
    cd data                     # the sub-folder where you'll store the data
    wget https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv
    ```
1. You can now close the terminal window.


[Learn more about this data on the UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

## Create handle to workspace

Before we dive in the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace.  You'll then use `ml_client` to manage resources and jobs.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
1. Copy the value for workspace, resource group and subscription ID into the code.
1. You'll need to copy one value, close the area and paste, then come back for the next one.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
1. Copy the value for workspace, resource group and subscription ID into the code.  
1. You'll need to copy one value, close the area and paste, then come back for the next one.

![image of workspace credentials](./media/find-credentials.png)

In [9]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, AzureCliCredential, ManagedIdentityCredential
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from dotenv import dotenv_values
import requests 
import os

config = dotenv_values(".env") 
subscription_id =config['SUBSCRIPTION_ID']
resource_group = config['RESOURCE_GP']
workspace = config['WORKSPACE']

os.environ["AZURE_CLIENT_ID"] = config["AZURE_CLIENT_ID"]
os.environ["AZURE_CLIENT_SECRET"] = config["AZURE_CLIENT_SECRET"]
os.environ["AZURE_TENANT_ID"] = config["AZURE_TENANT_ID"]

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace,
)

An **Azure service principal **is a security identity used by user-created apps, services, and automation tools to access specific Azure resources. Think of it as a 'user identity' (login and password or certificate) with a specific role, and tightly controlled permissions to access your resources. It only needs to be able to do specific things, unlike a general user identity. It improves security if you only grant it the minimum permissions level needed to perform its management tasks.

[How to set up authentication using Service Principle](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?view=azureml-api-2&tabs=sdk)

> [!NOTE]
> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen in the next code cell).


## Upload data to cloud storage

Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats look similar to the web URLs that you use in your web browser to access web pages. For example:

* Access data from public https server: `https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>`
* Access data from local computer:	`./home/username/data/my_data`
* Blob storage:	`wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/`
* Azure Data Lake (gen2): `abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv`
* Azure Data Lake (gen1): `adl://<accountname>.azuredatalakestore.net/<folder1>/<folder2>`
* Azure Machine Learning Datastore:	`azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet`

An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.

Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.

> [!TIP]
> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. Learn more about [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).

The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.  

Each time you create a data asset, you need a unique version for it.  If the version already exists, you'll get an error.  In this code, we're using time to generate a unique version each time the cell is run.

You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there. In this tutorial, we want to refer to specific version numbers, so we create a version number instead.

In [10]:
import pandas as pd

df = pd.read_csv("./data/UCI_Credit_Card.csv")
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

In [11]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import time

# update the 'my_path' variable to match the location of where you downloaded the data on your
# local filesystem
my_path = "./data/UCI_Credit_Card.csv"
v1 = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
my_data = Data(
    name="credit-card",
    version=v1,
    description="Credit card data",
    path=my_path,
    type=AssetTypes.URI_FILE,
)

# create data asset
ml_client.data.create_or_update(my_data)

print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")

Data asset created. Name: credit-card, version: 2023.06.16.105913


You can see the uploaded data by selecting **Data** on the left. You'll see the data is uploaded and a data asset is created:

![Image of data section of studio shows uploaded data](./media/access-and-explore-data.png)

This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column. This data uploaded to your workspace's default datastore named **workspaceblobstore**, seen in the **Data source** column. 

An Azure Machine Learning datastore is a *reference* to an *existing* storage account on Azure. A datastore offers these benefits:

1. A common and easy-to-use API, to interact with different storage types (Blob/Files/Azure Data Lake Storage) and authentication methods.
1. An easier way to discover useful datastores, when working as a team.
1. In your scripts, a way to hide connection information for credential-based data access (service principal/SAS/key).


## Access your data in a notebook

Pandas directly support URIs - this example shows how to read a CSV file from an Azure Machine Learning Datastore:

```
import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
```

However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv** command with the real values for your resources. 

You'll want to create data assets for frequently accessed data. Here's an easier way to access the CSV file in Pandas:

> [!IMPORTANT]
> In a notebook cell, execute this code to install the `azureml-fsspec` Python library in your Jupyter kernel:

In [12]:
import pandas as pd

# get a handle of the data asset and print the URI
data_asset = ml_client.data.get(name="credit-card", version=v1)
print(f"Data asset URI: {data_asset.path}")

# read into pandas - note that you will see 2 headers in your data frame - that is ok, for now

df = pd.read_csv(data_asset.path)
df.head()

Data asset URI: azureml://subscriptions/9017d57d-c4df-480d-b92d-7aea2266b0f0/resourcegroups/azure-mlops-demo/workspaces/demo-ws/datastores/workspaceblobstore/paths/LocalUpload/61c541f311e1bda7f0bb68f645ad64b7/UCI_Credit_Card.csv


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


You can download the data asset as well

In [None]:
import pandas as pd
import azure.ai.ml._artifacts._artifact_utilities as artifact_utils

# get a handle of the data asset and print the URI
data = ml_client.data.get(name="credit-card", version=v1)
print(f"Data asset URI: {data.path}")

# Download the dataset
artifact_utils.download_artifact_from_aml_uri(uri = data.path,
  destination = ".",
  datastore_operation=ml_client.datastores)
# df = pd.read_csv(data.path)
# df.head()

Read [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md) to learn more about data access in a notebook.

## Create a new version of the data asset


* a client ID column; we wouldn't use this feature in Machine Learning
* long response variable name

Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. Therefore, to clean the data and store it in Parquet, use:

In [13]:
# read in data again, this time using the 2nd row as the header
df = pd.read_csv(data_asset.path)
# rename column
df.rename(columns={"default.payment.next.month": "default"}, inplace=True)
# remove ID column
df.drop("ID", axis=1, inplace=True)

# write file to filesystem
df.to_parquet("./data/cleaned-credit-card.parquet")

This table shows the structure of the data in the original **default_of_credit_card_clients.csv** file .CSV file downloaded in an earlier step. The uploaded data contains 23 explanatory variables and 1 response variable, as shown here:

|Column Name(s) | Variable Type  |Description  |
|---------|---------|---------|
|X1     |   Explanatory      |    Amount of the given credit (NT dollar): it includes both the individual consumer credit and their family (supplementary) credit.    |
|X2     |   Explanatory      |   Gender (1 = male; 2 = female).      |
|X3     |   Explanatory      |   Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).      |
|X4     |   Explanatory      |    Marital status (1 = married; 2 = single; 3 = others).     |
|X5     |   Explanatory      |    Age (years).     |
|X6-X11     | Explanatory        |  History of past payment. We tracked the past monthly payment records (from April to September  2005). -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.      |
|X12-17     | Explanatory        |  Amount of bill statement (NT dollar) from April to September  2005.      |
|X18-23     | Explanatory        |  Amount of previous payment (NT dollar) from April to September  2005.      |
|Y     | Response        |    Default payment (Yes = 1, No = 0)     |

Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage):

> [!NOTE]
>
> This Python code cell sets **name** and **version** values for the data asset it creates. As a result, the code in this cell will fail if executed more than once, without a change to these values. Fixed **name** and **version** values offer a way to pass values that work for specific situations, without concern for auto-generated or randomly-generated values.


In [14]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import time

# Next, create a new *version* of the data asset (the data is automatically uploaded to cloud storage):
my_path = "./data/cleaned-credit-card.parquet"
v2 = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
# Define the data asset, and use tags to make it clear the asset can be used in training

my_data = Data(
    name="credit-card",
    version=v2,
    description="Default of credit card clients data.",
    tags={"training_data": "true", "format": "parquet"},
    path=my_path,
    type=AssetTypes.URI_FILE,
)

## create the data asset

my_data = ml_client.data.create_or_update(my_data)

print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")

Data asset created. Name: credit-card, version: 2023.06.16.110114


The cleaned parquet file is the latest version data source. This code shows the CSV version result set first, then the Parquet version:

In [15]:
import pandas as pd

# get a handle of the data asset and print the URI
data_asset_v1 = ml_client.data.get(name="credit-card", version=v1)
data_asset_v2 = ml_client.data.get(name="credit-card", version=v2)

# print the v1 data
print(f"V1 Data asset URI: {data_asset_v1.path}")
v1df = pd.read_csv(data_asset_v1.path)
print(v1df.head(5))

# print the v2 data
print(
    "_____________________________________________________________________________________________________________\n"
)
print(f"V2 Data asset URI: {data_asset_v2.path}")
v2df = pd.read_parquet(data_asset_v2.path)
print(v2df.head(5))

V1 Data asset URI: azureml://subscriptions/9017d57d-c4df-480d-b92d-7aea2266b0f0/resourcegroups/azure-mlops-demo/workspaces/demo-ws/datastores/workspaceblobstore/paths/LocalUpload/61c541f311e1bda7f0bb68f645ad64b7/UCI_Credit_Card.csv
   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4   
0   1    20000.0    2          2         1   24      2      2     -1     -1  \
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   ...  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3   
0  ...        0.0        0.0        0.0       0.0     689.0       0.0  \
1  ...     3272.0     3455.0     3261.0       0.0    1000.0    1000.0   
2  ...    14331.0    14948.0    15549.0    1518.0    1500.0    1000.0   
3