# Lab 3 - Creating a dataset in Azure ML

In this prelude to the first "real" lab using the Azure ML python SDK, we're going to upload a dataset to Azure ML for future use. This is a great piece of functionality we like to ensure that a dataset is accessible for training not only by a user but later also by a pipeline for automated, unattended re-training.

In [None]:
from azureml.core import Workspace, Dataset, Datastore

## Connecting to the workspace

As a first step, we must establish a connection to the workspace in which we are going to log our model measures etc. 

Note that in the following code cell we create this connection by refering to a configuration file that is available within the Compute Instance we're working with. To use this method outside of the Compute Instance, you will first need to download the config file from the Azure ML Studio.

**When you run this cell for the first time, you will need to copy the code provided and navigate to the page linked to complete authentication.**

In [None]:
ws = Workspace.from_config()

## Uploading CSV to datastore

In this case we are uploading our CSV files to Azure ML's default blob storage. Please note that this is not recommended practice outside of this tutorial. Typicially, your organization will register separate datastore first that hosts your organization's training data, e.g. data sitting in an Azure Data Lake or a relational database.

Please note that in a production environment you will rarely be uploading individual files but will want to rely on a clear process to ingest data into Azure ML. Have a look at the data ingestion options outline in [this article](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data-ingestion). 

In [None]:
datastore = ws.get_default_datastore()

datastore.upload_files(files = ['data/german_credit_dataset.csv'], overwrite = True, show_progress = True)

# Creating a dataset

Next, we will create a dataset object of type *Tabular* and point to the CSV file we have previously uploaded to our datastore. Note that the Dataset class also supports file datasets which may be used for example when handling image data or other file types. 

In [None]:
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'german_credit_dataset.csv')])

As a final step, we are now going to register this dataset in our Azure ML workspace and give it some meta data tags for easier identification in the future when the list of datasets may have grown rapidly.

In [None]:
dataset.register(ws, name = 'german_credit_dataset', tags = {'purpose': 'demo'}, create_new_version = True)

When the command above has completed, navigate to your [Azure ML workspace](https://ml.azure.com) and open the datasets section to verify that your dataset has been successfully created.

You may now close this notebook and proceed to the next one. 

# Further information

- [Example notebooks for working with datasets and datastores in Azure ML](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data)

## Disclaimer

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.