# Loading Data Tutorial

MLDB operates on data via [Datasets](/doc/#builtin/datasets/Datasets.md.html), which can be created and populated in three different ways:

1. You can create a mutable Dataset and insert data row by row via REST.
1. You can create a Dataset from an existing file.
1. You can create a Dataset by running a [Procedure](/doc/#builtin/procedures/Procedure.md.html).

## Creating Datasets via REST

Creating a Dataset is a simple REST call, and the most important thing to do when creating a Dataset is to choose the right [Dataset type](/doc/#builtin/datasets/Datasets.md.html). Here we will use a type from the `beh` family: the [`beh.mutable` Dataset](/doc/#builtin/datasets/MutableBehaviourDataset.md.html), which will allow us to append data before committing.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection()

Let's create a dataset called `example` which will be persisted to the local disk at `/mldb_data/datasets/example.beh`.

In [2]:
ds = mldb.v1.datasets("example")
ds.put({
    "type":"beh.mutable",
    "params": {
        "dataFileUrl":"file:///mldb_data/datasets/example.beh"
    }
})

That's all there is to it, and now we can add some rows and commit the dataset.

In [3]:
ds.rows.post({
    "rowName": "first row",
    "columns": [
        ["first column", 1, 0],
        ["second column", 2, 0]
    ]
})

ds.rows.post({
    "rowName": "second row",
    "columns": [
        ["first column", 3, 0],
        ["second column", 4, 0]
    ]
})

ds.commit.post({})

So now we have a little bit of data in our dataset. Let's check.

In [4]:
mldb.query("select * from example")

Unnamed: 0_level_0,first column,second column
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
first row,1,2
second row,3,4


## Creating a Dataset from a file

In the example above, our dataset was persisted to the local disk at `/mldb_data/datasets/example.beh`. The immutable [`beh` Dataset](/doc/#builtin/datasets/BehaviourDataset.md.html) type can load data from `.beh` files. This means that we could have persisted our file to shared storage, such as Amazon S3, and another instance of MLDB can load it up directly.

In [5]:
ds2 = mldb.v1.datasets("example2")
ds2.put({
    "type":"beh",
    "params": {
        "dataFileUrl":"file:///mldb_data/datasets/example.beh"
    }
})

In [6]:
mldb.query("select * from example2")

Unnamed: 0_level_0,first column,second column
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
first row,1,2
second row,3,4


## Creating a Dataset by running a Procedure on another Dataset

Procedures take Datasets as inputs and can create Datasets as outputs. This is how you can do data cleanup/transformation in MLDB. Here's a simple example with the [`transform` Procedure](/doc/#builtin/procedures/TransformDataset.md.html):

In [7]:
proc = mldb.v1.procedures("example")
proc.put({
    "type": "transform",
    "params": {
        "inputDataset": {"id": "example"},
        "outputDataset": {
            "id": "example3", "type":"beh.mutable", 
            "params":{
                "dataFileUrl": "file:///mldb_data/datasets/example3.beh"
            }
        },
        "select": '"first column" + "second column" as "transformed column"'
    }
})
proc.runs.post({})

In [8]:
mldb.query("select * from example3")

Unnamed: 0_level_0,transformed column
_rowName,Unnamed: 1_level_1
first row,3
second row,7


## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).