# Introduction to YAML

YAML is a cross-language data format meant to store data structures in a human-readable way. It is similar to JSON, but it has less boilerplate characters and it is easier to read by a person. YAML has many features (including references, multi-document files and other advanced features). In this class however we will only use basic data structures such as lists and key-value mappings ("dictionaries" in Python language), so let's look at how to write those in YAML.

In YAML, just like in python, the document structure is an integral part of the definition of the data structures. Therefore, spaces are important and indentation is used to open and close data structures. Note that tabs are not allowed in YAML documents, only spaces are.

Let's look at how to write in YAML simple data structures.

## Lists


**A simple list**

This is how you define a list in Python:

In [1]:
my_list = ['a word', 'b', 1, 3.5]

And this is how the same is represented in a YAML file:

```
- a word
- b
- 1
- 3.5
```

**A nested list**

Of course, in Python you can define lists containing other lists:

In [3]:
my_list = ['a word', [1, 2, 'a'], 1, 3.5]

This is how that data structure is represented in YAML:

```
- a word
- - 1
  - 2
  - a
- 1
- 3.5
```

Note how the indentation at line 3 signifies the continuation of the list that started at line 2, and how going back to the upper indentation level signifies the end of the inner list and the continuation of the outer list.

**A key-value mapping (a dictionary)**

This is how you define a dictionary in Python:



In [4]:
d = {'key_1': 1, 'key_2': "a string", 'another_key': 2.5}

In YAML, this is represented simply as:

```
key_1: 1
key_2: a string
another_key: 2.5
```

(note that order might not preserved when reading the file)

**A nested dictionary**

In Python you can of course define nested dictionaries, i.e., dictionaries containing other dictionaries:


In [5]:
d = {
  "a": "a value",
  "b": {
    "c": 1.2,
    "d": 1,
    "e": "a string"
  },
  "c": 5
}


In YAML this is represented as:

```
a: a value
b:
  c: 1.2
  d: 1
  e: a string
c: 5
```
Note that the quotes are optional in YAML, although they are usually a good idea if you are using characters beyond letters, numbers and spaces.



**Mixing dictionaries and lists**

You can of course mix lists and dictionaries in Python:

In [6]:
d = {
  "a": "a value",
  "b": {
    "c": 1.2,
    "d": 1,
    "e": "a string"
  },
  "c": [1, 2, "another string"]
}

Such a data structure is represented in YAML as:

```
a: a value
b:
  c: 1.2
  d: 1
  e: a string
c:
  - 1
  - 2
  - another string
```

## The YAML of conda.yml and MLproject



Let's now look back at the `conda.yml` file and the `MLproject` file. We can read these files using `pyyaml`, the python parser for YAML that can be installed with pip. For example, this `conda.yml` file:

```
name: download_data
channels:
  - conda-forge
  - defaults
dependencies:
  - requests=2.24.0
  - pip=20.3.3
  - pip:
      - wandb==0.12.6
```


In [9]:
%%bash 
echo "name: download_data
channels:
  - conda-forge
  - defaults
dependencies:
  - requests=2.24.0
  - pip=20.3.3
  - pip:
      - wandb==0.12.6" > conda.yml

can be read in Python:


In [11]:
import yaml

with open("conda.yml") as fp:
    d = yaml.safe_load(fp)

d


{'channels': ['conda-forge', 'defaults'],
 'dependencies': ['requests=2.24.0', 'pip=20.3.3', {'pip': ['wandb==0.12.6']}],
 'name': 'download_data'}

which gives

```
{
    "name": "download_data",
    "channels": [
        "conda-forge",
        "defaults"
    ],
    "dependencies": [
        "requests=2.24.0",
        "pip=20.3.3",
        {
            "pip": [
                "wandb==0.12.6"
            ]
        }
    ]
}
```

and this MLproject:

```
name: download_data
conda_env: conda.yml

entry_points:
  main:
    parameters:
      file_url:
        description: URL of the file to download
        type: uri
      artifact_name:
        description: Name for the W&B artifact that will be created
        type: str
      artifact_type:
        description: Type of the artifact to create
        type: str
        default: raw_data
      artifact_description:
        description: Description for the artifact
        type: str

    command: >-
      python download_data.py --file_url {file_url} \
                              --artifact_name {artifact_name} \
                              --artifact_type {artifact_type} \
                              --artifact_description {artifact_description}
```

In [12]:
%%bash 
echo "name: download_data
conda_env: conda.yml

entry_points:
  main:
    parameters:
      file_url:
        description: URL of the file to download
        type: uri
      artifact_name:
        description: Name for the W&B artifact that will be created
        type: str
      artifact_type:
        description: Type of the artifact to create
        type: str
        default: raw_data
      artifact_description:
        description: Description for the artifact
        type: str

    command: >-
      python download_data.py --file_url {file_url} \
                              --artifact_name {artifact_name} \
                              --artifact_type {artifact_type} \
                              --artifact_description {artifact_description}" > my_project

can be read as:

In [15]:
import yaml
with open("MLproject") as fp:
  d = yaml.safe_load(fp)
d

{'conda_env': 'conda.yml',
 'entry_points': {'main': {'command': 'python download_data.py --file_url {file_url}                               --artifact_name {artifact_name}                               --artifact_type {artifact_type}                               --artifact_description {artifact_description}',
   'parameters': {'artifact_description': {'description': 'Description for the artifact',
     'type': 'str'},
    'artifact_name': {'description': 'Name for the W&B artifact that will be created',
     'type': 'str'},
    'artifact_type': {'default': 'raw_data',
     'description': 'Type of the artifact to create',
     'type': 'str'},
    'file_url': {'description': 'URL of the file to download',
     'type': 'uri'}}}},
 'name': 'download_data'}

which gives:

```
{
    "name": "download_data",
    "conda_env": "conda.yml",
    "entry_points": {
        "main": {
            "parameters": {
                "file_url": {
                    "description": "URL of the file to download",
                    "type": "uri"
                },
                "artifact_name": {
                    "description": "Name for the W&B artifact that will be created",
                    "type": "str"
                },
                "artifact_type": {
                    "description": "Type of the artifact to create",
                    "type": "str",
                    "default": "raw_data"
                },
                "artifact_description": {
                    "description": "Description for the artifact",
                    "type": "str"
                }
            },
            "command": "python download_data.py --file_url {file_url} 
                                                --artifact_name {artifact_name}
                                                --artifact_type {artifact_type} 
                                                --artifact_description {artifact_description}"
        }
    }
}
```

In [18]:
import pathlib
file_url = "https://raw.githubusercontent.com/scikit-learn/scikit-learn/4dfdfb4e1bb3719628753a4ece995a1b2fa5312a/sklearn/datasets/data/iris.csv"

basename = pathlib.Path(file_url).name.split("?")[0].split("#")[0]
basename

'iris.csv'

In [23]:
pathlib.Path(file_url).name.split("?")

['iris.csv']