# The Big Picture
Ref: https://docs.wandb.ai/guides/artifacts/artifacts-core-concepts

W&B Artifacts was designed to make it effortless to version your datasets and models, regardless of whether you want to store your files with us or whether you already have a bucket (amazon, google cloud, or external location) you want us to track

# What is artifact?
Conceptually, an artifact is simply a directory in which you can store whatever you want, be it images, HTML, code, audio, or raw binary data. You can use it the same way you would an S3 or Google Cloud Storage bucket. Every time you change the contents of this directory, W&B will create a new version of your artifact instead of simply overwriting the previous contents.

Assume we have the following directory structure:

    images
    |-- cat.png (2MB)
    |-- dog.png (1MB)

Let's log it as the first version of a new artifact, animals:

In [None]:
import wandb

def execute(alias=None):
    run = wandb.init()
    artifact = wandb.Artifact('animals3', type='dataset', 
                              description="animals dataset description")
                                # metadata=<dict>)
    # metadata: A dictionary that can contain any structured data. You’ll be 
    # able to use this data for querying and making plots. E.g. you may choose 
    # to store the class distribution for a dataset artifact as metadata.
    
    artifact.add_dir('images')  # images directory must exist
    # name=<optional> # name: (str, optional) The path within the artifact to use 
    # for the directory being added. Defaults to files being added under the root 
    # of the artifact.
    # Ref: https://docs.wandb.ai/guides/artifacts/api#adding-files-and-directories

    # Any wandb object
    table = wandb.Table(columns=["a", "b", "c"], data=[[10, 20, 30]])
    artifact.add(table, name="my-wandb-table")  # name for the object
    # allowed_types = [
    #         data_types.Bokeh,
    #         data_types.JoinedTable,
    #         data_types.PartitionedTable,
    #         data_types.Table,
    #         data_types.Classes,
    #         data_types.ImageMask,
    #         data_types.BoundingBoxes2D,
    #         data_types.Audio,
    #         data_types.Image,
    #         data_types.Video,
    #         data_types.Html,
    #         data_types.Object3D,
    #         data_types.Molecule,
    #         data_types._SavedModel,
    #   ]
    
    run.log_artifact(artifact, aliases=alias) # Creates `animals:v0`
    
    # If artifact exists already on wandb artifacts:
    # ----------------------------------------------
    # log_artifact will check the content of local 'images' folder (hash) with what 
    # it has already on wandb artifacts (animals);
    # if changed, write it to a new version of the artifact animals next version
    # if not, do nothing

    run.finish()

execute()

![Adding file and directories](../imgs/artifact_add.PNG)

## Adding content to artifacts

    # Add a single file
    artifact.add_file(path, name='optional-name')

    # Recursively add a directory
    artifact.add_dir(path, name='optional-prefix')

    # Return a writeable file-like object, stored as <name> in the artifact
    with artifact.new_file(name) as f:
        ...  # Write contents into the file 

    # Add a URI reference
    artifact.add_reference(uri, name='optional-name')

## `log_artifact(...)`

<span style="color:#ff7171">NOTE: Calls to log_artifact are performed asynchronously for performant uploads. This can cause surprising behavior when logging artifacts in a loop. </span>

For example:

    for i in range(10):
        a = wandb.Artifact('race', type='dataset', metadata={
            "index": i,
        })
        # ... add files to artifact a ...
        run.log_artifact(a)

<span style="color:#ff7171">The artifact version v0 is NOT guaranteed to have an index of 0 in its metadata, as the artifacts may be logged in an arbitrary order.</span>

Ref: https://docs.wandb.ai/guides/artifacts/api

### Versioning
In W&B parlance, this version has the index v0. <span style="color:#00ffff">Every new version of an artifact bumps the index by one</span>. You can imagine that once you have hundreds of versions, referring to a specific version by its index would be confusing and error prone. <span style="color:#00ffff">This is where aliases come in handy. An alias allows you to apply a human-readable name to given version.</span>

To make this more concrete, let's say we want to update our dataset with a new image and mark the new version as our latest image. Here's our new directory structure:

    images
    |-- cat.png (2MB)
    |-- dog.png (1MB)
    |-- rat.png (1MB)

In [None]:
execute()  # will produce animals:v1

### Alias
W&B will automatically assign the newest version the alias `latest`, so instead of using the version index we could also refer to it using `animals:latest`. You can customize the aliases to apply to a version by passing in `aliases=['my-cool-alias']` to log_artifact. Can assign multiple `aliases=['latest', 'my-cool-alias']`

In [None]:
execute('my-cool-alias')  # Aliases to apply to this artifact, defaults to `["latest"]`

# if content of images folder is changed and 
# if execute('my-cool-alias') is run, 
#   the top most version will have 'my-cool-alias'

# if execute('my-cool-alias2') is run, 
#   the top most version will have 'my-cool-alias2',  the version below will be left with 
#   the previous alias 'my-cool-alias'

# IM: you can change these using the public api of the wandb later outside run

# Refering to Artifacts

Referring to artifacts is easy. In our training script, here's all we need to do to pull in the current the newest version of your dataset:

In [None]:
import wandb

run = wandb.init()
animals = run.use_artifact('animals:my-cool-alias', use_as="Just download")
# use_as: (string, optional) Optional string indicating what purpose the artifact was used with.
#                                        Will be shown in UI.

directory = animals.download()  # root=???
# NOTE: Any existing files at `root` are left untouched. Explicitly delete
# root before calling `download` if you want the contents of `root` to exactly
# match the artifact.

# Alternatively, 
# directory = animals.checkout()
# Replaces the specified root directory with the contents of the artifact.
# WARNING: This will DELETE all files in root that are not included in the artifact.

print(directory)
# Train on our image dataset...

## Using an artifact from a different project

    # Query W&B for an artifact and mark it as input to this run
    artifact = run.use_artifact('bike-dataset:v0')

    # Download the artifact's contents
    artifact_dir = artifact.download()

## Using an artifact from a different project

    # Query W&B for an artifact from another project and mark it
    # as an input to this run.
    artifact = run.use_artifact('my-project/bike-model:v0')

    # Use an artifact from another entity and mark it as an input
    # to this run.
    artifact = run.use_artifact('my-entity/my-project/bike-model:v0')

## Using an artifact that has not been logged

    artifact = wandb.Artifact('bike-model', type='model')
    artifact.add_file('model.h5')
    run.use_artifact(artifact)

# Storage Layout

Ref: https://docs.wandb.ai/guides/artifacts/artifacts-core-concepts#storage-layout

# Adding a reference to an artifact

* Ref: https://docs.wandb.ai/guides/artifacts/artifacts-core-concepts#data-privacy-and-compliance
* Ref: https://docs.wandb.ai/guides/artifacts/api#adding-references
* Ref: https://docs.wandb.ai/guides/artifacts/references

## Trackers availabe in `wandb`

```python
    from wandb.sdk.wandb_artifacts import WandbStoragePolicy

    s3 = S3Handler()
    gcs = GCSHandler()
    http = HTTPHandler(self._session)
    https = HTTPHandler(self._session, scheme="https")
    artifact = WBArtifactHandler()
    local_artifact = WBLocalArtifactHandler()
    file_handler = LocalFileHandler()

    default_handler=TrackingHandler()
```

depending on the path, scheme will be chosen

i.e.
* `File:C:\folder\1.jpg -> scheme='File', LocalFileHandler()` will be used
* `File:C:\folder -> scheme='File', LocalFileHandler()` will be used
* `C:\folder\1.jpg -> scheme=unknown, TrackingHandler()` (default) will be used
* `S3:\folder\1.jpg -> scheme='s3', S3Handler()` will be used
* `GC3:\folder\1.jpg -> scheme='gc3', GCSHandler()` will be used etc.

### `LocalFileHandler(StorageHandler)`

Tracks files or directories on a local filesystem. Directories
are expanded to create an entry for each file contained within.

* Handles `file://` references (for NFS mounts), Refer to: `LocalFileHandler.store_path(...)`
* <span style="color:#00ffff">Checksum (content-based) for local files will be 
calculated for versioning</span>
* UI will show filesize

#### `file://` references

`LocalFileHandler.store_path(...)` uses `urlparse()` to pase `file://...` references

`urlparse(...)`
Parse a URL into 6 components: `<scheme>://<netloc>/<path>;<params>?<query>#<fragment>`

i.e.
* `file://192.168.0.1/share/file.jpg` ->
* `<scheme>: file, <netloc>: 192.168.0.1, <path>: share/file.jpg`

* `file:///share/file.jpg` -> 
* `<scheme>: file, <netloc>:, <path>: /share/file.jpg`

for local disk files:

* `file:c:/share/file.jpg` -> 
* `<scheme>: file, <netloc>: c:, <path>: /share/file.jpg`

<span style="color:#ff7171">DO NOT use:</span>

* `file:c:/#folder_share/file.jpg`  <span style="color:#ff7171"><= path that has `'#'` symbol</span>
* `<scheme>: file, <netloc>: c:, <path>: /, <fragment>:folder_share/file.jpg`
  
internally,

```python 
local_path = f"{str(url.netloc)}{str(url.path)}"
```

#### Create a artifact, add a reference that is handled by `LocalFileHandler`

In [None]:
import wandb
from pathlib import Path
from imind.globals.ipaths import PCRelativeEnvConfig

run = wandb.init()
artifact = wandb.Artifact('pets', type='dataset')

# # can track file
# p = Path('File:' + str(PCRelativeEnvConfig.get_path('DLN_PATH')))/ \
#             'tutorials'/'03_wandb'/'artifacts'/'images'/'cat.jpg' 
# # scheme: 'File', FileHandler() will be used.
# artifact.add_reference(str(p))  # name=<remote filename for cat.jpg>
## artifact.add_reference(str(p), name="c.jpg")
           
# can track folder
p = Path('File:' + str(PCRelativeEnvConfig.get_path('DLN_PATH')))/ \
            'tutorials'/'03_wandb'/'artifacts'/'images'
# # scheme: 'File', FileHandler() will be used.
artifact.add_reference(str(p)) # name=<remote folder for content of images folder>
# artifact.add_reference(str(p), name='img_folder')

run.log_artifact(artifact)
wandb.finish()

#### Access a reference (handled by `LocalFileHandler`) in an artifact

In [None]:
import wandb

run = wandb.init()
animals = run.use_artifact('pets:latest')
directory = animals.download()  # will work if reference is pointing to S3, Google Cloud or NFS Mount

# or 
# Access susbet of files
# -----------------------
# entry = animals.get_path("c.jpg")  # if name='c.jpg' is provided in add_reference
# entry = animals.get_path("cat.jpg"), etc
# entry = animals.get_path("img_folder")  # if name='img_folder' is provided in add_reference
# entry.download()

print(directory)

### `TrackingHandler(StorageHandler)`

Tracks paths as is, with no modification or special processing. Useful
when paths being tracked are on file systems mounted at a standardized
location.

For example, if the data to track is located on an NFS share mounted on
`/data`, then it is sufficient to just track the paths.

* Handles `c:/share/file.jpg` or `c:/share/` references or any other type
* file/folder at the path may or may not exists! since it only cares about the path itself
* <span style="color:#00ffff">versioning is done based on the path (not the file content)</span>
* hashing is path based
* UI will not show filesize (filesize=0)

#### Create a artifact, add a reference that is handled by `TrackingHandler`

In [10]:
import wandb
from pathlib import Path
from imind.globals.ipaths import PCRelativeEnvConfig

run = wandb.init()
artifact = wandb.Artifact('pets', type='dataset')

# # can track file
p = Path(PCRelativeEnvConfig.get_path('DLN_PATH'))/ \
            'tutorials'/'03_wandb'/'artifacts'/'images'/'cat.jpg' 
# # scheme: unknown, TrackingHandler() - default handler is used
artifact.add_reference(str(p), name='NAME_CAT.jpg')
# name is mandotory since scheme=unknown
           
# can track folder
p = Path(PCRelativeEnvConfig.get_path('DLN_PATH'))/ \
            'tutorials'/'03_wandb'/'artifacts'/'images' 
artifact.add_reference(str(p), name='NAME_CAT_FOLDER')

run.log_artifact(artifact)
wandb.finish()



VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

#### Access a reference (handled by `TrackingHandler`) in an artifact

In [11]:
import wandb

run = wandb.init()
animals = run.use_artifact('pets:latest')

print(animals.id)
print(animals.size)
print(animals.aliases)
print(animals.manifest.entries)  # remote root folder structure
print(animals.digest)
print(animals.version)
print(animals.created_at)
print(animals.updated_at)
print(animals.description)

# directory = animals.download()  # will not work for TrackingHandler
# * IM: TrackingHandler.load_path() is not supported
# * Likely a user error. The tracking handler is
# * oblivious to the underlying paths, so it has
# * no way of actually loading it.

entry = animals.manifest.entries['NAME_CAT.jpg']
print(entry.ref)

QXJ0aWZhY3Q6MTM5OTU3OTA5
0
['latest']
{'NAME_CAT.jpg': <ManifestEntry ref: C:\Users\chath\OneDrive - Curtin\Research\dev\python\DLN\tutorials\03_wandb\artifacts\images\cat.jpg/NAME_CAT.jpg>, 'NAME_CAT_FOLDER': <ManifestEntry ref: C:\Users\chath\OneDrive - Curtin\Research\dev\python\DLN\tutorials\03_wandb\artifacts\images/NAME_CAT_FOLDER>}
23270932d8ae124cb54dda63c94253f2
v1
2022-06-08T05:29:04
2022-06-08T05:29:07

C:\Users\chath\OneDrive - Curtin\Research\dev\python\DLN\tutorials\03_wandb\artifacts\images\cat.jpg


# Other operations

## Access susbet of files in an artifact

If you're only interested in a subset of files, use the get_path method.

`entry = artifact.get_path(name)`

This fetches only the file at the path name. It returns an Entry object with the following methods:

* Entry.download: Downloads file from the artifact at path name
* Entry.ref: If the entry was stored as a reference using add_reference, returns the URI

References that have schemes that W&B knows how to handle can be downloaded just like artifact files. The consumer API is the same

### `artifact.get(name)` usage

    artifact = wandb.Artifact('my_table', 'dataset')
    table = wandb.Table(columns=["a", "b", "c"], data=[[i, i*2, 2**i]])
    artifact.add(table, "my_table")

    wandb.log_artifact(artifact)


Retrieving an object:

    artifact = wandb.use_artifact('my_table:latest')
    table = artifact.get("my_table")

## Downloading an artifact outside run (Use public API)
Ref: https://docs.wandb.ai/guides/artifacts/api#download-an-artifact-outside-of-a-run

    api = wandb.Api()
    artifact = api.artifact('entity/project/artifact:alias')
    artifact.download()

## Updating Artifacts

You can update the description, metadata, and aliases of an artifact by just setting them to the desired values and then calling save().

    api = wandb.Api()
    artifact = api.artifact('bike-dataset:latest')

    # Update the description
    artifact.description = "My new description"

    # Selectively update metadata keys
    artifact.metadata["oldKey"] = "new value"

    # Replace the metadata entirely
    artifact.metadata = {"newKey": "new value"}

    # Add an alias
    artifact.aliases.append('best')

    # Remove an alias
    artifact.aliases.remove('latest')

    # Completely replace the aliases
    artifact.aliases = ['replaced']

    # Persist all artifact modifications
    artifact.save()

<span style="color:#ff7171">Cannot change the manifest (files tracked)</span>

    from wandb.sdk.wandb_artifacts import ArtifactManifestEntry

    api = wandb.Api()
    artifact = api.artifact('uncategorized/pets:latest')

    artifact.manifest.entries['NAME_CAT.jpg'] = ArtifactManifestEntry('NAME_CAT.jpg',
                        'C:\\Users\\chath\\OneDrive -    Curtin\\Research\\dev\\python\\DLN\\tutorials\\03_wandb\\artifacts\\images\\cat.jpg',
                        'C:\\Users\\chath\\OneDrive - Curtin\\Research\\dev\\python\\DLN\\tutorials\\03_wandb\\artifacts\\images\\cat.jpg')

    artifact.save()

## Traversing the Artifact Graph

W&B automatically tracks the artifacts a given run has logged as well as the artifacts a given run has used. You can walk this graph by using the following APIs:

    api = wandb.Api()
    artifact = api.artifact('data:v0')

    # Walk up and down the graph from an artifact:
    producer_run = artifact.logged_by()
    consumer_runs = artifact.used_by()

    # Walk up and down the graph from a run:
    logged_artifacts = run.logged_artifacts()
    used_artifacts = run.used_artifacts()

## Cleaning the unsed versions

As an artifact evolves over time, you might end up with a large number of versions that clutter the UI. This is especially true if you are using artifacts for model checkpoints, where only the most recent version (the version tagged latest) of your artifact is useful. W&B makes it easy to clean up these unneeded versions:

    api = wandb.Api()

    artifact_type, artifact_name = ... # fill in the desired type + name
    for version in api.artifact_versions(artifact_type, artifact_name):
        # Clean up all versions that don't have an alias such as 'latest'.
        if len(version.aliases) == 0:
            version.delete()