# Metasmith Starter Notebook
Welcome! This is a starter notebook provided for you to interact with Metasmith.
If you get stuck, you can read [the docs](https://metasmith.readthedocs.io/en/latest/), or check out [the repository on GitHub](https://github.com/hallamlab/Metasmith)!

To begin, import the tools and structures needed from the Metasmith python API. We will also load some resources from the standard library using `Std()`.

In [11]:
import re

tests = """
Escherichia coli str. K-12 substr. MG1655, complete genome.
Escherichia coli str. K-12 substr. DH10B, complete sequence.
Escherichia coli strain EPI300 chromosome, complete genome.
"""
tests = [x for x in tests.split("\n") if x]
for x in tests:
    name = None
    for g in re.finditer(r"substr\.?\s?(\w+)|strain\s?(\w+)", x):
        a, b = g.group(1), g.group(2)
        name = a if a else b
        break
    if not name: name = x
    print(name)

MG1655
DH10B
EPI300


In [None]:
from pathlib import Path
from metasmith.python_api import Agent, Source, Std, DataInstanceLibrary
from metasmith.python_api import Resources, Size, Duration

dtypes, containers, transforms = Std()

Metasmith executes workflows through agents on your behalf. Each agent is given a workspace which may be on a remote machine. Let's make one called "smith".

In [None]:
smith = Agent(
    home = Source.FromLocal(Path("./local_home").resolve()),
)
# smith.Deploy()

For this demo, we will use long reads from the model organism *Eschichia coli* EPI300, but we only have its SRA accession "SRR35110061". For now, all inputs must be files so let's create one with the EPI300 accession number.

In [None]:
inputs_folder = Path("./std_assembly_data")
inputs_accession_file = Path("./epi300.acc")
with open(inputs_accession_file, "w") as f:
    f.write("SRR35110061")

We need to register the input into Metasmith's ecosystem by givging it a datatype. This lists all data types with "accession" in its name.

In [None]:
for k in dtypes.types:
    if "accession" not in k: continue
    print(k)

Using the `long_reads_accession` datatype, create a `DataInstanceLibrary`. This structure keeps track of multiple files and is essentially a filesystem folder managed by Metasmith.

In [None]:
inputs = DataInstanceLibrary(inputs_folder)
inputs.AddItem(inputs_accession_file.resolve(), "std::long_reads_accession")
inputs.Save()

Let's see what annotations are available. We will use "busco_annotations" since it will be the fastest to process.

In [None]:
for k in dtypes.types:
    if "annotations" not in k: continue
    print(k)

We can now ask the metasmith agent to generate a workflow that produces "busco_annotations" from a "long_reads_accession" using available resources and transform steps. The generated workflow with references to requried inputs are stored in `task`.  

In [None]:
task = smith.GenerateWorkflow(
    samples=[inputs],
    resources=[containers],
    transforms=[transforms],
    targets=[
        dtypes["busco_annotations"],
    ],
)

The generated plan can be viewed with graphviz. We should inspect it and ensure it is sensible.

In [None]:
from IPython.display import Image
dagf = Path("dag")
task.plans[0][0].RenderDAG(dagf, format="png")
Image(filename=f"{dagf}.png")

Staging the task transfers required files over to the agent's workspace.

In [None]:
smith.StageWorkflow(task, on_exist="clear")

Before starting the run, let's alter some resource constraints and specify the local executor.

In [None]:
for path, tr in transforms.IterateTransforms():
    print(tr.name)

In [None]:
smith.RunWorkflow(
    task,
    config_file=smith.GetNxfConfigPresets()["local"],
    resource_overrides={
        "all": Resources(
            cpus=8,
        ),
        transforms["busco_ref"]: Resources(
            cpus=2,
        ),
        transforms["fasterq_long"]: Resources(
            cpus=2,
        ),
    }
)

The task will execute asynchronously and its progress can be checked with:

In [None]:
smith.CheckWorkflow(task)