In [None]:
#!pip install gosling==0.0.8
import gosling as gos

# Loading data with `gos`

This notebook illustrates a key (optional) feature of `gos` which makes hosting data for your Gosling visualizations a breeze. 

Normally a Gosling visualization requires the administration of a web-server to host both the client and genomics data sets for the visualization. In `gos`, we provide further integration with Python to hide this complexity and allow remote, local, and in-memory data to be visualized seamlessly through an idential API.

In this notebook, we will visualize the same [BED file](https://samtools.github.io/hts-specs/BEDv1.pdf) containing h38 cytoband information as a: 

- remote dataset (via URL) 
- local dataset (via local path)
- in memory (from a `pd.DataFrame`)


## The visualization

The `ideogram` function generates an ideogram visualization for a given Gosling data source. It is not important that you understand the details of this block to follow along in this notebook. Moreover, the important bit is to understand that `ideogram` takes `data` as input and returns a Gosling visualization created with the `gos` API.

We will show how this function can be _reused_ for various `data` defintions (genomic data sources).

In [None]:
def ideogram(data):
    track = gos.Track(data) # bind data to track
    
    arms = track.mark_rect().encode(
        color=gos.Color("stain:N",
            domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
            range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"],
        ),
        x=gos.X("chromStart:G", axis="none"),
        xe="chromEnd:G",
        stroke=gos.value("black"),
        strokeWidth=gos.value(0.5),
    ).transform_filter_not(
        field="stain",
        oneOf=["acen"],
    )

    labels = track.mark_text().encode(
        text="name:N",
        color=gos.Color("stain:N",
            domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
            range=["black", "#636363", "black", "#D9D9D9", "white", "black"],
        ),
        strokeWidth=gos.value(0)
    ).visibility_lt(
        target='mark',
        measure='width',
        threshold='|xe-x|',
        transitionPadding=10
    )

    centromere = track.encode(
        x=gos.X("chromStart:G"),
        xe="chromEnd:G",
        color=gos.value('red'),
    ).transform_filter(
        "stain", oneOf=["acen"]
    )

    centromere_left = centromere.mark_triangleLeft().transform_filter(
        "name", include="p"
    )

    centromere_right = centromere.mark_triangleRight().transform_filter(
        "name", include="q"
    )

    return gos.overlay(arms, labels, centromere_left, centromere_right).properties(height=20)


## The dataset

The `url` below links to a [BED4+1](https://samtools.github.io/hts-specs/BEDv1.pdf) file containing UCSC hg38 cytoband information. This dataset is hosted on GitHub and is avaiable via URL. 

In [None]:
url = "https://raw.githubusercontent.com/manzt/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed"

# preview the file contents
!curl -s {url}  | head | column -t
# chrom  chromStart  chromEnd  name  stain

## Remote dataset (via URL)

We can reference this URL directly in Gos by creating a CSV data source via `gos.csv(...)`. This function returns a Python dictionary that describes our dataset to Gosling. We use the `gos.csv` utility since the resource is a columnar text dataset.

In [None]:
# specify BED4+1 format
data = gos.csv(
    url=url,
    headerNames=['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], # the +1 field is stain
    chromosomeField="chrom", # the column containing chrom names
    genomicFields=["chromStart", "chromEnd"], # fields with genomic coordinates
    separator="\t",
)

data

We can now pass this dataset directly to the `ideogram` function which binds `data` to `gos.Track` and creates our custom visualization.

In [None]:
ideogram(data)

This visualization is a bit crowded since we are viewing the data genome-wide. We can set the initial genomic domain for the visualization to Chromosome 2 by specifying `xDomain` as a property.

In [None]:
ideogram(data).properties(xDomain=gos.GenomicDomain(chromosome="chr2"))

## Local Dataset (via local filepath)

Data are not always publically available via URL like above, and often we'd like to visualize local data files. To visualize local data, **simply change the URL to a local file path**.

```diff
data = gos.csv(
-  url=url,
+  url="./UCSC.HG38.Human.CytoBandIdeogram.bed",
   ... 
)
```

Below we download the file from GitHub and load the visualization from our local filesytem.

In [None]:
!curl {url} -o UCSC.HG38.Human.CytoBandIdeogram.bed # download file

In [None]:
!cat UCSC.HG38.Human.CytoBandIdeogram.bed | head | column -t # print local file contents

In [None]:
data = gos.csv(
    url="./UCSC.HG38.Human.CytoBandIdeogram.bed",
    # url=url
    headerNames=['chrom', 'chromStart', 'chromEnd', 'name', 'stain'],
    chromosomeField="chrom",
    genomicFields=["chromStart", "chromEnd"],
    separator="\t",
)

# reuse the same visualization
ideogram(data).properties(xDomain=gos.GenomicDomain(chromosome="chr2"))

## In memory (via `pd.DataFrame`)

While loading remote and local genomics data files is useful, often we want to visualize intermediate or derived information during analysis. Rather than writing these results to disk, `gos` supports visualizing in-memory data directly from Pandas dataframes `pd.DataFrame`.

In order to use this feature, we first load our dataset as a `pd.DataFrame`.

In [None]:
import pandas as pd

df = pd.read_csv(
    './UCSC.HG38.Human.CytoBandIdeogram.bed', 
    names=['chrom', 'chromStart', 'chromEnd', 'name', 'stain'],
    sep="\t"
)

df.head()

Lets filter `df` in Python for our dataset only contains entries for Chromosome 2.

In [None]:
df = df[df.chrom == "chr2"]
df.head()

We can now create a `data` source for our visualization using the `df.gos.csv(...)` method, and visualize directly! Note how the resulting visualization only renders for chromosome 2.

In [None]:
data = df.gos.csv(
    # we only need to specify these fields since the rest are inferred from dataframe
    chromosomeField="chrom",
    genomicFields=["chromStart", "chromEnd"], 
)

ideogram(data) # view in context of full assembly

In [None]:
ideogram(data).properties(xDomain=gos.GenomicDomain(chromosome="chr2")) # view just chrom 2

We hope that you found this tutorial useful in getting started with `gos`! 

You can read more about [Gosling](http://gosling-lang.org/) to learn about exciting grammar features which are avaialbe in **gos** and also check out the **gos** [documenation](https://gosling-lang.github.io/gos/gallery/index.html) for more complex examples.