# Quick start

Welcome to the quick start guide for AlphaGenome! The goal of this tutorial notebook is to quickly get you started with using the model and making predictions.

```{tip}
Open this tutorial in Google colab for interactive viewing.
```

In [None]:
# @title Install AlphaGenome

# @markdown Run this cell to install AlphaGenome.
from IPython.display import clear_output
! pip install alphagenome
clear_output()

## Imports

In [None]:
# @title Default title text
from alphagenome import colab_utils
from alphagenome.data import gene_annotation
from alphagenome.data import genome
from alphagenome.data import transcript as transcript_utils
from alphagenome.interpretation import ism
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.visualization import plot_components
import matplotlib.pyplot as plt
import pandas as pd

#Three founder variants account for greater than 90% of BRCA1 and BRCA2 variants in individuals of Ashkenazi Jewish heritage:
BRCA1: c.68_69del (p.Glu23fs) (also known as BRCA1_185delAG);
BRCA1: c.5266dup (p.Gln1756fs) (also known as BRCA1_5382insC); and
BRCA2: c.5946del (p.Ser1982fs) (also known as BRCA2_6174delT).

## Predict outputs for a DNA sequence

AlphaGenome is a model that makes predictions from DNA sequences. Let's load it up:




```{tip}
If using Google Colab, store your key in "Secrets" for persistent access across sessions (see [installation](https://www.alphagenomedocs.com/installation.html#google-colab)). Otherwise, `dna_client.create` can take the API key directly.
```

In [None]:
dna_model = dna_client.create(colab_utils.get_api_key())

The model can make predictions for the following [output types](https://www.alphagenomedocs.com/exploring_model_metadata.html):

In [None]:
[output.name for output in dna_client.OutputType]

['ATAC',
 'CAGE',
 'DNASE',
 'RNA_SEQ',
 'CHIP_HISTONE',
 'CHIP_TF',
 'SPLICE_SITES',
 'SPLICE_SITE_USAGE',
 'SPLICE_JUNCTIONS',
 'CONTACT_MAPS',
 'PROCAP']

AlphaGenome predicts multiple 'tracks' per output type, covering a wide variety of tissues and cell-types.  However, predictions can be made efficiently for subsets of interest.

Here is how to make DNase-seq predictions (as specified by `OutputType`) in a subset of tracks corresponding to lung tissue (as specified by `ontology_terms`) for a short DNA sequence of length 2048:

*Note: We use ontology terms from standardized biological sources like UBERON (for anatomy) and the Cell Ontology (CL) to provide consistent and widely recognized classifications for tissue and cell types.*

**Three founder variants account for greater than 90% of BRCA1 and BRCA2 variants in individuals of Ashkenazi Jewish heritage:
BRCA1: c.68_69del (p.Glu23fs) (also known as BRCA1_185delAG);
BRCA1: c.5266dup (p.Gln1756fs) (also known as BRCA1_5382insC); and
BRCA2: c.5946del (p.Ser1982fs) (also known as BRCA2_6174delT).**

In [None]:
output = dna_model.predict_sequence(
    sequence='GATTACA'.center(2048, 'N'),  # Pad to valid sequence length.
    requested_outputs=[dna_client.OutputType.DNASE],
    ontology_terms=['UBERON:0002048'],  # Lung.
)

The `output` object contains predictions for all the different requested output types (in this case, only output type `DNASE`). Predictions for genomic tracks are stored inside a `TrackData` object:                                                                                                                            

In [None]:
dnase = output.dnase
type(dnase)

alphagenome.data.track_data.TrackData

`TrackData` objects have the following components:

<a href="https://services.google.com/fh/files/misc/trackdata.png"><img src="https://services.google.com/fh/files/misc/trackdata.png" alt="trackdata" border="0" height=500></a>

The predictions of shape `(sequence_length, num_tracks)` are stored in `.values`:

In [None]:
print(dnase.values.shape)

dnase.values

(2048, 1)


array([[0.00138092],
       [0.00121307],
       [0.00121307],
       ...,
       [0.00138092],
       [0.00213623],
       [0.00292969]], shape=(2048, 1), dtype=float32)

And the corresponding metadata describing each of the tracks is stored in `.metadata`:


In [None]:
dnase.metadata

Unnamed: 0,name,strand,Assay title,ontology_curie,biosample_name,biosample_type,biosample_life_stage,data_source,endedness,genetically_modified
0,UBERON:0002048 DNase-seq,.,DNase-seq,UBERON:0002048,lung,tissue,embryonic,encode,paired,False


In this case, there is only one output track, so the track metadata returns only 1 row.

The track metadata is especially useful when requesting predictions for multiple tissues or cell-types, and when dealing with stranded assays (which are assays with separate readouts for the two DNA strands, such as CAGE and RNA-seq):


In [None]:
output = dna_model.predict_sequence(
    sequence='GATTACA'.center(2048, 'N'),  # Pad to valid sequence length.
    requested_outputs=[
        dna_client.OutputType.CAGE,
        dna_client.OutputType.DNASE,
    ],
    ontology_terms=[
        'UBERON:0002048',  # Lung.
        'UBERON:0000955',  # Brain.
    ],
)

print(f'DNASE predictions shape: {output.dnase.values.shape}')
print(f'CAGE predictions shape: {output.cage.values.shape}')

DNASE predictions shape: (2048, 2)
CAGE predictions shape: (2048, 4)


Notice that in this example, we requested predictions for 2 assays and 2 ontology terms simultaneously.

The CAGE track metadata describes the strand and tissue of each of the 4 predicted tracks (2 per DNA strand):

In [None]:
output.cage.metadata

Unnamed: 0,name,strand,Assay title,ontology_curie,biosample_name,biosample_type,data_source
0,hCAGE UBERON:0000955,+,hCAGE,UBERON:0000955,brain,tissue,fantom
1,hCAGE UBERON:0002048,+,hCAGE,UBERON:0002048,lung,tissue,fantom
2,hCAGE UBERON:0000955,-,hCAGE,UBERON:0000955,brain,tissue,fantom
3,hCAGE UBERON:0002048,-,hCAGE,UBERON:0002048,lung,tissue,fantom
