#  Getting started: initializing, adding data, and saving your SwanGraph 

First, if you haven't already, make sure to [install Swan](https://github.com/fairliereese/swan_vis/wiki#installation).
After installing, you'll be able to run Swan from Python.

Then, download the data and the reference transcriptome annotation from [here](https://hpc.oit.uci.edu/~freese/swan_files/). The bash commands to do so are given below.

Swan offers two main ways for loading transcriptomes. You can either load models from [a properly-formatted GTF](getting_started.md#adding-transcript-models-gtf-and-abundance-information-at-the-same-time), or from a [TALON db](getting_started.md#adding-transcript-models-talon-db-and-abundance-information).
Please see the [input file format documentation](../faqs/file_formats.md) for specifics on how these files should be formatted.

We've provided three examples on how to add data to your SwanGraph in the following tutorial. You only need to run one!
1. [Using a GTF and abundance table together](#gtf_ab)
2. [Using a GTF and abundance table separately](#gtf_ab_sep)
3. [Using a TALON database and abundance table together](#db_ab)

Other sections: 
* [Example data download](#data_download)
* [Starting and initializing your SwanGraph](#init)
* [Saving and loading your SwanGraph](#save_load)

This page can also be read from top to bottom, just know that you may be running things more than once!

## <a name="data_download"></a> Download example data

In [13]:
# # run this block in your bash terminal
# mkdir data
# mkdir figures
# cd data/

# # download files
# wget http://crick.bio.uci.edu/freese/swan_files.tgz
    
# # expand files 
# tar xzf swan_files.tgz
# mv swan_files/* .
# rm -r swan_files/

# # download reference annotation
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
# gunzip gencode.v29.annotation.gtf.gz

# cd ../

## <a name="init"></a>Starting up Swan and initializing your SwanGraph

The rest of the code should be done in the Python shell, or run from a `.py` file. 

In [4]:
import swan_vis as swan

In [3]:
annot_gtf = 'data/gencode.v29.annotation.gtf'
hep_1_gtf = 'data/hepg2_1_talon.gtf'
hep_2_gtf = 'data/hepg2_2_talon.gtf'
hff_1_gtf = 'data/hffc6_1_talon.gtf'
hff_2_gtf = 'data/hffc6_2_talon.gtf'
hff_3_gtf = 'data/hffc6_3_talon.gtf'
ab_file = 'data/all_talon_abundance_filtered.tsv'
talon_db = 'data/talon.db'

Initialize an empty SwanGraph and add the transcriptome annotation to the SwanGraph.

In [10]:
# initialize a new SwanGraph
sg = swan.SwanGraph() 

# add an annotation transcriptome 
sg.add_annotation(annot_gtf)

Adding dataset annotation to the SwanGraph.


## <a name="gtf_ab"></a>Adding transcript models (GTF) and abundance information at the same time

Add each dataset to the SwanGraph, along with the corresponding abundance information from the abundance matrix. The `count_cols` variable refers to the column name in the abundance file that corresponds to the counts for the input dataset.

In [13]:
# add a dataset's transcriptome and abundance information to
# the SwanGraph
sg.add_dataset('HepG2_1', hep_1_gtf,
	counts_file=ab_file,
	count_cols='hepg2_1')
sg.add_dataset('HepG2_2', hep_2_gtf,
	counts_file=ab_file,
	count_cols='hepg2_2')
sg.add_dataset('HFFc6_1', hff_1_gtf,
	counts_file=ab_file,
	count_cols='hffc6_1')
sg.add_dataset('HFFc6_2', hff_2_gtf,
	counts_file=ab_file,
	count_cols='hffc6_2')
sg.add_dataset('HFFc6_3', hff_3_gtf,
	counts_file=ab_file,
	count_cols='hffc6_3')

Adding dataset HepG2_1 to the SwanGraph.
Adding dataset HepG2_2 to the SwanGraph.
Adding dataset HFFc6_1 to the SwanGraph.
Adding dataset HFFc6_2 to the SwanGraph.
Adding dataset HFFc6_3 to the SwanGraph.


##  <a name="save_load"></a>Saving and loading your SwanGraph

Following this, you can save your SwanGraph so you can easily work with it again without re-adding all the data.

In [11]:
# save the SwanGraph as a Python pickle file
sg.save_graph('swan')

Saving graph as swan.p


And you can reload the graph again.

In [12]:
# load up a saved SwanGraph from a pickle file
sg = swan.SwanGraph('swan.p')

Graph from swan.p loaded


##  <a name="gtf_ab_sep"></a>Adding transcript models (GTF) and abundance information separately

Swan can also run without abundance information, although many of Swan's analysis functions depend on abundance information. To load just the transcript models, simply just leave out the `counts_file` and `count_cols` arguments to the `add_dataset()` function as shown below.

In [None]:
# for this new example, create a new empty SwanGraph
sg = swan.SwanGraph()
# and add the annotation transcriptome to it
sg.add_annotation(annot_gtf)

In [16]:
# add transcriptome datasets from GTF files without
# corresponding abundance information
sg.add_dataset('HepG2_1', hep_1_gtf)
sg.add_dataset('HepG2_2', hep_2_gtf)
sg.add_dataset('HFFc6_1', hff_1_gtf)
sg.add_dataset('HFFc6_2', hff_2_gtf)
sg.add_dataset('HFFc6_3', hff_3_gtf)

Adding dataset annotation to the SwanGraph.
Adding dataset HepG2_1 to the SwanGraph.
Adding dataset HepG2_2 to the SwanGraph.
Adding dataset HFFc6_1 to the SwanGraph.
Adding dataset HFFc6_2 to the SwanGraph.
Adding dataset HFFc6_3 to the SwanGraph.


If you have just added transcript models to the graph via `add_dataset()` and wish to add abundance information, this can be done using the `add_abundance()` function as seen below. Here, the string passed to `count_cols` is the column in the abundance file that corresponds to the dataset, and the argument passed to `dataset_name` is the name of the dataset that has already been added to the SwanGraph in the previous code block.

In [8]:
# add abundance information corresponding to each of the datasets
# we've already added to the SwanGraph
# dataset_name must be a dataset that is already present in the SwanGraph
sg.add_abundance(ab_file, count_cols='hepg2_1', dataset_name='HepG2_1')
sg.add_abundance(ab_file, count_cols='hepg2_2', dataset_name='HepG2_2')
sg.add_abundance(ab_file, count_cols='hffc6_1', dataset_name='HFFc6_1')
sg.add_abundance(ab_file, count_cols='hffc6_2', dataset_name='HFFc6_2')
sg.add_abundance(ab_file, count_cols='hffc6_3', dataset_name='HFFc6_3')

## <a name="db_ab"></a> Adding transcript models (TALON db) and abundance information

Swan is also directly compatible with TALON databases and can pull transcript models directly from them. 

In [None]:
# for this new example, create a new empty SwanGraph
sg = swan.SwanGraph()
# and add the annotation transcriptome to it
sg.add_annotation(annot_gtf)

In [8]:
hepg2_whitelist='data/hepg2_whitelist.csv'
hffc6_whitelist='data/hffc6_whitelist.csv'

In [11]:
# add datasets directly from a TALON database and abundance
# information from an abundance table
# whitelist option is output from the talon_filter_transcripts
# step, which filters novel isoforms based on their reproducibility
# and for those that exhibit internal priming
sg.add_dataset('HepG2_1', talon_db,
    dataset_name='hepg2_1',
    whitelist=hepg2_whitelist,
	counts_file=ab_file,
	count_cols='hepg2_1')
sg.add_dataset('HepG2_2', talon_db,
    dataset_name='hepg2_2',
    whitelist=hepg2_whitelist,
	counts_file=ab_file,
	count_cols='hepg2_2')

sg.add_dataset('Hffc6_1', talon_db,
    dataset_name='hffc6_1',
    whitelist=hffc6_whitelist,
	counts_file=ab_file,
	count_cols='hffc6_1')
sg.add_dataset('Hffc6_2', talon_db,
    dataset_name='hffc6_2',
    whitelist=hffc6_whitelist,
	counts_file=ab_file,
	count_cols='hffc6_2')
sg.add_dataset('Hffc6_3', talon_db,
    dataset_name='hffc6_3',
    whitelist=hffc6_whitelist,
	counts_file=ab_file,
	count_cols='hffc6_3')

Adding dataset HepG2_1 to the SwanGraph.
Adding dataset HepG2_2 to the SwanGraph.
Adding dataset Hffc6_1 to the SwanGraph.
Adding dataset Hffc6_2 to the SwanGraph.
Adding dataset Hffc6_3 to the SwanGraph.
