# Goals
**[Script]** This script downloads reference files (FASTA and GTF files), creates other important files (annotation, BED, and index files), and adds them to a master file containg all references for all species

**[Standardizing Genomes](https://www.biostars.org/p/342482/) // [Advantage of T2T](https://www.biostars.org/p/9560818/#9560866) // [Advantage of Pangenome (HPRC)](https://www.biostars.org/p/9563810/)**
- T2T-CHM13: More accurate (esp. SV) vs. GRCh38: More annotations (ex. UCSC tracks)
    - T2T-CHM13 (vs. GRCh38): No gaps, adds nearly 200M bps (+4.5%), corrects thousands of structural errors and spurious SNVs, and increases the number of annotated genes from 60,090 to 63,494

**Contig->Scaffold->Chr ([Ensembl](https://useast.ensembl.org/info/genome/genebuild/chromosomes_scaffolds_contigs.html), [RefSeq](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/glossary/))**
- **YES:** unlocalized/unplaced scaffolds (UNL/UNP; no coordinates)
    - [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) says to include primaries, MT, UNP, and UNL (no patches or ALT)
        - UNP/UNL scaffolds only add 1-10M bp and can accurately map reads (ex. to rRNA repeats on UNPs)
- **NO:** patches (sequence updates), placed scaffolds (already in chr.), *alternate loci (maybe)*
    - Patches effect alignment and variant calling due to sequence redundancy ([RefSeq](https://www.ncbi.nlm.nih.gov/grc/help/patches/))

**[Additional Genome Databases]**
1. **Need a unique genome?** [UCSC Public Hubs / GenArk](https://genome.ucsc.edu/cgi-bin/hgHubConnect) hosts external sources containing exotic genomes
2. **Need a different file?** See [UCSC's Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables?command=start)

**[To Do]**
1. Download pangenomes and compare to canonical references
2. Add md5 checks for reference files (see `check_sums` in References.py, ex. [T2T](https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/README.txt))
3. Add option to create annotation and BED files with all sequences (MT, UNP, UNL, ALT, PATCH), not just primaries

# Packages

In [2]:
#########################
### Standard Library ####
#########################
import os
import json
from glob import glob
import warnings
import requests
import subprocess

#####################
### Data Cleaning ###
#####################
import gtfparse
import numpy as np
import pandas as pd
import janitor as jn
import VinlandPy as vp
from IPython.display import Markdown, display
from configparser import ConfigParser, ExtendedInterpolation

####################
### Session Info ###
####################
import session_info

## Options

In [None]:
warnings.simplefilter(action="once", category=Warning)

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 60)

## Functions

In [None]:
with open("./References.py") as functions:
    exec(functions.read())

# Parameters

## Inputs

In [None]:
cfg = vp.load_config("RefSeq_GRCh38.cfg", show_variables=False)
globals().update(cfg)  # Cleaner syntax

n_threads = vp.set_threads()
n_threads

## Outputs

In [None]:
vp.create_dir(PATH_TO_NEW_REFERENCE, show_print=False)  # Path formatted as {species}/{source}/{release}/

# Get URLs

In [None]:
url_dict = get_urls(source, species, release, file_type, readme_file, release_type="older")

# Download files
- **Ensembl and RefSeq:** GTF, genome, CDS, and protein FASTA files
- **Ensembl only:** cDNA, ncRNA, map files (ex. Ensembl-RefSeq IDs)
- **RefSeq only:** mRNA, pRNA, and coding proteins

In [None]:
file_dict = download_files(source, url_dict)

## [RefSeq only] Download primaries
- **Why?** RefSeq genomes contain patches and alternative chromosomes, which in turn contain duplicate sequences to those found in the primary chromosomes. These should be removed, since they will negatively impact alignment via multi-mapping.

In [None]:
download_primary_assemblies(url_dict, file_dict, delete_temp_folder=True, overwrite_file=False)

# Create annotation files

In [None]:
ann_dict = create_anno_files(file_dict, ann_type, overwrite_files=False)

# Create BED files

In [None]:
bed_dict = create_bed_files(file_dict, ann_dict, bed_type, overwrite_files=False)

## Create FASTA from BED file

In [None]:
fasta_dict = get_fasta_from_bed(file_dict, bed_dict, "gene", n_threads, overwrite_file=False)

# Get FASTA stats

In [None]:
stats_path = get_fasta_stats({**file_dict, **fasta_dict}, n_threads, overwrite_file=False)

# Create indexes
- **[Slow steps]** Indexing for STAR and bowtie
- **[Large files]** STAR index files (ex. Ensembl's GRCh38 is 27G)

In [None]:
idx_dict = create_indexes({**file_dict, **fasta_dict}, idx_type, overwrite_files=False)

# Create references.json
- If `new_file_url` and `new_ref_key` are provided, this function will download a single, specified file and add it to the master reference file

In [None]:
ref = add_ref_to_master_file({**file_dict,**fasta_dict,**ann_dict,**bed_dict,**idx_dict}, species)
ref[f"{species}"]

# Session info

In [3]:
session_info.show(os=True, std_lib=False, dependencies=False)

In [4]:
print(("\n").join(["-----", "bedtools    2.31.1", "bedparse    0.2.3", "seqkit      2.9.0", "samtools    1.20", "bowtie      1.3.1", "STAR        2.7.11b", "RSEM        1.3.3", "salmon      0.14.1", "-----"]))

-----
bedtools    2.31.1
bedparse    0.2.3
seqkit      2.9.0
samtools    1.20
bowtie      1.3.1
STAR        2.7.11b
RSEM        1.3.3
salmon      0.14.1
-----
