In [None]:
import re
import io
import json
import tempfile

from copy import copy
from collections import Counter
from itertools import groupby, product, combinations
from zipfile import ZipFile

import numpy as np
import matplotlib.pyplot as plt

from Bio import SeqIO, AlignIO, codonalign
from Bio.Align.Applications import MuscleCommandline
from Bio.Seq import Seq, MutableSeq

# NCBI Datasets (07/21/2022)

### Table of contents <a class="anchor" id="top"></a>
1. [Jupyter Notebooks](#notebooks)
1. [Natural selection and dN/dS](#dnds)
1. [Introduction to NCBI Datasets](#datasets) 
1. [Getting a list of gene-ids per species](#geneids)
1. [Getting ortholog data packages from a list of gene-ids](#orthologs)
1. [Closer look at the ortholog metadata files](#metadata)
1. [Reading sequence data](#sequence)
1. [Making a sequence alignment](#alignment)
1. [Counting substitutions](#substitutions)
1. [Adjusting for codon usage](#codons)
1. [Further learning](#references)

## Jupyter Notebooks<a class="anchor" id="notebooks"></a>

### What is a Jupyter notebook and what is it for?
1. A Jupyter notebook is a document that allows you to combine code, formatted text,
and images.
2. Notebooks are displayed and edited in a web browser.
3. You can edit and run the code in place and display the output.
4. They are useful for:
    - Exploration: you can quickly test out ideas and see the results
    - Documentation: a Jupyter notebook constitutes a record of precisely what you
did. (Think of it as a "lab notebook" for your computational "experiments.")
    - Communication: Jupyter notebooks make it easy to share what you did with
colleagues (e.g. reports for your PI, interactive examples to accompany publications)

### Creating, editing, and running cells

(Demo)

See also, [Jupyter Notebook Cheat Sheet](https://www.datacamp.com/cheat-sheet/jupyter-notebook-cheat-sheet)

### Troubleshooting the notebook

#### Undoing a change

If you get an error or would like to undo a change, select the cell and
use `CTRL-Z` (`CMD-Z` on a Mac) to undo the most recent change.

#### Running cells in order

Cells in a jupyter notebook can be run in any order, but they should be run from top to bottom.
If you're getting errors, it could be because you forgot to run all the cells above the one
you're working on.

To rerun all the cells above the one you're working on:
select the "Cell" drop-down menu and then click "Run all above"..

#### Interrupting a cell that's taking too long

The code in a Jupyter notebook is run by a program called the kernel.
Most of the time, we can ignore it, but if you get stuck, it can help to know
how to stop or restart the kernel.

Sometimes a cell is taking too long to run, either because
you made a mistake or because the task is bigger than you expected.
If you'd like to stop a cell from running, you can interrupt the kernel by
hitting the square "stop" button next to "Run" at the top of the notebook.

#### Restarting the kernel
Sometimes, you will want Jupyter to "forget" the
results of the cells you've run and start fresh.
To restart the kernel:
1. Click the "Kernel" drop-down menu at the top and select:
"Restart and clear output".
2. Go to the place in the notebook where you left off and use the "Cell"
drop-down menu to "Run all above" to run the previous cells and get back on track.

#### If you would like to start over:

If you change things in the notebook, can't get it to run, and want to start
over:
1. Select the name of the notebook "workshop" at the top of the page.
2. Change the name to anything else like "broken-notebook".
3. Press the save button.
4. Click the original link that you followed to get to the notebook.
2. Go to the place in the notebook where you left off and use the "Cell"
dropdown menu to "Run all above" to run the previous cells and get back on track.

⬆︎[back to top](#top)

## Natural Selection and dN/dS<a class="anchor" id="dnds"></a>

**Our question**: Can we detect natural selection by comparing ortholog sequences between species?

**The idea**: We can compare the rates of synonymous to non-synonymous (substitutions) to look for signals of purifying (or positive) selection.

### The genetic code and single-basepair substitutions
Recall that the genetic code is redundant: multiple codons encode the same amino acid.
As a result, not all single-basepair substitutions change the protein sequence.

Example:

```
Sequence1: ACG TTG GCT
Protein1:   T   L   A
Sequence2: CCG TTG GCA
Protein2:   P   L   A
```

The `A->C` mutation in the first codon is *nonsynonymous* (also called missense),
while the `T->A` mutation in the third codon is *synonymous*.

### Selection and rates of molecular evoltion:
- The molecular clock: synonymous mutations accumulate at a constant rate
- Purifying selection: most non-synonymous mutations are harmful and eliminated by natural selection
- Positive selection: some non-synonymous mutations may improve fitness. These will fix at a faster-than-neutral rate

We call the amount of nonsynonymous divergence between species $dN$ and synonymous divergence $dS$.
The ratio $dN/dS$ (adjusted for codon usage) contains a signal of selection:
- $dN/dS < 1$: strong purifying selection (meaning the gene is important and well-adapted).
- $dN/dS \approx 1$: relaxed purifying selection.
- $dN/dS > 1$: strong positive selection, adaptation

We will be comparing Drosophila species with different levels of divergence across a large number of ortholog families to categorize the orthologs by dN/dS.

### Workflow
1. Download ortholog sequences with NCBI Datasets
2. Read the sequences with BioPython
3. Make a codon-aware sequence alignment with Muscle and BioPython
4. Count substitutions between species for each ortholog
5. Account for codon usage to compute sequence-adjusted $dN/dS$.

⬆︎[back to top](#top)

## Introduction to NCBI Datasets<a class="anchor" id="datasets"></a>

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v1/datasets_schema_complete.svg" alt="datasets" style="width: 800px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) and `dataformat`, a `datasets`companion tool, to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

For example: if I want to download the cat reference genome (<i>Felis catus</i>), I would use the command below:

In [None]:
%%bash
#Download the monarch butterfly (taxid 13037) reference genome, with associated annotation files and metadata

datasets download genome taxon 13037 --reference --no-progressbar


Instead of downloading a data package, I could instead look at the metadata information by using the `summary`command. Here, I'm pipping it to [`jq`](https://stedolan.github.io/jq/) so it's easier to read:

In [None]:
%%bash
#Check the metadata information for the monarch butterfly reference genome

datasets summary genome taxon 13037 --reference | jq .

### Data packages

NCBI Datasets delivers data as <u>data packages</u>, which which are zip archives containing both data (FASTA, GFF3, GTF, GBFF) and metadata files (JSON, JSON-Lines). The image below shows the contents of all data packages. Files are included depending on their availability. For example: for an annotated genome, the data package would include FASTA files (genomic, transcript, protein and CDS sequences) and annotation files (GFF3, GTF and GBFF).


<img src="./images/datapackages.png" alt="data_package" style="width: 800px;"/>

Currently, the `virus` option only include SARS-Cov-2 genomes and proteins. 

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [None]:
%%bash
datasets --help

Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [None]:
%%bash
    datasets summary genome taxon formicidae --help

### Exercises:

Now let's use `datasets` and `jq` to take a look at the available genomes for your taxon of interest. Use our [jq cheatsheet](https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to look at the available fields and look for the information you need. Some ideas: 
- how many genomes are available?
- What's the contig N50?


In [None]:
%%bash



In [None]:
%%bash



In [None]:
%%bash



⬆︎[back to top](#top)

## Getting a list of gene-ids per species <a class="anchor" id="geneids"></a>

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information: <i>accession</i>, <i>gene-id</i>, and <i>symbol</i>.  

When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

- **accession**: Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

- **gene-id**: Also a unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.  

- **symbol**: Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata. 

### Now let's take a look at a gene example:

<img src="./images/gene_json_gene-id.png" />

We want to extract a list of gene-ids for *Drosophila melanogaster* (tax-id 7227). We will use that list later to download ortholog sets for each gene-id. Here, we will invoke the command `summary` and pipe the JSON output to `jq` to extract the complete list of gene-ids for that species.   
If you look in the figure above and follow the numbers in the command below, you'll understand better how to build a `jq` filtering command.

In [None]:
%%bash
# Get list of gene-ids for D. melanogaster and save as a txt file

                                             #1     #2    #3
datasets summary gene taxon 7227 | jq -r '.genes[].gene.gene_id' > dmel_gene-ids.txt


Let's take a quick look at how many genes are annotated in <i>D. melanogaster</i>.

In [None]:
%%bash
# Count the number of lines (genes) in the list

wc -l dmel_gene-ids.txt

### Exercises

Choose another species and look for the number of genes annotated in it. Hint: you can pipe the `wc -l` command (count number of lines) after the `jq` command that extracts the gene-ids. 

In [None]:
%%bash




⬆︎[back to top](#top)

## Accessing orthologs <a class="anchor" id="orthologs"></a>

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option: <i>accession</i>, <i>gene-id</i>, and <i>symbol</i>.   

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 
---

- **accession**: Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

- **gene-id**:  Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

- **symbol**: Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how do you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output. A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON Lines, which means that each gene entry is in a different line. Here's the diagram to help you create queries.
  
<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- Use a loop to download an ortholog data package for each gene-id in the list. Since the list is too long, let's try to download only the first 20 gene-ids from the list.
- Unzip it to the folder ortholog
- Look at the files

In [None]:
%%bash
# download an ortholog data package per gene-id. We will be using a reduced set (`head -n20`) as example:

rm -rf orthologs; #if the folder orthologs exist, remove it.
mkdir orthologs; # create the folder orthologs.

head -n 20 dmel_gene-ids.txt | while read GENEID; do
    echo ${GENEID}; 
    datasets download ortholog gene-id "${GENEID}" \
    --include-cds \
    --filename ./orthologs/$GENEID.zip \
    --taxon-filter 7215 --no-progressbar \
    2> /dev/null \
    || echo "No orthologs found."
done

If you look at the output above, you will notice not all gene-ids were part of ortholog sets (`Error: no valid NCBI gene identifiers, exiting`). Let's check the number of ortholog sets downloaded.

In [None]:
%%bash

ls orthologs

Now, let's unzip one of those data packages and check its content:

In [None]:
%%bash

unzip orthologs/10178777.zip -d 10178777

Let's use the command `tree` to check the folder hierarchy in a easier way.

In [None]:
%%bash
tree 10178777/

The in the `data`folder, you will find all files: data and metadata files:
- FASTA sequences (CDS, gene, protein and RNA); 
- `data_report.jsonl`: metadata file describing the genes included in the data package. 
- `dataset_catalog.json`: lists all files included in the data package
- `data_table`: TSV file with a subset of the metadata from the `data_report.jsonl`

In the next section, we will take a closer look at the metadata files. 

⬆︎[back to top](#top)

## Accessing metadata<a class="anchor" id="metadata"></a>

For the JSON-Lines metadata files included with the data packages, you have the option of using `dataformat` instead of `jq` to extract metadata information. `dataformat`is a `datasets`companion tool and it allows you to export metadata information to tabular or Excel format. We are working to make `dataformat` compatible with all `datasets summary` command options (currently, it works only with `datasets summary virus`).

Here, we will use `dataformat` to extract the following info from each ortholog data package: organism, accession, CDS range start and CDS range end. You can find more info about the `dataformat`fields for gene in [our documentation page](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/command-line/dataformat/tsv/dataformat_tsv_gene/)

In [None]:
!dataformat tsv gene \
--package orthologs/10178777.zip \
--fields tax-name,genomic-range-accession,genomic-range-range-start,genomic-range-range-stop |\
column -t -s$'\t' #added to align all the columns and make the visualization easier. But not really needed.

If we prefer to use `jq`, here's how to achieve the same result.

In [None]:
%%bash

cat 10178777/ncbi_dataset/data/data_report.jsonl | jq -r '[.taxname,
.genomicRanges[].accessionVersion,
.genomicRanges[].range[].begin,
.genomicRanges[].range[].end] | @tsv'

Useful tip: if you want to extract a list of all fields from any JSON file, you can use the code in the cell below. Notice that the field names are missing a period (".") in the beginning, so don't forget to add that when you use the info from this list.


Modified from here: https://www.fabian-keller.de/blog/5-useful-jq-commands-parse-json-cli/

In [None]:
%%bash

cat 10178777/ncbi_dataset/data/data_report.jsonl | jq 'select(objects)|=[.] 
        | map( paths(scalars) ) 
        | map( map(select(numbers)="[]") 
        | join("."))' | sort | uniq | sed 's/\.\[/\[/g'

### Exercises

Use `dataformat` to extract metadata from another data package. Remember that `dataformat` accepts both zip packages (`--package`) and data reports (`--inputfile`) as inputs, so no need to unzip anything.  

Also, instead of using `%%bash`, please use an exclamation mark `!` in front of the command.

In [None]:
!dataformat 



⬆︎[back to top](#top)

## Reading sequence data<a class="anchor" id="sequence"></a>

We will now download coding region sequences for orthologs in 37 Drosophila species.
We will save the data packages to `datadir`:

In [None]:
datadir = "../data/orthologs"

We won't have time to download all 13,000 orthologs, so we'll limit ourselves to a subset. The parameter `num_gids` sets the number of gene IDs to attempt to download (only about half will have orthologs). 1000 packages should take about 5 minutes to download

In [None]:
num_gids = 1000

In [None]:
%%time
%%bash -s "$datadir" "$num_gids"
mkdir -p $1
head -n $2 dmel_gene-ids.txt | while read GENEID; do 
        echo GID: ${GENEID};
        datasets download ortholog gene-id "${GENEID}" \
            --filename $1/$GENEID.zip \
            --taxon-filter 7215 \
            --include-cds \
            --exclude-gene \
            --exclude-protein \
            --exclude-rna \
            --no-progressbar \
            2> /dev/null \
            || echo "No orthologs found."
        done

In [None]:
!tree $datadir

In [None]:
!du -sh $datadir

In [None]:
!ls $datadir | wc -l

We read the fasta files using BioPython's SeqIO module:

In [None]:
def import_fasta(gene_id, datadir):
    dataset = f"{datadir}/{gene_id}.zip"
    fasta_path = "ncbi_dataset/data/cds.fna"
    with ZipFile(dataset) as zip_file:
        with zip_file.open(fasta_path, "r") as fasta_file:
            records = list(SeqIO.parse(io.TextIOWrapper(fasta_file), "fasta"))
    return(records)

In [None]:
gene_id = 10178781
records = import_fasta(gene_id, datadir)

Let's take a look at the records we've imported:

In [None]:
for rec in records:
    print(rec)

Notice that we have multiple records for some species. We want to take just one per species, so we'll choose the longest.

First we need to extract the species names from the records:

In [None]:
def get_species(record):
    pattern = re.compile(r"\[organism=([A-Za-z\s]+)\]")
    match = re.search(pattern, record.description)
    if match:
        return match.groups()[0]
    else:
        return None

In [None]:
for record in records:
    print(get_species(record))

Now we can group the records by species and take the longest:

In [None]:
def longest_record_per_species(records):
    return {
        species: max(recs, key=lambda r: len(r.seq))
        for species, recs in groupby(records, key=get_species)
    }

In [None]:
dna_records = longest_record_per_species(records)

In [None]:
for spec, rec in dna_records.items():
    print(spec)
    print(rec)
    print()

⬆︎[back to top](#top)

## Making a sequence alignment<a class="anchor" id="alignment"></a>

Now we need to make a sequence alignment. We need it to be codon-aware, so we'll align the protein sequences and then apply the alignment to the sequences.

In [None]:
def translate_record(record):
    new_record = copy(record)
    new_record.seq = record.seq.translate()
    return new_record

In [None]:
protein_records = {spec: translate_record(rec) for spec, rec in dna_records.items()}

In [None]:
for spec, rec in protein_records.items():
    print(spec)
    print(rec)
    print()

In [None]:
def align_proteins(protein_records):
    muscle_exe = "../bin/muscle3.8.31_i86linux64"
    with tempfile.NamedTemporaryFile(mode="w+t") as f:
        SeqIO.write(protein_records.values(), f, "fasta")
        f.seek(0)
        muscle_cline = MuscleCommandline(muscle_exe, input=f.name)
        stdout, stderr = muscle_cline()
    protein_aln = AlignIO.read(io.StringIO(stdout), "fasta")
    protein_aln.sort()
    return(protein_aln)

In [None]:
protein_aln = align_proteins(protein_records)
print(protein_aln[:,50:60])

In [None]:
codon_aln = codonalign.build(protein_aln, sorted(dna_records.values(), key=lambda x: x.id))

In [None]:
print(codon_aln[:,150:180])

⬆︎[back to top](#top)

## Counting substitutions<a class="anchor" id="substitutions"></a>

Now we can use our alignments to count substitutions. In a real application we'd use a sophisticated model to take multiple mutations at the same site or codon into account. Here we'll do a quick version where we count:
- Amino acid substitutions as non-synonymous mutations
- All other single-basepair substitutions as synonymous mutations

In [None]:
def number_of_substitutions(alignment) -> float:
    sub_matrix = alignment.substitutions
    return sub_matrix.sum() - sub_matrix.diagonal().sum()

In [None]:
total_subs = number_of_substitutions(codon_aln)
nonsyn_subs = number_of_substitutions(protein_aln)
syn_subs = total_subs - nonsyn_subs
dnds = nonsyn_subs / syn_subs

In [None]:
print(total_subs)
print(nonsyn_subs)
print(syn_subs)
print(dnds)

Now we can scale up our analysis and do the same for all the orthologs for several species

In [None]:
files = !ls {datadir}
gene_ids = [f.split(".")[0] for f in files]

In [None]:
def count_substitutions(gene_id, datadir, species1, species2):
    records = import_fasta(gene_id, datadir)
    longest_records = longest_record_per_species(records)
    if species1 in longest_records and species2 in longest_records:
        dna_records = {
            species1: longest_records[species1],
            species2: longest_records[species2],
        }
    else:
        return None
    protein_records = {spec: translate_record(rec) for spec, rec in dna_records.items()}
    protein_aln = align_proteins(protein_records)
    try:
        codon_aln = codonalign.build(protein_aln,
                                     sorted(dna_records.values(),
                                            key=lambda x: x.id))
    except RuntimeError as e:
        print(e)
        return None
    total_subs = number_of_substitutions(codon_aln)
    nonsyn_subs = number_of_substitutions(protein_aln)
    syn_subs = total_subs - nonsyn_subs
    return nonsyn_subs, syn_subs

In [None]:
substitutions = {}
focal_species = "Drosophila melanogaster"
comparison_species = ["Drosophila pseudoobscura", "Drosophila serrata", "Drosophila simulans", "Drosophila arizonae"]
for comp in comparison_species:
    print(comp)
    substitutions[comp] = {}
    for i, gene_id in enumerate(gene_ids):
        print(i, gene_id)
        subs = count_substitutions(gene_id, datadir, focal_species, comp)
        if subs:
            substitutions[comp][gene_id] = subs

In [None]:
for comp in comparison_species:
    for nonsyn_subs, syn_subs in substitutions[comp].values():
        plt.loglog(syn_subs, nonsyn_subs, '.b', alpha=0.25)
    plt.loglog([1,1000],[1,1000], '--k')
    plt.title(comp)
    plt.show()

⬆︎[back to top](#top)

## Adjusting for codon usage<a class="anchor" id="codons"></a>

We haven't yet adjusted for codon usage. In this section, we'll use the genetic code to figure out how many synonymous and nonsynoymous substitutions we'd expect to see for each gene.

In [None]:
bases = set(["A", "C", "G", "T"])
for comb in product(bases, repeat=3):
    s = Seq("".join(comb))
    print(s, "->", s.translate())

In [None]:
codons = (Seq("".join(b)) for b in product(bases, repeat=3))
genetic_code = {
    codon: codon.translate()
    for codon in codons
    if codon.translate() != Seq("*")
}

In [None]:
def count_differences(codon1, codon2):
    return sum(b1 != b2 for b1, b2 in zip(codon1, codon2))

nonsyn_counts = Counter()
syn_counts = Counter()
for codon1, codon2 in combinations(genetic_code, 2):
    if count_differences(codon1, codon2) == 1:
        if genetic_code[codon1] == genetic_code[codon2]:
            syn_counts[codon1] += 1
            syn_counts[codon2] += 1
        else:
            nonsyn_counts[codon1] += 1
            nonsyn_counts[codon2] += 1

In [None]:
for codon in genetic_code:
    print(codon, nonsyn_counts[codon], syn_counts[codon])

In [None]:
nonsyn_total = sum(nonsyn_counts.values())
syn_total = sum(syn_counts.values())
print(nonsyn_total / syn_total)

In [None]:
def expected_dnds(seq, nonsyn_counts, syn_counts):
    nonsyn = 0
    syn = 0
    for i in range(0, len(seq), 3):
        codon = seq[i:i+3]
        try:
            nonsyn += nonsyn_counts[codon]
            syn += syn_counts[codon]
        except KeyError:
            return None
    return nonsyn / syn

In [None]:
focal_species = "Drosophila melanogaster"
expectations = {}
for gene_id in gene_ids:
    records = import_fasta(gene_id, datadir)
    longest_records = longest_record_per_species(records)
    if focal_species not in longest_records:
        continue
    seq = longest_records[focal_species].seq
    expectations[gene_id] = expected_dnds(seq, nonsyn_counts, syn_counts)

In [None]:
plt.hist(expectations.values())

In [None]:
omega = {comp: dict() for comp in comparison_species}
for comp in comparison_species:
    for gene_id, (nonsyn_subs, syn_subs) in substitutions[comp].items():
        dnds_obs = nonsyn_subs / syn_subs
        dnds_exp = expectations[gene_id]
        omega[comp][gene_id] = dnds_obs / dnds_exp

In [None]:
for comp in comparison_species:
    plt.hist(omega[comp].values(), bins=np.arange(0,2.0,0.1))
    plt.title(comp)
    plt.show()

## Free time

Now you have some time to try things out on your own. Some ideas:
1. Look for the highest/lowest dN/dS orthologs and look them up for function.
2. Compare different species with different levels of divergence.
3. Explore `datasets`

⬆︎[back to top](#top)

## Where to go now<a class="anchor" id="references"></a>
If we've piqued your interest about learning to program for
molecular evolution, here are a few resources to keep learning:

### This workshop on GitHub

You can find the materials for this workshop, including (eventually) a runable version of the notebook,
go to: https://github.com/ncbi/workshop-mol-evol-datasets
### NCBI Datasets

To learn more about the NCBI Datasets tools, you can visit the
[Datasets homepage](https://www.ncbi.nlm.nih.gov/datasets/).
There is lots of information about the web interface,
command line tools, and Python and R libraries.

### Drosophila protein evolution

[Evolution of genes and genomes on the Drosophila phylogeny, *Nature*, 2007.](https://www.nature.com/articles/nature06341)

### Installing Jupyter on your computer

If you're ready to take the plunge and install Jupyter on your
own computer, there are installation instructions on the
[Jupyter project website](https://jupyter.org).


### The BioPython package
To learn what else you can do with BioPython, you can see their
[documentation here](https://biopython.org).
Particularly useful are their
[tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
and [cookbook](https://biopython.org/wiki/Category%3ACookbook).

### Learning Python

There are lots of resources online to help you learn Python in more depth.
One good place to start is the
[Python beginner's guide](https://www.python.org/about/gettingstarted/).


⬆︎[back to top](#top)