# Retrieving Frequency Data by Gene ID

This tutorial shows how to retrieve frequency data by gene id using the ALFA (ALlele Frequency Aggregator) frequency API:
* [`/interval/{seq_id}:{position}:{length}/overlapping_frequency_records`](https://api.ncbi.nlm.nih.gov/variation/v0/).

This API takes a sequence interval (i.e., sequence ID, interval start position and length), finds all variants overlapping with that interval, and returns their frequency data.

So, in order to get frequency data for a gene, we need to break down the process into two steps:

1. Determine the chromosome range of this gene.
2. Call the ALFA frequency API using that range to retrieve fruequency data.

For this tutorial we are going to use the [`TP53 (human tumor protein p57`](https://www.ncbi.nlm.nih.gov/gene/7157) gene as an example. Its gene ID is 7157 (which can be found using the [NCBI Gene website](https://www.ncbi.nlm.nih.gov/gene/).

Before writing any code, we need to install some standard python modules used in this tutorial.

In [None]:
%pip install -q requests
%pip install -q ratelimit

First, we use NCBI's eUtils' `esummary` service to get gene location, as shown in the function `get_gene_loc` below:

In [None]:
import requests
from ratelimit import limits
import time
from typing import List, Any

@limits(calls=1, period=1)  # Only one call per second
def get_gene_loc(gene_id: str) -> List[Any]:
    '''
    Return chromosome id, start and stop positions for gene_id
    '''
    esum_url=(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
              f'esummary.fcgi?db=gene&id={gene_id}&format=json')
    print (f'esummary url: {esum_url}')
    res = requests.get(esum_url)

    if res.status_code != 200:
        raise("Failed to get gene information")

    data = res.json()

    # First, verify that result contains location data 
    if ('result' not in data or gene_id not in data['result'] or
        'genomicinfo' not in data['result'][gene_id]):
        raise("Genomic information is not avaible for this gene")

    # Extract and return location data
    loc = data['result'][gene_id]['genomicinfo'][0]
    chraccver = loc['chraccver']
    chrstart = int(loc['chrstart'])
    chrstop = int(loc['chrstop'])
    # If the gene is on the opposite strand of the reference
    # sequence (e.g. TP53), chrstart is larger than chrstop. 
    # We need to swap them to make sure chrstart < chrstop.
    if chrstart > chrstop:
        chrstart, chrstop = chrstop, chrstart
    
    return (chraccver, chrstart, chrstop)

Then, we call `get_gene_loc` to get its chromosome location

In [None]:
# TP53
gene_id = '7157'

chraccver, chrstart, chrstop = get_gene_loc(gene_id)

print (f'gene id: {gene_id}, chr: {chraccver}, '
       f'start: {chrstart}, stop: {chrstop}.')

Now we are ready to call the ALFA frequency service: 
[`/interval/{seq_id}:{position}:{length}/overlapping_frequency_records`](https://api.ncbi.nlm.nih.gov/variation/v0/)

Due to resource limitations, this API service only returns the first 250 variants. If there are more than that, the service includes a warning in the reply and uses the `http` status code `206`. Thus, for genes with a large number of variants, we have to call the API service multiple times until we get all the results. Now a question arises: after the first API call, what interval should we use for the subsequent calls? The answer is that we just need to reset the start position of the range, while keeping the same stop position. This is because the API service returns the first 250 variants *by position*, and the new start position is right after the largest stop position of those 250 returned variants. The function `get_next_interval_start` below computes the new start position for the next interval:

In [None]:
def get_next_interval_start(result: dict) -> int:
    '''
    Return the start position of the next search interval
    '''
    # Collect stop positions of all 250 variations from the response.
    stops = []
    for k in result.keys():
        length, start = k.split('@')
        stops.append(int(length) + int(start))
    # The next search interval starts just after the last variant's stop position.
    return max(stops) + 1

Now we have done all the preparation and can go ahead to call the API service. A broad description of the process is as follows:

1. Call the service with the chromosome location of the gene.
2. If the response's status code is `200` (meaning, there are 250 or fewer variants found, and we have all of them in the response), then we are done.
3. If the response's status code is `206` (meaning, there are too many results and only the first 250 variants are returned):
    * first, we save those 250 variants,
    * then, we call the API again using the next interval
4. Repeat this process until the response's status code is `200`.

The above steps are implemented in the function `get_freq_by_interval` below (which does error checking):

In [None]:
@limits(calls=1, period=1)  # Only one call per second
def get_freq_by_interval(seq_id: str, start: int, stop: int) -> None:
    '''
    Recursively retrieve frequency data from the overlapping_frequency_records
    API service for a given sequence interval.
    '''
    
    api_url = (f'https://api.ncbi.nlm.nih.gov/variation/v0/interval/'
               f'{seq_id}:{start}:{stop - start + 1}'
               f'/overlapping_frequency_records')
    print (api_url)
    res = requests.get(api_url)

    # A global variable that allows for accumulating results from 
    # recursive calls. It must be reset before each external call
    # of get_freq_by_interval
    global coll
    
    # Check status_code to decide what to do next
    if res.status_code == 200:
        # We got all we asked for. Save the result and return.
        coll.update(res.json()['results'])
        return
    elif res.status_code == 206:
        # There are more data than the service can return.
        # We should save the result, and call the service again with
        # the next interval.
        coll.update(res.json()['results'])
        print (f'Accumulated result size: {len(coll)}')
        
        # Delay the call for 1 second to not exceed the rate limit.
        time.sleep(1)
        get_freq_by_interval(seq_id, get_next_interval_start(coll), stop)
    elif res.status_code >= 400:
        raise (f'API request returned with error code {res.status_code}\n'
               f'Request: {api_url}\n'
               f'Response: {res.json()}')
    else:
        raise(f'Unexpected return code: {res.status_code}')


Finally, we call the function `get_freq_by_interval` to get *all* the frequency data of this gene:

In [None]:
# Collect results from get_freq_by_interval
coll = {}
get_freq_by_interval(chraccver, chrstart, chrstop)
    
print (f'Final result: {len(coll)}')