# International Mouse Phenotyping Consortium (IMPC) Data API Workshop Cheat Sheet
Here is a cheat sheet from Data API Workshop.
For more information about IMPC visit our website. Useful links:

For more information about IMPC visit our [website](https://www.mousephenotype.org/).
Other useful links:
- Workshop [repository](https://github.com/mpi2/impc-data-api-workshop/tree/main) with all materials
- IMPC Solr cores [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/)
- International Mouse Phenotyping Resource of Standardised Screens | [IMPReSS](https://www.mousephenotype.org/impress/index)
- The Genome Targeting Repository | [GenTaR](https://www.gentar.org/tracker/#/)

Phenodigm useful links:
- [IMPC disease models summary](https://www.mousephenotype.org/help/data-visualization/gene-pages/disease-models/)
- [IMPC disease associations](https://www.mousephenotype.org/help/data-analysis/disease-associations/)
- [Disease Models Portal](https://diseasemodels.research.its.qmul.ac.uk)
- [OMIM](https://www.omim.org/)
- [Orphanet](https://www.orpha.net)
- [SOLR Phenodigm core documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/phenodigm.html)
# Set up
Let's start! First of all we need to import python libraries and set up helper functions.

Execute cell below. Follow steps:
1. Select cell by clicking into it.
2. Execute code by pressing ▷ play button above.
3. You can also use hotkey Ctrl + Enter to execute code.

### Helper functions
1. `solr_request` — Performs a single Solr request
2. `batch_request` — Calls `solr_request` multiple times with `params` to retrieve results in chunk `batch_size` rows at a time.
3. `facet_request` — Performs a faceting Solr request
4. `entity_iterator` — Generator function fetches results from the Solr server in chunks using pagination.
5. `iterator_solr_request` — Fetches results in batches from the Solr API and write them to a file.

In [1]:
import csv
import json
from urllib.parse import unquote

import pandas as pd
import requests
from IPython.display import display
from tqdm import tqdm

# Display the whole dataframe <15
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', None)

# Create helper function
def solr_request(core, params, silent=False):
    """Performs a single Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{response.request.url}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add search results to the list
        search_results = []
        for doc in data["response"]["docs"]:
            search_results.append(doc)
    
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(search_results)
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

def batch_request(core, params, batch_size):
    """Calls `solr_request` multiple times with `params` to retrieve results in chunk `batch_size` rows at a time."""
    if "rows" in "params":
        print("WARN: You have specified the `params` -> `rows` value. It will be ignored, because the data is retrieved `batch_size` rows at a time.")
    # Determine the total number of rows. Note that we do not request any data (rows = 0).
    num_results, _ = solr_request(core=core, params={**params, "start": 0, "rows": 0}, silent=True)
    # Initialise everything for data retrieval.
    start = 0
    chunks = []
    # Request chunks until we have complete data.
    with tqdm(total=num_results) as pbar:  # Initialize tqdm progress bar.
        while start < num_results:
            # Update progress bar with the number of rows requested.
            pbar.update(batch_size) 
            # Request chunk. We don't need num_results anymore because it does not change.
            _, df_chunk = solr_request(core=core, params={**params, "start": start, "rows": batch_size}, silent=True)
            # Record chunk.
            chunks.append(df_chunk)
            # Increment start.
            start += batch_size
    # Prepare final dataframe.
    return pd.concat(chunks, ignore_index=True)

def facet_request(core, params, silent=False):
    """Performs a faceting Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{unquote(response.request.url)}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add faceting query results to the list
        facet_counts = data["facet_counts"]["facet_fields"][params["facet.field"]]
        # Initialize an empty dictionary
        faceting_dict = {}
        # Iterate over the list, taking pairs of elements
        for i in range(0, len(facet_counts), 2):
            # Assign label as key and count as value
            label = facet_counts[i]
            count = facet_counts[i + 1]
            faceting_dict[label] = [count]
        
        # Print the resulting dictionary
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(faceting_dict)
        df = pd.DataFrame.from_dict(faceting_dict, orient='index', columns=['counts']).reset_index()

        # Rename the columns
        df.columns = [params["facet.field"], 'count_per_category']
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

# Helper function to fetch results. This function is used by the 'iterator_solr_request' function.
def entity_iterator(base_url, params):
    """Generator function to fetch results from the SOLR server in chunks using pagination

    Args:
        base_url (str): The base URL of the Solr server to fetch documents from.
        params (dict): A dictionary of parameters to include in the GET request. Must include
                       'start' and 'rows' keys, which represent the index of the first document
                       to fetch and the number of documents to fetch per request, respectively.

    Yields:
        dict: The next document in the response from the Solr server.
    """
    # Initialise variable to check the first request
    first_request = True

    # Call the API in chunks and yield the documents in each chunk
    while True:
        response = requests.get(base_url, params=params)
        data = response.json()
        docs = data["response"]["docs"]

        # Print the first request only
        if first_request:
            print(f'Your first request: {response.url}')
            first_request = False

        # Yield the documents in the current chunk
        for doc in docs:
            yield doc

        # Check if there are more results to fetch
        start = params["start"] + params["rows"]
        num_found = data["response"]["numFound"]
        if start >= num_found:
            break

        # Update the start parameter for the next request
        params["start"] = start

    # Print last request and total number of documents retrieved
    print(f'Your last request: {response.url}')
    print(f'Number of found documents: {data["response"]["numFound"]}\n')

# Function to iterate over field list and write results to a file.
def iterator_solr_request(core, params, filename='iteration_solr_request', format='json'):
    """Function to fetch results in batches from the Solr API and write them to a file
        Defaults to fetching 5000 rows at a time.

    Args:
        core (str): The name of the Solr core to fetch results from.
        params (dict): A dictionary of parameters to use in the filter query. Must include
                       'field_list' and 'field_type' keys, which represent the list of field items (i.e., list of MGI model identifiers)
                        to fetch and the type of the field (i.e., model_id) to filter on, respectively.
        filename (str): The name of the file to write the results to. Defaults to 'iteration_solr_request'.
        format (str): The format of the output file. Can be 'csv' or 'json'. Defaults to 'json'.
    """

    # Validate format
    if format not in ['json','csv']:
        raise ValueError("Invalid format. Please use 'json' or 'csv'")
    
    # Base URL
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    # Extract entities_list and entity_type from params
    field_list = params.pop("field_list")
    field_type = params.pop("field_type")

    # Construct the filter query with grouped model IDs
    fq = "{}:({})".format(
        field_type, " OR ".join(['"{}"'.format(id) for id in field_list])
    )

    # Show users the field and field values they passed to the function
    print("Queried field:",fq)
    # Set internal params the users should not change
    params["fq"] = fq
    params["wt"] = 'json'
    params["start"]=0 # Start at the first result
    params["rows"]=5000 # Fetch results in chunks of 5000


    try:
        # Fetch results using a generator function
        results_generator = entity_iterator(solr_url, params)
    except Exception as e:
        raise Exception("An error occurred while downloading the data: " + str(e))

    # Append extension to the filename
    filename = f"{filename}.{format}"

    try:
        # Open the file in write mode
        with open(filename, "w", newline="") as f:
            if format == 'csv':
                writer = None
                for item in results_generator:
                    # Initialize the CSV writer with the keys of the first item as the field names
                    if writer is None:
                        writer = csv.DictWriter(f, fieldnames=item.keys())
                        writer.writeheader()
                    # Write the item to the CSV file
                    writer.writerow(item)
                    # Write to json without loading to memory.
            elif format == 'json':
                f.write('[')
                for i, item in enumerate(results_generator):
                    if i != 0:
                        f.write(',')
                    json.dump(item, f)
                f.write(']')
    except Exception as e:
        raise Exception("An error occurred while writing the file: " + str(e))

    print(f"File {filename} was created.")

### 1. How to request limited number of documents?
Specify `rows` parameter. In this example we request 3 documents.

In [2]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': '*:*', # Request all records.
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=%2A%3A%2A&rows=3

Number of found documents: 4070676



Unnamed: 0,allele_accession_id,allele_name,allele_symbol,colony_id,data_type,doc_id,female_control_count,female_mutant_count,genetic_background,intermediate_mp_term_id,intermediate_mp_term_name,life_stage_acc,life_stage_name,male_control_count,male_mutant_count,marker_accession_id,marker_symbol,metadata,metadata_group,mp_term_id_options,mp_term_name_options,parameter_name,parameter_stable_id,parameter_stable_key,phenotyping_center,pipeline_name,pipeline_stable_id,pipeline_stable_key,procedure_name,procedure_stable_id,procedure_stable_key,production_center,project_name,resource_fullname,resource_name,significant,status,strain_accession_id,strain_name,top_level_mp_term_id,top_level_mp_term_name,zygosity,_version_,effect_size,p_value,statistical_method,classification_tag,female_control_mean,female_effect_size_low_normal_vs_high,female_effect_size_low_vs_normal_high,female_mutant_mean,female_pvalue_low_normal_vs_high,female_pvalue_low_vs_normal_high,genotype_effect_size_low_normal_vs_high,genotype_effect_size_low_vs_normal_high,genotype_pvalue_low_normal_vs_high,genotype_pvalue_low_vs_normal_high,male_control_mean,male_effect_size_low_normal_vs_high,male_effect_size_low_vs_normal_high,male_mutant_mean,male_pvalue_low_normal_vs_high,male_pvalue_low_vs_normal_high,phenotype_sex
0,MGI:5609345,"targeted mutation 1b, Helmholtz Zentrum Muench...",Lypla1<tm1b(EUCOMM)Hmgu>,LYPAB,categorical,45497a3b9c8e1012c85154e64a630ea3,932.0,5.0,involves: C57BL/6N,"[MP:0004508, MP:0005508, MP:0009250]","[abnormal pectoral girdle bone morphology, abn...",IMPCLS:0005,Early adult,917.0,5.0,MGI:1344588,Lypla1,[Date Scanner equipment last calibrated = 2012...,de98e6e5ea27114612ddffb277851f6b,[MP:0000149],[abnormal scapula morphology],Scapulae,IMPC_XRY_006_001,[2082],BCM,BCM Pipeline,BCM_001,16,X-ray,[IMPC_XRY_001],[91],BCM,[BaSH],IMPC,IMPC,False,NotProcessed,MGI:2159965,C57BL/6N,[MP:0005390],[skeleton phenotype],homozygote,1797159908006690842,,,,,,,,,,,,,,,,,,,,,
1,MGI:6399913,"endonuclease-mediated mutation 1, The Centre f...",Osbpl8<em1(IMPC)Tcp>,TCPR1519_AEKV,adult-gross-path,9a533ea2c0b3f78ca3ede1d6ed61f2d9,,,involves: C57BL/6NCrl,,,IMPCLS:0005,Early adult,,,MGI:2443807,Osbpl8,[Date of sacrifice = 2021-05-11T00:00:00Z|Equi...,d41d8cd98f00b204e9800998ecf8427e,,,Epididymis,IMPC_PAT_025_002,[37087],TCP,TCP Pipeline,TCP_001,9,Gross Pathology and Tissue Collection,[IMPC_PAT_002],[777],TCP,[DTCC],IMPC,IMPC,False,Successful,MGI:2683688,C57BL/6NCrl,,,homozygote,1797159908006690820,0.0,1.0,Supplied as data,,,,,,,,,,,,,,,,,,
2,MGI:5637089,"targeted mutation 1b, Wellcome Trust Sanger In...",Raph1<tm1b(EUCOMM)Wtsi>,MUFZ,unidimensional,eb6c509bb88f29d6327bccaa1b3a3bd7,397.0,5.0,involves: C57BL/6N,"[MP:0001963, MP:0003878, MP:0004738, MP:0006335]","[abnormal hearing electrophysiology, abnormal ...",IMPCLS:0005,Early adult,409.0,1.0,MGI:1924550,Raph1,[Anesthetic administration route = Intraperito...,59ce9671ec21ce59375623c5c2626a94,"[MP:0004738, MP:0011967, MP:0011968]","[abnormal auditory brainstem response, increas...",12kHz-evoked ABR Threshold,IMPC_ABR_006_001,[18930],WTSI,MGP Select Pipeline,MGP_001,15,Auditory Brain Stem Response,[IMPC_ABR_001],[547],WTSI,[BaSH],IMPC,IMPC,False,Successful,MGI:2159965,C57BL/6N,[MP:0005377],[hearing/vestibular/ear phenotype],homozygote,1797159908924194829,0.154034,0.066499,Reference Range Plus Test framework; quantile ...,Not significant,17.732997,0.066499,0.128463,20.0,0.066499,0.128463,0.027709,0.141439,0.594283,1.0,17.432763,0.026895,0.154034,15.0,1.0,1.0,"[male, female]"


### 2. How to request specific fields?
Specify `fl` parameter. In this example we request following fields:
- `marker_symbol`
- `top_level_mp_term_name`
- `effect_size`
- `p_value`
<br>

**Warning:** if you misspell field name, no error will be generated and this field will be silently omitted from the final result.

In [3]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': '*:*', # Request all records.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=%2A%3A%2A&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 4070676



Unnamed: 0,marker_symbol,top_level_mp_term_name,effect_size,p_value
0,Lypla1,[skeleton phenotype],,
1,Osbpl8,,0.0,1.0
2,Raph1,[hearing/vestibular/ear phenotype],0.154034,0.066499


### 3. How to filter by specific field?
Specify *field:value* in the `q` parameter. For example, let's filter out *Gprc6a*.

In [4]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': 'marker_symbol:Gprc6a', # Request Gprc6a records.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=marker_symbol%3AGprc6a&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 702



Unnamed: 0,marker_symbol,top_level_mp_term_name
0,Gprc6a,[behavior/neurological phenotype]
1,Gprc6a,
2,Gprc6a,[homeostasis/metabolism phenotype]


### 4. How to search in range of numbers?
Use following syntax:
- `field:[* TO 100]` — finds all field values less than or equal to 100.
- `field:[100 TO *]` — finds all field values greater than or equal to 100.
- `field:[* TO *]` — finds any document with a value between the effective values of -Infinity and +Infinity for that field type.
- `field:[0 TO 100]` — finds all field values less than or equal to 100 and greater than or equal to 100.
<br>

An asterisk * may be used for either or both endpoints to specify an open-ended range query.

In [5]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': 'p_value:[0 TO 1e-4]', # Request p-value from 0 to 1e-4.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=p_value%3A%5B0+TO+1e-4%5D&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 45387



Unnamed: 0,effect_size,marker_symbol,p_value,top_level_mp_term_name
0,2.715391,Tbx22,7.228019e-08,[behavior/neurological phenotype]
1,-0.973679,Avpr1a,2.229254e-06,[behavior/neurological phenotype]
2,1.962857,Trarg1,1.889197e-09,[behavior/neurological phenotype]


### 5. How to apply multiple filters together?
Search parameters can be combined using logical operators.
- To match both conditions, you specify `filter1 AND filter2`
- To match any one of the conditions, use specify `filter1 OR filter2`
For example, to find documents with marker symbol Gprc6a and p-value less than 1e-4:

In [6]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': 'marker_symbol:Gprc6a AND p_value:[0 TO 1e-4]', # Request p-value from 0 to 1e-4.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=marker_symbol%3AGprc6a+AND+p_value%3A%5B0+TO+1e-4%5D&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 3



Unnamed: 0,effect_size,marker_symbol,p_value,top_level_mp_term_name
0,1.0,Gprc6a,0.0,"[cardiovascular system phenotype, growth/size/..."
1,2.307854,Gprc6a,2.4e-05,"[immune system phenotype, endocrine/exocrine g..."
2,1.253108,Gprc6a,8.8e-05,[hematopoietic system phenotype]


### 6. How to apply multiple filters together?
Logical filters can be inverted by using the NOT operator.

In [7]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': 'NOT marker_symbol:Gprc6a', # Request Gprc6a records.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=NOT+marker_symbol%3AGprc6a&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 4069974



Unnamed: 0,marker_symbol,top_level_mp_term_name,effect_size,p_value
0,Lypla1,[skeleton phenotype],,
1,Osbpl8,,0.0,1.0
2,Raph1,[hearing/vestibular/ear phenotype],0.154034,0.066499


### 7. How to deal with Null values?
You can filter out null values by applying this range filter that we have seen before: `field:[* TO *]`

In [8]:
num_found, df = solr_request(
    core='statistical-result',
    params={
        'q': 'p_value:[* TO *]', # Request p-value from 0 to 1e-4.
        'fl': 'marker_symbol,top_level_mp_term_name,effect_size,p_value',
        'rows': 3   # Request the first three rows
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=p_value%3A%5B%2A+TO+%2A%5D&fl=marker_symbol%2Ctop_level_mp_term_name%2Ceffect_size%2Cp_value&rows=3

Number of found documents: 2068440



Unnamed: 0,effect_size,marker_symbol,p_value,top_level_mp_term_name
0,0.0,Osbpl8,1.0,
1,0.154034,Raph1,0.066499,[hearing/vestibular/ear phenotype]
2,0.000929,Ccdc50,1.0,[behavior/neurological phenotype]


### 8. How to download the data?
Use batch_request function to download the data. It retrieves results 
in several chunks. `batch_size` parameter is a size of chunk, that will be used for getting the data. And then run either `to_json` or `to_csv` to save data as json or as CSV.
<br>
You can add any query into `batch_request` function.
<br>
In this example we download late adult data for cardiovascular system.

In [9]:
df = batch_request(
    core='statistical-result',
    params={
        'q': 'top_level_mp_term_name:"cardiovascular system phenotype" AND life_stage_name:"Late adult"',
        'fl': 'allele_accession_id,life_stage_name,marker_symbol,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name',
    },
    batch_size=1000
)

10000it [00:09, 1025.75it/s]                                                                                                                     


In [10]:
# Save dataframe to JSON (lines) format for subsequent work. This will contain a single self contained JSON record per line.
df.to_json("impc_data.json", orient="records")

In [11]:
# We can also save as CSV, but note that fine structure such as lists and nested data will be lost.
df.to_csv("impc_data.csv", index=False)

### 9. How to do faceting query?
Execute `facet_request` function to estimate counts types of the categories. Required parameters:
- 'rows': '0'
- 'facet': 'on'
- `facet.field` — specifies field for faceting search.
- `facet.limit` — specifies the maximum number of facets for a field that should be returned. 
- `facet.mincount` — specifies the minimum counts required for a facet field to be included in the response.

In [12]:
num_found, df = facet_request(
    core='statistical-result',
    params={
        'q': '*:*',
        'rows': 0,
        'facet': 'on',
        'facet.field': 'zygosity',
        'facet.limit': 15,
        'facet.mincount': 1
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=*:*&rows=0&facet=on&facet.field=zygosity&facet.limit=15&facet.mincount=1

Number of found documents: 4070676



Unnamed: 0,zygosity,count_per_category
0,homozygote,2520286
1,heterozygote,1435388
2,hemizygote,101488
3,wildtype,13514


### 10: How to iterate over a list of genes?
`iterator_solr_request` will iterate over list of marker_symbols: Zfp580, Firrm, Gpld1 and Prkdc

You can download this data in `csv` or `json`.

In [13]:
# Genes example
genes = ["Zfp580","Firrm","Gpld1","Prkdc"]

# Initial query parameters
params = {
    'q': "*:*",
    'fl': 'marker_symbol,allele_symbol,parameter_stable_id',
    'field_list': genes,
    'field_type': "marker_symbol"
}

iterator_solr_request(
    core='statistical-result',
    params=params,
    filename='marker_symbol',
    format ='json')

Queried field: marker_symbol:("Zfp580" OR "Firrm" OR "Gpld1" OR "Prkdc")
Your first request: https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=%2A%3A%2A&fl=marker_symbol%2Callele_symbol%2Cparameter_stable_id&fq=marker_symbol%3A%28%22Zfp580%22+OR+%22Firrm%22+OR+%22Gpld1%22+OR+%22Prkdc%22%29&wt=json&start=0&rows=5000
Your last request: https://www.ebi.ac.uk/mi/impc/solr/statistical-result/select?q=%2A%3A%2A&fl=marker_symbol%2Callele_symbol%2Cparameter_stable_id&fq=marker_symbol%3A%28%22Zfp580%22+OR+%22Firrm%22+OR+%22Gpld1%22+OR+%22Prkdc%22%29&wt=json&start=0&rows=5000
Number of found documents: 2052

File marker_symbol.json was created.


### 11: How to iterate over a list of models?

In [14]:
# List of model IDs.
models = ["MGI:3587188","MGI:3587185","MGI:3605874","MGI:2668213"]

# Call iterator function
iterator_solr_request(
    core='phenodigm', 
        params = {
        'q': 'type:disease_model_summary',  
        'fl': 'model_id,marker_id,disease_id',
        'field_list': models,
        'field_type': 'model_id'
    },
    filename='model_ids',
    format='csv')

Queried field: model_id:("MGI:3587188" OR "MGI:3587185" OR "MGI:3605874" OR "MGI:2668213")
Your first request: https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary&fl=model_id%2Cmarker_id%2Cdisease_id&fq=model_id%3A%28%22MGI%3A3587188%22+OR+%22MGI%3A3587185%22+OR+%22MGI%3A3605874%22+OR+%22MGI%3A2668213%22%29&wt=json&start=0&rows=5000
Your last request: https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary&fl=model_id%2Cmarker_id%2Cdisease_id&fq=model_id%3A%28%22MGI%3A3587188%22+OR+%22MGI%3A3587185%22+OR+%22MGI%3A3605874%22+OR+%22MGI%3A2668213%22%29&wt=json&start=5000&rows=5000
Number of found documents: 7675

File model_ids.csv was created.


### 12. How to calculate phenodigm score?
1. Retrieve all diseases related to the mouse gene *[Nxn](https://www.mousephenotype.org/data/genes/MGI:109331)* (MGI:109331)
2. Select the following fields:
    - `marker_id`
    - `model_id`
    - `disease_id`
    - `disease_term`
    - `disease_model_avg_norm`
    - `disease_model_max_norm`
    - `association_curated`
3. [**Optional**] To select curated gene-disease associations add a filter to set `association_curated` to `true`, to keep predictions only set to `false`. 
4. Calculate their **Phenodigm scores** by adding the following to your last field:
    -  `phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)`
5. [**Optional**] Sort the results in descending order by the calculated Phenodigm score by passing the following to the `sort` parameter:
    - `div(sum(disease_model_avg_norm,disease_model_max_norm),2)`

**HINT**: do not include the text `phenodigm_score:` to the sort parameter as this will produce an error.

In [15]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND marker_id:"MGI:109331" AND association_curated:true',
        'fl': 'marker_id,model_id,disease_id,disease_term,disease_model_avg_norm,disease_model_max_norm,association_curated,phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)',
        'sort':'div(sum(disease_model_avg_norm,disease_model_max_norm),2) desc',
        'rows': 3
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary+AND+marker_id%3A%22MGI%3A109331%22+AND+association_curated%3Atrue&fl=marker_id%2Cmodel_id%2Cdisease_id%2Cdisease_term%2Cdisease_model_avg_norm%2Cdisease_model_max_norm%2Cassociation_curated%2Cphenodigm_score%3Adiv%28sum%28disease_model_avg_norm%2Cdisease_model_max_norm%29%2C2%29&sort=div%28sum%28disease_model_avg_norm%2Cdisease_model_max_norm%29%2C2%29+desc&rows=3

Number of found documents: 48



Unnamed: 0,disease_id,disease_term,model_id,marker_id,disease_model_avg_norm,disease_model_max_norm,association_curated,phenodigm_score
0,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:5548389#hom#embryo,MGI:109331,35.03,64.44,True,49.735
1,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:5548389#hom#embryo,MGI:109331,27.69,70.72,True,49.205
2,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:4881804,MGI:109331,21.44,70.56,True,46.0
