# International Mouse Phenotyping Consortium (IMPC) Data API Workshop
Welcome to our workshop! In this session, we'll guide you through using Apache Solr API to access IMPC data. After that, we will focus on the `phenodigm` core. By the end, you'll confidently construct Solr queries to extract IMPC datasets. Get ready for hands-on exercises and real-world examples to reinforce your skills!

For more information about IMPC visit our [website](https://www.mousephenotype.org/).
Other useful links:
- Workshop [repository](https://github.com/mpi2/impc-data-api-workshop/tree/main) with all materials
- IMPC Solr cores [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/)
- International Mouse Phenotyping Resource of Standardised Screens | [IMPReSS](https://www.mousephenotype.org/impress/index)
- The Genome Targeting Repository | [GenTaR](https://www.gentar.org/tracker/#/)

# Set up
Let's start! First of all we need to import python libraries and set up helper function.
### Helper functions
Execute cell below. Follow steps:
1. Select cell by clicking into it.
2. Execute code by pressing ▷ play button above.
3. You can also use hotkey Ctrl + Enter to execute code.

In [1]:
from IPython.display import display
from tqdm import tqdm
from urllib.parse import unquote

import csv
import pandas as pd
import requests

# Display the whole dataframe <15
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', None)

# Create helper function
def solr_request(core, params, silent=False):
    """Performs a single Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{response.request.url}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add search results to the list
        search_results = []
        for doc in data["response"]["docs"]:
            search_results.append(doc)
    
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(search_results)
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

### Example query
We will use `solr_request` function to access IMPC data using Solr API. Let's run cell below and investigate the result.  

In [2]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',  # Your query, '*' retrieves all documents
        'rows': 10,  # Number of rows to retrieve
        'fl': 'marker_symbol,allele_symbol,parameter_stable_id',  # Fields to retrieve
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=%2A%3A%2A&rows=10&fl=marker_symbol%2Callele_symbol%2Cparameter_stable_id

Number of found documents: 67660



Unnamed: 0,allele_symbol,marker_symbol,parameter_stable_id
0,Zfp580<tm1b(EUCOMM)Hmgu>,Zfp580,IMPC_EYE_063_001
1,Firrm<tm1b(EUCOMM)Hmgu>,Firrm,IMPC_XRY_012_001
2,Gpld1<tm1.1(KOMP)Vlcg>,Gpld1,IMPC_PAT_017_002
3,Mbip<em1(IMPC)J>,Mbip,IMPC_HEM_001_001
4,Dnmt3l<tm1b(EUCOMM)Hmgu>,Dnmt3l,IMPC_DXA_004_001
5,Cd300lg<tm1a(KOMP)Wtsi>,Cd300lg,IMPC_CAL_008_001
6,Gbgt1<tm1.1(KOMP)Vlcg>,Gbgt1,IMPC_GRS_011_001
7,Kcne1<tm1b(EUCOMM)Hmgu>,Kcne1,IMPC_ABR_012_001
8,Recql4<tm1.1(KOMP)Vlcg>,Recql4,IMPC_VIA_067_001
9,Miga2<tm1a(KOMP)Wtsi>,Miga2,M-G-P_016_001_002


Let's take a look at the output of helper function. You can see following:
1. Submitted request, that you can open in browser by clicking into the link.
2. Number of documents in the requested dataframe.
3. Table with the results of your query. It will display less than 15 rows.

Let's get started with the exercises!

# Exercise block A

### Exercise 1: Getting Familiar with the Core
We will be working with `genotype-phenotype` core. To get yourself familiar with data, request 3 rows and all fields from this core.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **67,660**.

In [3]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',
        'rows': 3,
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=%2A%3A%2A&rows=3

Number of found documents: 67660



Unnamed: 0,allele_accession_id,allele_name,allele_symbol,assertion_type,assertion_type_id,colony_id,doc_id,effect_size,intermediate_mp_term_id,intermediate_mp_term_name,life_stage_acc,life_stage_name,marker_accession_id,marker_symbol,mp_term_id,mp_term_name,p_value,parameter_name,parameter_stable_id,parameter_stable_key,phenotyping_center,pipeline_name,pipeline_stable_id,pipeline_stable_key,procedure_name,procedure_stable_id,procedure_stable_key,project_name,resource_name,sex,statistical_method,strain_accession_id,strain_name,top_level_mp_term_id,top_level_mp_term_name,zygosity,_version_
0,MGI:5694173,"targeted mutation 1b, Helmholtz Zentrum Muench...",Zfp580<tm1b(EUCOMM)Hmgu>,automatic,ECO:0000203,ZFOFB,1743756722287,-1.766374,"[MP:0006069, MP:0002864, MP:0003727, MP:000209...","[abnormal retina neuronal layer morphology, ab...",IMPCLS:0005,Early adult,MGI:1916242,Zfp580,MP:0003733,abnormal retina inner nuclear layer morphology,2.643851e-06,Right inner nuclear layer,IMPC_EYE_063_001,[86016],BCM,BCM Pipeline,BCM_001,16,Eye Morphology,[IMPC_EYE_003],[1390],[BaSH],IMPC,not_considered,"Linear Mixed Model framework, LME, not includi...",MGI:2159965,C57BL/6N,[MP:0005391],[vision/eye phenotype],homozygote,1797159224090820608
1,MGI:5548899,"targeted mutation 1b, Helmholtz Zentrum Muench...",Firrm<tm1b(EUCOMM)Hmgu>,automatic,ECO:0000203,H-BC055324-A04-TM1B,2276332667035,,"[MP:0005508, MP:0009250]","[abnormal skeleton morphology, abnormal append...",IMPCLS:0005,Early adult,MGI:3590554,Firrm,MP:0004509,abnormal pelvic girdle bone morphology,5.306344e-08,Pelvis,IMPC_XRY_012_001,[19374],MRC Harwell,Harwell,HRWL_001,8,X-ray,[IMPC_XRY_001],[557],[BaSH],IMPC,male,Fisher Exact Test framework,MGI:2164831,C57BL/6NTac,[MP:0005390],[skeleton phenotype],heterozygote,1797159224118083584
2,MGI:5695879,"targeted mutation 1.1, Velocigene",Gpld1<tm1.1(KOMP)Vlcg>,manual,ECO:0000218,BL3802,2336462209120,1.0,"[MP:0031094, MP:0002706, MP:0000516, MP:0002135]","[organomegaly, abnormal kidney size, abnormal ...",IMPCLS:0005,Early adult,MGI:106604,Gpld1,MP:0003068,enlarged kidney,0.0,Kidney,IMPC_PAT_017_002,[37205],UC Davis,UCD Pipeline,UCD_001,13,Gross Pathology and Tissue Collection,[IMPC_PAT_002],[779],[DTCC],IMPC,male,Supplied as data,MGI:2683688,C57BL/6NCrl,"[MP:0005367, MP:0005378]","[renal/urinary system phenotype, growth/size/b...",homozygote,1797159224119132160


### Exercise 2: Selecting Specific Fields
As you can see, there is a lot of fields. To focus on the fields we need, request only the following once:
- marker_symbol
- marker_accession_id
- zygosity
- parameter_name
- parameter_stable_id
- p_value
<br>

Modify query from exercise 1 to request limited list of the fields above.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **67,660**.

In [4]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',
        'rows': 3,
        'fl': 'marker_symbol,marker_accession_id,zygosity,parameter_name,parameter_stable_id,p_value'
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=%2A%3A%2A&rows=3&fl=marker_symbol%2Cmarker_accession_id%2Czygosity%2Cparameter_name%2Cparameter_stable_id%2Cp_value

Number of found documents: 67660



Unnamed: 0,marker_accession_id,marker_symbol,p_value,parameter_name,parameter_stable_id,zygosity
0,MGI:1916242,Zfp580,2.643851e-06,Right inner nuclear layer,IMPC_EYE_063_001,homozygote
1,MGI:3590554,Firrm,5.306344e-08,Pelvis,IMPC_XRY_012_001,heterozygote
2,MGI:106604,Gpld1,0.0,Kidney,IMPC_PAT_017_002,homozygote


# Exercise block B
### Exercise 3: Filtering by Single Field
Let's now focus on a particular gene. In this example we will be using *Dclk1*. Filter the results so there only documents of this gene are displayed by modifying query from exercise 2.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **13**.

In [5]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': 'marker_symbol:Dclk1',
        'rows': 3,
        'fl': 'marker_symbol,marker_accession_id,zygosity,parameter_name,parameter_stable_id,p_value'
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol%3ADclk1&rows=3&fl=marker_symbol%2Cmarker_accession_id%2Czygosity%2Cparameter_name%2Cparameter_stable_id%2Cp_value

Number of found documents: 13



Unnamed: 0,marker_accession_id,marker_symbol,p_value,parameter_name,parameter_stable_id,zygosity
0,MGI:1330861,Dclk1,1.0,ANA classification,MGP_ANA_002_001,homozygote
1,MGI:1330861,Dclk1,6e-06,Forelimb grip strength normalised against body...,IMPC_GRS_010_001,heterozygote
2,MGI:1330861,Dclk1,1.0,ANA classification,MGP_ANA_002_001,heterozygote


### Exercise 4: Filtering Numerical Values and Applying Multiple Filters
In addition to the `marker_symbol` filter, let's also apply more strict p-value threshold, so that it is less than 1e-4.
<br>
Modify query from exercise 3 and display **10 rows** instead of 3. 
<br>
Note: Sometimes spelling may differ.
<br>e.g. **`p_value`** is the name of the field in Solr, whereas **"p-value"** is the term used in real life.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **5**.

In [6]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': 'marker_symbol:Dclk1 AND p_value:[* TO 1e-4]',
        'rows': 10,
        'fl': 'marker_symbol,marker_accession_id,zygosity,parameter_name,parameter_stable_id,p_value'
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=marker_symbol%3ADclk1+AND+p_value%3A%5B%2A+TO+1e-4%5D&rows=10&fl=marker_symbol%2Cmarker_accession_id%2Czygosity%2Cparameter_name%2Cparameter_stable_id%2Cp_value

Number of found documents: 5



Unnamed: 0,marker_accession_id,marker_symbol,p_value,parameter_name,parameter_stable_id,zygosity
0,MGI:1330861,Dclk1,6e-06,Forelimb grip strength normalised against body...,IMPC_GRS_010_001,heterozygote
1,MGI:1330861,Dclk1,6.8e-05,Body length,IMPC_DXA_006_001,homozygote
2,MGI:1330861,Dclk1,1e-06,Magnesium,IMPC_CBC_054_001,homozygote
3,MGI:1330861,Dclk1,2.3e-05,Aspartate aminotransferase,IMPC_CBC_012_001,homozygote
4,MGI:1330861,Dclk1,7e-06,Forelimb grip strength normalised against body...,IMPC_GRS_010_001,homozygote


### Exercise 5: Search for parameter stable ID in the IMPReSS and answer questions
Follow steps below and answer the questions: 
1. Navigate to the [IMPReSS website](https://www.mousephenotype.org/impress/index).
2. Search for the parameter_stable_id `IMPC_GRS_010_001`, which is referenced in Exercise 4.
3. Examine the procedure associated with the pipeline key `IMPC_001` by clicking on the "View procedure" button.
<br>
- Which week was the specimen tested?

# Exercise block C

### Exercise 6: Downloading data in chunks
We will use `batch_request` function to download data in chunks. Let's execute cell below.

In [7]:
def batch_request(core, params, batch_size):
    """Calls `solr_request` multiple times with `params` to retrieve results in chunk `batch_size` rows at a time."""
    if "rows" in "params":
        print("WARN: You have specified the `params` -> `rows` value. It will be ignored, because the data is retrieved `batch_size` rows at a time.")
    # Determine the total number of rows. Note that we do not request any data (rows = 0).
    num_results, _ = solr_request(core=core, params={**params, "start": 0, "rows": 0}, silent=True)
    # Initialise everything for data retrieval.
    start = 0
    chunks = []
    # Request chunks until we have complete data.
    with tqdm(total=num_results) as pbar:  # Initialize tqdm progress bar.
        while start < num_results:
            # Update progress bar with the number of rows requested.
            pbar.update(batch_size) 
            # Request chunk. We don't need num_results anymore because it does not change.
            _, df_chunk = solr_request(core=core, params={**params, "start": start, "rows": batch_size}, silent=True)
            # Record chunk.
            chunks.append(df_chunk)
            # Increment start.
            start += batch_size
    # Prepare final dataframe.
    return pd.concat(chunks, ignore_index=True)

First of all, let's construct a query for cardiovascular system. In the example below we request data with following conditions:
- `top_level_mp_term_name` field with `cardiovascular system phenotype`.
- `effect_size` is not null.
- `life_stage_name` is late adult.
<br>
<br>

Run `solr_request` function below and look at the results.

In [8]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': 'top_level_mp_term_name:"cardiovascular system phenotype" AND effect_size:[* TO *] AND life_stage_name:"Late adult"',
        'fl': 'allele_accession_id,life_stage_name,marker_symbol,mp_term_name,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name,effect_size'
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=top_level_mp_term_name%3A%22cardiovascular+system+phenotype%22+AND+effect_size%3A%5B%2A+TO+%2A%5D+AND+life_stage_name%3A%22Late+adult%22&fl=allele_accession_id%2Clife_stage_name%2Cmarker_symbol%2Cmp_term_name%2Cp_value%2Cparameter_name%2Cparameter_stable_id%2Cphenotyping_center%2Cstatistical_method%2Ctop_level_mp_term_name%2Ceffect_size

Number of found documents: 413



Unnamed: 0,allele_accession_id,effect_size,life_stage_name,marker_symbol,mp_term_name,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name
0,MGI:6276919,1.0,Late adult,Sstr2,enlarged heart,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,"[cardiovascular system phenotype, growth/size/..."
1,MGI:6257614,1.0,Late adult,Etv6,enlarged heart,0.0,Heart,RBRCLA_PAT_006_002,RBRC,Supplied as data,"[cardiovascular system phenotype, growth/size/..."
2,MGI:6341961,0.254275,Late adult,Ik,abnormal retina blood vessel morphology,3.8e-05,Retinal Blood Vessels Structure,KMPCLA_EYE_025_001,KMPC,Fisher Exact Test framework,"[vision/eye phenotype, cardiovascular system p..."
3,MGI:6152521,1.0,Late adult,Naa10,abnormal heart morphology,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,[cardiovascular system phenotype]
4,MGI:5755068,1.0,Late adult,Ptp4a1,abnormal heart morphology,0.0,Heart,TCPLA_PAT_006_002,TCP,Supplied as data,[cardiovascular system phenotype]
5,MGI:6277003,1.0,Late adult,Slc9b1,abnormal heart morphology,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,[cardiovascular system phenotype]
6,MGI:5614615,-1.177116,Late adult,Thra,abnormal heart left ventricle morphology,4.8e-05,LVIDs,ICSLA_ECH_011_001,ICS,Linear Model Using Generalized Least Squares f...,[cardiovascular system phenotype]
7,MGI:6277006,1.0,Late adult,Crlf2,small heart,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,[cardiovascular system phenotype]
8,MGI:6277048,1.0,Late adult,Lrrtm2,abnormal heart morphology,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,[cardiovascular system phenotype]
9,MGI:6336212,1.0,Late adult,Gnptab,enlarged heart,0.0,Heart,UCDLA_PAT_006_002,UC Davis,Supplied as data,"[cardiovascular system phenotype, growth/size/..."


Download the data for request above by modifying `batch_request` function below. Set `batch_size` parameter to 100. 

In [9]:
# Request dataframe in chunks.
df = batch_request(
    core="genotype-phenotype",
    params={
        'q': 'top_level_mp_term_name:"cardiovascular system phenotype" AND effect_size:[* TO *] AND life_stage_name:"Late adult"',
        'fl': 'allele_accession_id,life_stage_name,marker_symbol,mp_term_name,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name,effect_size'
    },
    batch_size=100
)

500it [00:00, 695.94it/s]                                                                                                                            


In [10]:
# Save dataframe to JSON (lines) format for subsequent work. This will contain a single self contained JSON record per line.
df.to_json("impc_data.json", orient="records")

In [11]:
# We can also save as CSV, but note that fine structure such as lists and nested data will be lost.
df.to_csv("impc_data.csv", index=False)

# Exercise block D

### Exercise 7: Faceting Query
We will use `facet_request` function to run faceting query. Let's execute cell below.

In [12]:
def facet_request(core, params, silent=False):
    """Performs a single Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{unquote(response.request.url)}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add faceting query results to the list
        facet_counts = data["facet_counts"]["facet_fields"][params["facet.field"]]
        # Initialize an empty dictionary
        faceting_dict = {}
        # Iterate over the list, taking pairs of elements
        for i in range(0, len(facet_counts), 2):
            # Assign label as key and count as value
            label = facet_counts[i]
            count = facet_counts[i + 1]
            faceting_dict[label] = [count]
        
        # Print the resulting dictionary
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(faceting_dict)
        df = pd.DataFrame.from_dict(faceting_dict, orient='index', columns=['counts']).reset_index()

        # Rename the columns
        df.columns = [params["facet.field"], 'count_per_category']
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

In this exercise we will be again querying the whole core. We want to count how many documents there are for each value of the `zygosity` fields. Modify query below to get this information.
<br>
<br>
If you complete the exercise successfully, your total number of documents in homozygote category will be **52,606**.

In [13]:
num_found, df = facet_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',
        'rows': 0,
        'facet': 'on',
        'facet.field': 'zygosity',
        'facet.limit': 15,
        'facet.mincount': 1
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=*:*&rows=0&facet=on&facet.field=zygosity&facet.limit=15&facet.mincount=1

Number of found documents: 67660



Unnamed: 0,zygosity,count_per_category
0,homozygote,52606
1,heterozygote,14312
2,hemizygote,742


# Exercises: Phenodigm Core

Now we will go over the disease section of the IMPC website. The `phenodigm` core contains all of the information regarding human-mouse disease associations.

Some useful links are:
- [IMPC disease models summary](https://www.mousephenotype.org/help/data-visualization/gene-pages/disease-models/)
- [IMPC disease associations](https://www.mousephenotype.org/help/data-analysis/disease-associations/)
- [Disease Models Portal](https://diseasemodels.research.its.qmul.ac.uk)
- [OMIM](https://www.omim.org/)
- [Orphanet](https://www.orpha.net)
- [SOLR Phenodigm core documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/phenodigm.html)

### Exercise 1: Getting Familiar with the Phenodigm Core
Get familiar with the structure of the phenodigm core queries. Write a query to retrieve 5 rows of the `disease_model_summary` type

If you complete the exercise successfully, your total number of documents will be **8,444,376**.


In [14]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary',
        'rows': 5
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary&rows=5

Number of found documents: 8444376



Unnamed: 0,type,disease_id,disease_term,model_id,model_source,model_description,model_genetic_background,marker_id,marker_symbol,marker_locus,marker_num_models,disease_model_avg_raw,disease_model_avg_norm,disease_model_max_raw,disease_model_max_norm,association_curated,disease_matched_phenotypes,model_matched_phenotypes
0,disease_model_summary,OMIM:227645,"Fanconi Anemia, Complementation Group C",MGI:3696039,MGI,Fign<fi>/Fign<fi>,mixed,MGI:1890647,Fign,2:63801852-63928382,5,0.6,26.53,2.45,77.45,False,"[HP:0000568 Microphthalmia, HP:0003214 Prolong...","[MP:0001347 absent lacrimal glands, MP:0013387..."
1,disease_model_summary,OMIM:227645,"Fanconi Anemia, Complementation Group C",MGI:3696114,MGI,C1galt1<ptl1>/C1galt1<ptl1>,C57BL/6-C1galt1<ptl1>,MGI:2151071,C1galt1,6:7845224-7872042,6,0.64,28.23,2.32,73.31,False,"[HP:0001903 Anemia, HP:0000086 Ectopic kidney,...","[MP:0003179 thrombocytopenia, MP:0000228 abnor..."
2,disease_model_summary,OMIM:227645,"Fanconi Anemia, Complementation Group C",MGI:3696354,MGI,Ercc2<tm3Jhjh>/Ercc2<tm3Jhjh>,involves: 129P2/OlaHsd * C57BL/6 * FVB,MGI:95413,Ercc2,7:19115942-19129619,15,0.53,23.38,2.62,82.88,False,[HP:0003213 Deficient excision of UV-induced p...,"[MP:0010308 decreased tumor latency, MP:000515..."
3,disease_model_summary,OMIM:227645,"Fanconi Anemia, Complementation Group C",MGI:3696370,MGI,Khdrbs1<tm1Rchd>/Khdrbs1<tm1Rchd>,involves: 129S1/Sv * 129X1/SvJ * C57BL/6,MGI:893579,Khdrbs1,4:129596957-129636096,2,0.37,16.44,2.18,69.07,False,"[HP:0011940 Anterior wedging of T12, HP:000081...","[MP:0000168 abnormal bone marrow development, ..."
4,disease_model_summary,OMIM:227645,"Fanconi Anemia, Complementation Group C",MGI:3696775,MGI,Kat6a<tm1Iki>/Kat6a<tm1Iki>,Not Specified,MGI:2442415,Kat6a,8:23349551-23433275,11,0.42,18.49,2.4,75.9,False,"[HP:0001903 Anemia, HP:0001876 Pancytopenia, H...",[MP:0000239 absent common myeloid progenitor c...


### Exercise 2: Filtering by Disease

When filtering by disease, we can use the following fields:
- `disease_term`
- `disease_id`


Refer to the [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/phenodigm.html) and select the most appropriate `field` to filter for the disease: *[Robinow Syndrome](https://www.omim.org/entry/618529?search=618529&highlight=618529)*

How many documents are there on *Robinow Syndrome*?

If you complete the exercise successfully, your total number of documents will be **11,655**.

In [15]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND disease_term:"Robinow Syndrome"',
        'rows': 5,
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary+AND+disease_term%3A%22Robinow+Syndrome%22&rows=5

Number of found documents: 11655



Unnamed: 0,type,disease_id,disease_term,model_id,model_source,model_description,model_genetic_background,marker_id,marker_symbol,marker_locus,marker_num_models,disease_model_avg_raw,disease_model_avg_norm,disease_model_max_raw,disease_model_max_norm,association_curated,disease_matched_phenotypes,model_matched_phenotypes
0,disease_model_summary,ORPHA:97360,Robinow Syndrome,MGI:2166570,MGI,Ednra<tm1Ywa>/Ednra<tm1Ywa>,129S/SvEv-Ednra<tm1Ywa>,MGI:105923,Ednra,8:78389658-78451081,21,0.78,35.47,2.2,68.15,False,"[HP:0011800 Midface retrusion, HP:0000410 Mixe...","[MP:0010543 aorta tubular hypoplasia, MP:00302..."
1,disease_model_summary,ORPHA:97360,Robinow Syndrome,MGI:2166944,MGI,Gli3<Xt-J>/Gli3<Xt-J>,involves: C3H * CD-1,MGI:95729,Gli3,13:15638308-15904611,50,0.84,38.27,2.32,72.06,False,"[HP:0002751 Kyphoscoliosis, HP:0000410 Mixed h...","[MP:0000455 abnormal maxilla morphology, MP:00..."
2,disease_model_summary,ORPHA:97360,Robinow Syndrome,MGI:2167575,MGI,Ptk2<tm1Imeg>/Ptk2<tm1Imeg>,involves: C57BL/6 * CBA,MGI:95481,Ptk2,15:73076951-73295129,11,0.42,18.91,2.21,68.69,False,"[HP:0010882 Pulmonary valve atresia, HP:001166...",[MP:0011201 abnormal visceral yolk sac cavity ...
3,disease_model_summary,ORPHA:97360,Robinow Syndrome,MGI:2167786,MGI,Hoxa2<tm1Ipc>/Hoxa2<tm1Ipc>,involves: 129S2/SvPas,MGI:96174,Hoxa2,6:52139397-52141811,19,0.54,24.36,3.1,96.32,False,"[HP:0000202 Orofacial cleft, HP:0000410 Mixed ...","[MP:0008383 enlarged gonial bone, MP:0012786 i..."
4,disease_model_summary,ORPHA:97360,Robinow Syndrome,MGI:2168259,MGI,Myf5<tm1Jae>/Myf5<tm1Jae>,involves: 129S4/SvJae,MGI:97252,Myf5,10:107318769-107321995,19,0.31,13.88,2.5,77.71,False,"[HP:0005011 Mesomelic arm shortening, HP:00009...","[MP:0004672 short ribs, MP:0008148 abnormal st..."


### Exercise 3: Filtering Mouse Models Related to a Disease
Retrieve the first 5 rows related to mouse models associated with *Robinow Syndrome*.

Use the following fields to obtain the model information:
- `model_id`
- `disease_term`
- `disease_id`
- `marker_id`
- `marker_symbol`
- `model_source`
- `model_genetic_background`

If you complete the exercise successfully, your total number of documents will be **11,655** and they will only display **7 fields**.


In [16]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND disease_term:"Robinow Syndrome"',
        'rows': 5,
        'fl': 'model_id,disease_term,disease_id,marker_id,marker_symbol,model_source,model_genetic_background',
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary+AND+disease_term%3A%22Robinow+Syndrome%22&rows=5&fl=model_id%2Cdisease_term%2Cdisease_id%2Cmarker_id%2Cmarker_symbol%2Cmodel_source%2Cmodel_genetic_background

Number of found documents: 11655



Unnamed: 0,disease_id,disease_term,model_id,model_source,model_genetic_background,marker_id,marker_symbol
0,ORPHA:97360,Robinow Syndrome,MGI:2166570,MGI,129S/SvEv-Ednra<tm1Ywa>,MGI:105923,Ednra
1,ORPHA:97360,Robinow Syndrome,MGI:2166944,MGI,involves: C3H * CD-1,MGI:95729,Gli3
2,ORPHA:97360,Robinow Syndrome,MGI:2167575,MGI,involves: C57BL/6 * CBA,MGI:95481,Ptk2
3,ORPHA:97360,Robinow Syndrome,MGI:2167786,MGI,involves: 129S2/SvPas,MGI:96174,Hoxa2
4,ORPHA:97360,Robinow Syndrome,MGI:2168259,MGI,involves: 129S4/SvJae,MGI:97252,Myf5


## Advanced Exercises


### Exercise 4: Calculate the Phenodigm score

In this exercise, we will calculate the Phenodigm score, sort the results by this score, and filter the documents based on gene-disease curation.

1. Retrieve all diseases related to the mouse gene *[Nxn](https://www.mousephenotype.org/data/genes/MGI:109331)* (MGI:109331)
2. Select the following fields:
    - `marker_id`
    - `model_id`
    - `disease_id`
    - `disease_term`
    - `disease_model_avg_norm`
    - `disease_model_max_norm`
    - `association_curated`
3. Filter results to keep only those where `association_curated` is `true`.
4. Calculate their **Phenodigm scores** by adding the following to your last field:
    -  `phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)`
5. Sort the results in descending order by the calculated Phenodigm score by passing the following to the `sort` parameter:
    - `div(sum(disease_model_avg_norm,disease_model_max_norm),2)`

        **HINT**: do not include the text `phenodigm_score:` to the sort parameter as this will produce an error.

If you complete the exercise successfully, your total number of documents will be **48**.

In [17]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND marker_id:"MGI:109331" AND association_curated:true',
        'fl': 'marker_id,model_id,disease_id,disease_term,disease_model_avg_norm,disease_model_max_norm,association_curated,phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)',
        'sort':'div(sum(disease_model_avg_norm,disease_model_max_norm),2) desc'
    }
)


Your request:
https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary+AND+marker_id%3A%22MGI%3A109331%22+AND+association_curated%3Atrue&fl=marker_id%2Cmodel_id%2Cdisease_id%2Cdisease_term%2Cdisease_model_avg_norm%2Cdisease_model_max_norm%2Cassociation_curated%2Cphenodigm_score%3Adiv%28sum%28disease_model_avg_norm%2Cdisease_model_max_norm%29%2C2%29&sort=div%28sum%28disease_model_avg_norm%2Cdisease_model_max_norm%29%2C2%29+desc

Number of found documents: 48



Unnamed: 0,disease_id,disease_term,model_id,marker_id,disease_model_avg_norm,disease_model_max_norm,association_curated,phenodigm_score
0,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:5548389#hom#embryo,MGI:109331,35.03,64.44,True,49.735
1,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:5548389#hom#embryo,MGI:109331,27.69,70.72,True,49.205
2,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:4881804,MGI:109331,21.44,70.56,True,46.0
3,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:5548389#het#embryo,MGI:109331,14.62,70.72,True,42.670002
4,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:5620177#hom#embryo,MGI:109331,15.99,64.44,True,40.215
5,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:5548389#het#early,MGI:109331,24.02,56.29,True,40.155
6,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:5548389#het#early,MGI:109331,21.49,56.65,True,39.07
7,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:5548389#het#embryo,MGI:109331,22.27,54.93,True,38.6
8,ORPHA:1507,Autosomal Recessive Robinow Syndrome,MGI:5620177#hom#embryo,MGI:109331,13.76,57.27,True,35.515
9,OMIM:618529,"Robinow Syndrome, Autosomal Recessive 2",MGI:6257797#hom#embryo,MGI:109331,7.76,53.32,True,30.54



### Exercise 5: Iterate over a list of diseases or models/ genes
Now, we will define another helper function, `iterator_solr_request`. This function will request information for a list of values of a given `field`. This is particularly useful when we want to request data for specific models or genes.

Execute the helper functions below:

In [18]:
import json

# Helper function to fetch results. This function is used by the 'iterator_solr_request' function.
def entity_iterator(base_url, params):
    """Generator function to fetch results from the SOLR server in chunks using pagination

    Args:
        base_url (str): The base URL of the Solr server to fetch documents from.
        params (dict): A dictionary of parameters to include in the GET request. Must include
                       'start' and 'rows' keys, which represent the index of the first document
                       to fetch and the number of documents to fetch per request, respectively.

    Yields:
        dict: The next document in the response from the Solr server.
    """
    # Initialise variable to check the first request
    first_request = True

    # Call the API in chunks and yield the documents in each chunk
    while True:
        response = requests.get(base_url, params=params)
        data = response.json()
        docs = data["response"]["docs"]

        # Print the first request only
        if first_request:
            print(f'Your first request: {response.url}')
            first_request = False

        # Yield the documents in the current chunk
        for doc in docs:
            yield doc

        # Check if there are more results to fetch
        start = params["start"] + params["rows"]
        num_found = data["response"]["numFound"]
        if start >= num_found:
            break

        # Update the start parameter for the next request
        params["start"] = start

    # Print last request and total number of documents retrieved
    print(f'Your last request: {response.url}')
    print(f'Number of found documents: {data["response"]["numFound"]}\n')

# Function to iterate over field list and write results to a file.
def iterator_solr_request(core, params, filename='iteration_solr_request', format='json'):
    """Function to fetch results in batches from the Solr API and write them to a file
        Defaults to fetching 5000 rows at a time.

    Args:
        core (str): The name of the Solr core to fetch results from.
        params (dict): A dictionary of parameters to use in the filter query. Must include
                       'field_list' and 'field_type' keys, which represent the list of field items (i.e., list of MGI model identifiers)
                        to fetch and the type of the field (i.e., model_id) to filter on, respectively.
        filename (str): The name of the file to write the results to. Defaults to 'iteration_solr_request'.
        format (str): The format of the output file. Can be 'csv' or 'json'. Defaults to 'json'.
    """

    # Validate format
    if format not in ['json','csv']:
        raise ValueError("Invalid format. Please use 'json' or 'csv'")
    
    # Base URL
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    # Extract entities_list and entity_type from params
    field_list = params.pop("field_list")
    field_type = params.pop("field_type")

    # Construct the filter query with grouped model IDs
    fq = "{}:({})".format(
        field_type, " OR ".join(['"{}"'.format(id) for id in field_list])
    )

    # Show users the field and field values they passed to the function
    print("Queried field:",fq)
    # Set internal params the users should not change
    params["fq"] = fq
    params["wt"] = 'json'
    params["start"]=0 # Start at the first result
    params["rows"]=5000 # Fetch results in chunks of 5000


    try:
        # Fetch results using a generator function
        results_generator = entity_iterator(solr_url, params)
    except Exception as e:
        raise Exception("An error occurred while downloading the data: " + str(e))

    # Append extension to the filename
    filename = f"{filename}.{format}"

    try:
        # Open the file in write mode
        with open(filename, "w", newline="") as f:
            if format == 'csv':
                writer = None
                for item in results_generator:
                    # Initialize the CSV writer with the keys of the first item as the field names
                    if writer is None:
                        writer = csv.DictWriter(f, fieldnames=item.keys())
                        writer.writeheader()
                    # Write the item to the CSV file
                    writer.writerow(item)
                    # Write to json without loading to memory.
            elif format == 'json':
                f.write('[')
                for i, item in enumerate(results_generator):
                    if i != 0:
                        f.write(',')
                    json.dump(item, f)
                f.write(']')
    except Exception as e:
        raise Exception("An error occurred while writing the file: " + str(e))

    print(f"File {filename} was created.")

Here is how to use it:
- New information required from the user:
    - `field_list`: A list of field names to be queried in the API. (e.g., MGI ids, OMIM ids, gene ids).
    - `field_type`: The type of `field` we are querying. (e.g., `model_id`, `disease_id`, `marker_id`).
    - `filename`: The name of the file you wish to save the data in.
    - `format`: The format in which you want to save the data (e.g., `json`, `csv`).

In [19]:
# List of model IDs.
models = ["MGI:3587188","MGI:3587185","MGI:3605874","MGI:2668213"]

# Call iterator function
iterator_solr_request(
    core='phenodigm', 
        params = {
        'q': 'type:disease_model_summary',  
        'fl': 'model_id,marker_id,disease_id',
        'field_list': models,
        'field_type': 'model_id'
    },
    filename='model_ids',
    format='csv')

Queried field: model_id:("MGI:3587188" OR "MGI:3587185" OR "MGI:3605874" OR "MGI:2668213")
Your first request: https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary&fl=model_id%2Cmarker_id%2Cdisease_id&fq=model_id%3A%28%22MGI%3A3587188%22+OR+%22MGI%3A3587185%22+OR+%22MGI%3A3605874%22+OR+%22MGI%3A2668213%22%29&wt=json&start=0&rows=5000
Your last request: https://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=type%3Adisease_model_summary&fl=model_id%2Cmarker_id%2Cdisease_id&fq=model_id%3A%28%22MGI%3A3587188%22+OR+%22MGI%3A3587185%22+OR+%22MGI%3A3605874%22+OR+%22MGI%3A2668213%22%29&wt=json&start=5000&rows=5000
Number of found documents: 7675

File model_ids.csv was created.


### Bonus exercise
Now, build your own request based on the example above, incorporating the following changes:
1. Use the `genotype-phenotype` core
2. Iterate over genes (the name of the field is `marker_symbol`)
3. Use the following list of genes: _Zfp580, Firrm, Gpld1, Mbip_
4. Modify the list of fields you request (`fl`) to include:
    - `marker_symbol`
    - `allele_symbol`
    - `parameter_stable_id`

In [20]:
# Genes example
genes = ["Zfp580","Firrm","Gpld1","Mbip"]

# Initial query parameters
params = {
    'q': "*:*",
    'fl': 'marker_symbol,allele_symbol,parameter_stable_id',
    'field_list': genes,
    'field_type': "marker_symbol"
}
iterator_solr_request(core='genotype-phenotype', params=params, filename='marker_symbol', format ='csv')

Queried field: marker_symbol:("Zfp580" OR "Firrm" OR "Gpld1" OR "Mbip")
Your first request: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=%2A%3A%2A&fl=marker_symbol%2Callele_symbol%2Cparameter_stable_id&fq=marker_symbol%3A%28%22Zfp580%22+OR+%22Firrm%22+OR+%22Gpld1%22+OR+%22Mbip%22%29&wt=json&start=0&rows=5000
Your last request: https://www.ebi.ac.uk/mi/impc/solr/genotype-phenotype/select?q=%2A%3A%2A&fl=marker_symbol%2Callele_symbol%2Cparameter_stable_id&fq=marker_symbol%3A%28%22Zfp580%22+OR+%22Firrm%22+OR+%22Gpld1%22+OR+%22Mbip%22%29&wt=json&start=0&rows=5000
Number of found documents: 46

File marker_symbol.csv was created.
