# International Mouse Phenotyping Consortium (IMPC) Data API Workshop
Welcome to our workshop! In this session, we'll guide you through using Apache Solr API to access IMPC data. After that, we will focus on the `phenodigm` core. By the end, you'll confidently construct Solr queries to extract IMPC datasets. Get ready for hands-on exercises and real-world examples to reinforce your skills!

For more information about IMPC visit our [website](https://www.mousephenotype.org/).
Other useful links:
- Workshop [repository](https://github.com/mpi2/impc-data-api-workshop/tree/main) with all materials
- IMPC Solr cores [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/)
- International Mouse Phenotyping Resource of Standardised Screens | [IMPReSS](https://www.mousephenotype.org/impress/index)
- The Genome Targeting Repository | [GenTaR](https://www.gentar.org/tracker/#/)

# Set up
Let's start! First of all we need to import python libraries and set up helper function.
### Helper functions
Execute cell below. Follow steps:
1. Select cell by clicking into it.
2. Execute code by pressing ▷ play button above.
3. You can also use hotkey Ctrl + Enter to execute code.

In [None]:
from IPython.display import display
from tqdm import tqdm
from urllib.parse import unquote

import csv
import pandas as pd
import requests

# Display the whole dataframe <15
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', None)

# Create helper function
def solr_request(core, params, silent=False):
    """Performs a single Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{response.request.url}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add search results to the list
        search_results = []
        for doc in data["response"]["docs"]:
            search_results.append(doc)
    
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(search_results)
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

### Example query
We will use `solr_request` function to access IMPC data using Solr API. Let's run cell below and investigate the result.  

In [None]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',  # Your query, '*' retrieves all documents
        'rows': 10,  # Number of rows to retrieve
        'fl': 'marker_symbol,allele_symbol,parameter_stable_ide',  # Fields to retrieve
    }
)

Let's take a look at the output of helper function. You can see following:
1. Submitted request, that you can open in browser by clicking into the link.
2. Number of documents in the requested dataframe.
3. Table with the results of your query. It will display less than 15 rows.

Let's get started with the exercises!

# Exercise block A

### Exercise 1: Getting Familiar with the Core
We will be working with `genotype-phenotype` core. To get yourself familiar with data, request 3 rows and all fields from this core.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **67,660**.

In [None]:
num_found, df = solr_request(
    core=...,
    params={
        'q': ...,
        'rows': ...,
    }
)

### Exercise 2: Selecting Specific Fields
As you can see, there is a lot of fields. To focus on the fields we need, request only the following once:
- marker_symbol
- marker_accession_id
- zygosity
- parameter_name
- parameter_stable_id
- p_value
<br>

Modify query from exercise 1 to request limited list of the fields above.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **67,660**.

In [None]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': ...,
        'rows': ...,
        'fl': ...
    }
)

# Exercise block B
### Exercise 3: Filtering by Single Field
Let's now focus on a particular gene. In this example we will be using *Dclk1*. Filter the results so there only documents of this gene are displayed by modifying query from exercise 2.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **13**.

In [None]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': ...,
        'rows': ...,
        'fl': 'marker_symbol,marker_accession_id,zygosity,parameter_name,parameter_stable_id,p_value'
    }
)

### Exercise 4: Filtering Numerical Values and Applying Multiple Filters
In addition to the `marker_symbol` filter, let's also apply more strict p-value threshold, so that it is less than 1e-4.
<br>
Modify query from exercise 3 and display **10 rows** instead of 3. 
<br>
Note: Sometimes spelling may differ.
<br>e.g. **`p_value`** is the name of the field in Solr, whereas **"p-value"** is the term used in real life.
<br>
<br>
If you complete the exercise successfully, your total number of documents will be **5**.

In [None]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': ...,
        'rows': ...,
        'fl': 'marker_symbol,marker_accession_id,zygosity,parameter_name,parameter_stable_id,p_value'
    }
)

### Exercise 5: Search for parameter stable ID in the IMPReSS and answer questions
Follow steps below and answer the questions: 
1. Navigate to the [IMPReSS website](https://www.mousephenotype.org/impress/index).
2. Search for the parameter_stable_id `IMPC_GRS_010_001`, which is referenced in Exercise 4.
3. Examine the procedure associated with the pipeline key `IMPC_001` by clicking on the "View procedure" button.
<br>
- Which week was the specimen tested?

# Exercise block C

### Exercise 6: Downloading data in chunks
We will use `batch_request` function to download data in chunks. Let's execute cell below.

In [None]:
def batch_request(core, params, batch_size):
    """Calls `solr_request` multiple times with `params` to retrieve results in chunk `batch_size` rows at a time."""
    if "rows" in "params":
        print("WARN: You have specified the `params` -> `rows` value. It will be ignored, because the data is retrieved `batch_size` rows at a time.")
    # Determine the total number of rows. Note that we do not request any data (rows = 0).
    num_results, _ = solr_request(core=core, params={**params, "start": 0, "rows": 0}, silent=True)
    # Initialise everything for data retrieval.
    start = 0
    chunks = []
    # Request chunks until we have complete data.
    with tqdm(total=num_results) as pbar:  # Initialize tqdm progress bar.
        while start < num_results:
            # Update progress bar with the number of rows requested.
            pbar.update(batch_size) 
            # Request chunk. We don't need num_results anymore because it does not change.
            _, df_chunk = solr_request(core=core, params={**params, "start": start, "rows": batch_size}, silent=True)
            # Record chunk.
            chunks.append(df_chunk)
            # Increment start.
            start += batch_size
    # Prepare final dataframe.
    return pd.concat(chunks, ignore_index=True)

First of all, let's construct a query for cardiovascular system. In the example below we request data with following conditions:
- `top_level_mp_term_name` field with `cardiovascular system phenotype`.
- `effect_size` is not null.
- `life_stage_name` is late adult.
<br>
<br>

Run `solr_request` function below and look at the results.

In [None]:
num_found, df = solr_request(
    core='genotype-phenotype',
    params={
        'q': 'top_level_mp_term_name:"cardiovascular system phenotype" AND effect_size:[* TO *] AND life_stage_name:"Late adult"',
        'fl': 'allele_accession_id,life_stage_name,marker_symbol,mp_term_name,p_value,parameter_name,parameter_stable_id,phenotyping_center,statistical_method,top_level_mp_term_name,effect_size'
    }
)

Download the data for request above by modifying `batch_request` function below. Set `batch_size` parameter to 100. 

In [None]:
# Request dataframe in chunks.
df = batch_request(
    core="genotype-phenotype",
    params={
        'q': ...
        'fl': ...
    },
    batch_size=...
)

In [None]:
# Save dataframe to JSON (lines) format for subsequent work. This will contain a single self contained JSON record per line.
df.to_json("impc_data.json", orient="records")

In [None]:
# We can also save as CSV, but note that fine structure such as lists and nested data will be lost.
df.to_csv("impc_data.csv", index=False)

# Exercise block D

### Exercise 7: Faceting Query
We will use `facet_request` function to run faceting query. Let's execute cell below.

In [None]:
def facet_request(core, params, silent=False):
    """Performs a single Solr request.
    
    Returns:
        num_found: How many rows in total did the request match.
        df: A Pandas dataframe with a portion of the request matching `start` and `rows` parameters.
        silent: Suppress displaying the df and number of results (useful for batch requests).
    """
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    response = requests.get(solr_url, params=params)
    if not silent:
        print(f"\nYour request:\n{unquote(response.request.url)}\n")
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()
        num_found = data["response"]["numFound"]
        if not silent:
            print(f'Number of found documents: {num_found}\n')
        # Extract and add faceting query results to the list
        facet_counts = data["facet_counts"]["facet_fields"][params["facet.field"]]
        # Initialize an empty dictionary
        faceting_dict = {}
        # Iterate over the list, taking pairs of elements
        for i in range(0, len(facet_counts), 2):
            # Assign label as key and count as value
            label = facet_counts[i]
            count = facet_counts[i + 1]
            faceting_dict[label] = [count]
        
        # Print the resulting dictionary
        # Convert the list of dictionaries into a DataFrame and print the DataFrame
        df = pd.DataFrame(faceting_dict)
        df = pd.DataFrame.from_dict(faceting_dict, orient='index', columns=['counts']).reset_index()

        # Rename the columns
        df.columns = [params["facet.field"], 'count_per_category']
        if not silent:
            display(df)
        return num_found, df
    
    else:
        print("Error:", response.status_code, response.text)

In this exercise we will be again querying the whole core. We want to count how many documents there are for each value of the `zygosity` fields. Modify query below to get this information.
<br>
<br>
If you complete the exercise successfully, your total number of documents in homozygote category will be **52,606**.

In [None]:
num_found, df = facet_request(
    core='genotype-phenotype',
    params={
        'q': '*:*',
        'rows': 0,
        'facet': 'on',
        'facet.field': ...,
        'facet.limit': 15,
        'facet.mincount': 1
    }
)

# Exercises: Phenodigm Core

Now we will go over the disease section of the IMPC website. The `phenodigm` core contains all of the information regarding human-mouse disease associations.

Some useful links are:
- [IMPC disease models summary](https://www.mousephenotype.org/help/data-visualization/gene-pages/disease-models/)
- [IMPC disease associations](https://www.mousephenotype.org/help/data-analysis/disease-associations/)
- [Disease Models Portal](https://diseasemodels.research.its.qmul.ac.uk)
- [OMIM](https://www.omim.org/)
- [Orphanet](https://www.orpha.net)
- [SOLR Phenodigm core documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/phenodigm.html)

### Exercise 1: Getting Familiar with the Phenodigm Core
Get familiar with the structure of the phenodigm core queries. Write a query to retrieve 5 rows of the `disease_model_summary` type

If you complete the exercise successfully, your total number of documents will be **8,444,376**.


In [None]:
num_found, df = solr_request(
    core=...,
    params={
        'q': ...,
        'rows': ...
    }
)

### Exercise 2: Filtering by Disease

When filtering by disease, we can use the following fields:
- `disease_term`
- `disease_id`


Refer to the [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/phenodigm.html) and select the most appropriate `field` to filter for the disease: *[Robinow Syndrome](https://www.omim.org/entry/618529?search=618529&highlight=618529)*

How many documents are there on *Robinow Syndrome*?

If you complete the exercise successfully, your total number of documents will be **11,655**.

In [None]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': ...,
        'rows': 5,
    }
)

### Exercise 3: Filtering Mouse Models Related to a Disease
Retrieve the first 5 rows related to mouse models associated with *Robinow Syndrome*.

Use the following fields to obtain the model information:
- `model_id`
- `disease_term`
- `disease_id`
- `marker_id`
- `marker_symbol`
- `model_source`
- `model_genetic_background`

If you complete the exercise successfully, your total number of documents will be **11,655** and they will only display **7 fields**.


In [None]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND disease_term:"Robinow Syndrome"',
        'rows': 5,
        'fl': ...,
    }
)

## Advanced Exercises

### Exercise 4: Calculate the Phenodigm score

In this exercise, we will calculate the Phenodigm score, sort the results by this score, and filter the documents based on gene-disease curation.

1. Retrieve all diseases related to the mouse gene *[Nxn](https://www.mousephenotype.org/data/genes/MGI:109331)* (MGI:109331)
2. Select the following fields:
    - `marker_id`
    - `model_id`
    - `disease_id`
    - `disease_term`
    - `disease_model_avg_norm`
    - `disease_model_max_norm`
    - `association_curated`
3. Filter results to keep only those where `association_curated` is `true`.
4. Calculate their **Phenodigm scores** by adding the following to your last field:
    -  `phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)`
5. Sort the results in descending order by the calculated Phenodigm score by passing the following to the `sort` parameter:
    - `div(sum(disease_model_avg_norm,disease_model_max_norm),2)`

        **HINT**: do not include the text `phenodigm_score:` to the sort parameter as this will produce an error.

If you complete the exercise successfully, your total number of documents will be **48**.

In [None]:
num_found, df = solr_request(
    core='phenodigm',
    params={
        'q': 'type:disease_model_summary AND marker_id:"MGI:109331" AND association_curated:true',
        'fl': 'marker_id,model_id,disease_id,disease_term,disease_model_avg_norm,disease_model_max_norm,association_curated,phenodigm_score:div(sum(disease_model_avg_norm,disease_model_max_norm),2)',
        'sort':'div(sum(disease_model_avg_norm,disease_model_max_norm),2) desc'
    }
)


### Exercise 5: Iterate over a list of diseases or models/ genes
Now, we will define another helper function, `iterator_solr_request`. This function will request information for a list of values of a given `field`. This is particularly useful when we want to request data for specific models or genes.

Execute the helper functions below:

In [None]:
import json

# Helper function to fetch results. This function is used by the 'iterator_solr_request' function.
def entity_iterator(base_url, params):
    """Generator function to fetch results from the SOLR server in chunks using pagination

    Args:
        base_url (str): The base URL of the Solr server to fetch documents from.
        params (dict): A dictionary of parameters to include in the GET request. Must include
                       'start' and 'rows' keys, which represent the index of the first document
                       to fetch and the number of documents to fetch per request, respectively.

    Yields:
        dict: The next document in the response from the Solr server.
    """
    # Initialise variable to check the first request
    first_request = True

    # Call the API in chunks and yield the documents in each chunk
    while True:
        response = requests.get(base_url, params=params)
        data = response.json()
        docs = data["response"]["docs"]

        # Print the first request only
        if first_request:
            print(f'Your first request: {response.url}')
            first_request = False

        # Yield the documents in the current chunk
        for doc in docs:
            yield doc

        # Check if there are more results to fetch
        start = params["start"] + params["rows"]
        num_found = data["response"]["numFound"]
        if start >= num_found:
            break

        # Update the start parameter for the next request
        params["start"] = start

    # Print last request and total number of documents retrieved
    print(f'Your last request: {response.url}')
    print(f'Number of found documents: {data["response"]["numFound"]}\n')

# Function to iterate over field list and write results to a file.
def iterator_solr_request(core, params, filename='iteration_solr_request', format='json'):
    """Function to fetch results in batches from the Solr API and write them to a file
        Defaults to fetching 5000 rows at a time.

    Args:
        core (str): The name of the Solr core to fetch results from.
        params (dict): A dictionary of parameters to use in the filter query. Must include
                       'field_list' and 'field_type' keys, which represent the list of field items (i.e., list of MGI model identifiers)
                        to fetch and the type of the field (i.e., model_id) to filter on, respectively.
        filename (str): The name of the file to write the results to. Defaults to 'iteration_solr_request'.
        format (str): The format of the output file. Can be 'csv' or 'json'. Defaults to 'json'.
    """

    # Validate format
    if format not in ['json','csv']:
        raise ValueError("Invalid format. Please use 'json' or 'csv'")
    
    # Base URL
    base_url = "https://www.ebi.ac.uk/mi/impc/solr/"
    solr_url = base_url + core + "/select"

    # Extract entities_list and entity_type from params
    field_list = params.pop("field_list")
    field_type = params.pop("field_type")

    # Construct the filter query with grouped model IDs
    fq = "{}:({})".format(
        field_type, " OR ".join(['"{}"'.format(id) for id in field_list])
    )

    # Show users the field and field values they passed to the function
    print("Queried field:",fq)
    # Set internal params the users should not change
    params["fq"] = fq
    params["wt"] = 'json'
    params["start"]=0 # Start at the first result
    params["rows"]=5000 # Fetch results in chunks of 5000


    try:
        # Fetch results using a generator function
        results_generator = entity_iterator(solr_url, params)
    except Exception as e:
        raise Exception("An error occurred while downloading the data: " + str(e))

    # Append extension to the filename
    filename = f"{filename}.{format}"

    try:
        # Open the file in write mode
        with open(filename, "w", newline="") as f:
            if format == 'csv':
                writer = None
                for item in results_generator:
                    # Initialize the CSV writer with the keys of the first item as the field names
                    if writer is None:
                        writer = csv.DictWriter(f, fieldnames=item.keys())
                        writer.writeheader()
                    # Write the item to the CSV file
                    writer.writerow(item)
                    # Write to json without loading to memory.
            elif format == 'json':
                f.write('[')
                for i, item in enumerate(results_generator):
                    if i != 0:
                        f.write(',')
                    json.dump(item, f)
                f.write(']')
    except Exception as e:
        raise Exception("An error occurred while writing the file: " + str(e))

    print(f"File {filename} was created.")

Here is how to use it:
- New information required from the user:
    - `field_list`: A list of field names to be queried in the API. (e.g., MGI ids, OMIM ids, gene ids).
    - `field_type`: The type of `field` we are querying. (e.g., `model_id`, `disease_id`, `marker_id`).
    - `filename`: The name of the file you wish to save the data in.
    - `format`: The format in which you want to save the data (e.g., `json`, `csv`).

In [None]:
# List of model IDs.
models = ["MGI:3587188","MGI:3587185","MGI:3605874","MGI:2668213"]

# Call iterator function
iterator_solr_request(
    core='phenodigm', 
        params = {
        'q': 'type:disease_model_summary',  
        'fl': 'model_id,marker_id,disease_id',
        'field_list': models,
        'field_type': 'model_id'
    },
    filename='model_ids',
    format='json')

### Bonus exercise
Now, build your own request based on the example above, incorporating the following changes:
1. Use the `genotype-phenotype` core
2. Iterate over genes (the name of the field is `marker_symbol`)
3. Use the following list of genes: _Zfp580, Firrm, Gpld1, Mbip_
4. Modify the list of fields you request (`fl`) to include:
    - `marker_symbol`
    - `allele_symbol`
    - `parameter_stable_id`

In [None]:
# Genes example
genes = ["Zfp580","Firrm","Gpld1","Mbip"]

# Initial query parameters
params = {
    'q': "*:*",
    'fl': 'marker_symbol,allele_symbol,parameter_stable_id',
    'field_list': ...,
    'field_type': ...
}
iterator_solr_request(core=..., params=params, filename= ..., format =...)