# Autism Paper Data
Begin with the example of how to customise your query just after preparation step, including:

- Filtering by center, for example.
- Choosing available fields.
- Selecting the output format (CSV or JSON).

To download the data, follow step 0-2. In Step 3, you can choose one of two options:

- Option 3a. Download all the data for each `parameter_stable_id` in a single file.
- Option 3b. Download the data in separate files, with one file for each `parameter_stable_id`.

## 0. Preparation

In [1]:
# Import libraries.
import os
import pandas as pd 
from impc_api import solr_request, batch_solr_request

In [2]:
# Create directory for output files.
output_dir = "output_files/"

try:
    os.mkdir(output_dir)
    print(f"Directory '{output_dir}' created successfully.")
except FileExistsError:
    print(f"Directory '{output_dir}' already exists.")
except PermissionError:
    print(f"Permission denied: Unable to create '{output_dir}'.")
except Exception as e:
    print(f"An error occurred: {e}")

Directory 'output_files/' already exists.


# Example of Customising a Query
1. **Define the Dataset Requirements:** For this example, let's use a small dataset with specific filters:
- parameter_stable_id:IMPC_OFD_020_001
- production_center:HMGU
- colony_id:H5183-HEPD0503_2_F12-1
  
2. **Select Fields to Download:** Choose specific fields from the `experiment` core to download. For instance:
- `observation_id`
- `data_point`
- `metadata`

You can view all available fields in the [documentation](https://www.ebi.ac.uk/mi/impc/solrdoc/). 

3. **Set the Output Format:** We recommend using JSON for the output format (remove `wt` from the `params` dictionary in the example to default to JSON). CSV is an option too, but note that it flattens structured data like lists and nested fields, which you’ll find in the `metadata` field (labeled as an Array type in the documentation).

In [3]:
# Save as CSV.
df = batch_solr_request(
    core='experiment',
    params={
        'q': 'parameter_stable_id: IMPC_OFD_020_001 AND production_center:HMGU AND colony_id:H5183-HEPD0503_2_F12-1',
        'fl': 'observation_id,data_point,metadata',
        'wt': 'csv'
    },
    download=True,
    filename=output_dir + 'test_example'
)

# Save as JSON (recommended).
df = batch_solr_request(
    core='experiment',
    params={
        'q': 'parameter_stable_id: IMPC_OFD_020_001 AND production_center:HMGU AND colony_id:H5183-HEPD0503_2_F12-1',
        'fl': 'observation_id,data_point,metadata'
    },
    download=True,
    filename=output_dir + 'test_example'
)

Number of found documents: 20
Downloading file...


5000it [00:00, 46509.94it/s]                                                                                              

Your request URL after the last call:https://www.ebi.ac.uk/mi/impc/solr/experiment/select?q=parameter_stable_id%3A+IMPC_OFD_020_001+AND+production_center%3AHMGU+AND+colony_id%3AH5183-HEPD0503_2_F12-1&fl=observation_id%2Cdata_point%2Cmetadata&wt=csv&start=0&rows=5000
File saved as: output_files/test_example.csv
Reading downloaded file...





Number of found documents: 20
Downloading file...


5000it [00:00, 45760.37it/s]                                                                                              

Your request URL after the last call:https://www.ebi.ac.uk/mi/impc/solr/experiment/select?q=parameter_stable_id%3A+IMPC_OFD_020_001+AND+production_center%3AHMGU+AND+colony_id%3AH5183-HEPD0503_2_F12-1&fl=observation_id%2Cdata_point%2Cmetadata&start=0&rows=5000&wt=json
File saved as: output_files/test_example.json
Reading downloaded file...





In [4]:
# We can reload data as pandas dataframe
df1 = pd.read_csv("output_files/test_example.csv")
df2 = pd.read_json("output_files/test_example.json")

## 1. Preprocessing the File with `parameter_stable_id`s

In [5]:
# Read an excel file and convert into a dataframe object.
excel_table = "Parameters_IMPC_request_ASD_October2024.xlsx"
df = pd.DataFrame(pd.read_excel("Parameters_IMPC_request_ASD_October2024.xlsx")) 

# Remove rows where 'ID' column has NaN values.
df_cleaned = df.dropna(subset=["ID"])

# Remove procedure_stable_id from "ID" column.
df_parameters = df_cleaned[df_cleaned['Test'].isna()]

# Add weight parameter_stable_id.
weight_data = {"ID": "IMPC_BWT_001_001", "Test": "NaN", "Parameter": "Body weight"}
df_parameters = pd.concat([df_parameters, pd.DataFrame([weight_data])], ignore_index=True)

# Drop unneeded columns.
df_parameters = df_parameters.drop(columns=["Test", "Parameter"])

# Show the dataframe.
display(df_parameters)

Unnamed: 0,ID
0,IMPC_OFD_020_001
1,IMPC_OFD_021_001
2,IMPC_OFD_022_001
3,IMPC_OFD_019_001
4,IMPC_GRS_008_001
...,...
45,IMPC_LDT_015_001
46,IMPC_LDT_006_001
47,IMPC_LDT_007_001
48,IMPC_LDT_008_001


## 2. # Get `parameter_stable_ids` as a List

In [6]:
parameters = df_parameters["ID"].values.tolist()

# 3. Download the Data
## 3a. Download All the Data for Each `parameter_stable_id` in a Single File

In [None]:
# Download data for each parameter_stable_id in one file.
df = batch_solr_request(
    core='experiment',
    params={
        'q':'*:*',
        'field_list': parameters,
        'field_type': 'parameter_stable_id'
    },
    download = True,
    filename=output_dir + 'all'
)

## 3b. Download the Data in Separate Files, with One File for Each `parameter_stable_id`

In [None]:
# Download data for each parameter_stable_id in a separate file.
for parameter_stable_id in df_parameters["ID"]:
    parameter = parameter_stable_id.strip()
    df = batch_solr_request(
        core='experiment',
        params={
            'q':'parameter_stable_id:' + parameter
        },
        download=True,
        filename=output_dir + parameter
    )