# Counting KOMP2 generated colonies

We retreived the list of KOMP2 generated colonies from iMits.  Using this list, count how many
* Lines
* Genes
* Datapoints
* Images

are included in the current live data release

In [73]:
import json
import requests

BASE_URL = """https://www.ebi.ac.uk/mi/impc/solr"""
MUTANT_DATA_URL = BASE_URL + """/experiment/select?q=colony_id:"{}"&rows=1&fq=biological_sample_group:experimental"""
MUTANT_IMAGES_URL = BASE_URL + """/impc_images/select?q=colony_id:"{}"&rows=0"""

CONTROL_DATA_URL = BASE_URL + """/experiment/select?q=(project_name:BaSH OR project_name:DTCC OR project_name:JAX) AND biological_sample_group:control AND datasource_name:IMPC AND -pipeline_name:"MGP Select Pipeline"&rows=0"""
CONTROL_IMAGES_URL = BASE_URL + """/impc_images/select?q=(project_name:BaSH OR project_name:DTCC OR project_name:JAX) AND biological_sample_group:control AND datasource_name:IMPC AND -pipeline_name:"MGP Select Pipeline"&rows=0"""


In [74]:
colonies = [x.split("\t")[1].strip().upper() for x in open("KOMP2_colonies.tsv").readlines()][1:]
print("File has {} colonies".format(len(colonies)))

File has 4659 colonies


In [79]:

missing_colonies = set()
lines = set()
genes = set()
data_points = 0
images = 0

for colony in colonies:

    retries = 0
    while retries < 5:
    
        data = requests.get(MUTANT_DATA_URL.format(colony.replace("&", "%26")))
        if data.status_code != 200:
            print("Error retreiving data for colony: {}, URL: {}".format(colony, MUTANT_DATA_URL.format(colony)))
            retries = retries + 1
        else :
            break
        
        
    if data.status_code != 200:
        print ("Error retreiving data for colony: {}, URL: {}".format(colony, MUTANT_DATA_URL.format(colony)))
        continue
        
    num_found = data.json()['response']['numFound']
    data_points = data_points + num_found

    if num_found > 0:
        genes.add(data.json()['response']['docs'][0]['gene_accession_id'])
        lines.add(colony)

        image_data = requests.get(MUTANT_IMAGES_URL.format(colony.replace("&", "%26")))
        if image_data.status_code != 200:
            print ("Error retreiving image data for colony: {}, URL: {}".format(colony, MUTANT_IMAGES_URL.format(colony)))
        else:
            num_images_found = image_data.json()['response']['numFound']
            images = images + num_images_found

        if len(lines)%50 == 0:
            print("So far, found {} colonies with data in DR".format(len(lines)))

    else:

        missing_colonies.add(colony)
        if len(missing_colonies)%50 == 0:
            print("So far, found {} colonies missing from DR".format(len(missing_colonies)))
    




So far, found 50 colonies with data in DR
So far, found 100 colonies with data in DR
So far, found 150 colonies with data in DR
So far, found 200 colonies with data in DR
So far, found 50 colonies missing from DR
So far, found 250 colonies with data in DR
So far, found 300 colonies with data in DR
So far, found 100 colonies missing from DR
So far, found 150 colonies missing from DR
So far, found 350 colonies with data in DR
So far, found 200 colonies missing from DR
So far, found 250 colonies missing from DR
So far, found 300 colonies missing from DR
So far, found 350 colonies missing from DR
So far, found 400 colonies missing from DR
So far, found 450 colonies missing from DR
So far, found 500 colonies missing from DR
So far, found 550 colonies missing from DR
So far, found 600 colonies missing from DR
So far, found 400 colonies with data in DR
So far, found 650 colonies missing from DR
So far, found 700 colonies missing from DR
So far, found 750 colonies missing from DR
So far, found

In [80]:

data = requests.get(CONTROL_DATA_URL)
num_found = data.json()['response']['numFound']
total_data_points = data_points + num_found


image_data = requests.get(CONTROL_IMAGES_URL.format(colony))
num_images_found = image_data.json()['response']['numFound']
total_images = images + num_images_found


print("* Lines: {}".format(len(lines)))
print("* Genes: {}".format(len(genes)))
print("* Datapoints: {}".format(total_data_points))
print("* Images: {}".format(total_images))

print("*"*80)
print("There are {} KOMP2 colonies ({} missing) in the DR. List of colonies in DR: \n{}".format(len(lines), len(missing_colonies), "\n".join(lines)))
print("List of missing colonies in DR: \n{}".format("\n".join(missing_colonies)))


* Lines: 1635
* Genes: 1621
* Datapoints: 11258304
* Images: 81744
********************************************************************************
There are 1635 KOMP2 colonies (3024 missing) in the DR. List of colonies in DR: 
CR10037
BL4255
JR27689
CR1478
BL3674
JR30377
BL2461
JR26034
H-VPS35-DEL611INS4-EM1-B6N
CR1335
CR1636
BL3289
BL5466
TCPR0755_ADIJ
JR26914
BL3229
BL4203
JR30530
JR31319
TCPR0502_ACZQ
JR31401
JR31318
JR31130
JR29148
JR31037
BL3800
JR31784
JR32024
BL5300
JR29395
BL2529
JR30003
CR1635
CR1567
BL5322
JR31310
JR31384
CR1280
BL3001
JR29244
JR30916
JR28540
BL3256
BL4047
JR30083
CR10032
CR10090
JR30430
JR30138
BL5419
BL1512
JR28893
BL4206
CR10031
CR1656
TCPR0708_ADGZ
TCPR1038_ADTR
TCPR1025_ADRX
CR1539
CR1074
TCPR0398_ACTG
BL5138
JR30071
JR30521
TCPR0692_ADHH
JR29158
H-UNC13C-DEL209-EM1-B6N
JR29384
CR1767
TCPR1051_ADSL
JR31215
TCPR0758_ADIP
BL4479
JR31930
TCPR0443_ACXK
JR27202
CR1100
CR10136
CR10064
TCPR0366_ACSF
BL2790
CR1629
BL5417
JR31068
BL2259
TCPR0946_ADPY
H-C4B-DEL5