# Counting KOMP2 generated colonies

We retreived the list of KOMP2 generated colonies from iMits.  Using this list, count how many
* Lines
* Genes
* Datapoints
* Images

are included in the current live data release

In [87]:
import json
import requests

BASE_URL = """https://www.ebi.ac.uk/mi/impc/solr"""
MUTANT_DATA_URL = BASE_URL + """/experiment/select?q=colony_id:"{}"&rows=1&fq=biological_sample_group:experimental"""
MUTANT_IMAGES_URL = BASE_URL + """/impc_images/select?q=colony_id:"{}"&rows=0"""

CONTROL_DATA_URL = BASE_URL + """/experiment/select?q=(project_name:BaSH OR project_name:DTCC OR project_name:JAX) AND biological_sample_group:control AND datasource_name:IMPC AND -pipeline_name:"MGP Select Pipeline"&rows=0"""
CONTROL_IMAGES_URL = BASE_URL + """/impc_images/select?q=(project_name:BaSH OR project_name:DTCC OR project_name:JAX) AND biological_sample_group:control AND datasource_name:IMPC AND -pipeline_name:"MGP Select Pipeline"&rows=0"""

# SOURCE_FILE = "KOMP2_colonies.tsv"
SOURCE_FILE = "DCC_colonies.tsv"


In [88]:
colonies = [x.split("\t")[1].strip().upper() for x in open(SOURCE_FILE).readlines()][1:]
print("File has {} colonies".format(len(colonies)))

File has 3890 colonies


In [91]:

missing_colonies = set()
lines = set()
genes = set()
data_points = 0
images = 0

for colony in colonies:

    retries = 0
    while retries < 5:
    
        data = requests.get(MUTANT_DATA_URL.format(colony.replace("&", "%26")))
        if data.status_code != 200:
            print("Error retreiving data for colony: {}, URL: {}".format(colony, MUTANT_DATA_URL.format(colony)))
            retries = retries + 1
        else :
            break
        
        
    if data.status_code != 200:
        print ("Error retreiving data for colony: {}, URL: {}".format(colony, MUTANT_DATA_URL.format(colony)))
        continue
        
    num_found = data.json()['response']['numFound']
    data_points = data_points + num_found

    if num_found > 0:
        genes.add(data.json()['response']['docs'][0]['gene_accession_id'])
        lines.add(colony)

        image_data = requests.get(MUTANT_IMAGES_URL.format(colony.replace("&", "%26")))
        if image_data.status_code != 200:
            print ("Error retreiving image data for colony: {}, URL: {}".format(colony, MUTANT_IMAGES_URL.format(colony)))
        else:
            num_images_found = image_data.json()['response']['numFound']
            images = images + num_images_found

        if len(lines)%100 == 0:
            print("So far, found {} colonies with data in DR".format(len(lines)))

    else:

        missing_colonies.add(colony)
        if len(missing_colonies)%100 == 0:
            print("So far, found {} colonies missing from DR".format(len(missing_colonies)))
    




So far, found 100 colonies with data in DR
So far, found 200 colonies with data in DR
So far, found 300 colonies with data in DR
So far, found 400 colonies with data in DR
So far, found 500 colonies with data in DR
So far, found 600 colonies with data in DR
So far, found 700 colonies with data in DR
So far, found 100 colonies missing from DR
So far, found 800 colonies with data in DR
So far, found 900 colonies with data in DR
So far, found 1000 colonies with data in DR
So far, found 1100 colonies with data in DR
So far, found 200 colonies missing from DR
So far, found 1200 colonies with data in DR
So far, found 1300 colonies with data in DR
So far, found 1400 colonies with data in DR
So far, found 1500 colonies with data in DR
So far, found 1600 colonies with data in DR
So far, found 1700 colonies with data in DR
So far, found 1800 colonies with data in DR
So far, found 1900 colonies with data in DR
So far, found 2000 colonies with data in DR
So far, found 2100 colonies with data in DR

In [93]:

data = requests.get(CONTROL_DATA_URL)
num_found = data.json()['response']['numFound']
total_data_points = data_points + num_found


image_data = requests.get(CONTROL_IMAGES_URL)
num_images_found = image_data.json()['response']['numFound']
total_images = images + num_images_found


print("* Lines: {}".format(len(lines)))
print("* Genes: {}".format(len(genes)))
print("* Datapoints: {}".format(total_data_points))
print("* Images: {}".format(total_images))

print("*"*80)
print("There are {} KOMP2 colonies ({} missing) in the DR. List of colonies in DR: \n{}".format(len(lines), len(missing_colonies), "\n".join(lines)))
print("List of missing colonies in DR: \n{}".format("\n".join(missing_colonies)))


* Lines: 3389
* Genes: 3286
* Datapoints: 21120806
* Images: 181869
********************************************************************************
There are 3389 KOMP2 colonies (501 missing) in the DR. List of colonies in DR: 
MUAN
JR27295
CR10037
BL4255
JR28457
PH21753
JR27689
MRTTB
JR24897
CR1478
MAWLB
MTGOB
MUEX
BL3674
JR28474
JR30377
BL2461
JR22095
JR26034
JR28475
H-VPS35-DEL611INS4-EM1-B6N
JR27207
JR24946
JR28125
POLAB
CR1335
JR27556
CR1636
BL3289
BL5466
H-PKD2L2-A07-TM1B
MUDN
JR18655
JR18616
TCPR0755_ADIJ
BL3229
JR26914
BL4203
JR30530
JR31319
TCPR0502_ACZQ
JR31401
JR31130
JR31318
UBEBB
JR29148
ET8329
ET5625
JR31037
JR27625
JR26541
BL3800
JR31784
JR32024
MUHG
BL5300
SLTOB
JR29395
BL2529
JR27656
JR30003
BL5322
CR1567
CR1635
MUFH
MUDF
BL3001
ANTOB
CR1280
H-SNRNP200-E01-TM1B
JR31310
JR31384
FABMB
MUAQ
JR29244
H-LRRK1-C01-TM1B
H-GRIK1-FG-G07-TM1B
JR30916
BL3256
JR28540
JR25573
BL4047
JR23604
JR30083
CR10032
CR10090
JR28601
BL5419
JR30138
JR30430
JR21752
BL1512
FOXJB
JR28893
BL4206
M