# Get Samples information from MOLGENIS

This notebooks is used to develop a framework for getting sample information for processing from MOLGENIS. 

Depending on processing-type, different strategies are required. 

## scRNA-seq (cellranger count)

Cellranger expects read-files organized in the following structure: 

`[Sample Name]_S1_L00[Lane Number]_[Read Type]_001.fastq.gz`

Depending on the naming scheme, `[Sample Name]` consists one or both of the MOLGENIS/OTP attributes 
`PATIENT_ID` and `Sample Type`. 

## Retrieving information


To get from `library` labels to the read files data has to be aggregated from: 

- `library` ( -> `readfile_set`)
- `readfileset`
- `readsfile` ( -> `readfile_set`, `file`)
- `file`

Based on user input, libraries are selected by label, or individual (*xref*) and the resulting `readfile_set` IDs are returned. 

In a reverse lookup, `readsfile` items are selected using `readfile_set` IDs, and then a list of files is selected from `file`. 




In [286]:
import yaml
import getpass
from snakemake.io import _load_configfile
from workflow.scripts import db_molgenis
import pandas as pd

In [287]:
config = _load_configfile("../config/config.yaml")

molgenis_config = config["molgenis"]

In [288]:
molgenis_password = getpass.getpass()

········


In [289]:
session = db_molgenis.init_session(molgenis_config["host"], molgenis_config["user"], molgenis_password)

In [400]:
entities = ["library", "readfileset", "readsfile", "file"]
expansion_lookup = {
    "library": ["individual","sample_type"],
    "readsfile": ["file_id"],
}
data = dict(
    (
        entity_label,
        db_molgenis.get_table_data(
            session=session,
            table_name=f"{molgenis_config['project']}_{entity_label}",
            num=100, 
            expand=",".join(expansion_lookup.get(entity_label,[]))
        )
    ) for entity_label in entities
)

dfs = dict(map(lambda entity_label: (entity_label, pd.json_normalize(data[entity_label])),
              entities))

#libraries = db_molgenis.get_table_data(session, f"{molgenis_config['project']}_library")
# = db_molgenis.get_table_data(session, f"{molgenis_config['project']}_library")




In [402]:
data["library"]

[{'_href': '/api/v2/OE0538_DO0008_library/aaaadawvgbgfwascvqkqabqaae',
  'id': 'aaaadawvgbgfwascvqkqabqaae',
  'label': 'OE0538_DO-0008_mmus_young1-bonemarrow-01',
  'individual': {'_href': '/api/v2/OE0538_DO0008_individual/aaaadavwimq34ascvqkqabqaae',
   'id': 'aaaadavwimq34ascvqkqabqaae',
   'label': 'OE0538_DO-0008_mmus_young1',
   'otp_id': 'OTPIND-',
   'organism': {'_href': '/api/v2/ontologies_organism/obo:NCBITaxon_10090',
    'id': 'obo:NCBITaxon_10090',
    'label': 'Mus musculus'},
   'strain': 'C57BL/6Ly5.1',
   'sex': {'_href': '/api/v2/ontologies_sex/obo:PATO_0000383',
    'id': 'obo:PATO_0000383',
    'label': 'female'},
   'individual_odom': '25706',
   'date_of_birth': '2021-05-05',
   'date_of_death': '2021-07-14',
   'phenotype': 'healthy',
   'treatment': 'none'},
  'sample_type': {'_href': '/api/v2/OE0538_DO0008_sampletype/53203222',
   'otp_id_raw': '53203222',
   'otp_id': 'OTPSAT-53203222',
   'label': 'bonemarrow-01',
   'tissue': 'bonemarrow'},
  'otp_id_raw': 

In [370]:


def _convert_dicts_to_records(dicts, dedup=False):
    _records = defaultdict(list)
    for nested_dict in dicts:
        if len(nested_dict) > 0:
            for attr, val in nested_dic.items():
                _records[f"{label}.{attr}"].append(val)
    if dedup:
        records = {key:_deduplicate_list_items(value) for key, value
                  in _records.items()}
    else:
        records = _records
    return records

def _deduplicate_list_items(items):
    if items:
        if all(map(lambda elem: type(elem) in [str, int], items)):
            deduped = list(set(items))
            if len(deduped) == 1:
                return deduped[0]
            else: 
                return deduped
        else:
            return(items)
    
                
def unpack_lists(dat):
    from collections import defaultdict
    unpacked = []
    for item in array:
        nested_item_types = {key: type(val) for key, val in item.items()}
        for n_item_name, n_item_type in nested_item_types.items():
            
            n_item = item[n_item_name]
            
            if n_item_type == dict:
                
                n_item_unpacked = {f"{label}.{key}":value for key, value in n_item.items()}
                dic.update(n_item_unpacked)
                dic.pop(n_item)
            
            if n_item_type == list:
                for n_list_item in n_item:
                    records = _convert_dicts_to_records(n_list_item, dedup=True)
                    dic.update(records)
                    dic.pop(label)
            
                unpacked.append(dic)
    return unpacked


In [371]:
dfs = dict(map(lambda entity_label:
               (
                   entity_label, pd.DataFrame.from_records(
                       unpack_lists(data[entity_label])
                   )
               ),
               entities
              )
          )

In [399]:
data["library"]

[{'_href': '/api/v2/OE0538_DO0008_library/aaaadawvgbgfwascvqkqabqaae',
  'id': 'aaaadawvgbgfwascvqkqabqaae',
  'label': 'OE0538_DO-0008_mmus_young1-bonemarrow-01',
  'otp_id_raw': 'SAM63371637-STY36774488-LPK57832687-ABT00000',
  'otp_id': 'OTP-SAM63371637-STY36774488-LPK57832687-ABT00000',
  'antibody_target_otp_id': '00000',
  'sequencing_type_otp_id': '36774488',
  'library_preparation_kit': 'Chromium Next GEM Single Cell 3 Reagent Kits v3.1',
  'library_preparation_kit_otp_id': '57832687',
  'tissue_prep_method': 'single cell',
  'individual._href': '/api/v2/OE0538_DO0008_individual/aaaadavwimq34ascvqkqabqaae',
  'individual.id': 'aaaadavwimq34ascvqkqabqaae',
  'individual.label': 'OE0538_DO-0008_mmus_young1',
  'individual.otp_id': 'OTPIND-',
  'individual.strain': 'C57BL/6Ly5.1',
  'individual.individual_odom': '25706',
  'individual.date_of_birth': '2021-05-05',
  'individual.date_of_death': '2021-07-14',
  'individual.phenotype': 'healthy',
  'individual.treatment': 'none',
  '

In [398]:
dfs["library"]#["readfile_set.id"].tolist()


Unnamed: 0,_href,id,label,otp_id_raw,otp_id,antibody_target_otp_id,sequencing_type_otp_id,library_preparation_kit,library_preparation_kit_otp_id,tissue_prep_method,...,individual.sex.id,individual.sex.label,readfile_set._href,readfile_set.id,readfile_set.otp_id,readfile_set.label,readfile_set.lane,readfile_set.indexing_barcode,readfile_set.alignment_file,readfile_set.phred_encoding
0,/api/v2/OE0538_DO0008_library/aaaadawvgbgfwasc...,aaaadawvgbgfwascvqkqabqaae,OE0538_DO-0008_mmus_young1-bonemarrow-01,SAM63371637-STY36774488-LPK57832687-ABT00000,OTP-SAM63371637-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
1,/api/v2/OE0538_DO0008_library/aaaadawvgbgyoasc...,aaaadawvgbgyoascvqkqabqaae,OE0538_DO-0008_mmus_young1-bonemarrow-02,SAM63372406-STY36774488-LPK57832687-ABT00000,OTP-SAM63372406-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
2,/api/v2/OE0538_DO0008_library/aaaadawvgbhamasc...,aaaadawvgbhamascvqkqabqaae,OE0538_DO-0008_mmus_young1-bonemarrow-03,SAM63373175-STY36774488-LPK57832687-ABT00000,OTP-SAM63373175-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
3,/api/v2/OE0538_DO0008_library/aaaadawvgbhfaasc...,aaaadawvgbhfaascvqkqabqaae,OE0538_DO-0008_mmus_old2-bonemarrow-01,SAM63369236-STY36774488-LPK57832687-ABT00000,OTP-SAM63369236-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
4,/api/v2/OE0538_DO0008_library/aaaadawvgbhmqasc...,aaaadawvgbhmqascvqkqabqaae,OE0538_DO-0008_mmus_old2-bonemarrow-02,SAM63370005-STY36774488-LPK57832687-ABT00000,OTP-SAM63370005-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
5,/api/v2/OE0538_DO0008_library/aaaadawvgbhrwasc...,aaaadawvgbhrwascvqkqabqaae,OE0538_DO-0008_mmus_old2-bonemarrow-03,SAM63370774-STY36774488-LPK57832687-ABT00000,OTP-SAM63370774-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
6,/api/v2/OE0538_DO0008_library/aaaadawvgbhvaasc...,aaaadawvgbhvaascvqkqabqaai,OE0538_DO-0008_mmus_old1-bonemarrow-01,SAM63366922-STY36774488-LPK57832687-ABT00000,OTP-SAM63366922-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
7,/api/v2/OE0538_DO0008_library/aaaadawvgbi2masc...,aaaadawvgbi2mascvqkqabqaae,OE0538_DO-0008_mmus_OLD_1-bonemarrow-02,SAM60521756-STY36774488-LPK57832687-ABT00000,OTP-SAM60521756-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
8,/api/v2/OE0538_DO0008_library/aaaadawvgbibkasc...,aaaadawvgbibkascvqkqabqaae,OE0538_DO-0008_mmus_old1-bonemarrow-02,SAM63367697-STY36774488-LPK57832687-ABT00000,OTP-SAM63367697-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33
9,/api/v2/OE0538_DO0008_library/aaaadawvgbifaasc...,aaaadawvgbifaascvqkqabqaae,OE0538_DO-0008_mmus_old1-bonemarrow-03,SAM63368466-STY36774488-LPK57832687-ABT00000,OTP-SAM63368466-STY36774488-LPK57832687-ABT00000,0,36774488,Chromium Next GEM Single Cell 3 Reagent Kits v3.1,57832687,single cell,...,obo:PATO_0000383,female,/api/v2/OE0538_DO0008_readfileset/aaaadawvftmc...,aaaadawvftmcgascvqkqabqaai,60530465,LR-56383_1-TCTCAGTG,[{'_href': '/api/v2/OE0538_DO0008_lane/aaaadav...,TCTCAGTG,[[]],Phred+33


In [397]:
dfs["readfileset"].id #[dfs["readfileset"].id.isin(dfs["library"]["readfile_set.id"].tolist())]

0     aaaadawvfp32cascvqkqabqaae
1     aaaadawvfp4iaascvqkqabqaae
2     aaaadawvfp4u6ascvqkqabqaae
3     aaaadawvfp57mascvqkqabqaae
4     aaaadawvfp5dsascvqkqabqaae
                 ...            
95    aaaadawvfrfoiascvqkqabqaae
96    aaaadawvfrgdsascvqkqabqaai
97    aaaadawvfrgv4ascvqkqabqaae
98    aaaadawvfrh6qascvqkqabqaae
99    aaaadawvfrhb2ascvqkqabqaae
Name: id, Length: 100, dtype: object

In [60]:
df = pd.concat(dfs["library"])
dfs["readsfile"]

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"