# Introduction
See discussion at https://github.com/malariagen/fits/issues/62

Here I am trying to determine whether there are files that I am identifying through mlwh queries, but that are not in subtrack. If so, does Magnus see these in "production FITS" but not "Subtrack FITS"?


In [1]:
%run ../setup.ipynb

python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
numpy 1.13.3
scipy 0.19.1
pandas 0.22.0
numexpr 2.6.4
pysam 0.8.4
pysamstats 0.24.3
petlx 1.0.3
vcf 0.6.8
h5py 2.7.0
tables 3.4.2
zarr 2.2.0
scikit-allel 1.1.10


In [2]:
# Inputs
mlwh_missing_exceptions_fn = '/nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/mlwh_missing_exceptions.txt'
mlwh_study_exceptions_fn = '/nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/mlwh_study_exceptions.txt'
mlwh_sample_exceptions_fn = '/nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/mlwh_sample_exceptions.txt'
mlwh_taxon_exceptions_fn = '/nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/mlwh_taxon_exceptions.txt'
sequencescape_alfresco_study_mappings_fn='/nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/sequencescape_alfresco_study_mappings.txt'
input_dir = 'dummy' # only using this so I can reuse the script I wrote for Pf 6.2, i.e. https://github.com/malariagen/parasite-ops/blob/master/work/18_Pf_6_2_irods_manifest/20181023_%2318_Pf_6_2_irods_manifest.ipynb

# Create iRODS manifest for all MalariaGEN Pf samples

In [3]:
def set_irods_name(row):
    # First calculate prefix using nonhuman dependent on id_run and species (taxon)
    # Then calculate tag string (empty string if no tag)
    # Then calculate file extension
    # Then concat these three strings
    if (
        (row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 6750)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    elif (
        ~(row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 7100)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    else:
        prefix = "%d_%d" % (row['id_run'], row['position'])
        
    if np.isnan(row['tag_index']):
        tag_string = ''
    else:
        tag_string = '#%d' % row['tag_index']
    
    if row['id_run'] <= 10300:
        file_extension = '.bam'
    else:
        file_extension = '.cram'
    
    irods_filename = prefix + tag_string + file_extension
    return(irods_filename)

def set_sample_id(row):
    if row['sample'] is None:
        return('DS_%s' % row['name'])
    else:
        return('DS_%s' % row['sample'])

def is_valid_alfresco_code(s):
    try: 
        int(s[0:4])
        return True
    except ValueError:
        return False

def create_build_manifest(
#     taxon_ids=[5833, 5855, 5858, 5821, 7165, 7173, 30066], # Pf, Pv, Pm, Pb, Ag, A. arabiensis, A. merus
    taxon_ids=[5833, 36329, 5847, 57267, 137071], # Pf, 3D7, V1, Dd2, HB3
    # 1089 is excluded as this is R&D samples. 1204 and 1176 are excluded as these are CP1 samples. 1175 is excluded as some Pf R&D samples were incorrectly tagged with this study. 1157 is a Pv study - there are two suspected Pf samples in this study, but we need further investigation, and possible study change both in ROMA and in Sequencescape/mlwh/ENA/iRODS before we can include
    studies_to_exclude = ['1089-R&amp;D-GB-ALCOCK', '1204-PF-GM-CP1', '1176-PF-KE-CP1', '1175-VO-KH-STLAURENT', '1157-PV-MULTI-PRICE'],
    miseq_runs_to_allow = [13809, 13810], # Two Miseq runs within 24 samples on each from study 1095-PF-TZ-ISHENGOMA that were included in Pf 6.0
    input_dir = input_dir,
    mlwh_missing_exceptions_fn = mlwh_missing_exceptions_fn,
    mlwh_study_exceptions_fn = mlwh_study_exceptions_fn,
    mlwh_sample_exceptions_fn = mlwh_sample_exceptions_fn,
    mlwh_taxon_exceptions_fn = mlwh_taxon_exceptions_fn,
    sequencescape_alfresco_study_mappings_fn = sequencescape_alfresco_study_mappings_fn,
    output_columns = [
        'path',
        'study',
        'sample',
        'lane',
        'reads',
        'paired',
        'irods_path',
        'sanger_sample_id',
        'taxon_id',
        'study_lims',
        'study_name',
        'id_run',
        'position',
        'tag_index',
        'qc_complete',
        'manual_qc',
        'description',
        'instrument_name',
        'instrument_model',
        'forward_read_length',
        'requested_insert_size_from',
        'requested_insert_size_to',
        'human_percent_mapped',
        'subtrack_filename',
        'subtrack_files_bytes',
        'ebi_run_acc'
    ]
):
    """Create a DataFrame to be used as a build manifest.
    
    A build manifest here is a list of all the "lanelets" that need to be
    included in a build. The output will typically be written to a
    tab-delmited file, either to use as input to vr-pipe, or perhaps as
    input to a vr-track DB.

    Args:
        taxon_ids (list of int): The taxons of the build (P. falciparum is '5833',
            P. vivax is '5855').
        sequencescape_alfresco_study_mappings_fn (str): filename of
            mappings from sequencscape to alfresco study mappings

    Returns:
        pd.DataFrame: The build manifest.

    """
    
    # Read in taxon exceptions file
    df_mlwh_taxon_exceptions = pd.read_csv(mlwh_taxon_exceptions_fn, sep='\t', dtype={'tag_index': str, 'taxon_id': int}, index_col='irods_filename')
    df_mlwh_taxon_exceptions['tag_index'].fillna('', inplace=True)
    # Identify which samples in exceptions file match the taxon of current build, and which don't
    df_mlwh_exceptions_this_taxon = df_mlwh_taxon_exceptions.loc[df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids)]
    df_mlwh_exceptions_other_taxa = df_mlwh_taxon_exceptions.loc[~(df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids))]

    # Read in sample exceptions file
    df_mlwh_sample_exceptions = pd.read_csv(mlwh_sample_exceptions_fn, sep='\t', index_col='irods_filename')

    # Read in study exceptions file
    df_mlwh_study_exceptions = pd.read_csv(mlwh_study_exceptions_fn, sep='\t', index_col='irods_filename')

    # Read in missing exceptions file
    df_mlwh_missing_exceptions = pd.read_csv(mlwh_missing_exceptions_fn, sep='\t', index_col='irods_filename')
    df_mlwh_missing_exceptions['derivative_sample_id'] = 'DS_' + df_mlwh_missing_exceptions['sample_id']
    df_mlwh_missing_exceptions = df_mlwh_missing_exceptions.loc[
        df_mlwh_missing_exceptions['taxon_id'].isin(taxon_ids)
    ]

    # Read in SequenceScape-Alfresco study mappings
    df_sequencescape_alfresco_study_mappings = pd.read_csv(sequencescape_alfresco_study_mappings_fn, sep='\t', index_col='seqscape_study_id')
    sequencescape_alfresco_study_mappings_dict = df_sequencescape_alfresco_study_mappings.loc[
        :,
#         df_sequencescape_alfresco_study_mappings['build_flag']==1,
        'alfresco_study_code'
    ].to_dict()
    
    # Read in data from mlwh matching this taxon, plus samples from exceptions file matching this taxon
    conn = pymysql.connect(
        host='mlwh-db',
        user='mlwh_malaria',
        password='Solaris&2015',
        db='mlwarehouse',
        port=3435
    )
    
    sql_query = 'SELECT \
        study.name as study_name, \
        study.id_study_lims as study_lims, \
        sample.supplier_name as sample, \
        sample.name, \
        sample.sanger_sample_id, \
        sample.taxon_id, \
        iseq_product_metrics.id_run, \
        iseq_product_metrics.position, \
        iseq_product_metrics.tag_index, \
        iseq_product_metrics.num_reads, \
        iseq_product_metrics.human_percent_mapped, \
        iseq_run_lane_metrics.instrument_name, \
        iseq_run_lane_metrics.instrument_model, \
        iseq_run_lane_metrics.paired_read, \
        iseq_run_lane_metrics.qc_complete, \
        iseq_flowcell.manual_qc, \
        iseq_flowcell.requested_insert_size_from, \
        iseq_flowcell.requested_insert_size_to, \
        iseq_flowcell.forward_read_length, \
        iseq_run_status_dict.description \
    FROM \
        study, \
        iseq_flowcell, \
        sample, \
        iseq_product_metrics, \
        iseq_run_status, \
        iseq_run_lane_metrics, \
        iseq_run_status_dict \
    WHERE \
        study.id_study_tmp = iseq_flowcell.id_study_tmp and \
        iseq_flowcell.id_sample_tmp = sample.id_sample_tmp and \
        iseq_flowcell.manual_qc = 1 and \
        iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp and \
        iseq_run_status.id_run = iseq_product_metrics.id_run and \
        iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run and \
        iseq_product_metrics.position = iseq_run_lane_metrics.position and \
        iseq_run_status.iscurrent = 1 and \
        ( ( iseq_run_lane_metrics.instrument_model != "MiSeq" ) or (iseq_product_metrics.id_run in (%s)) ) and \
        iseq_run_status.id_run_status_dict = iseq_run_status_dict.id_run_status_dict and \
        study.faculty_sponsor = "Dominic Kwiatkowski" and \
        ( ( sample.taxon_id in (%s) ) or ' % (
            ', '.join([str(x) for x in miseq_runs_to_allow]),
            ', '.join([str(x) for x in taxon_ids])
        )
    sql_query = sql_query + ' or '.join(
        df_mlwh_taxon_exceptions.loc[df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids)].apply(
            lambda x: '(iseq_product_metrics.id_run="%s" and iseq_product_metrics.position="%s" and iseq_product_metrics.tag_index="%s")' % (x['id_run'], x['position'], x['tag_index']),
            1
        )
    )
    sql_query = sql_query + ')'
#         (iseq_flowcell.manual_qc = 1 or iseq_flowcell.manual_qc is null) and \
    
    df_return = pd.read_sql(sql_query, conn)

    # Replace missing taxon_id with -1 (can't have missing int values in pandas Series)
    df_return['taxon_id'] = df_return['taxon_id'].fillna(-1).astype('int32')

    # Determine file name in iRods
    df_return['irods_filename'] = df_return.apply(set_irods_name, 1)
    df_return.set_index('irods_filename', inplace=True)

    # Determine alfresco study and change any incorrect studies
    df_return['alfresco_study_code'] = df_return['study_lims'].apply(
        lambda x: sequencescape_alfresco_study_mappings_dict[int(x)] if int(x) in sequencescape_alfresco_study_mappings_dict else ''
    )
    empty_alfresco_study_code = (df_return['alfresco_study_code'] == '')
    df_return.loc[empty_alfresco_study_code, 'alfresco_study_code'] = df_return.loc[empty_alfresco_study_code, 'study_name']
    
    study_exception_indexes = df_mlwh_study_exceptions.index[
        df_mlwh_study_exceptions.index.isin(df_return.index)
    ]
    df_return.loc[study_exception_indexes, 'alfresco_study_code'] = df_mlwh_study_exceptions.loc[study_exception_indexes, 'alfresco_study_code']

    # Change any incorrect taxon IDs
    taxon_exception_indexes = df_mlwh_exceptions_this_taxon.index[
        df_mlwh_exceptions_this_taxon.index.to_series().isin(df_return.index)
    ]
    df_return.loc[taxon_exception_indexes, 'taxon_id'] = df_mlwh_exceptions_this_taxon.loc[taxon_exception_indexes, 'taxon_id']

    # Determine sample ID and change any incorrect sample IDs
    df_return['derivative_sample_id'] = df_return.apply(set_sample_id, 1)
    sample_exception_indexes = df_mlwh_sample_exceptions.index[
        df_mlwh_sample_exceptions.index.isin(df_return.index)
    ]    
    df_return.loc[sample_exception_indexes, 'derivative_sample_id'] = 'DS_' + df_mlwh_sample_exceptions.loc[sample_exception_indexes, 'sample_id']
    df_return.drop(['sample', 'name'], axis=1, inplace=True)
    
    # Remove any samples that are not in this taxon
    df_return = df_return.loc[~df_return.index.isin(df_mlwh_exceptions_other_taxa.index)]
    
    # Merge in missing exceptions
    df_return = df_return.append(df_mlwh_missing_exceptions.drop('study_group', 1))
    
    # Remove any samples that don't have an alfresco study
    invalid_codes = ~df_return['alfresco_study_code'].apply(is_valid_alfresco_code)
    if(np.count_nonzero(invalid_codes) > 0):
        print('Removing %d files with invalid alfresco_study_code:' % np.count_nonzero(invalid_codes))
        print(df_return['alfresco_study_code'].loc[invalid_codes].value_counts())
        df_return = df_return.loc[~invalid_codes]

    # Determine filename in subtrack (for very old samples, there is only a .srf file in subtrack)
    # Note we determined the id_run cutoff (5750) for bam/srf after looking at previous runs. There seems
    # to be a grey area between runs 5500 and 5750 where some files are srf and some are bam
    df_return['subtrack_filename'] = df_return.index
    df_return.loc[df_return['id_run'] < 5750, 'subtrack_filename'] = df_return.loc[df_return['id_run'] < 5750, 'subtrack_filename'].apply(lambda x: x.replace('.bam', '.srf'))
    
    # Read in data from substrack, most importantly to get run accessions
    sql_query = "\
    SELECT \
        files.file_name as irods_filename, \
        files.bytes as subtrack_files_bytes, \
        files.timestamp as subtrack_files_timestamp, \
        files.public_date as subtrack_files_public_date, \
        submission.id as subtrack_submission_id, \
        submission.created as subtrack_submission_created, \
        submission.release_date as subtrack_submission_release_date, \
        submission.timestamp as subtrack_submission_timestamp, \
        submission.ext_db as subtrack_submission_ext_db, \
        submission.ebi_sample_acc, \
        submission.ebi_exp_acc, \
        submission.ebi_study_acc, \
        submission.ebi_sub_acc, \
        submission.ebi_run_acc \
    FROM \
        submission, \
        files \
    WHERE \
        files.sub_id = submission.id AND \
        files.file_name in (%s)\
    " % ("'" + "', '".join(df_return['subtrack_filename']) + "'")

    conn= pymysql.connect(
        host='shap',
        user='ega_dataset',
        password='ega_dataset',
        db='subtrack',
        port=3303
    )

    df_run_accessions = pd.read_sql(sql_query, conn).set_index('irods_filename')
    
    # Merge in subtrack data
    df_return = df_return.join(df_run_accessions, on='subtrack_filename', how='left')

    # Remove unwanted studies
    df_return = df_return.loc[~df_return['alfresco_study_code'].isin(studies_to_exclude)]

    # Create other columns
    df_return.rename(columns={'alfresco_study_code': 'study', 'paired_read': 'paired'}, inplace=True)
    df_return['lane'] = df_return.index.to_series().apply(lambda x: x.split('.')[0])
    df_return['path'] = df_return.apply(lambda x: "%s/%s" % (input_dir, x.name), 1)
    df_return['reads'] = df_return['num_reads'].fillna(-1).astype(int)
    df_return['irods_path'] = df_return.apply(lambda x: "/seq/%d/%s" % (x['id_run'], x.name), 1)
    df_return['sample'] = df_return['derivative_sample_id'].str.replace('DS_', '')

    # Remove lanes with zero reads
    df_return = df_return.loc[df_return['reads'] != 0]
    
    # Remove control samples (see emails from Sonia 03/05/2018 12:20 and from Richard 09/10/2018 13:26)
#     df_return = df_return.loc[df_return['sample'].str.slice(0, 1) != 'C']
    df_return = df_return.loc[(df_return['sample'].str.slice(0, 1) != 'C') & (df_return['sample'] != 'control')]
    
    # Restrict to certain columns
    if output_columns is not None:
        df_return = (
            df_return[output_columns]
            .sort_values(['study', 'sample', 'id_run', 'position', 'tag_index'])
        )
    
    print(df_return.shape)

    return(df_return) 


In [4]:
df_assay_data = create_build_manifest()

Removing 63 files with invalid alfresco_study_code:
AT-rich amplification                                 53
Test sequencing of high level human contamination.     6
Plasmodium falciparum genome variation 1               4
Name: alfresco_study_code, dtype: int64


  result = self._query(query)


(18130, 26)


In [5]:
pd.options.display.max_columns = 50
df_assay_data[0:3]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_filename,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
1276_5_nonhuman.bam,dummy/1276_5_nonhuman.bam,1001-PF-ML-DJIMDE,PM0001-C,1276_5_nonhuman,-1,,/seq/1276/1276_5_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,1276,5,,2008-09-24,,,,HiSeq,100,,,,1276_5_nonhuman.srf,5464048000.0,ERR012350
1276_6_nonhuman.bam,dummy/1276_6_nonhuman.bam,1001-PF-ML-DJIMDE,PM0001-C,1276_6_nonhuman,-1,,/seq/1276/1276_6_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,1276,6,,2008-09-24,,,,HiSeq,100,,,,1276_6_nonhuman.srf,5479595000.0,ERR012360
1333_8_nonhuman.bam,dummy/1333_8_nonhuman.bam,1001-PF-ML-DJIMDE,PM0001-C,1333_8_nonhuman,-1,,/seq/1333/1333_8_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,1333,8,,2008-09-24,,,,HiSeq,100,,,,1333_8_nonhuman.srf,5136273000.0,ERR012305


# Find files found using mlwh but not in subtrack

In [8]:
df_assay_data.loc[df_assay_data['ebi_run_acc'].isnull(), ['id_run', 'study']].groupby(['id_run', 'study']).size()

id_run  study                  
245     1032-PF-BRHN-SMITHEE        2
368     1032-PF-BRHN-SMITHEE        2
2128    1004-PF-BF-OUEDRAOGO        5
2217    1013-PF-PEGB-BRANCH         1
2575    1015-PF-KE-NZILA            1
4060    1004-PF-BF-OUEDRAOGO        3
        1016-PF-TH-NOSTEN           1
        1032-PF-BRHN-SMITHEE        1
5534    1043-PF-GB-RAYNER           1
5554    1017-PF-GH-AMENGA-ETEGO     1
5557    1017-PF-GH-AMENGA-ETEGO     1
5597    1015-PF-KE-NZILA            2
5696    1004-PF-BF-OUEDRAOGO        1
        1012-PF-KH-WHITE            1
5717    1044-PF-KH-FAIRHURST        1
10901   1155-PF-ID-PRICE            6
11019   1155-PF-ID-PRICE            2
12299   1155-PF-ID-PRICE           14
14323   1155-PF-ID-PRICE            1
14324   1155-PF-ID-PRICE            3
16166   1155-PF-ID-PRICE           36
16189   1155-PF-ID-PRICE           11
16229   1155-PF-ID-PRICE           19
25237   1155-PF-ID-PRICE           37
25552   1155-PF-ID-PRICE            3
dtype: int64

In [10]:
df_assay_data.loc[df_assay_data['ebi_run_acc'].isnull() & (df_assay_data['study'] == '1017-PF-GH-AMENGA-ETEGO')]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_filename,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
5554_1_nonhuman.bam,dummy/5554_1_nonhuman.bam,1017-PF-GH-AMENGA-ETEGO,PF0176-C,5554_1_nonhuman,62084346,1.0,/seq/5554/5554_1_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,5554,1,,2010-12-21 23:55:44,1.0,qc complete,IL27,HK,76,200.0,300.0,87.64,5554_1_nonhuman.srf,,
5557_1_nonhuman.bam,dummy/5557_1_nonhuman.bam,1017-PF-GH-AMENGA-ETEGO,PF0264-C,5557_1_nonhuman,60035853,1.0,/seq/5557/5557_1_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,5557,1,,2010-12-20 16:00:42,1.0,qc complete,IL9,HK,76,200.0,300.0,97.99,5557_1_nonhuman.srf,,


In [11]:
df_assay_data.loc[df_assay_data['ebi_run_acc'].isnull() & (df_assay_data['study'] == '1004-PF-BF-OUEDRAOGO')]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_filename,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2128_7_nonhuman.bam,dummy/2128_7_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0061-C,2128_7_nonhuman,-1,,/seq/2128/2128_7_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,2128,7,,2009-02-20,,,,HiSeq,100,,,,2128_7_nonhuman.srf,,
2128_8_nonhuman.bam,dummy/2128_8_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0061-C,2128_8_nonhuman,-1,,/seq/2128/2128_8_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,2128,8,,2009-02-20,,,,HiSeq,100,,,,2128_8_nonhuman.srf,,
2128_2_nonhuman.bam,dummy/2128_2_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0067-C,2128_2_nonhuman,15813522,1.0,/seq/2128/2128_2_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,2128,2,,2009-02-20 15:24:50,1.0,qc complete,IL8,HK,54,300.0,600.0,98.33,2128_2_nonhuman.srf,,
2128_3_nonhuman.bam,dummy/2128_3_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0067-C,2128_3_nonhuman,15907048,1.0,/seq/2128/2128_3_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,2128,3,,2009-02-20 15:24:50,1.0,qc complete,IL8,HK,54,300.0,600.0,98.28,2128_3_nonhuman.srf,,
5696_2_nonhuman.bam,dummy/5696_2_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0067-C,5696_2_nonhuman,153418043,1.0,/seq/5696/5696_2_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,5696,2,,2011-02-01 03:44:19,1.0,qc complete,HS6,HiSeq,75,300.0,300.0,98.07,5696_2_nonhuman.srf,,
2128_6_nonhuman.bam,dummy/2128_6_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0068-C,2128_6_nonhuman,-1,,/seq/2128/2128_6_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,2128,6,,2009-02-20,,,,HiSeq,100,,,,2128_6_nonhuman.srf,,
4060_6_nonhuman.bam,dummy/4060_6_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0068-C,4060_6_nonhuman,39410448,1.0,/seq/4060/4060_6_nonhuman.bam,,5833,578,Plasmodium falciparum genome variation 1,4060,6,,2009-11-25 14:39:42,1.0,qc complete,IL12,HK,76,300.0,300.0,93.11,4060_6_nonhuman.srf,,
4060_7_nonhuman.bam,dummy/4060_7_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0068-C,4060_7_nonhuman,40244830,1.0,/seq/4060/4060_7_nonhuman.bam,,5833,546,Plasmodium falciparum Illumina sequencing R&D 1,4060,7,,2009-11-25 14:39:42,1.0,qc complete,IL12,HK,76,300.0,300.0,86.54,4060_7_nonhuman.srf,,
4060_8_nonhuman.bam,dummy/4060_8_nonhuman.bam,1004-PF-BF-OUEDRAOGO,PK0068-C,4060_8_nonhuman,36472202,1.0,/seq/4060/4060_8_nonhuman.bam,,5833,546,Plasmodium falciparum Illumina sequencing R&D 1,4060,8,,2009-11-25 14:39:42,1.0,qc complete,IL12,HK,76,300.0,300.0,88.4,4060_8_nonhuman.srf,,


In [12]:
df_assay_data.loc[df_assay_data['ebi_run_acc'].isnull() & (df_assay_data['id_run'] == 25552)]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_filename,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
25552_1#7.cram,dummy/25552_1#7.cram,1155-PF-ID-PRICE,SPT24658,25552_1#7,41337656,1.0,/seq/25552/25552_1#7.cram,5101STDY7247317,5833,5101,1155-PF-ID-PRICE,25552,1,7.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,96.29,25552_1#7.cram,,
25552_1#10.cram,dummy/25552_1#10.cram,1155-PF-ID-PRICE,SPT24662,25552_1#10,38311728,1.0,/seq/25552/25552_1#10.cram,5101STDY7247321,5833,5101,1155-PF-ID-PRICE,25552,1,10.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,95.05,25552_1#10.cram,,
25552_1#21.cram,dummy/25552_1#21.cram,1155-PF-ID-PRICE,SPT24686,25552_1#21,33226936,1.0,/seq/25552/25552_1#21.cram,5101STDY7247345,5833,5101,1155-PF-ID-PRICE,25552,1,21.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,97.55,25552_1#21.cram,,


In [13]:
df_assay_data.loc[df_assay_data['ebi_run_acc'].isnull() & (df_assay_data['id_run'] == 245)]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_filename,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
245_7_nonhuman.bam,dummy/245_7_nonhuman.bam,1032-PF-BRHN-SMITHEE,PG0002-C,245_7_nonhuman,-1,,/seq/245/245_7_nonhuman.bam,,5833,546,Plasmodium falciparum Illumina sequencing R&D 1,245,7,,2008-03-21,,,,1G,37,,,,245_7_nonhuman.srf,,
245_8_nonhuman.bam,dummy/245_8_nonhuman.bam,1032-PF-BRHN-SMITHEE,PG0002-C,245_8_nonhuman,-1,,/seq/245/245_8_nonhuman.bam,,5833,546,Plasmodium falciparum Illumina sequencing R&D 1,245,8,,2008-03-21,,,,1G,37,,,,245_8_nonhuman.srf,,


# Interim conclusions
I had previously been thinking that subtrack only contained a subset of the files that could be accessed using queries based on mlwh. However, having spot checked a few of the above, it seems that the cases where we have no accessions from subtrack are due to one of the two following reasons:

1. We are looking for a .srf file for older samples, when subtrack actually has a .bam file
1. The data are in subtrack, but run accession and bytes are empty as files were not submitted to ENA (as expected for study 1155-PF-ID-PRICE)

I will now modify the query to use the first part of the file_name to match (i.e. the basename which ignores the file extension), to check if this is universally true, i.e. that ALL files can be found in subtrack.

In [26]:
def set_irods_name(row):
    # First calculate prefix using nonhuman dependent on id_run and species (taxon)
    # Then calculate tag string (empty string if no tag)
    # Then calculate file extension
    # Then concat these three strings
    if (
        (row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 6750)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    elif (
        ~(row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 7100)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    else:
        prefix = "%d_%d" % (row['id_run'], row['position'])
        
    if np.isnan(row['tag_index']):
        tag_string = ''
    else:
        tag_string = '#%d' % row['tag_index']
    
    if row['id_run'] <= 10300:
        file_extension = '.bam'
    else:
        file_extension = '.cram'
    
    irods_filename = prefix + tag_string + file_extension
    return(irods_filename)

def set_sample_id(row):
    if row['sample'] is None:
        return('DS_%s' % row['name'])
    else:
        return('DS_%s' % row['sample'])

def is_valid_alfresco_code(s):
    try: 
        int(s[0:4])
        return True
    except ValueError:
        return False

def create_build_manifest(
#     taxon_ids=[5833, 5855, 5858, 5821, 7165, 7173, 30066], # Pf, Pv, Pm, Pb, Ag, A. arabiensis, A. merus
    taxon_ids=[5833, 36329, 5847, 57267, 137071], # Pf, 3D7, V1, Dd2, HB3
    # 1089 is excluded as this is R&D samples. 1204 and 1176 are excluded as these are CP1 samples. 1175 is excluded as some Pf R&D samples were incorrectly tagged with this study. 1157 is a Pv study - there are two suspected Pf samples in this study, but we need further investigation, and possible study change both in ROMA and in Sequencescape/mlwh/ENA/iRODS before we can include
    studies_to_exclude = ['1089-R&amp;D-GB-ALCOCK', '1204-PF-GM-CP1', '1176-PF-KE-CP1', '1175-VO-KH-STLAURENT', '1157-PV-MULTI-PRICE'],
    miseq_runs_to_allow = [13809, 13810], # Two Miseq runs within 24 samples on each from study 1095-PF-TZ-ISHENGOMA that were included in Pf 6.0
    input_dir = input_dir,
    mlwh_missing_exceptions_fn = mlwh_missing_exceptions_fn,
    mlwh_study_exceptions_fn = mlwh_study_exceptions_fn,
    mlwh_sample_exceptions_fn = mlwh_sample_exceptions_fn,
    mlwh_taxon_exceptions_fn = mlwh_taxon_exceptions_fn,
    sequencescape_alfresco_study_mappings_fn = sequencescape_alfresco_study_mappings_fn,
    output_columns = [
        'path',
        'study',
        'sample',
        'lane',
        'reads',
        'paired',
        'irods_path',
        'sanger_sample_id',
        'taxon_id',
        'study_lims',
        'study_name',
        'id_run',
        'position',
        'tag_index',
        'qc_complete',
        'manual_qc',
        'description',
        'instrument_name',
        'instrument_model',
        'forward_read_length',
        'requested_insert_size_from',
        'requested_insert_size_to',
        'human_percent_mapped',
        'subtrack_files_sub_id',
        'subtrack_files_file_name',
        'subtrack_basename',
        'subtrack_files_timestamp',
        'subtrack_files_bytes',
        'ebi_run_acc'
    ]
):
    """Create a DataFrame to be used as a build manifest.
    
    A build manifest here is a list of all the "lanelets" that need to be
    included in a build. The output will typically be written to a
    tab-delmited file, either to use as input to vr-pipe, or perhaps as
    input to a vr-track DB.

    Args:
        taxon_ids (list of int): The taxons of the build (P. falciparum is '5833',
            P. vivax is '5855').
        sequencescape_alfresco_study_mappings_fn (str): filename of
            mappings from sequencscape to alfresco study mappings

    Returns:
        pd.DataFrame: The build manifest.

    """
    
    # Read in taxon exceptions file
    df_mlwh_taxon_exceptions = pd.read_csv(mlwh_taxon_exceptions_fn, sep='\t', dtype={'tag_index': str, 'taxon_id': int}, index_col='irods_filename')
    df_mlwh_taxon_exceptions['tag_index'].fillna('', inplace=True)
    # Identify which samples in exceptions file match the taxon of current build, and which don't
    df_mlwh_exceptions_this_taxon = df_mlwh_taxon_exceptions.loc[df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids)]
    df_mlwh_exceptions_other_taxa = df_mlwh_taxon_exceptions.loc[~(df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids))]

    # Read in sample exceptions file
    df_mlwh_sample_exceptions = pd.read_csv(mlwh_sample_exceptions_fn, sep='\t', index_col='irods_filename')

    # Read in study exceptions file
    df_mlwh_study_exceptions = pd.read_csv(mlwh_study_exceptions_fn, sep='\t', index_col='irods_filename')

    # Read in missing exceptions file
    df_mlwh_missing_exceptions = pd.read_csv(mlwh_missing_exceptions_fn, sep='\t', index_col='irods_filename')
    df_mlwh_missing_exceptions['derivative_sample_id'] = 'DS_' + df_mlwh_missing_exceptions['sample_id']
    df_mlwh_missing_exceptions = df_mlwh_missing_exceptions.loc[
        df_mlwh_missing_exceptions['taxon_id'].isin(taxon_ids)
    ]

    # Read in SequenceScape-Alfresco study mappings
    df_sequencescape_alfresco_study_mappings = pd.read_csv(sequencescape_alfresco_study_mappings_fn, sep='\t', index_col='seqscape_study_id')
    sequencescape_alfresco_study_mappings_dict = df_sequencescape_alfresco_study_mappings.loc[
        :,
#         df_sequencescape_alfresco_study_mappings['build_flag']==1,
        'alfresco_study_code'
    ].to_dict()
    
    # Read in data from mlwh matching this taxon, plus samples from exceptions file matching this taxon
    conn = pymysql.connect(
        host='mlwh-db',
        user='mlwh_malaria',
        password='Solaris&2015',
        db='mlwarehouse',
        port=3435
    )
    
    sql_query = 'SELECT \
        study.name as study_name, \
        study.id_study_lims as study_lims, \
        sample.supplier_name as sample, \
        sample.name, \
        sample.sanger_sample_id, \
        sample.taxon_id, \
        iseq_product_metrics.id_run, \
        iseq_product_metrics.position, \
        iseq_product_metrics.tag_index, \
        iseq_product_metrics.num_reads, \
        iseq_product_metrics.human_percent_mapped, \
        iseq_run_lane_metrics.instrument_name, \
        iseq_run_lane_metrics.instrument_model, \
        iseq_run_lane_metrics.paired_read, \
        iseq_run_lane_metrics.qc_complete, \
        iseq_flowcell.manual_qc, \
        iseq_flowcell.requested_insert_size_from, \
        iseq_flowcell.requested_insert_size_to, \
        iseq_flowcell.forward_read_length, \
        iseq_run_status_dict.description \
    FROM \
        study, \
        iseq_flowcell, \
        sample, \
        iseq_product_metrics, \
        iseq_run_status, \
        iseq_run_lane_metrics, \
        iseq_run_status_dict \
    WHERE \
        study.id_study_tmp = iseq_flowcell.id_study_tmp and \
        iseq_flowcell.id_sample_tmp = sample.id_sample_tmp and \
        iseq_flowcell.manual_qc = 1 and \
        iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp and \
        iseq_run_status.id_run = iseq_product_metrics.id_run and \
        iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run and \
        iseq_product_metrics.position = iseq_run_lane_metrics.position and \
        iseq_run_status.iscurrent = 1 and \
        ( ( iseq_run_lane_metrics.instrument_model != "MiSeq" ) or (iseq_product_metrics.id_run in (%s)) ) and \
        iseq_run_status.id_run_status_dict = iseq_run_status_dict.id_run_status_dict and \
        study.faculty_sponsor = "Dominic Kwiatkowski" and \
        ( ( sample.taxon_id in (%s) ) or ' % (
            ', '.join([str(x) for x in miseq_runs_to_allow]),
            ', '.join([str(x) for x in taxon_ids])
        )
    sql_query = sql_query + ' or '.join(
        df_mlwh_taxon_exceptions.loc[df_mlwh_taxon_exceptions['taxon_id'].isin(taxon_ids)].apply(
            lambda x: '(iseq_product_metrics.id_run="%s" and iseq_product_metrics.position="%s" and iseq_product_metrics.tag_index="%s")' % (x['id_run'], x['position'], x['tag_index']),
            1
        )
    )
    sql_query = sql_query + ')'
#         (iseq_flowcell.manual_qc = 1 or iseq_flowcell.manual_qc is null) and \
    
    df_return = pd.read_sql(sql_query, conn)

    # Replace missing taxon_id with -1 (can't have missing int values in pandas Series)
    df_return['taxon_id'] = df_return['taxon_id'].fillna(-1).astype('int32')

    # Determine file name in iRods
    df_return['irods_filename'] = df_return.apply(set_irods_name, 1)
    df_return.set_index('irods_filename', inplace=True)

    # Determine alfresco study and change any incorrect studies
    df_return['alfresco_study_code'] = df_return['study_lims'].apply(
        lambda x: sequencescape_alfresco_study_mappings_dict[int(x)] if int(x) in sequencescape_alfresco_study_mappings_dict else ''
    )
    empty_alfresco_study_code = (df_return['alfresco_study_code'] == '')
    df_return.loc[empty_alfresco_study_code, 'alfresco_study_code'] = df_return.loc[empty_alfresco_study_code, 'study_name']
    
    study_exception_indexes = df_mlwh_study_exceptions.index[
        df_mlwh_study_exceptions.index.isin(df_return.index)
    ]
    df_return.loc[study_exception_indexes, 'alfresco_study_code'] = df_mlwh_study_exceptions.loc[study_exception_indexes, 'alfresco_study_code']

    # Change any incorrect taxon IDs
    taxon_exception_indexes = df_mlwh_exceptions_this_taxon.index[
        df_mlwh_exceptions_this_taxon.index.to_series().isin(df_return.index)
    ]
    df_return.loc[taxon_exception_indexes, 'taxon_id'] = df_mlwh_exceptions_this_taxon.loc[taxon_exception_indexes, 'taxon_id']

    # Determine sample ID and change any incorrect sample IDs
    df_return['derivative_sample_id'] = df_return.apply(set_sample_id, 1)
    sample_exception_indexes = df_mlwh_sample_exceptions.index[
        df_mlwh_sample_exceptions.index.isin(df_return.index)
    ]    
    df_return.loc[sample_exception_indexes, 'derivative_sample_id'] = 'DS_' + df_mlwh_sample_exceptions.loc[sample_exception_indexes, 'sample_id']
    df_return.drop(['sample', 'name'], axis=1, inplace=True)
    
    # Remove any samples that are not in this taxon
    df_return = df_return.loc[~df_return.index.isin(df_mlwh_exceptions_other_taxa.index)]
    
    # Merge in missing exceptions
    df_return = df_return.append(df_mlwh_missing_exceptions.drop('study_group', 1))
    
    # Remove any samples that don't have an alfresco study
    invalid_codes = ~df_return['alfresco_study_code'].apply(is_valid_alfresco_code)
    if(np.count_nonzero(invalid_codes) > 0):
        print('Removing %d files with invalid alfresco_study_code:' % np.count_nonzero(invalid_codes))
        print(df_return['alfresco_study_code'].loc[invalid_codes].value_counts())
        df_return = df_return.loc[~invalid_codes]

    # Determine basename (i.e. filename without extension) in subtrack
    # We use the basename because for many older samples, the file has a .srf extension in subtrack
    # but this is not done consistently, e.g. some very old samples such as 245_7_nonhuman have a .bam extension
    df_return['subtrack_basename'] = df_return.index.to_series().apply(lambda x: x.split('.')[0])
    
    # Read in data from substrack, most importantly to get run accessions
    sql_query = "\
    SELECT \
        files.sub_id as subtrack_files_sub_id, \
        files.file_name as subtrack_files_file_name, \
        SUBSTRING_INDEX(files.file_name, '.', 1) as subtrack_basename, \
        files.bytes as subtrack_files_bytes, \
        files.timestamp as subtrack_files_timestamp, \
        files.public_date as subtrack_files_public_date, \
        submission.id as subtrack_submission_id, \
        submission.created as subtrack_submission_created, \
        submission.release_date as subtrack_submission_release_date, \
        submission.timestamp as subtrack_submission_timestamp, \
        submission.ext_db as subtrack_submission_ext_db, \
        submission.ebi_sample_acc, \
        submission.ebi_exp_acc, \
        submission.ebi_study_acc, \
        submission.ebi_sub_acc, \
        submission.ebi_run_acc \
    FROM \
        submission, \
        files \
    WHERE \
        files.sub_id = submission.id AND \
        SUBSTRING_INDEX(files.file_name, '.', 1) in (%s)\
    " % ("'" + "', '".join(df_return['subtrack_basename']) + "'")

    conn= pymysql.connect(
        host='shap',
        user='ega_dataset',
        password='ega_dataset',
        db='subtrack',
        port=3303
    )

    df_run_accessions = pd.read_sql(sql_query, conn).set_index('subtrack_basename')
    
    # Merge in subtrack data
    df_return = df_return.join(df_run_accessions, on='subtrack_basename', how='left')

    # Remove unwanted studies
    df_return = df_return.loc[~df_return['alfresco_study_code'].isin(studies_to_exclude)]

    # Create other columns
    df_return.rename(columns={'alfresco_study_code': 'study', 'paired_read': 'paired'}, inplace=True)
    df_return['lane'] = df_return.index.to_series().apply(lambda x: x.split('.')[0])
    df_return['path'] = df_return.apply(lambda x: "%s/%s" % (input_dir, x.name), 1)
    df_return['reads'] = df_return['num_reads'].fillna(-1).astype(int)
    df_return['irods_path'] = df_return.apply(lambda x: "/seq/%d/%s" % (x['id_run'], x.name), 1)
    df_return['sample'] = df_return['derivative_sample_id'].str.replace('DS_', '')

    # Remove lanes with zero reads
    df_return = df_return.loc[df_return['reads'] != 0]
    
    # Remove control samples (see emails from Sonia 03/05/2018 12:20 and from Richard 09/10/2018 13:26)
#     df_return = df_return.loc[df_return['sample'].str.slice(0, 1) != 'C']
    df_return = df_return.loc[(df_return['sample'].str.slice(0, 1) != 'C') & (df_return['sample'] != 'control')]
    
    # Restrict to certain columns
    if output_columns is not None:
        df_return = (
            df_return[output_columns]
            .sort_values(['study', 'sample', 'id_run', 'position', 'tag_index'])
        )
    
    print(df_return.shape)

    return(df_return) 


In [27]:
df_new_assay_data = create_build_manifest()

Removing 63 files with invalid alfresco_study_code:
AT-rich amplification                                 53
Test sequencing of high level human contamination.     6
Plasmodium falciparum genome variation 1               4
Name: alfresco_study_code, dtype: int64
(18130, 29)


# Check all files found in subtrack

In [28]:
df_new_assay_data.loc[df_new_assay_data['ebi_run_acc'].isnull(), ['id_run', 'study']].groupby(['id_run', 'study']).size()

id_run  study           
10901   1155-PF-ID-PRICE     6
11019   1155-PF-ID-PRICE     2
12299   1155-PF-ID-PRICE    14
14323   1155-PF-ID-PRICE     1
14324   1155-PF-ID-PRICE     3
16166   1155-PF-ID-PRICE    36
16189   1155-PF-ID-PRICE    11
16229   1155-PF-ID-PRICE    19
25237   1155-PF-ID-PRICE    37
25552   1155-PF-ID-PRICE     3
dtype: int64

In [30]:
df_new_assay_data.loc[df_new_assay_data['ebi_run_acc'].isnull() & (df_new_assay_data['id_run'] == 25552)]

Unnamed: 0_level_0,path,study,sample,lane,reads,paired,irods_path,sanger_sample_id,taxon_id,study_lims,study_name,id_run,position,tag_index,qc_complete,manual_qc,description,instrument_name,instrument_model,forward_read_length,requested_insert_size_from,requested_insert_size_to,human_percent_mapped,subtrack_files_sub_id,subtrack_files_file_name,subtrack_basename,subtrack_files_timestamp,subtrack_files_bytes,ebi_run_acc
irods_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
25552_1#7.cram,dummy/25552_1#7.cram,1155-PF-ID-PRICE,SPT24658,25552_1#7,41337656,1.0,/seq/25552/25552_1#7.cram,5101STDY7247317,5833,5101,1155-PF-ID-PRICE,25552,1,7.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,96.29,1963391,25552_1#7.cram,25552_1#7,2018-04-06 01:20:52,,
25552_1#10.cram,dummy/25552_1#10.cram,1155-PF-ID-PRICE,SPT24662,25552_1#10,38311728,1.0,/seq/25552/25552_1#10.cram,5101STDY7247321,5833,5101,1155-PF-ID-PRICE,25552,1,10.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,95.05,1963394,25552_1#10.cram,25552_1#10,2018-04-06 01:20:52,,
25552_1#21.cram,dummy/25552_1#21.cram,1155-PF-ID-PRICE,SPT24686,25552_1#21,33226936,1.0,/seq/25552/25552_1#21.cram,5101STDY7247345,5833,5101,1155-PF-ID-PRICE,25552,1,21.0,2018-04-05 12:42:30,1.0,qc complete,HX7,HiSeqX,150,450.0,450.0,97.55,1963405,25552_1#21.cram,25552_1#21,2018-04-06 01:20:52,,


In [32]:
df_new_assay_data.loc[df_new_assay_data['subtrack_files_sub_id'].isnull(), ['id_run', 'study']].groupby(['id_run', 'study']).size()

Series([], dtype: int64)

In [33]:
df_new_assay_data.loc[df_new_assay_data['subtrack_files_file_name'].isnull(), ['id_run', 'study']].groupby(['id_run', 'study']).size()

Series([], dtype: int64)

In [34]:
df_new_assay_data.loc[df_new_assay_data['subtrack_files_timestamp'].isnull(), ['id_run', 'study']].groupby(['id_run', 'study']).size()

Series([], dtype: int64)

# Write out file
This might be useful for future comparisons of my approach to creating build manifests vs FITS. Rob has said he might find this file useful.

In [37]:
df_new_assay_data.to_csv('Pf_irods_manifest_20190117.txt.gz', sep='\t', compression='gzip')

# Conclusions
I had previously been assuming that there are files that can be identified through mlwh queries that can not be found in subtrack. It turns out that this assumption is not true, and that I had previously been assuming it was true because in some cases I was assuming the file in subtrack would have a .srf extension when in fact the file had a .bam extension. After modifying my subtrack query to match using only the file basename (i.e. ignoring the file extension), all files found using my mlwh query could also be found in subtrack (at least this is true for the 18,310 Pf files currently found by my query).