## Build a complete table of metadata and experimental details

As originally defined, using only with isolation source and collection data, there exist duplicate rows. 

These duplicates were intended to record multiple extractions from the same sample due to problems. (technical replicates)

However, source and date are still the same and it is only when joined to the sequencing run table that these additional extractions take on meaning.

Below we join the tables and then add a replicate_number column to disambiguate.

Mock and negative control records complicate matters.

A unique index (sample_name) is constructed sysmetically from isolation_source, collection_date and replicate_number.


In [4]:
import pandas as pd
import numpy as np 

In [6]:
# read in the sample metadata table as it current exists.

samples = pd.read_excel('../source_data/test_pigs_samples_IDs_for_NCBI.xlsx', sheet_name=0, skiprows=14, skipfooter=0, index_col=None, dtype={'*sample_name': str})
# for any sample name which uses plate number, make sure the syntax includes 2 digits Eg. P10 or P03 and not P3
# this results in the correct ordering of rows when sorted
samples['*sample_name'] = samples['*sample_name'].str.replace('P([1-9](?:.2)?)$', lambda m: 'P0{}'.format(m.group(1)))
print('There were {} rows in sample table'.format(len(samples)))
samples.groupby('chem_administration').size()

There were 885 rows in sample table


chem_administration
ColiGuard Day 10                    8
ColiGuard Day 14                   18
ColiGuard Day 3                     8
ColiGuard Day 7                    18
ColiGuard Day 8                     1
Control - Ex Placebo               52
Control - Placebo Day 10            8
Control - Placebo Day 14           18
Control - Placebo Day 3             8
Control - Placebo Day 6             6
Control - Placebo Day 7            18
D-scour Day 10                      9
D-scour Day 14                     18
D-scour Day 3                       8
D-scour Day 7                      18
D-scour Day 8                       2
Ex ColiGuard                       46
Ex ColiGuard                       12
Ex D-scour                         58
Ex Neomycin                       103
Ex Neomycin - ColiGuard Day 12      6
Ex Neomycin - ColiGuard Day 2      17
Ex Neomycin - ColiGuard Day 5       8
Ex Neomycin - ColiGuard Day 9      18
Ex Neomycin - D-scour Day 12        6
Ex Neomycin - D-scour Day 2   

In [7]:
pd.options.display.max_rows = 10
seqruns = pd.read_excel('../source_data/DNA_plates.xlsx', sheet_name=0, index_col=None, dtype={'date and pig ID': str})
seqruns.drop(seqruns.columns[[3,5]], axis=1, inplace=True)
print('There were {} rows in the DNA pate table'.format(len(seqruns)))
seqruns.head()

There were 928 rows in the DNA pate table


Unnamed: 0,elution plate number,well position,date of sample,pig ID,date and pig ID
0,plate_1,A1,Fe21,14194,Fe21/14194
1,plate_1,A2,Fe28,14286,Fe28/14286
2,plate_1,A3,Fe21,29644,Fe21/29644
3,plate_1,A4,Fe14,29898,Fe14/29898
4,plate_1,A5,Fe21,29679,Fe21/29679


In [8]:
def demux(df):
    # replace the current sample_name with that made from isolation_source and collection_date
    df['alt_sample_name'] = df['isolation_source'].astype(str) + df['*collection_date'].map(lambda v: v.strftime('-%y%m%d') if not pd.isnull(v) else '') 

    # + "-" + ncbi.replicate_number.astype(str)

    def f(n):
        """
        Create a incrementing counter of replicates when encountering duplicated rows.
        """
        if len(n) > 1:
           print('{} was duplicated {} times'.format(n.name, len(n)))
        return pd.Series(np.arange(1, len(n)+1), index=n.index)

    # add a new column which records the number of replicates of a given isolation source and collection date
    df['replicate'] = 0
    df.replicate = df.groupby('alt_sample_name')['alt_sample_name'].apply(f)
    df['alt_sample_name'] = df['isolation_source'].astype(str) + df['*collection_date'].map(lambda v: v.strftime('-%y%m%d') if not pd.isnull(v) else '') + "-" + df.replicate.map('{:02d}'.format)

In [9]:
def demux2(df):
    # replace the current sample_name with that made from isolation_source and collection_date
    df['sn'] =  df['*collection_date'].map(lambda v: v.strftime('%b%d/') if not pd.isnull(v) else 'NA/') + df['isolation_source'].astype(str)

    def f(n):
        """
        Create a incrementing counter of replicates when encountering duplicated rows.
        """
        if len(n) > 1:
           print('{} was duplicated {} times'.format(n.name, len(n)))
        return pd.Series(np.arange(1, len(n)+1), index=n.index)

    # add a new column which records the number of replicates of a given isolation source and collection date
    df['replicate'] = 0
    df.replicate = df.groupby('sn')['sn'].apply(f)
    df['sn'] = df['*collection_date'].map(lambda v: v.strftime('%b%d/') if not pd.isnull(v) else 'NA/') + df['isolation_source'].astype(str)  + "-" + df.replicate.map('{:02d}'.format)

In [23]:
# Using the existing sample_name as an index, join samples and plates. 
# rebuild an integer index
cmb = samples.set_index('*sample_name',).join(seqruns.set_index('date and pig ID'), how='inner').reset_index()
# given the old sample name column a better name than 'index'
cmb.rename(columns={'index': 'old_sample_name'}, inplace=True)
print('There were {} rows in the joined table'.format(len(cmb)))

# Only one record should exist for a given plate and well.
cmb.drop_duplicates(['elution plate number', 'well position'], inplace=True)
print('There were {} rows after removal of plate/well duplicates'.format(len(cmb)))

demux(cmb)
#cmb.to_csv('test.csv')


There were 920 rows in the joined table
There were 911 rows after removal of plate/well duplicates
14160-170201 was duplicated 2 times
14162-170201 was duplicated 2 times
14172-170216 was duplicated 2 times
14174-170131 was duplicated 2 times
14190-170131 was duplicated 3 times
14192-170131 was duplicated 2 times
14193-170131 was duplicated 2 times
14194-170207 was duplicated 2 times
14262-170207 was duplicated 2 times
14263-170131 was duplicated 2 times
14265-170131 was duplicated 2 times
14271-170206 was duplicated 2 times
14274-170214 was duplicated 3 times
14284-170131 was duplicated 2 times
14298-170131 was duplicated 2 times
14317-170207 was duplicated 2 times
14320-170131 was duplicated 2 times
29645-170131 was duplicated 2 times
29652-170131 was duplicated 2 times
29653-170131 was duplicated 2 times
29667-170131 was duplicated 2 times
29668-170131 was duplicated 2 times
29694-170131 was duplicated 2 times
29718-170131 was duplicated 2 times
29743-170131 was duplicated 2 times
2

In [12]:
pd.options.display.max_rows = 5

In [18]:
files.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,filename
DNA_plate,DNA_well,Unnamed: 2_level_1
P10\tA10\tcombined_new/plate_10_A10.r1.fastq.gz,,
P10\tA10\tcombined_new/plate_10_A10.r2.fastq.gz,,
P10\tA11\tcombined_new/plate_10_A11.r1.fastq.gz,,
P10\tA11\tcombined_new/plate_10_A11.r2.fastq.gz,,
P10\tA12\tcombined_new/plate_10_A12.r1.fastq.gz,,


In [33]:
files = pd.read_csv('../source_data/fq_well_and_plate_new.txt', sep=' ', names=['DNA_plate','DNA_well','filename'], index_col=['DNA_plate', 'DNA_well'])
# we will convert this to one row per read-pair, rather than one row for R1 and one for R2
files = files.groupby(['DNA_plate', 'DNA_well']).agg(['first', 'last'])
# convert the pandas group to a new dataframe
files = pd.DataFrame(files.reset_index().values, columns=['DNA_plate', 'DNA_well', 'r1_filename', 'r2_filename'])
print('There were {} file pairs'.format(len(files)))
files.head()

There were 960 file pairs


Unnamed: 0,DNA_plate,DNA_well,r1_filename,r2_filename
0,plate_1,A1,combined_new/plate_1_A1.r1.fastq.gz,combined_new/plate_1_A1.r2.fastq.gz
1,plate_1,A10,combined_new/plate_1_A10.r1.fastq.gz,combined_new/plate_1_A10.r2.fastq.gz
2,plate_1,A11,combined_new/plate_1_A11.r1.fastq.gz,combined_new/plate_1_A11.r2.fastq.gz
3,plate_1,A12,combined_new/plate_1_A12.r1.fastq.gz,combined_new/plate_1_A12.r2.fastq.gz
4,plate_1,A2,combined_new/plate_1_A2.r1.fastq.gz,combined_new/plate_1_A2.r2.fastq.gz


In [24]:
# remove extraneous text for chem_admin column 
# this loop assumes any chem_admin containing the word "Day" should be truncated
# as iterrows in a copy, we need to assign the changed value back into the dataframe
for row in cmb.iterrows():
    chem = row[1].chem_administration 
    ix = chem.find('Day')
    if ix != -1:
        cmb.loc[row[0],'chem_administration'] = chem[:ix].rstrip()

In [34]:
cmb.rename(columns={'elution plate number': 'DNA_plate', 'well position': 'DNA_well'}, inplace=True)
out = cmb.set_index(['DNA_plate', 'DNA_well']).join(files.set_index(['DNA_plate', 'DNA_well']))

In [40]:
out.to_csv('../complete_new.tsv', sep='\t', encoding = 'utf-8')

isolation_source
9128       1
14159     10
          ..
Y08843     1
Y09733     1
Length: 172, dtype: int64