# Submit study data to GEO
This Python Jupyter notebook submits the processed and raw study data to the [GEO database](https://www.ncbi.nlm.nih.gov/geo/).

## Import Python modules

In [4]:
import glob

import openpyxl

## Read the Excel metadata template
We have already created an Excel metadata workbook by manually filling the [GEO template](https://www.ncbi.nlm.nih.gov/geo/info/examples/seq_template_v2.1.xls) with information appropriate for our experiment.

We read this notebook and get the single active worksheet.
This sheet has all the relevant information **except** the MD5 checksums.

In [2]:
# read metadata template Excel workbook
wb = openpyxl.load_workbook('metadata_template.xlsx')

# get active worksheet, which should be first and only one
assert len(wb.sheetnames) == 1, "multiple notebook sheets"
ws = wb.active

## Get MD5 checksums and files to upload
We want to parse the worksheet to identify all files that need to be submitted to GEO and then do the following:
 1. Add the MD5 checksum to the worksheet
 2. Add the file to the list to upload to GEO
 
These needs to be done for two titled sections of the notebook:
 - a section titled *PROCESSED DATA FILES*
 - a section titled *RAW FILES*
 
Each of these sections should first have a heading line with the first three columns being *file name*, *file type*, and *file checksum*--with the last of these columns where we add the checksum.
Each section ends with the first line that is a comment (begins with `#`).

So we go through the notebook and create a dict that is keyed by the names of all processed and raw data files that we need to upload, and has as its values the cell in the worksheet where the MD5 checksum for that file needs to be placed:

In [3]:
rows = list(ws.rows)

files_to_upload = {}

for section in ['PROCESSED DATA FILES', 'RAW FILES']:
    
    print(f"\nIdentifying {section} files:")
    
    # identify row with section title
    titlecell = [row[0] for row in rows if row[0].value == section]
    if len(titlecell) != 1:
        raise ValueError(f"not exactly one title row for {section}")
    titlerow = titlecell[0].row
    
    # check correctness of header following section title
    header = [cell.value for cell in rows[titlerow][ : 3]]
    if header != ['file name', 'file type', 'file checksum']:
        raise ValueError(f"header incorrect for {section}")
        
    # get all rows with files to upload
    i = 1 + titlerow
    filename = rows[i][0].value
    while ((filename is not None) and (filename[0] != '#') and not filename.isspace()):
        print(filename)
        checksumcell = rows[i][2]
        if checksumcell.value is not None:
            raise ValueError("checksum cell already contains a value")
        files_to_upload[filename] = checksumcell
        i += 1
        filename = rows[i][0].value


Identifying PROCESSED DATA FILES files:
merged_canine_cells.tsv
merged_canine_genes.tsv
merged_canine_matrix.mtx
merged_humanplusflu_cells.tsv
merged_humanplusflu_genes.tsv
merged_humanplusflu_matrix.mtx
PacBio_annotated_merged_humanplusflu_cells.tsv

Identifying RAW FILES files:
2017-06-08_ccs.bam
2017-06-08_report.csv
2017-12-07_ccs.bam
2017-12-07_report.csv
2018-06-22_nonPol_ccs.bam
2018-06-22_nonPol_report.csv
2018-06-22_Pol-1_ccs.bam
2018-06-22_Pol-1_report.csv
2018-06-22_Pol-2_ccs.bam
2018-06-22_Pol-2_report.csv
2018-06-22_Pol_open_ccs.bam
2018-06-22_Pol_open_report.csv
2018-08-08_Pol_circ_ccs.bam
2018-08-08_Pol_circ_report.csv
IFN_enriched_S1_L002_R1_001.fastq.gz
IFN_enriched_S1_L002_R2_001.fastq.gz
IFN_enriched_S1_L002_I1_001.fastq.gz


Now we look for each of these files.
We specify directories where they are stored, and make sure each file is uniquely located in one of these directories:

In [6]:
# directories where files may be found
search_dirs = [
        # location of Illumina FASTQ files for transcriptomics
        '../results/demultiplexed_reads/2017-07-21/fastq/IFN_enriched/',
        # location of PacBio CCS files
        '../results/pacbio/ccs/',
        # location of cell-gene matrix files
        '../results/cellgenecounts/'
        ]

search_files = []
for search_dir in search_dirs:
    search_files += glob.glob(f'{search_dir)
    
print(search_files)

['../results/pacbio/ccs/', '../results/cellgenecounts/']
