<span style="color:red"> *some emphasized markdown text*</span>


# DSP Illumina-WTA workflow

This workflow project is a work in progress, currently combining multiple different approaches. Current work is aiming to unify these notebooks into a single coherent workflow. Some notebooks may not be relevent currently.

The current quick and dirty workflow is to use the Nanostring to StandR notebook to prepare files for use in StandR, then the StandR, EdgeR and EdgeR_Output notebooks.

Work is ongoing to add documentation, make a robust workflow and functions, and add detailed QC options.

# 1. (This Notebook) Contents
[1. How to use this notebook](#howTo)

[2. Overview of the workflow in this notebook](#overview)


#  2. Prelim view, cleanup, revision and grouping experimental sets.

[2.A.i. View and clean annotations](#clean)

[2.A.ii. View and QC Raw data](#qcRaw)

[2.A.iii. Normalise data](#norm)


#  3.

[3. View and QC Normalised data](#qcNorm)

[4. Set up comparisons](#compare)

[5. Run DGE](#dge)

[6. Convert EdgeR plots to volcano plots](#convert)


# 1. How to use this notebook <a class="anchor" id="howTo"></a>

<!-- Cell from Paul Watmore -->

This is a [Jupyter notebook](https://jupyter.org/) for DSP data exploration, normalisation and analysis. 

Jupyter notebooks are interactive documents that contain 'live code', which allows the user to complete an analysis by running code 'cells', which can be modified, updated or added to by the user.

Individual Jupyter notebooks are based on a specific 'kernel', or analysis envirnment (mostly programming languages). This particular notebook is based on R. To see which version of R this notebook is based on, and as an example of running a code cell, click on the cell below and press the 'Run' button (top of the page).

This notebook is designed for use on the QUT computer system. The notebook files and associated config files and scripts are downloaded from github to your home directory on the HPC. Data files and any outputs from the scripts and notebooks should be saved in a folder on the work directory so they can easily be shared with other users.

NOTE: changes you make to the notebook will only be in your home directory. Bug fixes should be pushed to the master github repo, or logged as an issue on github. Github should contain the most recent working version of all notbooks and scripts and should be checked for changes. 



# 1. Workflow Overview<a class="anchor" id="overview"></a>

This notebook links to a number of auxilary notebooks that walk through a series of steps for data exploration, data cleaning and data analysis.

To help with reproducibility, input and output files have been standardised as much as possible (this is a work in progress). <span style="color:red">The folder structure is shown in section ###, and descriptions of the key files are shown in section ###.</span>

This workflow is designed to be run in a collaborative manner on QUT compute facilities. As such, this utilises a number of different compute resources that each have different access restrictions and security protocols. Every effort has been made to ensure that sensitive data is protected, however, this may not be suitable for every project. It is every users responibility to ensure that data is stored and secured properly. No encryption is currently implemented in this workflow, relying on propper use of data use and storage according to QUT policies and procedures.





The high level overview of the workflow is as follows:

A. Decide on a working directory on the HPC to set up file structure.
NOTE: HPC storage is preferred to facilitate colaboration and help with data security.

2. View and clean annotations


1. View and QC RAW Data

...

1. Merge or exclude AOIs


1. Normalise Data


1. Set up comparisons

...

1. Run DGE


1. Convert EdgeR plots to volcano plots

### Some basic setup

In [1]:
import os
os.getcwd()


In [None]:
# input the root directory (shared work folder)

### File structure
```

## Raw data files that are kept secure and imutable (read only access to researchers):


 ├─ R.......
 │ ├─ A...........
 │ │ ├─ i..
 │ │ │ ├─ c..._......._.....
 │ │ │ │ ├─ c.._..._2023XXXX
 │ │ │ │ │ ├─ I.....
 │ │ │ │ │ │ ├ Slide_1.png
 │ │ │ │ │ │ ├ Slide-1_clean.png
 │ │ │ │ │ │ ├ Slide-1.zip
 │ │ │ │ │ │ ├ Slide-1.ome.tiff
 │ │ │ │ │ │ ├ Slide-2.png
 │ │ │ │ │ │ ├ Slide-2_clean.png
 │ │ │ │ │ │ ├ Slide-2.zip
 │ │ │ │ │ │ ├ Slide-2.ome.tiff
 │ │ │ │ │ ├─ W.........
 │ │ │ │ │ │ ├ 
 │ │ │ │ │ │ ├ 

FASTQ files?

DCC files?

 │ │ │ │ │ ├─ D...
 │ │ │ │ │ │ ├ slide-1 A to H
 │ │ │ │ │ │ ├ slide-2 A to H


## Processed data files that should be kept secure, but can be edited and shared by researchers:

 ├─ HPCFS
 │ ├─ Project_Folder                    (read/write access for multiple researchers)
 │ │ ├─ DSP_Data_Analysis
                                        # Files initially downloaded form DSP with preliminary processed and QC processed data
 │ │ │ ├ Initial Dataset.xlsx           (may be over-written after changes to ROI/AOI annotations)
 │ │ │ ├ Default_QC.xlsx                (may be over-written after changes to ROI/AOI annotations)
                                        # Files output from data QC script to identify ROI/AOIs and probes that fail data QC.
 │ │ │ ├ failAOIs.csv                                                      # Check location exportPath
 │ │ │ ├ FailProbes.csv                                                    # Check location exportPath
                                        # Files output from normalisation (normalisation my run through multiple iterations. Different normalisation methods may be needed for some comparisons). 
 │ │ │ ├─ Normalisation
 <!-- │ │ │ │ ├ QC_#Researcher#_#Project#_#Run#_NormInput.csv
 │ │ │ │ ├ 
 │ │ │ │ ├ #Run#
 │ │ │ │ │ ├ NS_Norm_1-84
 │ │ │ │ ├ #Run#
 │ │ │ │ │ ├ NS_Norm_1-84
                                        # Output from edgeR analysis 
 │ │ │ ├─ EdgeR
 │ │ │ │ ├ EdgeR_#Run#_Norm25
 │ │ │ │ │ ├ 
 │ │ │ │ │ ├ 
 │ │ │ │ │ ├ 
 │ │ │ │ │ ├ 
 -->

## Files on Github
                                        ## Should not contain any hard links to QUT servers

 ├─ GitHub                              ## https://github.com/kyleupton/DSP_EDA_WTA
 │ ├ Index.ipynb
 │ ├ DSP_nCounter_Protein_QC_Git.ipynb
 │ ├ 240202_DSP_nCounter_Protein_Post-Norm_EDA.ipynb

 │ ├ NSNorm.R

 │ ├ EdgeR_Config.txt
 │ ├ EdgeR.R

 │ ├ README.md
 │ ├ Config.txt
 │ ├ LICENSE

 │ ├ 







<i>NOTE: the phrase "kept secure" above indicates that the files should be saved to a secure location with backup. Primary data should be kept in read-only locations with backup.
</i>

```
ToDo: Add write output functionality to all files




```

- Root
 ├─ DSP_Protein_Data
 │ ├ 
 │ ├─ Initial Dataset.xlsx
 │ ├─ Default_QC.xlsx
 │ ├─ 
 │ ├─ Lab_Worksheet_P10016600#####1.txt
 │ ├─ Lab_Worksheet_P10016600#####2.txt
 │ ├─ 
 │ ├─ 
 │
 ├─ DSP_Rwa_QC
 │ ├─ AOI_Well_Mappings_Plate2.csv |- files
 │ ├─ AOI_Well_Mappings_Plate2.csv
 │
 │
 ├─ DSP_Rwa_QC
 │
 ├─ DSP_Rwa_QC
 │
 ├─ DSP_Rwa_QC
 │
 ├─ DSP_Rwa_QC
 │
 ├─ DSP_Rwa_QC
 │
 ├─ DSP EDA
 │ ├─ AOI_Well_Mappings_Plate2.csv |- files
 │
 │
 ├─ Data_Normalisation
 │ ├─ RUVIII_Subramaniam_HCC_grouped_NSNorm.R
 │ ├─ ERCC_Subramaniam_HCC_RUV_Grouped_Expressed.csv
 │ ├─ HCC_Nanostring_SampleInformation_grouped.csv
 │ ├─
 │ ├─ RUVIII_NSNorm_Grouped_Expressed
 │ ├─├─ NanoString_mRNA_norm...
 │ │ ├─ NanoStringNorm_28_none_mean_housekeeping.geo.mean.csv
 │
 │
 │
 │ │ EdgeR_Grouped_28
 │ │ ├─
 │ │ ├─
 │ │ ├─
 │ │ ├─
 │
 │
 ├─ EdgeR
 │ ├─ files
 │
 │
 │
 │
 │

```
DataNorm Output
/data/bak/QUT/upton6/Documents/Nanostring/projects/NS_Liver_HCC_DSP/Data_Normalisation/RUVIII_NSNorm_Grouped_Expressed/

2. Prepare configuration file

In [None]:
# input the root directory (shared work folder)

# 3. View and clean annotations <a class="anchor" id="clean"></a>

First up we want to confirm that all data entered into the DSP files is correct and clean. The following steps are easiest to performed using the DSP data analysis suite to view AOIs, and download and upload sample annotations. Sample annotations are easiest to edit in microsoft excel on any computer.


### Steps:

  1. Download "Annotation template file" from DSP

  2. Add in factors for AOI annotation

  3. Manually review all AOIs

  4. Ensure the comment line has been deleted (row 1 in downloaded file). The header row should be row 1.

  5. Upload file to DSP and select replace tags and factors
  
<i>Note: tags and factors are case sensitive. No aditional characters should be present. All tags must be comma separated</i>

<i>Note: "Initial Dataset" and "Default QC" file must be re-generated if AOI annotations are updated.</i>

<i>Note: Correlating AOIs to plates and wells was done using the lab worksheet documents and by matching the surface areas in those sheets with the surface areas in the DSP output excel files. 231206_DSP_nCounter_Protein_QC_Subramaniam_HCC_TMA_01 contains code for this but may not be completely up to date.</i>

Use the following notebook to review AOI annotations.

This will help find:
 - non-conforming annotations (wrong case or puncuation)
 - empty annotations / missing annotations
 - empty cells



<!-- [Click here for QC notebook](DSP_nCounter_Annotation_QC_Git.ipynb) -->

# 4. View and QC data <a class="anchor" id="qc"></a>

The next step is to chech the quality of the data that has been obtained from the DSP run.

This is done for both the probes abd the AOIs to determine if any probes or AOIs should be excluded from analysis.


[link to QC notebook](DSP_NGS_QC_Git.ipynb)

<i>Note: The above notebook also contains the code for cleaning and filling out data annotations. May want to break this down into separate notebooks for clarity
</i>

# 5. Initial Data Normalisation <a class="anchor" id="norm"></a>

# 6. Normalised Data EDA <a class="anchor" id="Norm_EDA"></a>

[Click here for Normalised Data EDA notebook](240202_DSP_nCounter_Protein_Post-Norm_EDA.ipynb)

# Set up Experimental sets

# Revise normalisation

# !!Multiple!! Normalised Data EDA

# 7. Set up comparisons <a class="anchor" id="compare"></a>

#### ToDo:
```
return AOIs to ignore in a flat text file

Return AOI groups (with whole annotations of all AOIS in each group)
What is the best file format to use for this? 
```

ToDo: Complete script for compiling comparisons in python and converting to R compatible format



# 8. Run DGE (EdgeR or DESeq) <a class="anchor" id="dge"></a>

[EDA notebook](231206_DSP_nCounter_Protein_Norm_EDA_01_Subramaniam_HCC.ipynb)
