ToDo: DSP => Digital Spatial Profiler

# DSP Protein nCounter workflow

# How to use this notebook <a class="anchor" id="howTo"></a>

<!-- Cell adapted from Paul Watmore -->

This is a [Jupyter notebook](https://jupyter.org/) for DSP data exploration, normalisation and analysis. 

Jupyter notebooks are interactive documents that contain 'live code', which allows the user to complete an analysis by running code 'cells', which can be modified, updated or added to by the user.

Individual Jupyter notebooks are based on a specific 'kernel', or analysis envirnment (mostly programming languages). This particular notebook is based on R. To see which version of R this notebook is based on, and as an example of running a code cell, click on the cell below and press the 'Run' button (top of the page).

This notebook is designed for use on the QUT computer system. The notebook files and associated config files and scripts are downloaded from github to your home directory on the HPC. Data files and any outputs from the scripts and notebooks should be saved in a folder on the work directory so they can easily be shared with other users.

NOTE: changes you make to the notebook will only be in your home directory. Bug fixes should be pushed to the master github repo, or logged as an issue on github. Github should contain the most recent working version of all notbooks and scripts and should be checked for changes. 

# Contents
[How to use this notebook](#howTo)

[1. Workflow Overview](#overview)

[2. Prepare Configuration File](#config)

[3. View and clean annotations](#clean)

[4. View and QC data](#qc)

[5. Normalise Data](#norm)

[6. Normalised Data EDA](#Norm_EDA)

[7. Set up comparisons](#compare)

[8. Run DGE](#dge)

[9. Convert EdgeR plots to volcano plots](#convert)


# 1. Workflow Overview<a class="anchor" id="overview"></a>



This notebook links to a number of auxilary notebooks that walk through a series of steps for data exploration, data cleaning and data analysis.

To help with reproducibility, input and output files have been standardised as much as possible (this is a work in progress). <span style="color:red">The folder structure is shown in section ###, and descriptions of the key files are shown in section ###.</span>

This workflow is designed to be run in a collaborative manner on QUT compute facilities. As such, this utilises a number of different compute resources that each have different access restrictions and security protocols. Every effort has been made to ensure that sensitive data is protected, however, this may not be suitable for every project. It is every users responibility to ensure that data is stored and secured properly. No encryption is currently implemented in this workflow, relying on propper use of data use and storage according to QUT policies and procedures.



The high level overview of the workflow is as follows:



-- Prepare config file for running analysis
        -- Decide on a working directory on the HPC to set up file structure.
        NOTE: HPC storage is preferred to facilitate colaboration and help with data security.


-- View and clean annotations


-- View and QC raw Data


-- Merge or exclude AOIs and/or exclude probes


-- Normalise Data


-- Normalised Data EDA


-- Set up comparisons


-- Run DGE


-- Convert EdgeR plots to volcano plots

## 1.a. Some basic setup


### Directory structure

The directory structure required uses 2 base directories, with other directories accessed relative to these 2 base directories. The first base directory is where the GitHub repository is kept. Project specific configuration files are then kept in a sister directory with the same relative root directory. By default this will usually be the users home directory.

The second directory is the working directory where input and output files are stored. This could either be the users home directory, or a folder on the "work" directory for easier sharing within your group.

### Scripts, config and variables files

```
## Workflow notebooks and R scripts downloaded from GitHub (like this file).

 ├─ home (HPCFS directory)
 │ ├─ qutUserID            (replace with actual folder name, read/write access for multiple researchers)
 │ │ ├─ DSP_EDA_Protein                 (replace with actual folder name, read/write access for multiple researchers)
 │ │ │ ├─ DSP_nCounter_Protein_Post-Norm_EDA.ipynb
 │ │ │ ├─ DSP_nCounter_Protein_QC_Git.ipynb
 │ │ │ ├─ Index.ipynb
 │ │ │ ├─ EdgeR.R
 │ │ │ ├─ LICENSE
 │ │ │ ├─ NSNorm.R
 │ │ │ ├─ README.md
 │ │ │ |
 │ │ │ ├─ functions
 │ │ │ │ ├─ eda.py
 │ │ │ │ ├─ masterdata.py
 │ │ │ │ ├─ plotting.py
 │ │ │ |
 │ │ │ ├─ helpers
 │ │ │ │ ├─ EdgeR_Config_Helper.ipynb
 │ │ │
 │ │ │
 │ │ ├─ Project_Folder        (This folder is at the same level as the "DSP_EDA_Protein" containing the files from GitHub. The name of this folder is requested at the start of QC and EDA notebooks, and must be on the same directory level as "DSP_EDA_Protein" folder.)
 │ │ │ ├─ config.txt
 │ │ │ ├─ edgeR_Config.txt
 │ │ │ ├─ factor_lookup.tsv
 │
 │  ## 
 │
```

### Working directory

```
## Processed data files that should be kept secure, but can be edited and shared by researchers:
## These are the working files and folders. The folder structure below is an example structure for working on the QUT HPC with eResearch Jupyter Labs/Notebooks


 ├─ work (HPCFS directory)
 │ ├─ Research_Group_Folder            (replace with actual folder name, read/write access for multiple researchers)
 │ │ ├─ Project_Folder                 (replace with actual folder name, read/write access for multiple researchers)
 │ │ │ ├─ DSP_Protein_Data
# Files initially downloaded form DSP with preliminary processed and QC processed data
 │ │ │ │ ├ Initial Dataset.xlsx        (may be over-written after changes to ROI/AOI annotations)
 │ │ │ │ ├ Default_QC.xlsx             (may be over-written after changes to ROI/AOI annotations)
 │ │ │ │ ├ lab_worksheet_P1001##########.txt
# Files output from data QC script to identify ROI/AOIs and probes that fail data QC.
 │ │ │ │ ├ failAOIs.csv                                                      # Check location exportPath
 │ │ │ │ ├ FailProbes.csv                                                    # Check location exportPath
 │ │ │ │ ├ sampleInfo_with_wells.csv
# Files output from normalisation (normalisation my run through multiple iterations. Different normalisation methods may be needed for some comparisons).
 │ │ │ │ |
 │ │ │ │ ├─ Normalisation
 │ │ │ │ │ ├─ NSNorm
 │ │ │ │ │ │ ├─ NanoStringNorm_01...
 │ │ │ │ │ │         ...
 │ │ │ │ │ │ ├─ NanoStringNorm_84...
 │ │ │ │ │ │
 │ │ │ │ │ ├─ NSNormDropped
 │ │ │ │ │ │ ├─ NanoStringNorm_01...
 │ │ │ │ │ │         ...
 │ │ │ │ │ │ ├─ NanoStringNorm_84...
 │ │ │ │ │
 │ │ │ │ ├─ EdgeR
 │ │ │ │ │ ├─ NSNorm_##
 │ │ │ │ │ │ ├─ EdgeR_Results_Files


<i>NOTE: the phrase "kept secure" above indicates that the files should be saved to a secure location with backup. Primary data should be kept in read-only locations with backup.
</i>
```

### Archived files

```

## Raw data files that are kept secure and imutable (read only access to researchers):
## These files are stored in the archives folder on the QUT research drive

 ├─ R.......
 │ ├─ A...........
 │ │ ├─ i..
 │ │ │ ├─ c..._......._.....
 │ │ │ │ ├─ c.._..._2024XXXX
 │ │ │ │ │ ├─ Images
 │ │ │ │ │ │ ├ Slide_1.png
 │ │ │ │ │ │ ├ Slide-1_clean.png
 │ │ │ │ │ │ ├ Slide-1.zip
 │ │ │ │ │ │ ├ Slide-1.ome.tiff
 │ │ │ │ │ │ ├ Slide-2.png
 │ │ │ │ │ │ ├ Slide-2_clean.png
 │ │ │ │ │ │ ├ Slide-2.zip
 │ │ │ │ │ │ ├ Slide-2.ome.tiff
 │ │ │ │ │ ├─ Worksheets
 │ │ │ │ │ │ ├ lab_worksheet_P1001##########.txt
 │ │ │ │ │ ├─ Data
 │ │ │ │ │ │ ├ ####.RCC

```

# 2. Prepare configuration file<a class="anchor" id="config"></a>

Each project requires a configuration file to store important details required to run the project. A template file is provided as config_example.txt (this should be renamed config.txt after entering project specific details).

Aditional variables will be added to the config file as analysis procedes. These values can be manually updated manually in the config file if required.

A complete sample config file is provided below (essential fields are contained in the sample file, aditional files are added during analysis.)

### Example config.txt file

<code>
projectName : 'Adams_Bray'
rootDir : /work/researchGroup/projectName/DSP_Protein_Data/
initialDataPath : Initial Dataset.xlsx
QCDataPath : Default_QC.xlsx
labWorksheet01Path : Lab_Worksheet_P1001660017100A.txt
sampleInfoFile : sampleInfo_with_Wells.csv
selectedData : Factor1, Factor2, Factor3, Factor4
probeThresholdIdx : 41
dropSamples : TMA_001_002_Segment_1, TMA_001_014_Segment_1, TMA_001_026_Segment_1, TMA_001_038_Segment_1, TMA_001_050_Segment_1, TMA_002_004_Segment_1, TMA_002_016_Segment_1, TMA_002_028_Segment_1<br>
</code>code>

# 3. Data QC notebook

## 3.1. Import data

## 3.2. Infer sample locations

## 3.3. Choose factors of interest

## 3.4. Basic QC and data overview

## 3.5. View and clean annotations <a class="anchor" id="clean"></a>

## 3.6. Plot distribution of AOI surface areas

## 3.7. Plot Binding Density histograms

## 3.8. Visualise raw probe values

## 3.9. Select threshold

## 3.10. Identify outlier AOIs and probes

## 3.11. ERCC correct data

## 3.12. Drop outlier AOIs and probes 

## 3.13. Plot negative controls and housekeeping controls from raw data

## 3.14. Export data

## 3.15. Nanostring Norm

# 4. Data EDA notebook

## 4.1. Import data

config file
Normalised data
Sample info
factor variable names and structure

## 4.2. Visualise normalised data options

## 4.3. Threshold data

## 4.4. re-run NS norm with dropped data

## 4.5. View normalised-dropped data

## 4.6. Generate groups for EdgeR analysis
See section 5, EdgeR Helper notebook

## 4.7.  Run EdgeR analysis

## 4.8. Convert MD Plots to Volcano Plots

## 4.9. Plot heatmaps and dendrograms

# 5. EdgeR Helper notebook

# 6. Work in progress (not functional):

PCA viewing by factors


# 7. Feature requests and feedback:

kyle.upton.is@gmail.com

First up we want to confirm that all data entered into the DSP files is correct and clean. The following steps are easiest to performed using the DSP data analysis suite to view AOIs, and download and upload sample annotations. Sample annotations are easiest to edit in microsoft excel on any computer.


### Steps:

  1. Download "Annotation template file" from DSP

  2. Add in factors for AOI annotation

  3. Manually review all AOIs

  4. Ensure the comment line has been deleted (row 1 in downloaded file). The header row should be row 1.

  5. Upload file to DSP and select replace tags and factors
  
<i>Note: tags and factors are case sensitive. No aditional characters should be present. All tags must be comma separated</i>

<i>Note: "Initial Dataset" and "Default QC" file must be re-generated if AOI annotations are updated.</i>

<i>Note: Correlating AOIs to plates and wells was done using the lab worksheet documents and by matching the surface areas in those sheets with the surface areas in the DSP output excel files. 231206_DSP_nCounter_Protein_QC_Subramaniam_HCC_TMA_01 contains code for this but may not be completely up to date.</i>

Use the following notebook to review AOI annotations.

This will help find:
 - non-conforming annotations (wrong case or puncuation)
 - empty annotations / missing annotations
 - empty cells



<!-- [Click here for QC notebook](DSP_nCounter_Annotation_QC_Git.ipynb) 

ToDo: Separate AOI QC into its own workbook??

-->

# 4. View and QC Data <a class="anchor" id="qc"></a>

The next step is to chech the quality of the data that has been obtained from the DSP run.

This is done for both the probes and the AOIs to determine if any probes or AOIs should be excluded from analysis.


[Click here for QC notebook](DSP_nCounter_Protein_QC_Git.ipynb)

<i>Note: The above notebook also contains some code for cleaning and completing data annotations.
</i>

# 5. Initial Data Normalisation <a class="anchor" id="norm"></a>

After basic cleaning of data and removal of AOIs or probes if required, data normalisation is performed using many different parameters and thresholds. These should be reviewed in the next notebook to check for consistency in results, and any outliers. Some methods will reveal low expressing probes. Usually only a few normalisations are likely to be appropriate, and a single method should be chosen.

Inital data normalisation. is run at the end of the QC Notebook

# 6. Normalised Data EDA <a class="anchor" id="Norm_EDA"></a>

Next we want to view the normalised data and finalise thresholds for excluding probes and samples.

[Click here for Normalised Data EDA notebook](DSP_nCounter_Protein_Post-Norm_EDA.ipynb)

# 7. Set up comparisons <a class="anchor" id="compare"></a>

#### ToDo:
```
return AOIs to ignore in a flat text file

Return AOI groups (with whole annotations of all AOIS in each group)
What is the best file format to use for this? 
```

ToDo: Complete script for compiling comparisons in python and converting to R compatible format



# 8. Run DGE (EdgeR) <a class="anchor" id="dge"></a>

[EdgeR R script for grouped samples](/Users/upton6/Documents/Nanostring/projects/NS_Liver_HCC_DSP/EdgeR/NS_HCC_GLM_Grouped_02.R)

# 9. Convert EdgeR plots to volcano plots <a class="anchor" id="convert"></a>

[EdgeR to Volcano plot notebook](231130_EdgeR_to_Volcano_plots_NS_msWTA.ipynb)


In [None]:
break