# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

Investigate the effects of oxycodone withdrawal during absence / presence of chronic pain with additional treatment by HDAC1/2 inhibition by RBC1HI. Exploration of novel treatments for patients with opioid dependence to lessen withdrawal symptoms and aid transition to non-opioid drugs.

What do the conditions mean?

oxy: oxycodone administered


sal: saline administered in place of oxycodone

What do the genotypes mean?

SNI:spared nerve injury, mice were given surgery to induce chronic pain


Sham: no chronic pain but sham surgery was performed to normalize for surgical stress

**Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.**

What would you do? organize files by treatment conditions in the four possible combinations given in the matrix below:

|          | Oxy        | Sal        |
|----------|------------|------------|
| **Sham** | Sham - Oxy | Sham - Sal |
| **SNI**  | SNI - Oxy  | SNI - Sal  |

**Which groups would you compare to each other?**

compare those sharing at least one condition, i.e. horizontally and vertically across the treatment matrix above such that one of two variables is always constant and controlled for

**Please also mention which outcome you would expect to see from each comparison.**

Sham surgery should lead to lessened addiction as no chronic pain is present, saline injected mice should not be addicted as no opioids were present

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [16]:
import pandas as pd
conditions = pd.read_excel('conditions_runs_oxy_project.xlsx', index_col='Run').drop(columns='Patient').notna()
conditions.columns = conditions.columns.str.lower() # make column names uniform

In [17]:

print(f'Oxycodone True count: {conditions['condition: oxy'].sum()}')
print(f'Saline True count: {conditions['condition: sal'].sum()}')

Oxycodone True count: 8
Saline True count: 8


In [21]:
print(f'SNI-Oxy Count: {conditions[conditions['condition: oxy']]['genotype: sni'].sum()}')
print(f'SNI-Saline Count: {conditions[conditions['condition: sal']]['genotype: sni'].sum()}')
print(f'SHAM-Oxy Count: {conditions[conditions['condition: oxy']]['genotype: sham'].sum()}')
print(f'SHAM-Saline Count: {conditions[conditions['condition: sal']]['genotype: sham'].sum()}')

SNI-Oxy Count: 4
SNI-Saline Count: 4
SHAM-Oxy Count: 4
SHAM-Saline Count: 4


pThey were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [22]:
conditions['base_counts'] = pd.read_csv('base_counts.csv', index_col='Run')

In [24]:
print(conditions.sort_values('base_counts', ascending=True).head(2))

             rna-seq  dna-seq  condition: sal  condition: oxy  genotype: sni  \
Run                                                                            
SRR23195516     True    False           False            True           True   
SRR23195511     True    False           False            True          False   

             genotype: sham  base_counts  
Run                                       
SRR23195516           False   6203117700  
SRR23195511            True   6456390900  


In [28]:
conditions.sort_values('base_counts', ascending=True).head(2).index.to_series().to_csv('fetch_runs.csv', index=False, header=False)

In [30]:
!nextflow run nf-core/fetchngs \
   -profile docker \
   --input fetch_runs.csv \
   --outdir fetch_out


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36mhappy_jepsen[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision       : [0;32mmaster[0m
  [0;34mrunName        : [0;

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

use the tools given in the methods sections with identical parameters and same 2x2 factorial design
read alignment: HISAT2, read counting: HT-Seq, differential analysis: DeSEQ2, p-value cutoff < 0.05 and absolute log 2(fold change) > 0.5

chain the nf-core pipelines rna-seq and differentialabundance together for analysis