# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

The objective of the study was investigating how gene expression and transcriptomic effects in the brain is affected by chronic oxycodone exposure and spontaneaus withdrawal in both control and SNI group. The researchers wanted to understand the genetic and molecular mechanisms which underly the use of opioides like oxycodone. Multiple brain regions (NAc, mPFC, VTA) were profiled using RNA-seq to find pathways and upstream regulators.

What do the conditions mean?

oxy:    treatment group of mice receiving chronic oxycodone injections, follwed by spontaneous drug withdrawal


sal:    control group receiving saline injections

-> control for treatment

What do the genotypes mean?

SNI:    group of mice where a Spared Nerve Injury was performed (1-2 mm of common peroneal and sural nerves were removed)


Sham:   control group with surgery without nerve injury

-> control for pain

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

- Run a pipeline for differential expression analysis for RNA-seq data. Include quality control steps
- compare conditions within one genotype and genotypes within one condition
    - Genotype Sham, Oxy vs. Sal: effect of oxycodone withdrawal in non-pain background.
    - Genotype SNI, Oxy vs. Sal: effect of oxycodone withdrawal in chronic pain background. 
    - Condition Oxy, SNI vs. Sham: effect of the injury under the oxycodone withdrawal condition
    - Condition Sal, SNI vs. Sham: effect of the injury under the control condition



Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [34]:
import pandas as pd
import numpy as np

# Read in metadata from files
conditions = pd.read_excel('conditions_runs_oxy_project.xlsx', sheet_name='Sheet1', index_col=1)
counts = pd.read_csv('base_counts.csv', index_col=0)

In [23]:
conditions = conditions.fillna(False)
conditions = conditions.replace('x', True)
conditions.drop(columns=['Patient', 'RNA-seq', 'DNA-seq'], inplace=True)

  conditions = conditions.fillna(False)
  conditions = conditions.replace('x', True)


In [24]:
conditions

Unnamed: 0_level_0,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SRR23195505,True,False,True,False
SRR23195506,False,True,False,True
SRR23195507,True,False,False,True
SRR23195508,False,True,True,False
SRR23195509,False,True,True,False
SRR23195510,True,False,True,False
SRR23195511,False,True,False,True
SRR23195512,True,False,False,True
SRR23195513,True,False,True,False
SRR23195514,False,True,False,True


In [28]:
# samples per condition
conditions.filter(regex='condition|Condition').sum()

condition: Sal    8
Condition: Oxy    8
dtype: int64

In [26]:
# samples per genotype
conditions.filter(regex='Genotype').sum()

Genotype: SNI     8
Genotype: Sham    8
dtype: int64

In [33]:
# samples per combined condition and genotype
combination_counts = pd.crosstab(
    index=[conditions["condition: Sal"], conditions["Condition: Oxy"]],
    columns=[conditions["Genotype: SNI"], conditions["Genotype: Sham"]]
)
print(combination_counts)

Genotype: SNI                 False True 
Genotype: Sham                True  False
condition: Sal Condition: Oxy            
False          True               4     4
True           False              4     4


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [44]:
# Join dataframes, sort by 'Bases' and pick the two smallest runs
smallest_runs = pd.DataFrame(conditions.join(counts, how='inner').sort_values(by='Bases', ascending=True)[:2].index)
smallest_runs.to_csv('ids.csv', index=False, header=None)

In [45]:
# run nf-core/fetchngs with the following command:
!nextflow run nf-core/fetchngs -profile docker -r 1.12.0 --input samples.csv --outdir fetchngs --max_memory "12GB" -resume


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36mstupefied_sammet[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [1.12.0][m
[K
ERROR ~ Unable to acquire lock on session with ID f3cdac15-2d36-4419-b233-94d5ac0a6748

Common reasons for this error are:
 - You are trying to resume the execution of an already running pipeline
 - A previous execution was abruptly interrupted, leaving the session open

You can see which process is holding the lock file by using the following command:
 - lsof /mnt/c/Users/NicolaiOswald/OneDrive - UT Cloud/Dokumente/Studium Tübingen/Computational Workflows/computational-workflows-2025/notebooks/day_02/.nextflow/cache/f3cdac15-2d36-4419-b233-94d5ac0a6748/db/LOCK

 -- Check '.nextflow.log' file for details


While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

Optimally, find a existing nf-core pipeline for RNA-seq analysis which uses the packages mentioned in the paper (HISAT2, HT-Seq, DESeq2). Proposal: use nf-core/rnaseq