## **Project Introduction**

In living cells, genes are expressed in two main steps: first, DNA is transcribed into messenger RNA (mRNA), and then mRNA is translated into proteins. While traditional RNA sequencing (RNA-seq) tells us how much mRNA is made, it doesn’t always reflect how much protein is actually produced — because not all mRNAs are actively translated.

This project focuses on understanding **how gene expression is controlled at the translation level** in *Escherichia coli* (E. coli), especially under different nutrient limitation conditions (like low carbon, nitrogen, or phosphate).

To do this, the study uses two powerful techniques:

* **RNA-seq**: This method sequences all the mRNAs in a cell to measure gene expression.
* **Ribo-seq (Ribosome Profiling)**: This newer method captures only the parts of mRNA that are being read by ribosomes — the cellular machines that build proteins. By sequencing these ribosome-protected fragments, we can see which mRNAs are actually being used to make proteins, and how efficiently.

By comparing RNA-seq and Ribo-seq data across different stress conditions and genetic backgrounds, this project aims to uncover patterns of **translational regulation** — revealing how cells fine-tune protein production beyond just making RNA.

## **Strains Used in the Study**

The study uses **three types of E. coli strains**:

| Strain Name        | Description                             | Type                       |
| ------------------ | --------------------------------------- | -------------------------- |
| **WT (Wild Type)** | Normal E. coli strain with no mutations | Control                    |
| **ΔrplA**          | Knockout of the **rplA** gene           | Translation-related mutant |
| **ΔleuB**          | Knockout of the **leuB** gene           | Metabolism-related mutant  |


### 1. **WT – Wild Type**

* This is the **baseline strain** with no gene deletions.
* Used to observe **natural responses** of E. coli to nutrient limitations.
* Serves as a **control group** to compare against mutants.

### 2. **ΔrplA – Knockout of rplA**

* **rplA** encodes **ribosomal protein L1**, a component of the **50S large ribosomal subunit**.
* Function:
  * Involved in **ribosome assembly and translation elongation**.
  * Also plays a role in **feedback regulation of its own synthesis** by binding to its own mRNA.
* Why use it?
  * Knocking out **rplA** allows researchers to study **how disrupting the translation machinery** affects global gene expression and translational regulation.
  * Also helps to reveal **feedback loops** in translation control — a key theme of the study.

### 3. **ΔleuB – Knockout of leuB**

* **leuB** encodes **3-isopropylmalate dehydrogenase**, an enzyme in the **leucine biosynthesis pathway**.
* Function:
  * Catalyzes a key step in making **leucine**, an essential amino acid.
* Why use it?
  * ΔleuB cannot make leucine on its own → becomes **leucine-limited** in minimal medium.
  * This simulates **amino acid starvation**, allowing the study of **translation under nutrient stress**.
  * Helps investigate how cells **reprogram translation** when key biosynthetic pathways are impaired.

Here’s the updated flowchart with the additional steps for **Ribosome Profiling Analysis** and **Functional Analysis**:

```plaintext
1. Fetch Data from NCBI
   └── sra-tools (e.g., `fastq-dump` or `prefetch`)

2. Convert SRA Files to FASTQ
   └── fasterqdump

3. Quality Check (Pre-Trimming)
   └── MultiQC (Assess raw read quality)

4. Adapter Removal and Trimming
   └── Trimmomatic (Trim adapters, low-quality bases)

5. Quality Check (Post-Trimming)
   └── MultiQC (Assess quality after trimming)

6. Read Alignment
   └── STAR / Bowtie (Map reads to the reference genome)

7. Counting Reads (Feature Counting)
   └── FeatureCounts (Count number of reads mapping to genes)

8. Differential Gene Expression Analysis
   └── R (e.g., DESeq2, edgeR, or limma)

9. (Optional) Ribosome Profiling Analysis
   └── Additional steps for RiboSeq-specific analysis
   └── Analysis of ribosome footprints, translation efficiency, etc.
   └── Tools: RiboSeqR, Ribotaper

10. Functional Analysis (Optional)
    └── Gene Ontology (GO) Enrichment using tools like DAVID, ClusterProfiler
    └── Pathway Analysis: Reactome, KEGG, GSEA
```

### Breakdown of the Additional Steps:

#### 9. **Ribosome Profiling Analysis**:

* **RiboSeqR**: This tool can be used for analyzing ribosome footprints, which is crucial for assessing translation efficiency.
* **Ribotaper**: It helps in extracting ribosome-protected fragment counts from RiboSeq data. You can use this for better interpretation of translation dynamics.
* **Translation Efficiency**: Compare ribosome reads to mRNA abundance (from RNA-Seq) to calculate translation efficiency for each gene.
* **Examine Ribosome Footprints**: Visualize the distribution of ribosome footprints across mRNA sequences to identify regions of high ribosome occupancy, possible pauses, and codon usage.

#### 10. **Functional Analysis**:

* **Gene Ontology (GO) Enrichment**: Tools like **DAVID** or **ClusterProfiler** can help identify enriched biological processes, molecular functions, or cellular components in differentially expressed or translated genes.
* **Pathway Analysis**: Further enrich your findings with pathway analysis tools such as **Reactome**, **KEGG**, or **GSEA** (Gene Set Enrichment Analysis) to interpret the functional implications of the genes involved in specific pathways.

This extended pipeline now includes the optional but important steps for **ribosome profiling** and **functional analysis**, giving you a comprehensive bioinformatics workflow for RiboSeq analysis.


In [None]:
riboSeq_pipeline/
│
├── assets/                     # Directory for static resources (e.g., reference genomes, annotations)
│   ├── reference/              # Reference genome files (FASTA, GTF, etc.)
│   └── annotations/            # Gene annotations (GTF, BED files)
│
├── bin/                        # Custom scripts or functions (e.g., helper scripts, utility functions)
│   ├── align.sh                # Shell script for alignment (STAR/Bowtie)
│   ├── trim.sh                 # Shell script for trimming (Trimmomatic)
│   └── count.sh                # Shell script for feature counting (FeatureCounts)
│
├── data/                       # Raw data and intermediate results
│   ├── fastq/                  # FASTQ files (raw reads)
│   ├── sra/                    # SRA files (if fetched from NCBI)
│   ├── trimmed/                # Trimmed reads (post-Trimmomatic)
│   ├── aligned/                # Aligned BAM files
│   └── counts/                 # FeatureCounts output (gene-level counts)
│
├── nf/                         # Nextflow-specific files
│   ├── main.nf                 # Main Nextflow workflow script
│   ├── config/                 # Configuration files (for Nextflow parameters)
│   │   ├── process.config      # Custom process configurations (e.g., CPU, memory)
│   │   └── params.config       # Parameters for the pipeline (e.g., file paths, sample names)
│   ├── processes/              # Directory for all Nextflow process scripts
│   │   ├── align.nf            # Process for read alignment (STAR/Bowtie)
│   │   ├── trim.nf             # Process for trimming (Trimmomatic)
│   │   ├── count.nf            # Process for feature counting (FeatureCounts)
│   │   └── go_analysis.nf      # Process for GO analysis (if separate)
│   └── profiles/               # Different execution profiles (e.g., local, cloud)
│       └── local.config        # Local configuration for running Nextflow locally
│
├── results/                    # Final output and visualizations
│   ├── differential_expression/ # Differential gene expression results (DESeq2, edgeR output)
│   ├── riboSeq_analysis/       # RiboSeq-specific results (footprint analysis, translation efficiency)
│   ├── functional_analysis/    # Gene Ontology (GO) and pathway analysis results
│   └── figures/                # Generated plots and figures (e.g., heatmaps, metagene plots)
│
├── scripts/                    # General scripts (e.g., helper scripts for data processing)
│   ├── go_analysis.R           # R script for GO enrichment
│   ├── riboSeq_analysis.R      # R script for ribosome profiling (e.g., translation efficiency)
│   └── differential_expression.R # R script for differential expression analysis
│
├── Dockerfile                  # Dockerfile for containerizing the pipeline (if applicable)
├── README.md                   # Project description and instructions for using the pipeline
├── nextflow.config              # Main Nextflow configuration file
└── environment.yaml             # Environment dependencies (e.g., conda, docker)


## Groovy
1. Variables, data types, and collections

2. String interpolation

3. Basic control structures (if-else, loops)

4. Groovy closures

5. How to define processes and work with channels in Nextflow

---

### **1. `Channel` API Methods**

Nextflow provides a wide variety of methods that can be used to **create**, **transform**, and **manipulate channels**. Channels are the main way to pass data between processes.

Here’s a list of the most common **`Channel` API** methods, categorized by their purposes.

---

### **1.1 Channel Creation Methods**

These methods are used to **create channels** from various sources:

* **`fromPath()`**: Creates a channel from file paths. You can pass file paths (wildcards allowed) to create a channel of files.

  Example:

  ```groovy
  Channel.fromPath('*.txt')
  ```

* **`fromFilePairs()`**: Creates a channel of paired files based on a matching pattern.

  Example:

  ```groovy
  Channel.fromFilePairs('*.fastq')
  ```

* **`from()`**: Creates a channel from a **list** or **array** of values.

  Example:

  ```groovy
  Channel.from([1, 2, 3, 4, 5])
  ```

* **`fromString()`**: Creates a channel from a **string**, splitting it into individual characters.

  Example:

  ```groovy
  Channel.fromString('Hello')
  ```

* **`fromFile()`**: Creates a channel from a **single file**.

  Example:

  ```groovy
  Channel.fromFile('file.txt')
  ```

* **`empty()`**: Creates an **empty channel**.

  Example:

  ```groovy
  Channel.empty()
  ```

---

### **1.2 Channel Transformation Methods**

These methods are used to **transform the data** flowing through channels:

* **`map()`**: Transforms the data in the channel using a **Groovy closure**.

  Example:

  ```groovy
  Channel.from([1, 2, 3, 4, 5]).map { it * 2 }  // Results in [2, 4, 6, 8, 10]
  ```

* **`filter()`**: Filters the data based on a condition.

  Example:

  ```groovy
  Channel.from([1, 2, 3, 4, 5]).filter { it % 2 == 0 }  // Results in [2, 4]
  ```

* **`collect()`**: Collects and transforms the data into a list.

  Example:

  ```groovy
  Channel.from([1, 2, 3, 4, 5]).collect()  // Results in [1, 2, 3, 4, 5]
  ```

* **`splitText()`**: Splits text data into multiple lines.

  Example:

  ```groovy
  Channel.fromPath('file.txt').splitText()
  ```

* **`flatten()`**: Flattens nested channels or collections into a single channel.

  Example:

  ```groovy
  Channel.from([1, 2, [3, 4], 5]).flatten()  // Results in [1, 2, 3, 4, 5]
  ```

* **`merge()`**: Merges multiple channels into a single channel.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).merge(Channel.from([4, 5, 6]))  // Results in [1, 2, 3, 4, 5, 6]
  ```

* **`concat()`**: Concatenates multiple channels.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).concat(Channel.from([4, 5, 6]))  // Results in [1, 2, 3, 4, 5, 6]
  ```

* **`join()`**: Joins multiple items into a single string or collection.

  Example:

  ```groovy
  Channel.from(["apple", "banana", "cherry"]).join(", ")  // Results in "apple, banana, cherry"
  ```

---

### **1.3 Channel Output Methods**

These methods define where the data from a channel should be sent:

* **`into()`**: Directs the data from one channel into another channel or variable.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).into { outputChannel }
  ```

* **`to()`**: Sends the channel’s output to a file or directory.

  Example:

  ```groovy
  Channel.fromPath('*.txt').toPath('output.txt')
  ```

* **`toFile()`**: Directs the channel’s output to a file.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).toFile('output.txt')
  ```

* **`view()`**: Displays the data flowing through the channel. Useful for debugging.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).view()
  ```

* **`subscribe()`**: Subscribes to a channel to trigger an action for each item.

  Example:

  ```groovy
  Channel.from([1, 2, 3]).subscribe { println it }
  ```

---

### **1.4 Channel Operations for Synchronization**

These methods are used for managing channel data flow in parallel processes:

* **`take()`**: Takes an element from the channel and removes it from the channel queue.

  Example:

  ```groovy
  def channel = Channel.from([1, 2, 3, 4])
  channel.take()  // Takes the first element (1)
  ```

* **`consume()`**: Consumes items from the channel.

  Example:

  ```groovy
  def channel = Channel.from([1, 2, 3, 4])
  channel.consume { println it }
  ```

* **`send()`**: Sends items explicitly to a channel (typically used in complex workflows).

  Example:

  ```groovy
  def channel = Channel.create()
  channel.send(10)
  ```

---

### **Summary: Common Channel Methods**

* **Creation**: `fromPath()`, `from()`, `fromFilePairs()`, `empty()`
* **Transformation**: `map()`, `filter()`, `flatten()`, `merge()`, `collect()`, `splitText()`
* **Output**: `into()`, `to()`, `view()`, `subscribe()`
* **Synchronization**: `take()`, `consume()`, `send()`

---

### **When to Use These Methods in Nextflow**:

1. **Creating Channels**: Use `fromPath()`, `fromFilePairs()`, `from()`, and other `from*` methods to create channels from files, lists, or strings.
2. **Manipulating Channel Data**: Use `map()`, `filter()`, and `flatten()` to transform or filter data flowing through channels.
3. **Combining Data**: Use `merge()` or `concat()` to combine data from multiple channels into one.
4. **Sending Data to Processes**: Use `into()` or `to()` to direct data to processes, files, or variables.
5. **Debugging**: Use `view()` or `subscribe()` to print or monitor the flow of data for debugging purposes.

---

### **Example Usage**:

Here’s an example of how you might use several **channel methods** in a Nextflow pipeline:

```groovy
// Define a channel with some input data
Channel.from([1, 2, 3, 4, 5])
    .map { it * 2 }         // Multiply each number by 2
    .filter { it % 4 == 0 } // Keep only even numbers divisible by 4
    .merge()                // Merge the results from multiple channels
    .into { outputChannel } // Send the output to outputChannel

outputChannel.view()  // Display the content of the output channel
```

---

### **Next Steps**:

* Experiment with the **`Channel` methods** and try combining them in different ways to manipulate and transform data.
* Understand **how channels pass data between processes** in **Nextflow**, and use methods like **`merge()`** and **`map()`** to streamline your data flow.

These methods will allow you to create powerful, data-driven workflows in **Nextflow**. Let me know if you need further examples or explanations!


In [46]:
Data = pd.read_csv("../data/sample_info.csv")

In [47]:
Data['Assay Type']

Unnamed: 0,Run,Assay Type,genotype,treatment,Nutrient_Concentration,replicate
0,SRR15446740,RNA-Seq,wild type,carbon limitation,0.1,rep1
1,SRR15446741,RNA-Seq,wild type,carbon limitation,0.1,rep2
2,SRR15446742,RNA-Seq,wild type,carbon limitation,0.1,rep3
3,SRR15446743,RNA-Seq,wild type,carbon limitation,0.6,rep1
4,SRR15446744,RNA-Seq,wild type,carbon limitation,0.6,rep2
...,...,...,...,...,...,...
67,SRR15446807,Ribo-Seq,delta_leuB,Leucine limitation,0.1,rep2
68,SRR15446808,Ribo-Seq,delta_leuB,Leucine limitation,0.1,rep3
69,SRR15446809,Ribo-Seq,delta_leuB,Leucine limitation,0.6,rep1
70,SRR15446810,Ribo-Seq,delta_leuB,Leucine limitation,0.6,rep2


In [49]:
Data.columns = ['run', 'data_type', 'genotype', 'treatment', 'nutrient_concentration',
       'replicate']

In [51]:
Data.to_csv("../data/sample_info.csv", index = False)