# Transcriptomics checkpoint

**Due**: Mar 21 by 11:59 pm

**Points**: 100

As outlined in the [project introduction](https://pitt-biosc1540-2024s.oasci.org/assessments/checkpoints/transcriptomics/), this project aims to identify genes in *E. coli* that could be engineered for enhanced stress tolerance.
This will be accomplished by analyzing gene expression data from control and stress-resistant strains under conditions of cellular stress.
Identifying differences in gene expression between these strains may highlight genes integral to stress resistance.
It is highly recommended that you familiarize yourself with [the foundational paper](https://www.nature.com/articles/s41598-017-14335-7) that discusses the dataset you will be working with.


## Instructions

Throughout your analysis, you will come across various blocks of code accompanied by comments such as:

```python
# DO NOT MODIFY CODE BELOW THIS LINE.
```

These comments are placed to ensure the integrity of certain code segments, ensuring the evaluation process remains consistent across all student submissions.
You must adhere to these directives and refrain from modifying the code following these comments.
Should you mistakenly alter any of this pre-written code, or if the teaching staff recommends an adjustment, you must revert the changes.
The correct code version can always be found and copied [directly from the course website](https://pitt-biosc1540-2024s.oasci.org/assessments/checkpoints/genomics/genomics-checkpoint/).

This structured approach ensures that the focus remains on your project's analytical and interpretative aspects rather than on the underlying code infrastructure.
Following these guidelines will help maintain the academic integrity of the project and ensure a fair and standardized assessment for all participants.

## Project Grading Options

For this project, you are provided with two distinct options for completing your analysis.
Each path caters to different skill sets and preferences, ensuring you can choose the approach that best suits your strengths and learning goals.

### Option 1: Spreadsheet

-   **Approach**: Download the provided CSV file from the [link](https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/large-files/ecoli-transcriptome-cell.stress.csv) and conduct your analysis using Google Sheets.
-   **Deliverables**: Submit your completed spreadsheet along with a PDF of your written portion of the project (a total of two files).
-   **Support**: The teaching team is equipped to provide conceptual guidance for your analysis.
    However, specific assistance with spreadsheet functionalities or techniques will be limited.

### Option 2: Python (Preferred)

-   **Approach**: Complete your analysis within this Jupyter notebook using Python.
    This method is preferred and encourages you to utilize the powerful data analysis libraries available in Python.
-   **Deliverables**: Submit the Jupyter notebook (`.ipynb`) with your writing in markdown cells (a total of one file).
-   **Support**: Opting for Python analysis enables you to receive comprehensive support from the teaching team, including both conceptual and implementation-related queries.
-   **Bonus**: Choosing this option entitles your assignment to an additional 25 points as a bonus, recognizing the technical complexity of Python.

### Choosing Your Path

Select the option that aligns with your learning objectives and comfort level with the tools at your disposal.
Whether you opt for the flexibility and accessibility of spreadsheet software or the robust and dynamic capabilities of Python, your choice should reflect your learning journey and goals for this project.

## Gene Expression Data Overview

### Data Source

Our investigation utilizes gene expression data from the BioProject [PRJNA353081](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA353081), which explores the transcriptome of *E. coli* under a variety of stress conditions.
The aim is to pinpoint genes instrumental in the organism's adaptive evolution to stress. 

### Data Accessibility and Preparation

The teaching team has procured, processed, and cleansed the gene expression data from this study.
The refined dataset is made available in the form of a CSV file, which you can access and analyze in the subsequent sections of this project.
This data preparation step ensures that you can focus on analyzing and interpreting gene expression patterns without the initial burden of data cleaning and preprocessing.

### Utilizing the Data

Below, you will find instructions on how to load the provided CSV file containing the gene expression data.
This dataset is crucial for your analysis and will be the foundation for identifying potential genes linked to stress tolerance in *E. coli*.
It's crucial to handle this data carefully and follow any additional instructions in the forthcoming sections to ensure a thorough and accurate analysis.

---

**Reminder:** As you work with this dataset, remember to consider the broader context of the research and the specific objectives of our study. This approach will enhance the relevance and impact of your findings.

First, we'll import the libraries [NumPy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/) into our notebook.
These libraries are essential data manipulation and analysis tools, allowing us to work efficiently with complex datasets.
NumPy supports large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
Pandas offers data structures and operations for manipulating numerical tables and time series, making it indispensable for data munging and preparation tasks.
Together, these libraries form the backbone of our data analysis toolkit in Python.

In [1]:
# DO NOT MODIFY CODE BELOW THIS LINE.
import numpy as np
import pandas as pd

Before executing the code block provided, it's essential to understand its function and the data it will process.
The dataset contains transcriptome information for *E. coli* under various stress conditions is crucial for our analysis in identifying genes involved in adaptive evolution.

To proceed, we will perform the following steps:

1. **Set the Data Source**: The variable `GE_CSV_PATH` is initialized with the URL of the CSV file containing the gene expression data.    
    This ensures we access the most recent and relevant data for our analysis.
2. **Load the Data**: Using Pandas' `read_csv` function, we load the data from the specified URL into a DataFrame named `df_ge`.
    This DataFrame will be used to manipulate and analyze the gene expression data.
3. **Determine the Number of Genes**: We calculate the total number of genes in the dataset by determining the length of `df_ge`.
    This gives us an overview of the dataset's size and scope.
4. **Preview the Data**: Finally, we display the first five entries in the dataset using `df_ge.head(n=5)`.
    This initial peek into the data helps us understand its structure and contents before diving into a more detailed analysis.

Please note that the code below this instruction is crucial for the initial data setup and should be kept the same to ensure consistency and reliability in the analysis process.

In [2]:
# DO NOT MODIFY CODE BELOW THIS LINE.
GE_CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/large-files/ecoli-transcriptome-cell.stress.csv"
df_ge = pd.read_csv(GE_CSV_PATH)
gene_names = df_ge["gene_id"].to_numpy()
n_genes = len(df_ge)
print(f"There are {n_genes} genes in the dataset.")
print(gene_names)
print(df_ge.head(n=5))

There are 4466 genes in the dataset.
['aaeA' 'aaeB' 'aaeR' ... 'zupT' 'zur' 'zwf']
  gene_id  Parent_replicate1  Parent_replicate2  Parent_replicate3    P_NaCl  \
0    aaeA           1.600048           1.580573           1.596062  1.505947   
1    aaeB           2.384388           2.490061           2.490061  2.712741   
2    aaeR           2.313178           2.259222           2.262170  2.486852   
3    aaeX           1.675798           1.691446           1.548673  1.616997   
4     aas           2.355012           2.326936           2.338066  2.187426   

      P_KCl      P_Co   P_SoCar     P_Lac     P_Mal  ...   BuOH1_S   BuOH2_S  \
0  1.571618  1.657503  1.794461  1.636620  1.733818  ...  1.722702  1.734668   
1  2.608243  2.252648  2.700831  2.531299  2.638515  ...  2.119643  2.103084   
2  2.434341  2.464022  2.664427  2.400416  2.435800  ...  2.404603  2.406878   
3  1.605543  1.729344  1.906402  1.804496  1.834359  ...  1.761060  1.756508   
4  2.261457  2.419453  2.271215  2.2

Upon successfully loading and displaying the initial portion of the gene expression dataset, we now observe that it encompasses 4,466 genes, each with multiple measurements under different conditions.
The dataset comprehensively views gene expression levels across various replicates and stress conditions.
This rich dataset is the foundation for our analysis, aiming to identify genes associated with stress tolerance in *E. coli*.

Note that these data are the $\log_{10} (x)$ of the normalized open reading frame (ORF) level expression.
ORF level expression was calculated by median value of all probes corresponding to each ORF.

## How to use `select_stress` function

We introduce a crucial function designed to streamline gene expression analysis under specific stress conditions.
This function, `select_stress`, allows you to filter the comprehensive dataset to focus on gene expression data related to a particular cell stressor.
Understanding how gene expression varies across different stress conditions is key to identifying genes contributing to stress tolerance.

The `select_stress` function is designed for simplicity and efficiency, enabling you to specify the stress condition of interest and obtain relevant subsets of the data for analysis.
Here's how it works:

1. **Parameters**: The function takes two parameters:
   - `df`: The full DataFrame containing gene expression data.
   - `stress`: A string indicating the specific cell stressor to focus on.
        Valid options are: `NaCl`, `KCl`, `Co`, `SoCar`, `Lac`, `Mal`, `Metha`, `Croton`, `MG`, `BuOH`, and `CPC`.
2. **Functionality**:
   - The function first identifies columns in the dataset that correspond to control conditions (replicates) and those that are specific to the chosen stress condition, distinguishing between non-stressed (`_NS`) and stressed (`_S`) states.
   - It then creates three subsets of the DataFrame:
     - `columns_control_ns`: Includes gene IDs and control replicate measurements under no-stress conditions.
     - `columns_control_s`: Includes gene IDs and control replicate measurements under stress conditions.
     - `columns_resistant_ns`: Comprises gene IDs and measurements for the non-stressed state under the specified stress condition.
     - `columns_resistant_s`: Contains gene IDs and measurements for the stressed state under the specified stress condition.
3. **Return Values**:
   - The function returns four DataFrames, each tailored to the specific analysis needs: one for control conditions, one for non-stressed states of the specified stressor, and one for stressed states of the stressor.


In [3]:
# DO NOT MODIFY CODE BELOW THIS LINE.
def select_stress(df, stress):
    columns_control_ns = [
        "gene_id",
        "Parent_replicate1",
        "Parent_replicate2",
        "Parent_replicate3",
    ]
    columns_control_s = ["gene_id", "P_" + stress]
    columns_resistant_ns = ["gene_id"]
    columns_resistant_ns.extend([c for c in df.columns if stress in c and "_NS" in c])
    columns_resistant_s = ["gene_id"]
    columns_resistant_s.extend([c for c in df.columns if stress in c and "_S" in c])
    return (
        df[columns_control_ns],
        df[columns_control_s],
        df[columns_resistant_ns],
        df[columns_resistant_s],
    )

For example, we can get data for the `NaCl` stressor.
We will use the `select_stress` function to isolate the relevant subsets of our dataset.
This function will allow us to compare gene expression levels across three distinct groups:

1. **Control Group**: Gene expression data under normal conditions without any added stressors.
2. **Control Group with stress (P_NaCl)**: Gene expression data under stress conditions induced by NaCl.
3. **NaCl resistant strains under no stress (NaCl_NS)**: Gene expression data for *E. coli* without inducing stress conditions with NaCl.
4. **NaCl resistant strains under stress (NaCl_S)**: Gene expression data for *E. coli* under stress conditions induced by NaCl.

The following code block employs the `select_stress` function to extract these three subsets specifically for the NaCl stress condition.
After extracting the data, we will display the first three entries from each subgroup to examine the structure of the data and to observe initial gene expression levels in these different conditions.

In [4]:
df_control_ns, df_control_s, df_nacl_ns, df_nacl_s = select_stress(df_ge, "NaCl")
print(df_control_ns.head(n=3))
print(df_control_s.head(n=3))
print(df_nacl_ns.head(n=3))
print(df_nacl_s.head(n=3))

  gene_id  Parent_replicate1  Parent_replicate2  Parent_replicate3
0    aaeA           1.600048           1.580573           1.596062
1    aaeB           2.384388           2.490061           2.490061
2    aaeR           2.313178           2.259222           2.262170
  gene_id    P_NaCl
0    aaeA  1.505947
1    aaeB  2.712741
2    aaeR  2.486852
  gene_id  NaCl1_NS  NaCl2_NS  NaCl3_NS  NaCl4_NS  NaCl5_NS
0    aaeA  1.399577  1.613108  1.562037  1.586437  1.535762
1    aaeB  2.232785  2.274567  2.283701  2.286662  2.396796
2    aaeR  2.300639  2.385886  2.362458  2.383017  2.420135
  gene_id   NaCl1_S   NaCl2_S   NaCl3_S   NaCl4_S   NaCl5_S
0    aaeA  1.589559  1.443630  1.537268  1.538087  1.484430
1    aaeB  2.561555  2.748582  2.591026  2.639602  2.695565
2    aaeR  2.257621  2.203123  2.287483  2.359584  2.326936


In [5]:
df_control_ns, df_control_s, df_kcl_ns, df_kcl_s = select_stress(df_ge, "KCl")
print(df_control_ns.head(n=3))
print(df_control_s.head(n=3))
print(df_kcl_ns.head(n=3))
print(df_kcl_s.head(n=3))

  gene_id  Parent_replicate1  Parent_replicate2  Parent_replicate3
0    aaeA           1.600048           1.580573           1.596062
1    aaeB           2.384388           2.490061           2.490061
2    aaeR           2.313178           2.259222           2.262170
  gene_id     P_KCl
0    aaeA  1.571618
1    aaeB  2.608243
2    aaeR  2.434341
  gene_id   KCl1_NS   KCl2_NS   KCl3_NS   KCl4_NS   KCl5_NS
0    aaeA  1.625741  1.422273  1.513973  1.540602  1.527510
1    aaeB  2.240655  2.348305  2.332424  2.375653  2.313878
2    aaeR  2.456952  2.383017  2.493289  2.435114  2.426256
  gene_id    KCl1_S    KCl2_S    KCl3_S    KCl4_S    KCl5_S
0    aaeA  1.585657  1.565324  1.581476  1.612395  1.530056
1    aaeB  2.542533  2.598445  2.611495  2.598939  2.547291
2    aaeR  2.427043  2.337362  2.414993  2.355012  2.350735


The `Parent_replicate` columns hold gene expression data for the control (parent) *E. coli* strain across three replicates, serving as a baseline for comparison.
The `P_NaCl` column presents the gene expression levels of the control strain when subjected to stress conditions.
Columns labeled `NaCl_NS` feature gene expression data for resistant strains grown in standard media, indicating their baseline response without stress.
Conversely, `NaCl_S` columns capture the expression data of these resistant strains under stress conditions, providing insight into their adaptability and resilience.
Each numerical suffix (e.g., 1, 2, 3, 4, and 5) denotes a distinct resistant strain, allowing for a comparative analysis between different adaptations to stress conditions.

## Project

In this project, you will comprehensively analyze this gene expression data to uncover insights into stress tolerance in *E. coli* of only **NaCl** and **KCl**.

1. **Data Exploration**: Familiarize yourself with the dataset's structure and contents.
    Note the conditions under which gene expression levels were recorded, such as `Parent_replicate1`, `P_NaCl` (sodium chloride stress), and `P_KCl` (potassium chloride stress).
    Each column represents a different experimental condition, with the first column (`gene_id`) uniquely identifying each gene.
2. **Statistical Analysis**: Begin with basic statistical analyses to understand gene expression levels' distribution, variance, and central tendencies across different conditions.
    This may involve calculating each gene's means, medians, and standard deviations across various conditions.
3. **Data Visualization**: Employ data visualization techniques to represent the data graphically.
    This could include plotting the expression levels of select genes under different stress conditions or creating heatmaps to visualize the expression levels of all genes across various situations.
4. **Identification of Key Genes**: Utilize analytical techniques to identify genes with significant differences in expression between control and stress conditions.
    This may involve conducting hypothesis tests or applying machine learning models to discern patterns that indicate a gene's role in stress tolerance.
5. **Interpretation and Conclusion**: Interpret the results of your analysis to draw conclusions about which genes are potential candidates for engineering NaCl and KCl stress tolerance in *E. coli*.
    Consider how these genes' expression levels change under stress compared to normal conditions.

## Rubric

The originating paper used *E. coli* growth rates as an output to a linear regression model to identify which genes are crucial for imparting survivability.
Then, by successive mutations, engineered strains through mutations of candidate genes for experimental confirmation.
The article does not provide growth data, so we cannot reproduce their methods and are unlikely to get the same outcome: "This specific gene mentioned in the paper was found in my analysis."

Instead, we will take the approach that the only data we have to work with is the gene expression data.
This also means there is no correct answer: this project is entirely based on exploration and logical insights.
This rubric encourages thoroughness in completing the project while rewarding the depth of analysis, innovative approaches, and clear communication of findings.

### Completion (80 Points)

- **Statistical Analysis (20 Points)**
  - Calculated means, medians, and standard deviations for each condition, considering replicates.
  - Assessed and described the distribution of gene expression levels, including within-replicate variability.
  - Analyzed variance within and between conditions to understand expression variability.
- **Data Visualization (20 Points)**
  - Created plots comparing expression levels of selected genes under different conditions.
  - Employed visual comparisons of control vs. stress conditions, highlighting significant differences.
- **Identification of Key Genes (20 Points)**
  - Applied numerical analysis to identify genes with significant expression differences.
  - Reviewed and listed genes identified as significant, focusing on consistency across replicates.
- **Interpretation (20 Points)**
  - Discussed potential functions of identified genes in stress tolerance, providing insights into biological mechanisms.

### Quality (20 Points)

- **Accuracy and Logic (5 Points)**
  - Analysis methods and conclusions are logically sound, accurately reflecting the dataset's complexity.
- **Insightfulness (5 Points)**
  - Demonstrated an understanding of gene function and stress tolerance mechanisms based on data analysis.
- **Creativity (2 Points)**
  - Employed innovative analysis, visualization, or interpretation approaches, showcasing advanced problem-solving skills.
- **Presentation and Clarity (8 Points)**
  - The project is well-organized, clearly communicating findings and methodologies, making complex data understandable.

### Python (25 points)

Performed all analyses in Python, from data preprocessing to final interpretation, utilizing Python's libraries (Pandas for data manipulation, Matplotlib/Seaborn for visualization, SciPy for statistical tests, etc.)


## Notes

-   This project is intentionally open-ended and designed to give you enough information to figure out what you need to do without just telling you what to do.
    Thus, grading will not be "Did you do what the instructor would have done?" but "Did the student critically think about the project and make a logical attempt with reasonable rationale?"
    The teaching team will not tell you, "Here is how you should do it."
    Instead, they will give you advice on what to consider.
-   Visit [From data to Viz](https://www.data-to-viz.com/) or [Python Graph Gallery](https://python-graph-gallery.com/) to see different types of data visualization plots you can make.
-   [Lecture 15](https://pitt-biosc1540-2024s.oasci.org/lectures/15/l15/) gives you some boilerplate code for basic NumPy and Pandas operations.
    Only looking at what genes have changed does not consider any [statistics](https://realpython.com/python-statistics/).
-   You should write any text in a Markdown cell below as you see fit.