# mRNA Expression in Stem Cell Differentiation of Zebra Fish  
**Darren Llewellyn:** u0720798@utah.edu | u0720798  
**Matt Poulsen:**  Matthew.poulsen@utah.edu | u0546346  
**Noah Pearson:** noah.pearson@utah.edu | u1020047  

## Background and Motivation: 
Stem cells have a great potential for novel therapeutics. Niche germline stem cells can generate a substantial number of differentiated cells, thus it is essential that researchers better understand the dynamics involved in maintaining these germline stem cells. Characterizing the genes responsible for both the maintenance of the germline and the differentiation of these cells will inform tailored stem cell behavior.  
  
Many labs are actively investigating changes to gene expression in stem cells during cell differentiation to distinguish between the stem cells that maintain homeostasis and those that successfully differentiate. Understanding the genetic precursors to cell differentiation may lead to a wide array of stem cell therapeutics. Part of the research focus in the Gagnon lab is investigating how stem cells maintain tissue homeostasis. The testis is a good model for stem cell differentiation because one can simultaneously capture multiple states of spermatogenesis given the abundant number of cells in each differentiation state.  

Single-cell RNA sequencing is a ubiquitous tool in studying gene expression. This sequencing outputs a unique tag for each sequenced cell, and then individually labels each mRNA transcript within the cell. These data, when aggregated across a population of cells at different stages of differentiation, provide a snapshot of the difference in gene expression as related to differentiation state.  

Our previous genetic investigations have exclusively used the Seurat package in R. With this package, we create two-dimensional UMAP plots, as well as heat maps, to visualize the transcriptomic state of the cells. Recently, we have turned to three-dimensional UMAP scatter plots, as these are more interactive, represent the scale of information better, and better show the separation of cells. While these UMAP plots are useful, we believe that we can further improve our visualization of the relationship between genes and differentiation type to become more interactive and intuitive.


## Project Objectives:  
We intend to identify the differences in gene expression across cells at different stages of differentiation in the testis. Our narrow objective for this project is to create an interactive cell atlas that visualizes single-cell RNA sequencing data in cells at different stages of differentiation. This objective can be further broken down into three aims. **Aim 1** is to generate an interactive atlas that is based on a three-dimensional UMAP scatter plot. **Aim 2** is to use a previously explored dataset to verify and validate our tool. In **Aim 3** we will characterize the unique expression profile for each cell type in the zebrafish testis.  
  
**Aim 1:** Tools have been published where genetic expression is displayed in an interactive two-dimensional plot using UMAP reduction, e.g. [this tool.](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6946/results/tsne) While helpful, these tools lack the ability for the researcher to explore individual cells and the relationships between clusters in more depth. We anticipate that our tool’s three-dimensional visualization will help the researcher appreciate comparisons of different clusters with more precision. Furthermore, we believe that adding more interactive elements to our tool will allow the researcher to elaborate on where the genetic similarities are between adjacent clusters. Genetic data are notoriously complex, and any improvement to the ability to visualize relationships is a useful endeavor for identifying trends in genetic pathways.  
  
We plan on using Altair as our primary visualization tool. To be useful for the scientist, our tool will need to take a typical normalized gene count file and format the data to be visualized. To do this, we will need to map the genes to the three dimensions used for plotting the data. We anticipate using UMAP to reduce the gene expressions to an interpretable dimension. The user will be able to dynamically choose the parameters necessary for this dimension reduction. With our cells plotted along the UMAP axis, the user will be able to select regions of interest on the plot. In-depth information about these regions will populate adjacent cells. These additional informations will include subplots such as full-spectrum gene expression bar graphs and scalar values such as statistical similarities between groups.  
  
**Aim 2:** In Aim 2, we will investigate the utility of our tool using the dataset found [here.](https://www.ebi.ac.uk/gxa/sc/experiments/E-GEOD-100911/downloads) Using a previously analyzed dataset will allow us to prioritize developing our visualization tool rather than analysing novel data. Furthermore, we will be able to directly compare our tool with existing visualization tools. We will build our data pipeline in Aim 1 to work on this data, so we will be able to troubleshoot and iterate our tool using consistent data.  
  
**Aim 3:** In Aim 3, we will apply our tool to a novel dataset, specifically the zebrafish testis data discussed above. This aim will be entirely exploratory: we will be using our tool to analyze fresh data. As our tool is intended for this deep exploration, we believe that Aim 3 will provide the best insights into the utility of our tool. We hope that the user will find the tool intuitive to use, and that they will be able to quickly identify trends that would otherwise be challenging to find in a more traditional visualization.

## Data:  
**Aim 2** data will be sourced from the [single cell expression atlas.](https://www.ebi.ac.uk/gxa/sc/home) Data will be downloaded as a matrix market file prior to being processed in the code.

We will also be getting data about the genes and biological pathways from [KEGG.](https://www.genome.jp/kegg)

The yet-unexplored data for **Aim 3** has been collected in the Gagnon lab.


## Ethical Considerations:  
Genetic manipulation is a field that is at the center of many ethics conversations. With a greater understanding of the relationship between gene expression and cell phenotypes comes a greater ability to change cell phenotypes. While this ability is often beneficial, (i.e. in the treatment of disease), it also has the ability to be abused. Weighing in on the benefit vs. potential for abuse relationship is not a question that can be addressed only by the scientific community, but rather should be addressed by society as a whole. Scientists studying genetics should be cautious not to progress too far ahead of society’s ability to reason through the ethical implications of their research.  As far as working with the scRNA-seq data, it’s important to keep in mind the bias that comes with these methods. For example, sequencing read errors will cause transcripts to be dropped from the dataset, meaning the dataset won’t reflect the actual full composition of the cell.  Additionally, gene expression in cells is a dynamic process and scRNA sequencing only captures a snapshot of the activity.

## Data processing:  
We will be pulling data from KEGG in order to add gene information to the gene expression data we get from the RNA-seq data. Will need to use either the KEGG API or web scraping to obtain which genes are involved in pathways then map that to our dataset. 
The data we obtain from the scRNA-seq dataset should be fairly clean already and only require initial manipulation to get it into a format that we can work with. From this data we plan on deriving which genes are being differentially expressed in the different cells based on the mRNA concentration in the cells.  


## Exploratory Analysis:  
As discussed previously, we plan to use UMAP reduction to visualize the data in three dimensions. We will apply a k-means clustering algorithm to the data to group similar cells. For more in-depth exploration, we will be using relative bar charts for the selected ROI. 

## Analysis Methodology:  
We plan on using a clustering method to differentiate the different cell types based on mRNA concentration. We also plan to explore the relationship between cells based on the biological pathways the differentially expressed genes in a cell may be involved in. The analysis method used for this will depend on the results of our exploratory analysis.

## Project Schedule:  
As seen in **Figure 1**, We will begin our project by downloading and formatting the practice dataset. We will then move forward with developing our visualization tool, first by creating the UMAP reduction, and then by adding the accessory visualizations. The visualization tool will be presented at the project review meeting as complete. This leaves the next week for the exploration of data using our novel dataset. The final week will be devoted to summarizing our data in the final report, with an emphasis focused on creating figures.
  
![Gantt_Chart.png](attachment:Gantt_Chart.png)
**Figure 1:** Gantt chart showing interm deadlines for project completion.