# Module 2 Problem Set: numpy, pandas, and plotnine
*Name:* <Your name here>


In [None]:
# Import libraries
import numpy as np
import pandas as pd
import plotnine as pn

## Problem 1

1.1) Create a 2D numpy array comprising 500 rows and 10 columns. Populate this array with random numbers derived from a uniform distribution from range `[0, 10]`. Use the seed `2023` for the `np.random.default_rng` function to ensure your output is reproducible. Print this array.

1.2) Now let's dig deeper into the data. Calculate the mean and standard deviation for each row in your array and print the results.

1.3) Compute and print the standard deviation of the means you just calculated. You'll probably notice that this value is significantly smaller than the standard deviation of the original array. Can you elucidate why there's such a striking difference?

Lastly, think about where this characteristic - that is, the reduction in variability when considering means rather than individual data points - is exploited in scientific research. Provide an example of such an application. Consider fields like bioinformatics, environmental science, physics, or any other discipline that relies heavily on statistical analysis.

1.4) Create a 1D array of numbers from 1 to 12. Reshape it to form a 3x4 matrix.

1.4) Resize the above matrix to shape (5,5). What values are added during this process?

1.5) Replace all odd numbers in the array [2,3,4,5,6,7,8,9,10] with -1 without changing the original array.

## Problem 2 - Pandas

In this problem, we'll be using an already available dataset extracted from a study which investigated gene expression in distinct populations of hippocampal neurons. The data can be traced back to this research paper:

* Cembrowski MS, Wang L, Sugino K, Shields BC et al. Hipposeq: a comprehensive RNA-seq database of gene expression in hippocampal principal neurons. Elife 2016 Apr 26;5:e14997. PMID: 27113915

We'll be working with one of the summarized data files which is given as a part of this task. Load the file named `GSE74985_genes.read_group_tracking.txt` into a pandas dataframe. This specific file comprises RNA-seq-based gene expression estimates derived from a variety of neuron populations found in the hippocampus. Here is a brief rundown of the columns:

* *tracking_id:* The Ensembl gene ID 
* *condition:* Refers to the treatment condition
* *replicate:* Stands for the replicate number
* *raw_frags:* Represents the number of reads mapping to this particular gene
* *internal_scaled_frags:* Signifies the number of reads mapping to this gene after normalization for library size
* *external_scaled_frags:* Indicates the number of reads mapping to this gene, normalized considering both library size and gene length
* *FPKM:* Number of reads mapping to the gene, normalized for library size and gene length. This is expressed as Fragments Per Kilobase of transcript per Million mapped reads (FPKM).
* *effective_length:* Depicts the effective length of gene
* *status:* Whether OK or NOTEST


In [None]:
data = pd.read_csv("GSE74985_genes.read_group_tracking.txt", sep="\t")

2.1) How many unique conditions are there in the dataset?

In [None]:
len(data["condition"].unique())

2.2) The goal is to reorganize the present dataset to facilitate comparability across various sample conditions and replicates for each gene.

Design a new pandas DataFrame in which each row corresponds to a unique tracking_id (representing individual genes).

Columns should be formed based on unique combinations of the sample conditions and their corresponding replicate numbers. In effect, each column title should represent a specific pairing of a condition with its replicate (for instance: condition1_rep1, condition1_rep2, condition2_rep1, etc.).

The cells in this restructured DataFrame should carry values of the FPKM gene expression levels corresponding to that gene for each specific condition-replicate pairing.

For clarity, here's how the beginning of your transformed DataFrame should look:

| tracking_id | condition1_rep1 | condition1_rep2 | condition2_rep1 | condition2_rep2 |
|-------------|-----------------|-----------------|-----------------|-----------------|
| gene1       | 1               | 2               | 3               | 4               |
| gene2       | 5               | 6               | 7               | 8               |
| gene3       | 9               | 10              | 11              | 12              |

Upon completion of the transformation, validate the dimensions of your new DataFrame. It should have a shape of (37699, 24). Print out the shape and inspect the first few entries (using the head method) of your new DataFrame.

2.3) Calculate the average FPKM for each gene across all samples. Print the `head` of the resulting `Series`.

2.4) Calculate the standard deviation of the FPKM for each gene across all samples. Print the `head` of the resulting `Series`.

2.5) Which gene (tracking_id) has the highest average FPKM across all samples? What is the average FPKM for that gene? What is this gene known to do? Is it what you would expect to be highly expressed in neurons?

2.6) With the help of the pivot_table method, compute the mean FPKM for each gene across the unique conditions, thereby aggregating the replicate samples to derive a per-condition mean. Store this in a new `DataFrame`. Display the top of the resulting DataFrame to observe the summarized information.

2.7) Your PI wants to know the mean expression across all conditions of `Pax6` (`tracking_id` = '`ENSMUSG00000027168`')? Please create a `Series` that answer this question.

## Problem 3 - Plotnine

3.1) Using the `DataFrame` from Problem 2, create a scatter plot of the mean FPKM for each gene across all conditions (x-axis) vs. the standard deviation of the FPKM for each gene across all conditions (y-axis). Label the axes and give the plot a title.

3.2) Add a smoothed line to the above plot to show the relationship between mean and standard deviation of FPKM across genes. 

3.3) FPKM values are often log-transformed to make them more normally distributed. However, since FPKM values can be zero, we often add a small constant (1) called a 'pseudocount' to the FPKM values before log-transforming them to avoid taking the log of zero.

For the FPKM values in your melted `DataFrame`, calculate the log10(`FPKM`+1) and add these values as a new column in your `DataFrame`. 

3.4) Plot a box and whisker plot (boxplot) of the log10(`FPKM`+1) values for each condition. Label the axes and give the plot a title.

3.5) Create a function that takes a `DataFrame` and a `tracking_id` as arguments and returns a plot of the log10(`FPKM`+1) values for that gene across all conditions. Use `geom_point` to plot the individual replicate values and color the points by `replicate` Make sure to give the plot a title. 

Use the `Pax6` gene (`tracking_id` = '`ENSMUSG00000027168`') as a test case for your function.  

3.6) Looking back at Problem 1, create two histograms:

- one for the first row of the array you created in Problem 1.1.
- another for the row means you calculated in Problem 1.2.

Display these histograms. How do the distributions of individual values and means differ? What do you think is the distribution of the means?

Bonus: if you are aware of have used the $t$-test before, explain how this nature of the distribution of means contributes to the robustness of the $t$-test, even when the original data is not normally distributed.