# Probabilities and distributions
This jupyter notebook introduces the usage and interpretation of statistical descriptors as well as basic concept in using probabilities and probability distributions.


## Installation of libraries and necessary software


This notebook requires an R kernel to run the R scripts. We recommend to install the latest R version (https://www.r-project.org/), open an R console and then follow the instructions in https://irkernel.github.io/installation.

More features and a user-friendly environment to run R scripts outside jupyter are available through __[RStudio](https://www.rstudio.com/products/rstudio/download/)__

Install the necessary libraries (only needed once) by executing (shift-enter) the following cell:




In [None]:
install.packages("MASS", repos='http://cran.us.r-project.org')
install.packages("perm", repos='http://cran.us.r-project.org')
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("qvalue")


## Loading data and libraries
This requires that the installation above have been finished without error


In [None]:

library("MASS")
library("perm")
library("qvalue")

The following commands provide important statistical descriptors of data sets:

__Examples:__  
`mean(x)` Returns average/mean of a data set  
`median(x)` Returns median of a data set  
`min(x)` Returns minimum of a data set  
`max(x)` Returns maximum of a data set  
`var(x)` Returns variance of a data set  
`sd(x)` Returns standard deviation of a data set  

#### Example for some random data _x_:

In [None]:
# 100 random values taken from a normal distribution with mean 0 and standard deviation 1 
# (default of rnorm function)
x <- rnorm(100)
paste("First 6 values:")
head(x)
paste("mean(x) =",mean(x))
# add your code here ...


### Exercise 1
Create a data set containing the numbers 1 to 100 (`dset <- 1:100`) and calculate _mean, median, minium, maximum, variance_ and _standard deviation_. 




#### Add your answers here
(double-click here to edit the cell)

##### Question: <ins>Why are _mean_ and _median_ the same?</ins>



##### Question: <ins>What is the relationship between _variance_ and _standard deviation_?</ins>





### Exercise 2
Calculate `sum(dset)/length(dset)` and compare the returned value to the ones from the previous exercise. 




#### Add your answer here
(double-click here to open the cell)



### Exercise 3
Try to understand what happens in the script below.

Vary the number of random values (e.g. from 5000 to 1000, 100 and 10000). 


In [None]:
# 5000 random variables from a normal distribution with mean 1 and standard deviation 1
data <- rnorm(5000,1,1)
# standard histogram
hist(data, 50, col="#666666", border="white")
# adding statistical descriptors
dmean <- mean(data)
abline(v=dmean,col="red")
dmedian <- median(data)
abline(v=dmedian,col="green")
dmin <- min(data)
abline(v=dmin,col="blue")
dmax <- max(data)
abline(v=dmax,col="blue")
dsd <- sd(data)
abline(v=dmean+dsd)
abline(v=dmean-dsd)

#### Add your answer here
(double-click here to edit the cell)

#### Question: <ins>Are minimum and maximum good descriptors of this data set?</ins>




### Exercise 4
Apply the same analysis on `data <- rlnorm(5000)`



In [None]:
# Add our code here ...

#### Add your answer here
(double-click here to edit the cell)

#### Question: <ins>What is different?</ins>


#### Question: <ins>Do the statistical descriptors provide useful information?</ins>



### Exercise 5
Calculate the statistical descriptors for flow cytometry data of RNA molecules (from paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575505/#s2). 

You can get the processed table from http://computproteomics.bmb.sdu.dk/BMB539Data/FlowCytoData.csv (use `read.csv()` on the web page and write it into the object `FlowCyt`).

The columns _X18s.RNA_, _abl_ and _bcr_ denote the different molecules. Rows correspond to quantifications in different cells. Apply the statistical descriptors on the values in the columns with the additional argument `na.rm=T` (e.g. `mean(FlowCyt$bcr, na.rm=T)`). 





In [None]:
A <- read.csv(url("http://computproteomics.bmb.sdu.dk/BMB539Data/FlowCytoData.csv"))
# add your code here ...

#### Add your answer here
(double-click here to edit the cell)

#### Question: <ins>Do the statistical descriptors describe all molecules accurately?</ins>


#### Question: <ins>Would you trust a report that provides only mean and standard deviation when you compare them to the distributions?</ins>



### Exercise 6
There are several functions to create data sets that contain random numbers: `rnorm, runif, rexp` (drawn from a normal, a uniform and an exponential distribution, respectively). Apply the functions to create 10, 20 and 50 random numbers.






In [None]:
# add your code here ...

#### Add your answer here
(double-click here to open the cell)

#### Question: <ins>Do you see a basic pattern in the values for each of the distributions?</ins>

#### Question: <ins>What are specific properties of the distributions? You can check the help pages (e.g. by writing `?rnorm`)</ins>


 ### Exercise 7
 Create 1000 random numbers that are uniformly distributed:
 
 The plot shows a histogram. Discuss the meaning of the different parts of the plot. Change the number of bins (_breaks_ in R). Estimate which number of bins describes the data best.  

<!-- Create a boxplot of the data (`boxplot(unif_dat)`) and understand what the figure shows. You can find a simple explanation of boxplots here:    
 http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots -->



In [None]:
unif_dat <- runif(1000, -10,10) 
hist(unif_dat, breaks = 10) 
 

#### Add your answer here
(double-click here to open the cell)

#### Question: <ins>How many bins do you consider suitable?</ins>



<!-- ### Exercise 8 -->
<!-- Create an object _norm_dat_ containing 1000 normally distributed random numbers (`rnorm()` function). Visualize them by plotting the data as histogram and boxplot.  -->

<!-- What is different to the plots of uniformely distributed numbers? -->



### Exercise 8
Sort and plot the random numbers as *rank plot*:

You can compare two distributions by plotting their sorted values versus each other: `plot(sort(unif_dat), sorted_norm_dat)`. 

 

In [None]:
norm_dat <- rnorm(1000)
unif_dat <- runif(1000, -10,10) 
sorted_norm_dat <- sort(norm_dat)
sorted_unif_dat <- sort(unif_dat)
plot(sorted_norm_dat)
# add the other plot commands here ...


#### Add your answer here
(double-click here to open the cell)

#### Question: <ins>Why are the ranked plot of the normal distribution and the plot for the comparison between the distributions similar?</ins>



### Exercise 9

Download _Supplementary Dataset 1_ (first sheet *Cell extract (CE)*) from a study investigating bladder cancer cells T24 to detect changes between the cancer cells and their metastatic subtype. 
Link to paper: http://www.nature.com/articles/srep25619

Open the table with Excel/Libreoffice and save as csv-file. 

Import the table into R 

`prot_dat <- read.csv("TableName",row.names=1, skip=1)`
(you need to upload the file to your jupyter notebook folder)

If you get an error, try 

`prot_dat <- read.csv("TableName",row.names=1, sep=";", digits=",", skip=1)`

And if you still don't manage to read the csv-file, you can directly import the data from http://computproteomics.bmb.sdu.dk/BMB539Data/ProtTable.csv and apply 

`prot_dat <- read.csv("ProtTable.csv",row.names=1, skip=1)`

View the first 6 lines of the table and understand its content: `View(prot_dat)`.

Convert the expression values to numerical values (ignore the warnings):  
`for (i in 13:22) prot_dat[,i] <- as.numeric(as.character(prot_dat[,i]))`

Plot the column `prot_dat$Area.T24_T1.normalized` as boxplot (`boxplot(...)`) and histogram (`hist(...)`). 

Plot all columns with intensities (access them by `prot_dat[,13:22]`) as boxplot.

Calculate _mean, median, sum_ and _standard deviation_ of one of the columns. Missing values will be dismissed by the argument _na.rm=T_, e.g. `mean(prot_dat[,13], na.rm=T)` 

Transform all protein abundances to their logarithm
`lprot_dat <- as.matrix(log2(prot_dat[,13:22]))
lprot_dat[!is.finite(lprot_dat)] <- NA
`
and make again boxplots and histogram: `boxplot(lprot_dat)`. 

Calculate _mean, median_ and _standard deviation_ of one of the columns and try to locate the values in the boxplot.




In [None]:
# Add your code here:


#### Add your answer here
(double-click here to open the cell)

#### Question: <ins>What information does the file contain?</ins>


#### Question: <ins>What is strange when you plot the boxplot and the histogram of the data before taking its logarithm?</ins>

#### Question: <ins>Does the transformed data make more sense? Why could that be?</ins>





### Exercise 10
_Probabilities_
- Read the description of ```dnorm()```: ```help(dnorm)```
- Plot the density (```dnorm()```) and the cumulative (```pnorm()```) probability distribution of a normal distribution with mean 2.5 and standard deviation 1.5.
- Read the probability of having a number between 0.5 and 4 from the cumulative distribution. Verify this number with its calculation ```pnorm(4, 2.5, 1.5) - pnorm(0.5, 2.5, 1.5)```
- Repeat the same for the intervals (-1, 2) and (1, 2)

_Frequencies_
- The relative number of observations per unit interval around $x=2$ (between 1.5 and 2.5) is given by ```dnorm(x=2, 2.5, 1.5)```. Hence
  - In a sample of 100 the expected number of observations per unit interval in the immediate vicinity of $x=2$ is 25.16
  - In a sample of 1000 the expected number of observations per unit interval in the immediate vicinity of $x=2$ is 251.6
  - The expected number of values from a sample of 1000, between 1.9 and 2.1, is approximately $0.2 \cdot 251.6 = 50.32$, or, more precisely,  
```1000 * (pnorm(2.1, 2.5, 1.5) - pnorm(1.9, 2.5, 1.5))```

- Repeat the calculation for the intervals (-1,2) and (1,2). 


In [None]:
x <- seq(-5,10,0.01)
density <- dnorm(x, mean=2.5, sd=1.5)
cumulative <- pnorm(x, mean=2.5, sd=1.5)

## plot the functions:


# This code is related to a question below and the sample with 1000 observations above
plot(x, 1000*dnorm(x, mean=2.5, sd=1.5), type="l",ylab="frequency")
interval <- seq(1.5,2.5,0.01)
polygon(c(1.5,interval,2.5), c(0,1000*dnorm(interval, 2.5,1.5),0), col = "#FF000055")
polygon(c(1.5,2.5,2.5,1.5), 1000*c(dnorm(2, 2.5,1.5),dnorm(2, 2.5,1.5),0,0), col = "#00FF0055")
points(2,1000*dnorm(2,2.5,1.5),pch=15,col=2)
text(2,1000*dnorm(2,2.5,1.5),pch=15,col=2,labels =1000*dnorm(2,2.5,1.5), pos=1)


#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>What are the 3 different arguments of these functions? How are they related to the Gaussian function?</ins>

_Answer_

##### Question II:  <ins>What is the difference between the first argument of ```dnorm``` and ```rnorm```?</ins>

_Answer_

##### Question III:  <ins>How would you estimate the probability of having a number between 0.5 and 4 from the density distribution?</ins>

_Answer_

##### Question IV:  <ins>What is the probability to obtain the number 2?</ins>

_Answer_

##### Question V:  <ins>What is the difference between probability and frequency?</ins>

_Answer_

##### Question VI:  <ins>How would you calculate the area of the rectangle and the area under the curve in the figure given above?</ins>

_Answer_


### Exercise 11
We now check the behavior of the t-distribution which is an integral part of the t-test and exponential distribution.
- Plot the density and cumulative probability distribution (```dt()``` and ```pt``` with argument ```df=3```) for a t-distribution with 3 degrees of freedom. Plot the normal distribution over it with ```lines()```. 
- Plot the density and cumulative probability distribution for an exponential distribution (```dexp()```) with a rate parameter equal to 1 (the default). Repeat with a rate parameter equal to 2. What happens when you do the plot on logarithmic (y-coordinate) and double-logarithmic scale?


In [None]:
x <- seq(-5,5,0.01)
# density function
dens_t <- dt(x, df=3)

dens_exp <- dexp(x, rate = 1)
# continue ...


#### Add your answer here
(double-click here to open the cell)


##### Question I:  <ins>What happens with the t-distribution of high degrees of freedom?</ins>

_Answer_

##### Question II:  <ins>Which is a good visual way to check whether data is exponentially distributed?</ins>

_Answer_


### Exercise 12
Use the function ```rnorm()``` to draw a random sample of 25 values from a normal distribution with a mean of 0 and a standard deviation equal to 1.0. Use a histogram, with ```probability=TRUE``` to display the values. Overlay the histogram with: (a) an estimated density curve; (b) the theoretical density curve for a normal distribution with mean 0 and standard deviation equal to 1.0. Repeat with samples of 100, 500 and 1000 values, showing the different displays in different panels on the same graphics page (```par(mfrow=...)```)


In [None]:
rand <- rnorm(25)
hist(rand, probability = TRUE,ylim=c(0,0.5), border="#FFFFFF", col="#333333")
lines(density(rand))
x <- seq(-5,5,0.01)
lines(x, dnorm(x), col=2)


#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>What are the black and the red lines?</ins>

_Answer_

##### Question II:  <ins>What improves when you increase the number of values?</ins>

_Answer_

##### Question III:  <ins>What does ```#333333``` mean?</ins>

_Answer_

### Exercise 13
Data with a distribution close to lognormal are common. Size measurements of biological organisms often have this character. As an example, consider the measurements of body weight (```body```) in the data frame ```Animals``` (```MASS``` package). Begin by drawing a histogram of the untransformed values, and overlay a density curve. Then

- Draw an estimated density curve for the logarithms of the values. 
- Determine the mean and standard deviation of ```log(Animals$body)```. Overlay the estimated density with the theoretical density for a normal distribution with the mean and standard deviation just obtained.



In [None]:
# Add you code here:

#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>Does the distribution look like a normal distribution after transformation to a logarithmic scale?</ins>

_Answer_

### Exercise 14
Write a script that plots an estimated density curve for a random sample of 50 values from a normal distribution:

- Plot estimated density curves (```plot(density(...))```) for random samples containing 50 values
  - the normal distribution
  - the uniform distribution (```runif(50)```)
  - the $t$-distribution with 3 degrees of freedom. 
-  Overlay the three plots and use different colors.
- Repeat the same but now taking random samples of 500 and 5000 values



In [None]:
# Add your code here:

#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>Why is the estimated density curve of the uniformely distributed values much higher?</ins>

_Answer_

### Exercise 15
There are two ways to make the estimated density smoother:

- One is to increase the number of samples
- The other one is to increase the bandwidth. For example
```
plot(density(rnorm(50), bw=0.2), type="l")
plot(density(rnorm(50), bw=0.6), type="l")
```

Repeat each of these with bandwidths of 0.15, with default choice of bandwidth, and with the bandwidth set to 0.75

In [None]:
# Add your code here:


#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>What is the function that underlies the smoothing of the ```density``` function?</ins>

_Answer_

##### Question I:  <ins>Would you get the same result for data that is not normally distributed?</ins>

_Answer_

### Exercise 16

The density estimation has the issue that it depends strongly on bandwidth and choice of kernel, making it sometimes not very useful to judge normality. A much better tool is the quantile-quantile plot, which uses an output similar to cumulative probability distributions. Try the following script and compare assess how the plot characterizes normally distributed data.
- See how the plot deviates when comparing the normal distribution with random variables from other distributions.
- Increase the number of data points
- Substitute the ```rnorm()``` function by random variables from other distributions (e.g. ```rexp()``` and ```rlnorm()```)


In [None]:
qqnorm(rnorm(10))
qqnorm(rnorm(15))
qqnorm(rnorm(200))


#### Add your answer here
(double-click here to open the cell)

##### Question I:  <ins>How does the ```qqnorm()``` function show that the data is normally distributed?</ins>

_Answer_

##### Question II:  <ins>Which is the limiting function when increasing the number of values to infinity?</ins>

_Answer_

##### Question III:  <ins>How do the other tested distributions show their difference to a normal distribution when using the ```qqnorm()``` function?</ins>

_Answer_


### Exercise 17
Take the data sets ```lh``` and ```Animals``` and check for normality using ```qqnorm```. Do the same on their logarithmic values. Additionally, use ```boxplot()``` to get an idea about how the boxplot of a normal distribution looks.


In [None]:
library(MASS)
data("Animals")
# add your code here


#### Add your answer here
(double-click here to open the cell)

 ##### Question I:  <ins>Which data set is (approximately) normally distributed?</ins>

_Answer_

##### Question II:  <ins>Which data set is (approximately) log-normally distributed?</ins>

_Answer_