# Biomeccanica Multiscala
## Laboratorio 2
***Molecular Driving Forces - Principles of Probability***

Authors:
    
- Prof. Marco A. Deriu (marco.deriu@polito.it)
- Lorenzo Pallante (lorenzo.pallante@polito.it)
- Eric A. Zizzi (eric.zizzi@polito.it)
- Marcello Miceli (marcello.miceli@polito.it)
- Marco Cannariato (marco.cannariato@polito.it)

# Import necessary packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from itertools import *
from math import *
from random import shuffle
import scipy.stats as stats

%matplotlib inline
mpl.rcParams['font.size'] = 20
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['legend.frameon'] = False
mpl.rcParams['legend.fontsize'] = 20

In [None]:
# copy over data repository
!if [ -n "$COLAB_GPU" ]; then git clone https://github.com/lorenzopallante/BiomeccanicaMultiscala.git; fi
!if [ -n "$COLAB_GPU" ]; then mv BiomeccanicaMultiscala/LAB/02-Probability/data .; fi

# Basic Probability

## Exercise 1
**Definition of probability**

What is the probability of throwing one die and getting a number greater than 3?

The probability of an event A is defined as the number of outcomes $n_A$ falling into the category of A divided by the total number of possible outcomes $N$: $$p_A = \frac{n_{A}}{N}$$

Outcomes into the category A: 4,5,6 $\rightarrow$ $n_A = 3$

Total of outcomes: 1,2,3,4,5,6 $\rightarrow$ $N = 6$
$$p_A = \frac{3}{6} = 0.5$$

## Exercise 2
**Indipendent events**

What is the probability of a 1 on the first roll of a die **AND** a 4 on the second roll of a die?

Two consecutive rolls of a die are independent events, therefore the rules of probability tells us that:
$$p(A \cap B) = p_A \bullet p_B$$

In this case, $p_A = p_B = 1/6$, therefore: $$p(A \cap B) = \left(\frac{1}{6}\right)^2 = 0.028$$

## Exercise 3
**Mutually exclusive events**

What is the probability of obtaining a 4 or a 6 in one dice roll?

Since we consider one die roll, the events "outcome 4" and "outcome 6" are mutually exclusive. Therefore, the probability of their union is simply the sum of the single probabilities

$$p\left(A\cup B \right) = p\left(A\right) + p\left(B \right) = \frac{1}{6} + \frac{1}{6} = \frac{1}{3} $$

## Exercise 4
**Events that are not mutually exclusive**

A die is rolled two times. What is the probability of a 1 on the first roll **OR** a number greater than 4 on the second roll?

It is possible to compute the probability through the addition rule in the case of not mutually exclusive events:

$$p\left(A \cup B \right) = p\left(A\right) + p\left(B\right) - p\left(A \cap B \right)$$

$$p\left(A\right) = \frac{6}{36} \text{ ; } p\left(B\right) = \frac{12}{36} \text{ ; } p\left(A \cap B\right) = \frac{2}{36}$$

$$p\left(A \cup B \right) = \frac{16}{36}$$


<center><img src=https://miro.medium.com/max/939/1*BgZcoMvx9g_ZPUtM0SjrDA.png width=700 height=500 /><center>

Otherwise, we can obtain them as:

In [None]:
outcomes = []
for i in range(1,7):      # roll of the first die
    for j in range(1,7):  # roll of the secon die
        outcomes.append([i,j]) # store the present outcomes

# now, transform the list in a numpy array
outcomes = np.vstack(outcomes) 
print(outcomes)

In [None]:
N = outcomes.shape[0]
print(f"Total number of outcomes: {N}")
n_AB = np.sum((outcomes[:,0]==1) | (outcomes[:,1]>4))
print(f"Number of favourable outcomes: {n_AB}")
p = n_AB/N ; print(f"Probability: {p:.2f}")

# the result is different considering the two
# events separately
n_A = np.sum((outcomes[:,0]==1))
n_B = np.sum((outcomes[:,0]>4))
print(f"Favourable outcomes considering separated events: {n_A+n_B}")

## Exercise 5
**Events that are not independent**

What is the probability of getting a card of clubs, a card of diamonds, and a card of clubs in three consecutive extractions? how does this probability change if after each extraction the card is put again into the deck?

<div class="alert alert-info"> <b>Hint</b>: Remember the definition of conditional probability!</div>

***Try to do it by yourself first!***

### Solution

Let us first consider the case of extraction without replacement. The probability of getting a card of clubs in the first extraction is $$p(C)=\frac{13}{52}$$
The probability of getting a diamond in the second extraction is $$p(D|C)=\frac{13}{51}$$
The probability of getting a club in the third extraction is $$p(C|D,C)=\frac{12}{50}$$

The probability of the composite event is
$$p(C,D,C)=p(C)\bullet p(D|C) \bullet p(C|D,C) = \frac{13}{52} \bullet \frac{13}{51} \bullet \frac{12}{50} = 0.0153$$

Now, if the cards are put again into the deck, the extractions become independent and with same probability. Thus,
$$p(C,D,C)=p(C)^3 = \left(\frac{13}{52}\right)^3 = 0.0156$$

Which corresponds to an increase of 2.16%

## Exercise 6
A die is rolled 3 times. What is the probability that either of this events are happening:
- a 2 is obtained in the first roll
- a 6 is obtained in the second roll
- a 5 is obtained in the third

<div class="alert alert-info"> <b>Hint</b>: When computing the probability of the union of two or more events, make sure you are treating the insersection of the events in the right way!</div>

***Try to do it by yourself first!***

### Solution

In this case, we have to consider the union of three events that are NOT mutually exclusive. Therefore, we have to apply the formula general addition rule

$$p(A \cup B \cup C)=p(A) + p(B) + p(C) - p(A \cap B)- p(A \cap C) - p(B \cap C) + p(A \cap B \cap C)$$

<center><img src=https://i.pinimg.com/originals/89/a1/17/89a1177e7e1f988eb92878d39c95175e.png width=500 height=500 /><center>

In [None]:
# Create a matrix containing all the possible outcomes of the three rolls of the dice
# This is the same as we have done before, but in a more compact form using list comprehension
tot_out = np.array([x for x in product(np.arange(1,7),np.arange(1,7),np.arange(1,7))])

# compute the single probabilities
p_A = np.sum(tot_out[:,0]==2)/tot_out.shape[0]
p_B = np.sum(tot_out[:,1]==6)/tot_out.shape[0]
p_C = np.sum(tot_out[:,2]==5)/tot_out.shape[0]
p_A_B = np.sum((tot_out[:,0]==2)&(tot_out[:,1]==6))/tot_out.shape[0]
p_A_C = np.sum((tot_out[:,0]==2)&(tot_out[:,2]==5))/tot_out.shape[0]
p_B_C = np.sum((tot_out[:,2]==5)&(tot_out[:,1]==6))/tot_out.shape[0]
p_A_B_C = np.sum((tot_out[:,0]==2)&(tot_out[:,1]==6)&(tot_out[:,2]==5))/tot_out.shape[0]

# now apply the formula
p = p_A + p_B + p_C - p_A_B - p_A_B - p_B_C + p_A_B_C
print(f"The probability is {100*p:.2f} %")

# We can check if the formula gives the same result as counting the favourable events
p2 = np.sum((tot_out[:,0]==2)|(tot_out[:,1]==6)|(tot_out[:,2]==5))/tot_out.shape[0]
print(f"Are the two probabilities the same? {p==p2}")

# Combinatorics

## Exercise 7
**Permutations and Combinations**

You have a set of 10 marbles. Compute:
1. the number of permutations of the marbles, if they are all different;
2. the number of permutations of the marbles, if 3 of them are blue, 4 are green, and 3 are red;
3. the possible number of subsets composed by 6 marbles, considering them all different and ignoring their order.

In the case of **different marbles**, the number of possible permutations is given by the factorial:

$$N = 10! = 10 \bullet 9 \bullet 8 \bullet ... \bullet 2 \bullet 1 = 3628800$$

In the case of equal marbles, we have to take into account the repetitions. If there are $n$ element with $k$ repetitions, the total number of permutations is:

$$N = \frac{n!}{k!} $$

Therefore:

$$N = \frac{10!}{3! 4! 3!} = 4200$$

If we want to compute the possible number of subusets composed of 6 marbles, ignoring their order and considering that the marbles are all different, we should compute the binomial coefficient.

The binomial coefficient ${n \choose k}$ is usually read as *n choose k* because there are ${n \choose k}$ ways to choose an (unordered) subset of $k$ elements from a fixed set of $n$ elements.

$${n \choose k} = \frac{n!}{k! (n-k)!}$$

In our case:

$$N = {10 \choose 6} = \frac{10!}{6! 4!} = 210$$

## Exercise 8
**Combinatorics and Probability**

A system is composed by 15 particles of type A and 5 particles of type B that can be organized in a total of 23 sites. Considering the particles of the same type as indistinguishable, how many arrangements are possible? If the total number of sites is 20 or 26, how does the number of combination changes?

We can solve this problem considering a number of objects equal to the number of sites and three type of objects:
$$n_A = 15; n_B = 5; n_E = n_S - n_A - n_B $$
where $n_S$ is the number of sites

In [None]:
# define the number of sites
nA, nB, nS = 15,5,23
nE = nS-nA-nB

# create a list containing the objects to look at 4 examples
seq = list('A'*nA + 'B'*nB + '_'*nE)
for i in range(4):
    # pemutation of the objects in the list
    shuffle(seq)   
    print(''.join(seq)+'\n')

The `string.join()` method accepts a list as input. When it is applied to a string, the string is used to join the elements of the list and create a string.

Calling `''.join(list)` means joining the element of the list without spaces.

In [None]:
for nS in [23,20,26]:
    nE = nS-nA-nB
    N = factorial(nS)/(factorial(nA)* \
                       factorial(nB)*factorial(nE))
    print(f"{nS} sites: {N:.0f} arrangements\n")

In [None]:
nS = np.arange(20,41,dtype=int)
N = list()
for n in nS:
    N.append(factorial(n)/(factorial(nA)* \
                           factorial(nB)* \
                           factorial(n-nA-nB)))
N = np.array(N)
fig,ax = plt.subplots(figsize=(10,6))
ax.plot(nS,N,color='k')
ax.set_xlim(nS[0]-1,nS[-1])
ax.set_ylim(bottom=0); ax.set_xlabel(r"$n_S$")
ax.set_ylabel('# combinations')
plt.tight_layout()

## Exercise 9
**Combinatorics and Probability**

A test contains 10 questions, each one with available four different answers, among which just one is correct. To pass the test at least 5 questions must be answered correctly. What is the probability that completely unprepared student will pass the test?

<div class="alert alert-info"> <b>Hint</b>: When computing the grade, each question can have only two possible outcomes, correct (1) or wrong (0). Moreover, the answers of two different questions are independent.</div>

***Try to do it by yourself first!***

### Solution

Let us consider the test as one event. The probability for the student to pass the test with a score of 5 is
$$p = {10 \choose 5} p_{correct}^5 \bullet p_{wrong}^5$$
Where the multiplication factor takes into account that the order of answers has no effect on the final grade.

The probability for the student to pass the test with a score of 6 is
$$p = {10 \choose 6} p_{correct}^6 \bullet p_{wrong}^4$$
and so on...

Since the test outcomes are mutually exclusive, the total probability is the sum of the individual ones.

In [None]:
pq = 1/4
p = 0
for i in range(5,11):
    p += factorial(10)/(factorial(i)*factorial(10-i))*(pq**(i))*((1-pq)**(10-i))
print(f"The probability is {100*p:.2f} %")

# Probability Distributions

## Exercise 10
**Computing the probability from a distribution**

The water molecule is characterized by an average angle of 104.5° and standard deviation of 5°. What is the probability for a water molecule of having an angle between 90° and 100°? 

The probability can be derived from the probability density function as:
$$P(a,b)=\int_a^b \! p(x) \, \mathrm{d}x$$

where here we assume the probability density to a gaussian with mean $\mu$ and variance $\sigma^2$:
$$p(x)=\frac{1}{\sqrt{2\pi}}e^{-\left(x-\mu\right)^2/2\sigma^2}$$

In [None]:
x = np.linspace(80,130,500)       # create the x points where the function will be evaluated
y = stats.norm.pdf(x, 104.5, 5)   # values of a normal distribution of mean 104.5 and standard deviation 5
i = (x>=90) & (x<=100)            # indices of points we are interested in
fig,ax = plt.subplots(figsize=(8,4.5))
ax.plot(x,y,color='k')
ax.fill_between(x[i],0,y[i],color='r',alpha=.5)
ax.set_xlim(80,130)
ax.set_ylim(bottom=0); ax.set_xlabel(r"Angle (°)")
ax.set_ylabel('Density')
plt.tight_layout()
plt.show()

p = np.trapz(y[i],x[i])   # this function is the equivalent of the integration, but for the discrete case
print(f"Probability: {p:.2f}")

## Exercise 11

Consider the Boltzmann distribution:
$$ p(x) = \frac{1}{\beta} e^{-x/\beta}$$

- If $\beta=10$, what is the ratio between the probability of 10 and 20?
- Consider the same ratio with $\beta=5$ and $\beta=20$. How does it change?
- Plot the Boltzmann distribution at the three values of $\beta$. Which difference do you notice? if $\beta \propto T$, where $T$ is the temperature, and $x$ is the energy of a particle when it is in a certain position, what can you deduce from the plot of the distributions?

***Try to do it by yourself first!***

### Solution

In [None]:
beta = 10
p1 = np.exp(-10/beta)/beta
p2 = np.exp(-20/beta)/beta
print(f"{p1/p2:.2f}")

In [None]:
for beta in [5,10,20]:
    p1 = np.exp(-10/beta)/beta
    p2 = np.exp(-20/beta)/beta
    print(f"At beta = {beta}:\t{p1/p2:.2f}")

As could be expected by the Boltzmann distribution, probabilities at low $x$ values are greater than at higher values. Increasing the parameter $\beta$, such difference is reduced, i.e., the different $x$ values becomes more and more equiprobable and, in the limit $\beta \rightarrow \infty$ the distribution becomes flat.

In [None]:
x2 = 50
# create a set of 100 equidistant points between 0.01 and x2
x = np.linspace(0.01,x2,100)
fig,ax = plt.subplots(figsize=(10,6))
for beta in [5,10,20]:
    y = np.exp(-x/beta)/beta
    ax.plot(x,y,label=fr"$\beta$={beta}")
ax.set_xlim(0,x2)
ax.set_ylim(bottom=0); ax.set_xlabel(r"$x$")
ax.set_ylabel(r'$P(x)$')
ax.legend()
plt.tight_layout()
plt.show()


The steepness of the curve at low $x$ increases while decreasing $\beta$, and at the same time higher probabilities are reached. At higher $\beta$, the distribution beacomes more and more flat.

<div class="alert alert-success">Considerig the case in which $\beta \propto T$, and $x$ is the energy of a particle when it is in a certain position, the Boltzmann distribution describes a situation where positions of the particle corresponing to a lower energy are more favoured in general. However, increasing the temperature, positions at higher energy becomes more probable. In the limit $T \rightarrow \infty$, all positions have the same probability</div>


## Exercise 12
**Advanced Exercise**

In the folder *data* there is a file named *samples.txt* containing 500 samples obtained from the distribution of two variables, $x$ and $y$. Read the samples from the file into two variables using numpy built-in function, then
1. plot the an histogram of the two distribution using matplotlib built-in funcion **plt.hist**
2. plot a scatter plot of the samples with matplotlib built-in funcion **plt.scatter**. Is there a correlation between the two variables?
3. estimate the mean and variance of the variable x.
4. supposing that x follows a gaussian distribution with the estimated mean and variance, compute the probability that x is greater than 11. Basing on the histogram of x plotted before, is the assumption of gaussian distribution reasonable?

***Complete the missing parts of the following code***


<div class="alert alert-danger"> <b>Hint</b>: This exercise may contain functions and pieces of code that might be useful for the project!</div>


In [None]:
# Read the data from the file
# With this function, each variable is a column of a matrix
data = np.loadtxt('data/samples.txt',comments='#')

# plot the histograms
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.hist(data[:,0])
ax1.set_ylabel('Count')
ax1.set_xlabel('x')
ax2 = fig.add_subplot(122)
ax2.hist(data[:,1])
ax2.set_ylabel('Count')
ax2.set_xlabel('y')
plt.tight_layout()

In [None]:
# scatter plot of the two variables
# the scatter plot function accepts separately the
# x and y data
fig,ax = plt.subplots(figsize=(5,5))
ax.scatter(data[:,0],data[:,1],15)
ax.set_xlabel('x'); ax.set_ylabel('y')
plt.tight_layout()

From the scatter plot, we can observe that the two variables x and y seems to be positively correlated. Tow quantify the linear correlation we can compute the correlation coefficient between them.

The correlation coefficient is computed with the following formula:


$$r=\frac{\sum{(x-x_m)(y-y_m)}}{\sqrt{\sigma_{x}^2 \sigma_{y}^2}}$$

In [None]:
# this function to compute the correlation accepts
# separately the x and y data
r,_ = stats.pearsonr(data[:,0],data[:,1])
print(f'Correlation Coefficient: {r:.2f}')

In [None]:
# since the data is stored in a numpy array, the mean and variance can be easily computed with the built-in functions
x_m, y_m = data.mean(axis=0)
x_v, y_v = data.var(axis=0)
print(f"x: mean = {x_m:.2f}; variance = {x_v:.2f}\n")
print(f"y: mean = {y_m:.2f}; variance = {y_v:.2f}")

***Remember:***

In a numpy array with two dimensions, the axis 0 is the column and the axis 1 is the row.

Many numpy functions can be applied along a direction, that can be specified with the option "axis".

In [None]:
# Compute the probability as for Exercise 8
x_sd = np.sqrt(x_v)
xx = np.linspace(x_m-2*x_sd,x_m+2*x_sd,500)
yy = stats.norm.pdf(xx, x_m,x_sd)
i = (xx>=11)
fig,ax = plt.subplots(figsize=(8,4.5))
ax.plot(xx,yy,color='k')
ax.fill_between(xx[i],0,yy[i],color='r',alpha=.5)
ax.set_xlim(x_m-2.5*x_sd,x_m+2.5*x_sd)
ax.set_ylim(bottom=0); ax.set_xlabel(r"s"); ax.set_ylabel('Density')
plt.tight_layout(); plt.show()

p = np.trapz(yy[i],xx[i])
print(f"Probability: {p:.2f}")