# Linear Algebra & Probability


## What you will learn in this course 🧐🧐

* Doing matrix calculations
* Understanding probabilities and how they are computed

## Algebra


### Vector

For the rest of the course, it is important to understand what a scalar number is. For that, let's come back to what is a vector.


#### Definition

A vector is nothing more than a list of numbers. The size of this list can vary. For example, we can have :

### $\vec{u} = \begin{bmatrix}5\\2\end{bmatrix}$


This is a two dimensions vector, represented by $\vec{u}$ and can be represented like that: 

![](https://drive.google.com/uc?export=view&id=1upNvSWhsE04GiPbxt_udfb2wsFn1VKC_)


#### Operations on vectors

We can add two vectors between them. Let $\vec{u}$ and $\vec{v}$ be two vectors with these coordinates:

### $\vec{u} = \begin{bmatrix}5\\2\end{bmatrix}$

### $\vec{v} = \begin{bmatrix}3\\4\end{bmatrix}$


### $\vec{u} + \vec{v} = \begin{bmatrix}5\\2\end{bmatrix} + \begin{bmatrix} 3\\4 \end{bmatrix} = \begin{bmatrix}8\\6\end{bmatrix}$



We can also multiply a vector by a number. Now let's take $\vec{w}$  with these coordinates :

$\vec{w} = \begin{bmatrix}3\\2\end{bmatrix}$ 

$4 * \begin{bmatrix}3\\2\end{bmatrix}=\begin{bmatrix} 4 & * & 3\\ 4 & * & 2 \end{bmatrix}$
$= \begin{bmatrix}12\\8\end{bmatrix}$


#### Vectors collinearity

Let's take three vectors $\vec{u}$, $\vec{v}$, $\vec{w}$ wich values are respectively :

## $\vec{u} = \begin{bmatrix}4\\2\end{bmatrix}$ 

## $\vec{v} = \begin{bmatrix}2\\6\end{bmatrix}$

## $\vec{w} = \begin{bmatrix}9\\7\end{bmatrix}$


We can express:

## $\vec{w} = 2\vec{u} + \frac{1}{2}\vec{v}$



That is to say that there is a linear relationship that allows to express $\vec{w}$ as a function of $\vec{u}$ and $\vec{v}.

In general, when you have two two-dimensional scalars, you can reach any point in a two-dimensional plane, EXCEPT if the vectors are collinear. That is to say that one can express $\vec{u}$ as a function of $\vec{v}$ or vice versa.

Let's take an example with two other vectors $\vec{i}$ and $\vec{j}$, respectively :

$\vec{i} = \begin{bmatrix}2\\4\end{bmatrix}$ 

$\vec{j} = \begin{bmatrix}1\\2\end{bmatrix}$


Here, we can see:

# $\vec{i} = 2\vec{j}$


In fact, we have a collinearity relationship between the two vectors, which is going to be problematic because if these two vectors represented explanatory variables in a Machine Learning algorithm, using both variables at the same time would increase the errors made in the predictions (you'll learn more about that during module M05 about supervised machine learning). In practice, when we detect some collinearity between two or more vectors, we usually drop one of the vectors to avoid problems and to simplify the set of variables to be used in the analysis.


### Matrices


#### Definition

We talked about vectors above. Well, a matrix only represents vectors with multiple columns. Here's an example:

### $X = \begin{bmatrix} 3 & 0 & 2\\ 2 & 1 & -1 \\ 1 & 0 & -2\end{bmatrix}$



#### Matricial product

Very often we have to do matrix multiplications to be able to move in the space of the vectors defined by the variables of our dataset. This is why it is good to know how matrix multiplications are made.

Let's take two matrices $X$ and $Y$ taking respectively the following values


### $X = \begin{bmatrix} 3 & 1 & 2\\ -1 & 3 & 2 \\ 3 & 1 & 4\end{bmatrix}$

### $Y = \begin{bmatrix} 2 & 1 & 0\\ -1 & 3 & 2 \\ 1 & 0 & 1\end{bmatrix}$



The product of the two matrices is computed as follows:

### $XY = \begin{bmatrix} 3 & 1 & 2\\ -1 & 3 & 2 \\ 3 & 1 & 4\end{bmatrix} \begin{bmatrix} 2 & 1 & 0\\ -1 & 3 & 2 \\ 1 & 0 & 1\end{bmatrix} = \begin{bmatrix} 6-1+2 & 3+3+0 & 0+2+2\\ -2-3+2 & -1+9+0 & 0+6+2 \\ 6-1+4 & 3+3+0 & 0+2+4\end{bmatrix} = \begin{bmatrix} 7 & 6 & 4 \\ -3 & 8 & 8 \\ 9 & 6 & 6 \end{bmatrix}$



More generally, matrix multiplication is done as follows:

## $\begin{bmatrix} a & b & c\\ d & e & f \\ g & h & i\end{bmatrix}$ $\begin{bmatrix} a' & b' & c'\\ d' & e' & f' \\ g' & h' & i'\end{bmatrix}$ $= \begin{bmatrix} aa'+ bd'+ cg' & ab'+be'+ch' & ac'+bf'+ci'\\ da'+ed'+fg' & db'+ee'+fh' & dc'+ef'+fi' \\ ga'+hd'+ig' & gb'+he'+ih' & gc'+hf'+ii'\end{bmatrix}$



Take a minute to analyze how the multiplication is done on the first column of the matrix and you will easily understand the rest of the columns.


#### Matrix inversion

Knowing how to invert matrices will allow us to solve linear equations with several unknowns. In particular, this will be the basis for linear regressions in Machine Learning.

Consider the following equations:


## $2x + y + 4z = -1$


## $x + 2y - z = 0$


## $3x + y + 2z = 2$



This could be represented as follows:

### $\begin{bmatrix} 2&1&4\\1&2&-1 \\3&1&2 \end{bmatrix}$$\begin{bmatrix} x\\ y\\ z\end{bmatrix} = \begin{bmatrix} -1\\ 0\\ 2\end{bmatrix}$


Let's assign the first matrix to $A$ the second matrix to the vector $/vec{u}$ and the third to $\vec{v}$ such as :

### $A = \begin{bmatrix} 2&1&4\\ 1&2&-1 \\ 3&1&2 \end{bmatrix}$

### $\vec{u} = \begin{bmatrix} x \\ y \\ z \end{bmatrix}$

### $\vec{v} = \begin{bmatrix} -1 \\ 0 \\ 2 \end{bmatrix}$



As we had seen above, we can write: 

## $A\vec{u} = \vec{v}$

To solve the equation, we would like to compute the invert of A such that:

## $\vec{u} = \frac{1}{A}\vec{v}$



When we are using matrices, we are not writing $\frac{1}{A}$ but $A^{-1}$.

Here the result:

### $\vec{u} = A^{-1}\vec{v}$

This $A^{-1}$ represents the **inverse matrix** of $A$


We will not see in this course how to calculate the inverse of a matrix but it is important to understand this simple concept of solving equations which will be used later on to talk about Machine Learning.


## Probabilities ##

Let's move on to the probabilities. For some machine learning algorithms, we will need some theoretical knowledge of probability.


### Definition

A probability is simply the "percentage chance" that an event will occur. This is not a rigorous definition but an intuitive one. A probability is therefore a number between 0 and 1.

For example, if we flip a coin, the probability of getting "tail" is 50% or 0.5. Let the event "get tails" be called A. Which will be modeled by:

### $P(A) = \frac{1}{2}$


### Union and intersection of events

Let A and B be two events. There are two ways of "combining" these two events : union and intersection

#### Intersection of events

The probability $P(A \cap B)$, where $A \cap B$ is the intersection of A and B, corresponds to the probability that event A **and** B occurred. 

Example :
* We flip two coins. Let A be the event "getting Head on first coin" and B "getting Head on second coin". Then, the event "getting two Heads" is the intersection of A and B, and $P("getting two Heads") = P(A \cap B)$

#### Union of events

The probability $P(A \cup B)$, where $A \cup B$ is the union of A and B, corresponds to the probability that event A **or** B occurred. 
*Here, "or" refers to the logical "or", which means that the union $A \cup B$ combines actually three events : "A happened but B didn't happen", "B happened but A didn't happen", "both A and B happened". It's interesting to notice that the latter refers to the intersection of A and B.*

Example :
* We throw a dice. Let A, B, C be the events : "getting 2", "getting 4", "getting 6". Then, the event "getting an even number" is the union of A, B and C, and $P("getting an even number") = P(A \cup B \cup C)$


#### Fundamental relationship between probability of union and probability of intersection
The following formula is always true :

### $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

### Conditional probability
The conditional probability of A given B $P_B(A)$ is the probability of a, given that B has happened. The conditional probability can be computed from the probability of the intersection of events :

### $P_B(A) = \frac{P(A \cap B)}{P(B)}$


### Mutual exclusivity
**Mutually exclusive events** (sometimes also called **disjoint events**) refers to events that can't occur at the same time. Let A and B be two disjoint events, then :

### $P(A \cap B) = 0$

#### Fundamental properties related to mutual exclusivity
* Let A and B be two disjoint events. Then $P(A \cup B) = P(A) + P(B)$
* Let e be all the possible events in an experiment. If all the events e are mutually exclusive, then $\sum e = 1$

#### Example
We flip a coin. There are only to possible outcomes : A "getting Heads" and B "getting Tails". A and B are mutually exclusive events, so we can write $P(A \cup B) = P(A) + P(B) = 1$. 

### Independent events

We say that two events A and B are **independent** if the probability $P(A)$ doesn't change whether B has occured or not (and conversely, $P(B)$ is the same if A has occured or not). In this case :

### $P(A \cap B) = P(A) \times P(B)$

Example
* We flip two coins. The events A "getting Heads on first coin" and B "getting Heads on second coin" are independent, then the probability of the intersection "getting 2 Heads" is : $P("getting 2 Heads") = P(A) \times P(B)$



### Let's practice probabilities !

Imagine we're in a class of students. In this class we have 10 students divided as follows:


<table>
  <tr>
   <td>Student
   </td>
   <td>Strong school subjects
   </td>
  </tr>
  <tr>
   <td>1
   </td>
   <td>French, Maths
   </td>
  </tr>
  <tr>
   <td>2
   </td>
   <td>Maths, French
   </td>
  </tr>
  <tr>
   <td>3
   </td>
   <td>English, French
   </td>
  </tr>
  <tr>
   <td>4
   </td>
   <td>Maths, English
   </td>
  </tr>
  <tr>
   <td>5
   </td>
   <td>Spanish, French
   </td>
  </tr>
  <tr>
   <td>6
   </td>
   <td>Physics, Maths
   </td>
  </tr>
  <tr>
   <td>7
   </td>
   <td>Sports, English
   </td>
  </tr>
  <tr>
   <td>8
   </td>
   <td>English, French
   </td>
  </tr>
  <tr>
   <td>9
   </td>
   <td>French, Sports
   </td>
  </tr>
  <tr>
   <td>10
   </td>
   <td>French, Maths
   </td>
  </tr>
</table>


We can take the $A$ event: "The student is strong at French", and $B$ event "The student is strong in maths"

We can say that:

## $P(A) = \frac{7}{10}$

## $P(B) = \frac{5}{10}$



And we can also say:

## $P(A \cap B) = \frac{3}{10}$


Now, imagine if we knew all the students who were good at French. In your opinion, how many of these students who are strong in French are strong in Maths? To answer that, we can make a subgroup:


<table>
  <tr>
   <td>Student
   </td>
   <td>Strong school subjects
   </td>
  </tr>
  <tr>
   <td>1
   </td>
   <td>French, Maths
   </td>
  </tr>
  <tr>
   <td>2
   </td>
   <td>Maths, French
   </td>
  </tr>
  <tr>
   <td>3
   </td>
   <td>English, French
   </td>
  </tr>
  <tr>
   <td>5
   </td>
   <td>Spanish, French
   </td>
  </tr>
  <tr>
   <td>8
   </td>
   <td>English, French
   </td>
  </tr>
  <tr>
   <td>9
   </td>
   <td>French, Sports
   </td>
  </tr>
  <tr>
   <td>10
   </td>
   <td>French, Maths
   </td>
  </tr>
</table>


Here you can see that it would be: $\frac{3}{7}$. We just verified the formula of conditional probability :

## $P_{A}(B) = \frac{P(A \cap B)}{P(A)} = \frac{3/10}{7/10} = \frac{3}{7}$ 



### Probability distributions


#### Random variable

To understand probability distributions, it is important to understand what a random variable is: it is a variable that can take all the values that correspond to the possible outcomes of a given experiment. Let's take an example:

We throw a dice. Every time the value is even, the player wins 3€. If the value is odd, the player loses 3€.

Let $X$ be the random variable that represents the player's possible earnings. So what are the possible values for $X$?


<table>
  <tr>
   <td>Result for the dice throwed
   </td>
   <td>1
   </td>
   <td>2
   </td>
   <td>3
   </td>
   <td>4
   </td>
   <td>5
   </td>
   <td>6
   </td>
  </tr>
  <tr>
   <td>X values
   </td>
   <td>-3€
   </td>
   <td>+3€
   </td>
   <td>-3€
   </td>
   <td>+3€
   </td>
   <td>-3€
   </td>
   <td>+3€
   </td>
  </tr>
</table>


In this case, the random variable $X$ can only take two values : -3 and 3.


The probability distribution of $X$ corresponds to the probability that the random variable "$X$" takes as a function of the possible values of "$X$". If we take the example from above:


<table>
  <tr>
   <td>Possible values for <b>$X$</b>

   </td>
   <td>-3
   </td>
   <td>+3
   </td>
  </tr>
  <tr>
   <td>Probability to obtain the value for <b>$x1$</b>

   </td>
   <td>
3 / 6
   </td>
   <td>
3 / 6
   </td>
  </tr>
</table>


Generally, we will write:


<table>
  <tr>
   <td>Possible values for <b>$X$</b>
   </td>
   <td>
<b>x1</b>
   </td>
   <td>
<b>x2</b>
   </td>
   <td>...
   </td>
   <td>
<b>xn</b>
   </td>
  </tr>
  <tr>
   <td>Probability to obtain the value for <b>$x1$</b>

   </td>
   <td>
$P(X = x1)$
   </td>
   <td>
$P(X = x2)$
   </td>
   <td>...
   </td>
   <td>
$P(X = xn)$
   </td>
  </tr>
</table>


In our example above, the random variable was discrete as it took only two possible values (-3 and 3). But keep in mind that we can also work with continuous random variables. In this case, the probability distribution becomes a function of X, and is usually called "probability density".

There exist some usual probability distributions that are very often used when you're doing statistics or machine learning. For example, the Binomial distribution or the Normal distribution.


#### Expected Value

Before we talk about the most famous probability distribution, we need to explain what is the expectation as well as the variance.


##### If the random variable is discrete

If the random variable is discrete then the expectation is simply the weighted mean :

## $E(X) = \frac{\sum_{0}^{n} p(X = x_i) x_{i} }{n}$


##### If the random variable is continuous

If our variable follow a continuous law then the expected value is: 

## $E(X) = \int_{-\infty}^{+\infty} xf(x)$



#### Variance

The variance is the typical "squared deviation" between each point in your sample and the expected value. The formula is written as follows:

## $V(X) = E((X - E(X))^2)$



## Normal distribution


##### What is that?

There is a lot of talk about the normal distribution, but it is often difficult to understand why. Let's explain what it is first.

The normal distribution follows a Gaussian function:


## $f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma })^{2}}$


The function is quite complex but just imagine graphically that you get a bell curve.


##### The Central Theorem Limit

The reason why the Normal distribution is so famous, is because of the central limit theorem which says that, whatever the probability distribution that a random variable $X$ follows, if we take a very large number of samples, the mean of all these samples follows a normal distribution.  

So this means that it doesn't matter which initial law our random variable follows, if we have enough measurements and consider the mean value, it will follow a normal distribution.

## Resources 📚📚

What is a probability law - [https://www.youtube.com/watch?v=SiG-hFEN_iQ](https://www.youtube.com/watch?v=SiG-hFEN_iQ)

Conditionnal probability - [https://www.youtube.com/watch?v=5oBnmZVrOXE](https://www.youtube.com/watch?v=5oBnmZVrOXE)

Vectors explicated - [https://www.youtube.com/watch?v=fNk_zzaMoSs](https://www.youtube.com/watch?v=fNk_zzaMoSs)

Vectors colinearity - [https://www.youtube.com/watch?v=k7RM-ot2NWY&t=2s](https://www.youtube.com/watch?v=k7RM-ot2NWY&t=2s)