[Home](https://www.latex-project.org/)
[About](https://www.latex-project.org/about/)
[Get](https://www.latex-project.org/get/)
[LaTeX3](https://www.latex-project.org/latex3/)
[Publications](https://www.latex-project.org/publications/)
[Help](https://www.latex-project.org/help/)
[News](https://www.latex-project.org/news/)
***

![LaTeX Logo](https://www.latex-project.org/img/latex-project-logo.svg)


## LaTeX

[LaTeX](https://www.latex-project.org/) is a typesetting language for producing scientific documents. In this notebook we show a small part of the language for writing mathematical notation. Jupyter notebook recognizes LaTeX code written in markdown cells and renders the symbols in the browser using the [MathJax](https://www.mathjax.org/) JavaScript library.


[Jupyter Math Reference](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Typesetting%20Equations.html)



### The Lorenz Equations
\begin{align}
\dot{x} & = \sigma(y-x) \\
\dot{y} & = \rho x - y - xz \\
\dot{z} & = -\beta z + xy
\end{align}

## How to Show Mathematics Inline and Display

Enclose LaTeX code in dollar signs `$ ... $` to display math inline. For example, the code `$f(x) = x^2$` renders inline as $ f(x) = x^2 $.

Enclose LaTeX code in double dollar signs `$$ ... $$` to display expressions in a centered paragraph. For example:

```LaTeX
$$f(x) = x^2$$
```

renders as

$$f(x) = x^2$$

You can also enclose LaTeX code with `\begin{equation*}` and close with `\end{equation*}`. 

For instance,

```LaTeX
\begin{equation*}
n^{22}\\
\end{equation*}
```
will render as

\begin{equation*}
n^{22}\\
\end{equation*}

You can also use the 'align' key word to do more advanced expressions.

For instance,

```LaTeX
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\   \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}
```

Will render as:

\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\   \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}

Here's another example of using the keywords `align` and `aligned`

```LaTeX
 \begin{align}
 \begin{aligned}
    2x+3 &= 7 & 2x+3-3 &= 7-3 \\
    2x &= 4 & \frac{2x}2 &= \frac42\\
    x &= 2
 \end{aligned}
 \end{align}
```

and will render as:

 \begin{align}
 \begin{aligned}
    2x+3 &= 7 & 2x+3-3 &= 7-3 \\
    2x &= 4 & \frac{2x}2 &= \frac42\\
    x &= 2
 \end{aligned}
 \end{align}
 


See the [LaTeX WikiBook](https://en.wikibooks.org/wiki/LaTeX), especially the section on [mathematics](https://en.wikibooks.org/wiki/LaTeX/Mathematics).

In [1]:
# and yet another way to render Math using code in a Jupyter notebook cell is
from IPython.display import display, Math

display(Math(r'\int_0^1\sin x \;dx'))

<IPython.core.display.Math object>

### The Cauchy-Schwarz Inequality
\begin{equation*}
\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)
\end{equation*}

### A Cross Product Formula
\begin{equation*}
\mathbf{V}_1 \times \mathbf{V}_2 =  \begin{vmatrix}
\mathbf{i} & \mathbf{j} & \mathbf{k} \\
\frac{\partial X}{\partial u} &  \frac{\partial Y}{\partial u} & 0 \\
\frac{\partial X}{\partial v} &  \frac{\partial Y}{\partial v} & 0
\end{vmatrix}
\end{equation*}

### The probability of getting (k) heads when flipping (n) coins is
\begin{equation*}
P(E)   = {n \choose k} p^k (1-p)^{ n-k}
\end{equation*}

### An Identity of Ramanujan
\begin{equation*}
\frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} =
1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}}
{1+\frac{e^{-8\pi}} {1+\ldots} } } }
\end{equation*}

### A Rogers-Ramanujan Identity
\begin{equation*}
1 +  \frac{q^2}{(1-q)}+\frac{q^6}{(1-q)(1-q^2)}+\cdots =
\prod_{j=0}^{\infty}\frac{1}{(1-q^{5j+2})(1-q^{5j+3})},
\quad\quad \text{for $|q|<1$}.
\end{equation*}

### Maxwell’s Equations
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\   \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}

### Sourcing
This expression $\sqrt{3x-1}+(1+x)^2$ is an example of a TeX inline equation in a [Markdown-formatted](https://daringfireball.net/projects/markdown/) sentence.

### Simple Greek Letter Equation
\begin{equation*}
\forall x \in X, \quad \exists y \leq \epsilon
\end{equation*}

### Greek Letters
\begin{equation*}
\alpha, A, \beta, B, \gamma, \Gamma, \pi, \Pi, \phi, \varphi, \mu, \Phi
\end{equation*}

### Limits
\begin{equation*}
\lim\limits_{x \to \infty} \exp(-x) = 0
\end{equation*}

### Modular arithmetic
\begin{equation*}
a \bmod b \\
a \pmod b
\end{equation*}

### Powers and indices
\begin{equation*}
k_{n+1} = n^2 + k_n^2 - k_{n-1} \\
\end{equation*}

For powers with more than one digit, surround the power with { }

\begin{equation*}
n^{22}\\
\end{equation*}

An underscore ( _ ) can be used with a vertical bar ( | ) to denote evaluation using subscript notation in mathematics

\begin{equation*}
f(n) = n^5 + 4n^2 + 2 |_{n=17}
\end{equation*}


### Aligning your equations
\begin{equation}
\begin{aligned}
F ={} & \{F_{x} \in  F_{c} : (|S| > |C|) \\
      & \cap (\mathrm{minPixels}  < |S| < \mathrm{maxPixels}) \\
      & \cap (|S_{\mathrm{conected}}| > |S| - \epsilon)\}
\end{aligned}
\end{equation}

### Fractions and Binomials
A fraction is created using the \frac{numerator}{denominator} command. Likewise, the binomial coefficient (aka the Choose function) may be written using the \binom command:

\begin{equation*}
\frac{n!}{k!(n-k)!} = \binom{n}{k}
\end{equation*}


You can embed fractions within fractions:

\begin{equation*}
\frac{\frac{1}{x}+\frac{1}{y}}{y-z}
\end{equation*}

### Modular arithmetic
\begin{equation*}
x^\frac{1}{2}
\end{equation*}

### Continued Fractions
\begin{equation}
  x = a_0 + \cfrac{1}{a_1 
          + \cfrac{1}{a_2 
          + \cfrac{1}{a_3 + \cfrac{1}{a_4} } } }
\end{equation}

### Multiplication of two numbers
To make multiplication visually similar to a fraction, a nested array can be used, for example multiplication of numbers written one below the other.

\begin{equation}
\frac{
    \begin{array}[b]{r}
      \left( x_1 x_2 \right)\\
      \times \left( x'_1 x'_2 \right)
    \end{array}
  }{
    \left( y_1y_2y_3y_4 \right)
  }
\end{equation}

### Roots
The \sqrt command creates a square root surrounding an expression.

\begin{equation}
\sqrt{\frac{a}{b}}
\end{equation}

It accepts an optional argument specified in square brackets ([ and ]) to change magnitude.

\begin{equation}
\sqrt[n]{1+x+x^2+x^3+\dots+x^n}
\end{equation}


### Sums and integrals

The \sum and \int commands insert the sum and integral symbols respectively, with limits specified using the caret (^) and underscore (_). The typical notation for sums is:

\begin{equation}
\sum_{i=1}^{10} t_i
\end{equation}

or

\begin{equation}
\displaystyle \sum_{i=1}^{10} t_i
\end{equation}

The limits for the integrals follow the same notation. It's also important to represent the integration variables with an upright d, which in math mode is obtained through the \mathrm{} command, and with a small space separating it from the integrand, which is attained with the \, command.

\begin{equation}
\int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x
\end{equation}

### Summation and Product

[Summation](https://en.wikipedia.org/wiki/Summation) is one of the most useful and commonly used symbols in iterative mathematics and data science. Despite it looking seemingly complex, it's quite simple and very useful.

Take for instance:

\begin{equation}
\sum_{i}^{n} t_i
\end{equation}

Where n is your limit or stopping number and i is your indexing start number (if i=1, you start at 1), you increment the value t until you reach n.

\begin{equation}
\sum_{i=1}^{5} t_i
\end{equation}

This equation simply states that we're going to sum the values from 1 to 5. 

All that this symbol represents is a for loop in the range of the number on top, starting from the number on the bottom. The variable set on the bottom becomes the index variable and any result per loop is added to an overall value. 

Here is the simple code for it below:

In [2]:
x = [1, 2, 3, 4, 5]
result = 0
for i in range(5):
    result += x[i]
print(result)

15


The Product Operator functions in the same manner, but instead of adding each result they will be multiplied.

\begin{equation}
\prod_{i}^{n} t_i
\end{equation}

\begin{equation}
\prod_{i=1}^{5} t_i
\end{equation}

In [1]:
x = [1, 2, 3, 4, 5]
result = 1
for i in range(5):
    result *= x[i]
print(result)

120


### Factorial

In mathematics, the [factorial](https://en.wikipedia.org/wiki/Factorial) of a positive integer n, denoted by n!, is the product of all positive integers less than or equal to n:

n! is the product of the natural numbers from 1 to n

- $n! = 1 x 2 x 3 x \dots xn $

for instance

- $5! = 1x2x3x4x5 = 120$

Negative numbers don't have a factorial and 0! = 1

- $n! = (n-1)!xn$
- $(n+1)! = n!$
- $(n+k)!$
- $(n-k)!$
- $(n+k)! = n!x(n+1)x(n+1)x\dots x(n+k)$

- $(n-k)! = \frac{n!}{(n-k+1)x(n-k+2)x\dots x(n-k+k)}$
- $(n-k)! = \frac{n!}{(n-k+1)x(n-k+2)x\dots xn}$

if n > k, then

- $ \frac{n!}{k!} = (k=1) x (k=2) x \dots xn$

so if $n = 7$ and $k = 4$, then

- $ \frac{7!}{4!} = 5 x 6 x 7 $

5! coded would look like:

In [4]:
result = 1
for i in range(1,6):
    result *= i
print(result)

120


In [5]:
# various math import libraries makes things simple 
import scipy, numpy, math 
  
print ("using scipy math library", scipy.math.factorial(5))
print ("using numpy math library", numpy.math.factorial(5))
print ("using python math library", math.factorial(5)) 


using scipy math library 120
using numpy math library 120
using python math library 120


We are just touching on the scipy ecosystem go to https://www.scipy.org/about.html to read more

### Brackets, braces and delimiters
The use of delimiters such as brackets soon becomes important when dealing with anything but the most trivial equations. Without them, formulas can become ambiguous. Also, special types of mathematical structures, such as matrices, typically rely on delimiters to enclose them.

There are a variety of delimiters available for use in LaTeX:
\begin{equation}
( a ), [ b ], \{ c \}, | d |, \| e \|,
\langle f \rangle, \lfloor g \rfloor,
\lceil h \rceil, \ulcorner i \urcorner
\end{equation}

where \lbrack and \rbrack may be used in place of [ and ].

### Automatic sizing
Very often mathematical features will differ in size, in which case the delimiters surrounding the expression should vary accordingly. This can be done automatically using the \left, \right, and \middle commands. Any of the previous delimiters may be used in combination with these:

\begin{equation}
\left(\frac{x^2}{y^3}\right)
\end{equation}

\begin{equation}
P\left(A=2\middle|\frac{A^2}{B}>4\right)
\end{equation}

Curly braces are defined differently by using \left\{ and \right\},

\begin{equation}
\left\{\frac{x^2}{y^3}\right\}
\end{equation}

If a delimiter on only one side of an expression is required, then an invisible delimiter on the other side may be denoted using a period (.).

\begin{equation}
\left.\frac{x^3}{3}\right|_0^1
\end{equation}

### Manual sizing
In certain cases, the sizing produced by the \left and \right commands may not be desirable, or you may simply want finer control over the delimiter sizes. In this case, the \big, \Big, \bigg and \Bigg modifier commands may be used:

\begin{equation}
( \big( \Big( \bigg( \Bigg(
\end{equation}

These commands are primarily useful when dealing with nested delimiters. For example, when typesetting

\begin{equation}
\frac{\mathrm d}{\mathrm d x} \left( k g(x) \right)
\end{equation}

we notice that the \left and \right commands produce the same size delimiters as those nested within it. This can be difficult to read. To fix this, we write

\begin{equation}
\frac{\mathrm d}{\mathrm d x} \big( k g(x) \big)
\end{equation}

Manual sizing can also be useful when an equation is too large, trails off the end of the page, and must be separated into two lines using an align command. Although the commands \left. and \right. can be used to balance the delimiters on each line, this may lead to wrong delimiter sizes. Furthermore, manual sizing can be used to avoid overly large delimiters if an \underbrace or a similar command appears between the delimiters.

### Matrices and arrays
A basic matrix may be created using the matrix environment: in common with other table-like structures, entries are specified by row, with columns separated using an ampersand (&) and a new rows separated with a double backslash (\\)

\begin{equation}
 \begin{matrix}
  a & b & c \\
  d & e & f \\
  g & h & i
 \end{matrix}
\end{equation}

When writing down arbitrary sized matrices, it is common to use horizontal, vertical and diagonal triplets of dots (known as ellipses) to fill in certain columns and rows. These can be specified using the \cdots, \vdots and \ddots respectively:

\begin{equation}
A_{m,n} = 
 \begin{pmatrix}
  a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\
  a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  a_{m,1} & a_{m,2} & \cdots & a_{m,n} 
 \end{pmatrix}
\end{equation}

In some cases you may want to have finer control of the alignment within each column, or want to insert lines between columns or rows. This can be achieved using the array environment, which is essentially a math-mode version of the tabular environment, which requires that the columns be pre-specified:

\begin{equation}
 \begin{array}{c|c}
  1 & 2 \\ 
  \hline
  3 & 4
 \end{array}
\end{equation}

You may see that the AMS matrix class of environments doesn't leave enough space when used together with fractions resulting in output similar to this:

\begin{equation}
 M = \begin{bmatrix}
       \frac{5}{6} & \frac{1}{6} & 0           \\[0.3em]
       \frac{5}{6} & 0           & \frac{1}{6} \\[0.3em]
       0           & \frac{5}{6} & \frac{1}{6}
     \end{bmatrix}
\end{equation}

### Matrices in running text
To insert a small matrix, and not increase leading in the line containing it, use smallmatrix environment:

A matrix in text must be set smaller:
$\bigl(\begin{smallmatrix}
a&b \\ c&d
\end{smallmatrix} \bigr)$
to not increase leading in a portion of text.

### Adding text to equations
The math environment differs from the text environment in the representation of text. Here is an example of trying to represent text within the math environment:
\begin{equation}
50 apples \times 100 apples = lots of apples^2
\end{equation}

There are two noticeable problems: there are no spaces between words or numbers, and the letters are italicized and more spaced out than normal. Both issues are simply artifacts of the maths mode, in that it treats it as a mathematical expression: spaces are ignored (LaTeX spaces mathematics according to its own rules), and each character is a separate element (so are not positioned as closely as normal text).

There are a number of ways that text can be added properly. The typical way is to wrap the text with the \text{...} command (a similar command is \mbox{...}, though this causes problems with subscripts, and has a less descriptive name). Let's see what happens when the above equation code is adapted:

\begin{equation}
50 \text{apples} \times 100 \text{apples} = \text{lots of apples}^2
\end{equation}

The text looks better. However, there are no gaps between the numbers and the words. Unfortunately, you are required to explicitly add these. There are many ways to add spaces between maths elements, but for the sake of simplicity we may simply insert space characters into the \text commands.
 
\begin{equation}
50 \text{ apples} \times 100 \text{ apples} = \text{lots of apples}^2
\end{equation}

### Formatted text
Using the \text is fine and gets the basic result. Yet, there is an alternative that offers a little more flexibility. You may recall the introduction of font formatting commands, such as \textrm, \textit, \textbf, etc. These commands format the argument accordingly, e.g., \textbf{bold text} gives bold text. These commands are equally valid within a maths environment to include text. The added benefit here is that you can have better control over the font formatting, rather than the standard text achieved with \text.
\begin{equation}
50 \textrm{ apples} \times 100
 \textbf{ apples} = \textit{lots of apples}^2
\end{equation}

### Formatting mathematics symbols

These formatting commands can be wrapped around the entire equation, and not just on the textual elements: they only format letters, numbers, and uppercase Greek, and other math commands are unaffected.

To bold lowercase Greek or other symbols use the \boldsymbol command; this will only work if there exists a bold version of the symbol in the current font. As a last resort there is the \pmb command (poor man's bold): this prints multiple versions of the character slightly offset against each other.

\begin{equation}
\boldsymbol{\beta} = (\beta_1,\beta_2,\dotsc,\beta_n)
\end{equation}

To change the size of the fonts in math mode, see Changing font size.

### Accents
So what to do when you run out of symbols and fonts? Well, the next step is to use accents:

\begin{equation}
a'  a'' \hat{a} \bar{a} \grave{a} \acute{a} \dot{a}	\ddot{a} \not{a} \mathring{a}	
\overrightarrow{AB}	 \overleftarrow{AB}	a''' a''''	\overline{aaa}	\check{a} \breve{a}	\vec{a}	\dddot{a}		\ddddot{a}	\widehat{AAA} \widetilde{AAA} \stackrel\frown{AAA}	\tilde{a}	\underline{a} 
\end{equation}

### Color
The package xcolor, described in Colors, allows us to add color to our equations. 

\begin{equation}
k = {\color{red}x} \mathbin{\color{blue}-} 2
\end{equation}

### Plus and minus signs

LaTeX deals with the + and − signs in two possible ways. The most common is as a binary operator. When two maths elements appear on either side of the sign, it is assumed to be a binary operator, and as such, allocates some space to either side of the sign. The alternative way is a sign designation. This is when you state whether a mathematical quantity is either positive or negative. This is common for the latter, as in math, such elements are assumed to be positive unless a − is prefixed to it. In this instance, you want the sign to appear close to the appropriate element to show their association. If you put a + or a − with nothing before it but you want it to be handled like a binary operator you can add an invisible character before the operator using {}. This can be useful if you are writing multiple-line formulas, and a new line could start with a − or +, for example, then you can fix some strange alignments adding the invisible character where necessary.

A plus-minus sign is written as:
\begin{equation}
\pm
\end{equation}


Similarly, there exists also a minus-plus sign:

\begin{equation}
\mp
\end{equation}

### Controlling horizontal spacing
LaTeX is obviously pretty good at typesetting maths—it was one of the chief aims of the core TeX system that LaTeX extends. However, it can't always be relied upon to accurately interpret formulas in the way you did. It has to make certain assumptions when there are ambiguous expressions. The result tends to be slightly incorrect horizontal spacing. In these events, the output is still satisfactory, yet any perfectionists will no doubt wish to fine-tune their formulas to ensure spacing is correct. These are generally very subtle adjustments.

There are other occasions where LaTeX has done its job correctly, but you just want to add some space, maybe to add a comment of some kind. For example, in the following equation, it is preferable to ensure there is a decent amount of space between the maths and the text.

\begin{equation}
 f(n) =
  \begin{cases}
    n/2       & \quad \text{if } n \text{ is even}\\
    -(n+1)/2  & \quad \text{if } n \text{ is odd}
  \end{cases}
\end{equation}


OK, so back to the fine tuning as mentioned at the beginning of the document. A good example would be displaying the simple equation for the indefinite integral of y with respect to x:
if you were to try this, you may write:

\begin{equation}
\int y \mathrm{d}x
\end{equation}

However, this doesn't give the correct result. LaTeX doesn't respect the white-space left in the code to signify that the y and the dx are independent entities. Instead, it lumps them altogether. A \quad would clearly be overkill in this situation—what is needed are some small spaces to be utilized in this type of instance, and that's what LaTeX provides:

\begin{equation}
\int y\, \mathrm{d}x
\end{equation}

\begin{equation}
\int y\: \mathrm{d}x
\end{equation}

\begin{equation}
\int y\; \mathrm{d}x
\end{equation}

The negative space may seem like an odd thing to use, however, it wouldn't be there if it didn't have some use! Take the following example:

\begin{equation}
\left(
    \begin{array}{c}
      n \\
      r
    \end{array}
  \right) = \frac{n!}{r!(n-r)!}
\end{equation}

The matrix-like expression for representing binomial coefficients is too padded. There is too much space between the brackets and the actual contents within. This can easily be corrected by adding a few negative spaces after the left bracket and before the right bracket.

\begin{equation}
\left(\!
    \begin{array}{c}
      n \\
      r
    \end{array}
  \!\right) = \frac{n!}{r!(n-r)!}
\end{equation}

In any case, adding some spaces manually should be avoided whenever possible: it makes the source code more complex and it's against the basic principles of a What You See is What You Mean approach. The best thing to do is to define some commands using all the spaces you want and then, when you use your command, you don't have to add any other space. Later, if you change your mind about the length of the horizontal space, you can easily change it modifying only the command you defined before. Let us use an example: you want the d of a dx in an integral to be in roman font and a small space away from the rest. If you want to type an integral like \int x \, \mathrm{d} x, you can define a command like this:


\newcommand{\dd}{\mathop{}\,\mathrm{d}}


in the preamble of your document. We have chosen \dd just because it reminds the "d" it replaces and it is fast to type. Doing so, the code for your integral becomes \int x \dd x. Now, whenever you write an integral, you just have to use the \dd instead of the "d", and all your integrals will have the same style. If you change your mind, you just have to change the definition in the preamble, and all your integrals will be changed accordingly.

[Source](https://en.wikibooks.org/wiki/LaTeX/Mathematics)

## Mean, Median, Mode

Mean, median, and mode are different measures of center in a numerical data set. They each try to summarize a dataset with a single number to represent a "typical" data point from the dataset.

__Mean:__ The "average" number; found by adding all data points and dividing by the number of data points.


\begin{equation}
\bar{y} = \frac{\sum y}{N}
\end{equation}

- $\sum y$ Take all the values and add them up
- divide by the total number of observations you have

__Median__ The middle number found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).


To get the median

\begin{equation}
y = \tilde{x}
\end{equation}

$ y = [1,2,3,4,4,3,2,1] $

$ y = 4 $


The __mode__ is the most commonly occurring data point in a dataset. The mode is useful when there are a lot of repeated values in a dataset. There can be no mode, one mode, or multiple modes in a dataset.

## Standard Deviation

__Standard deviation__ is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are more spread out.

$$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}$$


## Variance

The __Variance__ is defined as: The average of the squared differences from the Mean. To calculate the variance follow these steps: Find the __Mean__, Then for each number: subtract the __Mean__ and square the result (the squared difference).

$$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}$$

## What is RMSE

 $RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sum_i}\Big)^2}}$
 
 
 ## Root mean squared error (RMSE)
RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average squared differences between prediction and actual observation. 

 $RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\Sigma_i}\Big)^2}}$
 
 we can do this simply from sklearn:

```python
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_actual, y_predicted))
print(rmse)

# another way
n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

# and another way
from sklearn.metrics import mean_squared_error
from math import sqrt
expected = [0.0, 0.5, 0.0, 0.5, 0.0]
predictions = [0.2, 0.4, 0.1, 0.6, 0.2]
mse = mean_squared_error(expected, predictions)
rmse = sqrt(mse)
print('RMSE: %f' % rmse)

```
The mean square root and square root of it will be useful.


In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
expected = [0.0, 0.5, 0.0, 0.5, 0.0]
predictions = [0.2, 0.4, 0.1, 0.6, 0.2]
mse = mean_squared_error(expected, predictions)
rmse = sqrt(mse)
print('RMSE: %f' % rmse)

## What is regression?

__Regressions__ are one of the most commonly used tools in a data scientist's tool kit. We can plug our data back into a regression equation to see if the predicted output matches corresponding observed value seen in the data.

The quality of the regression model is how well its predictions match up against actual values, but how do we actually evaluate quality? Luckily for us some really smart Mathematicians have developed error metrics to judge the quality of the model and enable us to compare regressions against other regressions with different parameters. These metrics are short and useful summaries of the quality of our data. Let's discuss __Linear Regression__.

## What is linear regression?
The linear regression is the most commonly used model in research and business and is the simplest to understand, so it makes sense to start developing our intuition on how they are assessed. The intuition behind many of the metrics are extended to other types of modesl and their respective metrics. 

In the context of regression, models refer to mathematical equations used to describe the relationship between two variables. In general, these models deal with prediction and estimation of values of interest in our data called __outputs__. Think about our bearing example. Models will look at other aspects of the data called __inputs__ that we believe to affect the outputs, and use them to generate estimated outputs.

These inputs and outputs have many names that you may have heard before. Inputs can also be called independent variables or predctors, while outputs are also known as responses or dependent variables. Simply speaking, modesl are just functions where the outputs are some function of the inputs. The __linear__ part of linear regression refers to the fact that linear regression model is described mathematically in the form.


- $\hat{y}$ is your predicted output
- $\beta_0 + \beta_1$ are your coefficients
- $x$ is your input
- $\epsilon$ is your error

Linear Regression: Single Variable
\begin{equation}
\hat{y} = \beta_0 + \beta_1 x + \epsilon
\end{equation}

Linear Regression: Multiple Variable
\begin{equation}
\hat{y} = \beta_0 + \beta_1 x_1 + \dotsc + \beta_p x_p + \epsilon
\end{equation}

If that looks too mathematical, know that linear thinking is intuitive. The __regression__ part of linear regresion does not refer to some return to a lesser state. Regression here simply refers tot he act of estimating the relationship between our inputs and outputs. In particular, regression deals witht he modelling of __continuous values__ _(think bearing or exhaust temperatures)_.

Taken together, a linear regression creates a model that assumes a linear relationship between the inputs and outputs. The highter the inputs are, the higher (or lower, if the relationship was negative) the outputs are. So, think if the exhaust temps are high or reactive power load is high then bearing temps might also be high (or that's what our thesis is). What adjusts how strong the relationship is and what the direction of the relationship is between the inputs and outputs are called __coefficients__. The first coefficient without an input is called the intercept, and it adjusts what the model predicts when all your inputs are 0. There exists a method to calulate the optimal coefficients, given which inputs we want to use to predict the output.

Given the coefficients, if we plug in values for the inputs, the linear regression will give us an __estimate__ for what the output should be. These outputs won't always be perfect unless our data ends up in a straight line, our model will not precisely hit all of our data points. One of the reasons for this is the $\epsilon$ (named epsilon) term. $\epsilon$ represents error that comes from the sources out of our control, causing the data to deviate slightly from their true position. Our error metrics will be able to judge the differences between prediction and actual values, but we cannot know how much the error has contributed to the discrepancy. While we cannot ever complety elimante $\epsilon$, it is useful to retain a term for it in a linear model.

## Comparing model predictions against reality

Since our model will produce an output given any input or set of inputs, we can check these estimated outputs against the actual values that we tried to rpedict. We call the difference between the actual value and the mode's estimate a __residual__. We can caluclate the residual for every point in our data set, and each of these residuals will be of use in assessment. These residuals play a signficant role in judging the usefulness of our model selections.

If our collection of residuals are small, it implies that our model that produced themd oes a good job at predicting our output of interest. Conversely, if these rediuals are genearly large, it implies that model is a poor estimator. We technically can inpsect all of the residuals to justge the model's accuracy, but unsurprisignly, this doesn't scale if we have thousands or millions of data pointss. Thus statusicians have developed summary measurements that take our collection of residuals and condense them into a single value that represent the predictive ability of our model. There are many of these summary statisitcs, each with their own advantages and pitfalls. For each, we'll discuss what each statistic represents, theire intuition and typical use case.

- Mean Absolute Error (MAE)
- Mean Square Error (MSE)
- Mean Absolute Percentage Error (MAPE)
- Mean Percentage Error (MPE)

_Note: Even though you see error in each statistic above, it doesn't refer to the $\epsilon$ term from above! The error described in these metrics refer to the **residuals!**_

## Keeping it Real

In discussing these error metrics, it's easy to get bogged down by the various acronyms and equations used to describe them. To keep ouselves grounded perhaps we say we want to predict our output say bearing1 temperature and we think it is based partly by power needed, partly by exhaust gas temperature and and partly by lube oil temperature. We're not sure if our below model is accurate or not, so we'll need to calcuate error metrics to check if we should include more inuts or if the model is even any good.

- $\hat{y}$ is your predicted bearing output variable
- $\beta_0$ is zero because when we have a zero in our data it means the temp is equal to zero
- $\beta_1$ is our exhaust gas temp coefficients
- $x_1$ is exhaust gas temp
- $\beta_2$ is our power coefficient
- $x_2$ is our power
- $\beta_3$ is our lube oil temp coeffcient
- $x_3$ is our lube oil temperature
- $\epsilon$ is your error and since our $\beta_0$ is zero we can cancel this and $\beta_0$ out

## Rationale behind this model

Let's say if we lose a bearing then that's a major casualty, so we want to watch out for any anomalies with respect to each of the main diesel generator bearings. If each of x values goes up then it makes sense that our bearing temp will go up in a linear fashion.

## Mean Absolute Error (MAE)

The __mean absolute error__ (MAE) is the simplest regression error metric to understand. We'll calulate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancle out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals. Remember the __mean__ is a descriptive statistic that looks at the aveage value of a data set. Most people understand it as the average. The mean is calculated below:

To get the mean:
\begin{equation}
\bar{y} = \frac{\sum y}{N}
\end{equation}

- $\sum y$ Take all the values and add them up
- divide by the total number of observations you have

In the case of the mean, the "middle" of the data set refers to this typical value. The mean respresents a typical observation in our data set. If we were to pick one of our observations at random, then we're likely to get a value that's close to the mean. The calculation of the mean is a simple task in Python and Pandas. We can get our normal values for an operational profile/phase in a column in the dataframe then run ```df.mean()```

The formal equation for MAE is:
\begin{equation}
MAE = \frac{1}{n}{\sum}| y - \hat{y} |
\end{equation}

- Divide by the total number of data points $\frac{1}{n}$
- Actual output value $y$
- Predicted output value $\hat{y}$
- Sum of the absolute value of the residual ${\sum}|y - \hat{y}|$


The MAE is also the most intuitive of the metrics since we’re just looking at the absolute difference between the data and the model’s predictions. Because we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model (whether or not the model under or overshoots actual data). Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error. Like we’ve said above, a small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. A MAE of 0 means that your model is a perfect predictor of the outputs (but this will almost never happen).

While the MAE is easily interpretable, using the absolute value of the residual often is not as desirable as squaring this difference. Depending on how you want your model to treat outliers, or extreme values, in your data, you may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use.

## Mean square error

The mean square error (MSE is just like MAE, but squares the difference before summing them all instead of using the absolute value. 

The formal equation for MAE is:
\begin{equation}
MSE = \frac{1}{n}{\sum}( y - \hat{y} )^2
\end{equation}

- The square of the difference between actual and predicted $( y - \hat{y} )^2$

Because we are squaring the difference, the MSE will almost always be bigger than MAE. For this reason, we cannot directly compare the MAE to the MSE. We can only compare our model's error metrics to those of a competing model. The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. While each residual in MAE contributes proportionally to the toal error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE. Similarly, our model will be penalized more for making predictions that differ greatly from the corresponding actual value. This is to say that large differences between actual and predicted are punished more in MSE than in MAE. The following picture graphically demonstrates what an individual residual in the MSE might look like.

Outliers will produce exponentially larger differences, and it is our job to judge how we should approach them.


## The problem of outliers

Outliers in our data are a constant source of discussion for data scientists that try to create models. Do we include the outliers in our model creation or do we ignore them? The answer to this question is dependent on the field of study, the data set on hand. We would want to use the MSE to ensure that the model takes these outliers into account more.

## MSE and RMSE

Another error metric we have discussed in the ConOps is the root mean squared error (RMSE). As the name suggests, it is the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier. Since the MSE and RMSE both square the residual, they are similarly affected by outliers. The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out. Both MAE and MSE can range from 0 to positive infinity, so as both of these measures get higher, it becomes harder to interpret how well your model is performing. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it’s supposed to estimate.

## Root mean squared error (RMSE)
RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average squared differences between prediction and actual observation. 

$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$
 
## Calculate MSE against our Model

Like MAE, we'll calculate the MSE for our Model. With MSE, we would expect it to be a much larger error than MAE due to the influence of outliers. We find that this is the case: the MSE is an order of magnitude higher than MAE. The corresponding RMSE would be a bit higher.

## Mean Absolute Percentage Error (MAPE)

The __mean absolute percentage error__ (MAPE) is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages.

The formal equation for MAPE is:
\begin{equation}
MAPE = \frac{100%}{n}{\sum}| \frac{y - \hat{y}}{y} |
\end{equation}

- Divide by the total number of data points $\frac{100%}{n}$
- Actual output value $y$
- Predicted output value $\hat{y}$
- Sum of the absolute value of the residual ${\sum}| \frac{y - \hat{y}}{y} |$
- Each residual is scaled against the actual value y


## What is Probability?
Definition: the likelihood of an event occurring

Definition  

- $A -> \text{event}$
- $P(A) -> \text{Probability}$

- $ P(A) = \frac{preferred}{all}$
- preferred means outcomes we want to have happen (favorable)
- all means entire sample space

Let's take a coin flip as an example

if you want HEADS as your outcome and there are only two sides of a coin then you can represent it as:

- $ P(A) = \frac{1}{2} = 0.5$

now imagine you have six sided die. we could write it as:

- $ P(A) = \frac{1}{6} = 0.167$  
- you can set A to any of the 6 sides of the die to get your probability

What if you wanted to roll a side of the die that was divisible by 3? you would now have 2 preferred outcomes out of the total possible outcomes of 6. so,

- $ P(A) = \frac{2}{6} = 0.33$ 

note that the probability of two independent outcomes are a product of the two outcomes.

- $ P(A and B) = P(A).P(B)$

Let's use cards to enhance your understanding

- $P(AceofSpades) = P(Ace).P(Spade)$

Let's take winning the lottery and you buy one ticket and there are 100 million tickets sold.

- $P(A) = \frac{1}{1000000}$

## Expected Values

Expected Values Definition: the average outcome we expect if we run an experiment many times.

imagine we don't know what the probability of getting heads. We would flip a coin several times and record the outcome. we we do that we call that a trial. when we do multiple trials we call that an experiment.

trial = several coin flips and recorded outcome
experiment = several trials

So, if we toss a coin 20 times and record 20 outcomes, that is called a single experiment with 20 trials. We'll do the same in machine learning. we'll conduct experiments made up of multiple trials.

so the probabilities we come up with during these experiments are called Experimental Probabilities.

Basically if we don't know the theorietical(true) probabilities we conduct experiments to create experimental probabilities to use in our applications.

Let's take for instance going to the grocery store. If I go to the store and record how many times I stand in line and come up with 8 out of 10 times I have to wait in line, then I can say with a good approximation that 80% of the time I have to wait in line at the grocery store.

## Experimental Probabilities
Experimental Probabilities are easy to compute

- $P(A) = \frac{successful-trials}{all trials}$


## Expected Values

$E(A) ->$ the outcome we expect to occure when we run an experiment

### Categorical Outcomes

- $ A -> SPADE $ or specific suit is equal to $ \frac{1}{4} $

if we did a selection trial of 20 times then, 
- $ E(A) = P(0.25).20$

so, we would expect to select a SPADE 5 times out of the 20 times we selected. There is no guarantee that we'll get a SPADE 5 times during that trial.

### Numerical Outcomes

We use a slightly different formula

Sample Space = {A,B,C}

- $A.P(A) + B.P(B) + C.P(C)$

to get the expected value

Say we're shooting arrows at a target and after a few trials we get our probabilities,

- $A = 10, P(A) = 0.5$
- $B = 20, P(B) = 0.4$
- $C = 100, P(C) = 0.1$

so, $E(X) = (0.5)10 + (0.4)20 + (0.1)100 = 23$ 

## Probability Frequency Distribution

notice below when we run our two dice vectors we have 7 show up 6 times in the matrix

so, 

- $ P(7) = \frac{6}{36} $ or just $\frac{1}{6}$


In [3]:
# Load library
import numpy as np
#create a vector as a row (one die)
vector_row = np.array([1,2,3,4,5,6])



#create a vector as a column (second die)
vector_column = np.array([[1],
                          [2],
                          [3],
                          [4],
                          [5],
                          [6]])

<p style="text-align: left; font-size:100px;">
  &#9856;&#9857;&#9858;&#9859;&#9860;&#9861;
</p>
<p style="text-align: left; font-size:100px;">
  &#9856;
  &#9857;
  &#9858;
  &#9859;
  &#9860;
  &#9861;
</p>



&#9857;&#9858;&#9859;&#9860;&#9861;



<p style="text-align: left; font-size:100px;">
  &#9856;
</p>
<p style="text-align: left; font-size:100px;">
  &#9857;
</p>
<p style="text-align: left; font-size:100px;">
  &#9858;
</p>
<p style="text-align: left; font-size:100px;">
  &#9859;
</p>
<p style="text-align: left; font-size:100px;">
  &#9860;
</p>
<p style="text-align: left; font-size:100px;">
  &#9861;
</p>

<h1>
   <span style="display: inline-block; vertical-align: middle">DICE</span>
   <span class="dice" style="display: inline-block; vertical-align: left">&#9857;&#9858;&#9859;&#9860;&#9861;</span>
</h1>

In [2]:
matrix = vector_row + vector_column
matrix

array([[ 2,  3,  4,  5,  6,  7],
       [ 3,  4,  5,  6,  7,  8],
       [ 4,  5,  6,  7,  8,  9],
       [ 5,  6,  7,  8,  9, 10],
       [ 6,  7,  8,  9, 10, 11],
       [ 7,  8,  9, 10, 11, 12]])

The probability frequency distribution is a collection of the probabilities for each possible outcome. This is how we know that 7 is the most probably number of the two dice. THis is usually expressed as a graph or matrix as in the above example.

We next create a sum table of all the numbers from our matrix

sum_table = (2,3,4,5,6,7,8,9,10,11,12)

Then we create a Frequency Table from the matrix

freq_table = (1,2,3,4,5,6,5,4,3,2,1)

next we create a probability associated with each number 2-12

prob_table = (1/36, 1/18, 1/12, 1/9, 5/36, 1/6, 5/36, 1/9, 1/12, 1/18, 1/36)

we basically divide the frequency by the size of the sample space

So the prob_table is called the __"Probability Frequency Distribution"__


In [4]:
import pandas as pd
df = pd.DataFrame(matrix)
df

Unnamed: 0,0,1,2,3,4,5
0,2,3,4,5,6,7
1,3,4,5,6,7,8
2,4,5,6,7,8,9
3,5,6,7,8,9,10
4,6,7,8,9,10,11
5,7,8,9,10,11,12


In [5]:
df.columns

RangeIndex(start=0, stop=6, step=1)

In [6]:
# change your index and column to start with 1 instead of zero 
df.columns += 1

In [7]:
df.columns

RangeIndex(start=1, stop=7, step=1)

In [8]:
df

Unnamed: 0,1,2,3,4,5,6
0,2,3,4,5,6,7
1,3,4,5,6,7,8
2,4,5,6,7,8,9
3,5,6,7,8,9,10
4,6,7,8,9,10,11
5,7,8,9,10,11,12


In [9]:
df.index

RangeIndex(start=0, stop=6, step=1)

In [10]:
df.index += 1

In [11]:
df

Unnamed: 0,1,2,3,4,5,6
1,2,3,4,5,6,7
2,3,4,5,6,7,8
3,4,5,6,7,8,9
4,5,6,7,8,9,10
5,6,7,8,9,10,11
6,7,8,9,10,11,12


## Complements

The complement of an event is everything that an event is not.

If we toss a coin for all possible outcomes we've then exhausted the sample space and have

P(A) + P(B) = 1

we are 100% certain we will get either a heads or tails.

All events have a complement and we notate that with an apostrophe such as A'

also (A')' would equal A

Rolling a die

- $P(A) = P(1) + P(2) + P(4) + P(5) + P(6) = $
- $ \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} =\frac{5}{6}$

You can also say "The probability of not getting a 3 is", 

- $ = 1 - \frac{1}{6} $
- $ = \frac{5}{6} $


## Fundamentals of Combinatorics

Permutations is the number of different wasy we can arrange a set of elements, these can be digits, letters, objects or even people.

A permutation of 3 possible scenarios is P(3)

n - many elements

n in first slot
n-1 in the second slot
n-2 in the third slot

The further down the rankings we go, the more options we've exhausted and the fewer options we have left.

- $ P_n = n(n-1)(n-2)\dots1 = n! $
or n factorial


## Variations

Variations are the total number of ways you can pick and arrange some elements of a given set

say you had a two letter code where you have three different options for each code
A, B or C

There are 3 different variations so the total number of variations is 3x3 equals 9

$\bar{V}^n_p = n^p$

where n is the total number elements we have available and p is the number of positions we need to fill. The numer of variations with repetition when picking p-many elements out of n elemetns, is equal to n to the power of p.

so if n= 3 and p = 2 then,

$\bar{V}^3_2 = 3^2 = 9$

Nine different variations of 2-letter passcodes consiting of A, B, or C only

what if we could use any of the 26 letters? then,

$\bar{V}^26_2 = 26^2 = 676$

## Variations without Repetitions


Say we have a team of 5 and there are only 4 posts to fill.

when we fill the first position we only have 3 positions left
when we fill the second position we only have 2 positions left with 6 variations
when we fill the last positions we have only one position left with one person left over.

The further down the order we go the fewer options we are left with.

5,4,3,2 etc. this is what variations without repetions is about

so here we have 5 x 4 x 3 x 2 = 120 different options of how we arrange the positions of our team


$ V^n_p = \frac{n!}{n!-p!} = \frac{5!}{1!} = 5! = 120$

p = 4 and n = 5

The number of variations without repetition when arranging p elements out of a total of n


## Combinations

are all the different ways you can combine elements

as you just noticed above variations don't take into account double counting elements

All the different permutation of a single combination are different variations

say we have 3 people where the order we pick them is not relevant.

Any of the six permutations we can show is a different variation but NOT a different combination

- $ P_3= 3! = 6$ Permutations
- $ C^{10}_3 = 120$ Combinations
- $ V^{10}_3 = 720$ Variations


What's the number of combinations for choosing p-many elements out of a sample space of n elements?
The number of combinations equals the number of variations, over the number of permutations.

$ C^{n}_{p} = \frac{V^{n}_{p}}{P_p}$

$ C^{n}_{p} = \frac{n!}{p!(n-p)!}$

if n = 10 and p = 3 then,

$ C^{10}_{3} = \frac{10!}{3!7!} = \frac{8x9x10}{1x2x3} = \frac{720}{6} = 120$


let's look at another example where we have to select 4 out of 10 people to go to a conference.

$ C^{10}_{4} = \frac{10!}{4!6!} = \frac{7x8x9x10}{1x2x3x4} = \frac{5040}{24} = 210$

another example, imagine you're on a trip and the bakery you go to offers a different size variety packs where you can choose from either 3, 5 or 8 macaroons. The only requirement is that they must be all different flavors.

How many different 3-macaroon packs can you get, considering 8 distinct flavors?

$ C^{8}_{3} = \frac{8!}{3!5!} = \frac{40320}{720} = 56$

Now you imagine you want to get the medium pack with contains 5 macaroons instead of 3. How many different possible packs can you make?

$ C^{8}_{5} = \frac{8!}{5!3!}$ 

Now imagine you want to get the large pack of 8. note 0! = 1

$ C^{8}_{8} = \frac{8!}{8!1} = 1$ because we only 8 different distinct flavors.

No imagine you have 6 different fruits with room for only 4 in your basket.

$ C^{6}_{4} = 15$ possible choices if you increased your basket size to six, how many different fruit combinations can you fit? just 1 so picking more elements leads to having fewer combinations.

In the general case, we can pick p-many elements in as many ways as we can pick n minus p many elements. thus,

$ C^{n}_{p} = C^{n}_{(n-p)}$ 


## Symmetry

$ p > \frac{n}{2} > n - p $

apply symmetry to avoid calculating factorials of large numbers
we use symmetry to simplify calculations


new types of combinations

A combination can be a mixture of different smaller individual events

take for instance the place you go to lunch and thinking about all the combinations of lunch specials you could try.  if they had 3 types of sandwiches, 2 types of sides and 2 types of drinks

think about each part of the menu as separate positions

so 3 x 2 x 2 = 12 different lunch menus

so now let's think about an online marketing add combinations where we have

- 3 post descriptions
- 5 thumbnails
- 3 headings
- 2 buttons

how many different add combinations would you have to try before you tried all possibilities?

3 x 5 x 3 x 2 = 90 different ads

this helps to determine the appropriate amount of time it would take for such a task to be completed. It can also help to remote sevarl of the options to decrease the workload

the way to calculate these are simply by multiplying the number of options available for each individual event   a x b x c x d ...

## Independent Events

The likelihood of two independent events occurring simultaneously equals the product of their individual probabilities.

In the case of Powerball. 

1. event to select the correct 5 numbers
2. event to select the correct Powerball number

Let's start with selecting the powerball number that has only 1 favorable outcome

- $P(Powerball) = \frac{1}{26}$ 

the next we'll select the 5 numbers

Order does not matter and we cannot have the same value twice. This means we're dealing with combinations without repetition.

So let's apply the relevant formula:

$ C^{n}_{p} = \frac{n!}{p!(n-p)!}$ 

where n = 69 and p = 5

$ C^{69}_{5} = \frac{69!}{5!(69-5)!}$ 

$ = \frac{69!}{5!64!} = 11,000,000 $

so,

$P(5 numbers) < \frac{1}{11,000,000}$

next we would have to guess correcly the powerball number and thus becomes 26 times less likely which means we have over 300,000,000 outcomes. So to get the probability of winning we need to get two pieces of information. 

1. number of favorable outcomes
2. number of all possible outcomes

We've already created the number of all possible outcomes.

So for the number of favorable outcomes those will be equal to the number of tickets we buy.

if buy one ticket thus,

$ P(lottery) = \frac{1}{300000000} $


## Summary

We use permutations with variations when we must arrange a set of objects, the order is crucial.

With permutations and the main difference from variations is that you always arrange the entire set of elements in the sample space. For instance we use permutations for our four runners of a relay race 4 runners, 4 positions so we rely on permutations. If we had 6 runners with only 4 positions and 4 spots to fill then we would use variations. If were only dealing with which 4 out of the 6 team members made the team then we would be dealing with combinations. In that instance we don't care about order. 

There are also two types of Variations and Combinations

### No repetition

There is a clear relation ship between permutation(P), variation(V) and combinations(C)

$ C = \frac{V}{P} $

This is because we count all the permutations of a given set of numbers as a single combination, but as separate variations.

Here we use the formulas:

 $ P_n = n! $
 
 $ V^{n}_{p} = \frac{n!}{(n-p)!} $
 
 $ C^{n}_{p} = \frac{n!}{p!(n-p)!} $

### with repetition

 $\bar{V}^{n}_{p} = n^p $

 $\bar{C}^{n}_{p} = \frac{(n+p-1)!}{p!(n-1)!}$


### Symmetry

Combinations are symmetric

$ C^{n}_{p} = C^{n}_{n-p} $

because we can reverse the problem and choose the elements to be ommitted.



## Bayesian Inference

### Sets

Event => Set of favorable outcomes

for instance we want even numbers 2,4,6,8...

Values of a set don't always have to be numerical such as,

Event -> being one of the 50 United States

We use UPPER-CASE to denote the set and lower-case to denote the elements

A = (2,4,6,8)

Any set can be either empty or have values in it. When there are no values or empty we call 
it the empty set or the null set.

$S \not= \{\}$ 

$S \not= \varnothing$ 

$S \not= \emptyset$ 

when a set is non empty it can either be infinite or finite and we represent

x element is in the set A by

$ x \in A $

x is an element of set A

or we can say x is not an element of A $ x \notin A $

or A contains x, or A owns x, such we show, $ A \owns x $

or we can say A does not contain or own x , $ A \not\ni x $

Next we represent generalized statements about multiple elements using a variety of symbols

for instance, we use, $ \forall $ for all or any

$ \forall x $ for all elements x

so if we want to write a mathematical statement to describe for all x in A we write, $ \forall x \in A $

We also use the colon which can be incredibly useful when we want to make statements about a specific group of elements within a set. $ \forall : $

So to describe all even numbers in a set we would write, $ \forall x \in A : x $ is even 

above states for all x in A, such that, x is even


### Subset

A subset is a set that is fully contained in another set.

If every element of A is also an element of B, then A is a subset of B. we denote that as $ A \subseteq B $

note that every set contains at least 2 subsets. Itself and the null set.

$ A \subseteq A $ and $\emptyset \subseteq A $

if an outcome is NOT part of a SET, it CANNOT be part of any of its SUBSETS

An outcome NOT being part of SOME subset, does NOT EXCLUDE it from the entirety of the greater set.


### Intersections

Intersection includes all the outcomes that are favorable for both event A and event B simultaneously. and we denote that by A intersect B or $ A \cap B $

examples of intersection

The intersection of all hearts and all diamonds is the EMPTY SET, so there are no outcomes which satistfy both events simultaneously or we would write: $ A \cap B = \emptyset$

With the intersection of diamonds and the queen of diamonds and the only one that satisfies being a queen and being a diamond at the same time would be an example of an intersection. We would write this as $ A \cap B = Q_{diamond}$

An example of all red cards and all diamonds would be any diamond is simultaneously RED and Diamond. So Diamonds would all be RED.

We use intersections to denote instances where both events A and B happen simultaneously.


### Unions

What if we only require one of them to occur? It's the same as asking either A or B to happen.

Unions are a combination of all outcomes preferred for either A or B or $ A \cup B $

If the sets A and B do not intersect at all then $ A \cap B = \emptyset $, their union would simply be their sum. $ A \cup B = A + B $

The union of hearts and diamonds would be all red cards but no single card can have multiple suits.

The union of these two sets are denoted by $ A \cup B = A + B - A \cap B$ because if we were to just add the two sets we would be double-counting every element that is part of the interection.

Union - Intersection relationship
$ A \cup B = A + B - A \cap B$

### Mutually Exclusive

Mutually exclusive sets have the empty set as their intersection.

If the intersection of any number of sets is the empty set, then they must be mutually exclusive.

if A and B are mutually exclusive then $ A \cup B = A + B $

### Complements

Sets have complements too

All values that are part of the sample space, but not part of the set

imagine a set of all odd numbers, it's complemenent would be the set of all even numbers.

Complements are ALWAYS mutually exclusive, but NOT all mutually exclusive sets are complements

### Independent and Dependent Events


Independent Events = The theoretical probability remains unaffected by other events

when you flip a coin you get a 50% chance of getting tails each time you flip with no regard to the last flip, thus independent events.

Dependent Events = Probabilities of dependent events vary as conditions change

$P(Q_{spades}) = \frac{1}{52}$ where 1 is our favorable outcome, 52 our element space

imagine now, we know that the card we drew was a going to be a spade then,

$P(Q_{spades}) = \frac{1}{13}$ the cards are now all spades in the element space

now imagine, instead of a spade we know our card is a queen, now our sample/element space is 4.

$P(Q_{spades}) = \frac{1}{4}$

The insight here is as the more information you have the probability of an event changes depending on the information we have.

Two events: A and B 

The probability of getting A, if we are given that B has occurred -> $P(A|B)$ means "A given B"  

Back to the card example A --> $Q_{spades}$    and    B --> $spades$


$P(A|B)= \frac{1}{13}$ --> The probability of drawing the Queen of Spades if we know the card is a spade

similarly if we have A --> $Q_{spades}$    and    C --> $Queens$

$P(A|B)= \frac{1}{4}$

We call this conditional probability and we use this to distinguish dependent from independent events.

Two Coin Flips 

$P(A) = P(A|B)$ which are independent

if any two events are independent then,

$P(A \cap B) = P(A) x P(B)$

Let's now discuss our Queen of Spades example again where,

$ A = Q_{spades} $

$ B = spades $

$ C = queens $

normally the probablility of drawing a queen of spades is $ P(A) = \frac {1}{52} $

However, it increases if we know it's a spade $ P(A|B) = \frac {1}{13} $ 

we can now say that the two events A and B are dependent similiarly,

$ P(A|C) \neq P(A) $ we can also say that A and C are also dependent.

Let's formalize this relationship with a formula:

\begin{equation}
 P(A|B) = \frac {P(A \cap B)}{P(B)}       \text{if }  P(B) > 0\\
\end{equation}

which has many similarities to our:

\begin{equation}
 P(A|B) = \frac {\text{favorable}}{\text{all}}  
\end{equation}

College class of 2018 graduated 5% economics majors graduated with honors

$ P(H|E) = \frac{4}{80} $

According to the conditional probability formula, P(A|B) = P(Intersection) / P(B), rather than P(A|B) = P(B) + P(Intersection)

According to the formula, P(A|B) = 0.15/0.6 = 0.25.

### Law of Total Probability

$ A = B_1\cup B_2 \dots \cup B_n $

Survey/trial

- A = outcome
- B_1 = coefficient 1
- B_2 = coefficient 2

$ P(A) = P(A|B_1)xP(B_1)+P(A|B_2)xP(B_2) \dots $

### Additive Law

Recall that Union - Intersection relationship
$ A \cup B = A + B - A \cap B$

Calculate $P(B \cup A), \text{given} P(A) = 0.75, P(B)= 0.6  \text{and}  P(B \cap A) = 0.55$

$ A \cup B = A + B - A \cap B$

$ P(B \cup A) = .75 + .6 - .55 = .8 $

According to the additive law, P(Union) = P(A) + P(B) - P(Intersection). Plugging in the values, gives us P(Union) = 0.75 + 0.6 - 0.55 = 0.8.

$ A \cup B = A + B - A \cap B$

$ A \cup B = .90$, given $P(A) = .65$ and $P(B) = .44 $

$ .90 - (.65 + .44) = .19 $


### Multiplication Rule

$ P(A|B)xP(B) = P(A \cup B) $  is the multiplication rule

Example:

$P(B) = .5$

$P(A|B) = .8$

Event A also appears in 80 percent of those 50 percent when B occurred so, 

$P(A \cup B) = 0.8x0.5=0.4$


What if drew a card and were looking for a spade as an outcome, but we didn't draw a spade. we then shuffled the deck without that one card we drew so now we have 51 cards. we would need to rewrite our equation.

$ P(A|B) = \frac{favorable}{all} = \frac{13}{51} = 0.255 $

so far: $ P(B) = 0.75 $

and $ P(A|B) = 0.255 $

now: what is the probability of drawing a spade on the second draw and not drawing a spade on the first draw? (we need to apply the multiplication rule)

$P(A \cap B) = P(A|B)xP(B)$

$P(A \cap B) = 0.255 x 0.75 = .191$ .191 for the interesection of A and B

### Bayes' Rule = Bayes' Theorem = Bayes' Law

Conditional Probability Formula = $P(A|B) = \frac{P(A \cap B)}{P(B)}$

Multiplication Rule = $P(A \cap B) = P(B|A)xP(A) $


Bayes' Theorem = 

\begin{equation}
P(A|B) = \frac{P(B|A)xP(A)}{P(B)}
\end{equation}

Baye's Rule in Real-Life

out of 200 successful candidates

experience(EXP) vs. good grades(A+)

- $P(EXP) = 45%$
- $P(A+) = 60%$
- $P(A+)|(EXP) = 50%$
- $P(EXP)|(A+) = ?$

if we use Baye's we can create:

$P(EXP)|(A+) = \frac{P(A+)|(EXP) x P(EXP)}{P(A+)} = \frac{0.5 x 0.4}{.06} = 0.375$

so,

$P(EXP)|(A+) = 0.375  < P(A+|EXP) = 0.5$

This shows that it is likely that a candidate with experience is more likely to excel having high grades than a student with high grades to have the right experience pedigree. Also, candidates who had internships are more likely to also have a high GPA. So, the ideal candidate is someone who has experience, rather than somebody who thrived academically. 

- $P(B|A) = 0.6$
- $P(A) = 0.4$
- $P(B) = 0.3$
- $P(A)|(B) = ?$

$P(A)|(B) = \frac{P(B)|(A) x P(A)}{P(B)} = \frac{0.6 x 0.4}{.3} = 0.8$


### Probability Distributions

A distribution shows the possible values a variable can take and how frequently they occur.

Uppercase Y represents the actual outcome of an event and lowercase y represents one of the possible outcomes.

One way to denote the likelihood of reaching a particular outcome Y is $ p(Y = y) $ we can also express it as $P(y)$ 

We call this the probability function.

mean --> $ \mu $

variance --> $ \sigma^{2} $


### Population vs Sample Data

- Population includes all the data
- Sample data includes only a subset of the data

so sample mean would look like: $ \bar{x} $ and sample variance would look like: $ s^2 $

### Standard Deviation

Standard Deviation is really the square root of variance $ \sqrt{\sigma^2} $

Which we say for population $ \sigma $ and for sample we say $ s $

Any value that falls with in $ \mu - \sigma$ and $ \mu + \sigma $ is one standard deviation from the mean.

The more the data is congested in the middle of the distribution the more data is in that interval and alternatively the less the data falls within the middle of the distribution the more dispersed that data is.

There is a constant relationship between mean and variance:

$ \sigma^2 = E((Y - \mu)^2) = E(Y^2)-\mu^2 $


### Uniform Distribution

$ U(a,b) $

$ X ~ U(3,7) $

This says that variable X follows a uniform distribution from 3 to 7.

Events that follow a uniform distribution are events where all outcomes have equal probility.

The main takeaway for a uniform distribution is that each outcome is equally likely and both the mean and variance are uninterpretable. 

this means that the uniform distribution provides no predictive power whatsoever. There is another distribution called the Bernouli distribution which also has no predictive power.

### Bernouli distribution

$ Bern(p) $

$ X ~ Bern(p) $

A variable X follows a bernouli distribution with the probability of success as p

Events with 1 trial and 2 outcomes such as flipping a coin, true or false question, whether to vote Democrat or Republican.

you Assign one outcome as 0 and the other as 1.

Conventionally we assign $ p = 1 $ and $ 1 - p = 0 $

so, $ p > 1 - p $

$ \sigma^2 = (x_0 - \mu)^2 * P(x_0) + (x_1 - \mu)^2 * P(x_1) = (0 - p)^2 * (1 - p) + (1 - p)^2 * p = p(1 - p) $

We provide the variance as $ \sigma^2 = p(1-p) $

or the standard deviation as $ \sigma = \sqrt{p(1-p)} $

but that brings us very little value...


### Binomial Distribution

A Binomial Distribution is a sequence of identical Bernouli events.

We represent a Binomial distribution by the letter B followed by the number of events and the probability of success for each one.

$ B(n,p) $

so we read the following equation a Variable X follows a binomial distribution with 10 trials and a likelihood of 0.6 to succeed on each individual trial.

$ X ~ B(10, 0.6) $ 

We can also represent a single binomial distribution as a single bernouli trial as:

$ => Bern(p) = B(1, p) $

Another way to look at the difference between Bernouli and Binomial is to take a quiz with several true/false question/answers. Each question would be a Bernouli distribution whereas the entire quiz would be a Binomial distribution.

In [2]:
from sympy import *
x = Symbol('x')

In [3]:
limit(sin(x)/x, x, 0)

1

In [4]:
limit(sin(x)/x, x, 0)

1