# Welcome to the Dark Art of Coding:
## Introduction to Data Science Fundamentals
Probability distributions and Combinations, Permutations, Factorials


<img src='images/logos.3.600.wide.png' height='250' width='300' style="float:right">

# Main objectives
---

You will be able to:

* Understand discrete probability distributions
* Understand permutations and combinations, including:
    * factorials
    * calculating the number of arrangements of items
    * examining permutations
    * examining combinations

# Discrete Probability Distributions
---

Often, it is useful to simply understand the probability of an event. But sometimes it just isn't enough.

For example, sometimes we need to know the consequences of a fairly likely event OR the results if an unlikely event occurs.

To start this conversation, let's consider a classic example: the three wheel slot machine that costs $0.20 to play:

* A = Apple
* O = Orange
* Q = Quarter

|Combination|Payout|Gain/Loss|
|:---|:---||
|A . A . A|\$4|\$3.80|
|(in any order) > A . A . O|\$3|\$2.80|
|O . O . O|\$2|\$1.80|
|Q . Q . Q|\$1|\$0.80|
|All other combinations|\$0|-\$0.20|

The probability to get any single symbol is as follows:

* A = 0.1
* O = 0.2
* Q = 0.2
* all other symbols = 0.5

|Combination|Payout|Gain/Loss|Probability|
|:---|:---|:---||
|A . A . A|\$4|\$3.80|$0.1 \times 0.1 \times 0.1 = $|0.001|
|A . A . O|\$3|\$2.80|$(0.1 \times 0.1 \times 0.2) + (0.1 \times 0.2 \times 0.1) + (0.2 \times 0.1 \times 0.1) = $|0.006|
|O . O . O|\$2|\$1.80|$0.2 \times 0.2 \times 0.2 =$| 0.008|
|Q . Q . Q|\$1|\$0.80|$0.2 \times 0.2 \times 0.2 =$| 0.008|
|All other combinations|\$0|-\$0.20|$1 - 0.001 - 0.006 - 0.008 - 0.008 =$| 0.977|

If we translate the gain/loss, to a result (labeled x), we can assign the probability of getting any specific result.

|Combination|P(X = x)|x|
|:---|:---|:---||
|A . A . A|0.001|\$3.80|
|A . A . O|0.006|\$2.80|
|O . O . O|0.008|\$1.80|
|Q . Q . Q|0.008|\$0.80|
|All other combinations|0.977|-\$0.20|

As we mentioned... What this means is that the probability of earning \$3.80 is one in a thousand.
The probability of losing \$0.20 is 977 in a thousand.

### Expectation:

The question that often comes to mind is... sure, I get that on any given trial, I might lose, I might win, but is there a way to average out my wins/losses? Well, yes, there is a way to calculate your expected wins and losses over time OR a large number of trials.

This is called **Expectation**. Expectation of a random variable **X** is much like the mean we explored earlier AND you use a similar process to calculate it:

Multiply the value **x** by the **probability of x** and sum the results:

$E(X) = \sum \ x \ P(X = x)$

**Note**: random variables (this collection of values), such as **X** are often identified with capital letters. Generally the individual values that the variable can take are denoted with a lower case **x**

**Note**: in later cells, we will refer to $E(X)$ as $\mu$



|Combination|P(X = x)|x|Result|
|:---|:---|:---|||
|A . A . A|0.001|\$3.80|0.0038|
|A . A . O|0.006|\$2.80|0.0168|
|O . O . O|0.008|\$1.80|0.0144|
|Q . Q . Q|0.008|\$0.80|0.0064|
|All other combinations|0.977|-\$0.20|-0.1954|
|||**$\mu$**|**-0.1604**|

Which means that overall, everytime you play, you can expect to **lose an average of $0.16** for every game you play.

When we discussed means, earlier, we also noted that often it is useful to understand how widely the results might deviate from the expected values. Two measures that we looked at for typical dataset were the variance and the standard deviation.

### Variance

The variance of a probability distribution (X) is the expectation of $(X - \mu)^2$

$Var(X) = E(X - \mu)^2 = \sum (x - \mu)^2 P(X \ = \ x)$



|Combination|P(X = x)|x|Result|x - $\mu$|
|:---|:---|:---||||
|A . A . A|0.001|\$3.80|0.0038|3.8 - (-0.1604) = 3.9604|
|A . A . O|0.006|\$2.80|0.0168|2.8 - (-0.1604) = 2.9604|
|O . O . O|0.008|\$1.80|0.0144|1.8 - (-0.1604) = 1.9604|
|Q . Q . Q|0.008|\$0.80|0.0064|0.8 - (-0.1604) = 0.9604|
|All other combinations|0.977|-\$0.20|-0.1954|-0.2 - (-0.1604) = -0.0396|
|||**$\mu$**|**-0.1604**||



Taking the results of $ x - \mu$ and squaring them... yada

|Combination|P(X = x)|x - $\mu$|(x - $\mu$)$^2$|Result|
|:---|:---|:---||||
|A . A . A|0.001|3.9604|15.6848|0.001 * 15.6848 = 0.0157|
|A . A . O|0.006|2.9604|8.7640|0.006 * 8.7640 = 0.0526|
|O . O . O|0.008|1.9604|3.8432|0.008 * 3.8432 = 0.0307|
|Q . Q . Q|0.008|0.9604|0.9224|0.008 * 0.9224 = 0.0074|
|All other combinations|0.977|-0.0396|0.00157|0.977 * 0.00157 = 0.0015|
||||**Variance:**|0.1079|

### Standard Deviation

The standard deviation is simply the square root of the variance:
    
$\sigma = \sqrt {Var(X)}$   

The standard deviation is: 0.328

## Wait... how do we code all that...

There is a lot going on in this example. We leave it to you to create a snippet of code to calculate the Expectation, the Variance and Standard Deviation for a small number of input values.

## scipy.stats field trip

Instead, let's take a quick look at a source of professionally developed tools that can help you... let's go on a field trip:

A useful resource to turn to are the statistical methods found in the [scipy.stats library](https://docs.scipy.org/doc/scipy-0.19.1/reference/stats.html)


## More to study!

As you grow more familiar with the idea of discrete probability distributions, a concept you will want to follow-up on is the idea of linear transforms...

What does that mean?

Say the expected gains and losses change? We double or trip the costs. We multiply the payouts by 3, 4 or 5. That will change our formula outputs and affect the mean or Expectation and thus the Variance and Standard Deviation. Understanding linear transforms will help you to deal with this simply and easily without having to recalculate all the values.

Now... let's change topics for a bit...

# Combinations & Permutations with a sprinkling of Factorials

In looking at the sample space... a primary concern is being able to accurately identify or calculate the number of arrangements of items. In the previous example, we glossed over this when we mentioned that all arrangements of Apple, Apple, Orange impacted the probability of that event.

In the case of the Apple, Apple, Orange slot machine result, it was fairly easy to find a quick solution.

Other cases may not be so easy: two difficulties that we run into include:

* dealing with duplicates
* arranging by individual items versus types of items

To help us with calculating the number of arrangements in a sample space, we will look at factorials, permutations and combinations (with and without replacement).

# Factorials

A **factorial** is the product of every whole number from `n` to 1. Factorials are written in the following manner:

$\Large n!$

For example:

$\Large 2! = 2 \times 1 = 2$

$\Large 3! = 3 \times 2 \times 1 = 6$

$\Large 4! = 4 \times 3 \times 2 \times 1 = 24$

$\Large 10! = 10 \times 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 3628800$

Factorials present an easy way to write certain large numbers and are often used in statistics, especially in terms of calculating permutations and combinations.

A great attribute of factorials is that when performing math with factorials, you can often simplify your work greatly if you need to do things by hand OR want to get a rough order of magnitude. In this example, the :

$\Large \frac{5!}{3!} = \frac{5 \times 4 \times 3 \times 2 \times 1}{3 \times 2 \times 1} = \frac{5 \times 4}{1}  = 20$

$\Large \frac{5!  \times 4!}{3! \times 2!} =
 \frac{5 \times 4 \times 3 \times 2 \times 1}{3 \times 2 \times 1 } \times \frac{4 \times 3 \times 2 \times 1 }{\times 2 \times 1}
      {} = \frac{5 \times 4}{1} \times \frac{4 \times 3}{1} = 20 \times 12 = 240$

In [None]:
def factorial(n):
    '''Return the factorial of a given integer input.'''    
    
    if n == 1:
        return 1
    elif n >= 2:
        return n * factorial(n - 1)
    elif n == 0:
        return 1

In [None]:
factorial(4)

In [None]:
factorial(10)

**Real world solution**

Warning... as noted above... the code snippets here (using a roll-your-solution) are not robust and should not be relied upon for anything requiring speed or performance is the if performance is goal, but can certainly be useful to learn new things.

Some issues that affect factorials:

On some computing hardware, the size of an integer may be limited... 
* integers on 32-bit hardware max out around 12!
* integers on 64-bit hardware max out around 20!

Floating point representations/approximations of values and other techniques can sometimes be used to get past these limits.

To improve speeds, some implementations will simply do lookups for fairly small factorials, versus computing them.

Formulas such as Stirling's formula can approximate large values.

Some implementations use a divide and conquer approach.

Let's take a quick look at my roll-your-own versus the implementation in Python

In [None]:
from math import factorial as f

f(4)        # returns 4, as expected

Let's cut both functions loose on a mid-size factorial and see how they do, performance-wise

In [None]:
%timeit factorial(40)    # my version

# Performance when writing this tutorial (your mileage may vary)
# 7.28 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [None]:
%timeit f(40)            # builtin version

# Performance when writing this tutorial (your mileage may vary)
# 363 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# Permutations and Combinations

## Definitions:

**permutations (w/o replacement)**: number of ways to arrange a subset of objects from a finite set of distinct objects. In permutations, order is important and once an object has been used, it is not used again. Examples include: how many ways can you arrange four out of five books on a shelf, how many ways can three out of 10 horses cross the finish line? 

$\Large\frac{n!}{(n - k)!}$

**combinations**: number of subsets of objects from a finite set of distinct objects. In combinations, order is not important and objects are not replaced. Examples include: items in a salad (lettuce, tomatoes, dressing, croutons), picking players to be on the floor in a basketball game.

$\Large\frac{n!}{k!(n - k)!}$

**product**: is much like permutations, where order matters AND includes the replacement of objects back into the pool. Examples include: all possible sequences for a lock where the numbers can be reused, all possible selections of ball from a lottery cage, where the balls are replaced.

$\Large n^k$

**combinations with replacement**: is the same as combinations, but includes the replacement of the objects back in the pool.

$\Large\frac{(k + n - 1)!}{k!(n - 1)!}$

||Without replacement|With replacement|
|:-|:-|:-|
|**Order matters**|permutations|product???|
|**Order doesn't matter**|combinations|combinations with repl.|

Let's use the above formula to calculate how many items are in each of these possible categories.

In [None]:
# Permutations

factorial(3) / factorial(3 - 2)

In [None]:
# Combinations

factorial(3) / (factorial(2) * factorial(3 - 2))

In [None]:
# prod

3 ** 2

In [None]:
# combs with rep...

factorial(2 + 3 - 1) / (factorial(2) * factorial(3 - 1))

This worked fine to give us the counts, but what if we needed to see OR use the actual arrangements?

Python has a module in the Standard Library, called `itertools` that can help create each of the arrangements in any category.

In [None]:
import itertools

l = [1, 2, 3]

print('Permutations:', list(itertools.permutations(l, 2)))
print('Combinations:', list(itertools.combinations(l, 2)))

print('Product w/ R:', list(itertools.product(l, repeat=2)))
print('Combin. w/ R:', list(itertools.combinations_with_replacement(l, 2)))

# Some practical application

What does all this mean, anyway... thus far in our earlier examples we generally just focused on a specific event OR events in a sample space:

* probability of drawing an Ace of Hearts from a standard 52-card deck
* probability of rolling an even number on a d20

Sadly, much of probability is not that simple...

For example, in a superhero foot race, how many different ways can the superheroes cross the finish line would be an example of looking at a problem where order matters:

||First, Second, Third|
|:--|:--|
|Order 1|iron man, black widow, black panther|
|Order 2|iron man, black panther, black widow|
|Order 3|black widow, iron man, black panther|
|Order 4|black widow, black panther, iron man|
|Order 5|black panther, iron man, black widow|
|Order 6|black panther, black widow, iron man|

Let's use our newfound skills to calculate something similar:

In [None]:
heroes = ['iron man', 'black widow', 'black panther', 'captain america']

First we will look at all the possible outcomes and count those outcomes:

* What are the possible orderings (permutations)?
* How many possible ordering are there (i.e. what is the size of the sample space)?

In this case, we can find out how many possible outcomes are available by using our function `factorial()`, but this is somewhat limiting.

In [None]:
print(factorial(len(heroes)))

This is somewhat limiting in that we don't know the order of the items and such.

To actually get the ordering:

In [None]:
# NOTE: pretty printing as provided by the module pprint is useful to align some data types when they are printed to screen. 

import pprint as pp

S = list(itertools.permutations(heroes, 2))
pp.pprint(S)
print()
print('Total number of permutations:', len(S))

In [None]:
S = list(itertools.combinations(heroes, 2))
pp.pprint(S)
print()
print('Total number of combinations:', len(S))

In [None]:
S = list(itertools.product(heroes, repeat=2))
pp.pprint(S)
print()
print('Total number of product arrangements:', len(S))

In [None]:
S = list(itertools.combinations_with_replacement(heroes, 2))
pp.pprint(S)
print()
print('Total number of combinations_with_replacement:', len(S))

## Experience Points
---

### Complete the following exercises:

Display the following arrangements of objects, taken 3 at a time, using a list of five animals as an input:

`animals = ['fox', 'cat', 'dog', 'owl', 'alligator']`

* Permutations
* Combinations
* Product
* Combinations w/ Replacement
    

If you (and your partner, if you're working in pairs) are done, then you can put your green sticky up! This is how we know you're done.

<img src='images/green_sticky.300px.png' width='200' style='float:left'>

# Resources

* [Python Statistics Module](https://docs.python.org/3/library/statistics.html)
* [numpy library](https://www.numpy.org/devdocs/reference/index.html)
* [scipy library](https://docs.scipy.org/doc/scipy/reference/index.html)
* [pandas library](http://pandas.pydata.org/pandas-docs/stable/)