$
\newcommand{\real}[1]{\mathbb{#1}}
\newcommand{\expect}{\mathrm{E}}
\newcommand{\prob}{\mathrm{P}}
\newcommand{\v}{\mathrm{var}}
\newcommand{\Comb}[2]{{}^{#1}C_{#2}}
$

# Discrete Random Variables

## PMF; Expected Values and their Properties 
*Lecture 05*

In this section we go over what Discrete Random Variables are, and how we calculate their Probability Mass Functions. Then we look at a few common types of Discrete Random Variables, specifically Uniform RV, Bernoulli RV, Binomial RV and Geometric RV. Finally we learn about Expected Values and their properties (most importantly Linearity).

A Random Variable (RV) associates a value (a number) to every possible outcome of an experiment. Mathematically, it is a function from the sample space $\Omega$ to the real numbers $\real{R}$. It can take discrete or continuous values. We can have several random variables defined on the same sample space. Basically, there can be multiple ways of mapping the outcomes of *the same experiment* to the number line. All these different ways constitute different random variables. A function of one or several random variables is also a random variable , like $X + Y$.

Probability Mass Function (PMF) is the “probability law” or “probability distribution” of a random variable. <br>
$$ p_X(x) = \prob(X=x) = \prob(\{ \omega \in \Omega \text{ s.t. } X(\omega) = x\})$$  

*Bernoulli RV* models a trial that results in success/failure. It is an indicator RV of an event $A$.  

*Binomial RV* represents the total number of heads when a coin is flipped $n$ times with probability $p$ of getting heads in each toss. If $X$ is a Binomail RV, then  
$$p_X(k) = \Comb{n}{k} \,p^k (1-p)^{n-k}$$  

*Geometric RV* represents number of tosses till first head. If X is Geometric RV, then  
$$p_X(k) = (1-p)^{k-1} p$$
What is the probability that $X > k$? For this all the first $k$ tosses should be tails. Thus, $P(X > k) = (1-p)^k$.  

### Expected Value
>Expectation of a Random Variable is the weighted average of the possible values of the RV, where the weights are the probabilities of each value.  

Another way of defining Expected Value is as follows:  
>Expectation of a Random Variable is the average value of a random variable in large number of independent repetitions of the experiment.  

How do both the definitions lead to the same measure? Is it necessary that the frequencies in a large population will follow the probability distribution? Is the probability (likelihood) of an observation the same as its frequency when the experiment is conducted a large number of times?  The intuition is easy to catch if we look at probabilities as frequencies. The mathematical proof is given by the **Weak Law of Large Numbers**.  

Consider an experiment where we randomly pick a student in a classroom of $n$ students. The weight of the i-th student is $x_i$. If the RV $X$ is defined as the weight of the selected student, then $p_X(x_i) = \frac{1}{n}$ and the expected value is
$$ E(X) = \sum_i x_i p_X(x_i) = \frac{1}{n} \sum_i x_i $$  

Thus, expected value (which is the average of observations over multiple runs of the experiment) is equal to the population average (mean weight of a student in the class).

**Expected Value Rule**:  
Suppose we have two random variables $X$ and $Y$ over the same sample space such that $Y = g(X)$. Then,
$$\expect[Y] = \expect[g(X)] = \sum_x g(x)p_X(x)$$

## Variance, Conditioning (on events); Joint PMFs for Multiple RVs 
*Lecture 06*  

Expected Value gives an average (central tendency) of the probability distribution. But how do you measure the spread of the distribution?  

One way to measure the spread is the range (highest possible value - lowest possible value). But this is severely affected by outliers. Thus, other measures like Quartiles and Percentiles are often used. Percentiles give us the flexibility of discounting as many outliers as required. However these measures still don't tell us how the observations are distributed around the mean value. How do you measure how close are the actual observations to the Expected Value? The solution is given by **Variance**.  

Variance measures the average distance of a Random Variable from the Expected Value (over multiple trials). It quantifies the amount of randomness that is present. Together with EV, the variance summarizes crisply the properties of a PMF.  
\begin{equation}
\v(X) = \expect[(X-\mu)^2] \\  
\textbf{standard deviation} : \sigma_X = \sqrt{\v(X)}
\end{equation}  

**Properties of Variance**:
* $\v(aX+b) = a^2\v(X)$
* $\v(X^2) = E[X^2] - (E[X])^2$
* If $X$ is a Uniform RV with values between $a$ and $b$, then the $\expect[X] = \frac{a-b}{2}$ and $\v(X) = \frac{1}{12}(b-a)(b-a+2)$
* If $X$ is a Bernoulli RV with probability $p$ of success, then $\expect[X] = p$ and $\v(X) = p(1-p)$  

After this we discuss conditional PMF, conditional Expected Value and conditional Variance (all conditioned on any event $A$). All the formulas remain the same, but now we use conditional probabilities instead of absolute probabilities.

**Total Expectation Theorem**  
Total Expectation Theorem helps us divide and conquer EV calculations. It states that
$$ \expect[X] = \prob(A_1)\expect[X \mid A_1] + \dots \prob(A_n)\expect[X \mid A_n] $$  

### Memorylessness of Geometric Random Variable
Number of *remaining* coin tosses, conditioned on tails in the first toss, is Geometric, with parameter $p$. For example, suppose the probability of getting the first head on third toss is $k$. If we get a tail on the first toss, then (given this fact) the probability of getting a head on the overall fourth toss (third remaining toss) is also $k$.  

After the first coin toss, the *remaining* coin tosses is given by the Random Variable $X-1$. This random variable has the same probability distribution as the origin random variable $X$ given that the first toss was a tail.  
$$ \prob_{X|X > 1}(a) = \prob_X(a+1) = \prob_{X-1}(a) $$  

Using this property we can derive the Expected Value of a Geometric Random Variable $X$.
\begin{align}
\expect[X] &= 1 + \expect[X-1] \\
&= 1 + p\expect[X-1 \mid X=1] + (1-p)\expect[X-1 \mid X>1] \\
&= 1 + 0 + (1-p)\expect[X] \\
&= \frac{1}{p} \\
\end{align}

### Joint PMF
Joint PMF of two different random variables $X$ and $Y$ is given by:
$$p_{X,Y}(x,y) = \prob(X=x \textrm{ and } Y=y)$$   

The following properties are obvious:
\begin{align}
&\sum_{x}\sum_{y}p_{X,Y}(x,y) = 1 \\
&p_X(x) = \sum_y p_{X,Y}(x,y) \\
&p_Y(y) = \sum_x p_{X,Y}(x,y) \\
&\expect[g(X,Y)] = \sum_x \sum_y g(x,y)p_{X,Y}(x,y) \\
\end{align}

Using the above rules of Joint PMFs, we can prove the linearity of expectation over multiple random variables.
$$\expect[X+Y] = \expect[X] + \expect[Y]$$  
Here $X$ and $Y$ don't need to be independent.  

The above property can be used to derive the Expected Value of a Binomial Random Variable $X$ with $n$ trials and $p$ probability of success. Let $X_i$ denote the result of i-th trial. Then
$$X = X_1 + X_2 + \dots + X_n$$  
Now, $\expect[X_i] = p$. Therefore, $\expect[X] = p + p + \dots + p = np$.

## Conditioning on other RVs; Independence of RVs; Hat Problem 
*Lecture 07*  

In this part, we study conditioning of random variables on other random variables (as opposed to events, which we studied in last lecture). This is mainly just new notations and very straight-forward extension of concepts of joint PMF and conditionality. Then we introduce independence of random variables. This uses concepts of Joint PMFs studied in previous lecture and conditioning on other random variables studied in this lecture. We study how the expectations and variances of independent RVs can be manipulated. Finally, we study the Hat Problem and try to solve it using independence of random variables.

**Conditional PMF**  

We have studied conditional probability of a random variable given an event (in the previous lecture). We also studies the Joint PMF of two different random variables. Now we study conditional probability of a random variable given value of another random variable.  
$$p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(y)}$$  

**Independence**

$\expect[X + Y] = \expect[X] + \expect[Y]$ for all random variables, but only for **independent random variables**:
\begin{align}
\expect[XY] &= \expect[X] * \expect[Y]\\
\v(X + Y) &= \v(X) + \v(Y)\\
\end{align}

The above property can be used to derive the variance of a Binomial Random Variable $X$ with $n$ trials and $p$ probability of success. Let $X_i$ denote the result of i-th trial. Then
$$X = X_1 + X_2 + \dots + X_n$$  
Now, $\v(X_i) = p(1-p)$. Therefore, given that all $X_i$(s) are independent of each other $$\v(X) = \v(X_1) + \dots + \v(X_n) = np(1-p)$$  

### Hat Problem

In the hat problem, $X_j$ denotes the indicator variable for the event where person $j$ gets his own hat. Now there are $n!$ permutations in which the hats may be distributed. In them $(n-1)!$ ways give the correct hat to j-th person. So the person $j$ gets his correct hat with the probability $\frac{(n-1)!}{n!}$ which is $\frac{1}{n}$.

Also there is some subtlety involved here. First the random variable $X$ (denoting total num of men receiving their own hat) and random variables $X_1, X_2, X_3, \dots, X_n$ etc. all map the outcomes of the *same experiment* to the number line. The experiment is that each person is assigned a hat. $X_i$ takes each hat permutation and assigns it 1 if the ith-person received his own hat. $X$ takes each hat permutation and assigns it $k$ if in all $k$ people got their hat back.

Secondly, $X_1, X_2, X_3$ are all dependent on each other. If $X_1 \dots X_n$ all get their correct hats, $X_n$ automatically gets the correct hat. 

Now, how do we find $\expect[X]$? We use the Linearity of Expected Value theorem.
\begin{align}
X &= X_1 + X_2 + \dots + X_n \\
\expect[X] &= \expect[X_1] + \expect[X_2] + \dots + \expect[X_n] \\
\expect[X] &= \frac{1}{n} + \frac{1}{n} + \dots + \frac{1}{n} &\textrm{as each }X_i\textrm{ is a Bernoulli RV with }p=\frac{1}{n} \\
\expect[X] &= 1 \\
\end{align}  

How do we find the variance of $X$? Here we can't use Linearity of Variance, because $X_i$(s) are not independent. We will use the formula $\v(X) = \expect[X^2] - (\expect[X])^2$. The $\v(X)$ comes out to be $1$. Refer to Lec 07.8 (starting at 9:44) for exact calculation.