# Lecture 5.1: Review

## Outline

* Discrete probability distributions
    * Bernoulli distribution
    * Binomial distribution
    * Geometric distribution
    * Negative Binomial distribution
    * Hypergeometric distribution
    * Multinomial distribution
    * Poission distribution 

* Continuous probability distributions
    * Exponential
    * Gamma
    * Uniform
    * Normal

* Relationshop between two random variables
    * The joint distribution function
    * Marginal distributions
    * Independent random variables
    * Conditional distributions
    * Conditional expectation
    * Combining random variables
        * Covariance
        * Correlation
        * Mean
        * Variance

* Point estimation
    * Maximum likelihood estimation (MLE)
    * Method of moments (MOM)
    * Mean squared error (MSE)

* Central Limit Theorem (CLT)

* Confidence intervals
    * One sample - mean and proportion
    * Two samples - difference in means and proportions

## Discrete Probability Distributions

### Bernoulli Distribution (Bernoulli Trials)

The **Bernoulli distribution** is the probability distribution of a random variable which takes the value 1 with success probability of $p$ and the value 0 with failure probability of $q = 1 - p$.

The random variable $X$ is called a **Bernoulli random variable** if it takes only 2 values, 0 and 1.


The probability mass function is,

$$f_X(x)=
\begin{cases}
    p, & \text{if } x = 1\\
    1 - p, & \text{if } x = 0
\end{cases}$$

### Binomial Distribution

Let $X$ be the number of successes in $n$ independent Bernoulli trials each with probability of success $= p$, then $X$ has the **Binomial distribution** with parameters $n$ and $p$. We write $X \sim Bin(n, p)$, or $X \sim Binomial(n, p)$.

**Probability Mass Function (PMF)**:
$$f_X(x) = P(X = x) = \binom{n}{x} p^x (1 - p)^{n - x} \text{ for } x = 0, 1, \dots, n$$

**Mean**: $\mu = np$

**Variance**: $\sigma^2 = np(1 - p) = npq$

### Geometric Distribution

**Geometric distribution**: When independent Bernoulli trials are repeated, each with probability $p$ of success, the number of trails $X$ it takes to get the first success has a Geometric distribution. We write $X \sim Geometric(p)$.

**Probability Mass Function (PMF)**:

$$ f_X(x) = (1 - p)^{x - 1} p = q^{x - 1} p \text{, for } x = 1, 2, \dots $$

**Mean**: $\mu = \frac{1}{p}$

**Variance**: $\sigma^2 = \frac{1 - p}{p^2}$

### Negative Binomial Distribution

When independent Bernoulli trials are repeated, each with probability $p$ of success, and $X$ is the trial number when $r$ successes are first achieved, then $X$ has a Negative Binomial distribution. We write $X \sim NegBin(r, p)$.

**Probability Mass Function**:

$$ f_X(x) = \binom{x - 1}{r - 1} p^r (1 - p)^{x - r} \text{, for } x = r, r + 1, \dots $$

**mean**: $\mu = \frac{r}{p}$

**Variance**: $\sigma^2 = \frac{r(1 - p)}{p^2}$

### Hypergeometric Distribution

The Hypergeometric distribution is used when we are sampling without replacement from a *finite* population.  

In a Bernoulli process, given that there are $M$ successes among $N$ trails, the number $X$ of successes among the first $n$ trials has a Hypergeometric distribution. We write $X \sim Hypergeometric(N, M, n)$.

**Probability Mass Function**:

$$ f_X(x) = \frac{\binom{M}{x} \binom{N - M}{n - x}}{\binom{N}{n}} \text{, for } x = 0, 1, \dots, n $$

**Mean**: $\mu = np$

**Variance**: $\sigma^2 = \binom{N - n}{N - 1} np(1-p)$ where $p = \frac{M}{N}$

### Multinomial Distribution

The Multinomial distribution is a generalization of the Binomial distribution. For each independent trial, instead of having only two possible outcomes, success and failure, we can have k possible outcomes, with probabilities $p_1, \dots, p_k$, where $p_i \geq 0$ for $i = 1, \dots, k$ and $\sum_{i = 1}^{k} p_i = 1$. For $n$ independent trials, if random variable $X_i$ indicates the number of times outcome number $i$ is observed over the $n$ trials, the vector $X = (X_1, \dots, X_k)$ follows a Multinomial distribution with parameters $n$ and $p = (p_1, \dots, p_k)$.

**Probability Mass Function**:

$$\begin{align}f(x_1, \dots, x_k) &= P(X_1 = x_1, \dots, X_k = x_k) \\&= \frac{n!}{x_1! \dots x_k!} p_1^{x_1} \dots p_k^{x_k} \\\text{ for } \sum_{i = 1}^k x_i = n \end{align}$$

**Mean**: $E(X_i) = n p_i$

**Variance**: $Var(X_i) = n p_i (1 - p_i)$  

**Covariance**: $Cov(X_i, X_j) = -n p_i p_j$ for $i \neq j$

### Poisson Distribution

The **Poisson distribution** is a distribution that counts the number of random events in a fixed space of time.

**Probability Mass Function**:

$$ f_X(x) = P(X = x) = \frac{\lambda^x}{x!} e^{-\lambda} \text{ for } x = 0, 1, 2, \dots $$

**Mean**: $\mu = \lambda$

**Variance**: $\sigma^2 = \lambda$

### Exercise 1:

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(a) What is the probability the first blue-eyed child they have is their third child? Assume that the eye colors of the children are independent of each other.  

(b) On average, how many children would such a pair of parents have until having a blue-eyed child? What is the standard deviation of the number of children they have until the first blue-eyed child?  

(c) What is the probability that their first child will have green eyes and the second will not?  

(d) What is the probability that exactly one of their two children will have green eyes?  

(e) If they have six children, what is the probability that exactly two will have green eyes?  

(f) If they have six children, what is the probability that at least one will have green eyes?  

(g) What is the probability that the first green eyed child will be the 4th child?  

(h) Would it be considered unusual if only 2 out of their 6 children had brown eyes?  

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(a) What is the probability the first blue-eyed child they have is their third child? Assume that the eye colors of the children are independent of each other.  

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).    

(b) On average, how many children would such a pair of parents have until having a blue-eyed child? What is the standard deviation of the number of children they have until the first blue-eyed child?  


A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(c) What is the probability that their first child will have green eyes and the second will not?  

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(d) What is the probability that exactly one of their two children will have green eyes?  


A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(e) If they have six children, what is the probability that exactly two will have green eyes? 

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(f) If they have six children, what is the probability that at least one will have green eyes?  

A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(g) What is the probability that the first green eyed child will be the 4th child?  


A husband and wife both have brown eyes but carry genes that make it possible for their children to have brown eyes (probability 0.75), blue eyes (0.125), or green eyes (0.125).  

(h) Would it be considered unusual if only 2 out of their 6 children had brown eyes?  


### Exercise 2:

A coffee shop serves an average of 75 customers per hour during the morning rush.  

(a) Which distribution would be the most appropriate for calculating the probability of a given number of customers arriving within one hour during this time of day?  

(b) What are the mean and the standard deviation of the number of customers this coffee shop
serves in one hour during this time of day?  

(c) Calculate the probability that this coffee shop serves more than 70 customers in one hour during this
time of day?  


A coffee shop serves an average of 75 customers per hour during the morning rush.  

(a) Which distribution would be the most appropriate for calculating the probability of a given number of customers arriving within one hour during this time of day?  

A coffee shop serves an average of 75 customers per hour during the morning rush.  

(b) What are the mean and the standard deviation of the number of customers this coffee shop
serves in one hour during this time of day?  

A coffee shop serves an average of 75 customers per hour during the morning rush.  

(c) Calculate the probability that this coffee shop serves more than 70 customers in one hour during this
time of day?  


## Continuous Probability Distributions

### $P(X = x) = 0$

* When $X$ is continuous, $P(X = x) = 0$ for **ALL** $x$. The probability mass function (PMF) is meaningless.


### The CDF

* $F_X(x) = P(X \leq x)$
* $F_X(x)$ is a *continuous* function: probability accumulates *continously*
* $P(a < X \leq b) = P(X \in (a, b]) = F(b) - F(a)$

### The PDF

$$ f_X(x) = \frac{dF_X}{dx} = F_X^{'}(x)$$

<img src="images/total_pdf.png" width="500">


$P(a \leq X \leq b) = F(b) - F(a) = \int_a^b f_X(x) dx$ 

<img src="images/int_pdf.png" width="500">

### Mean and Variance

$$ \mu_X = E(X) = \int_{-\infty}^{\infty} x f_X(x) dx $$

$$ Var(X) = E \left((X - \mu_X)^2 \right) = \int_{-\infty}^{\infty} (x - \mu_X)^2 f_X(x) dx $$

or

$$ Var(X) = E(X^2) - (E(X))^2 = \int_{-\infty}^{\infty} x^2 f_X(x) dx - (E(X))^2 $$

### Exponential Distribution

We define the Exponential($\lambda$) distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate $\lambda$.

We write $X \sim Exponential(\lambda)$, or $X \sim Exp(\lambda)$.

**CDF**: 
$$ F_X(x) = P(X \leq x) =
\begin{cases}
    1 - e^{-\lambda x} & \text{for } x \geq 0 \\
    0 & \text{for } x < 0
\end{cases}$$

**PDF**:
$$ f_X(x) = F_X^{'}(x) =
\begin{cases}
    \lambda e^{-\lambda x} & \text{for } x \geq 0 \\
    0 & \text{for } x < 0
\end{cases}$$

**Mean**: $\mu_X = E(X) = \frac{1}{\lambda}$

**Variance**: $\sigma^2 = Var(X) = \frac{1}{\lambda^2}$

### Gamma Distribution

The **Gamma distribution** is defined as the sum of $k$ independent Exponential random variables. It is a very flexbile family of distributions.

For $X \sim Gamma(k, \lambda)$

**Probability density distribution (PDF)**:

$$ f_X(x) = 
\begin{cases}
\frac{\lambda^k}{\Gamma(x)} x^{k - 1} e^{-\lambda x} & \text{if } x \geq 0 \\
0 & \text{otherwise}
\end{cases} $$

Where $\Gamma(x) = \int_0^{\infty} t^{x - 1} e^{-t} dt $ (when $x$ is a positive integer, $\Gamma(x) = (x - 1)!$) 


**Mean and Variance**: 
$$E(X) = \frac{k}{\lambda} \text{ and } Var(X) = \frac{k}{\lambda^2} $$

### Uniform Distribution

$X$ has a Uniform distribution on the interval $[a, b]$ if $X$ is equally likely to fall anywhere in the interval $[a, b]$.  

We write $X \sim Uniform(a, b)$, or $X \sim U(a, b)$.  

**PDF**:

$$f_X(x) = 
\begin{cases}
\frac{1}{b - a} & \text{if } a \leq x \leq b \\
0 & \text{otherwise}
\end{cases} $$

**Mean and Variance**:

$$ E(X) = \frac{a + b}{2} \text{ and } Var(X) = \frac{(b - a)^2}{12} $$

### Normal Distribution

The **Normal distribution**, also called the **Gaussian distribution**, is probably the most important distribution in statistics. It has two parameters, the mean $\mu$ and the variance $\sigma^2$.  

We write $X \sim Normal(\mu, \sigma^2)$, or $X \sim N(\mu, \sigma^2)$.  

**Probability Density Function (PDF)**:

$$ f(x) = \frac{1}{\sigma \sqrt{2 \pi}} exp \left(- \frac{(x - \mu)^2}{2 \sigma^2} \right) \text{, for } x \in R $$

**Mean**: $E(X) = \mu$  

**Variance**: $Var(X) = \sigma^2$

#### Linear Transformations

If $X \sim N(\mu, \sigma^2)$, the for any constants $a$ and $b$,

$$ aX + b \sim N(a \mu + b, a^2 \sigma^2) $$

In particular,

$$ X \sim N(\mu, \sigma^2) \Rightarrow \left(\frac{X - \mu}{\sigma} \right) \sim N(0, 1) $$

Note: $N(0, 1)$ is called the **standard Normal distribution**.

#### Sum of Normal Random Variables

If $X$ and $Y$ are independent, and $X \sim N(\mu_1, \sigma_1^2)$, $Y \sim N(\mu_2, \sigma_2^2)$, then

$$ X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) $$

### Exercise 3:

The daily high temperature in September in San Francisco is approximately Normally distributed with an average high of 70°F and a standard deviation of 8°F.  

(a) What is the probability that a day in September would have a high of 68°F or colder?  

(b) What high temperature needs to be achieved in order to be in the top 1% of daily high temps September?

(c) Say the temperature right now is 72°F, what is the probability that today's high temperature is higher than 75°F?

The daily high temperature in September in San Francisco is approximately Normally distributed with an average high of 70°F and a standard deviation of 8°F.  

(a) What is the probability that a day in September would have a high of 68°F or colder?  

The daily high temperature in September in San Francisco is approximately Normally distributed with an average high of 70°F and a standard deviation of 8°F.  

(b) What high temperature needs to be achieved in order to be in the top 1% of daily high temps September?


The daily high temperature in September in San Francisco is approximately Normally distributed with an average high of 70°F and a standard deviation of 8°F.  

(c) Say the temperature right now is 72°F, what is the probability that today's high temperature is higher than 75°F?

### Exercise 4:

Customers arrive at a coffee shop at an average rate of 75 per hour during the morning rush.

(a) The coffee shop opens at 7am, on average, how long do they have to wait until the first customer arrives?

(b) What is the probability that the first customer arrives within the first 3 minutes?

A coffee shop serves an average of 75 customers per hour during the morning rush.

(a) The coffee shop opens at 7am, on average, how long do they have to wait until the first customer arrives?

A coffee shop serves an average of 75 customers per hour during the morning rush.

(b) What is the probability that the first customer arrives within the first 3 minutes?

### Exercise 5:

Let $X_i$ denote the weight of a randomly selected prepackaged one-pound bag of carrots. Of course, one-pound bags of carrots won't weigh exactly one pound. In fact, history suggests that $X_i$ is Normally distributed with a mean of 1.18 pounds and a standard deviation of 0.07 pound. Now, let $W$ denote the weight of a randomly selected prepackaged three-pound bag of carrots. Three-pound bags of carrots won't weigh exactly three pounds either. In fact, history suggests that W is Normally distributed with a mean of 3.22 pounds and a standard deviation of 0.09 pound. Selecting bags at random, what is the probability that the sum of three one-pound bags exceeds the weight of one three-pound bag?

## Relationshop Between Two Random Variables

### The Joint Distribution Function

* When we deal with two discrete random variables, $X$ and $Y$, it is convenient to work with joint probabilities. We define the joint probability distribution to be    


$$ f_{X, Y} (x, y) = P(X = x \text{ and } Y = y) $$

* As usual, we require that  


$$f_{X, Y} (x, y) \geq 0 \text{ for any pairs } x, y$$  
    
$$\sum_{\text{all } x, y} f_{X, Y} (x, y) = 1$$

### Marginal Distributions

* Suppose we are interested only in $X$, yet have to work with the joint distribution of $X$ and $Y$. We can obtain the marginal distribution of $X$ as follows.  

* The marginal probabilities of $X$ and $Y$ are given by  
    
    $$ f_X(x) = \sum_y f_{X, Y} (x, y) $$
    $$ f_Y(y) = \sum_x f_{X, Y} (x, y) $$

### Independence

* Two random variables $X$ and $Y$ are called independent if the events ($X = x$) and ($Y = y$) are independent. That is,

* The random variables $X$ and $Y$ are independent if for all values of x and y:

$$ f_{X, Y} (x, y) = f_X (x) f_Y(y) $$

### Conditional Distributions

* Let $X$ and $Y$ be jointly distributed random variables. Then the conditional distribution of $X$ given $Y$ is given by

$$ f_{X|Y} (X = x | Y = y) = \frac{f_{X, Y} (x, y)}{f_Y (y)} $$

* Note that for a given $y$ value, $P(X = x | Y = y)$ is a probability distribution. That is, for any value $y$

$$ \sum_{\text{all } x \text{ values}} P(X = x | Y = y) = 1 $$

### Conditional Expectation

* One useful application of conditional distributions is in calculating conditional expectations. You will see a lot more of this when we get to regression analysis.

* The basic idea is that given a conditional distribution, we can also calculate a conditional expectation:

$$ E(X | Y = y) = \sum_{\text{all } x \text{ values}} x P(X = x | Y = y) $$

### Covariance

* The covariance is a measure of the linear association of two random variables. Its sign reflects the direction of the association; if the variables tend to move in the same direction the covariance is positive. If the variables tend to move in opposite directions the covariance is negative.


$$ \begin{align}Cov(X, Y) = \sigma_{XY} &= E[(X - E(X)(Y - E(Y))] \\ &= \sum_{i = 1}^N (x_i - E(X))(y_i - E(Y))P(x_i, y_i)\end{align} $$

* The covariance can also be expressed as

$$ Cov(X, Y) = E(XY) - E(X)E(Y) $$

* Two interesting facts
    * $Cov(X, X) = Var(X)$    
    * if $X$ and $Y$ are **independent**, $Cov(X, Y) = 0$

### Covariance and Independence

* IF $X$ and $Y$ are independent, then $Cov(X, Y) = 0$.
* But the reverse is not always true!


### Correlation: Covariance Rescaled


$$ \rho = \frac{\sigma_{X, Y}}{\sigma_X \sigma_Y} $$

* This is a unitless measure of association.

* The correlation is always between -1 and 1, with 1 indicating a perfect positive linear relationship, -1 a perfect negative linear relationship and 0 no linear relationship between $X$ and $Y$.

### Combination of Random Variables

$$ \begin{align}
E((a + bX) + (c + dY)) = a &+ bE(X) \\&+ c + dE(Y) 
\\
Var((a + bX) + (c + dY)) = b^2Var(X) &+ d^2Var(Y) 
\\
&+ 2bd\times Cov(X, Y) 
\end{align}$$  

### Exercise 6:

A researcher suspected that the number of between-meal snacks eaten by students in a day during final examinations might depend on the number of tests a student had to take on that day. The accompanying table shows joint probabilities, estimated from a survey.  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(a) What are the mean and standard deviation of $X$?  

(b) What are the mean and standard deviation of $Y$?  

(c) What is the covariance between $X$ and $Y$?  

(d) What is the correlation between $X$ and $Y$?  

(e) Are $X$ and $Y$ independent?  Justify your answer.  

(f) What is the conditional probability distribution of $Y$ given $X = 2$?  

(g) What is the expected value of $Y$ given $X = 2$?  

(h) Suppose your tummy stress level $S$, on a scale of 0 to 45 (no pain to extreme pain), is given by $S = 5X + 10Y$. Find the mean and variance of $S$.  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(a) What are the mean and standard deviation of $X$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(b) What are the mean and standard deviation of $Y$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(c) What is the covariance between $X$ and $Y$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(d) What is the correlation between $X$ and $Y$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(e) Are $X$ and $Y$ independent?  Justify your answer.  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(f) What is the conditional probability distribution of $Y$ given $X = 2$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(g) What is the expected value of $Y$ given $X = 2$?  

|     |     |     |  X  |     |
|:---:|:---:|:---:|:---:|:---:|
|     |     | 0   | 1   |  2  |
|     | 0 | 0.10 | 0.09 | 0.08 |
|  Y  | 1 | 0.11 | 0.12 | 0.14 |
|     | 2 | 0.06 | 0.13 | 0.17 |


(h) Suppose your tummy stress level $S$, (no pain to extreme pain), is given by $S = 5X + 10Y$. Find the mean and variance of $S$.  

## Point Estimation  

**Example**  

Victor owns a coffee shop, and he wants to model the number of customers arrive at the coffee shop during morning rush hours, in order to schedule his staff more efficiently and provide better customer service. He recorded the number of customer arrived per hour for a few mornings, and here is the data he collected: 74, 94, 93, 79, 74, 80, 73, 88, 63, 97.  

1) How should Victor model the data? What model should he use?  

2) What do we need to estimate in order to model the data?  

3) How can we estimated it/them?

### Maximum Likelihood Estimation (MLE)  

We want to find the value of the parameter that maximizes the likelihood function.

#### The likelihood function  

Probability mass function (discrete) / probability density function (continuous): $P(X = x | \text{ fixed known parameters})$  


The likelihood function: looks almost the same as above, **EXCEPT**:
* $x$ is fixed and known now  

* parameter values are unknown

#### The log-likelihood function  

To maximize the likelihood function, we usually take the log of it and maximize the log-likelihood function instead for the following reasons:  

* Maximizing the log-likelihood function is equivalent to maximizing the likelihood function  


* It is often easier to take the derivative of a log-likelihood function and solving for the parameter being maximized  
    * i.e. it is easier to work with sums than products

* It makes computation easier
    * one issue with calculating small probabilities with a computer is numerical underflow - once the values get sufficiently small, they will be rounded to 0 and we will lose all information

#### Maximization  

To maximize the log-likelihood, we introduced two methods:  

* Analytical method: take the derivative of the function with respect to the parameter(s), set it/them to 0 and solve for the parameter value(s)  


* Numerical method: for a range of values that the parameter(s) can take, calculate the corresponding log-likelihood, the value(s) that give the maximum log-likelihood value will be the MLE  

### Method of Moments (MOM)

We make method of moments estimates by equating the population moments to the sample moments, and solve for the parameters we are trying to estimate.  

First, distinction between population moments and sample moments:  

* Populations moments are like attributes of the population or distribution, $E(X), E(X^2), E(X^3), \dots$
    * population moments are functions of population parameters


* Sample moments are the estimates of these attributes based on the sample, $\frac{1}{n} \sum_{i = 1}^n X_i, \frac{1}{n} \sum_{i = 1}^n X_i^2, \frac{1}{n} \sum_{i = 1}^n X_i^2,  \dots$  

* Similar to the idea of population parameters vs sample statistics, population moments are unknown but fixed numbers, while sample moments can be calculated from the sample but it varies from sample to sample.  


* Population moments do not equal sample moments as a fact, but in order to estimate the population parameters experssed in the population moments, we set them to be equal and solve for the parameters

### Mean Squared Error (MSE)  

One way to evaluate the estimators is to use the mean squared error (MSE).  

$$MSE = E[(\hat{\theta} - \theta)^2] = Var(\hat{\theta}) + (Bias(\hat{\theta}))^2 $$  

where $Bias(\hat{\theta}) = E(\hat{\theta}) - \theta$

**Example**   

Victor doesn't like our way of estimating $\lambda$, and he came up with his own estimator for $\lambda$,  

$$ \hat{\theta} = \frac{1}{4} X_1 + \frac{1}{2} X_2 + \frac{1}{4} X_3 $$  

What is the MSE of his estimator?  

## Central Limit Theorem

* The CLT states that if random samples of size $n$ are repeatedly drawn from any population with mean $\mu$ and variance $\sigma^2$, then when $n$ is large, the distribution of the sample means will be approximately Normal:  


$$ \bar{X} \dot{\sim} N(\mu, \frac{\sigma^2}{n}) $$  

**How Large is Large Enough?**  

* For **most** distributions, $n > 30$ will give a sampling distribution that is nearly Normal   


* For **fairly symmetric** distributions, $n > 15$  


* For **Normal** population distributions, the sampling distribution of the mean is always Normally distributed

**Example**  

The service times for customers coming to the coffee shop have a mean of 1 minute and a variance of 1 at each counter. What is the probability that 80 customers can be serviced in less than 1 hours by one counter?

## Confidence Intervals

### For one population mean

* With known population standard deviation $\sigma$: **z-based** 

$$ \bar{X} \pm 1.96 \frac{\sigma}{\sqrt{n}} $$  

* Population standard deviation unknown: **t-based** 

$$ \bar{X} \pm t \left( \frac{s}{\sqrt{n}} \right) $$  

**Example**  

New York is known as "the city that never sleeps". A random sample of 25 New Yorkers were asked how much sleep they get per night. Statistical summaries of these data are shown below. Construct a 95% confidence interval for the average amount of sleep New Yorkers get per night.  

<img src="images/ny_sleep.png">

### For one population proportion  

The 95% confidence interval is give by:  

$$ \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$  

We always construct **z-based** confidence intervals for proportions.

### Difference in means  

* For large sample sizes, the 95% confidence interval for the difference in two populations means is given by:  

$$ \bar{X} - \bar{Y} \pm 1.96 \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} $$  

* For small sample sizes:  

$$ \bar{X} - \bar{Y} \pm t \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} $$

### Difference in proportions  

The 95% confidence interval for the difference in two population proportions:  

$$ (\hat{p}_1 - \hat{p}_2) \pm 1.96 \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}} $$