<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float" src="../icon.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Spring 2023</h4><p>Ani Adhikari</p>CC BY-NC-SA 4.0</div></td></tr></table><!-- not in pdf -->

This content is protected and may not be shared, uploaded, or distributed.

In [None]:
# Run this cell to set up your notebook

# These lines make warnings go away
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Homework 14

### Instructions

Your homeworks will generally have two components: a written portion and a portion that also involves code.  Written work should be completed on paper, and coding questions should be done in the notebook. Start the work for the written portions of each section on a new page. You are welcome to $\LaTeX$ your answers to the written portions, but staff will not be able to assist you with $\LaTeX$ related issues. 

It is your responsibility to ensure that both components of the lab are submitted completely and properly to Gradescope. **Make sure to assign each page of your pdf to the correct question. Refer to the bottom of the notebook for submission instructions.**

Every answer should contain a calculation or reasoning. For example, a calculation such as $(1/3)(0.8) + (2/3)(0.7)$ or `sum([(1/3)*0.8, (2/3)*0.7])`is fine without further explanation or simplification. If we want you to simplify, we'll ask you to. But just ${5 \choose 2}$ by itself is not fine; write "we want any 2 out of the 5 frogs and they can appear in any order" or whatever reasoning you used. Reasoning can be brief and abbreviated, e.g. "product rule" or "not mutually exclusive."

## 1. Predicting Scores ##

[Your answers to this question should be decimal values or equations with numerical coefficients. For the arithmetic, you are welcome use the code cell at the end of the question. It's just there for your convenience – we won't read it.]

Grades in a class are based on a linear combination of a final exam (worth 50% of the grade), a midterm (worth 30%), and homework (worth 20%). Let the random vector $[F ~~ M ~~ H]^T$ consist of the final, midterm, and homework scores of a randomly picked student.

Suppose the mean vector of $[F ~~ M ~~ H]^T$ is $[60 ~~ 55 ~~ 80]^T$ and the covariance matrix is

$$
\begin{bmatrix}
121 & 80 & 10 \\
80 & 144 & 15 \\
10 & 15 & 9
\end{bmatrix}
$$

Suppose the distribution of $[F ~~ M ~~ H]^T$ is multivariate normal. 

**a)** Find the distribution of the student's overall score $S = 0.5F + 0.3M + 0.2H$.

**b)** The instructor wonders whether the final exam score $F$ can just be predicted by a linear function of the random variable $X = 0.3M + 0.2H$. Explain whether the least squares predictor of $F$ based on $X$ is linear and why. Then find the least squares predictor of $F$ based on $X$.

**c)** Find the root mean squared error of the predictor in Part **b**.

In [None]:
weight = np.matrix([0.5, 0.3, 0.2])
mu = np.matrix([60, 55, 80])
cov = np.matrix([[121, 80, 10], 
      [80, 144, 15],
      [10, 15, 9]])

...

\newpage

## 2. Normal Sample Mean and Sample Variance, Part 1 ##

Let $X_1, X_2, \ldots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$. Let

$$
\bar{X} ~ = ~ \frac{1}{n} \sum_{i=1}^n X_i
$$

denote the sample mean. In [Homework 8](http://prob140.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https://github.com/prob140/materials-sp23&branch=main&subPath=hw/Homework_08.ipynb), you constructed a random variable

$$S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \bar{X})^2$$ 

called the sample variance. Before proceeding, please review Homework 8, Question 1 in its entirety.

**a)** For $1 \le i \le n$ let $D_i = X_i - \bar{X}$. Find $Cov(D_i, \bar{X})$.

**b)** Now assume in addition that $X_1, X_2, \ldots, X_n$ are i.i.d. normal $(\mu, \sigma^2)$. What is the joint distribution of $\bar{X}, D_1, D_2, \ldots, D_{n-1}$? Explain why $D_n$ isn't on the list.

**c)** True or false (justify your answer): The sample mean and sample variance of an i.i.d. normal sample are independent of each other.

\newpage

## 3. Normal Sample Mean and Sample Variance, Part 2 ##

**a)** Let $R$ have the chi-squared distribution with $n$ degrees of freedom. What is the mgf of $R$?

**b)**
For $R$ as in Part (a), suppose
$R = V + W$ where $V$ and $W$ are independent and $V$ has the chi-squared 
distribution with $m < n$ degrees of freedom. Can you identify the distribution of $W$? Justify your answer.

**c)** Let $X_1, X_2, \ldots , X_n$ be any sequence of random variables and let $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$. Let $\alpha$ be
any constant. Prove the *sum of squares decomposition*

$$
\sum_{i=1}^n (X_i - \alpha)^2 ~=~ \sum_{i=1}^n (X_i - \bar{X})^2 ~+~ n(\bar{X} - \alpha)^2
$$

**d)** Now let $X_1, X_2, \ldots , X_n$ be i.i.d. normal with mean $\mu$ and variance $\sigma^2 > 0$. Let $S^2$ be the "sample variance" defined by 

$$
S^2 ~=~ \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2
$$

Find a constant $c$ such that $cS^2$ has a chi-squared distribution. Provide the degrees of freedom.

[Use Parts (b) and (c) as well as the result of the previous exercise.]

\newpage

## 4. Multiple Regression: Residuals ##

This exercise assumes the multiple regression model of [Section 25.4](http://prob140.org/textbook/content/Chapter_25/04_Multiple_Regression.html) of the textbook and uses the same notation as in that section. 

The regression estimate $\hat{\mathbf{Y}}$ can be written as $\mathbf{HY}$ for an $n \times n$ matrix $\mathbf{H}$. This matrix is called the *hat matrix*, probably because it ''puts the hat on $\mathbf{Y}$.''

**a)** Write $\mathbf{H}$ in terms of $\mathbf{X}$. Is $\mathbf{H}$ symmetric?

**b)** Show that $\mathbf{H}$ is idempotent. (If you haven't seen that term before, look it up.)

**c)** Find the distribution of the residual vector $\mathbf{e}$.

**d)** Show that the covariance matrix of $\mathbf{e}$ is $\sigma^2(\mathbf{I} - \mathbf{H})$.

## Submission Instructions ##

Many assignments throughout the course will have a written portion and a code portion. Please follow the directions below to properly submit both portions.

### Written Portion ###
*  Scan all the pages into a PDF. You can use any scanner or a phone using applications such as CamScanner. Please **DO NOT** simply take pictures using your phone. 
* Please start a new page for each question. If you have already written multiple questions on the same page, you can crop the image in CamScanner or fold your page over (the old-fashioned way). This helps expedite grading.
* It is your responsibility to check that all the work on all the scanned pages is legible.
* If you used $\LaTeX$ to do the written portions, you do not need to do any scanning; you can just download the whole notebook as a PDF via LaTeX.

### Code Portion ###
* Save your notebook using `File > Save and Checkpoint`.
* Generate a PDF file using `File > Download As > PDF via LaTeX`. This might take a few seconds and will automatically download a PDF version of this notebook.
    * If you have issues, please post a follow-up on the general Homework 14 Ed thread.
    
### Submitting ###
* Combine the PDFs from the written and code portions into one PDF. [Here](https://smallpdf.com/merge-pdf) is a useful tool for doing so. 
* Submit the assignment to Homework 14 on Gradescope. 
* **Make sure to assign each page of your pdf to the correct question.**
* **It is your responsibility to verify that all of your work shows up in your final PDF submission.**

If you are having difficulties scanning, uploading, or submitting your work, please read the [Ed Thread](https://edstem.org/us/courses/35049/discussion/2398718) on this topic and post a follow-up on the general Homework 14 Ed thread.

## **We will not grade assignments which do not have pages selected for each question.** ##