## Assignments

### Statistics 240, spring 2023

There are two term projects involving open-source code contributions, a longer term project involving data analysis, and several shorter assignments.

### Term project: Contributing to an open source package and joining the academic publishing ecosystem

+ If you do not already have an ORCID, sign up for one at https://orcid.org/. 

+ See https://github.com/statlab/permute and https://statlab.github.io/permute/

+ Read https://statlab.github.io/permute/dev/index.html

+ Clone https://github.com/statlab/permute to your own device or to datahub. Set upstream to be a repository within your Berkeley Github repo for Stat 240.


**Part 1.** Unit tests.

+ Look at the code coverage for the latest build, https://app.codecov.io/gh/statlab/permute

+ Identify at least one function that does not have complete test coverage.

+ Fork your `permute` main branch to make a testing branch

+ Write a unit test that exercises functionality that was not previously tested. Document your unit test
using an appropriate docstring that follows PEP8 (https://peps.python.org/pep-0008/) and PEP257 (https://peps.python.org/pep-0257/)

+ Verify that your test increases the coverage, using covecov. Github automation can be configured to
run codecov whenever you push your branch and/or make a pull request. 
That's how it's set up at https://github.com/statlab/permute

+ When everything is verified, make a pull request at https://github.com/statlab/permute to have your test included
in the package. Congratulations! You've made a pull request for an open source project! If the moderator approves
your pull request, you will be listed as a contributor.

+ Bonus: find one or more existing unit tests in `permute` that do not exercise the code well and write better unit tests to replace them. Verify that you have not decreased the coverage. Make a pull request to the project once you have verified everything.

**Part 2.** New functionality.

There are many nonparametric tests and confidence procedures not yet in `permute`, for instance:

+ tests and confidence intervals/sequences for bounded means using betting martingales, for a variety of betting strategies; ALPHA

+ tests and confidence intervals for the treatment effect for binary treatment and binary outcomes

+ tests and confidence sets for percentiles of the treatment effect for bounded outcomes

+ exact tests and confidence intervals for the mean from stratified samples, including methods based on greedy discrete optimization and on supermartingales, $E$-values, and union-intersection tests.

+ Gaffke's bound

+ Romano's projection of the empirical approach to testing nonparametric hypotheses, including symmetry, exchangeability, and independence

Code up one of those, or some other nonparametric method from the course that is not yet in `permute`.
Make sure the calling signature of your function is parallel to that of similar functions in the package.
Make sure the function has the right level of abstraction, and that you provide implementations of
a number of sub-functions if that is warranted. 
For instance, for methods based on betting martingales,
you should implement a number of betting strategies; for the ALPHA approach, you should implement
a number of "estimators." 
If the method can be computed exactly in some circumstances but needs to be approximated/simulated/randomized 
in others, provide both options.
Document your function. Write unit tests that cover its functionality completely. 
When you're satisfied that it's correct, complete, tidy, and well documented, 
make a pull request to the project.

### Term project 2: Nonparametric data analysis

Identify a paper that is interesting to you that analyzed data parametrically, but a nonparametric
analysis is justified by the way the data were collected, e.g., by randomization. 

[MORE TO COME]

### Assignment 1.
Due 1/24/2023, 11:59pm.

1. Identify the question and source of data for your term project.

1. Let $A$, $B$ and $C$ be sets.
Show that $A \cup (B \cap C ) = (A \cup B ) \cap (A \cup C)$, and
$A \cap (B \cup C) = (A \cap B ) \cup (A \cap C )$.

1. Let $A$ and $B$ be sets.
Show that $A-B = \emptyset$ implies $A \subset B$.

1. Show that for any sets $A$, $B$, $C$, $D$,
$(A \bigotimes B) \cap (C \bigotimes D) = (A \cap C) \bigotimes (B \cap D)$.

1. Show that for any function $f$ with domain $\mathcal{X}$, if $A, B \subset \mathcal{X}$,
then $f(A \cap B) = fA \cap fB$, and that
$f(A \cup B ) = fA \cup fB$.

1. Let $f$ be a function with co-domain $\mathcal{Y}$, and $A, B \subset \mathcal{Y}$.
Does $f^{-1} (A \cap B) = f^{-1}A \cap f^{-1}B$?
Does $f^{-1} (A \cup B ) = f^{-1}A \cup f^{-1} B$? 

1. Let $f$ have domain $\mathcal{X}$ and co-domain $\mathcal{Y}$, and suppose that $A \subset \mathcal{X}$
and $B \subset \mathcal{Y}$.
Does $f^{-1}(f(A)) = A$?
Does $f(f^{-1}B) = B$?

1. Let $\mathcal{G}$ be a group with identity $e$.
Show that $ae = (a^{-1})^{-1} = a$. (That is, show that $e$ is not only the identity from the left, it
is also the identity from the right, and that if $a^{-1}a = e$, then
$aa^{-1} = e$.)

1. Let $a, b, c, d \in F$, where $F$ is a field.
Show that if $b,d \ne 0$, then $a/b+c/d = (ad+bc)/bd$.

1. Show that $A= \{0,1,2, \cdots, p-1 \}$ with $p$ prime is a field, if
addition and multiplication are defined modulo $p$.
What breaks down if $p$ is not prime?
For $p=7$, show that the multiplicative inverse of 2 is 4.

1. Suppose $\{ M_i \}_{i \in \mathbb{N}}$ is a countable collection of supermartingales with respect to the
same filtration. Show that every positive linear combination of any finite subset of them is a supermartingale with
respect to the same filtration, and that the expected value of the first term in that supermartingale is
the same positive linear combination of the expected values of the first terms.

1. In light of the previous result, a convex combination of a finite collection of nonnegative supermartingales (on the same filtration) starting at one is a nonnegative supermartingale starting at one. How can that be used to construct $E$-values?

1. Is there an analogous result when the supermartingales are not defined with respect to a common filtration?
How might you construct a common filtration from the "marginal" filtrations?




### Assignment 2.

1. Write a brief (1--2 page) research proposal for your term project.  The proposal
should include the scientific question you intend to address, the source of
the data, and the primary reference you plan to use.
If you need help picking a topic, see me.

1. Explain the ``randomization model'' and some of its advantages and limitations.
State the strong null hypothesis for the randomization model and the typical weak null.

1. For the randomization model, state some pros and cons of the Wilcoxon 
rank sum test versus a permutation test based on the sample mean or on the permutation distribution
of the $t$ statistic, compared
with the parametric two-sample Student $t$-test.
State the null hypothesis for each of the tests.

1. Consider an experiment involving 9 subjects, 5 assigned at random to treatment
and 4 to control.
We want to test the null hypothesis that "treatment makes no difference"
at significance level 10%.
For each individual, we measure a quantitative response.
Think about the omnibus alternative, about the one-sided shift alternative
that treatment increases the response, and about the two-sided shift
alternative that treatment increases or decreases the response.
Consider four tests: the Wilcoxon rank-sum test (using mid-ranks for ties), a
permutation test based on the difference in the sample means for the control and
the treatment groups, the Smirnov test, and a 2-sample $t$-test based on the
difference in the sample means for the control and treatment groups.

    1. Explain the assumptions of each test, including a precise statement of the null hypotheses.

    1. State strong and weak versions of each null hypothesis. Note whether the nominal significance level of each test is for the strong or weak null hypothesis.

    1. Find or estimate by simulation the power of each test against the one-sided shift alternative that treatment increases each individual's response by 1 unit. If the power depends on additional unspecified features of the treatment effect or on features of the baseline responses of the subjects, explain what those features are, and find the power for a few different values of those features.

    1. For one-sided versions of the Wilcoxon rank-sum test, the permutation test using the sample mean, and the $t$-test, and for the Smirnov test, find the $P$-values for the following hypothetical data, and the power against a shift alternative that treatment increases the response by one unit:
| treatment | 1 | 2 | 3 | 3 | 4 |
|----------|---:|----:|----:|----:|----:|
| control   | 0 | 1.5 | 2.5 | 3.5 |
    E.  Which test do you think is best in this situation, absent any additional information about the nature of the experiment? Why?

5. Consider the Smirnov test for an experiment involving 5 subjects, 2 assigned at random to treatment and 3 to control.

    1. Enumerate all possible values of the test statistic and their probabilities under the strong null hypothesis of no treatment effect, assuming no ties among the data.
    
    1. Now estimate the probabilities by simulation, using 10,000 replications.

    1. Calculate the true (theoretical) standard error of the simulated probabilities.

    1. What is the joint distribution of the number of times the test statistic takes each of its possible values?

    1. Sketch how you would test the hypothesis that the true probabilities are equal to the values you calculated in the first part of this question using the empirical frequencies you observed in the second part.

    1. What software package are you using to do the calculations? What is its default algorithm for calculating pseudo-random numbers? What is the period of that pseudo-random number generator? What is the largest number of objects for which that generator can give you all permutations? Is there an option in the package to use a better pseudo-random number generator?  If so, which one?
    
6. Give statistical interpretations and theoretical justifications for using $\mbox{hits}/\mbox{reps}$ and $(\mbox{hits}+1)/(\mbox{reps}+1)$ as the simulation $P$-value. Which do you prefer? Why?

7. Give five real-world examples of a scientific null hypothesis that can be expressed as the invariance of
a probability model for the data under the action of some group. In each case, identify the "scientific" invariance and the corresponding group invariance for the data. Explain how you could use that invariance to test the scientific
hypotheses without knowing anything else about the probability model for the data.



### Assignment 3.

1. Show that the collection of all sets of the form $(-\infty, x] \times (-\infty, y]$
comprise a Vapnik-Cervonenkis class (V-C class) over the plane.

1. Show that intersections and finite unions of V-C classes are V-C classes. Show that countable unions of V-C classes need not be V-C classes.

1. The file https://statistics.berkeley.edu/~stark/Java/Data/lomaPrieta.dat contains 221 observations of the times of putative aftershocks of the 17 October 1989 earthquake in Loma Prieta, California.
There are 222 lines in the file.
The first is 0, the main shock, which occurred at 4:15:43pm.
The other lines are the times in days from the main event to the aftershocks,
defined as earthquakes determined to have magnitude 3.0 and above, focal depth
of 0--20km, and epicenter within 40km of the epicenter of the Loma Prieta earthquake.
The data are from the UC Berkeley Seismographic Stations, courtesy of
Dr. Bob Uhrhammer.
(Hint: see See pp. 109--116 and Labs 10 and 11 in Freedman (2005), _Statistical Models: Theory and Practice_.)
                
    1. According to one theory, aftershocks follow the modified Omori Law. If so, the data are an iid sample of size 221, sorted into increasing order, from the density $C/(a+t)^b$, where $C$ is a normalizing constant, $a$ and $b$ are parameters,  and $t$ is time in the interval 0 to 805 days.
                                
        1. There are natural restrictions on $a$ and $b$ for the density to peak at 0 and decrease monotonically (Aftershocks generally are most frequent immediately after the main shock, then decrease in frequency.) What are those restrictions?
        
        1. Find the MLE for $a$ and $b$ from the Loma Prieta data.
        
        1. Compute the observed information matrix of the parameter estimates.
        
        1. Is the MLE biased in this application? Is the observed information matrix a good approximation to the variance-covariance matrix of $a$ and $b$?
    
    1. Estimate the probability density of aftershocks of the Loma Prieta earthquake for the time interval 0 to 805 days, conditional on the event that there  were 221 aftershocks during that period. (Assume that, conditional on the total number of aftershocks in the interval, the times of the aftershocks are iid random variables with common density $f$, whose functional form is unknown; the data are those times sorted into increasing order.) Use the estimators listed below. Compare and contrast the estimates. You might want to plot the density estimates on a semi-logarithmic scale (linear in time, but logarithmic in the density: the rate of aftershocks is extremely high in the first few days). Please implement the algorithms yourself---do not use "canned" routines from a package.  You will get a better feel for the methods if you have to code them. But feel free to use a canned package to check your results.
                                
        1. Histogram (use several choices of bin width).
                                        
        1. The naive estimator (use several bin widths).
                                        
        1. A kernel density estimate (use several bandwidths). The kernel you use is up to you, but say why you chose it.
        1. The nearest neighbor estimate (use several neighborhood sizes).

                                