### *** Names: [Insert Your Names Here]***

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

# Lab 3 - Statistics in Python

## Lab 3 Contents

1. Statistical Distributions
  * Random Draws 
  * Functional Forms
  * Probability Density Functions
2. Descriptive Statistics
3. Apply it! Is the universe isotropic?
4. Convolution

In this lab, you will continue to explore Python plotting, while also learning some of the basic Python statistical functions and distributions. 

## 1. Statistical Distributions 
### 1.1 Random Draws
In data analysis, it will often be useful to draw a random sample of numbers from a statistical distribution, where the relative probability of getting any given number $x_1$ is proportional to the probability density function evaluated at that location P($x_1$). For example, to draw 1000 random numbers from a normal distribution with mean 0 and standard deviation 1, I would do the following

In [None]:
#here rvs stands for random variables
norm_sample = stats.norm.rvs(size = 100, loc = 0, scale = 1)

Although the normal distribution is a continuous function, we have drawn 100 discrete random numbers from it, so your best tool to visualize how these randomly drawn numbers are distributed is to make a histogram. 

In [None]:
fig = plt.hist(norm_sample, bins=10)

---  
### Exercise 1
    
-----------------------
Making good histograms is an art, and we'll do more work with them soon, but for now, it will be sufficient to familiarize yourselves with the basics of the "tunable parameters" that control the appearance of a histogram. 

Copy the two lines of code in the cells above this exercise into the "testing" cells below. Modify the values of each of the optional inputs (size, loc, scale, and bins) one at a time until you are confident that you know what they control, and then describe the effect of changing each parameter in words below. In each case, describe the range of values you tried for the keyword and the visible effect that modifying it had on the plot. 

In [None]:
#test of effect of "size" keyword

*insert explanation of size keyword*

In [None]:
#test of effect of "loc" keyword

*insert explanation of loc keyword*

In [None]:
#test of effect of "scale" keyword

*insert explanation of scale keyword*

In [None]:
#test of effect of "bins" keyword

*insert explanation of bins keyword*

---

### 1.2 Functional forms

Of course, most statistical distributions have functional forms and we do not necessarily always need to rely on random draws to visualize or use them. The normal distribution for example, has the functional form
$$f(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-\mu)^2}{2\sigma^2}}$$
where $\mu$ is the mean and $\sigma$ is the standard deviation. 

These functional forms can sometimes be useful for building intuition, but luckily in python we rarely need to code one from scratch because there is a huge library of statistical distributions built into the scipy statistics library of functions. For a full list of available distributions, see [this link](https://docs.scipy.org/doc/scipy/reference/stats.html). 

--- 

### Exercise 2 

----------------------------------
Set a timer for 10 minutes. How many statistical distributions can you put on the same plot in this time? Make sure that you use a legend describing the relevant input parameters for that distribution. 

In [None]:
## plotting code goes here

---

### 1.3 - Probability Density Functions

Since we already used the normal distribution in Example 1, let's use a new distribution here - the Poisson distribution, which in astronomy is perhaps most important in its application to the collection of light from astronomical objects, where the photons collected per unit of time should follow a poisson distribution. 

$$P(n)=e^{-\lambda}\frac{\lambda^n}{n!}$$

where $\lambda$ is the mean of the distribution. We will need to define a range of values for n over which to calculate the pdf, and it can be difficult to choose these intelligently. One trick is to use the "percent point function" to evaluate the n value corresponding to a certain percentile of the full distribution. For example ppf(0.01) corresponds to the n value where only 1% of the area under the PDF is less than n. 

Let's create an appropriate range of n values for an arbitrary choice of $\lambda$

In [None]:
lam = 75
#return the n value corresponding to the 0.1 percentile
minn = stats.poisson.ppf(0.001, lam)
#return the n value corresponding to the 99.9 percentile
maxn = stats.poisson.ppf(0.999, lam)
print(minn,maxn)

In [None]:
#create a range of x values over which to compute the PDF, ranging from minn to maxn
x = np.arange(minn,maxn)
len(x)

Now we're ready to actually compute the PDF, though actually in this case because the poisson distribution is discrete, this is more properly called a PMF (Probability Mass Function)

In [None]:
#compute pmf for given lam and range of x
poisson_pmf = stats.poisson.pmf(x, lam)

We can also just as easily compute the cumulative distribution function

In [None]:
poisson_cdf = stats.poisson.cdf(x,lam)

--- 
### Exercise 3
--------------------------------------
    
(a) Using the cells above as a reference, write a ***for loop*** that plots the poisson PDF for a range of $\lambda$ values from 10 to 100 (by tens is fine) all on the same plot.   
(b) Once you have created your graphic, which should have a proper legend and axis labels, write a one sentence summary of the most important feature(s) of the poisson distribution demonstrated by the graphic.   
(c) Now write a function that will plot a poisson distribution with an arbitrary $\lambda$ value (the required input) and then overplot a normal distribution with the same mean and standard deviation as the Poisson over the same range of x values. The output plot should have an appropriate legend and axis labels.  
(d) Use your code from (c) to compare the Poisson and Normal distributions for a range of $\mu$ values, then describe in words (i) the ways in which the two distributions are different from one another, and (ii) how this difference changes as $\lambda$ changes.  

*Note that you could just as easily have done this exercise with cumulative distribution functions (CDFs) rather than PDFs. If you're feeling ambitious, you might consider plotting a few and thinking about the types of questions that each is best suited to answer.

In [None]:
## your for loop for exercise (a) goes here

***Your explanation for (b) goes here***

In [None]:
#your function for (c) goes here

In [None]:
#test statement 1

In [None]:
#test statement 2

In [None]:
#test statement 3

***Your explanation for (d) goes here***

## 2 Descriptive Statistics

Most of the statistical distributions in Python also have a "stats" method built in that will tell you about the moments of the distribution. For example:

In [None]:
#let's try a non-integer mean this time
lam = 5.7

In [None]:
#the stats method
mean, var, skew, kurt = stats.poisson.stats(lam, moments = 'mvsk')

In [None]:
print(mean, var, skew, kurt)

There are also built-in functions for other statistical quantities, for example

In [None]:
stats.poisson.median(lam)

In [None]:
stats.poisson.mean(lam)

In [None]:
stats.poisson.std(lam)

---
  
### Exercise 4 (CUT OR SHORTEN TO ~2 NEXT YEAR)
    
--------------------------------
Using your knowledge of the Poisson distribution from the reading and from any experiments that you can do with the functions and plots that you've designed above, identify each of the following statements as True or False, and insert an explanation of WHY. **Connect your answers to the plots that you made for Exercise 3 and to specific statistics wherever possible**

1. The mean and variance values will always be the same for the poisson distribution.   
***explanation here***
2. A skew of 0.1 means that there is slightly more power (area) to the left of the peak than to the right for the Poisson distribution.   
***explanation here***
3. A positive kurtosis means that the distribution is "peakier" than a normal distribution.   
***explanation here***
4. The mean and median of a poisson distribution are always different.  
***explanation here***
5. The poisson distribution gets less symmetrical as n increases.  
***explanation here***

In [None]:
#tests and supporting plots in this cell and any others that you choose to insert

---
## Exercise 5 - Apply it! Is the Universe Isotropic?
---

Now that you've explored the use of statistical distributions in Python, let's try applying them to an interesting question, namely:


> Is the universe isotropic?

Where isotropic basically means the same in every direction. To answer this question, you and your classmates compiled some data on galaxy counts in different regions of the Hubble Ultra Deep Field (HUDF). The next two cells will read in that data and set it up for you to manipulate. 

We will learn much more about dealing with tabular data in the next unit, so for now, just execute the cells and don't worry about the computational methods too much except to familiarize yourself a little with the syntax and verify that the contents of the final object are the same as the ones in the spreadsheet.  

In [None]:
#install some packages to interface with google sheets
!pip install gspread gspread-dataframe

#import spreadsheet modules
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe

#import modules for google authentication and set up
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())

In [None]:
#read in the first tab of the google sheet 
worksheet = gc.open('HUDF_Counts').sheet1
#pull out the data
rows = worksheet.get_all_values()
#stitch it together into a pandas dataframe, skipping the first row
df=pd.DataFrame.from_records(rows[1:])
#make the second row the header row
df.rename(columns=df.iloc[0],inplace=True)
#drop the second row that has been pulled into the header
df.drop(df.index[0], inplace=True)
#extract the numbers in the "AVG" column
avgcounts = df["AVG"].values

The python object ```avgcounts``` is now a numpy array storing the average of the class' galaxy counts for each of the twenty HUDF regions. Use this array to make some statistical calculations and/or visualizations to inform the question.  Then, weave them together into a data-driven argument about whether or not your class' data are consistent with the universe being isotropic or not. If you're not sure where to start, ask for help!

*Hint: Many of the skills and tools that you've used so far in this lab will be useful in making your argument, namely: the histogram, the .ppf method, descriptive statistics, random draws from a statistical distribution, etc. There are many possible ways to approach this problem, and I suggest using more than one of the above in your argument as independent lines of evidence.*


In [None]:
#code here (feel free to add more code cells)

***Explanation here. Should be at least two full paragraphs with SPECIFIC references to the data and the statistical calculations. What did you expect the properties of the data would be if the universe IS isotropic and why? What calculations or visualizations did you do to test this and why? Describe the results of each test, as well as your interpretation of their meaning.***

----

*NOTE: The remaining part of this lab is optional. I leave it here for your reference, as it may come in handy later.*

# 4. Convolution in Python

"Convolution" is a mathematical operation that is useful in statistics in that it allows us to derive a probability distribution for a quantity that is the sum of two other (independent, random) variables that are themselves distributed following their own PDFs. It has a mathematical definition, but here we will try to develop some intuition for it graphically. 

Let's start by visualizing this for two normal distributions with different means but the same standard deviation. 

In [None]:
x = np.arange(0,20,0.1)
norm1 = stats.norm.pdf(x,loc=5, scale=1)
norm2 = stats.norm.pdf(x, loc=15, scale=1)

In [None]:
plt.plot(x, norm1, label="$\mu$=3")
plt.plot(x, norm2, color="red", label="$\mu$=7")
plt.legend()

In [None]:
conv = np.convolve(norm1,norm2, mode="same")

In [None]:
plt.plot(x, norm1, label="$\mu$=3")
plt.plot(x, norm2, color="cyan", label="$\mu$=7")
plt.plot(x, conv, color="magenta", label="convolution" )

In [None]:
conv/=sum(norm2)
print(sum(conv))

In [None]:
plt.plot(x, norm1, label="$\mu$=3")
plt.plot(x, norm2, color="cyan", label="$\mu$=7")
plt.plot(x, conv, color="magenta", label="convolution" )

---
    
### Exercise 6 (Optional)

---------------------------
    
Experiment with the convolution of normal distributions with different values of $\mu$ and $\sigma$. Write down at least three observations about what is happening, with plots interspersed to demonstrate/support your arguments. 

**Challenge Exercise**
If you finish early, do the same for another statistical distribution. 

In [None]:
#convolved normal distribuion testing

***explanations (don't forget to weave in the plots) go here***

In [None]:
#extra space for challenge exercise

---

# Sumbitting Prelabs and Labs for Grading

Before submitting any Google Colab notebook for grading, please follow the following steps

**1) Try running everything in one go (Runtime menu -> Restart and run all)**

Make sure the entire notebook runs from start to finish. If necessary, comment out any un-executable cells from the instructions portion of the lab so the whole notebook will execute in one go. 

**2) Restart the kernel (Runtime menu --> Restart Runtime).**

**3) Clear all output (Edit --> clear all outputs).**

**4) Make sure the names of all group members are in a markdown cell at the top of the file and submit the notebook through the Moodle link for this Lab**