# Bayes, Bootstrap, MLE & All That
-----

In this post I want to go back to the basics of statistics, but from an advanced point of view both from a theoretical point of view and technical point of view. The point is go back to the basics of estimating a single parameter value and then quantifying the uncertantity in the estimate using variou methods. In general I will take three approaches to this two of which are [frequentist](https://en.wikipedia.org/wiki/Frequentist_inference) and one that is [Bayesian](https://en.wikipedia.org/wiki/Bayesian_statistics). I admit, Im not as familar with Bayesian methods and therefore sticking to a simple example of estimating a single parameter. I don't believe one approach to statistics is inherently better than the other, but have found so-called frequentist for me to be easier to understand, easier to implement and additionally found Fisherian approaches satisfying from a theoretical point of view.

The title for this post is inspired by [Div, Grad, Curl & All That](https://www.google.com/books/edition/_/sembQgAACAAJ?hl=en) which I used as a undergraduate to help learn vector calculus.


## 1. Using A Poisson Distribution Written In Scala with Py4J
---------------

The first thing we need is data and for that purposes I wrote a [Poisson distribution in Scala](https://github.com/mdh266/PoissonDistributionInScala). The Poisson distribution is a probability distribution for a random variable $y$ that represents some count phenomena, i.e. a number of non-zero integer occurences in some fixed time frame.  For example the number of trains passing through a station per day or the number of customers that visit a website per hour are governed by Poisson distribution. The disibution is,

$$ p(y \, = \, k)  \; = \; \frac{\lambda^{k} e^{-\lambda} }{k!} $$

The parameter $\lambda$ is the rate variable, i.e. the true number of customers 

But why did I write my distribution in Scala? Well, I like Scala and enjoyed the challenge of writing a Poisson distribution using a functional approach. I also wanted to learn more about how to use [Py4J](https://www.py4j.org/) which can be used to work with functions and objects in the JVM from Python. [Apache Spark](https://spark.apache.org/) actually uses Py4J in PySpark to write Python wrappers for Scala code. I've used both PySpark and Spark in Scala and this gave me an opportunity to understand how PySpark works better.

The first was to create the Poisson class as I did this [project](https://github.com/mdh266/PoissonDistributionInScala), however, one key difference is the return value of any public value needs to be Java object. Specifically the [sample](https://github.com/mdh266/BayesBootstrapMLE/blob/main/src/main/scala/Poisson.scala) method needs to return a Java List ([java.util.List[Int]](https://www.javatpoint.com/java-list)) of integers. I originally tried returning a [Scala List](https://www.scala-lang.org/api/current/scala/collection/immutable/List.html) which worked fine in pure Scala, but when Py4J was not able to serialize this object so well and Python type of returned list was "Java Object".

In order to use this class from Python we need to do three things from the Scala point of view:

1. Create a [Gateway Server](https://www.py4j.org/_static/javadoc/index.html?py4j/GatewayServer.html)
2. Create a class entrypoint to allow for setting object attributes outside of the constructor
3. Package the code as a jar using a build tool such as [Maven](https://maven.apache.org/) or [SBT](https://www.scala-sbt.org/).


The first step is pretty straight forward to from the [Py4J Documentation](https://www.py4j.org/getting_started.html) and is in the [Main.Scala](https://github.com/mdh266/BayesBootstrapMLE/blob/main/src/main/scala/Main.scala) object:

    import py4j.GatewayServer

    object Main {
        def main(args: Array[String]) = {
            val server = new GatewayServer(new PoissonEntryPoint())
            server.start()
            System.out.println("Gateway Server Started")
        }
    }
    
The GatewayServer in the author's own words *it allows Python programs to communicate with the JVM through a local network socket.*  The GatewayServer takes an *entrypoint* as a parameter which can be any object, see [here](https://www.py4j.org/getting_started.html#writing-the-python-program).  However, the entrypoint doesn't really offer a way for us to pass the $\lambda$ value from [Python](https://www.py4j.org/getting_started.html#writing-the-python-program) to the Poisson constructor. To get around this I create a [PoissonEntryPoint](https://github.com/mdh266/BayesBootstrapMLE/blob/main/src/main/scala/PoissonEntryPoint.scala) case class:

    case class PoissonEntryPoint() {

        def Poisson(lambda : Double) : PoissonDistribution = {
            var p = new PoissonDistribution()
            p.setLambda(lambda)
            p
        }


    }

This case class really just acts a [Singleton](https://docs.scala-lang.org/tour/singleton-objects.html), but has a class instead of an object. The point of this class is simply to be able to create a Poisson class with a instance after starting GatewayServer.  

Now let's talk about how the project structure and packaging it for use.  The project structure is:

    src/
       main/
           scala/
               Main.scala
               PoissonDistribution.scala
               PoissonEntryPoint.scala
    pom.xml
    

The [pom.xml](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html#:~:text=Available%20Variables-,What%20is%20a%20POM%3F,default%20values%20for%20most%20projects.) file is the project object model and is a file which all contains all the instructions for [Maven](https://maven.apache.org/). I wont go into the details here, but I will say that Maven is a java tool to compile and package our code while SBT is the Scala equivalent. Since Scala is a [JVM language](https://en.wikipedia.org/wiki/List_of_JVM_languages) we can use either, however, I went with Maven since I'm more familiar with it and because examples with Py4J were much more easy to find.

We package our code into a [uber jar](https://stackoverflow.com/questions/11947037/what-is-an-uber-jar) with the command:

    mvn package 
    
Then we can start our our Py4J Web server with the command:


    java -jar target/poisson-1.0-jar-with-dependencies.jar

Now we can start up our Jupyter notebook and connect Python to the JVM with the following code taken directly from [Py4J's](https://www.py4j.org/index.html#) home page

In [9]:
from py4j.java_gateway import JavaGateway

gateway = JavaGateway() 

app = gateway.entry_point

The app is now the instantiated PoissonEntryPoint class.  We can see the class type in Python

In [10]:
type(app)

py4j.java_gateway.JavaObject

As well as the methods to the class:

In [11]:
dir(app)

['Poisson',
 'apply',
 'canEqual',
 'copy',
 'equals',
 'getClass',
 'hashCode',
 'notify',
 'notifyAll',
 'productArity',
 'productElement',
 'productIterator',
 'productPrefix',
 'toString',
 'unapply',
 'wait']

We can see `Poisson` class method! Since PoissonEntryPoint is a [case class](https://docs.scala-lang.org/tour/case-classes.html) it comes with a number of default methods just like a [data class](https://realpython.com/python-data-classes/) in Python.

We can then create a Poisson class instance and see that the value of $\lambda$ is 3.0:

In [12]:
p1 = app.Poisson(3.0)
p1.getLambda()

3.0

We can then instantiate another Poisson object:

In [13]:
p2 = app.Poisson(4.0)

Note that the PoissonEntryPoint class the PoissonDistribution object is initailized within the Poisson function and not as attribute of the class. If the it were an attribute of the class it would the last commnand would change the $\lambda$ of p1.  We can see the seperate values of $\lambda$s:

In [14]:
p1.getLambda()

3.0

In [15]:
p2.getLambda()

4.0

The really nice thing about Py4J is that you can treat objects in the JVM as if they are Python objects. For instance we can see the methods in the object:

In [16]:
dir(p1)

['$anonfun$cdf$1',
 '$anonfun$getSum$1',
 '$anonfun$invCDF$1',
 '$anonfun$invCDF$2',
 '$anonfun$invCDF$3',
 '$anonfun$invCDF$4',
 '$anonfun$sample$1',
 '$anonfun$uniform$1',
 '$lessinit$greater$default$1',
 'cdf',
 'equals',
 'getClass',
 'getLambda',
 'getSum',
 'hashCode',
 'invCDF',
 'notify',
 'notifyAll',
 'prob',
 'sample',
 'setLambda',
 'toString',
 'uniform',
 'wait']

We can then just use the methods in the class just like they would be used directly in Scala. For instance we can get the probability of $y=1$ when $\lambda = 3$:

In [17]:
p1.prob(1)

0.14936120510359183

Now let's generate a random samle from the Poisson object:

In [20]:
sample = p1.sample(1000)

In [21]:
sample[:3]

[3, 5, 2]

It looks like Py4J returns a Python list while the [PoissonDistribution class](https://github.com/mdh266/BayesBootstrapMLE/blob/main/src/main/scala/PoissonDistribution.scala)) returns   `java.util.List[Int]`.  Py4J can only serialize specific Java objects back to Python which is awesome!  This is also why I needed to convert to from a Scala `List[Int]` to a `ava.util.List[Int]`, without it the returned object would just be a genera `Java Object`.  


Now that we have our sample, let's get into the Maximum Likelyhood Estimator to estimate the $\lambda$ for the distribution that generated our sample.

## 2. The Maximum Likelyhood Estimator 
----------

In [None]:
lam = sum(x) / len(x)
print(f"lambda = {lam}") 

lambda = 2.9927


why Poisson(2).prob(2) == Poisson(2).prob(3)?

In [63]:
p.prob(2)

0.22404180765538775

In [64]:
p.prob(1)

0.14936120510359183

In [65]:
p.prob(3)

0.22404180765538775

In [57]:
p.setLambda(5.0)

In [59]:
p.getLambda()

5.0

## 3. Confidence Intervals From The Fisher Information
-------------------

## 4. The Bootstrap
----------------