# Chapter 6: A first look at probabilistic models

In [Chapter 6](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#) of the book, we worked with a dataset consisting of listing prices of a collection of Airbnbs in Austin, Texas, and we explored the possibilities of modeling this dataset probabilistically. But it turns out that that dataset contains more variables than just price. In fact, it is also contains the _number of reviews_ for each listing over the observation period of 12 months (discrete count data!). In this programming assignment, we will construct two candidate probabilistic models for this new variable, and we will use the _model checking_ and _goodness of fit_ plots and visualization techniques from [Chapter 6](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#) to decide which of the two models provides the best fit to the data.

While we used the powerful Seaborn and Statsmodels libraries in the book to produce these plots, I will have you generate these plots _from scratch_ for the first probabilistic model. This will give you the opportunity to do a deeper dive into the theory and to make sure that you truly understand what's going on. However, for the second model, you'll be allowed to use Seaborn and Statsmodels.

This might be the most important programming assignment during this first semester since it ties together so much of the material and also shows you how the theory is applied in practice. If you like model building, just wait until [Chapter 11](https://mml.johnmyersmath.com/stats-book/chapters/models.html), when things get really fun!

## Directions

1. The programming assignment is organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

3. Each problem contains a blank cell containing the following comment: `# ENTER YOUR CODE IN THIS CELL`. Enter your code in these cells below the comment, being sure to not erase the comment. There are usually directions on the precise syntax that you will use to enter your solution properly. Please pay very careful attention to these directions.

4. Below most of the solution cells are "autograder" cells. Do not alter the autograder cells in any way.

5. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## Importing the data and initial analysis

We begin, as usual, by importing all the standard libraries for this assignment.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
# set matplotlib to output pretty .svg's, rather than ugly .png's
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

Now, we read in the data from our course's [Github repository](https://github.com/jmyers7/stats-book-materials/tree/main):

In [None]:
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-4-1.csv'
srs = pd.read_csv(url).drop(columns=['Unnamed: 0']).squeeze()
srs

Since we are only dealing with one variable in this assignment---the number of reviews for each Airbnb listing---we have saved the data into a Pandas Series, rather than a DataFrame. For this reason, we saved the data into the variable `srs` instead of our usual `df`.

Notice that there are $14{,}694$ data points in our series. The first listing (at index $0$) has $648$ reviews, while the last one (at index $14{,}693$) has no reviews at all.

Let's print out the summary statistics of our variable:

In [None]:
srs.describe()

In particular, we see the mean is about $37$ while the median is $9$, and that the maximum number of reviews is a _huge_ $1{,}124$. This tells us that the data is _extremely_ right-skewed (i.e., it has a long tail stretching out toward the right).

We also see that the $0.25$-quantile (or the _first quartile_) is $1$, while the $0.75$-quantile (or the _third quartile_) is $36$. Thus, the interquartile range (IQR) is $36 - 1 = 35$, and the upper threshold for outliers is

$$
(\text{third quartile}) + 1.5 \times (\text{IQR}) = 36 + 1.5 \times 35 = 88.5.
$$

Because this value will be useful later, let's save it into the variable `threshold`:

In [None]:
threshold = 88.5

### Problem 1 --- First glimpse of the data distribution

The summary statistics that we just found suggest that our data is right-skewed. Let's confirm this visually by producing a box plot using the Seaborn library. In this problem, your goal is to produce this:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/8b144c56622a67e553c027b7cee394d76579edbf/img/boxplot-with-outliers.svg?raw=true" width="900" align="center">
</center>

Hints/directions:

* I suggest taking a look at the code in the [section](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#box-plots-and-violin-plots) of the book where we discussed box plots.

* Set the size of the plot to be $10$ inches wide and $2$ inches high.

* Be sure to notice the label along the horizontal axis and the title of the plot.

* Make sure to call `plt.tight_layout()` at the end of your code block.


In [None]:
# ENTER YOUR CODE IN THIS CELL



This box plot confirms that we have a very large number of outliers in our dataset.

### Problem 2 --- Removing outliers

So, what are we to do with all these outliers? We're going to remove them!

Remember that we saved the upper threshold for outliers into the variable `threshold`. In the next code block, create a boolean mask to **keep** all values in `srs` that are **less then or equal** to `threshold`. Thus, your mask should contain `True` for those values that we want to **keep**. Make sure you understand this!

Save your mask into the variable `mask`. You might also print it out to make sure it looks right.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now, in the next code block, index into `srs` using `mask` to produces a Pandas Series that contains all non-outliers. Save your answer into the variable `srs`, saving over the old `srs`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


If you did everything right, you should see that the `srs` now contains $13{,}023$ data points. Thus, by removing outliers, we removed almost $1{,}700$ listings.

### Problem 3 --- Looking at the data again

Having removed the outliers, let's produce another box plot to get a look at the data distribution. Your goal is to produce this:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/9279e12deda58577897c9113fa4fea05669b93ee/img/boxplot-wo-outliers.svg?raw=true" width="900" align="center">
</center>

Again, set the dimensions of the plot to 10 inches wide and 2 inches high, and be sure to  call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



After removing the initial batch of outliers, the upper threshold for outliers changed (decreased). Thus, there are _new_ outliers in our new (smaller) dataset, as shown by the last box plot. But there are a much more managable number of them---not nearly as many as $1{,}700$.

## The empirical mass function

### Problem 4 --- Probability histograms

Let's now generate the mass function of the empirical distribution of the number of reviews. Do this in the next code block, saving your answer into the variable `epmf`. Be sure to sort the levels in the support of the mass function by calling `.sort_index()`, just as you did back in the [third programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_03.ipynb).


In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now, let's generate a probability histogram of the mass function. Your goal is to produce this:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/699a543ffcf7db1152ee55f9cc532983501fce07/img/epmf-reviews.svg?raw=true" width="900" align="center">
</center>

Hints/directions:

* Set the size of the figure to 10 inches wide, 4 inches high.

* Set the ticks along the $x$-axis by calling `plt.xticks(range(0, 45, 5))`. This produces ticks from $0$ to $40$ in steps of $5$. (The `range` function always _excludes_ the end point, so we call `range(0, 45, 5)` rather than `range(0, 40, 5)`.)

* Set the limits on the $x$-axis by calling `plt.xlim(-1, 40.5)`.

* Be sure to notice the labels on the axes, as well as the title of the plot!

* Call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



## The Poisson model

Having completed some initial exploratory data analysis, our goal now is to cook up a probabilistic model for the data:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/8740bf8b382c69d5d69800972d0fabc543918420/img/whichone.svg?raw=true" width="400" align="center">
</center>

As a first step in this direction, we first remind ourselves that our data is _count_ data. Indeed, each data point _counts_ the number of reviews of an Airbnb listing in Austin. Remember that I mentioned numerous times in [Chapter 5](https://mml.johnmyersmath.com/stats-book/chapters/examples-of-rvs.html#) that Poisson random variables are often used to model _count_ data. So, perhaps this means that a Poisson model might work well...? Our goal in this section of the programming assignment is to assess the fit of a Poisson model using the goodness-of-fit plots that we learned in [Chapter 6](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#).

Since Poisson random variables are parametrized just by a single parameter $\mu$, we would _visualize_ a Poisson model graphically as follows:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/c080c681323f5ccc50fa2b43ad000d78305dfa10/img/poisson-model.svg?raw=true" width="400" align="center">
</center>

Here, $m$ is the size of the dataset, $m=13{,}023$. Since our model is so simple, there's not really any insight gained by depicting the model graphically, but I think it will be helpful to get into the habit of drawing out our models in anticipation of the more complicated and complex models in [Chapter 11](https://mml.johnmyersmath.com/stats-book/chapters/models.html) of the book.


Before moving on to the next problem, it will be helpful to have the size of the dataset available later, so let's save it into the variable `m`:

In [None]:
m = len(srs)
m

### Problem 5 --- Comparison of PMFs

As a first test for goodness of fit, we will compare the empirical mass function of the dataset to the mass function of the Poisson model. But how should we choose the parameter $\mu$ for the model $Pois(\mu)$?

Well, remember that we chose the Greek letter $\mu$ for the parameter because it actually _is_ the mean of the model. This suggests that a good choice for $\mu$ should be the empirical mean $\bar{x}$ of the dataset. In the next code block, define the variable `mu` as the empirical mean of the dataset. Then, in the same code block, create a Poisson random variable in SciPy using this value of `mu`. Call the random variable `X`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


We now want to generate the mass function of the Poisson random variable `X`. To do this, we will pass the support of the empirical mass function into `X.pmf()`. The support of the empirical mass function is contained in the indices of the Pandas Series `epmf` that you defined above. You learned how to extract the indices of a Pandas Series back in the [third programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_03.ipynb). Using this knowledge, in the next code block create the mass function of `X` and save it into the variable `poisson_pmf`. As always, you might print it out to check that it looks correct.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now, to help facilitate comparison between the empirical and model PMF's, let's load them both into the columns of a Pandas DataFrame. I will do this for you. Run the next code block:

In [None]:
df = pd.DataFrame({'empirical': epmf, 'poisson': poisson_pmf})
df

Notice that I defined the dataframe by passing the dictionary

``` python
{'empirical': epmf, 'poisson': poisson_pmf}
```

into the dataframe constructor. The _keys_ in the dictionary (i.e., the strings `'empirical'` and `'poisson'`) become the names of the columns, while the _values_ in the dictionary (i.e., the series `epmf` and `poisson_pmf`) become the columns.

With the dataframe `df` in hand, your goal now is to produce the following plot:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/9c4d69177360d3e71883ac4d731b9f299438a477/img/poisson-compare.svg?raw=true" width="900" align="center">
</center>

Use the same specifications for this plot that I gave you for the probability histogram in Problem 4. And, as always, be sure to call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



This comparison of PMFs suggests (strongly) that the Poisson model fits the data poorly. You probably knew this all along, if you remembered the shapes of the mass functions of Poisson random variables that we saw back in [Chapter 5](https://mml.johnmyersmath.com/stats-book/chapters/examples-of-rvs.html#poisson-distributions).

### Problem 6 --- Comparison of CDFs

From the single plot above, we know that the Poisson model is a bad one. However, just for practice, over the next two problems we will generate more comparison plots for model checking and goodness of fit.

In this problem, we will compare the ECDF of the data to the CDF of the Poisson model. And, rather than use the convenient `ecdfplot` method in the Seaborn library, I will have you build and plot the ECDF from _scratch_.

To generate the ECDF, first look [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.cumsum.html) at the documentation for the "cumulative sum" method in Pandas. Use this method in the next code block to generate the ECDF from the series `epmf` that you defined above. Save the ECDF into the variable `ecdf`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now that you have the ECDF, we want to plot it against the CDF of the Poisson model for comparison. Your goal is to produce the following plot:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/4ba2a0b35325af1aa86a85234d3619f99aa48eff/img/poisson-cdf.svg?raw=true" width="600" align="center">
</center>

Hints/directions:

* To plot the ECDF, see the documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.step.html) for the `step` method in Matplotlib.

* To produce the CDF of the Poisson variable, you will use the `step` method one more time. Pass the indices of the empirical distribution, `epmf.index`, into `X.cdf()`.

* Take care to notice the axis labels, plot title, and legend.

* Call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



This plot confirms what we already knew: the Poisson model is a bad fit. But at least we got practice creating ECDFs from scratch!

### Problem 7 --- QQ-plots

We now turn toward creating a QQ-plot to compare the empirical distribution to the Poisson model distribution. Even though these plots may be created easily using the `qqplot` method in the Statsmodels library, in this problem we will create the QQ-plot from scratch.

To do this problem, it is cruical that you understand the theory behind QQ-plots. I suggest that you either review your notes from class, or re-read the [section](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#qq-plots) in the book where I discuss QQ-plots.

To begin, we first need to generate a list of $q$-values according to the formula

$$
q_i = \frac{i-1/2}{m},
$$

for $i=1,2,\ldots,m$. Remember, $m$ is the size of the dataset (which was saved into the variable `m` above). Again, if you don't understand the relevance of this formula, you _need_ to review the theory of QQ-plots.

In the next code block, create a list of these $q$-values. To do this, you will use a technique called _list comprehension_; see the link [here]() for a description. Once you see what list comprehension is, hopefully you'll recall that I did something very similar back in the [third programming assignment](https://github.com/jmyers7/stats-book-materials/blob/main/programming-assignments/assignment_03.ipynb); look there for more hints. Save your list into the variable `q`. (And do not write garbage code! Make sure your spacing looks all nice and pretty.)

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now that we have our list of $q$-values, we need to pass these into the quantile function of the Poisson variable `X`. Do this in the next code block, saving the NumPy array into the variable `poisson_quantiles`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


To produce the QQ-plot, we will first take our array of Poisson quantiles and the (sorted) dataset and load them into the columns of a dataframe. Again, I will do this for you:

In [None]:
df = pd.DataFrame({'empirical': srs.sort_values(), 'poisson': poisson_quantiles})
df

We now have everything we need to produce the following QQ-plot:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/a2e743d4730f0c0683ecac4b37377f90429e1129/img/poisson-qq.svg?raw=true" width="600" align="center">
</center>

Hints/directions:

* Call the `plt` method on the dataframe `df` to produce the scatter plot. Pass in the parameter `alpha=0.1` to set the opacity.

* Call `plt.plot([0, threshold], [0, threshold], color='red')` to produce the red diagonal line.

* To ensure that the aspect ratio of the plot is fixed at `1`, call `plt.gca().set_aspect('equal')` in your code.

* As always, take care to notice the axis labels and the plot title.

* Call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



As if we needed _even more_ evidence that the Poisson model is a bad fit, the QQ-plot provides further confirmation.

## The geometric model

So, the Poisson model is not a good choice for a probabilistic model of the data. What might we choose next?

How about a _geometric_ random variable, $Y \sim Geo(\theta)$? Though we didn't discuss these in class, if you [jump over](https://mml.johnmyersmath.com/stats-book/chapters/examples-of-rvs.html#geometric-distributions) to the relevant section in the book and take a peek at the histograms of the mass functions, you'll notice that they resemble our empirical PMF in many respects. This suggests that a _geometric model_ for the data might be a good fit.

Such a model would be visualized graphically as:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/d5c2334460264a878009abf9eace1c3477c0bfae/img/geom-model.svg?raw=true" width="400" align="center">
</center>

Again, there's not much insight to be gained from this graphical representation of the model; drawing these figures is merely in anticipation of [Chapter 11](https://mml.johnmyersmath.com/stats-book/chapters/models.html).



### Problem 8 --- Comparison of PMFs

Like a Poisson variable, notice that a geometric variable $Y$ is parametrized by a single parameter $\theta$, which must be in the interval $[0,1]$ since it represents a probability. How might we pick this parameter?

If you look deeper into the section of the book on geometric variables, you'll notice that the mean value of $Y \sim Geo(\theta)$ is equal to the reciprocal of the parameter, $1/\theta$. So, this suggests that we should choose $\theta$ to be the recriprocal of the empirical mean, $1/\bar{y}$. In the next code block, save this parameter value into the variable `theta`; then, using this parameter, look at the [SciPy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html) for geometric random variables and (in the same code cell) create a geometric random variable called `Y`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now that we have the geometric variable `Y`, we need to get its PMF. Do this in the next code block exactly as you did for the Poisson variable `X` in Problem 5. Save your answer into the variable `geom_pmf`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Again, to help facilitate comparison, let's take the empirical PMF of the data and the PMF of the geometric variable and load them into the columns of a dataframe. This time, I'll have you create the dataframe on your own. Call it `df`, and use the strings `'empirical'` and `'geometric'` for the keys in your dictionary.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Your goal now is to compare the PMFs by producing this plot:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/fa9142291e7e8dc8b239f40d62f178a2ad9ad202/img/geom-compare-pmf.svg?raw=true" width="900" align="center">
</center>

Use the same specifications that I gave you for the plot in Problem 4. Remember to call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



Ok, so what does this plot tell us about the fit of the geometric model? It's not _amazing_, but at least it's better than the Poisson model. One big difference between the two distributions is that the support of the geometric model consists of all positive integers $y\geq 1$; however, the dataset contains lots of observations at $y=0$.

What we need, then, is a so-called _[zero-inflated](https://en.wikipedia.org/wiki/Zero-inflated_model)_ model. While there are lots of these things out there, we're going to create a naive version that simply shifts the geometric distribution to the _left_ by one unit. This won't provide a perfect fit at $y=0$, but it's better than the current model.

So, in the next code block, your goal is to do three things:

1. Redefine the geometric random variable `Y` so that is is shifted to the left by one unit. You'll use the same `theta` parameter that you used above, but you'll need to search [the docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html) on your own to learn how to shift the distribution.

2. Redefine `geometric_pmf` using the new `Y`.

3. Redefine the dataframe `df` using the new `geometric_pmf`. Use the keys `'empirical'` and `'(shifted) geometric'`, the latter in place of the original key `'geometric'`.

Do all of these things in the following block:

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Now, your goal is to produce the following (new) plot of the mass functions:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/0ccefb68bd0ad3414a69a1a9c50e12ee98691cf4/img/geom-shifted-compare-pmf.svg?raw=true" width="900" align="center">
</center>

As long as you produced the previous version of this plot correctly, you should be able to just copy and past your code:

In [None]:
# ENTER YOUR CODE IN THIS CELL



Again, the fit at $y=0$ and $y=1$ is not great, but it's better than the unshifted geometric model, and certainly _much_ better than the Poisson model.

### Problem 9 --- Comparison of CDFs

In the next two problems, we will visually compare the CDFs of the dataset and the (shifted) geometric model, as well as draw a QQ-plot. However, having paid our dues previously by generating these plots from _scratch_, we have earned the right to use the Seaborn and Statsmodels libraries. Yay!

To compare the CDFs, your goal is to produce this plot:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/4ba2a0b35325af1aa86a85234d3619f99aa48eff/img/geometric-cdf.svg?raw=true" width="600" align="center">
</center>

Hints/directions:

* First, I would suggest looking at the [section](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#probabilistic-models-and-empirical-distributions) in the book where I produced ECDF plots using Seaborn for inspiration. You will not need _all_ of that code, however, so don't just thoughtlessly and sloppily copy and paste. Make sure your brain is turned on for this one, ok?

* To produce the CDF of the geometric variable `Y`, use the `step` method of `plt` as you did in Problem 6.

* Make sure to notice the axis labels, the legend, and the plot title.

* Call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL



This plot shows what we already knew: the geometric model is an _ok_ fit, but not great.

### Problem 10 --- QQ-plots

Finally, let's produce a QQ-plot comparing empirical to model quantiles. You will produce this:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/3be69abe4584319842b750fdfc27a47dcdb569b6/img/geometric-qq.svg?raw=true" width="600" align="center">
</center>

Hints/directions:

* Once again, I recommend looking at the [section](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#qq-plots) in the book where I produced QQ-plots using the `qqplot` method from the Statsmodels library.

* You do _not_ need to import `qqplot` since we already did this at the beginning of the assignment. Pay attention!

* Pass in `alpha=0.1` to set the opacity.

* Notice the axis labels and the plot title.

* Call `plt.tight_layout()` at the end of your code.

In [None]:
# ENTER YOUR CODE IN THIS CELL

