<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float" src="http://prob140.org/assets/icon256.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Fall 2018</h4><p>Ani Adhikari and Jim Pitman</p>CC BY-NC 4.0</div></td></tr></table><!-- not in pdf -->

In [None]:
from datascience import *
from prob140 import *
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import numpy as np
from scipy import stats
import warnings
warnings.simplefilter('ignore')
from matplotlib.ticker import FormatStrFormatter

In [None]:
def override_hist(*args, **kwargs):
    """
    This cleans up some unfortunate floating point precision
    bugs in the datascience library
    """
    #kwargs['edgecolor'] = 'w'
    Table.hist2(*args, **kwargs)
    ax = plt.gca()
    ticks = ax.get_xticks()
    if np.any(np.array(ticks) != np.rint(ticks)):
        ax.xaxis.set_major_formatter(FormatStrFormatter('%.2f'))

if not hasattr(Table, 'hist2'):
    Table.hist2 = Table.hist
    
Table.hist = override_hist

# Lab 11: Chinese Restaurant Process, Part II #
This is a continuation of Lab 10. Please review the description of the Chinese Restaurant process provided at the beginning of that lab.

In Lab 10 you studied the distribution of the number of tables formed by $N$ people in a Chinese Restaurant process with parameter $\theta$. That is, you studied the distribution of the number of clusters.

You noticed that with high probability there are not many tables compared to $N$, and you showed that the expected number of tables is roughly $\theta \log(N)$. You also saw by simulation that the distribution of people across tables was typically quite uneven, with a few tables proving to be more popular than others.

In this lab you will study the long run behavior of the proportion of people at Table 1. This helps answer questions such as:

- About what proportion of the animals are of the same species as the first one you saw?
- What is the chance that more than half the retweets are the same as the first one?

The proportion of people at the first table has some surprising and beautiful properties. You will learn:

- Why the behavior of the proportion supports the idea that "the rich get richer"
- The connection between the long run distribution of the proportion and the beta family

### From Lab 10: Simulating the Process ###
For ease of reference, here is the definition of the function `cr` from Lab 10. You defined it to take `N` and `theta` as its arguments, run the Chinese Restaurant process with parameter `theta` until `N` people have been seated, and return an array of the counts of people at the tables in the order of table formation.

In [None]:
def cr(N, theta):
    tables = make_array()
    people = make_array()
    
    for i in range(N):
        n = sum(people)
        new_table = len(tables) + 1
        
        tbl_choices = np.append(tables, new_table)
        tbl_probs = np.append(people, theta)/(n+theta)
    
        choice = int(np.random.choice(tbl_choices, p = tbl_probs))
    
        if choice == new_table:
            tables = tbl_choices
            people = np.append(people, 1)
        else:
            people[choice-1] = people[choice-1]+1 
        
    
    return people

#newpage

## Part 1. The Rich Get Richer ##

In our analysis of the Chinese Restaurant stochastic process, we will say that Person $n$ enters the system at time $n$. Thus time is always equal to the total number of people in the system at that time.

In this Part you will follow the **number** of people at Table 1 as the process evolves.

- At time 1 the number of people at Table 1 is 1 because there is only one person and that person sits at Table 1. 
- At time 2 the number is either 2 (if Person 2 chooses Table 1) or 1 (if Person 2 starts a new table). 
- And so on.

### a) ###

Modify the definition of `cr` to define a function `t1_counts` that takes `N` and `theta` as arguments and does the following:

- Runs the Chinese Restaurant Process with parameter `theta` till time `N`.
- Then returns an array of length `N` such that the $i$th element of the array is the number of people at Table 1 at time $i$. 

Thus, the first element of the returned array should always be 1. Recall that if the second person joins Table 1, then the second element should be 2. If the third person then joins Table 2, then the third element should still be 2, and so on.

In [None]:
def t1_counts(N, theta):
    
    t1 = make_array()    # array of Table 1 counts
    
    ...
    ...
    
    return t1

Run the following cell several times and check that the output makes sense. By now you have probably discovered that this goes much quicker if you run the cell by using Control-Return instead of Shift-Return.

Change the value of $\theta$ to 0.5 and also to 2, run the cell a few times with each $\theta$, and make sure that the output still makes sense.

In [None]:
t1_counts(2, 1)

Choose from the three options below to complete the sentence:

For every number of people `N` and positive parameter `theta`, the sequence of entries in the array `t1_counts(N, theta)` should be

(i) non-increasing.

(ii) non-decreasing.

(iii) not necessarily either non-increasing or non-decreasing.


**Your answer here**

### b) ###

Now define a function `plot_t1_counts` that takes `N`, `theta`, and `repetitions` as its arguments and for each repetition does the following:

- Runs the Chinese Restaurant process with parameter `theta` till time `N` and keeps track of the number of people at Table 1 at each time 1 through $N$.
- Displays a graph of the counts at Table 1 versus time.

We will call each graph a *path* of the number of people at Table 1 as people enter the system.

Don't forget that we start with 1 person.

In [None]:
def plot_t1_counts(N, theta, repetitions):
    n = ...
    for i in range(repetitions):
        plt.plot(n, ..., lw=2)  

### c) ###

Run the following cell several times. Then change $N$ and $\theta$ and run the cell again. Make sure you include $\theta = 0.5$ and $\theta = 2$. If you change the number of paths, keep it fairly small so that you can see the individual paths clearly.

In [None]:
N = 100
theta = 1
plot_t1_counts(N, theta, 10)
plt.xlabel('Time')
plt.title('Number of People at Table 1');

### d) ###
Each path is the graph of a function of time. What kind of function do you see? Briefly summarize what the paths are likely to look like, based on your observations in (c).


**Your answer here**

### e) ###
Let $W_n$ be the number of people at Table 1 at time $n$, and consider the rate of change of $W_n$ as a function of $n$. Based on your observations in (c), for paths that have a high rate of change when $n$ is small, is the rate of change typically high or typically low as $n$ gets larger? Explain why it is possible to do a pretty good job of predicting $W_n$ for a large $n$, based on observing the early stages of the process.


**Your answer here**

### f) ###
The Chinese Restaurant process is said to have the property that "the rich get richer." Briefly explain this in light of your answer to (e). 

Then explain how you would describe this property to a natural scientist in a context where instead of people there are animals sitting at "tables", where tables are species. It might help to imagine that the data consists of video recorded by a [hidden camera](https://gizmodo.com/how-dslr-camera-traps-are-capturing-stunning-wildlife-p-1730499208) in a nature reserve. 


**Your answer here**

#newpage

## Part 2. Long Run Proportion at Table 1 ##
Now track the **proportion** of people at Table 1, instead of the number of people at the table.

### a) ###
Define a function `plot_t1_proportions` that's just like `plot_t1_counts` except that it plots the proportions of people at Table 1 instead of the counts.

In [None]:
def plot_t1_proportions(N, theta, repetitions):
    ...
    ...

### b) ###
Run the cell below several times, then change $N$ and $\theta$ as in Part 2(c) and run it again several times.

In [None]:
N = 100
theta = 1
plot_t1_proportions(N, theta, 10)

### c) ###
What feature of the paths helps confirm the following result?

If $W_n$ is the number of people at Table 1 at time $n$, then the proportion $\frac{W_n}{n}$ converges with probability 1 as $n \to \infty$.


**Your answer here**

#newpage

## Part 3. Limit Distribution of the Proportion ##

Let $W = \lim_{n \to \infty} \frac{W_n}{n}$ be the limit of the proportion of people at Table 1.

Clearly the possible values of $W$ are the interval $(0, 1)$. In this part you will use simulation to identify the distribution of $W$ over the unit interval.

To do this, you will simulate the distribution of $\frac{W_N}{N}$ for a large $N$, and compare it with a known distribution.

### a) ###
Define a function `t1_prop_at_fixed_time` that takes `N`, `theta` and `repetitions` as its arguments. In each repetition, the function runs the Chinese Restaurant process with parameter `theta` till time `N`, and computes the proportion of people at Table 1 at time `N`. The function returns an array of the simulated proportions.

In [None]:
def t1_prop_at_fixed_time(N, theta, repetitions):
    
    t1_proportion = make_array()
    for i in np.arange(repetitions):
    ...
    ...
    return t1_proportion    

Run the cell below to check that the output is an array of proportions and that the array has the correct length.

In [None]:
t1_prop_at_fixed_time(100, 1, 5)

### b) ###
Complete the function definition in the cell below. The function should display the empirical histogram of $\frac{W_N}{N}$ with the beta $(1, \theta)$ density overlaid.

In [None]:

# Empirical distribution of W_N/N
# with beta (1, theta) density overlaid

def plot_limit_t1_proportion(N, theta, repetitions):
    t1_props = ...
    Table().with_column('Proportion at Table 1', t1_props).hist(bins=20)
    x = np.arange(0, 1.01, 0.01)
    plt.plot(x, ..., color='red', lw=2)
    plt.title('Overlaid Density: Beta'+ r'$(1, \theta)$');

### c) ###
Use `plot_limit_t1_proportion` to display the empirical histogram of $\frac{W_{100}}{100}$ based on 2000 repetitions of the following:

- Run the Chinese Restaurant process with $\theta = 1$ till time 100, and compute $\frac{W_{100}}{100}$

How does the empirical distribution compare with the overlaid beta $(1, \theta)$ density? 

[Note: If you want to experiment with making $N$ larger than 100, be prepared to be patient as the code chugs along.]

### d) ###
Repeat Part (c) keeping everything the same but changing $\theta$ from 1 to 0.5.

### d) ###
Give a brief intuitive explanation why the long run proportion of people at Table 1 is more likely to be high than low when $\theta = 0.5$. It will help to think about what happens when the first few people enter the system, and also to keep in mind Part 1 of this lab.


**Your answer here**

### e) ###
Repeat Part (c) keeping everything the same but changing $\theta$ from 1 to 2.

What you have observed by simulation is the distribution of the long run proportion of people at Table 1. 

Why is it beta $(1, \theta)$? Do this week's homework and you'll find out. The Chinese Restaurant process is closely related to the beta-binomial process that you hav studied in class.

## Conclusion ##
What you have learned:

- The number of people at Table 1 grows in a predictable way after some randomness in the early stages.
- If Table 1 gets rich, then it gets richer.
- The proportion of people at Table 1 has a limit, and the distribution of the limit is a member of the beta family.
- There's got to be a reason why the beta density appears; the cliff-hanger is resolved in this week's homework, via the beta-binomial process.

More advanced analyses of the Chinese Restaurant process include descriptions the joint distribution of the proportions at all the tables. These are related to the *Dirichlet-multinomial* process, which is the multivariate version of the beta-binomial process. 

It's late in the semester and we don't have time to go into all that. But you can skim [a description](https://en.wikipedia.org/wiki/Chinese_restaurant_process) of the Chinese Restaurant process, in which you will spot a lot that is familiar and a lot that isn't. For an illuminating exposition, read the first few pages of [a much-cited paper](https://cocosci.berkeley.edu/tom/papers/ncrp.pdf) by Blei, Griffiths, and Jordan. 

In [None]:
import gsExport
gsExport.generateSubmission("Lab11.ipynb")