# German Tank Problem

"In the statistical theory of estimation, the German tank problem consists of estimating the maximum of a discrete uniform distribution from sampling without replacement. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to $N$. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate $N$ from these observed numbers" - Wikipedia

The problem is motivated by problems that real statisticians faced during World War 2; The Allies wanted to estimate the amount of tanks that the Germans were producing, and they used serial numbers on captured tanks to do this.

Read more at https://en.wikipedia.org/wiki/German_tank_problem

Below, we have provided three methods for estimating the maximum number $N$ based on a sample. Note that $m$ is the maximum value in the sample and $n$ is the sample size
- frequentist prediction: $N = m+\frac{m}{n}-1$ 
  <br><br/>  
- Bayesian median: $m + \frac{m\log_2}{n-1}$ 
  <br><br/> 
- Bayesian mean: $(m-1)\frac{n-1}{n-2}$ 

Try it out!

In [None]:
# Libraries!
import random
import math
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [None]:
tanks = random.randint(1,5000)
sample = random.sample(range(1,tanks+1),5)
Nf = max(sample) + max(sample)/len(sample) - 1
Nd = max(sample) + max(sample)*math.log(2)/(len(sample) - 1)
Nm = (max(sample)-1) * (len(sample)-1)/(len(sample)-2)
print(f"There are {tanks} tanks... but we don't know this!")
print(f"Our captured tanks have numbers {sample}")
print(f"Using the frequentist prediction, our estimate is {Nf}")
print(f"Using the Bayesian median, our estimate is {Nd}")
print(f"Using the Bayesian mean, our estimate is {Nm}")

## Aggregate Data

Running the cell above will sometimes lead to good results... but sometimes it doesn't! How can we improve the estimates?

Well, we know that the total number of tanks must be _at least_ as big as the maximum number in the sample.

We can create _many_ random samples using **any** numbers less than this max value in order to generate _many_ more estimates. We can then use these aggregated estimates to create a better estimate of the true number of tanks

`createEstimates(maxTanks,n)` does the following:
1. Randomly pick a true number of tanks between 1 and maxTanks  
2. Generates an initial random sample of n tanks  
3. Generate 100 random samples based on the maximum value in the initial sample. 
4. Calculates three estimate for the true number of tanks from each sample using each of the three methods from above
5. Returns a pandas dataframe (a table) of all the estimates from each sample categorized by method and the true number of tanks it randomly chose

`graph(df,trueTanks)` plots the data from the previous function, as well as the true number of tanks

Run the cells below to see it in action

In [None]:
def createEstimates(maxTanks,n):
    """Generates a random number of tanks and generates max
    estimates using different methods. Returns dataframe"""
    tanks = random.randint(1,maxTanks)
    sampleInit = random.sample(range(1,tanks+1),n)
    certainMax = max(sampleInit)

    df = []

    for x in range(100):
        sample = random.sample(range(1,certainMax),n)
        Nf = max(sample) + max(sample)/len(sample) - 1
        Nd = max(sample) + max(sample)*math.log(2)/(len(sample) - 1)
        Nm = (max(sample)-1) * (len(sample)-1)/(len(sample)-2)
        df += [[Nf,Nd,Nm]]

    columns = ['Frequentist','BMedian','BMean']
    panda = pd.DataFrame(data=df,columns=columns)
    print(f"Our captured tanks have numbers {sample}")
    print(f"There must be at least {max(sample)} tanks")
    return panda, tanks

In [None]:
def graph(df,trueTanks):
    """Generates graphs of different estimate methods"""
    fig = plt.figure(figsize=(6,6))
    ax1 = sns.kdeplot(x=df['Frequentist'])
    ax2 = sns.kdeplot(x=df['BMedian'])
    ax3 = sns.kdeplot(x=df['BMean'])
    
    x1 = ax1.lines[0].get_xdata() 
    y1 = ax1.lines[0].get_ydata() 
    maxid1 = np.argmax(y1)
    
    x2 = ax2.lines[1].get_xdata() 
    y2 = ax2.lines[1].get_ydata() 
    maxid2 = np.argmax(y2)
    
    x3 = ax3.lines[2].get_xdata() 
    y3 = ax3.lines[2].get_ydata() 
    maxid3 = np.argmax(y3)
    
    ax4 = plt.axvline(trueTanks,color='red')
    # plt.axvline(x=x2[maxid2])
    # plt.axvline(x=x3[maxid3])
    
    print(f"The true number of tanks is {trueTanks}... but we don't know this!")
    print(f"Peak of Frequentist is {x1[maxid1]}")
    print(f"Peak of BMedian is {x2[maxid2]}")
    print(f"Peak of BMean is {x3[maxid3]}")
    ax1.set(title='Estimating Tanks',xlabel='Max Estimates',ylabel='Frequency')
    plt.legend(['Frequentist','BMedian','BMean','Actual'])

In [None]:
# +++ The Important Cell +++
data,nVal = createEstimates(1000,7)
graph(data,nVal)