# Plausability of Lottery Luck--Group Computations

#### [Dylan D. Daniels](http://statistics.berkeley.edu/people/dylan-david-daniels) and [Philip B. Stark](www.stat.berkeley.edu/~stark), Department of Statistics, University of California, Berkeley
#### Based on MATLAB code by [Skip Garibaldi](http://www.garibaldibros.com/)

This tool appraises whether it is plausible that a list of individuals each won a set of lottery prizes honestly. 

User Inputs:

   + A comma-separated values file (CSV) of individuals, wins, odds, and game type (scratcher or draw).
   + An upper bound on the potential number of players (for instance, one might assume that the number of people playing the lottery isn't greater than the number of residents of the state), MAX_PLAYERS
   + A tiny "threshold" probability, CHANCE_THRESHOLD
   + The total lottery revenue during the period in question (optional), TOT_REVENUE.

The code outputs, for each individual, a lower bound on the amount every potential player
would have had to spend for _any_ of them to have a tiny chance of winning so often, where "tiny" is the threshold number chosen by the user.

If the required spending amount is, for example, several times the median house price in the state, it may call into question whether the winner won honestly.

This version can analyze data for a group of players. 

The code implements the mathematics described in the first link below. The third link is to a public lecture about the method, and results for reported lottery winners in Florida. 
The fourth and fifth links are news stories that relied on such calculations.

See:
+ Arratia, R., S. Garibaldi, L. Mower, and P.B. Stark, 2015. Some people have all the luck. _Mathematics Magazine_, _88_ 196–211. doi:10.4169/math.mag.88.3.196.c, Reprint: http://www.stat.berkeley.edu/~stark/Preprints/luck15.pdf http://www.jstor.org/stable/10.4169/math.mag.88.3.196
+ Arratia, R., S. Garibaldi, L. Mower, and P.B. Stark, 2015. Some people have all the luck &hellip; or do they? _MAA Focus_, August/September, 37–38. http://www.maa.org/sites/default/files/pdf/MAAFocus/Focus_AugustSeptember_2015.pdf
+ https://www.youtube.com/watch?v=s8cHHWNblA4
+  Lottery odds: To win, you’d have to be a loser. Lawrence Mower, _Palm Beach Post_, 28 March 2014. http://www.mypalmbeachpost.com/news/news/lottery-odds-to-win-youd-have-to-be-a-loser/nfL57
+ Against all Odds, Gavin Off and Adam Bell, _The Charlotte Observer_, 29 September 2016.
http://www.charlotteobserver.com/news/special-reports/against-all-odds/
+ How did PennLive investigate America's 'luckiest' lottery players?, Daniel Simmons-Ritchie and Jeff Kelly Lowenstein, _Penn Live_, 13 September 2017.
http://www.pennlive.com/watchdog/2017/09/defying_the_odds_methodology.html
+ The math behind PennLive's analysis of frequent lottery winners, Daniel Simmons-Ritchie, _Penn Live_, 13 September 2017. http://www.pennlive.com/watchdog/2017/09/defying_the_odds_math.html


## Instructions:
1. Create a CSV file with results for all gamblers. The CSV file should contain at least five columns:  

> "name", "game", "probability," "wins," "cost," and "type" 

Each row corresponds to one type of wager for one gambler. 

+ "name" is an identifier for each gambler.
+ "game" is the name of the wager (it's OK to leave this blank).
+ "probability" is the chance of winning that wager. 
+ "wins" is the number of times the gambler collected on that wager.
+ "cost" is the cost per ticket or play on that wager. 
+ "type" is a letter identifying whether the game is a scratch-off game ('s') or involves picking numbers for a drawing ('d')

If type is not specified, the software assumes the game is a scratch-off game.

The computations assume that the gambler did not win any dependent bets, for instance, two bets on the same drawing.

2. Put the filename of your CSV file in the box below, along with the values of MAX_PLAYERS and CHANCE_THRESHOLD.

3. On the toolbar of this browser window (under the jupyter logo), click "Cell" --> "Run All". Wait a bit for your results to appear at the bottom of this page. 

In [1]:
from __future__ import print_function, division

# Put the name of your CSV file here:
# CSV_FILENAME = 'FILL_ME_IN.csv'
# CSV_FILENAME = 'miller.csv'
# CSV_FILENAME = 'pa-20-recompiled.csv'
# CSV_FILENAME = 'jafaar-max-odds.csv'
# CSV_FILENAME = 'messier-vt.csv'
# CSV_FILENAME = 'steele-vt.csv'

CSV_FILENAME = 'va-18.csv'
COL_NAMES = ['name', 'game', 'probability', 'wins', 'cost', 'type']

# set the max number of players size and overall cutoff probability
CHANCE_THRESHOLD =  10**(-7) # one in ten million threshold

# MAX_PLAYERS = 1  # if there were only one person buying tickets...
# MAX_PLAYERS = 623657      # Vermont estimated 2017 per census bureau
MAX_PLAYERS = 8470020   # Virginia, estimated 2017 per census bureau, https://www.census.gov/quickfacts/VA
# MAX_PLAYERS = 5795483   # Wisconsin, estimated 2017
# MAX_PLAYERS = 27862600  # Texas MAX_PLAYERS, 2016
# MAX_PLAYERS = 6859819  # MAX_PLAYERS of MA in 2017 per census
# MAX_PLAYERS = 12784000  # MAX_PLAYERS of Pennsylvania, 2016

# set the revenue the lottery took in in the relevant period
# TOT_REVENUE = 658324233 # Messier  # 400864529 # Vermont
TOT_REVENUE = 16624690000 # Virginia lottery revenue, 2008-2017
# TOT_REVENUE = 3517783 # instant games in MA for 2017, 
     # http://www.masslottery.com/lib/downloads/leadership/pdfs/Financial-Statement-History/June2017financialYTD.pdf



In [2]:
print('revenue per player', TOT_REVENUE/MAX_PLAYERS)

revenue per player 1962.7686829547038


In [3]:
import csv
import numpy as np
import math
from scipy.special import betainc
from scipy.optimize import minimize

def binTail(p, n, t): # upper tail probability for a vector of Binomial(n,p) random variables
    return betainc(n, t - n + 1, p)

def binTailln(p, n, t): # logarithm of the upper tail probability for a vector of Binomial(n,p) random variables
    return np.log(binTail(p, n, t))

def constraintFn(p, n, tot_log_mults): 
    """
    constraint function: probability of vector of wins must be at least CUT
    <mults> accommodates draw and scratcher games.
    Scratchers have mult=1; draws have mult=2.
    
    Parameters:
    -----------
    p : list of floats
        probabilities of winning
    n : list of ints
        number of trials
    tot_log_mults : double
        log of the product of probability multipliers for draw games
    """
    return lambda x: tot_log_mults + np.sum(binTailln(p, n, x)) - np.log(CUT)

def objectiveFn(c):  # construct function that gives cost of vector x of bets, for cost-per-bet vector c
    return lambda x: np.dot(x, c)

def solve(x0, upperBoundVec, p, n, c, eps, debugMode, maxiter, method='SLSQP'):  
    # invoke the constrained optimizer
    # 
    #    x0:     starting guess
    #    p:      vector of game probabilities
    #    n:      vector of number of wins of each game
    #    c:      vector of game costs
    #    eps:    stepsize for Hessian approximation
    #    debugMode: True for verbose output
    #    maxiter: maximum iterations in optimizer
    #    method: underlying minimization algorithm
    #       
    cons = ({'type': 'ineq', 'fun': constraintFn(p, n, tot_log_mults)})   # overall probability constraint
    bnds = tuple((n[i], upperBoundVec[i]) for i in range(len(n)))  # must bet at least n times to win n times
    return minimize(objectiveFn(c), x0, method=method, jac=(lambda x: c),
                    constraints=cons, bounds=bnds,
                    options={'disp': debugMode, 'maxiter': maxiter, 'eps': eps})

def readCsv(filename):  # read the csv file of data for a player
    with open(filename, 'rU') as f:
        reader = csv.DictReader(f)
        gamblers=[]; games=[]; pValues=[]; nValues=[]; cValues=[]; mValues=[]
        for row in reader:
            try:
                gamblers.append(row[COL_NAMES[0]])
                games.append(row[COL_NAMES[1]])
                pValues.append(float(row[COL_NAMES[2]]))
                nValues.append(float(row[COL_NAMES[3]]))
                cValues.append(float(row[COL_NAMES[4]]))
            except ValueError:
                print('Skipping row:\n', row)
        f.seek(0)  # rewind
        reader = csv.DictReader(f)
        skipped = 0
        read = 0
        for row in reader:
            try:
                if row[COL_NAMES[5]] in ['d', 'draw']:
                    mValues.append(2)
                else:
                    mValues.append(1)
                read += 1    
            except KeyError:
                skipped += 1
                mValues.append(1)
        print("read types for {} rows\nimputed scratcher for {} rows with missing key".format(
                read, skipped))
    return np.array(gamblers), np.array(games), np.array(pValues),\
           np.array(nValues), np.array(cValues), np.array(mValues)

def solveProblem(tries=5, debugMode=False, epsilon = 1e-7, epsFac=8, maxiter=10**4):
    # Try up to epsFac values of the Hessian step size, related by powers of 10 (Hessian approximation step sizes)
    optimalValues = []     # candidate optima
    optimalProbs = []      # probabilities associated with those optima
    optimalSolutions = []  # detailed optimization output for candidate optima
    if debugMode:
        print("n: {} \np: {}".format(n,p))
    for meth in methods:   # try different optimization methods
        for epsIndex in range(epsFac):  # try different step sizes in the Hessian
            x0 = np.array(that/divisor) # starting guess
            for i in range(tries):
                while (tot_log_mults + np.sum(np.log(binTail(p, n, x0))) - np.log(CUT)) < 0:  # ensure x0 is a feasible point
                    x0 = np.add(x0,np.ones_like(x0))  # increment every element of x
                if (debugMode):
                    print("method: {} try: {} \nx0: {} \nprobability {}:".format(meth,i,x0,\
                                np.prod(binTail(p, n, x0))))
                optimOutput = solve(x0, that, p, n, c, epsilon*10**epsIndex, debugMode, maxiter, method=meth)
                if optimOutput['success']:
                    attainedProb = np.prod(binTail(p, n, optimOutput['x']))
                    if attainedProb <= CUT:
                        optimalValues.append(optimOutput['fun'])
                        optimalProbs.append(attainedProb)
                        optimalSolutions.append(optimOutput)
                    if debugMode:
                        print(optimOutput)
                        print("attained probability: {}".format(attainedProb))
                x0 = [np.random.randint(low=n[i], high=that[i]) for i in range(len(n))] # update x0 randomly
    if len(optimalValues) == 0:
        raise Exception('No candidate optimal solution found.')
    bestValue = np.min(optimalValues)
    largestProb = np.max(tuple(optimalProbs))
    if debugMode:
        print("\nFound {} candidate minima: {}".format(len(optimalValues), optimalValues))
        print("Best value: {}".format(bestValue))
    print("{} \t {} \t {} \t ${:,.0f} \t {} \t {}".format(g, int(np.sum(n)), 
                                                    len(n), np.int(bestValue), 
                                                    MAX_PLAYERS*attainedProb,
                                                    MAX_PLAYERS*bestValue/TOT_REVENUE))
    return optimalValues, optimalProbs

In [4]:
# parameters common to the calculations for all players

np.random.RandomState(seed=957334456) # setting seed explicitly, for reproducibility

debugMode = False  # verbose output if True; set to False for less output

CUT = CHANCE_THRESHOLD / MAX_PLAYERS # Bonferroni cutoff probability

divisor = 5 # initial value for optimizer is expected number divided by divisor (modified to ensure feasibility)

tries = 8 # number of times to run the optimization code from different starting points

methods = ['SLSQP','COBYLA']  # COBYLA will ignore the individual bounds, but should honor the probability constraint

In [5]:
gg, mm, pp, nn, cc, mults = readCsv(CSV_FILENAME)  # read the data for all players

read types for 0 rows
imputed scratcher for 726 rows with missing key




In [6]:
print("Found", len(np.unique(gg)), "gamblers:\n", np.unique(gg))
print(("Assumptions:\n If {:,} people bet the amount in column 4, " +\
      "chance would be no larger than {} that any of them would win " + \
      "as much as this person").format(MAX_PLAYERS, CHANCE_THRESHOLD))
print("Name\t wins \t games \t minimum spend \t attained probability \t multiple of revenue")

for g in np.unique(gg):
    gambler = gg==g
    p = pp[gambler]
    n = nn[gambler]
    c = cc[gambler]
    m = mults[gambler]
    all_mults = np.prod(m)  # multipliers for draw v scratcher, times multiplicity
    tot_log_mults = np.log(all_mults)
    that = n/p  # expected number of wagers on each bet required to win that bet n times
    print('product of draw adjustments {}'.format(all_mults))
    if debugMode:
        print("initial t_hat: {} \ninitialprobability: {}".format(that,all_mults*np.prod(binTail(p, n, that))))


# 'that' is used as an upper bound; ensure that it's compatible with the probability constraint
    while all_mults*np.prod(binTail(p, n, that)) < CUT:
        that = 2*that
    
    if debugMode:
        print ("adjusted t_hat: {} \nadjusted probability: {}".format(that,all_mults*np.prod(binTail(p, n, that))))

    optimalValues, optimalProbs = solveProblem(tries = tries, debugMode=debugMode, epsilon = 1e-7, epsFac=8, maxiter=10**4)

Found 40 gamblers:
 ['Albert Brown Waverly, Va' 'Alemayehu Yonas Alexandria, Va'
 'Arthur Mowbray Chesapeake, Va'
 'Carlin Walton Nash Jr. Providence Forge, Va'
 'Charles Hargrove Woodbridge, Va' 'Crystal Smith Winchester, Va'
 'Donald Jackson Hampton, Va' 'Eddie Bellett Jr Winchester, Va'
 'Ellen Davis Petersburg, Va' 'Eugene Vernon Hunt Chesapeake, Va'
 'Franklin Myers Mechanicsville, Va' 'Galen Tang Centreville, Va'
 'Gloridine Lambert Dewitt, Va' 'Gregory Minnick Ruther Glen, Va'
 'Henry Hart Spotsylvania, Va' 'James Butts Chesapeake, Va'
 'James King, Jr. Dry Fork, Va' 'Jane Watts Chesapeake, Va'
 'John Matthews Hopewell, Va' 'Kawaljit Singh Fredericksburg, Va'
 'Keith Moore South Hill, Va' 'Kenneth Good Glen Allen, Va'
 'Kimberly Beamon Suffolk, Va' 'Lana Neibert Suffolk, Va'
 'Md Hossain Henrico, Va' 'Orlando Harrison Emporia, Va'
 'Philip Redfearn Alexandria, Va' 'Reginald Long Roanoke, Va'
 'Rico Walker Salem, Va' 'Rolando Toledo Nave Va Beach, Va'
 'Ronald Lee Roanoke, Va' 'S



Albert Brown Waverly, Va 	 90 	 21 	 $1,325,743 	 1.000000000333255e-07 	 675.4454150425976
product of draw adjustments 1
Alemayehu Yonas Alexandria, Va 	 79 	 27 	 $3,884,304 	 9.999999999374798e-08 	 1978.9927422248272
product of draw adjustments 1
Arthur Mowbray Chesapeake, Va 	 73 	 20 	 $3,692,753 	 9.999999872514452e-08 	 1881.4003896746033
product of draw adjustments 1
Carlin Walton Nash Jr. Providence Forge, Va 	 72 	 39 	 $6,925,037 	 9.999999978074069e-08 	 3528.1985544932295
product of draw adjustments 1
Charles Hargrove Woodbridge, Va 	 63 	 11 	 $557,165 	 1.0000000002312115e-07 	 283.86730395252846
product of draw adjustments 1
Crystal Smith Winchester, Va 	 72 	 17 	 $1,874,306 	 9.999998855488775e-08 	 954.9301084642483
product of draw adjustments 1
Donald Jackson Hampton, Va 	 98 	 35 	 $3,936,827 	 9.999999957083597e-08 	 2005.7520485410573
product of draw adjustments 1
Eddie Bellett Jr Winchester, Va 	 32 	 15 	 $481,379 	 9.9999999871583e-08 	 245.2553513373592
prod

In [8]:
# version information
%load_ext version_information
%version_information scipy, numpy, csv, pandas, matplotlib, notebook

Loading extensions from ~/.ipython/extensions is deprecated. We recommend managing extensions like any other Python packages, in site-packages.




Software,Version
Python,3.7.3 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,7.6.1
OS,Darwin 19.3.0 x86_64 i386 64bit
scipy,1.2.1
numpy,1.16.3
csv,1.0
pandas,0.24.2
matplotlib,2.0.0b3
notebook,6.0.0
Wed Jan 08 05:07:47 2020 PST,Wed Jan 08 05:07:47 2020 PST
