In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from fraud_simulation import ElectionSimulator, FraudAnalyzer, is_nice, bayes_inversion

In [3]:
voter_iteration_tuples = [(300, 30000), (1000, 100000), (3000,1000000), (10000,10000000), (30000,30000000)]
p_res_fair, p_fair, p_fraud, p_res_fraud, p_fair_res = [], [], [], [], []

for tp in voter_iteration_tuples:
    voters = tp[0]
    iterations = tp[1]
    nea_dimokratia_election = ElectionSimulator(num_candidates=4,
                                                names=["Meimarakis",
                                                       "Mitsotakis",
                                                       "Tzitzikostas",
                                                       "Georgiadis"])

    nea_dimokratia_election.update_distribution({"Meimarakis": [0.335, 0.41, 0.46],
                                                 "Mitsotakis": [0.313, 0.2, 0.251],
                                                 "Tzitzikostas": [0.227, 0.18, 0.154],
                                                 "Georgiadis": [0.111, 0.17, 0.101]}, 
                                                voters) # 404078
    simulation = nea_dimokratia_election.election_run(num_iterations=iterations)
    sim_fraud = FraudAnalyzer(simulation, is_nice)
    sim_fraud.find_nice()
    p_res_fair.append(sim_fraud.fraction_nice())
    p_fair.append(0.99)
    p_fraud.append(0.01)
    p_res_fraud.append(0.01)
    p_fair_res.append(bayes_inversion(p_res_fair[-1], p_fair[-1], p_res_fraud[-1]))
    print "%d voters -> p(result|fair) = %4.3e, p(fair|result) = %4.3f" %(voters, p_res_fair[-1], p_fair_res[-1])

300 voters -> p(result|fair) = 5.300e-03, p(fair|result) = 0.981
1000 voters -> p(result|fair) = 4.460e-03, p(fair|result) = 0.978
3000 voters -> p(result|fair) = 5.700e-05, p(fair|result) = 0.361
10000 voters -> p(result|fair) = 1.200e-06, p(fair|result) = 0.012
30000 voters -> p(result|fair) = 1.000e-07, p(fair|result) = 0.001


In [4]:
df = pd.DataFrame({"p(result|fair)": p_res_fair,
                   "p(result|fraud)": p_res_fraud,
                   "p(fair)": p_fair,
                   "p(fraud)": p_fraud,
                   "p(fair|result)": p_fair_res})
df.head()

Unnamed: 0,p(fair),p(fair|result),p(fraud),p(result|fair),p(result|fraud)
0,0.99,0.981298,0.01,0.0053,0.01
1,0.99,0.977854,0.01,0.00446,0.01
2,0.99,0.360736,0.01,5.7e-05,0.01
3,0.99,0.011741,0.01,1.2e-06,0.01
4,0.99,0.000989,0.01,1e-07,0.01


In [3]:
fig = plt.figure()
ax = plt.gca()
ax.plot([300, 1000, 3000, 10000, 30000], [5.3e-3, 4.46e-3, 5.7e-5, 1.2e-6, 5e-8])
ax.set_xscale('log')
ax.set_yscale('log')
plt.show()

In [4]:
5e-8*5e-8/1e-3

2.4999999999999994e-12

In [8]:
"""
Suppose an election is held for the leadership position in a major political party. Four candidates are running. After the election, the following results are announced:

        Candidate A: 160823 votes
        Candidate B: 115162 votes
        Candidate C: 82028 votes
        Candidate D: 46065 votes

Candidate A is declared the winner by a clear margin. However, at that point people notice that the percentages of the four candidates, when rounded to three decimal digits look suspiciously round:

        Candidate A: 39.800 %
        Candidate B: 28.500 %
        Candidate C: 20.300 %
        Candidate D: 11.400 %

All four rounded percentages have zeros in their 2nd and 3rd decimal digits! This leads to a fire of speculation about election fraud in social media and the press. 

How can one determine the probability of fraud in such a situation? One approach would be to say:

$p(fraud|result) = 1 - p(fair|result)$

where

$p(fair|result) = \frac{p(result|fair)p(fair)}{p(result|fair)p(fair)+p(result|fraud)p(fraud)}$

Then, the priors can be set to reasonable values (e.g. $p(fair)=0.99$), while the conditionals $p(result|fair)$ and $p(result|fraud)$ are more tricky.

One could argue that $p(result|fraud) \sim 0.01$ by the token that a fraudster would not consider more than a couple of hundred ways to cheat and that setting the percentages to round values would be the most obvious choice of a naive fraudster (the "inexperienced fraudster" argument). However this is quite hand-wavy.

Estimating $p(result|fair)$ is also problematic. On the one hand, getting four percentages that round so nicely to the third digit looks very suspicious -almost fabricated. On the other hand, the probability of getting percentages that round (in the 2nd & 3rd decimal digit) to 00, 00, 00, 00 is not going to be very different from the probability of getting percentages that round (in 2nd & 3rd digit) to 01, 23, 45, 31, or any other allowed combination, so why single out the former? 

How can one formulate the layman intuition that four percentages which round so nicely look suspicious? I thought of using the criterion: Find the probability that the rounded percentages can be generated from a simple rule which repeats the same pattern in their 2nd and 3rd decimal digits. For example, the patterns 00, 00, 00, 00 and 25, 25, 25, 25 are the two simplest possible patterns, and will be grouped together when estimating the probability. The probability of getting one of the patterns in the lowest simplicity class is very low ($<10^{-12}$ by a Monte Carlo estimate). This is not entirely satisfactory, as it maintains the distinction between "suspicious" and "non-suspicious" numbers but hides it behind the idea of "simple" vs "non-simple" patterns.

Does anyone have any idea of a different way to arrive at a reasonable estimate of $p(fair|result)$, and/or $p(result|fraud)$, or even an entirely different approach? (you can assume that you do not have access to election results broken down by voting center, area, etc.)

*Note: The above is not a hypothetical scenario, it happened in real life. The quoted numbers are the numbers of votes received by the four candidates for the presidency of the Greek party of ["New Democracy"][1] on December 20th 2015.*


  [1]: https://en.wikipedia.org/wiki/New_Democracy_(Greece)
"""

0.339756592292
0.317444219067
0.230223123732
0.112576064909


In [None]:
"""
A level-headed statistician is called in to determine the probability of election fraud. The statistician is allowed to use the results of opinion polls from the week leading to the election (if she wants):

          | Poll 1 | Poll 2 | Poll 3
        -----------------------------
        A |  33.7  |  42.0  |  47.6 
        B |  31.5  |  22.0  |  26.1
        C |  22.7  |  18.5  |  15.2
        D |  12.1  |  17.5  |  11.1 
"""