### Dawid-Skene Model

&nbsp;

Dawid-Skene Model is one of the most common models in crowdsourcing problems. Developed by Alexander Philip Dawid and Allan Skene in 1977, it is designed to recover the true value from multiple noisy measurements. The way it works heavily relies on Bayes' theorem. At first, we take an initial guess (usually majority vote) of the true value. Next, we compute the prior probability. By leveraging the Bayes' theorem, we can derive the posterior probability of the true value to update our initial guess. We repeat these two steps until convergence. Sounds familiar? That's right. It's done in EM algorithm.

Details of EM algorithm can be found in the following link

https://github.com/je-suis-tm/machine-learning/blob/master/naive%20bayes%20mixture%20model.ipynb

The original paper can be referred to the link below

https://www.semanticscholar.org/paper/Maximum-Likelihood-Estimation-of-Observer-Using-the-Dawid-Skene/c80c7ab615b2fad5148a7848dbdd26a2dc50dd3d

If you don't wanna get lost in the original paper, this example is the most comprehensive illustration

https://sukrutrao.github.io/project/fast-dawid-skene/Fast-Dawid-Skene.pdf

Apart from Dawid-Skene Model, minimax entropy can be used as an alternative method for discrete case

https://dennyzhou.github.io/talks/MinimaxEnt.pdf

For continuous case, plz check Platt-Burges Model

https://github.com/je-suis-tm/machine-learning/blob/master/Wisdom%20of%20Crowds%20project/platt%20burges.ipynb

In [1]:
import os
os.chdir('k:/ecole/github/televerser/wisdom of crowds')
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.metrics

### Functions

In [2]:
#e step
#update prior
def e_step(matrix,prior_truth):

    confusion_matrices={}

    for bank_id in range(matrix.shape[0]):

        #utilize sklearn to compute prior
        confusion_matrices[bank_id]=sklearn.metrics.confusion_matrix(
            y_true=prior_truth,
            y_pred=matrix[bank_id].ravel().tolist()[0],
            normalize='true')
    
    return confusion_matrices

In [3]:
#m step
#update posterior
def m_step(matrix,prior_truth,confusion_matrices):

    cases,_=np.unique(matrix.ravel().tolist()[0],
                      return_inverse=True)
    post_truth=[]

    for commodity_id,_ in enumerate(prior_truth):
        commodity_forecast=matrix[:,commodity_id]

        #bayes theorem
        prob_comparison={}
        for case in cases:
            uncond_prob=prior_truth.count(case)/len(prior_truth)
            cond_prob=1
            for bank_id in range(matrix.shape[0]):
                prob=confusion_matrices[bank_id][int(case),
                      int(matrix[bank_id,commodity_id])]
                cond_prob*=prob
            
            #compute posterior
            post=cond_prob*uncond_prob
            prob_comparison[case]=post

        #update the initial guess
        prob_comparison=dict(sorted(prob_comparison.items(),
                                    key=lambda x:x[1],
                                    reverse=True))
        post_truth.append(list(prob_comparison.keys())[0])
        
    return post_truth

In [4]:
#em algorithm wrapped up in one
def dawid_skene(matrix,init_truth):

    #initialize
    truth=init_truth
    iterations=[truth]
    stop=False    
    counter=0

    while not stop:
        
        counter+=1

        #em algorithm
        confusion_matrices=e_step(matrix,truth)
        truth=m_step(matrix,truth,confusion_matrices)

        iterations.append(truth)

        #convergence check
        if iterations[-2]==iterations[-1]:
            stop=True
            print(f'converged after {counter} iterations')
        
    return truth

### ETL

In [5]:
#read data
y0matrix2019=pd.read_csv('y0matrix2019.csv')

y1matrix2020=pd.read_csv('y1matrix2020.csv')

monthly=pd.read_csv('monthly.csv')

annual=pd.read_csv('annual.csv')

In [6]:
#set index
y0matrix2019.set_index('Source Name',inplace=True)

y1matrix2020.set_index('Source Name',inplace=True)

monthly.set_index('Date',inplace=True)
monthly.index=pd.to_datetime(monthly.index)
monthly.columns=y0matrix2019.columns

annual=annual.pivot(index='Date',columns='Name',values='Value')
annual.index=pd.to_datetime(annual.index)

In [7]:
#generate forecast data
y0matrix2019_direction_est=np.sign(np.subtract(y0matrix2019,
                                          monthly['2019-08':'2019-08']))

y1matrix2020_direction_est=np.sign(np.subtract(y1matrix2020,
                                          monthly['2019-08':'2019-08']))

### Ensemble learning

In [8]:
#construct matrix for y+0 forecast
matrix=np.mat(y0matrix2019_direction_est)

#use positive number for better indexing
matrix[matrix==-1]=0.

#initialize with majority voting
init_truth=[scipy.stats.mode(i.tolist()[0])[0][0] for i in matrix.T]

#run
truth=dawid_skene(matrix,init_truth)

#as most of the predictions are pretty much aligned
#no adjustment is needed
#the result is consistent with citigroup, commonwealth bank
#deutsche bank and goldman sachs
#10/11 accuracy
print(init_truth,truth)

converged after 1 iterations
[1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0] [1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]


In [9]:
#construct matrix for y+1 forecast
matrix=np.mat(y1matrix2020_direction_est)

#use positive number for better indexing
matrix[matrix==-1]=0.

#initialize with majority voting
init_truth=[scipy.stats.mode(i.tolist()[0])[0][0] for i in matrix.T]

#run
truth=dawid_skene(matrix,init_truth)

#as most of the predictions are pretty much aligned
#no adjustment is needed
#the result is consistent with jp morgan
#4/11 accuracy, worse than tossing a coin lol
print(init_truth,truth)

converged after 1 iterations
[1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0] [1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0]


### Diversified Case

In [10]:
#a diversified case from Sukrut Rao
matrix=np.mat([[1,1,0,1,1],[0,0,0,1,1],[1,0,1,0,0]])

#initialize with majority voting
init_truth=[scipy.stats.mode(i.tolist()[0])[0][0] for i in matrix.T]

#run
truth=dawid_skene(matrix,init_truth)

#a case which bayes theorem works well
print(init_truth,truth)

converged after 2 iterations
[1, 0, 0, 1, 1] [0, 1, 0, 1, 1]
