In [1]:
from statswag.estimators import *
import numpy as np

### Using Your Own Data

The ultimate purpose of this package is to estimate the accuracy of labelers (and the true labels) for a dataset where the ground truth label of each data instance is unknown.  To use this package with your own data, you need only create the "label matrix" that represents your dataset.  That is, you need to create an array-like representation of the predicted labels from each labeler, for each data instance.

Labels can be coded either as strings or as integers.  For example, if you have two classes "malicious" and "benign," you can input the labels as strings, or say code the malicious as class "1" and benign as class "0", or vice versa.

_Note: Possible classes are automatically inferred from the input, so errors in the input may produce invalid results._

Suppose we have a dataset of 6 data instances, each labeled by 3 labelers.  Their output is as follows (assuming the ordering of data instances is the same for all labelers):
1. Labeler 1: malicious, benign, benign, malicious, benign, malicious
2. Labeler 2: malicious, benign, benign, malicious, malicious, malicious
3. Labeler 3: malicious, benign, benign, malicious, malicious, benign

The label matrix should be constructed so that the rows are the data instances (6 here) and the columns are the experts (3 here).  If we use 0 to code "benign" and 1 to code "malicious", the label matrix corresponding to this dataset is shown below.

In [2]:
# Data as integers
X = [[1,1,1],
    [0,0,0],
    [0,0,0],
    [1,1,1],
    [0,1,1],
    [1,1,0]]
X = np.asarray(X)
agree = Agreement()
agree.fit(X)

{'accuracies': array([0.83848851, 0.83333333, 0.83848851]),
 'labels': array([1, 0, 0, 1, 1, 1]),
 'probs': None,
 'class_names': array([0, 1])}

In [3]:
# Data as strings
X = [['malicious','malicious','malicious'],
    ['benign','benign','benign'],
    ['benign','benign','benign'],
    ['malicious','malicious','malicious'],
    ['benign','malicious','malicious'],
    ['malicious','malicious','benign']]
X = np.asarray(X)
spectral = Spectral()
spectral.fit(X)

{'accuracies': array([0.81068241, 1.        , 0.81068241]),
 'labels': array(['malicious', 'benign', 'benign', 'malicious', 'malicious',
        'malicious'], dtype='<U9'),
 'probs': None,
 'class_names': array(['benign', 'malicious'], dtype='<U9')}

### Handling missing data
Some of the estimators (MajorityVote, IWMV, and MLE) can handle missing data (not all labelers label all data instances).  Missing labels need to be represented by ``nan``.  To include both integers (or strings) and ``nan`` in the same array, you must declare as an object array.

In [4]:
X = [['malicious',np.nan,'malicious'],
    ['benign','benign','benign'],
    ['benign','benign','benign'],
    ['malicious','malicious','malicious'],
    ['benign','malicious','malicious'],
    ['malicious','malicious','malicious']]
X = np.asarray(X,dtype=object)
X

array([['malicious', nan, 'malicious'],
       ['benign', 'benign', 'benign'],
       ['benign', 'benign', 'benign'],
       ['malicious', 'malicious', 'malicious'],
       ['benign', 'malicious', 'malicious'],
       ['malicious', 'malicious', 'malicious']], dtype=object)

In [5]:
MLEOneParameterPerLabeler().fit(X)

{'accuracies': array([0.83333333, 1.        , 1.        ]),
 'labels': array(['malicious', 'benign', 'benign', 'malicious', 'malicious',
        'malicious'], dtype='<U9'),
 'probs': array([[2.40442973e-10, 1.00000000e+00],
        [1.00000000e+00, 0.00000000e+00],
        [1.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 1.00000000e+00],
        [0.00000000e+00, 1.00000000e+00],
        [0.00000000e+00, 1.00000000e+00]]),
 'class_names': array(['benign', 'malicious'], dtype=object)}

In [6]:
MajorityVote().fit(X)

{'accuracies': [0.8333333333333334, 0.8, 0.8333333333333334],
 'labels': ['malicious',
  'benign',
  'benign',
  'malicious',
  'malicious',
  'malicious'],
 'probs': None,
 'class_names': array(['benign', 'malicious'], dtype=object)}

In [7]:
IWMV().fit(X)

{'accuracies': array([0.83333333, 1.        , 1.        ]),
 'labels': array(['malicious', 'benign', 'benign', 'malicious', 'malicious',
        'malicious'], dtype='<U9'),
 'probs': None,
 'class_names': array(['benign', 'malicious'], dtype=object)}