In [1]:
from NUTMEG.nutmeg import NUTMEG
import pandas as pd

### 1. Format you input data

The input DataFrame must contain the columns `task`, `worker`, `subpopulation`, and `label` where each row is a single annotation from one annotator for one task and ...

`task` is the index of the item being annotated.

`worker` is the annotator who is annotating that item.

`subpopulation` is the group of the annotator.  - Every annotator must have exactly one subpopulation indicator.  - NUTMEG will produce one truth estimate for each item for each subpopulation. So, if you have three subpopulations, every item will have an estimated 3 estimated truths.

`label` is the label that the annotator gave an item. 

In [8]:
df = pd.DataFrame({'task':['T_1', 'T_1', 'T_1', 'T_1', 'T_2', 'T_2', 'T_2'],
                   'worker':['W_1', 'W_2', 'W_3', 'W_4', 'W_1', 'W_2', 'W_3'],
                   'subpopulation':['S_2', 'S_1', 'S_1', 'S_1', 'S_1', 'S_1', 'S_1'],
                   'label':[0, 0, 1, 1, 0, 0, 0]})
df.head()

Unnamed: 0,task,worker,subpopulation,label
0,T_1,W_1,S_2,0
1,T_1,W_2,S_1,0
2,T_1,W_3,S_1,1
3,T_1,W_4,S_1,1
4,T_2,W_1,S_1,0


### 2. Instantiate an instance of NUTMEG and specify parameters:

`n_restarts`
The number of optimization runs of the algorithms.
The final parameters are those that gave the best log likelihood.
If one run takes too long, this parameter can be set to 1.

`n_iter`
The maximum number of iterations for each optimization run.

`smoothing`
The smoothing parameter for the normalization.

`default_noise`
The default noise parameter for the initialization.

`alpha`
The prior parameter for the Beta distribution of the competence measure.

`beta`
The prior parameter for the Beta distribution of the competence measure.

`random_state`
The state of the random number generator.

`verbose`
Specifies if the progress will be printed or not:
0 — no progress bar, 1 — only for restarts, 2 — for both restarts and optimization.


In [3]:
# instantiate NUTMEG instance
nutmeg = NUTMEG()

### 3. Fit the model to the data

Use the `fit` method to train the model. Alternatively use `fit_predict` or `fit_predict_proba` to return the predicted labels or probabilities directly.


**Note:** The `return_unobserved` parameter determines how the model will output predictions for items with no annotations from certain subpopulations. If it is set to `False`, the probabilities and labels returned by the model will be `np.nan` for subpopulation-item combinations where there are no examples. If it is set to `True`, then the labels and probabilities will be estimated based on other observed instances (as detailed in the paper).

In [None]:
nutmeg.fit(df, return_unobserved=True)

# Obtain Results

`.labels_` The predicted labels for each item for each subpopulation as a numpy array with shape `(n_items, n_subpopulations)` where the order of the items is the order that they were given in the DataFrame and the order of the subpopulations is the order they were given in the DataFrame. (i.e., if the first entry was "item1" with "subpopulation2", then the first entry in the numpy array is for item1 and the estimated label for subpopulation2 is the first entry in the second dimension)

`.probas_` The predicted probabilities of the labels for each item for each subpopulation with shape `(n_items, n_subpopulations, n_labels)`. Using the same order scheme as above. Note the order of the probabilities corresponds to the order in which those labels are observed in the DataFrame. 

`.spamming_` The estimated competences of each annotator (in the order they were given in the DataFrame). The last dimension of the DataFrame will be size 2, where the first entry is the probability of the annotator giving the correct label and the second entry is the probability of the annotator spamming.


**Note:** If you want an easy way of determining the order of the tasks, annotators, subpopulations, or labels in the output simply use `df[column_name].unique()`

In [5]:
nutmeg.labels_

array([['0', '1'],
       ['0', '0']], dtype=object)

In [6]:
nutmeg.probas_

array([[[0.99706369, 0.00293631],
        [0.0107271 , 0.9892729 ]],

       [[0.99706369, 0.00293631],
        [0.91612266, 0.08387734]]])

In [7]:
nutmeg.spamming_

array([[0.08896221, 0.76026716],
       [0.08896221, 0.76026716],
       [0.14328313, 0.69704067],
       [0.15978965, 0.57639631]])