<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px" alt="Vera C. Rubin Observatory Logo"> 
<h1 style="margin-top: 10px">Retrieve and Aggregate Zooniverse Output</h1>
Authors: Becky Nevin, Clare Higgs, and Eric Rosas <br>
Contact author: Clare Higgs <br>
Last verified to run: 2024-06-20 <br>
LSST Science Pipelines version: Weekly 2024_04 <br>
Container size: small or medium <br>
Targeted learning level: intermediate

<b>Description:</b> This notebook guides a PI through the process of retrieving classification data from Zooniverse and builds upon Hayley Robert's Aggregation notebook example. <br><br>
<b>Skills:</b> 
<br><br>
<b>LSST Data Products:</b> n/a<br><br>
<b>Packages:</b> rubin.citsci, utils (citsci plotting and display utilities),  <br><br>
<b>Credit:</b> Hayley Roberts' aggregation code https://github.com/astrohayley/SLSN-Aggregation-Example/blob/main/SLSN_batch_aggregation.py<br><br>
<b>Get Support: </b>PIs new to DP0 are encouraged to find documentation and resources at <a href="https://dp0-2.lsst.io/">dp0-2.lsst.io</a>. Support for this notebook is available and questions are welcome at cscience@lsst.org.

## 1. Introduction <a class="anchor" id="first-bullet"></a>
This notebook provides an introduction to how to use the Zooniverse panoptes client and rubin.citsci package to retrieve classifications from Zooniverse and aggregate the results. Data aggregation in this context is collecting classifications across all citizen scientists and summarizing them by subject in terms of classifier majority.

### 1.1 Package imports <a class="anchor" id="second-bullet"></a>

#### Install Pipeline Package

First, install the Rubin Citizen Science Pipeline package by doing the following:

1. Open up a New Launcher tab
2. In the "Other" section of the New Launcher tab, click "Terminal"
3. Use `pip` to install the `rubin.citsci` package by entering the following command:
```
pip install rubin.citsci
```
Note that this package will soon be installed directly on RSP.

If this package is already installed, make sure it is updated:
```
pip install --u rubin.citsci
```

4. Confirm the next cell containing `from rubin.citsci import pipeline` works as expected and does not throw an error

5. Install panoptes_client:
```
pip install panoptes_client
pip install panoptes_aggregation
```

6. If the pip install doesn't work for `panoptes_aggregation`:
```
pip install -U git+git://github.com/zooniverse/aggregation-for-caesar.git
```
(https://www.zooniverse.org/talk/1322/2415041?comment=3969837&page=1)

In [9]:
import numpy as np
import pandas as pd
import getpass
import json
import os

# Zooniverse tools
from panoptes_client import Panoptes, Workflow
from panoptes_aggregation.extractors.utilities import annotation_by_task
from panoptes_aggregation.extractors import question_extractor
from panoptes_aggregation.reducers import question_consensus_reducer

from tqdm import tqdm

from rubin.citsci import pipeline

### 1.2 Define functions and parameters <a class="anchor" id="third-bullet"></a>
Credit for these functions goes to Hayley Roberts at Zooniverse. This includes:
- `download_classifications`: A function to download the classifications given a workflow ID in, which returns a dataframe
- `extract_data`: A function to extract user annotations by task and sort by when they were classified??? This can be modified for other classification tasks such as drawing, please see the Zooniverse documentation.
- `aggregate_data`: A function that groups by task and user, selects the most recent classification from each user, and uses the Zooniverse `question_consesus_reducer` function to determine the consensus for each subject ID amongst all users

In [1]:
def download_classifications(WORKFLOW_ID, client):
    """
    Downloads data from Zooniverse

    Args:
        WORKFLOW_ID (int): Workflow ID of workflow being aggregated
        client: Logged in Zooniverse client

    Returns:
        classification_data (DataFrame): Raw classifications from Zooniverse
    """
    print('beginning function')
    workflow = Workflow(WORKFLOW_ID)
    print('workflow class loaded?', workflow)
    print('all attributes of this workflow', dir(workflow))

    print('this is the client were working with', client)
    
    # generate the classifications 
    # if generate=True, it generates a new classification report,
    # which can take a long time because they’re queued in the Zooniverse system.
    # It’s the same as going to the project builder and clicking the “request new report”.
    # If you just want to download the existing report, you can set generate to False.
    with client:
        classification_export = workflow.get_export('classifications',
                                                    generate=False,
                                                    wait=False)
        print('export', classification_export)
        print('type', type(classification_export))
        # since it's a partial class, call it to get the DictReader object
        csv_dictreader_instance = classification_export.csv_dictreader()
        print('csv_dictreader_instance', csv_dictreader_instance)

        # the below line is taking forever to run
        classification_rows = [row for row in tqdm(csv_dictreader_instance, file=sys.stdout)]

        #classification_rows = [row for row in tqdm(classification_export.csv_dictreader())]

    # convert to pandas dataframe
    print('pre conversion to df', classification_rows)
    classification_data = pd.DataFrame.from_dict(classification_rows)

    return classification_data



def extract_data(classification_data):
    """
    Extracts annotations from the classification data

    Args:
        classification_data (DataFrame): Raw classifications from Zooniverse

    Returns:
        extracted_data (DataFrame): Extracted annotations from raw classification data
    """
    # set up our list where we will store the extracted data temporarily
    extracted_rows = []

    # iterate through our classification data
    for i in range(len(classification_data)):

        # access the specific row and extract the annotations
        row = classification_data.iloc[i]
        for annotation in json.loads(row.annotations):

            row_annotation = annotation_by_task({'annotations': [annotation]})
            extract = question_extractor(row_annotation)

            # add the extracted annotations to our temporary list along with some other additional data
            extracted_rows.append({
                'classification_id': row.classification_id,
                'subject_id':        row.subject_ids,
                'user_name':         row.user_name,
                'user_id':           row.user_id,
                'created_at':        row.created_at,
                'data':              json.dumps(extract),
                'task':              annotation['task']
            })


    # convert the extracted data to a pandas dataframe and sort
    extracted_data = pd.DataFrame.from_dict(extracted_rows)
    extracted_data.sort_values(['subject_id', 'created_at'], inplace=True)

    return extracted_data



def last_filter(data):
    """
    Determines the most recently submitted classifications
    """
    last_time = data.created_at.max()
    ldx = data.created_at == last_time
    return data[ldx]


def aggregate_data(extracted_data):
    """
    Aggregates question data from extracted annotations

    Args:
        extracted_data (DataFrame): Extracted annotations from raw classifications

    Returns:
        aggregated_data (DataFrame): Aggregated data for the given workflow
    """
    # generate an array of unique subject ids - these are the ones that we will iterate over
    subject_ids_unique = np.unique(extracted_data.subject_id)

    # set up a temporary list to store reduced data
    aggregated_rows = []

    # determine the total number of tasks
    tasks = np.unique(extracted_data.task)

    # iterating over each unique subject id
    for i in range(len(subject_ids_unique)):

        # determine the subject_id to work on
        subject_id = subject_ids_unique[i]

        # filter the extract_data dataframe for only the subject_id that is being worked on
        extract_data_subject = extracted_data[extracted_data.subject_id==subject_id].drop_duplicates()

        for task in tasks:

            extract_data_filtered = extract_data_subject[extract_data_subject.task == task]

            # if there are less unique user submissions than classifications, filter for the most recently updated classification
            if (len(extract_data_filtered.user_name.unique()) < len(extract_data_filtered)):
                extract_data_filtered = extract_data_filtered.groupby(['user_name'], group_keys=False).apply(last_filter)

            # iterate through the filtered extract data to prepare for the reducer
            classifications_to_reduce = [json.loads(extract_data_filtered.iloc[j].data) for j in range(len(extract_data_filtered))]

            # use the Zooniverse question_consesus_reducer to get the final consensus
            # WHAT ARE THE ARGUMENTS THAT ARE OPTIONAL HERE?
            reduction = question_consensus_reducer(classifications_to_reduce)

            # add the subject id to our reduction data
            reduction['subject_id'] = subject_id
            reduction['task'] = task

            # add the data to our temporary list
            aggregated_rows.append(reduction)


    # converting the result to a dataframe
    aggregated_data = pd.DataFrame.from_dict(aggregated_rows)

    # drop rows that are nan
    aggregated_data.dropna(inplace=True)

    return aggregated_data





def batch_aggregation(generate_new_classifications=False, WORKFLOW_ID=13193): 
    """
    Downloads raw classifications, extracts annotations, and aggregates data

    Args:
        WORKFLOW_ID (int): Workflow ID of workflow being aggregated
        client: Logged in Zooniverse client

    Returns:
        aggregated_data (DataFrame): Aggregated data for the given workflow
    """

    if generate_new_classifications:
        # connect to client and download data
        print('Sign in to zooniverse.org:')
        client = Panoptes.client(username=getpass.getpass('username: '), password=getpass.getpass('password: '))
        print('Generating classification data - could take some time')
        classification_data = download_classifications(WORKFLOW_ID=WORKFLOW_ID, client=client)
        print('Saving classifications')
        classification_data.to_csv('galaxy-classifications.csv', index=False)
    else:
        # or just open the file
        print('Loading classifications')
        classification_data = pd.read_csv('galaxy-classifications.csv')

    # limit classifications to those for the relevant workflow
    classification_data = classification_data[classification_data.workflow_id==WORKFLOW_ID]

    # extract annotations
    print('Extracting annotations')
    extracted_data = extract_data(classification_data=classification_data)

    # aggregate annotations
    print('Aggregating data')
    final_data = aggregate_data(extracted_data=extracted_data)

    return final_data

## 2. Log into Zooniverse and link
If you're running this notebook, you should already have a Zooniverse account with a project with classifications. If you do not yet have an account, please return to notebook `01_Introduction_to_Citsci_Pipeline.ipynb`.

IMPORTANT: Your Zooniverse project must be set to "public", a "private" project will not work. Select this setting under the "Visibility" tab, (it does not need to be set to live). 

Supply the email associated with your Zooniverse account, and then follow the instructions in the prompt to log in and select your project by slug name. 

A "slug" is the string of your Zooniverse username and your project name without the leading forward slash, for instance: "username/project-name". [Click here for more details](https://www.zooniverse.org/talk/18/967061?comment=1898157&page=1).

**The `rubin.citsci` package includes a method that creates a Zooniverse project from template. If you wish to use this feature, do not provide a slug_name and run the subsequent cell.**

In [None]:
email = "beckynevin@gmail.com"
cit_sci_pipeline = pipeline.CitSciPipeline()
cit_sci_pipeline.login_to_zooniverse(email)

## 3. Download the classifications
These will still be in the raw format. This function reads from the output csv and puts all rows into a dataframe format.

In [19]:
#project_id = 19539
WORKFLOW_ID = 23254
client = cit_sci_pipeline.client
# how long should this take?
classification_data = download_classifications(WORKFLOW_ID, client)

beginning function
workflow class loaded? <Workflow 23254>
all attributes of this workflow ['RESERVED_ATTRIBUTES', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api_slug', '_edit_attributes', '_export_path', '_link_slug', '_loaded', '_original_configuration', '_original_retirement', '_original_tasks', '_savable_dict', 'add_alice_extractors', 'add_alice_reducers', 'add_alice_rules_and_effects', 'add_caesar_extractor', 'add_caesar_reducer', 'add_caesar_rule', 'add_caesar_rule_effect', 'add_subject_sets', 'caesar_effects', 'caesar_extractors', 'caesar_reducers', 'caesar_rules', 'caesar_subject_extracts', 'caesar_subject_reductions', 'configure_for_alice', 'delete'

In [17]:
classification_data

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids
0,460251424,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:18 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428507"":{""retired"":null}}",83428507
1,460251464,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:27 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428528"":{""retired"":null}}",83428528
2,460251470,sreevani,1672374.0,adb6c943f6a02f9f48b3,23254,Classification,9.7,2023-01-05 17:09:29 UTC,,,"{""source"":""api"",""session"":""faf6ac09286ac159d2d...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428673"":{""retired"":null}}",83428673
3,460251475,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:30 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428522"":{""retired"":null}}",83428522
4,460251484,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:32 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428624"":{""retired"":null}}",83428624
5,460251498,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:34 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428488"":{""retired"":null}}",83428488
6,460251507,sreevani,1672374.0,adb6c943f6a02f9f48b3,23254,Classification,9.7,2023-01-05 17:09:36 UTC,,,"{""source"":""api"",""session"":""faf6ac09286ac159d2d...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428610"":{""retired"":null}}",83428610
7,460251508,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:37 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428619"":{""retired"":null}}",83428619
8,460251529,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:40 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428583"":{""retired"":null}}",83428583
9,460251542,rebecca.nevin,1946584.0,93db3159e6449bc8cf78,23254,Classification,9.7,2023-01-05 17:09:42 UTC,,,"{""source"":""api"",""session"":""17680efb53f0c9ec4a9...","[{""task"":""T0"",""task_label"":""Is this a galaxy?""...","{""83428592"":{""retired"":null}}",83428592


## 4. Extract annotations by task and sort by subject ID
There could be multiple tasks per item.

In [20]:
extracted_data = extract_data(classification_data)
extracted_data

Unnamed: 0,classification_id,subject_id,user_name,user_id,created_at,data,task
5,460251498,83428488,rebecca.nevin,1946584.0,2023-01-05 17:09:34 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
29,460252388,83428489,rebecca.nevin,1946584.0,2023-01-05 17:14:05 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
30,460252838,83428494,rebecca.nevin,1946584.0,2023-01-05 17:16:35 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
21,460252218,83428504,rebecca.nevin,1946584.0,2023-01-05 17:13:14 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
27,460252349,83428504,rebecca.nevin,1946584.0,2023-01-05 17:13:55 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
0,460251424,83428507,rebecca.nevin,1946584.0,2023-01-05 17:09:18 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
16,460251628,83428515,sreevani,1672374.0,2023-01-05 17:10:03 UTC,"{""no"": 1, ""aggregation_version"": ""4.1.0""}",T0
13,460251578,83428516,rebecca.nevin,1946584.0,2023-01-05 17:09:51 UTC,"{""no"": 1, ""aggregation_version"": ""4.1.0""}",T0
3,460251475,83428522,rebecca.nevin,1946584.0,2023-01-05 17:09:30 UTC,"{""yes"": 1, ""aggregation_version"": ""4.1.0""}",T0
1,460251464,83428528,rebecca.nevin,1946584.0,2023-01-05 17:09:27 UTC,"{""no"": 1, ""aggregation_version"": ""4.1.0""}",T0


## 5. Aggregate the annotations
Sort by unique subject ID and then unique tasks. Find the most recent classification for each user ID, and uses the Zooniverse consensus builder to look through all user classifications and build consensus.

In [21]:
aggregated_data = aggregate_data(extracted_data)

In [22]:
aggregated_data

Unnamed: 0,most_likely,num_votes,agreement,aggregation_version,subject_id,task
0,yes,1,1.0,4.1.0,83428488,T0
1,yes,1,1.0,4.1.0,83428489,T0
2,yes,1,1.0,4.1.0,83428494,T0
3,yes,1,1.0,4.1.0,83428504,T0
4,yes,1,1.0,4.1.0,83428507,T0
5,no,1,1.0,4.1.0,83428515,T0
6,no,1,1.0,4.1.0,83428516,T0
7,yes,1,1.0,4.1.0,83428522,T0
8,no,1,1.0,4.1.0,83428528,T0
9,no,1,1.0,4.1.0,83428529,T0


Note in txt about how you could join stuff now with another table.

What I haven't tested is the batch aggregation, which downloads classifications in a slightly different way?:

In [28]:
out = batch_aggregation(generate_new_classifications=True,WORKFLOW_ID=WORKFLOW_ID)

Sign in to zooniverse.org:


username:  ········
password:  ········


Generating classification data - could take some time
beginning function
workflow class loaded? <Workflow 23254>
all attributes of this workflow ['RESERVED_ATTRIBUTES', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api_slug', '_edit_attributes', '_export_path', '_link_slug', '_loaded', '_original_configuration', '_original_retirement', '_original_tasks', '_savable_dict', 'add_alice_extractors', 'add_alice_reducers', 'add_alice_rules_and_effects', 'add_caesar_extractor', 'add_caesar_reducer', 'add_caesar_rule', 'add_caesar_rule_effect', 'add_subject_sets', 'caesar_effects', 'caesar_extractors', 'caesar_reducers', 'caesar_rules', 'caesar_subject_extracts', 'caesa

KeyError: 'subject_id'

In [None]:
# issues:
# get_data_from_zooniverse is undefined
# the pip install does not work for panoptes_aggregation
# download_classifications takes upwards of 43 minutes to run,
# is this because I haven't completed the workflow? Because I have

In [5]:
WORKFLOW_ID = 23254
batch_aggregation(WORKFLOW_ID)

Sign in to zooniverse.org:


username:  ········
password:  ········


Generating classification data - could take some time


NameError: name 'get_data_from_zooniverse' is not defined