<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px" alt="Vera C. Rubin Observatory Logo"> 
<h1 style="margin-top: 10px">Retrieve and Aggregate Zooniverse Output</h1>
Authors: Becky Nevin, Clare Higgs, and Eric Rosas <br>
Contact author: Clare Higgs <br>
Last verified to run: 2024-10-07 <br>
LSST Science Pipelines version: Weekly 2024_37 <br>
Container size: small or medium <br>
Targeted learning level: intermediate

<b>Description:</b> This notebook guides a PI through the process of retrieving classification data from Zooniverse and builds upon Hayley Robert's Aggregation notebook example. <br><br>
<b>Skills:</b> Query for Zooniverse classification data via the panoptes client; retrieve and aggregate user classifications and retrieve original objectIds or diaobjectIds.
<br><br>
<b>LSST Data Products:</b> n/a<br><br>
<b>Packages:</b> panoptes_client, panoptes_aggregation, rubin.citsci, utils (citsci plotting and display utilities),  <br><br>
<b>Credit:</b> Hayley Roberts' aggregation code https://github.com/astrohayley/SLSN-Aggregation-Example/blob/main/SLSN_batch_aggregation.py<br><br>
<b>Get Support: </b>PIs new to DP0 are encouraged to find documentation and resources at <a href="https://dp0-2.lsst.io/">dp0-2.lsst.io</a>. Support for this notebook is available and questions are welcome at cscience@lsst.org.

## 1. Introduction <a class="anchor" id="first-bullet"></a>
This notebook provides an introduction to how to use the Zooniverse panoptes client and rubin.citsci package to retrieve classifications from Zooniverse and aggregate the results. Data aggregation in this context is collecting classifications across all citizen scientists and summarizing them by subject in terms of classifier majority.

### 1.1 Package imports <a class="anchor" id="second-bullet"></a>

#### Install Pipeline Package

First, install the Rubin Citizen Science Pipeline package by doing the following:

1. Open up a New Launcher tab
2. In the "Other" section of the New Launcher tab, click "Terminal"
3. Use `pip` to install the `rubin.citsci` package by entering the following command:
```
pip install rubin.citsci
```
Note that this package will soon be installed directly on RSP.

If this package is already installed, make sure it is updated:
```
pip install --u rubin.citsci
```

4. Confirm the next cell containing `from rubin.citsci import pipeline` works as expected and does not throw an error

5. Install panoptes_client:
```
pip install panoptes_client
pip install panoptes_aggregation
```

6. If the pip install doesn't work for `panoptes_aggregation`:
```
pip install -U git+git://github.com/zooniverse/aggregation-for-caesar.git
```
(https://www.zooniverse.org/talk/1322/2415041?comment=3969837&page=1)

In [1]:
import numpy as np
import pandas as pd
import getpass
import json
import os

# Zooniverse tools
from panoptes_client import Panoptes, Workflow
from panoptes_aggregation.extractors.utilities import annotation_by_task
from panoptes_aggregation.extractors import question_extractor
from panoptes_aggregation.reducers import question_consensus_reducer

from tqdm import tqdm

from rubin.citsci import pipeline

### 1.2 Define functions and parameters <a class="anchor" id="third-bullet"></a>
Credit for these functions goes to Hayley Roberts at Zooniverse. This includes:
- `download_classifications`: A function to download the classifications given a workflow ID in, which returns a dataframe
- `extract_data`: A function to extract user annotations by task and sort by when they were classified??? This can be modified for other classification tasks such as drawing, please see the Zooniverse documentation.
- `aggregate_data`: A function that groups by task and user, selects the most recent classification from each user, and uses the Zooniverse `question_consesus_reducer` function to determine the consensus for each subject ID amongst all users

In [19]:
def download_classifications(WORKFLOW_ID, client):
    """
    Downloads data from Zooniverse

    Args:
        WORKFLOW_ID (int): Workflow ID of workflow being aggregated
        client: Logged in Zooniverse client

    Returns:
        classification_data (DataFrame): Raw classifications from Zooniverse
    """
    print('beginning function')
    workflow = Workflow(WORKFLOW_ID)
    # generate the classifications 
    # if generate=True, it generates a new classification report,
    # which can take a long time because they’re queued in the Zooniverse system.
    # It’s the same as going to the project builder and clicking the “request new report”.
    # If you don't care about new classifications and are okay with downloading
    # an older report that you already generated, you can set this flag to False
    with client:
        classification_export = workflow.get_export('classifications',
                                                    generate=False,
                                                    wait=False)
        # since it's a partial class, call it to get the DictReader object
        csv_dictreader_instance = classification_export.csv_dictreader()
        classification_rows = [row for row in tqdm(csv_dictreader_instance, file=sys.stdout)]
    # convert to pandas dataframe
    classification_data = pd.DataFrame.from_dict(classification_rows)
    return classification_data



def extract_data(classification_data, id_type='objectId'):
    """
    Extracts annotations from the classification data

    Args:
        classification_data (DataFrame): Raw classifications from Zooniverse

    Returns:
        extracted_data (DataFrame): Extracted annotations from raw classification data
    """
    # set up our list where we will store the extracted data temporarily
    extracted_rows = []
    # iterate through our classification data
    for i in range(len(classification_data)):
        # access the specific row and extract the annotations
        row = classification_data.iloc[i]
        for annotation in json.loads(row.annotations):
            row_annotation = annotation_by_task({'annotations': [annotation]})
            extract = question_extractor(row_annotation)
            subject_id_str = str(row.subject_ids)
            # Check if the subject ID exists and is a dictionary
            if subject_id_str in row.subject_data and isinstance(json.loads(row.subject_data)[subject_id_str], dict):
                try:
                    rubin_id = json.loads(row.subject_data)[str(row.subject_ids)][id_type]
                    print(json.loads(row.subject_data))
                    STOp
                except KeyError:
                    print(json.loads(row.subject_data))
                    STOP
            else:
                print(f"Key '{subject_id_str}' not found in subject_data or it is not a dictionary.")
            # add the extracted annotations to our temporary list along with some other additional data
            extracted_rows.append({
                'classification_id': row.classification_id,
                'subject_id':        row.subject_ids,
                'user_name':         row.user_name,
                'user_id':           row.user_id,
                'created_at':        row.created_at,
                'rubin_id':          rubin_id,
                'data':              json.dumps(extract),
                'task':              annotation['task']
            })
    # convert the extracted data to a pandas dataframe and sort
    extracted_data = pd.DataFrame.from_dict(extracted_rows)
    extracted_data.sort_values(['subject_id', 'created_at'], inplace=True)
    return extracted_data

def last_filter(data):
    """
    Determines the most recently submitted classifications
    """
    last_time = data.created_at.max()
    ldx = data.created_at == last_time
    return data[ldx]

def aggregate_data(extracted_data):
    """
    Aggregates question data from extracted annotations

    Args:
        extracted_data (DataFrame): Extracted annotations from raw classifications

    Returns:
        aggregated_data (DataFrame): Aggregated data for the given workflow
    """
    # generate an array of unique subject ids - these are the ones that we will iterate over
    subject_ids_unique = np.unique(extracted_data.subject_id)

    # Create a dictionary to map subject IDs to their corresponding metadata
    rubin_ids_unique = extracted_data.groupby('subject_id')['rubin_id'].unique()

    '''
    print('len subject-ids-unique', len(subject_ids_unique))
    print('metadata saved', len(rubin_id))
    print(rubin_id)
    STOP
    '''
    

    # set up a temporary list to store reduced data
    aggregated_rows = []

    # determine the total number of tasks
    tasks = np.unique(extracted_data.task)

    # iterating over each unique subject id
    for i in range(len(subject_ids_unique)):

        # determine the subject_id to work on
        subject_id = subject_ids_unique[i]
        rubin_id = rubin_ids_unique.iloc[i][0]

        # filter the extract_data dataframe for only the subject_id that is being worked on
        extract_data_subject = extracted_data[extracted_data.subject_id==subject_id].drop_duplicates()

        for task in tasks:

            extract_data_filtered = extract_data_subject[extract_data_subject.task == task]

            # if there are less unique user submissions than classifications, filter for the most recently updated classification
            if (len(extract_data_filtered.user_name.unique()) < len(extract_data_filtered)):
                extract_data_filtered = extract_data_filtered.groupby(['user_name'], group_keys=False).apply(last_filter)

            # iterate through the filtered extract data to prepare for the reducer
            classifications_to_reduce = [json.loads(extract_data_filtered.iloc[j].data) for j in range(len(extract_data_filtered))]

            # use the Zooniverse question_consesus_reducer to get the final consensus
            # WHAT ARE THE ARGUMENTS THAT ARE OPTIONAL HERE?
            reduction = question_consensus_reducer(classifications_to_reduce)

            # add the subject id to our reduction data
            reduction['subject_id'] = subject_id
            reduction['task'] = task
            reduction['rubin_id'] = rubin_id

            # add the data to our temporary list
            aggregated_rows.append(reduction)


    # converting the result to a dataframe
    aggregated_data = pd.DataFrame.from_dict(aggregated_rows)

    # drop rows that are nan
    aggregated_data.dropna(inplace=True)

    return aggregated_data

## 2. Log into Zooniverse and find the workflow to download classifications from
If you're running this notebook, you should already have a Zooniverse account with a project with classifications. If you do not yet have an account, please return to notebook `01_Introduction_to_Citsci_Pipeline.ipynb`.

IMPORTANT: Your Zooniverse project must be set to "public", a "private" project will not work. Select this setting under the "Visibility" tab, (it does not need to be set to live). 

Supply the email associated with your Zooniverse account, and then follow the instructions in the prompt to log in and select your project by slug name. 

A "slug" is the string of your Zooniverse username and your project name without the leading forward slash, for instance: "username/project-name". [Click here for more details](https://www.zooniverse.org/talk/18/967061?comment=1898157&page=1).

**The `rubin.citsci` package includes a method that creates a Zooniverse project from template. If you wish to use this feature, do not provide a slug_name and run the subsequent cell.**

In [3]:
email = "beckynevin@gmail.com"
cit_sci_pipeline = pipeline.CitSciPipeline()
cit_sci_pipeline.login_to_zooniverse(email)

Loading and running utilities to establish a link with Zooniverse
Enter your Zooniverse username followed by password below
Enter your Zooniverse credentials...


Username:  rebecca.nevin
 ········


You now are logged in to the Zooniverse platform.

*==* Your Project Slugs *==*

rebecca-dot-nevin/template-test-copy-2024-07-09-21-49-53
rebecca-dot-nevin/template-test-copy-2024-07-09-19-02-02
rebecca-dot-nevin/template-test-copy-2024-07-09-18-53-39
rebecca-dot-nevin/template-test-copy-2024-07-03-21-54-30
rebecca-dot-nevin/template-test-copy-2024-07-02-22-11-52
rebecca-dot-nevin/pcw-2023-awesome-citsci-project
rebecca-dot-nevin/test-project
rebecca-dot-nevin/galaxy-rotation-fields




Which project would you like to connect to? (copy & paste the slug name here)? rebecca-dot-nevin/test-project


Current project set to: rebecca-dot-nevin/test-project


Use the `list_workflows` method to find the workflow ID.

In [4]:
cit_sci_pipeline.list_workflows()


*==* Your Workflows *==*

Workflow ID: 27546 - Display Name: Classify with hidden metadata
Workflow ID: 27509 - Display Name: Classify with metadata
Workflow ID: 23254 - Display Name: Classification




Copy and paste the above ID into the cell below.

In [16]:
WORKFLOW_ID = 27509

## 3. Download the classifications
These will still be in the raw format. This function reads from the output csv and puts all rows into a dataframe format.

In [17]:
client = cit_sci_pipeline.client
# how long should this take?
classification_data = download_classifications(WORKFLOW_ID, client)

beginning function
76it [00:00, 34231.86it/s]


In [18]:
classification_data

Unnamed: 0,classification_id,user_name,user_id,user_ip,workflow_id,workflow_name,workflow_version,created_at,gold_standard,expert,metadata,annotations,subject_data,subject_ids
0,589019044,rebecca.nevin,1946584,cee5ed56424764950d55,27509,Classify with metadata,3.2,2024-10-07 16:08:01 UTC,,,"{""source"":""api"",""session"":""f1580a7bdd196dc45e8...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247085"":{""retired"":null,""objectId"":""16514...",103247085
1,589019067,rebecca.nevin,1946584,cee5ed56424764950d55,27509,Classify with metadata,3.2,2024-10-07 16:08:07 UTC,,,"{""source"":""api"",""session"":""f1580a7bdd196dc45e8...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247082"":{""retired"":null,""objectId"":""15679...",103247082
2,589019090,rebecca.nevin,1946584,cee5ed56424764950d55,27509,Classify with metadata,3.2,2024-10-07 16:08:12 UTC,,,"{""source"":""api"",""session"":""f1580a7bdd196dc45e8...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247086"":{""retired"":null,""objectId"":""16515...",103247086
3,589019097,rebecca.nevin,1946584,cee5ed56424764950d55,27509,Classify with metadata,3.2,2024-10-07 16:08:14 UTC,,,"{""source"":""api"",""session"":""f1580a7bdd196dc45e8...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247083"":{""retired"":null,""objectId"":""16509...",103247083
4,589019337,rebecca.nevin,1946584,cee5ed56424764950d55,27509,Classify with metadata,5.6,2024-10-07 16:09:17 UTC,,,"{""source"":""api"",""session"":""f1580a7bdd196dc45e8...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247087"":{""retired"":null,""objectId"":""16513...",103247087
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,589020345,rebecca.nevin,1946584,77dde073c5ec44c24799,27509,Classify with metadata,5.6,2024-10-07 16:13:13 UTC,,,"{""source"":""api"",""session"":""60d31e3b15da057584e...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247082"":{""retired"":null,""objectId"":""15679...",103247082
72,589020360,rebecca.nevin,1946584,77dde073c5ec44c24799,27509,Classify with metadata,5.6,2024-10-07 16:13:18 UTC,,,"{""source"":""api"",""session"":""60d31e3b15da057584e...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247085"":{""retired"":null,""objectId"":""16514...",103247085
73,589020372,rebecca.nevin,1946584,77dde073c5ec44c24799,27509,Classify with metadata,5.6,2024-10-07 16:13:21 UTC,,,"{""source"":""api"",""session"":""60d31e3b15da057584e...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247083"":{""retired"":null,""objectId"":""16509...",103247083
74,589020385,rebecca.nevin,1946584,77dde073c5ec44c24799,27509,Classify with metadata,5.6,2024-10-07 16:13:23 UTC,,,"{""source"":""api"",""session"":""60d31e3b15da057584e...","[{""task"":""T0"",""task_label"":""Is there a galaxy ...","{""103247086"":{""retired"":null,""objectId"":""16515...",103247086


Select either `objectId` or `diaObjectId`; this should match the ID type of the data that was first sent to Zooniverse.

## 4. Extract annotations by task and sort by subject ID
The `id_type` argument should either be set to 'objectId' (default) or 'diaobjectId'. This function will return all annotations, there are repeated rows for some `subject_id` entries from different users or the same user re-classifying the same subject. This function will also return the Rubin IDs in a table.

In [20]:
extracted_data = extract_data(classification_data, id_type='objectId')
extracted_data

{'103247085': {'retired': None, 'objectId': '1651448872733547971'}}


NameError: name 'STOp' is not defined

## 5. Aggregate the annotations
Sort by unique subject ID and then unique tasks. Find the most recent classification for each user ID, and uses the Zooniverse consensus builder to look through all user classifications and build consensus.

In [None]:
aggregated_data = aggregate_data(extracted_data)

In [None]:
aggregated_data

## 6. Next steps and additional resources
You are now done! Congratulations!
Next steps could include joining the above table with other LSST data using the `rubin_id` column, which is either objectId or diaobjectId.

Additional resources include the Zooniverse team's resources to run panoptes through python (https://github.com/zooniverse/panoptes-python-client/tree/master), which provides high level access to the Zooniverse API in order to manage projects via python.

For examples of how to work with the data exports, see our Data Digging code repository or use our Panoptes Aggregation python package.
https://github.com/zooniverse/Data-digging, https://github.com/zooniverse/aggregation-for-caesar