<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px" alt="Vera C. Rubin Observatory Logo"> 
<h1 style="margin-top: 10px">Retrieve and Aggregate Zooniverse Output</h1>
Authors: Becky Nevin, Clare Higgs, and Eric Rosas <br>
Contact author: Clare Higgs <br>
Last verified to run: 2024-06-20 <br>
LSST Science Pipelines version: Weekly 2024_04 <br>
Container size: small or medium <br>
Targeted learning level: intermediate

<b>Description:</b> This notebook guides a PI through the process of retrieving classification data from Zooniverse and builds upon Hayley Robert's Aggregation notebook example. <br><br>
<b>Skills:</b> 
<br><br>
<b>LSST Data Products:</b> n/a<br><br>
<b>Packages:</b> rubin.citsci, utils (citsci plotting and display utilities),  <br><br>
<b>Credit:</b> Hayley Roberts<br><br>
<b>Get Support: </b>PIs new to DP0 are encouraged to find documentation and resources at <a href="https://dp0-2.lsst.io/">dp0-2.lsst.io</a>. Support for this notebook is available and questions are welcome at cscience@lsst.org.

## 1. Introduction <a class="anchor" id="first-bullet"></a>
This notebook provides an introduction to how to use the ???? and rubin.citsci package to retrieve classifications from Zooniverse and aggregate the results.

### 1.1 Package imports <a class="anchor" id="second-bullet"></a>

#### Install Pipeline Package

First, install the Rubin Citizen Science Pipeline package by doing the following:

1. Open up a New Launcher tab
2. In the "Other" section of the New Launcher tab, click "Terminal"
3. Use `pip` to install the `rubin.citsci` package by entering the following command:
```
pip install rubin.citsci
```
Note that this package will soon be installed directly on RSP.

If this package is already installed, make sure it is updated:
```
pip install --u rubin.citsci
```

4. Confirm the next cell containing `from rubin.citsci import pipeline` works as expected and does not throw an error

5. Install panoptes_client:
```
pip install panoptes_client
pip install panoptes_aggregation
```

6. If the pip install doesn't work for `panoptes_aggregation`:
```
pip install -U git+git://github.com/zooniverse/aggregation-for-caesar.git
```
(https://www.zooniverse.org/talk/1322/2415041?comment=3969837&page=1)

In [None]:
# this is all from SLSN_batch_aggregation.py
# found here - 
# https://github.com/astrohayley/SLSN-Aggregation-Example/blob/main/SLSN_batch_aggregation.py

In [12]:
# basics
import numpy as np
import pandas as pd
import getpass
import json
import os

# Zooniverse tools
from panoptes_client import Panoptes, Workflow
from panoptes_aggregation.extractors.utilities import annotation_by_task
from panoptes_aggregation.extractors import question_extractor
from panoptes_aggregation.reducers import question_consensus_reducer

import tqdm

In [24]:
def download_classifications(WORKFLOW_ID, client):
    """
    Downloads data from Zooniverse

    Args:
        WORKFLOW_ID (int): Workflow ID of workflow being aggregated
        client: Logged in Zooniverse client

    Returns:
        classification_data (DataFrame): Raw classifications from Zooniverse
    """

    workflow = Workflow(WORKFLOW_ID)

    # generate the classifications 
    with client:
        classification_export = workflow.get_export('classifications', generate=True, wait=False)
        print('export', classification_export)
        classification_rows = [row for row in tqdm(classification_export.csv_dictreader())]

    # convert to pandas dataframe
    classification_data = pd.DataFrame.from_dict(classification_rows)

    return classification_data



def extract_data(classification_data):
    """
    Extracts annotations from the classification data

    Args:
        classification_data (DataFrame): Raw classifications from Zooniverse

    Returns:
        extracted_data (DataFrame): Extracted annotations from raw classification data
    """
    # set up our list where we will store the extracted data temporarily
    extracted_rows = []

    # iterate through our classification data
    for i in range(len(classification_data)):

        # access the specific row and extract the annotations
        row = classification_data.iloc[i]
        for annotation in json.loads(row.annotations):

            row_annotation = annotation_by_task({'annotations': [annotation]})
            extract = question_extractor(row_annotation)

            # add the extracted annotations to our temporary list along with some other additional data
            extracted_rows.append({
                'classification_id': row.classification_id,
                'subject_id':        row.subject_ids,
                'user_name':         row.user_name,
                'user_id':           row.user_id,
                'created_at':        row.created_at,
                'data':              json.dumps(extract),
                'task':              annotation['task']
            })


    # convert the extracted data to a pandas dataframe and sort
    extracted_data = pd.DataFrame.from_dict(extracted_rows)
    extracted_data.sort_values(['subject_id', 'created_at'], inplace=True)

    return extracted_data



def last_filter(data):
    """
    Determines the most recently submitted classifications
    """
    last_time = data.created_at.max()
    ldx = data.created_at == last_time
    return data[ldx]


def aggregate_data(extracted_data):
    """
    Aggregates question data from extracted annotations

    Args:
        extracted_data (DataFrame): Extracted annotations from raw classifications

    Returns:
        aggregated_data (DataFrame): Aggregated data for the given workflow
    """
    # generate an array of unique subject ids - these are the ones that we will iterate over
    subject_ids_unique = np.unique(extracted_data.subject_id)

    # set up a temporary list to store reduced data
    aggregated_rows = []

    # determine the total number of tasks
    tasks = np.unique(extracted_data.task)

    # iterating over each unique subject id
    for i in range(len(subject_ids_unique)):

        # determine the subject_id to work on
        subject_id = subject_ids_unique[i]

        # filter the extract_data dataframe for only the subject_id that is being worked on
        extract_data_subject = extracted_data[extracted_data.subject_id==subject_id].drop_duplicates()

        for task in tasks:

            extract_data_filtered = extract_data_subject[extract_data_subject.task == task]

            # if there are less unique user submissions than classifications, filter for the most recently updated classification
            if (len(extract_data_filtered.user_name.unique()) < len(extract_data_filtered)):
                extract_data_filtered = extract_data_filtered.groupby(['user_name'], group_keys=False).apply(last_filter)

            # iterate through the filtered extract data to prepare for the reducer
            classifications_to_reduce = [json.loads(extract_data_filtered.iloc[j].data) for j in range(len(extract_data_filtered))]

            # use the Zooniverse question_consesus_reducer to get the final consensus
            reduction = question_consensus_reducer(classifications_to_reduce)

            # add the subject id to our reduction data
            reduction['subject_id'] = subject_id
            reduction['task'] = task

            # add the data to our temporary list
            aggregated_rows.append(reduction)


    # converting the result to a dataframe
    aggregated_data = pd.DataFrame.from_dict(aggregated_rows)

    # drop rows that are nan
    aggregated_data.dropna(inplace=True)

    return aggregated_data





def batch_aggregation(generate_new_classifications=False, WORKFLOW_ID=13193): 
    """
    Downloads raw classifications, extracts annotations, and aggregates data

    Args:
        WORKFLOW_ID (int): Workflow ID of workflow being aggregated
        client: Logged in Zooniverse client

    Returns:
        aggregated_data (DataFrame): Aggregated data for the given workflow
    """

    if generate_new_classifications:
        # connect to client and download data
        print('Sign in to zooniverse.org:')
        client = Panoptes.client(username=getpass.getpass('username: '), password=getpass.getpass('password: '))
        print('Generating classification data - could take some time')
        classification_data = get_data_from_zooniverse(WORKFLOW_ID=WORKFLOW_ID, client=client)
        print('Saving classifications')
        classification_data.to_csv('superluminous-supernovae-classifications.csv', index=False)
    else:
        # or just open the file
        print('Loading classifications')
        classification_data = pd.read_csv('superluminous-supernovae-classifications.csv')

    # limit classifications to those for the relevant workflow
    classification_data = classification_data[classification_data.workflow_id==WORKFLOW_ID]

    # extract annotations
    print('Extracting annotations')
    extracted_data = extract_data(classification_data=classification_data)

    # aggregate annotations
    print('Aggregating data')
    final_data = aggregate_data(extracted_data=extracted_data)

    return final_data

In [3]:
from rubin.citsci import pipeline

In [7]:
email = "beckynevin@gmail.com"
slug_name = "rebecca-dot-nevin/test-project"
print("Loading and running utilities to establish a link with Zooniverse")
print("Enter your Zooniverse username followed by password below")
cit_sci_pipeline = pipeline.CitSciPipeline()
cit_sci_pipeline.login_to_zooniverse(slug_name, email)

Loading and running utilities to establish a link with Zooniverse
Enter your Zooniverse username followed by password below
To install the latest version, open up a terminal tab and run the following command:
    pip install --upgrade --force-reinstall rubin.citsci
After the upgrade installation has finished, please restart the kernel for the changes to take effect.
Enter your Zooniverse credentials...


Username:  rebecca.nevin
 ········


You now are logged in to the Zooniverse platform.


## Download the classifications
These will still be in the raw format. This function reads from the output csv and puts all rows into a dataframe format.

In [None]:
#project_id = 19539
WORKFLOW_ID = 23254
client = cit_sci_pipeline.client
# how long should this take?
classification_data = download_classifications(WORKFLOW_ID, client)

In [None]:
classification_data

Test to see if the above is the same as what we already have (below):

In [17]:
# again, this is from the citsci package
def retrieve_data(self, project_id):
    """
        Given a project ID of a project that contains a completed workflow with 
        data that has been classified, this method will request the classified/
        completed data and download it if it is available.
    """

    classification_export = panoptes_client.Project(project_id).get_export(
        "classifications"
    )
    list_rows = []

    for row in classification_export.csv_reader():
        list_rows.append(row)

    return list_rows


In [22]:
project_id = 19539
raw_clas_data = cit_sci_pipeline.retrieve_data(project_id)

In [23]:
raw_clas_data

[['classification_id',
  'user_name',
  'user_id',
  'user_ip',
  'workflow_id',
  'workflow_name',
  'workflow_version',
  'created_at',
  'gold_standard',
  'expert',
  'metadata',
  'annotations',
  'subject_data',
  'subject_ids'],
 ['460251424',
  'rebecca.nevin',
  '1946584',
  'cbfbf6eb78413a32bf60',
  '23254',
  'Classification',
  '9.7',
  '2023-01-05 17:09:18 UTC',
  '',
  '',
  '{"source":"api","session":"17680efb53f0c9ec4a9c49c9f4b8d66f3dc7482b7a8f6e9f7234ef3245d404ed","viewport":{"width":1191,"height":776},"started_at":"2023-01-05T17:09:13.704Z","user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36","utc_offset":"25200","finished_at":"2023-01-05T17:09:18.588Z","live_project":false,"interventions":{"opt_in":true,"messageShown":false},"user_language":"en","user_group_ids":[],"subject_dimensions":[{"clientWidth":702,"clientHeight":562,"naturalWidth":1000,"naturalHeight":800}],"subject_selection_stat

## Extract annotations by task and sort by subject ID
There could be multiple tasks per item.

In [None]:
extracted_data = extract_data(classification_data)
extracted_data

## Aggregate the annotations
Sort by unique subject ID and then unique tasks. Find the most recent classification for each user ID, and uses the Zooniverse consensus builder to look through all user classifications and build consensus.

In [None]:
aggregated_data = aggregate_data(extracted_data)

In [None]:
# issues:
# get_data_from_zooniverse is undefined
# the pip install does not work for panoptes_aggregation
# download_classifications takes upwards of 43 minutes to run,
# is this because I haven't completed the workflow? Because I have

In [5]:
WORKFLOW_ID = 23254
batch_aggregation(WORKFLOW_ID)

Sign in to zooniverse.org:


username:  ········
password:  ········


Generating classification data - could take some time


NameError: name 'get_data_from_zooniverse' is not defined