<a href="https://colab.research.google.com/github/jimccasey1/jimccasey1.github.io/blob/master/Zooniverse_Notebook_2024_11_08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zooniverse Caesar Aggregation - Google Colab Version

## Prerequisites

Before you begin, you need:

1. **Classification Export CSV** from your Zooniverse project
   - Go to your project's Lab page
   - Click on Data Exports
   - Request a new classification export
   - Download when ready

2. **Workflow CSV** from your Zooniverse project
   - Same location as classification export
   - Download the workflow export

3. **Workflow ID**
   - Found in project builder URL when editing your workflow
   - Example: In `https://www.zooniverse.org/lab/12345/workflows/67890`
   - `67890` is your workflow ID

4. **Workflow Version**
   - Found in your workflow export CSV
   - Latest version number for your workflow ID

## Initial Setup

⚠️ **IMPORTANT**: This notebook must be run in a specific order to work correctly in Google Colab. Follow these steps exactly:

### Step 1: Initial Package Cleanup

In [None]:
# Remove existing packages
!pip uninstall -y pandas
!pip uninstall -y mizani
!pip uninstall -y plotnine
!pip cache purge


Found existing installation: pandas 2.2.2
Uninstalling pandas-2.2.2:
  Successfully uninstalled pandas-2.2.2
Found existing installation: mizani 0.13.0
Uninstalling mizani-0.13.0:
  Successfully uninstalled mizani-0.13.0
Found existing installation: plotnine 0.14.1
Uninstalling plotnine-0.14.1:
  Successfully uninstalled plotnine-0.14.1
Files removed: 163


### Step 2: Install Correct Pandas Version

In [None]:
# Force install the correct pandas version
!pip install pandas==2.2.2 --force-reinstall --no-deps
!pip install -I pandas==2.2.2  # -I means ignore installed packages

Collecting pandas==2.2.2
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.3
    Uninstalling pandas-2.1.3:
      Successfully uninstalled pandas-2.1.3
Successfully installed pandas-2.2.2
Collecting pandas==2.2.2
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy>=1.22.4 (from pandas==2.2.2)
  Using cached numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting python-dateutil>=2.8.2 (from pandas==2.2.2)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas==2.2.2)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7

### Step 3: First Version Check

In [None]:
# Verify pandas version
import pandas as pd
print(f"Pandas version: {pd.__version__}")

# If pandas version is not 2.2.2:
# 1. Restart runtime (Runtime > Restart runtime)
# 2. Run Step 2 again
# 3. Run this version check again

Pandas version: 2.2.2


### Step 4: Install Required Packages

In [None]:
# Only proceed if pandas version shows 2.2.2

# Install build dependencies first
!apt-get update
!apt-get install -y build-essential python3-dev

# Install scientific computing dependencies
!pip install numpy
!pip install scipy
!pip install scikit-learn

# Uninstall existing panoptes installation if any
!pip uninstall -y panoptes-aggregation

# Install fresh from GitHub
!pip install -U git+https://github.com/zooniverse/aggregation-for-caesar.git

# Install remaining packages
!pip install tqdm

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com (185.125.190.83)] [1 InRelease 5,481 B/129 kB 4%] [Connected to                                                                                                    Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Fetched 264 kB in 1s (202 kB/s)

### Step 5: Final Setup Verification
⚠️ RESTART RUNTIME AGAIN BEFORE RUNNING THIS CELL

In [None]:
import pandas as pd
import panoptes_aggregation
print(f"Pandas version: {pd.__version__}")
print(f"Panoptes aggregation version: {panoptes_aggregation.__version__}")

ModuleNotFoundError: No module named 'panoptes_aggregation'

## Configuration

In [None]:
# Import required libraries
import os
import json
import re
from tqdm import tqdm
from panoptes_aggregation import extractors, reducers
from panoptes_aggregation.running import setup_csv

ModuleNotFoundError: No module named 'panoptes_aggregation.running'

In [None]:
# Define default character replacements
DEFAULT_REPLACEMENTS = {
    r'\u2019': "'",    # smart apostrophe
    r'\u201c': '"',    # opening smart quote
    r'\u201d': '"',    # closing smart quote
    r'\u00e9': 'é',    # e acute
    r'\u00f1': 'ñ',    # n tilde
    r'\n': ' ',        # newline to space
    r'\t': ' ',        # tab to space
    r'\"': '"',        # escaped quote
    r'\\': '',         # backslash
    r'\r': ' ',        # carriage return
    r'\u00a0': ' ',    # non-breaking space
    r'\u2013': '-',    # en dash
    r'\u2014': '--',   # em dash
}

# Add your custom replacements here
CUSTOM_REPLACEMENTS = {
    # Example:
    # r'\u00fc': 'ü',  # u umlaut
}

# Combine both dictionaries
ALL_REPLACEMENTS = {**DEFAULT_REPLACEMENTS, **CUSTOM_REPLACEMENTS}

In [None]:
# Set your workflow parameters
# Replace these with your values
WORKFLOW_ID = "ENTER_YOUR_WORKFLOW_ID"
WORKFLOW_VERSION = "ENTER_YOUR_WORKFLOW_VERSION"

# Directory for output files
OUTPUT_DIR = "aggregated_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

In [None]:
# Generate configuration from workflow CSV
try:
    # Replace 'workflow.csv' with your workflow CSV filename
    config = setup_csv.Config('workflow.csv', workflow_id=WORKFLOW_ID, version=WORKFLOW_VERSION)
    print(f"Configuration generated for workflow {WORKFLOW_ID} version {WORKFLOW_VERSION}")
except Exception as e:
    print(f"Error generating configuration: {str(e)}")
    print("Check your workflow CSV file and workflow ID/version")


## Extract

In [None]:
try:
    # Replace 'classifications.csv' with your classifications CSV filename
    with open(config.extractor_config, 'r') as conf_file:
        extractor_config = json.load(conf_file)

    # Run the extractor
    print("Extracting classifications...")
    extracted_data = extractors.extract_csv(
        'classifications.csv',
        extractor_config,
        output_dir=OUTPUT_DIR,
        show_progress=True
    )
    print("Extraction complete!")
except Exception as e:
    print(f"Error during extraction: {str(e)}")
    print("Check your classifications CSV file")

## Reduce

In [None]:
try:
    print("Reducing extracted data...")
    reduced_data = reducers.reduce_csv(
        extracted_data['data_file'],
        extractor_config['reducer'],
        output_dir=OUTPUT_DIR,
        show_progress=True
    )
    print("Reduction complete!")
except Exception as e:
    print(f"Error during reduction: {str(e)}")

## Clean and Export Results

In [None]:
def clean_text(text):
    """Clean text by replacing special characters and formatting."""
    cleaned = text
    for pattern, replacement in ALL_REPLACEMENTS.items():
        cleaned = cleaned.replace(pattern, replacement)
    return cleaned

def scan_for_special_chars(text):
    """Scan for any remaining escaped unicode or special characters."""
    # Look for \u followed by exactly 4 hex digits
    unicode_chars = set(re.findall(r'\\u[0-9a-fA-F]{4}', text))
    # Look for escaped characters
    escaped_chars = set(re.findall(r'\\[^u]', text))
    return unicode_chars, escaped_chars

In [None]:
def export_results():
    try:
        # Load reduced data
        with open(reduced_data['data_file'], 'r') as f:
            data = json.load(f)

        print(f"Exporting {len(data)} subjects to text files...")
        special_chars_found = set()
        escaped_chars_found = set()

        # Process each subject
        for subject in tqdm(data):
            subject_id = subject['subject_id']
            answers = subject['data']  # Adjust this based on your data structure

            # Clean the text
            cleaned_answers = clean_text(json.dumps(answers, ensure_ascii=False))

            # Scan for any remaining special characters
            unicode_chars, escaped_chars = scan_for_special_chars(cleaned_answers)
            special_chars_found.update(unicode_chars)
            escaped_chars_found.update(escaped_chars)

            # Write to file
            output_file = os.path.join(OUTPUT_DIR, f"subject_{subject_id}.txt")
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(cleaned_answers)

        print("\nExport complete!")

        # Report any found special characters
        if special_chars_found or escaped_chars_found:
            print("\nFound additional characters that might need replacement:")
            if special_chars_found:
                print("\nUnicode characters:")
                for char in sorted(special_chars_found):
                    print(f"    {char}")
            if escaped_chars_found:
                print("\nEscaped characters:")
                for char in sorted(escaped_chars_found):
                    print(f"    {char}")

            print("\nTo handle these, add them to CUSTOM_REPLACEMENTS above.")

    except Exception as e:
        print(f"Error during export: {str(e)}")

# Run the export
export_results()

## Verify Results

In [None]:
def verify_results():
    try:
        # Load reduced data to get expected subject count
        with open(reduced_data['data_file'], 'r') as f:
            data = json.load(f)
        expected_count = len(data)

        # Count actual files
        actual_files = len([f for f in os.listdir(OUTPUT_DIR)
                          if f.startswith('subject_') and f.endswith('.txt')])

        print(f"Expected files: {expected_count}")
        print(f"Created files: {actual_files}")

        if expected_count == actual_files:
            print("✅ All files successfully created!")
        else:
            print("⚠️ Warning: Number of files doesn't match expected count")

    except Exception as e:
        print(f"Error during verification: {str(e)}")

# Run verification
verify_results()

## Troubleshooting

If you encounter issues:

1. **Package Version Conflicts**
   - Return to Step 1 of Initial Setup
   - Follow all steps in order
   - Make sure to restart runtime when indicated

2. **Import Errors**
   - Make sure you've restarted the runtime after installing packages
   - Verify pandas version is 2.2.2
   - Try running the setup steps again from the beginning

3. **File Upload Issues**
   - Make sure your CSV files are uploaded to Colab
   - Verify the filenames match what's in your code
   - Check file permissions

4. **Memory Issues**
   - Try restarting the runtime
   - Consider using a smaller dataset for testing
   - Clear output of previous cells

Remember: Always restart the runtime when indicated, and run cells in order!