# Setup

In [None]:
%%capture
!pip install --progress-bar off poetry
!pip install --progress-bar off git+https://github.com/oughtinc/ergo.git@87c7bc2b38c3007aab38da46c441cef548217e31

In [None]:
import warnings
warnings.filterwarnings(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="plotnine")

In [None]:
import pandas as pd
from datetime import datetime

import ergo
from ergo.platforms.metaculus.question import MetaculusQuestion, LinearQuestion, LogQuestion, ContinuousQuestion

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Get questions

Get all *open* questions on the *main* subdomain.

NOTE: log date questions are excluded because Ergo currently can't handle them

In [None]:
metaculus = ergo.Metaculus(username="oughtpublic", password="123456", api_domain="www")

In [None]:
qs = metaculus.get_questions(question_status="open", pages=99999, load_detail=False)

In [None]:
# For questions with open boundaries,
# the undetailed version of these questions is missing
# the probability above and below the question bounds, which we'd like to have.
# So, fetch the full question data.
for q in qs:
    if getattr(q, "low_open", False) or getattr(q, "high_open", False):
        q.refresh_question()

In [None]:
exemplar_q_id = 3530

In [None]:
exemplar_q = metaculus.get_question(exemplar_q_id)

In [None]:
qs_df = exemplar_q.to_dataframe(qs)

# Get question metadata

## Get all of the metadata already on the question

Get the field names from the question JSON from Metaculus:

In [None]:
metaculus_json_fields = list(exemplar_q.data.keys())

Get the property names from the MetaculusQuestion and ContinuousQuestion classes:

In [None]:
def properties(some_class):
    """
    Get all @properties of a class
    """
    class_items = some_class.__dict__.items()
    return [name for (name, value) in class_items if isinstance(value, property)]

In [None]:
question_properties = properties(MetaculusQuestion)

In [None]:
continuous_question_properties = properties(ContinuousQuestion)

In [None]:
%%capture
q_fields = metaculus_json_fields + question_properties + continuous_question_properties

simple_fields = [field for field in q_fields if type(getattr(exemplar_q, field)) in [bool, int, float, str, datetime]]

# This property causes an exception for some reason.
# Didn't seem worth investigating
simple_fields.remove("question_range_width")

for field in simple_fields:
    qs_df[field] = [getattr(q, field, None) for q in qs]

## Generate and add more metadata

In [None]:
def get_p_outside(q):
    if not hasattr(q, "latest_community_percentiles"):
        return None

    # q.latest_community_percentiles is a float for binary questions:
    # https://github.com/oughtinc/ergo/pull/378
    if type(q.latest_community_percentiles) == float:
        return None
    
    return q.latest_community_percentiles["low"] + (1 - q.latest_community_percentiles["high"])

In [None]:
metadata_columns = {
    "type": lambda q: type(q).__name__,
    "num_boundaries_open": lambda q: int(q.low_open) + int(q.high_open) if hasattr(q, "low_open") else None,
    "question_scale_low": lambda q: q.scale.low if hasattr(q, "scale") else None,
    "question_scale_high": lambda q: q.scale.high if hasattr(q, "scale") else None,
}

In [None]:
for (name, fn) in metadata_columns.items():
    qs_df[name] = [fn(q) for q in qs]

## Select and reorder columns

We have these columns:

In [None]:
qs_df[qs_df["id"] == exemplar_q_id]

Select all the ones that might plausibly be useful, and put them in a reasonable order:

In [None]:
qs_df = qs_df[[
    "id",
    "title",
    "question_url",
    "type",
    "num_boundaries_open",
    "low_open",
    "high_open",
    "p_outside",
    "question_scale_low",
    "question_scale_high",
    "anon_prediction_count",
    "last_activity_time",
    "votes",
    "comment_count",
    "created_time",
    "publish_time",
    "close_time",
    "resolve_time",
    "author_name",
    "last_read",
    "has_predictions",
    "activity",
    "title_short"
]]

In [None]:
qs_df[qs_df["id"] == exemplar_q_id]

## Explanations of some important fields
- `low_open`: Is the lower boundary of the question open? (only applies to ContinuousQuestions)
- `high_open`: Is the upper boundary of the question open? (only applies to ContinuousQuestions)
- `p_outside`: How much of the total probability mass of the community prediction is outside the question range? (only applies to ContinuousQuestions)
- `anon_prediction_count`: Seems to be a proxy for the number of predictions. See "Data notes" below.
- `last_activity_time`: Seems to be a quick proxy for the time of the last prediction on the question. See "Data notes" below.
- `comment_count`: How many comments have been left on this question?
- `created_time`: When did the author of the question create it? (I think)
- `publish_time`: When was the question published to all Metaculus users?
- `close_time`: After what time are predictions on this question no longer allowed?
- `resolve_time`: When can the question be resolved, i.e. when will the answer be known?

## Data notes:
1. `anon_prediction_count` is the closest thing I could find to a count of number of predictions, but I'm not really sure how it relates to the number of predictions. In my testing:
    1. It seems to always be the same as the length of `prediction_timeseries`
    2. It seems to be correlated with something about the number of predictions shown. E.g.
        1. it's 101 for this question where the community prediction is shown: https://www.metaculus.com/questions/3530/how-many-people-will-die-as-a-result-of-the-2019-novel-coronavirus-covid-19-before-2021/.
        2. While it's 0 for this question where the community prediction is not shown yet: https://www.metaculus.com/questions/4614/when-will-directly-removing-carbon-dioxide-from-the-atmosphere-be-economically-feasible/ 
    3. I couldn't get it to increment. I tried:
        1. making a new prediction with an account that had already predicted on the question
        2. making a prediction with an account that had never predicted on that question before.
2.`last_activity_time` seems like the most obvious easy proxy for when the most recent prediction was made. However, I'm not sure how reliable it is.
    1. It did not update when I made a new prediction from an account that had already predicted on the question previously
    2. It may update when people leave comments or at other times
    3. Alternatively, we could use the last time from the `prediction_timeseries`, but that also doesn't seem to update every time someone makes a prediction
3. To get the datetime of the last posted comment, I think we'd need to retrieve it from a separate API (prob at least 30 min of work, maybe more like hours), so I haven't tried
4. Log date questions are excluded here because Ergo can't handle them yet.

# View data

## Export as csv

(For use when running locally in `ergo/notebooks`)

In [None]:
# qs_df.to_csv("../ergo/contrib/metac_qs_data/metac_qs_data.csv", index=False, float_format='%.20f')

A version of this CSV is uploaded as a [Google Sheet](https://docs.google.com/spreadsheets/d/1Aii_IkUTiJH6t14n2lhwhu4PJJjlTz6X5vdEi5gPGa0/edit#gid=1305569144).

## View all questions

In [None]:
qs_df