# JCS Query

## Instructions

 1. After executing "Initialization", change the parameters under "User Options" specific to your needs. It is important for "output_s3_subkey" to be unique for each brand new query.

 2. Execute the "Prepare Input" and "Launch Query" cells. If there is something wrong with the input csv or selector, it will raise an error before the query takes place, and you'll get a chance to see what it was and fix it.

 3. Wait for the query to finish. You cannot tell when a batch query is finished using just this notebook, although it usually finishes in around 15 minutes regardless of size.

 4. After the query is finished, execute the rest of the cells, starting from "Collect Query Results". Unlike the query itself, the amount of time and memory it takes depends on the number of points and amount of requested data.

#### Selector Info:

The selector chooses what data is queried. The notebook supports CSG>=2.1. Notes:
 - `None` will return data from the final step of the JCS pipeline
 - If you want a subset, provide a *list* of values, even if the list has only one thing in it
 - You cannot ask for a subset of metrics unless there is only one peril
 - "baseline"/"baseline_average" are for year 1995

#### Important
 - If you've previously performed a query and want to reload an existing raw file, then after the "Prepare Input" section skip to the "Optional - load existing raw file" section and uncomment out that code
 - This notebook is not intended to run hundreds of thousands of points. If that is what you want to do, you'll need to use separate code in conjunction with this notebook. Instructions on how to do that aren't listed here. The recommended upper limit is 50,000 points.

## Initialization

In [None]:
import logging
import shutil

consoleHandler = logging.StreamHandler('jcs-query-notebook')
consoleHandler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logging.getLogger().addHandler(consoleHandler)

import time
import boto3
import numpy as np
import pandas as pd
import dask.dataframe as dd
from typing import Optional, List
from dask.distributed import Client
from pandas.api.types import CategoricalDtype

from jupiter_csg_query.core import URL, Selector, Result, Location
from jupiter_csg_query.util import generator
from jupiter_csg_query.source import jcs
from jupiter_csg_query.runtime.batch import BatchProvider
from jupiter_csg_query.runtime.parallel.provider import ParallelProvider
from jupiter_csg_query.source.jcs.query_source import PartitionJcs  # Not used?

import postprocess.postprocessing_jcs as postprocessing_jcs
import preprocess.geocoding as geocoding
from releases.selector_jcs import jcs_selector
from releases.selector_jcs_dem import jcs_dem_selector
from validate.validate import validate_query_input
from postprocess import collect, util

import warnings

warnings.filterwarnings('ignore')

## User Options

In [None]:
# current user
stage = "production"
output_s3_subkey = "test/unique_subkey"
input_url = "infile.csv"
output_url = "outfile.csv"

#
# Create a selector for JCS query
#

# Required values
# Note: the same thresholds_version is used for all queries at present
thresholds_version = 1
# Defines which version of source files to use
csg_version = "2.2"
# Defines how source data is treated, e.g. treat the data as if it were
# from the 2.2 release in terms of peril/metric availability
metric_set = "2.2"

# Optional arguments; set these to None to use the default value
# Otherwise, should be specified with lists. If specific metrics are
# desired, a single peril must be specified.
check_sets = None  # default: ["combined_checkset"]
perils = None  # default: all perils
metrics = None  # default: ["combined_metric"]
epochs = None  # default: all epochs
scenarios = None  # None  # default: ["combined_scenario"]
uq_bands = None  # uncertainty bands; default: ["combined_uq"]

# provider_type options are "batch" or "parallel";
# "parallel" should only be used if you anticipate
# having 15 or less points
provider_type = "batch"

# enable this only if the input dataframe is missing lat-lons
# and we need to geocode the addresses.
# there are additional options in the geocode cell but you usually
# won't need to change them.
do_geocode = False

validate_input = True
check_for_missing_data = False  # TODO
include_debug_messages = False
dry_run = True
ignore_missing_record_errors = True
missing_record_retries = 5

# currently all `region` does is, if it isn't None, create a column called "jupiterRegion" with all rows having the given value 
region = None

# if you want to manually define or edit the selector(s) (NOT RECOMMENDED), do so here
selector = jcs_selector(
    metric_set=metric_set,
    csg_ver=csg_version,
    thresholds_version=thresholds_version,
    perils=perils,
    check_sets=check_sets,
    epochs=epochs,
    scenarios=scenarios,
    metrics=metrics,
    bands=uq_bands,
)

### If querying JCS DEM results, use the following selector

In [None]:
selector_dem = jcs_dem_selector(
    dem_version=8,
    thresholds_version=1
)

In [None]:
# --------------------------------------------------------------------------------
# please do not change this cell unless you know what you're doing
# --------------------------------------------------------------------------------
csg_bucket = "jupiter-climatescoreglobal-eos"
output_bucket = "jupiter-climatescoreglobal-eos" # alternatively, this might be a customer-specific bucket
source_loc = f's3://{csg_bucket}/{stage}/'
work_prefix = f"reports/query/{output_s3_subkey}/"
work_loc = f's3://{output_bucket}/{work_prefix}'

# location for source, work-area and output files
source_url = URL(source_loc)
work_area_url = URL(work_loc)

# define current wind units so we don't convert twice
current_wind_units = 'mps'

batch_settings = dict(
    work_area_url=work_area_url,
    parallelism=250,
    job_queue='csg-global-prod-validation',
    job_definition='csg-query-dev:13',
    region='us-east-1'
)

if include_debug_messages:
    logging.getLogger('jupiter.csg.query').setLevel(logging.DEBUG)

## Prepare Input

#### Geocoding (optional)
You can pass in your own `address_cols` and `loc_id_col` if your dataframe doesn't match the expected default.

Default `address_cols` to be geocoded are: ("streetAddress","streetAddress2","cityName","admin1Code","postalCode","countryCodeISO2A")
Default `loc_id_col` is `locationId`
You can also set `geocode_log_file` to `None` if you don't need it.

In [None]:
input_df = pd.read_csv(input_url)
input_df.columns = input_df.columns.str.strip()
if do_geocode:
    geocode_log_file = output_url.replace(".csv", "_geocode_status.csv")
    input_df = geocoding.geocode_dataframe(input_df, log_file=geocode_log_file) # insert address_cols, loc_id_col params if needed
if validate_input:
    validate_query_input(input_df)
input_df = input_df.sort_values(by="locationId")
print(input_df.info())
print(f"Provider type: {provider_type}")

## Launch JCS Query

In [None]:
# runtime query executor
# this will block until batch is done
util._launch_query_jcs(
    input_df=input_df,
    selector=selector,
    dry_run=dry_run,
    source_url=source_url, 
    batch_settings=batch_settings,
)

## Launch JCS DEM Query

In [None]:
# TODO: Future work

## Wait for batch

Do not execute this cell unless you are hitting "Run All Cells", all it does is wait for 25 minutes, which is a safe amount to wait for batch to finish *IF* everything runs smoothly.

In [None]:
# this shouldn't be needed anymore since the query has its own batch waiter

# if provider_type == 'batch':
#     time.sleep(60 * 25)

## Collect query results

Wait for the query to finish before executing this cell. You cannot tell when a batch query is finished using just this notebook, although it usually finishes in less than 15 minutes regardless of size.

In [None]:
df = util._launch_collect(
    work_bucket=output_bucket,
    work_prefix=work_prefix,
    selector=selector,
    csg_version=metric_set,
    data_source="jcs",
)

# TODO: Future work - collect JCS DEM query results

## Optional - load existing raw file

If the collect query section was previously successful and you'd like to load an existing raw file, uncomment and use this section after having executed the "Prepare Input" section. Do NOT uncomment out this code if you are performing a query under a unique `output_s3_subkey` for the first time.

In [None]:
# df = pd.read_csv(f"s3://{output_bucket}/{work_prefix}all.csv")

## Optional - check for missing records

If there are no points missing, this is fast.

Otherwise, it may take some time depending on how much has to be checked. TODO is better optimization.

In [None]:
# TODO: Future work
if check_for_missing_data:
    util._check_missing(df, input_df, selector, work_area_url, output_bucket, work_loc, work_prefix, missing_record_retries, dry_run, ignore_missing_record_errors,
                   source_url, batch_settings, csg_version)

## Postprocess

In [None]:
df = postprocessing_jcs.postprocess_query(df, input_df,)

## Extra User Postprocessing

This blank cell is for the user for any additional code they'd like to execute

## Save

Saving will take a good amount of time if the dataframe is big.

In [None]:
df.to_csv(output_url, index=False)