
# The new viewser and views_stepshift workflow

Hey!

We are rewriting ViEWS, to make it easier, and more pleasant to work with.  We
are now at the point where we have a near-complete system in place for serving
data from a remote server. We have also re-packaged the modelling code from
ViEWS 2.

Each component of ViEWS 3 will be accessible through its own package, to
encourage modular thinking. The packages containing the above-mentioned functionality are
`viewser` and `views-stepshift`, which are both published on pip.

To install them, simply run

```
pip install viewser views-stepshift
```


# Getting data

In ViEWS3, data is stored and processed centrally, in the views cloud. This
means that you don't have to download and process data yourself, it is
processed on a server. The views cloud is accessible through an idiomatic REST
API. The viewser python package provides wrappers for this API, and data models
to make it easier to post new definitions.

Data is retrieved by referencing a named `Queryset`, which defines a set of
database columns and transformations at a level of analysis. Querysets can
easily be defined in python code:


In [8]:

import json
from viewser.models import Queryset,Database,Transformed

gte_25 = Transformed(name="ops.gte",arguments=[25])

my_queryset = Queryset(
        name = "country_month_ged_tlag",
        loa = "country_month",
        themes = ["testing","conflict_history"],
        operations = [
                [
                    gte_25,
                    Database(name="priogrid_month.ged_best_ns",arguments=["sum"]),
                ],

                [
                    Transformed(name="lags.tlag",arguments=["1"]),
                    gte_25,
                    Database(name="priogrid_month.ged_best_ns",arguments=["sum"]),
                ],

                [
                    Transformed(name="lags.tlag",arguments=["2"]),
                    gte_25,
                    Database(name="priogrid_month.ged_best_ns",arguments=["sum"]),
                ],

                [
                    Transformed(name="lags.tlag",arguments=["3"]),
                    gte_25,
                    Database(name="priogrid_month.ged_best_ns",arguments=["sum"]),
                ],
            ]
    )
print(json.dumps(my_queryset.dict(),indent=4))

{
    "loa": "country_month",
    "name": "country_month_ged_tlag",
    "themes": [
        "testing",
        "conflict_history"
    ],
    "operations": [
        [
            {
                "namespace": "trf",
                "name": "ops.gte",
                "arguments": [
                    "25"
                ]
            },
            {
                "namespace": "base",
                "name": "priogrid_month.ged_best_ns",
                "arguments": [
                    "sum"
                ]
            }
        ],
        [
            {
                "namespace": "trf",
                "name": "lags.tlag",
                "arguments": [
                    "1"
                ]
            },
            {
                "namespace": "trf",
                "name": "ops.gte",
                "arguments": [
                    "25"
                ]
            },
            {
                "namespace": "base",
                "name": "priogrid_month.ged_


As you can see from the above definition, the queryset has a level of analysis
(loa), "priogrid_month", a name, "my_queryset", and a list of operations. Each
operation is composed of one or more retrieval operations, such as fetching raw
data from the database, and transforming it in various ways. In this simple
example, the queryset contains raw GED data, and time-lags of 1, 2 and 3
time-units (months).

To get this data, the queryset must first be _published_ to the ViEWS cloud.
This makes it available to other viewsers, and enables the system to cache its
retrieval operations, resulting in faster retrieval times:

In [9]:
from viewser.operations import publish
publish(my_queryset)

INFO:viewser.operations:Queryset named "country_month_ged_tlag" exists, updating



Now that the queryset is registered in the cloud, you can retrieve it using the fetch operation:

In [10]:

from viewser.operations import fetch

df = fetch("country_month_ged_tlag") # Notice that we are referring to the queryset by name here, not by object.
df.fillna(0, inplace=True) 
# NA handling is not yet implemented, so I'm just filling the DF here for this demonstration
df.describe()

Unnamed: 0,priogrid_month_ged_best_ns_sum_views_2_greater_or_equal_25,priogrid_month_ged_best_ns_sum_views_2_greater_or_equal_25_views_2_tlag_1,priogrid_month_ged_best_ns_sum_views_2_greater_or_equal_25_views_2_tlag_2,priogrid_month_ged_best_ns_sum_views_2_greater_or_equal_25_views_2_tlag_3
count,32340.0,32340.0,32340.0,32340.0
mean,0.025448,0.025448,0.025448,0.025448
std,0.157485,0.157485,0.157485,0.157485
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0



# Modelling

The data you retrieved follows the same schema as the data used in the previous
iteration of ViEWS. There is no change in the indexing scheme, which means that
it can be plugged in to the modelling pipelines and procedures that you are
used to. For convenience, we have packaged the modelling code from ViEWS 2 in a
separate package, called `views_stepshift`. `views_stepshift` exposes the exact
same API as the model-app from ViEWS 2.

Note that python packaging naming conventions mandates dashes when publishing
to PyPi, and underscores when importing in your scripts. So, it is installed
under the name `views-stepshift` and imported under the name `views_stepshift`.

A still missing feature is centralized storage and retrieval of Periods. These
must now be defined in code. We will eventually implement publishing and
retrieval of Period-sets through the `viewser` publish function as shown above
with Querysets.

Here I write definitions which we'll be using for modelling. As you can see, these are from the old ViEWS2 API.

In [11]:
# Definitions

from views_stepshift import Period,Downsampling

periods = [
      Period(name="A",train_start=121,train_end=396,predict_start=397,predict_end=432),
      Period(name="B",train_start=121,train_end=432,predict_start=433,predict_end=468),
      Period(name="C",train_start=121,train_end=480,predict_start=483,predict_end=520),
   ]

steps = [1,12,24,36]

downsampling_10_pst_negative = Downsampling(share_positive = 1.0, share_negative = 0.1)

# We'll be using the time-lags of GED to predict GED
outcome = "priogrid_month_ged_best_ns_sum_views_2_greater_or_equal_25"
columns_features = set(df.columns).difference({outcome})


We define our models, just like before, fit them, produce predictions from them, and get evaluation scores:

In [12]:
from views_stepshift import Model,Ensemble
from sklearn.ensemble import RandomForestClassifier

model = Model(
    name = "priogrid_month_conflict_history_nonstate",
    col_outcome = outcome,
    cols_features = columns_features,
    steps = steps,
    periods = periods,
    outcome_type = "prob",
    estimator = RandomForestClassifier(n_estimators = 100)
    )

onset_model = Model(
    name = "priogrid_month_conflict_history_nonstate_onset",
    col_outcome = outcome,
    cols_features = columns_features,
    steps = steps,
    periods = periods,
    outcome_type = "prob",
    estimator = RandomForestClassifier(n_estimators = 100),
    onset_outcome = True,
    onset_window = 24,
    )

downsampled_model = Model(
    name = "priogrid_month_conflict_history_nonstate_downsampled",
    col_outcome = outcome,
    cols_features = columns_features,
    steps = steps,
    periods = periods,
    outcome_type = "prob",
    downsampling = downsampling_10_pst_negative,
    estimator = RandomForestClassifier(n_estimators = 100)
    )

models = [model,onset_model,downsampled_model]

average_ensemble = Ensemble(
    models = models,
    name = "my_ensemble",
    outcome_type = "prob",
    col_outcome = outcome,
    method = "average",
    periods = periods
    )

In [13]:

for model in models:
    model.fit_estimators(df)

INFO:views_stepshift.api:Fitting estimators for priogrid_month_conflict_history_nonstate
INFO:views_stepshift.api:Fitting estimators for priogrid_month_conflict_history_nonstate_onset
INFO:views_stepshift.api:Fitting estimators for priogrid_month_conflict_history_nonstate_downsampled


In [None]:
from views_stepshift.datautils import assign_into_df

for model in models:
    df = assign_into_df(df_from=model.predict(df),df_to=df)

In [None]:
for model in models:
    model.evaluate(df)
    model.extras._get_feature_importances()
    model.extras._get_permutation_importances(df)

In [None]:
import pandas as pd

dict_of_dicts = lambda d: {k:v for k,v in d.items() if type(v) is dict}

data = []
for model in models:
    for period_name,period_scores in model.scores.items():
        period_scores = dict_of_dicts(period_scores)
        for step_name,step_scores in period_scores.items():
            step_scores = dict_of_dicts(step_scores)
            for score_type,scores in step_scores.items():
                scores["score_type"] = score_type
                scores["period_name"] = period_name
                scores["model_name"] = model.name 
                data.append(scores)
                
score_data = pd.DataFrame(data)
score_data.describe()