# Introduction to Praxis

Praxis is a tool that helps machine learning teams reason about data that evolves over time. 📈

### Features

* Intuitive interface that quickly visualizes time-series (as pandas DataFrames) and their decompositions
* Easily adds covariates to your target series
* Performs train-test split and exports corresponding DataFrames
* Trains baseline statistical and ML models, investigates residuals and performs benchmarking
* Finds leading indicators


<p style="text-align: center;"><b>To ensure that the DataFrame represents a time-series, right now we require the DataFrame you are loading to include both a `id` and `date` column.</b>
</p>

### Reporting Bugs / Providing Feedbacks

The product is still in alpha, so there may be bugs. We'd love to hear from you for bug reports and feedbacks :)

Feel free to join our Community Slack channel at https://join.slack.com/t/praxiscommunity/shared_invite/zt-1ef4vfje9-VCdEThDKIrYd0Z5ErVGo9A, or alternatively send correspondences to engineering@praxispioneering.com.

![overview](https://drive.google.com/uc?id=1Y7iXbh7x6ffYgu6M9rp1gwxjohH49TB5)


## Installing via Pip

You may save the wheel for installation on your own platforms (hosted Jupyter notebooks, Vertex AI, localhost, etc.) 

Run the command below to install the package. This should take at most 3 to 5 minutes.

In [None]:
!pip install https://storage.googleapis.com/praxis-public/wheels/praxis_interface-0.0.8-py3-none-any.whl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting praxis-interface==0.0.8
  Downloading https://storage.googleapis.com/praxis-public/wheels/praxis_interface-0.0.8-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 913 kB/s 
[?25hCollecting pytz==2022.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
[K     |████████████████████████████████| 503 kB 5.5 MB/s 
[?25hCollecting pystan==2.19.1.1
  Downloading pystan-2.19.1.1-cp37-cp37m-manylinux1_x86_64.whl (67.3 MB)
[K     |████████████████████████████████| 67.3 MB 92 kB/s 
[?25hCollecting pangres==4.1.1
  Downloading pangres-4.1.1.tar.gz (54 kB)
[K     |████████████████████████████████| 54 kB 2.3 MB/s 
[?25hCollecting dash==2.4.1
  Downloading dash-2.4.1-py3-none-any.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 24.0 MB/s 
[?25hCollecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K  

* After seeing:
```
WARNING: The following packages were previously imported in this runtime:
  [pkg1, pkg2, ..]
```
you may wanna restart the runtime, otherwise importing Praxis may not work.
* Pip may report an installation warning like:
```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
```
this won't have any affect on the success of installation. You may proceed as usual.

## Importing the Interface

In [None]:
from praxis_interface import Praxis
import pandas as pd

  warn(f"Failed to load image Python extension: {e}")


Occasional errors may appear that relates to YAML, torchvision or Darts. You may ignore them. 


## Loading Data

The `Praxis` class accepts a pandas DataFrame as an input to be able to visualize it. **To ensure that the DataFrame represents a time-series, right now we *require* it to include both a `id` and `date` column.** All other columns will be treated as either a target series, or covariates.

In [None]:
df = pd.read_parquet('https://storage.googleapis.com/praxis-public/assets/covid-19-us-demo.parquet')

## Instantiating a Praxis Interface
After a DataFrame is loaded, you may launch a Praxis instance to help with your exploratory data analysis (EDA) and investigatory workflow. 

In [None]:
praxis = Praxis(df)

The overhead is minimized by DuckDB, a lightweight Online analytical processing (OLAP) database that is instantiated in-memory. You can instantiate an arbitrary number of Praxis interfaces on an arbitrary number of DataFrames.

There are two main ways you could start a Praxis interface, either inline or in an external window. 

In [None]:
praxis.run(mode="external") # Running externally

Dash app running on:


<IPython.core.display.Javascript object>

In [None]:
praxis.run() # Running inline

If you are in a hosted environment, the proxy set-up may also be buggy and results in the interface not showing up. Your browser may also be configured to not display the interface due to Enhanced Tracking Protection.

### Views

Within a single invocation of `.run`, Praxis records different configurations of visualizations that you've tried, the baseline models you've trained on it, and their performances. These are represented as "views," uniquely characterized by the following components:

* ID of target series, or the regex to filter multi-series models
* Name of target series
* All past / future covariates
* Cutoff time-stamp

## Working with the Interface

Praxis is designed to be flexible and performant to investigate your time-series data, join and identify most-important covariates, and esaily train and benchmark baseline models. Here are some examples of how you could work with the tool:



### Selecting Time-series and Columns

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.loom.com/embed/fc2616d1c28a4fcc9e552064276673ca",width=800, height=400)

### Add Covariates and Change Cutoff

In [None]:
IFrame(src="https://www.loom.com/embed/12af08aaf7eb4feebb85b318a16b35a3",width=800, height=400)

### Find Leading Indicators

In [None]:
IFrame(src="https://www.loom.com/embed/2b7257d2c59b4fe3b32c52b4940bacc0",width=800, height=400)

### Train, Investigate and Compare Single-series Models (Statistical & ML)

In [None]:
IFrame(src="https://www.loom.com/embed/9da6846324dc4c719ae14e1bd7cc90ea",width=800, height=400)

### Train, Investigate and Compare Multi-series Models (Statistical & ML)


In [None]:
IFrame(src="https://www.loom.com/embed/985fc3e2f2fa4e27a8645bd8e213c3e8",width=800, height=400)

## Getting Results from the Interface

Praxis complements your existing workflow by visualizing inputs from a DataFrame, then returning preliminary results from your investigative interactions with the interface. 

Here are different types of information you could get from the interface: 

### Models
Remark that the `['model']` field contains a Python Darts object that you could use elsewhere.

In [None]:
praxis.get_model() # Getting models trained on last view accessed

{'Prophet (singleseries)': {'mode': 'forecast',
  'start_time': Timestamp('2021-01-01 00:00:00'),
  'target_column': 'new_deceased',
  'model': <darts.models.forecasting.prophet_model.Prophet at 0x7f063a366310>,
  'model_selection': 'Prophet (singleseries)',
  'components': ['population_age_70_79', 'new_persons_vaccinated'],
  'forecast_length': 20,
  'forecast_samples': 1,
  'disable_future_cov': False}}

In [None]:
praxis.get_model(return_all=True) # Getting models trained on all views

{'US_IL@new_deceased@new_persons_fully_vaccinated/current_intensive_care_patients/search_trends_diarrhea@@@2021-01-01T00:00:00.000000000': {'Prophet (singleseries)': {'mode': 'forecast',
   'start_time': Timestamp('2021-01-01 00:00:00'),
   'target_column': 'new_deceased',
   'model': <darts.models.forecasting.prophet_model.Prophet at 0x7f064261bb90>,
   'model_selection': 'Prophet (singleseries)',
   'components': ['new_persons_fully_vaccinated',
    'current_intensive_care_patients',
    'search_trends_diarrhea'],
   'forecast_length': 20,
   'forecast_samples': 1,
   'disable_future_cov': False}},
 'US_IL@new_deceased@new_persons_fully_vaccinated/current_intensive_care_patients/search_trends_diarrhea@@@2021-03-01T00:00:00.000000000': {'Prophet (singleseries)': {'mode': 'forecast',
   'start_time': Timestamp('2021-03-01 00:00:00'),
   'target_column': 'new_deceased',
   'model': <darts.models.forecasting.prophet_model.Prophet at 0x7f063e93df90>,
   'model_selection': 'Prophet (single

### Forecasts

Forecasts are broken down by models, percentiles (for probabilistic inference), and whether it's a prediction or a residual (actual - prediction). They are immediately available as pandas DataFrame, which you may incorporate in your training pipeline, or even fed back into the interface for more analysis

In [None]:
forecast = praxis.get_forecast()
forecast['prediction']['Prophet (singleseries)']['0.05']

Unnamed: 0,prediction
2021-01-01,198.128034
2021-01-02,196.214713
2021-01-03,163.387324
2021-01-04,154.366083
2021-01-05,194.928211
2021-01-06,219.353732
2021-01-07,221.2586
2021-01-08,208.887665
2021-01-09,206.974343
2021-01-10,174.146955


### Metrics

Currently, we support calculating symmetric mean absolute percentage error (SMAPE) and root mean squared error (RMSE).

In [None]:
praxis.get_metrics()

{'Prophet (singleseries)': {'smape': 69.81, 'rmse': 250.29}}

### Last-run View

You could also export the last view you were working on for records:

In [None]:
praxis.get_last_view()

{'id': 'US_CA',
 'target': 'new_deceased',
 'past_covs': ['population_age_70_79'],
 'future_covs': ['new_persons_vaccinated'],
 'discrete': [],
 'cutoff': Timestamp('2021-01-01 00:00:00'),
 'filter_regex': '.*'}

### Train/test Split

To help with accelerating your prototyping process, you could easily export the train/test split in interface into respective DataFrames, so that it's easier to run on your more complex training pipeline.

In [None]:
train, test = praxis.to_dataframe()
train

Unnamed: 0,id,date,new_deceased,population_age_70_79,new_persons_vaccinated
0,US_AK,2020-01-01,0.000000,35021.0,0.0
1,US_AK,2020-01-02,0.000000,35021.0,0.0
2,US_AK,2020-01-03,0.000000,35021.0,0.0
3,US_AK,2020-01-04,0.000000,35021.0,0.0
4,US_AK,2020-01-05,0.000000,35021.0,0.0
...,...,...,...,...,...
20547,US_WY,2020-12-28,21.353542,44614.0,0.0
20548,US_WY,2020-12-29,7.117847,44614.0,0.0
20549,US_WY,2020-12-30,2.372616,44614.0,0.0
20550,US_WY,2020-12-31,22.790872,44614.0,0.0


In [None]:
test

Unnamed: 0,id,date,new_deceased,population_age_70_79,new_persons_vaccinated
0,US_AK,2021-01-02,2.171726,35021.0,0.000000
1,US_AK,2021-01-03,1.390575,35021.0,0.000000
2,US_AK,2021-01-04,0.463525,35021.0,0.000000
3,US_AK,2021-01-05,2.154508,35021.0,0.000000
4,US_AK,2021-01-06,1.384836,35021.0,0.000000
...,...,...,...,...,...
32979,US_WY,2022-08-09,6.000001,44614.0,0.000669
32980,US_WY,2022-08-10,2.000000,44614.0,0.000223
32981,US_WY,2022-08-11,0.666667,44614.0,152.666741
32982,US_WY,2022-08-12,0.222222,44614.0,50.888914


## Tutorial: Finding Your First Data Driver in the Demo Dataset

Now that you are familiar with the basic interactive components of the interface, this tutorial will demonstrate how to use the tool to discover important drivers on our demo dataset.


### The Data

The demo dataset we are using is a subset of `bigquery-public-data.covid19_open_data.covid19_open_data` available on Google BigQuery, cleaned and formatted for use inside the Praxis interface. It contains daily time-series data related to COVID-19. You can find the list of sources available here: https://github.com/open-covid-19/data.

In [None]:
df # get a glimpse of the dataset.

Unnamed: 0,date,id,new_confirmed,new_deceased,cumulative_confirmed,cumulative_deceased,new_persons_vaccinated,cumulative_persons_vaccinated,new_persons_fully_vaccinated,cumulative_persons_fully_vaccinated,...,new_vaccine_doses_administered_pfizer,cumulative_vaccine_doses_administered_pfizer,new_persons_fully_vaccinated_moderna,cumulative_persons_fully_vaccinated_moderna,new_vaccine_doses_administered_moderna,cumulative_vaccine_doses_administered_moderna,new_persons_fully_vaccinated_janssen,cumulative_persons_fully_vaccinated_janssen,new_vaccine_doses_administered_janssen,cumulative_vaccine_doses_administered_janssen
0,2020-01-01,US_AK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,2020-01-02,US_AK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,2020-01-03,US_AK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,2020-01-04,US_AK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,2020-01-05,US_AK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53531,2022-08-09,US_WY,767.727060,6.000001,172089.136470,1862.000000,0.000669,341423.018870,1.004531,298931.497735,...,2.457173,401370.771414,0.502280,121411.748860,2.972672,341851.513664,0.049405,26924.975297,0.079584,28970.960208
53532,2022-08-10,US_WY,255.909020,2.000000,172345.045490,1864.000000,0.000223,341423.006290,0.334844,298931.832578,...,0.819058,401371.590471,0.167427,121411.916287,0.990891,341852.504555,0.016468,26924.991766,0.026528,28970.986736
53533,2022-08-11,US_WY,85.303007,0.666667,172430.348497,1864.666667,152.666741,341575.668763,306.111615,299237.944193,...,314.939686,401686.530157,149.389142,121561.305429,258.330297,342110.834852,5.338823,26930.330589,4.675509,28975.662245
53534,2022-08-12,US_WY,28.434336,0.222222,57476.782832,621.555556,50.888914,341626.556254,102.037205,299339.981398,...,104.979895,401791.510052,49.796381,121611.101810,86.110099,342196.944951,1.779608,26932.110196,1.558503,28977.220748


In [None]:
praxis.run(mode="external") # run the interface again.

Let's say we'd like to predict the `new_deceased` column for the `US_CA` time-series starting at 03/01/2021, 20 days into the future. We'll use [CatBoost](https://catboost.ai/) for a baseline model:


![baseline](https://drive.google.com/uc?id=1C2idiBK1HIKc_9i01wFePfuhhD9wif8N)

We now have a 64.26% SMAPE, which is not great. We will now use the leading indicator search to find potential data drivers: 

![baseline](https://drive.google.com/uc?id=1_i3pV2wfb2GZzyrS6pUsbHdN1t5ftU54)

We've now found some interesting candidate leading indicators that may improve our results. We will pick the covariate that have the lowest f-value: `search_trends_hypoxia` and retrain the model.


![new-model](https://drive.google.com/uc?id=1zIRsYX7-Rh4xR4-tBQ2rWaBYLqvrYTPW)

As shown above, we are able to achieve a 30% improvement (64.26% -> 34.56%) with the Praxis tool by finding core data drivers in your time-series dataset and fast iteration of using investigative baseline models to measure the quality of your features.