<a href="https://colab.research.google.com/github/masefrizzy/Concrete-strength-prediction/blob/main/Debugging_Errors_and_Large_Scripts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Debugging Errors

Generally local development is the easiest way to test and exercise functions. Still things can go wrong in the hosted model for any number of reasons. The logs from function invocation for feature transformation and model training are recorded and are available for inspection. Further, errors at prediction time are also reported but a little differently than for long running jobs.

In this section we will review error handling and reporting.


### Abacus.AI Setup

1. Install the Abacus.AI library.

In [None]:
!pip install abacusai

Collecting abacusai
  Downloading abacusai-0.34.3.tar.gz (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 6.0 MB/s 
Collecting fastavro
  Downloading fastavro-1.4.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 37.8 MB/s 
Building wheels for collected packages: abacusai
  Building wheel for abacusai (setup.py) ... [?25l[?25hdone
  Created wheel for abacusai: filename=abacusai-0.34.3-py3-none-any.whl size=118020 sha256=53f60e69ac59cb97696c4f85c0bbd3b5851bc4cd28d2e8724499417d11e16717
  Stored in directory: /root/.cache/pip/wheels/19/af/9c/961d3284bf3ccc4513c7150c6da17c38edb6300b81c05df8d3
Successfully built abacusai
Installing collected packages: fastavro, abacusai
Successfully installed abacusai-0.34.3 fastavro-1.4.9


2. Add your Abacus.AI [API Key](https://abacus.ai/app/profile/apikey) generated using the API dashboard as follows:

In [None]:
#@title Abacus.AI API Key

api_key = '49256d9a1f2843acb795e937d6aa3cbc'  #@param {type: "string"}

3. Import the Abacus.AI library and instantiate a client.

In [None]:
from abacusai import ApiClient, ApiException
client = ApiClient(api_key)

## 1. Create a Project



In this notebook, we're going to see how to use python to customize models in Abacus. We will cover custom data transforms, model training and prediction handling. Projects that will be hosting a custom model needed to be created with the `PYTHON_MODEL` use case. Note that custom python data transforms can be used in any kind of project and like any other feature group can be shared across projects. However, custom training algorithms and prediction functions are enabled by this use case.

In [None]:
project = client.create_project(name='Debugging Python Models', use_case='PYTHON_MODEL')

### Add the datasets to Abacus.AI


Using the Create Dataset API, we can tell Abacus.AI the public S3 URI of where to find the datasets.



In [None]:
# if the dataset already exists, skip creation
try: 
  concrete_dataset = client.describe_dataset(client.describe_feature_group_by_table_name('concrete_strength').dataset_id)
except ApiException: # dataset not found
  concrete_dataset = client.create_dataset_from_file_connector(
      name='Concrete Strength',
      table_name='concrete_strength',
      location='s3://abacusai.exampledatasets/predicting/concrete_measurements.csv')
  concrete_dataset.wait_for_inspection()

### Load the dataset so we can build and test the transform.

Most of the time it is easiest to develop custom transformations on your local machine. It makes iteration, inspection and debugging easier and often you can do it directly in a notebook environment. To enable simple local development you can use the Abacus.AI client to load your dataset as a pandas dataframe. This tends to work well if your dataset is under `100MB` but for datasets that get much larger you will likely want to construct a sampled feature group for development.

Here we are working with a fairly small dataset so can easily load it into memory. The first block fetches the feature group corresponding to the dataset (datasets are used to move data into Abacus.AI, feature groups are used to consume data for various operations). It initiates a materialization of the feature group to generate a snapshot, waits for it to be ready and then loads it as a pandas dataframe.

In [None]:
concrete_feature_group = concrete_dataset.describe_feature_group()
if not concrete_feature_group.list_versions():
  concrete_feature_group.create_version()
concrete_feature_group.wait_for_materialization()

concrete_df = concrete_feature_group.load_as_pandas()
concrete_df[:10]

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28.0,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28.0,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270.0,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365.0,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360.0,44.3
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90.0,47.03
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365.0,43.7
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28.0,36.45
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28.0,45.85
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28.0,39.29


#### Custom Data Transform

Now we will setup a transformation that can fail depending on the contents of the input dataset.

In [None]:
def transform_concrete(concrete_dataset):
  import pandas as pd
  import numpy as np
  import logging, sys
  logging.basicConfig(stream=sys.stdout, level=logging.INFO)
  feature_df = concrete_dataset.drop(['flyash'], axis=1)
  no_flyash = feature_df[concrete_dataset.flyash == 0.0]
  flyash = feature_df[concrete_dataset.flyash > 0.0]
  mean_df = no_flyash.mean()
  logging.info(mean_df)
  if np.any(mean_df) < 10:
    raise ValueError('Was expecting all means to be greater than 10.')
  return pd.concat([no_flyash - no_flyash.assign(age=0).mean(), flyash - flyash.assign(age=0).mean()])

### Running on Abacus.AI

Let's see if we can use this to generate a feature group for training.

In [None]:
try:
  concrete_flyash = client.create_feature_group_from_function(
      table_name='concrete_with_positive_means',
      function_source_code=transform_concrete,
      function_name='transform_concrete',
      input_feature_groups=['concrete_strength'])
except:
  concrete_flyash = client.describe_feature_group_by_table_name('concrete_with_positive_means')
if not concrete_flyash.latest_feature_group_version:
  concrete_flyash.create_version()
concrete_flyash.wait_for_materialization()
concrete_by_flyash_df = concrete_flyash.load_as_pandas()

ApiException: ignored

In [None]:
concrete_flyash.latest_feature_group_version.get_materialization_logs(stdout=True, stderr=True)

[FunctionLogs(function='transform_concrete',
   stats={'start': '2021-11-16T21:36:30.033503', 'end': '2021-11-16T21:36:30.037220'},
   stdout=[92m'''
       cement              314.037809
       slag                100.110247
       water               186.616784
       superplasticizer      4.055654
       coarseaggregate     973.357420
       fineaggregate       764.853004
       age                  55.040636
       csMPa                36.771784
       dtype: float64
 '''[0;0m,
   stderr=[91m[0;0m,
   exception=UserException(type='ValueError',
   value='Was expecting all means to be greater than 10.',
   traceback=[91m'''
       Traceback (most recent call last):
       line 10, in transform_concrete
           raise ValueError('Was expecting all means to be greater than 10.')
       ValueError: Was expecting all means to be greater than 10.
 '''[0;0m))]

### Model Training Logs

The same thing applies to model training. In fact, for training runs it can useful too look at logs even when the job is successful.

In [None]:
def train(training_dataset):
  # set the seed for reproduceable results
  import numpy as np
  import logging, sys
  logging.basicConfig(stream=sys.stdout, level=logging.INFO)
  np.random.seed(5)

  X = training_dataset.drop(['csMPa'], axis=1)
  logging.info(X.mean())
  y = training_dataset.csMPa
  from sklearn.preprocessing import QuantileTransformer
  from sklearn.linear_model import LinearRegression
  qt = QuantileTransformer(n_quantiles=20)
  recent_model = LinearRegression()
  _ = recent_model.fit(qt.fit_transform(X.values), y)
  logging.info(qt.quantiles_)
  model_r2 = recent_model.score(X.values, y)
  logging.info(f'Linear model R^2 = {model_r2}')
  if model_r2 < 0.50:
    raise RuntimeError('Could not get a model with sufficient accuracy')
  return (X.columns, qt, recent_model)

### Prediction Function

We will stick with a working predict function but one that is not very robust.

In [None]:
def predict(model, query):
  columns, qt, recent_model = model
  import pandas as pd
  X = pd.DataFrame({c: [query[c]] for c in columns})
  y = recent_model.predict(qt.transform(X.values))[0]
  return {'csMPa': y}

### Training Errors and Logs

In [None]:
model = client.create_model_from_functions(project_id=project, 
                                   train_function=train, 
                                   predict_function=predict, 
                                   training_input_tables=['concrete_strength'])
model.wait_for_training()
print(model.latest_model_version.get_training_logs(stdout=True, stderr=True)[0])

FunctionLogs(function='train',
  stats={'start': '2022-02-24T23:31:21.789026', 'end': '2022-02-24T23:31:22.415849'},
  stdout=[92m'''
      INFO:root:cement              281.167864
      slag                 73.895825
      flyash               54.188350
      water               181.567282
      superplasticizer      6.204660
      coarseaggregate     972.918932
      fineaggregate       773.580485
      age                  45.662136
      dtype: float64
      INFO:root:[[1.02000000e+02 0.00000000e+00 0.00000000e+00 1.21800000e+02
        0.00000000e+00 8.01000000e+02 5.94000000e+02 1.00000000e+00]
       [1.44000000e+02 0.00000000e+00 0.00000000e+00 1.46131579e+02
        0.00000000e+00 8.45284211e+02 6.13000000e+02 3.00000000e+00]
       [1.54800000e+02 0.00000000e+00 0.00000000e+00 1.55052632e+02
        0.00000000e+00 8.54315789e+02 6.70000000e+02 3.00000000e+00]
       [1.66100000e+02 0.00000000e+00 0.00000000e+00 1.59000000e+02
        0.00000000e+00 8.84900000e+02 6.93310526e

Let's fix up the error and try again.

In [None]:
def train_fixed(training_dataset):
  # set the seed for reproduceable results
  import numpy as np
  import logging, sys
  logging.basicConfig(stream=sys.stdout, level=logging.INFO)
  np.random.seed(5)

  X = training_dataset.drop(['csMPa'], axis=1)
  logging.info(X.mean())
  y = training_dataset.csMPa
  from sklearn.preprocessing import QuantileTransformer
  from sklearn.linear_model import LinearRegression
  qt = QuantileTransformer(n_quantiles=20)
  recent_model = LinearRegression()
  _ = recent_model.fit(qt.fit_transform(X.values), y)
  logging.info(qt.quantiles_)
  model_r2 = recent_model.score(qt.transform(X.values), y)
  logging.info(f'Linear model R^2 = {model_r2}')
  if model_r2 < 0.50:
    raise RuntimeError('Could not get a model with sufficient accuracy')
  return (X.columns, qt, recent_model)

In [None]:
model = client.create_model_from_functions(project_id=project, 
                                   train_function=train_fixed, 
                                   predict_function=predict, 
                                   training_input_tables=['concrete_strength'])
model.wait_for_training()
print(model.latest_model_version.get_training_logs(stdout=True, stderr=True)[0])

FunctionLogs(function='train_fixed',
  stats={'start': '2022-02-24T23:37:23.094949', 'end': '2022-02-24T23:37:23.706260'},
  stdout=[92m'''
      INFO:root:cement              281.167864
      slag                 73.895825
      flyash               54.188350
      water               181.567282
      superplasticizer      6.204660
      coarseaggregate     972.918932
      fineaggregate       773.580485
      age                  45.662136
      dtype: float64
      INFO:root:[[1.02000000e+02 0.00000000e+00 0.00000000e+00 1.21800000e+02
        0.00000000e+00 8.01000000e+02 5.94000000e+02 1.00000000e+00]
       [1.44000000e+02 0.00000000e+00 0.00000000e+00 1.46131579e+02
        0.00000000e+00 8.45284211e+02 6.13000000e+02 3.00000000e+00]
       [1.54800000e+02 0.00000000e+00 0.00000000e+00 1.55052632e+02
        0.00000000e+00 8.54315789e+02 6.70000000e+02 3.00000000e+00]
       [1.66100000e+02 0.00000000e+00 0.00000000e+00 1.59000000e+02
        0.00000000e+00 8.84900000e+02 6.933

In [None]:
deployment_token = client.create_deployment_token(project).deployment_token
deployment = client.create_deployment(model_id=model)
deployment.wait_for_deployment()

Now we can run predictions on Abacus and compare against predictions from the local model.

In [None]:
# remotely trained
for _, r in concrete_df[:5].iterrows():
  print(client.predict(deployment_token, deployment.deployment_id, r.to_dict()), r['csMPa'])

{'csMPa': 51.50365071691952} 79.99
{'csMPa': 51.47965282255592} 61.89
{'csMPa': 56.04384149462997} 40.27
{'csMPa': 57.02361874408969} 41.05
{'csMPa': 43.71257572519326} 44.3


### Prediction Errors

As mentioned prediction errors are handled a little differently. Rather than having to fetch logs for errors generated during prediction the exception is returned in the response. The response will contain an the `error` key and a traceback of the exception that happened.

The prediction function is not resilient to missing inputs so passing in a request without the inputs causes an error.

In [None]:
# remotely trained
for _, r in concrete_df.drop('age', axis=1)[:5].iterrows():
  print(client.predict(deployment_token, deployment.deployment_id, r.to_dict()), r['csMPa'])

{'error': "'age'", 'traceback': 'Traceback (most recent call last):\nline 25, in predict\n    X = pd.DataFrame({c: [query[c]] for c in columns})\n  File "/usercode/__source_cddb6379cc.py", line 25, in <dictcomp>\n    X = pd.DataFrame({c: [query[c]] for c in columns})\nKeyError: \'age\'\n'} 79.99
{'error': "'age'", 'traceback': 'Traceback (most recent call last):\nline 25, in predict\n    X = pd.DataFrame({c: [query[c]] for c in columns})\n  File "/usercode/__source_cddb6379cc.py", line 25, in <dictcomp>\n    X = pd.DataFrame({c: [query[c]] for c in columns})\nKeyError: \'age\'\n'} 61.89
{'error': "'age'", 'traceback': 'Traceback (most recent call last):\nline 25, in predict\n    X = pd.DataFrame({c: [query[c]] for c in columns})\n  File "/usercode/__source_cddb6379cc.py", line 25, in <dictcomp>\n    X = pd.DataFrame({c: [query[c]] for c in columns})\nKeyError: \'age\'\n'} 40.27
{'error': "'age'", 'traceback': 'Traceback (most recent call last):\nline 25, in predict\n    X = pd.DataFram

## More Complex Scripts

When the model training algorithm becomes more complicated it is unrealistic to fit it all into a single function. In that case you can supply the script file as the code resource. Below is a simple recipe for shipping code from Github to Abacus for training.

In [None]:
!pip install PyGithub

In [None]:
from github import Github
gh = Github()
repo = gh.get_repo('abacusai/api-python')
print(repo.get_contents('examples/fullscript.py').decoded_content.decode())

from sys import stderr
import pandas as pd
import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression


def transform_concrete(concrete_dataset):
  feature_df = concrete_dataset.drop(['flyash'], axis=1)
  no_flyash = feature_df[concrete_dataset.flyash == 0.0]
  flyash = feature_df[concrete_dataset.flyash > 0.0]
  mean_df = no_flyash.mean()
  print(mean_df)
  return pd.concat([no_flyash - no_flyash.assign(age=0).mean(), flyash - flyash.assign(age=0).mean()])


def to_quantiles(X):
  qt = QuantileTransformer(n_quantiles=20)
  X_q = qt.fit_transform(X.values)
  print(qt.quantiles_)
  return qt, X_q

def train_model(training_dataset):
  np.random.seed(5)

  X = training_dataset.drop(['csMPa'], axis=1)
  print(X.mean())
  y = training_dataset.csMPa
  qt, X_q = to_quantiles(X)

  recent_model = LinearRegression()
  fit_result = recent_model.fit(X_q, y)
  print(fit_result)
  model_r2 = recent_model.score(qt.transform(X.values), y

In [None]:
model = client.create_model_from_python(
    project_id=project,
    function_source_code=repo.get_contents('examples/fullscript.py').decoded_content.decode(),
    train_function_name='train_model',
    predict_function_name='predict',
    training_input_tables=['concrete_strength'])
model.wait_for_training()
print(model.latest_model_version.get_training_logs(stdout=True, stderr=True)[0])

FunctionLogs(function='train_model',
  stats={'start': '2021-11-16T20:32:27.173756', 'end': '2021-11-16T20:32:27.566631'},
  stdout=[92m'''
      cement              281.167864
      slag                 73.895825
      flyash               54.188350
      water               181.567282
      superplasticizer      6.204660
      coarseaggregate     972.918932
      fineaggregate       773.580485
      age                  45.662136
      dtype: float64
      [[1.02000000e+02 0.00000000e+00 0.00000000e+00 1.21800000e+02
        0.00000000e+00 8.01000000e+02 5.94000000e+02 1.00000000e+00]
       [1.44000000e+02 0.00000000e+00 0.00000000e+00 1.46131579e+02
        0.00000000e+00 8.45284211e+02 6.13000000e+02 3.00000000e+00]
       [1.54800000e+02 0.00000000e+00 0.00000000e+00 1.55052632e+02
        0.00000000e+00 8.54315789e+02 6.70000000e+02 3.00000000e+00]
       [1.66100000e+02 0.00000000e+00 0.00000000e+00 1.59000000e+02
        0.00000000e+00 8.84900000e+02 6.93310526e+02 7.00000000