# The API: Connecting to BigML

The BigML API offers an endpoint to create, get, update and delete every Machine Learning resource. <br/>
This API is accessible via HTTP and its general public domain is [bigml.io](https://bigml.io). <br/>
You will need some credentials that will be used to authenticate every request. <br/>
We recommend to set them in environment variables for your convenience.

In [None]:
# set your credentials as environment variables
%env BIGML_USERNAME=[your username]
%env BIGML_API_KEY=[your api key]

The API calls that you will need to issue contain these credentials as authentication token. For instance

In [None]:
url = "https://bigml.io/source?username=[your username];api_key=[your api key];limit=1"
print(url)

In [None]:
!curl "https://bigml.io/source?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;limit=1"

lists the last previously uploaded file in your account.<br/>
Managing resources using these raw HTTP calls is, of course, possible but not optimal.<br/>
Bindings to several languages will make easier resource management.

## The bindings: connecting to BigML

From now on, this notebook uses the **Python bindings library** and **BigMLer**, a command line utility, to access the BigML API. 
Please, check the **quick start** section of [BigMLer's documentation](http://bigmler.readthedocs.org/en/latest/#quick-start) to know how to **install** and remember to [set your **credentials**](http://bigml.readthedocs.org/en/latest/#authentication) before using **BigMLer** or the bindings.

In [None]:
from bigml.api import BigML
api = BigML() # Credentials are imported from environment variables
              # BIGML_USERNAME and BIGML_API_KEY.
              # They can also be set explicitly: api = BigML([username], [api_key])
print(api.url + api.auth)

## Data dictionary operations

The example uses a **Diabetes dataset**, which contains information about several features that might or might not be related to the users diagnose. The goal is predicting which features can influence the diagnose and predicting if a new user will have diabetes. The data looks like this:

In [None]:
import pandas as pd
import sys
from pprint import pprint, pformat
from IPython.display import display
# DIABETES_FILE = 'https://static.bigml.com/csv/ext_diabetes.csv' # you can also import from a remote file
DIABETES_FILE = 'data/ext_diabetes.csv'
DIABETES_REMOTE = 'https://static.bigml.com/csv/ext_diabetes.csv'
display(pd.read_csv(DIABETES_FILE, nrows=5))

Let's upload this file and see how this content is interpreted from the Machine Learning point of view.<br/>
We'll first create a **Project**, an organizational unit to store every resource generated in this session.

#### CREATING PROJECT

In [None]:
PROJECT_NAME = "VSSML18 Python bindings example"
project = api.create_project({'name': PROJECT_NAME})

**Projects** and **predictions** are the only resources that are **synchronous** in **BigML**, meaning that when you issue the create call the response you get is never a work in process, but the final resource.

In [None]:
pprint(project)

The first level attributes of this dictionary contain:

- code: the HTTP response status code
- error: the error information (when an HTTP error occurs)
- location: the location to access the resource
- object: the API's response
- resource: the **resource ID**

In [None]:
PROJECT = project['resource']

The rest of resources in **BigML** are **asynchronous**, so you will need polling for the resource till it is either finished or faulty. We'll see the first example now, when we upload our data to the platform and create a **source**.

#### CREATING SOURCE
When data is uploaded to the platform a **source** is created

In [None]:
source = api.create_source(DIABETES_REMOTE,
                           {'name': 'diabetes source', \
                            'tags': ['bindings example', 'diabetes'], \
                            'project': PROJECT})
"""
    CSV, ARFF, Excel and JSON files, either local or remote, can be uploaded.
    For instance, you could use a remote diabetes:
    DIABETES_FILE = "https://static.bigml.com/csv/diabetes.csv"
    
"""
pprint(source)

As you can see, this response does not contain any of the uploaded information yet. The **status** of the resource shows that the source creation request is in process. We'll have to wait for this process to finish. This is what **api.ok** does

In [None]:
api.ok(source)
pprint(source["object"])

**api.ok** waits for the source creation to finish and updates the contents of the **source** variable with the current remote version of the source. Thus, now we can see that the **source** variable contains the description of the fields inferred from the uploaded file. We'll write two auxiliar functions using **api.ok** to show the resources once they are finished or to warn us about any errors.

In [None]:
def check(resource):
    """
        Checks whether the resource status is *finished* or
        prints an error if something fails.
    """
    # api.ok uses api.get_[resouce_type] to retrieve the status of the resource
    # till it reaches a final state (either FINISHED or FAULTY)
    # as defined in 
    if not api.ok(resource):
        print("Error!!!: Failed to create resource %s" % \
            resource.get(
                'resource',
                resource.get('object', {}).get('name')))

def check_and_show(resource):
    """
        Checks whether the resource status is *finished*
        and shows its contents or prints an error if something failed.
    """
    check(resource)
    pprint(resource)


#### View source in BIGML's web site
As all **BigML**'s applications work on top of the same **API**, the source we've just created appears immediately in the source listings of our web dashboard.

In [None]:
BIGML_DASHBOARD_URL = 'https://bigml.com/dashboard'
sources_list_url = "%s/sources" % BIGML_DASHBOARD_URL
print(sources_list_url)

Now that our data is uploaded, we'd need to check that the **fields** characteristics inferred in our **source** are really the expected ones.

#### Fields class: working with fields
What's the field structure that was inferred from the first lines of the file?

In [None]:
from bigml.fields import Fields
fields = Fields(source) # retrieves the field structure from the source object

The **fields** attribute in **Fields** contains the complete fields structure information as a dictionary.<br/>
It also has auxiliary functions to produce the field attributes, like the ID associated to each field.

In [None]:
pprint(fields.field_id("medication"))

### UPDATING SOURCE: changing fields type
If you need to change any of the inferred types, just update your source

In [None]:
# medication was inferred to be a categorical field, it really is an items field
# To update fields attributes, the expected format of the update body is
# {"fields": {[field_id1]: {[field_attribute1]: [new_field_attribute1_value], 
#                          [field_attribute2]: [new_field_attribute2_value]},
#             [field_id2]: {[field_attribute1]: [new_field_attribute1_value[,
#                           [field_attribute2]: [new_field_attribute2_value]]}}

fields_change  = {
    fields.field_id('medication'): {'optype': 'items',
                                    'item_analysis': {'separator': ';'}}}
# updating the source structure
source = api.update_source(source,
                           {"fields": fields_change,
                            "name": "modified diabetes"})
check(source)


Source has been updated and the field contents should be analyzed like a list of **items**.

In [None]:
fields = Fields(source)
fields.fields[fields.field_id("medication")]["optype"]


Similarly, a **categorical** field could be turned into a **text** field and their associated properties:
```
{
    "enabled": true,
    "use_stopwords": true,
    "stem_words": true,
    "case_sensitive": false,
    "language": "en",
    "token_mode": "tokens_only"
}
```
could be changed.


## Missings

Missings can be a source of information in your data. For a correct management of missing values, we need to identify some strings that usually can be considered as such. These are considered **missing tokens**.<br/>
In the example data, the field **bmi** contains a "no data" string which should be interpreted as a missing value.<br/>
We can extend the list of missing values used by default in the **Source** adding this string.

In [None]:
# updating the source structure

missing_tokens = fields.missing_tokens
missing_tokens.append("no data")
source = api.update_source( \
    source,
    {'source_parser': {"missing_tokens": missing_tokens,
                       "locale": "es-ES"}})
check(source)
fields = Fields(source)

pprint(fields.missing_tokens[0: 9])
pprint(fields.missing_tokens[10: 20])

Once your **missings** have been correctly identified, it's time to analyze the full contents of the file and create a **Dataset** that summarizes it.

The dataset will provide information about **errors**, **missing** values in fields and **histograms**.

In [None]:
dataset = api.create_dataset(source)
check(dataset)

In [None]:
fields = Fields(dataset)
print("The diabetes field contains: ", pformat(fields.fields[fields.field_id("diabetes")]["summary"]))
print("The bmi field has", pformat(fields.fields[fields.field_id("bmi")]["summary"]["missing_count"]) , "missing values")
print("The pregnancies field has", pformat(fields.fields[fields.field_id("pregnancies")]["summary"]["missing_count"]) , "missing values")

Summaries show the number of **missings** and **errors** and we can decide what to do with them.<br/>
Some models can consider missing as a new value. In the field **pregnancies** we could associate it to the fact that the patient is a man.<br/>
In this case, the model can be built to use them.

In [None]:
model = api.create_model(dataset, {"missing_splits": True})
check(model)

Other models, like **Clusters** cannot use the rows that have missing values. In this case, these rows are discarded when building the model unless the
missing values are replaced with a sensible default.

In [None]:
cluster_args = {"default_numeric_value": "mean"}
cluster = api.create_cluster(dataset, cluster_args)
check(cluster)

## Errors

Introducing a new example, let's inspect this damaged **churn telemcom** dataset, where other datasets have been merged by mistake.


In [None]:
# CHURN_FILE = 'https://static.bigml.com/csv/churn-telecom.csv' # you can also import from a remote file
CHURN_FILE = 'data/churn-telecom.csv'
CHURN_REMOTE = 'https://static.bigml.com/csv/churn-telecom.csv'
display(pd.read_csv(CHURN_FILE, nrows=5))

The dataset should contain information about telecom accounts, where the last field **Churn** should be **True** if the user has churned and **False** otherwise.

In [None]:
dirty_churn_source = api.create_source(CHURN_REMOTE, \
                                       {"name": "Dirty churn",
                                        "tags": ["bindings example", "dirty churn"], \
                                        "project": project["resource"]})
api.ok(dirty_churn_source)
dirty_churn_dataset = api.create_dataset(dirty_churn_source)
api.ok(dirty_churn_dataset)

Inspecting the contents of the **Churn** field, we see that that's it has some unexpected values. We should remove the rows affected.

In [None]:
fields = Fields(dirty_churn_dataset)
print(dirty_churn_dataset['object']['rows'])
print("The Churn field contains: ", pformat(fields.fields[fields.field_id("Churn")]["summary"]))


In [None]:
clean_churn_dataset = api.create_dataset(dirty_churn_dataset, { \
    "lisp_filter": "(or (= (f \"Churn\") \"True\") (= (f \"Churn\") \"False\"))", \
    "name": "Clean Churn"})
check(clean_churn_dataset)
print(clean_churn_dataset['object']['rows'])
fields = Fields(clean_churn_dataset)
print("The Churn field contains: ", pformat(fields.fields[fields.field_id("Churn")]["summary"]))

## Feature engineering
New fields can help improve the performance of models. The **Churn** dataset has several fields that can be combined to generate new features, like ratios of charge per call.

In [None]:
names = [fields.field_name(col) for col in range(0, len(fields.fields))]
prefix_fields = [name[0: -7] for name in names if name.endswith("charge")]
field_expressions = " ".join(["(/ (f \"%s charge\") (f \"%s calls\"))" % ( \
    prefix_fields[index], prefix_fields[index]) \
    for index in range(0, len(prefix_fields))])
fields_generator = [{"names": ["%s charge per call" % name for name in prefix_fields],
                     "fields": "(list %s)" % field_expressions}]
extended_dataset = api.create_dataset(clean_churn_dataset, {"new_fields": fields_generator})
api.ok(extended_dataset)

To check the performance, we split the dataset into train and test datasets

In [None]:
# To ensure deterministic results, you must set the seed value
SEED = "BigML"
train_dataset = api.create_dataset(extended_dataset,
                                   {'name': 'Churn train dataset (80%)',
                                    'sample_rate': 0.8,
                                    'seed': SEED})
check(train_dataset)
# The out_of_bag flag selects the instances left out in the previous dataset
test_dataset = api.create_dataset(extended_dataset,
                                  {'name': 'Churn test dataset (20%)',
                                   'sample_rate': 0.8,
                                   'out_of_bag': True,
                                   'seed': SEED})
check(test_dataset)
print("train dataset: %s instances" % train_dataset["object"]["rows"])
print("test dataset: %s instances" % test_dataset["object"]["rows"])

In [None]:
# In our example, we will exclude the new fields first,
excluded_fields = ["%s charge per call" % prefix for prefix in prefix_fields]

# aternatively, you could write the list of fields to be included
# usign "input_fields"
original_model = api.create_model( \
    train_dataset,
    {'name': "Churn original fields",
     'objective_field': "Churn",
     'excluded_fields': excluded_fields})

check(original_model)
used_fields = Fields(original_model["object"]["model"]["model_fields"])
used_fields.list_fields()

In [None]:
original_evaluation = api.create_evaluation(original_model, test_dataset, {"name": "Churn original fields"})
api.ok(original_evaluation)
print(original_evaluation['object']['result']['model']['accuracy'])

In [None]:
# We will exclude now the original charge fields
excluded_fields = ["%s charge" % prefix for prefix in prefix_fields]

# aternatively, you could write the list of fields to be included
# usign "input_fields"
ratio_model = api.create_model( \
    train_dataset,
    {'name': "Churn ration fields",
     'objective_field': "Churn",
     'excluded_fields': excluded_fields})

check(ratio_model)
used_fields = Fields(ratio_model["object"]["model"]["model_fields"])
used_fields.list_fields()

In [None]:
ratio_evaluation = api.create_evaluation(ratio_model, test_dataset, {"name": "Churn ratio fields"})
api.ok(ratio_evaluation)
print(ratio_evaluation['object']['result']['model']['accuracy'])

## Model tuning
Depending on your data, some configuration choices can produce better adapted models. As an example, using wights to balance the instances that end in churn can help the model to detect the patterns for this especially interesting class.

In [None]:
balanced_model = api.create_model( \
    train_dataset,
    {'name': "Churn ration fields",
     'objective_field': "Churn",
     'balance_objective': True,
     'excluded_fields': excluded_fields})
balanced_evaluation = api.create_evaluation(balanced_model, test_dataset, {"name": "Churn ratio fields"})
api.ok(balanced_evaluation)

In [None]:
def get_metrics(evaluation, class_name):
    """ Returns the evaluation metrics corresponding to a particular class
    
    """
    for class_info in evaluation['object']['result']['model']['per_class_statistics']:
        if class_info["class_name"] == class_name:
            return class_info
        
print("Ratio model recall:", get_metrics(ratio_evaluation, "True")["recall"])
print("Balanced model recall:", get_metrics(balanced_evaluation, "True")["recall"])

## PREDICTIONS INTEGRATION
Eventually, the goal of our models will usually be creating predictions. Predictions can be created remotely by providing the new input data.

In [None]:
input_data = {'Total day minutes': 320, 'Number vmail messages': 2}
prediction = api.create_prediction(ratio_model,
                                   input_data=input_data)
check(prediction)
print("prediction: %s" % prediction["object"]["output"])
print("confidence: %s" % prediction["object"]["confidence"])
print("path: %s" % prediction["object"]["prediction_path"]["path"])

Of course, this method has latencies involved every time you make a prediction. If your predictions don't need to be immediate, then you can store the input data in a file and do a batch prediction with an entire dataset of it. We can use our test dataset to do that.

In [None]:
batch_prediction = api.create_batch_prediction(\
    ratio_model, test_dataset, \
    {"all_fields": True,
     "output_dataset": True})
check(batch_prediction)
# we could download the results as a CSV using
# api.download_batch_prediction(batch_prediction,
#     filename='my_dir/my_predictions.csv')

## Model class: using the model locally to predict
The JSON model that can be downloaded via the API has all the information needed to predict.

In [None]:
LOCAL_MODEL_FILE = "data/churn_model.json"
api.export(ratio_model, LOCAL_MODEL_FILE)

 The local **Model** object adds a **predict** method that can be used locally.

In [None]:
from bigml.model import Model
"""
    The **Model** object can use the contents of a Model
    previously stored in a file or
    internally download the model JSON structure once and
    store it in a local directory for further use.
"""
local_model = Model(LOCAL_MODEL_FILE)
pprint(local_model.predict(input_data, full=True))

If you need to predict many rows at once, you can use the **BigMLer** command line, that uses this local **Model** object to create the predictions and store it in a file.

In [None]:
MODEL_ID = ratio_model['resource']
!bigmler --test $CHURN_REMOTE --model $MODEL_ID --output-dir predictions

Or if you prefer the predictions to be computed remotely

In [None]:
CHURN_TEST_REMOTE = 'https://static.bigml.com/csv/churn-test.csv'
!bigmler --test $CHURN_TEST_REMOTE --model $MODEL_ID --output-dir remote-predictions --remote

## Workflows

#### Basic prediction workflow

To sum up, the basic prediction workflow will need some steps:

- Upload the data to create a Source
- Summarize all data in a Dataset
- Create a Model from the Dataset
- Use the Model to produce a prediction for the new data

Using the diabetes example, to produce this workflow using the bindings you would use this code

In [None]:
import csv
from bigml.api import BigML
from bigml.model import Model

api = BigML()
source = api.create_source(DIABETES_REMOTE, {"project": PROJECT})
api.ok(source)
dataset = api.create_dataset(source)
api.ok(dataset)
model = api.create_model(dataset)
api.ok(model)

local_model = Model(model)
with open("data/diabetes_test.csv") as test_handler:
    reader = csv.DictReader(test_handler)
    for input_data in reader:
    # predicting for all rows
        print(local_model.predict(input_data))

The same could be achieved in a single line command

In [None]:
DIABETES_TEST_REMOTE = "https://static.bigml.com/csv/diabetes_test.csv"
!bigmler --train $DIABETES_REMOTE --test $DIABETES_TEST_REMOTE \
         --output-dir diabetes-prediction --project-id $PROJECT \
         --name "Diabetes with bigmler"

And if we want to evaluate this model, we can add the **--evaluate** flag

In [None]:
!bigmler --train $DIABETES_REMOTE --output-dir diabetes-eval --evaluate \
         --project-id $PROJECT --name "Diabetes evaluated with BigMLer"

#### Outliers removal workflow 
We can try to improve that performance by removing the top outliers from the dataset before modeling

In [None]:
DATASET_ID = dataset['resource']
!bigmler anomaly --dataset $DATASET_ID \
                 --anomaly-fields "insulin,pregnancies,plasma glucose,diabetes" \
                 --top-n 2 --anomalies-dataset out --output-dir diabetes_anomaly \
                 --project-id $PROJECT --name "Clean diabetes"

And evaluating the model built on the clean dataset

In [None]:
!bigmler --datasets diabetes_anomaly/dataset_gen --output-dir diabetes-clean-eval \
         --project-id $PROJECT --evaluate --name "Clean diabetes"

#### Retrain with cumulative data

Usually, you start your project uploading a sample of data and playing with it till you discover the workflow that gives you acceptable results. Then, the rest of data is uploaded and you'd like to repeat the same process on the accumulated data. **BigMLer** can help you do that. In this example, we do a regular model creation workflow.

In [None]:
# DIABETES_REMOTE = "https://static.bigml.com/csv/ext_diabetes_1.csv" 
# download this file and save it as a local file in data/ext_diabetes_1.csv
DIABETES_1 = "data/ext_diabetes_1.csv"
!bigmler --train $DIABETES_1 \
         --tag cumulative_diabetes \
         --name "Cumulative diabetes data" \
         --project-id $PROJECT \
         --output-dir ./initial_model

And after that, new data is uploaded and the same process is reproduced on the accumulated data.

In [None]:
# DIABETES_REMOTE_2 = "https://static.bigml.com/csv/ext_diabetes_2.csv"
# download this file and store it in data/ext_diabetes_2.csv
DIABETES_2 = "data/ext_diabetes_2.csv"
!bigmler retrain --add $DIABETES_2 \
                 --model-tag cumulative_diabetes \
                 --output-dir accumulative_retrain