# The API: Connecting to BigML

The BigML API offers an endpoint to create, get, update and delete every Machine Learning resource. <br/>
This API is accessible via HTTP and its general public domain is [bigml.io](https://bigml.io). <br/>
You will need some credentials that will be used to authenticate every request. <br/>
We recommend to set them in environment variables for your convenience.

In [None]:
# set your credentials as environment variables
%env BIGML_USERNAME=merce_demo
%env BIGML_API_KEY=***************

The API calls that you will need to issue contain these credentials as authentication token. For instance

In [None]:
url = "https://bigml.io/source?username=merce_demo;api_key=********;limit=1"
print(url)

In [None]:
!curl "https://bigml.io/source?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;limit=1"

lists the last previously uploaded file in your account.<br/>
Managing resources using these raw HTTP calls is, of course, possible but not optimal.<br/>
Bindings to several languages will make easier resource management.

## The bindings: connecting to BigML

From now on, this notebook uses the **Python bindings library** and **BigMLer**, a command line utility, to access the BigML API. 
Please, check the **quick start** section of [BigMLer's documentation](http://bigmler.readthedocs.org/en/latest/#quick-start) to know how to **install** and remember to [set your **credentials**](http://bigml.readthedocs.org/en/latest/#authentication) before using **BigMLer** or the bindings.

In [None]:
from bigml.api import BigML
api = BigML() # Credentials are imported from environment variables
              # BIGML_USERNAME and BIGML_API_KEY.
              # They can also be set explicitly: api = BigML([username], [api_key])
print(api.url + api.auth)

The next instruction should show the last uploaded source in your account. If that's the case, your credentials are correctly set and you are ready to start creating your resources in BigML

In [None]:
api.list_sources("limit=1")

### PREDICTION WORKFLOW

In [None]:
import pandas as pd
import sys
from pprint import pprint, pformat
from IPython.display import display
# DIABETES_FILE = 'https://static.bigml.com/csv/diabetes.csv' # you can also import from a remote file
DIABETES_FILE = 'data/diabetes.csv'
display(pd.read_csv(DIABETES_FILE, nrows=5))

Let's upload this file and see how this content is interpreted from the Machine Learning point of view.<br/>
We'll first create a **Project**, an organizational unit to store every resource generated in this session.

#### CREATING PROJECT

In [None]:
PROJECT_NAME = "MLSEV Python bindings example"
project = api.create_project({'name': PROJECT_NAME})

**Projects** and **predictions** are the only resources that are **synchronous** in **BigML**, meaning that when you issue the create call the response you get is never a work in process, but the final resource.

In [None]:
pprint(project)

The first level attributes of this dictionary contain:

- code: the HTTP response status code
- error: the error information (when an HTTP error occurs)
- location: the location to access the resource
- object: the API's response
- resource: the **resource ID**

In [None]:
PROJECT = project['resource']

The rest of resources in **BigML** are **asynchronous**, so you will need polling for the resource till it is either finished or faulty. We'll see the first example now, when we upload our data to the platform and create a **source**.

#### CREATING SOURCE
When data is uploaded to the platform a **source** is created

In [None]:
source = api.create_source(DIABETES_FILE,
                           {'name': 'diabetes source', \
                            'tags': ['bindings example', 'diabetes'], \
                            'project': PROJECT})
"""
    CSV, ARFF, Excel and JSON files, either local or remote, can be uploaded.
    For instance, you could use a remote diabetes:
    DIABETES_FILE = "https://static.bigml.com/csv/diabetes.csv"
    
"""
pprint(source)

As you can see, this response does not contain any of the uploaded information yet. The **status** of the resource shows that the source creation request is in process. We'll have to wait for this process to finish. This is what **api.ok** does

In [None]:
api.ok(source)
pprint(source["object"])

**api.ok** waits for the source creation to finish and updates the contents of the **source** variable with the current remote version of the source. Thus, now we can see that the **source** variable contains the description of the fields inferred from the uploaded file. We'll write two auxiliar functions using **api.ok** to show the resources once they are finished or to warn us about any errors.

In [None]:
def check(resource):
    """
        Checks whether the resource status is *finished* or
        prints an error if something fails.
    """
    # api.ok uses api.get_[resouce_type] to retrieve the status of the resource
    # till it reaches a final state (either FINISHED or FAULTY)
    # as defined in 
    if not api.ok(resource):
        print("Error!!!: Failed to create resource %s" % \
            resource.get(
                'resource',
                resource.get('object', {}).get('name')))

def check_and_show(resource):
    """
        Checks whether the resource status is *finished*
        and shows its contents or prints an error if something failed.
    """
    check(resource)
    pprint(resource)


#### View source in BIGML's web site
As all **BigML**'s applications work on top of the same **API**, the source we've just created appears immediately in the source listings of our web dashboard.

In [None]:
BIGML_DASHBOARD_URL = 'https://bigml.com/dashboard'
sources_list_url = "%s/sources" % BIGML_DASHBOARD_URL
print(sources_list_url)

Now that our data is uploaded, we'd need to check that the **fields** characteristics inferred in our **source** are really the expected ones. As that's the case, the next step will be creating a **dataset** from it.

#### CREATING DATASET

The dataset will provide information about **errors**, **missing** values in fields and **histograms**.

In [None]:
dataset = api.create_dataset(source)
check(dataset)

Summaries show the number of **missings** and **errors** and we can decide what to do with them.

#### CREATING MODEL
This is the real training part, where the algorithm learns the patterns in your data. Un supervised learning models, the **objective field** is the label that will be predicted by the model. BigML considers the last field to be the objective field if not said otherwise.

In [None]:
model = api.create_model( \
    dataset,
    {'name': "Diabetes decision tree",
     'objective_field': "diabetes"})
check(model)

## PREDICTIONS INTEGRATION
Eventually, the goal of our models will usually be creating predictions. We can create **remote predictions** by providing the new input data.

In [None]:
input_data = {'plasma glucose': 180, 'bmi': 30}
prediction = api.create_prediction(model,
                                   input_data=input_data)
check(prediction)
print("prediction: %s" % prediction["object"]["output"])
print("confidence: %s" % prediction["object"]["confidence"])
print("path: %s" % prediction["object"]["prediction_path"]["path"])

Of course, this method has latencies involved every time you make a prediction. If your predictions don't need to be immediate, then you can store the input data in a file and do a **remote batch prediction** with an entire test dataset.

## Model class: using the model locally to predict
The JSON model that can be downloaded via the API has all the information needed to predict.

In [None]:
LOCAL_MODEL_FILE = "data/diabetes_model.json"
api.export(model, LOCAL_MODEL_FILE)

 The local **Model** object adds a **predict** method that can be used locally.

In [None]:
from bigml.model import Model
"""
    The **Model** object can use the contents of a Model
    previously stored in a file or
    internally download the model JSON structure once and
    store it in a local directory for further use.
"""
local_model = Model(LOCAL_MODEL_FILE)
pprint(local_model.predict(input_data, full=True))

If you need to predict many rows at once, you can use the **BigMLer** command line, that uses this local **Model** object to create the predictions and store it in a file.

## Workflows

#### Basic prediction workflow

To sum up, the basic prediction workflow will need some steps:

- Upload the data to create a Source
- Summarize all data in a Dataset
- Create a Model from the Dataset
- Use the Model to produce a prediction for the new data

Using the diabetes example, to produce this workflow using the bindings you would use this code

In [None]:
import csv
from bigml.api import BigML
from bigml.model import Model

api = BigML()
source = api.create_source(DIABETES_FILE, {"project": PROJECT})
api.ok(source)
dataset = api.create_dataset(source)
api.ok(dataset)
model = api.create_model(dataset)
api.ok(model)

local_model = Model(model)
with open("data/diabetes_test.csv") as test_handler:
    reader = csv.DictReader(test_handler)
    for input_data in reader:
    # predicting for all rows
        print(local_model.predict(input_data))

The same could be achieved in a single line command

In [None]:
DIABETES = "data/diabetes.csv"
DIABETES_TEST = "data/diabetes_test.csv"
!bigmler --train $DIABETES --test $DIABETES_TEST \
         --output-dir diabetes-prediction \
         --project-id $PROJECT \
         --name "Diabetes with BigMLer"

And if we want to evaluate this model, we can add the **--evaluate** flag

In [None]:
!bigmler --train $DIABETES \
         --test-split 0.2 \
         --output-dir diabetes-eval \
         --project-id $PROJECT \
         --name "Diabetes evaluated with BigMLer" \
         --seed "bigml" \
         --evaluate
