# The API: Connecting to BigML

The BigML API offers an endpoint to create, get, update and delete every Machine Learning resource. <br/>
This API is accessible via HTTP and its general public domain is [bigml.io](https://bigml.io). <br/>
You will need some credentials that will be used to authenticate every request. <br/>
We recommend to set them in environment variables for your convenience.

In [67]:
# set your credentials as environment variables
%env BIGML_USERNAME=merce_demo
%env BIGML_API_KEY=dc0d33828c638840934f8bc004099b41d664b421

env: BIGML_USERNAME=merce_demo
env: BIGML_API_KEY=dc0d33828c638840934f8bc004099b41d664b421


The API calls that you will need to issue contain these credentials as authentication token. For instance

In [47]:
url = "https://bigml.io/source?username=merce_demo;api_key=dc0d33828c638840934f8bc004099b41d664b421;limit=1"
print(url)

https://bigml.io/source?username=merce_demo;api_key=dc0d33828c638840934f8bc004099b41d664b421;limit=1


In [48]:
!curl "https://bigml.io/source?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;limit=1"

{"meta": {"limit": 1, "next": "/andromeda/source?username=merce_demo&api_key=dc0d33828c638840934f8bc004099b41d664b421&limit=1&offset=1", "offset": 0, "previous": null, "total_count": 179}, "objects": [{"category": 12, "charset": "UTF-8", "code": 200, "configuration": null, "configuration_status": false, "content_type": "text/csv", "created": "2018-09-10T15:17:09.644000", "creator": "merce_demo", "credits": 0, "description": "Created using BigMLer", "disable_datetime": false, "field_types": {"categorical": 2, "datetime": 1, "items": 0, "numeric": 15, "text": 1, "total": 19}, "fields_meta": {"count": 19, "limit": 1000, "offset": 0, "query_total": 19, "total": 19}, "file_name": "ext_diabetes.csv", "item_analysis": {}, "md5": "6d6af57ac5fe8484e9ddede6a18a8e57", "name": "BigMLer_MonSep1018_171708", "name_options": "19 fields (2 categorical, 15 numeric, 1 text, 6 auto-generated datetime)", "number_of_anomalies": 0, "number_of_anomalyscores": 0, "number_of_associations": 0, "number_of_associa

lists the last previously uploaded file in your account.<br/>
Managing resources using these raw HTTP calls is, of course, possible but not optimal.<br/>
Bindings to several languages will make easier resource management.

## The bindings: connecting to BigML

From now on, this notebook uses the **Python bindings library** and **BigMLer**, a command line utility, to access the BigML API. 
Please, check the **quick start** section of [BigMLer's documentation](http://bigmler.readthedocs.org/en/latest/#quick-start) to know how to **install** and remember to [set your **credentials**](http://bigml.readthedocs.org/en/latest/#authentication) before using **BigMLer** or the bindings.

In [3]:
from bigml.api import BigML
api = BigML() # Credentials are imported from environment variables
              # BIGML_USERNAME and BIGML_API_KEY.
              # They can also be set explicitly: api = BigML([username], [api_key])
print(api.url + api.auth)

https://bigml.io/andromeda/?username=merce_demo;api_key=dc0d33828c638840934f8bc004099b41d664b421;


## Data dictionary operations

The example uses a **Diabetes dataset**, which contains information about several features that might or might not be related to the users diagnose. The goal is predicting which features can influence the diagnose and predicting if a new user will have diabetes. The data looks like this:

In [68]:
import pandas as pd
import sys
from pprint import pprint, pformat
from IPython.display import display
# DIABETES_FILE = 'https://static.bigml.com/csv/ext_diabetes.csv' # you can also import from a remote file
DIABETES_FILE = 'data/ext_diabetes.csv'
display(pd.read_csv(DIABETES_FILE, nrows=5))

Unnamed: 0,patient id,timestamp,pregnancies,plasma glucose,blood pressure,triceps skin thickness,insulin,bmi,diabetes pedigree,age,diabetes,medication,observations
0,1,04/01/16 18:43,6.0,148,72,35,0,33.6,627,50,True,diazepam;enalapril,hepatItis antecents
1,2,05/01/16 00:20,1.0,85,66,29,0,26.6,351,31,False,,
2,3,06/01/16 17:17,8.0,183,64,0,0,23.3,672,32,True,simvastatin,
3,4,11/01/16 15:07,1.0,89,66,23,94,28.1,167,21,False,,
4,5,12/01/16 03:12,,137,40,35,168,43.1,2288,33,True,,


Let's upload this file and see how this content is interpreted from the Machine Learning point of view.<br/>
We'll first create a **Project**, an organizational unit to store every resource generated in this session.

#### CREATING PROJECT

In [69]:
PROJECT_NAME = "VSSML18 Python bindings example"
project = api.create_project({'name': PROJECT_NAME})

**Projects** and **predictions** are the only resources that are **synchronous** in **BigML**, meaning that when you issue the create call the response you get is never a work in process, but the final resource.

In [70]:
pprint(project)

{'code': 201,
 'error': None,
 'location': 'http://bigml.io/andromeda/project/5b96b41e92527314e1001ef9',
 'object': {'category': 0,
            'code': 201,
            'configuration': None,
            'configuration_status': False,
            'created': '2018-09-10T18:12:46.166550',
            'creator': 'merce_demo',
            'description': '',
            'execution_id': None,
            'execution_status': None,
            'manage_permission': [],
            'name': 'VSSML18 Python bindings example',
            'name_options': '',
            'private': True,
            'resource': 'project/5b96b41e92527314e1001ef9',
            'stats': {'anomalies': {'count': 0},
                      'anomalyscores': {'count': 0},
                      'associations': {'count': 0},
                      'associationsets': {'count': 0},
                      'batchanomalyscores': {'count': 0},
                      'batchcentroids': {'count': 0},
                      'batchprediction

The first level attributes of this dictionary contain:

- code: the HTTP response status code
- error: the error information (when an HTTP error occurs)
- location: the location to access the resource
- object: the API's response
- resource: the **resource ID**

In [71]:
PROJECT = project['resource']

The rest of resources in **BigML** are **asynchronous**, so you will need polling for the resource till it is either finished or faulty. We'll see the first example now, when we upload our data to the platform and create a **source**.

#### CREATING SOURCE
When data is uploaded to the platform a **source** is created

In [72]:
source = api.create_source(DIABETES_FILE,
                           {'name': 'diabetes source', \
                            'tags': ['bindings example', 'diabetes'], \
                            'project': PROJECT})
"""
    CSV, ARFF, Excel and JSON files, either local or remote, can be uploaded.
    For instance, you could use a remote diabetes:
    DIABETES_FILE = "https://static.bigml.com/csv/diabetes.csv"
    
"""
pprint(source)

{'code': 201,
 'error': None,
 'location': 'http://bigml.io/andromeda/source/5b96b50a2774cb43d1001f65',
 'object': {'category': 0,
            'code': 201,
            'configuration': None,
            'configuration_status': False,
            'content_type': 'text/csv',
            'created': '2018-09-10T18:16:42.836014',
            'creator': 'merce_demo',
            'credits': 0.0,
            'description': '',
            'disable_datetime': False,
            'field_types': {'categorical': 0,
                            'datetime': 0,
                            'items': 0,
                            'numeric': 0,
                            'text': 0,
                            'total': 0},
            'fields_meta': {'count': 0, 'limit': 1000, 'offset': 0, 'total': 0},
            'file_name': 'ext_diabetes.csv',
            'item_analysis': {},
            'md5': '6d6af57ac5fe8484e9ddede6a18a8e57',
            'name': 'diabetes source',
            'name_options': '',
  

As you can see, this response does not contain any of the uploaded information yet. The **status** of the resource shows that the source creation request is in process. We'll have to wait for this process to finish. This is what **api.ok** does

In [73]:
api.ok(source)
pprint(source["object"])

{'category': 0,
 'charset': 'UTF-8',
 'code': 200,
 'configuration': None,
 'configuration_status': False,
 'content_type': 'text/csv',
 'created': '2018-09-10T18:16:42.836000',
 'creator': 'merce_demo',
 'credits': 0,
 'description': '',
 'disable_datetime': False,
 'field_types': {'categorical': 2,
                 'datetime': 1,
                 'items': 0,
                 'numeric': 15,
                 'text': 1,
                 'total': 19},
 'fields': {'000000': {'column_number': 0,
                       'name': 'patient id',
                       'optype': 'numeric',
                       'order': 0},
            '000001': {'child_ids': ['000001-0',
                                     '000001-1',
                                     '000001-2',
                                     '000001-3',
                                     '000001-4',
                                     '000001-5'],
                       'column_number': 1,
                       'name': 'timestam

**api.ok** waits for the source creation to finish and updates the contents of the **source** variable with the current remote version of the source. Thus, now we can see that the **source** variable contains the description of the fields inferred from the uploaded file. We'll write two auxiliar functions using **api.ok** to show the resources once they are finished or to warn us about any errors.

In [9]:
def check(resource):
    """
        Checks whether the resource status is *finished* or
        prints an error if something fails.
    """
    # api.ok uses api.get_[resouce_type] to retrieve the status of the resource
    # till it reaches a final state (either FINISHED or FAULTY)
    # as defined in 
    if not api.ok(resource):
        print("Error!!!: Failed to create resource %s" % \
            resource.get(
                'resource',
                resource.get('object', {}).get('name')))

def check_and_show(resource):
    """
        Checks whether the resource status is *finished*
        and shows its contents or prints an error if something failed.
    """
    check(resource)
    pprint(resource)


#### View source in BIGML's web site
As all **BigML**'s applications work on top of the same **API**, the source we've just created appears immediately in the source listings of our web dashboard.

In [10]:
BIGML_DASHBOARD_URL = 'https://bigml.com/dashboard'
sources_list_url = "%s/sources" % BIGML_DASHBOARD_URL
print(sources_list_url)

https://bigml.com/dashboard/sources


Now that our data is uploaded, we'd need to check that the **fields** characteristics inferred in our **source** are really the expected ones.

#### Fields class: working with fields
What's the field structure that was inferred from the first lines of the file?

In [75]:
from bigml.fields import Fields
fields = Fields(source) # retrieves the field structure from the source object

The **fields** attribute in **Fields** contains the complete fields structure information as a dictionary.<br/>
It also has auxiliary functions to produce the field attributes, like the ID associated to each field.

In [76]:
pprint(fields.field_id("medication"))

'00000b'


### UPDATING SOURCE: changing fields type
If you need to change any of the inferred types, just update your source

In [77]:
# medication was inferred to be a categorical field, it really is an items field
# To update fields attributes, the expected format of the update body is
# {"fields": {[field_id1]: {[field_attribute1]: [new_field_attribute1_value], 
#                          [field_attribute2]: [new_field_attribute2_value]},
#             [field_id2]: {[field_attribute1]: [new_field_attribute1_value[,
#                           [field_attribute2]: [new_field_attribute2_value]]}}

fields_change  = {
    fields.field_id('medication'): {'optype': 'items',
                                    'item_analysis': {'separator': ';'}}}
# updating the source structure
source = api.update_source(source,
                           {"fields": fields_change,
                            "name": "modified diabetes"})
check(source)


Source has been updated and the field contents should be analyzed like a list of **items**.

In [78]:
fields = Fields(source)
fields.fields[fields.field_id("medication")]["optype"]


'items'

Similarly, a **categorical** field could be turned into a **text** field and their associated properties:
```
{
    "enabled": true,
    "use_stopwords": true,
    "stem_words": true,
    "case_sensitive": false,
    "language": "en",
    "token_mode": "tokens_only"
}
```
could be changed.


## Missings

Missings can be a source of information in your data. For a correct management of missing values, we need to identify some strings that usually can be considered as such. These are considered **missing tokens**.<br/>
In the example data, the field **bmi** contains a "no data" string which should be interpreted as a missing value.<br/>
We can extend the list of missing values used by default in the **Source** adding this string.

In [79]:
# updating the source structure

missing_tokens = fields.missing_tokens
missing_tokens.append("no data")
source = api.update_source( \
    source,
    {'source_parser': {"missing_tokens": missing_tokens,
                       "locale": "es-ES"}})
check(source)
fields = Fields(source)

pprint(fields.missing_tokens[0: 9])
pprint(fields.missing_tokens[10: 20])

['', 'NaN', 'NULL', 'N/A', 'no data', 'null', '-', '#REF!', '#VALUE!']
['#NULL!', '#NUM!', '#DIV/0', 'n/a', '#NAME?', 'NIL', 'nil', 'na', '#N/A', 'NA']


Once your **missings** have been correctly identified, it's time to analyze the full contents of the file and create a **Dataset** that summarizes it.

The dataset will provide information about **errors**, **missing** values in fields and **histograms**.

In [80]:
dataset = api.create_dataset(source)
check(dataset)

In [81]:
fields = Fields(dataset)
print("The diabetes field contains: ", pformat(fields.fields[fields.field_id("diabetes")]["summary"]))
print("The bmi field has", pformat(fields.fields[fields.field_id("bmi")]["summary"]["missing_count"]) , "missing values")
print("The pregnancies field has", pformat(fields.fields[fields.field_id("pregnancies")]["summary"]["missing_count"]) , "missing values")

The diabetes field contains:  {'categories': [['false', 119], ['true', 81]], 'missing_count': 0}
The bmi field has 3 missing values
The pregnancies field has 11 missing values


Summaries show the number of **missings** and **errors** and we can decide what to do with them.<br/>
Some models can consider missing as a new value. In the field **pregnancies** we could associate it to the fact that the patient is a man.<br/>
In this case, the model can be built to use them.

In [82]:
model = api.create_model(dataset, {"missing_splits": True})
check(model)

Other models, like **Clusters** cannot use the rows that have missing values. In this case, these rows are discarded when building the model unless the
missing values are replaced with a sensible default.

In [83]:
cluster_args = {"default_numeric_value": "mean"}
cluster = api.create_cluster(dataset, cluster_args)
check(cluster)

## Errors

Introducing a new example, let's inspect this damaged **churn telemcom** dataset, where other datasets have been merged by mistake.


In [20]:
# CHURN_FILE = 'https://static.bigml.com/csv/churn-telecom.csv' # you can also import from a remote file
CHURN_FILE = 'data/churn-telecom.csv'
display(pd.read_csv(CHURN_FILE, nrows=5))

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


The dataset should contain information about telecom accounts, where the last field **Churn** should be **True** if the user has churned and **False** otherwise.

In [84]:
dirty_churn_source = api.create_source(CHURN_FILE, \
                                       {"name": "Dirty churn",
                                        "tags": ["bindings example", "dirty churn"], \
                                        "project": project["resource"]})
api.ok(dirty_churn_source)
dirty_churn_dataset = api.create_dataset(dirty_churn_source)
api.ok(dirty_churn_dataset)

True

Inspecting the contents of the **Churn** field, we see that that's it has some unexpected values. We should remove the rows affected.

In [22]:
fields = Fields(dirty_churn_dataset)
print(dirty_churn_dataset['object']['rows'])
print("The Churn field contains: ", pformat(fields.fields[fields.field_id("Churn")]["summary"]))


4335
The Churn field contains:  {'categories': [['False', 2849],
                ['yes', 963],
                ['True', 483],
                ['no', 37],
                ['Falsecsv/groceries2.csv\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x000000644\x000001753\x000001753\x0000001757705\x0013250330501\x00013742\x00 '
                 '0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x

In [85]:
clean_churn_dataset = api.create_dataset(dirty_churn_dataset, { \
    "lisp_filter": "(or (= (f \"Churn\") \"True\") (= (f \"Churn\") \"False\"))", \
    "name": "Clean Churn"})
check(clean_churn_dataset)
print(clean_churn_dataset['object']['rows'])
fields = Fields(clean_churn_dataset)
print("The Churn field contains: ", pformat(fields.fields[fields.field_id("Churn")]["summary"]))

3332
The Churn field contains:  {'categories': [['False', 2849], ['True', 483]], 'missing_count': 0}


## Feature engineering
New fields can help improve the performance of models. The **Churn** dataset has several fields that can be combined to generate new features, like ratios of charge per call.

In [88]:
names = [fields.field_name(col) for col in range(0, len(fields.fields))]
prefix_fields = [name[0: -7] for name in names if name.endswith("charge")]
field_expressions = " ".join(["(/ (f \"%s charge\") (f \"%s calls\"))" % ( \
    prefix_fields[index], prefix_fields[index]) \
    for index in range(0, len(prefix_fields))])
fields_generator = [{"names": ["%s charge per call" % name for name in prefix_fields],
                     "fields": "(list %s)" % field_expressions}]
extended_dataset = api.create_dataset(clean_churn_dataset, {"new_fields": fields_generator})
api.ok(extended_dataset)

True

To check the performance, we split the dataset into train and test datasets

In [90]:
# To ensure deterministic results, you must set the seed value
SEED = "BigML"
train_dataset = api.create_dataset(extended_dataset,
                                   {'name': 'Churn train dataset (80%)',
                                    'sample_rate': 0.8,
                                    'seed': SEED})
check(train_dataset)
# The out_of_bag flag selects the instances left out in the previous dataset
test_dataset = api.create_dataset(extended_dataset,
                                  {'name': 'Churn test dataset (20%)',
                                   'sample_rate': 0.8,
                                   'out_of_bag': True,
                                   'seed': SEED})
check(test_dataset)
print("train dataset: %s instances" % train_dataset["object"]["rows"])
print("test dataset: %s instances" % test_dataset["object"]["rows"])

train dataset: 2665 instances
test dataset: 667 instances


In [91]:
# In our example, we will exclude the new fields first,
excluded_fields = ["%s charge per call" % prefix for prefix in prefix_fields]

# aternatively, you could write the list of fields to be included
# usign "input_fields"
original_model = api.create_model( \
    train_dataset,
    {'name': "Churn original fields",
     'objective_field': "Churn",
     'excluded_fields': excluded_fields})

check(original_model)
used_fields = Fields(original_model["object"]["model"]["model_fields"])
used_fields.list_fields()

[Account length                  : numeric         : 1       ]
[International plan              : categorical     : 3       ]
[Number vmail messages           : numeric         : 5       ]
[Total day calls                 : numeric         : 7       ]
[Total day charge                : numeric         : 8       ]
[Total eve calls                 : numeric         : 10      ]
[Total eve charge                : numeric         : 11      ]
[Total night minutes             : numeric         : 12      ]
[Total night calls               : numeric         : 13      ]
[Total night charge              : numeric         : 14      ]
[Total intl minutes              : numeric         : 15      ]
[Total intl calls                : numeric         : 16      ]
[Customer service calls          : numeric         : 18      ]
[Churn                           : categorical     : 19      ]


In [92]:
original_evaluation = api.create_evaluation(original_model, test_dataset, {"name": "Churn original fields"})
api.ok(original_evaluation)
print(original_evaluation['object']['result']['model']['accuracy'])

0.91454


In [93]:
# We will exclude now the original charge fields
excluded_fields = ["%s charge" % prefix for prefix in prefix_fields]

# aternatively, you could write the list of fields to be included
# usign "input_fields"
ratio_model = api.create_model( \
    train_dataset,
    {'name': "Churn ration fields",
     'objective_field': "Churn",
     'excluded_fields': excluded_fields})

check(ratio_model)
used_fields = Fields(ratio_model["object"]["model"]["model_fields"])
used_fields.list_fields()

[Account length                  : numeric         : 1       ]
[International plan              : categorical     : 3       ]
[Number vmail messages           : numeric         : 5       ]
[Total day minutes               : numeric         : 6       ]
[Total day calls                 : numeric         : 7       ]
[Total eve minutes               : numeric         : 9       ]
[Total eve calls                 : numeric         : 10      ]
[Total night minutes             : numeric         : 12      ]
[Total night calls               : numeric         : 13      ]
[Total intl minutes              : numeric         : 15      ]
[Total intl calls                : numeric         : 16      ]
[Customer service calls          : numeric         : 18      ]
[Churn                           : categorical     : 19      ]
[Total day charge per call       : numeric         : 20      ]
[Total eve charge per call       : numeric         : 21      ]
[Total night charge per call     : numeric         : 22

In [94]:
ratio_evaluation = api.create_evaluation(ratio_model, test_dataset, {"name": "Churn ratio fields"})
api.ok(ratio_evaluation)
print(ratio_evaluation['object']['result']['model']['accuracy'])

0.91754


## Model tuning
Depending on your data, some configuration choices can produce better adapted models. As an example, using wights to balance the instances that end in churn can help the model to detect the patterns for this especially interesting class.

In [95]:
balanced_model = api.create_model( \
    train_dataset,
    {'name': "Churn ration fields",
     'objective_field': "Churn",
     'balance_objective': True,
     'excluded_fields': excluded_fields})
balanced_evaluation = api.create_evaluation(balanced_model, test_dataset, {"name": "Churn ratio fields"})
api.ok(balanced_evaluation)

True

In [96]:
def get_metrics(evaluation, class_name):
    """ Returns the evaluation metrics corresponding to a particular class
    
    """
    for class_info in evaluation['object']['result']['model']['per_class_statistics']:
        if class_info["class_name"] == class_name:
            return class_info
        
print("Ratio model recall:", get_metrics(ratio_evaluation, "True")["recall"])
print("Balanced model recall:", get_metrics(balanced_evaluation, "True")["recall"])

Ratio model recall: 0.66279
Balanced model recall: 0.67442


## PREDICTIONS INTEGRATION
Eventually, the goal of our models will usually be creating predictions. Predictions can be created remotely by providing the new input data.

In [32]:
input_data = {'Total day minutes': 320, 'Number vmail messages': 2}
prediction = api.create_prediction(ratio_model,
                                   input_data=input_data)
check(prediction)
print("prediction: %s" % prediction["object"]["output"])
print("confidence: %s" % prediction["object"]["confidence"])
print("path: %s" % prediction["object"]["prediction_path"]["path"])

prediction: True
confidence: 0.52627
path: [{'operator': '>', 'field': '000006', 'value': 244.67063}, {'operator': '<=', 'field': '000005', 'value': 5}]


Of course, this method has latencies involved every time you make a prediction. If your predictions don't need to be immediate, then you can store the input data in a file and do a batch prediction with an entire dataset of it. We can use our test dataset to do that.

In [33]:
batch_prediction = api.create_batch_prediction(\
    ratio_model, test_dataset, \
    {"all_fields": True,
     "output_dataset": True})
check(batch_prediction)
# we could download the results as a CSV using
# api.download_batch_prediction(batch_prediction,
#     filename='my_dir/my_predictions.csv')

## Model class: using the model locally to predict
The JSON model that can be downloaded via the API has all the information needed to predict.

In [34]:
LOCAL_MODEL_FILE = "data/churn_model.json"
api.export(ratio_model, LOCAL_MODEL_FILE)

'data/churn_model.json'

 The local **Model** object adds a **predict** method that can be used locally.

In [35]:
from bigml.model import Model
"""
    The **Model** object can use the contents of a Model
    previously stored in a file or
    internally download the model JSON structure once and
    store it in a local directory for further use.
"""
local_model = Model(LOCAL_MODEL_FILE)
pprint(local_model.predict(input_data, full=True))

{'confidence': 0.52627,
 'count': 232,
 'distribution': [['True', 137], ['False', 95]],
 'distribution_unit': 'categories',
 'next': 'Total eve minutes',
 'path': ['Total day minutes > 244.67063', 'Number vmail messages <= 5'],
 'prediction': 'True',
 'probability': 0.5886221807084364}


If you need to predict many rows at once, you can use the **BigMLer** command line, that uses this local **Model** object to create the predictions and store it in a file.

In [97]:
MODEL_ID = ratio_model['resource']
!bigmler --test data/churn-test.csv --model $MODEL_ID --output-dir predictions

[2018-09-10 23:25:48] Retrieving model. https://bigml.com/dashboard/model/5b96d2532774cb43d1002046
[2018-09-10 23:25:48] Creating local predictions.

Generated files:

 predictions
  ├─bigmler_sessions
  └─predictions.csv



Or if you prefer the predictions to be computed remotely

In [98]:
!bigmler --test data/churn-test.csv --model $MODEL_ID --output-dir remote-predictions --remote

[2018-09-10 23:26:46] Retrieving model. https://bigml.com/dashboard/model/5b96d2532774cb43d1002046
[2018-09-10 23:26:47] Creating test source.
[2018-09-10 23:26:51] Source created: https://bigml.com/dashboard/source/5b96e197c7736e657b001ce4
[2018-09-10 23:26:51] Creating dataset.
[2018-09-10 23:26:54] Dataset created: https://bigml.com/dashboard/dataset/5b96e19cc7736e6580000324
[2018-09-10 23:26:54] Creating batch prediction.
[2018-09-10 23:26:57] Batch prediction created: https://bigml.com/dashboard/batchprediction/5b96e19fc7736e65830002f3

Generated files:

 remote-predictions
  ├─bigmler_sessions
  ├─source_test
  ├─predictions.csv
  ├─dataset_test
  └─batch_prediction



## Workflows

#### Basic prediction workflow

To sum up, the basic prediction workflow will need some steps:

- Upload the data to create a Source
- Summarize all data in a Dataset
- Create a Model from the Dataset
- Use the Model to produce a prediction for the new data

Using the diabetes example, to produce this workflow using the bindings you would use this code

In [105]:
import csv
from bigml.api import BigML
from bigml.model import Model

api = BigML()
source = api.create_source(DIABETES_FILE, {"project": PROJECT})
api.ok(source)
dataset = api.create_dataset(source)
api.ok(dataset)
model = api.create_model(dataset)
api.ok(model)

local_model = Model(model)
with open("data/diabetes_test.csv") as test_handler:
    reader = csv.DictReader(test_handler)
    for input_data in reader:
    # predicting for all rows
        print(local_model.predict(input_data))

true
true
false
false
false
true
false
true
false
true
false
false
false
false
false
false
false
true
false
false
false
false
false
true
false
true
true
true


The same could be achieved in a single line command

In [109]:
!bigmler --train data/ext_diabetes.csv --test data/diabetes_test.csv \
         --output-dir diabetes-prediction --project-id $PROJECT \
         --name "Diabetes with bigmler"

[2018-09-10 23:44:40] Creating source.
[2018-09-10 23:44:43] Source created: https://bigml.com/dashboard/source/5b96e5c9c7736e657b001d16
[2018-09-10 23:44:43] Creating dataset.
[2018-09-10 23:44:46] Dataset created: https://bigml.com/dashboard/dataset/5b96e5ccc7736e657b001d19
[2018-09-10 23:44:46] Creating model.
[2018-09-10 23:44:50] Model created: https://bigml.com/dashboard/model/5b96e5ce92527314e100207d
[2018-09-10 23:44:50] Retrieving model. https://bigml.com/dashboard/model/5b96e5ce92527314e100207d
[2018-09-10 23:44:50] Creating local predictions.

Generated files:

 diabetes-prediction
  ├─bigmler_sessions
  ├─predictions.csv
  ├─models
  ├─dataset
  └─source



And if we want to evaluate this model, we can add the **--evaluate** flag

In [110]:
!bigmler --train data/ext_diabetes.csv --output-dir diabetes-eval --evaluate \
         --project-id $PROJECT --name "Diabetes evaluated with BigMLer"

[2018-09-10 23:45:06] Creating source.
[2018-09-10 23:45:11] Source created: https://bigml.com/dashboard/source/5b96e5e3c7736e657b001d1c
[2018-09-10 23:45:11] Creating dataset.
[2018-09-10 23:45:14] Dataset created: https://bigml.com/dashboard/dataset/5b96e5e892527314e1002080
[2018-09-10 23:45:14] Creating model.
[2018-09-10 23:45:19] Model created: https://bigml.com/dashboard/model/5b96e5ea2774cb43c8000262
[2018-09-10 23:45:19] Creating evaluations.
[2018-09-10 23:45:22] Evaluation created: https://bigml.com/dashboard/evaluation/5b96e5f0c7736e657e000317
[2018-09-10 23:45:22] Retrieving evaluation. https://bigml.com/dashboard/evaluation/5b96e5f0c7736e657e000317

Generated files:

 diabetes-eval
  ├─bigmler_sessions
  ├─evaluation.txt
  ├─evaluation.json
  ├─models
  ├─evaluations
  ├─dataset
  └─source



#### Outliers removal workflow 
We can try to improve that performance by removing the top outliers from the dataset before modeling

In [116]:
DATASET_ID = dataset['resource']
!bigmler anomaly --dataset $DATASET_ID \
                 --anomaly-fields "insulin,pregnancies,plasma glucose,diabetes" \
                 --top-n 2 --anomalies-dataset out --output-dir diabetes_anomaly \
                 --project-id $PROJECT --name "Clean diabetes"

[2018-09-10 23:52:59] Retrieving dataset. https://bigml.com/dashboard/dataset/5b96e53492527314d4000246
[2018-09-10 23:53:00] Creating anomaly detector.
[2018-09-10 23:53:13] Anomaly created: https://bigml.com/dashboard/anomaly/5b96e7bc2774cb43c8000265
[2018-09-10 23:53:14] Creating dataset.
[2018-09-10 23:53:17] Dataset created: https://bigml.com/dashboard/dataset/5b96e7cb92527314e500026a

Generated files:

 diabetes_anomaly
  ├─bigmler_sessions
  ├─dataset_gen
  └─anomalies



And evaluating the model built on the clean dataset

In [118]:
!bigmler --datasets diabetes_anomaly/dataset_gen --output-dir diabetes-clean-eval \
         --project-id $PROJECT --evaluate --name "Clean diabetes"

[2018-09-10 23:54:14] Retrieving dataset. https://bigml.com/dashboard/dataset/5b96e7cb92527314e500026a
[2018-09-10 23:54:14] Creating model.
[2018-09-10 23:54:22] Model created: https://bigml.com/dashboard/model/5b96e80bc7736e657b001d2b
[2018-09-10 23:54:22] Creating evaluations.
[2018-09-10 23:54:31] Evaluation created: https://bigml.com/dashboard/evaluation/5b96e815c7736e657b001d2e
[2018-09-10 23:54:31] Retrieving evaluation. https://bigml.com/dashboard/evaluation/5b96e815c7736e657b001d2e

Generated files:

 diabetes-clean-eval
  ├─bigmler_sessions
  ├─evaluation.txt
  ├─evaluation.json
  ├─models
  └─evaluations



#### Retrain with cumulative data

Usually, you start your project uploading a sample of data and playing with it till you discover the workflow that gives you acceptable results. Then, the rest of data is uploaded and you'd like to repeat the same process on the accumulated data. **BigMLer** can help you do that. In this example, we do a regular model creation workflow.

In [86]:
!bigmler --train data/ext_diabetes_1.csv \
         --tag best_diabetes \
         --name "Cumulative diabetes data" \
         --project-id $PROJECT \
         --output-dir ./initial_model

[2018-09-10 21:48:31] Creating source.
[2018-09-10 21:48:34] Source created: https://bigml.com/dashboard/source/5b96ca90c7736e65830002dc
[2018-09-10 21:48:34] Creating dataset.
[2018-09-10 21:48:37] Dataset created: https://bigml.com/dashboard/dataset/5b96ca9392527314dd0002b1
[2018-09-10 21:48:37] Creating model.
[2018-09-10 21:48:40] Model created: https://bigml.com/dashboard/model/5b96ca952774cb43d100201d

Generated files:

 initial_model
  ├─bigmler_sessions
  ├─models
  ├─dataset
  └─source



And after that, new data is uploaded and the same process is reproduced on the accumulated data.

In [87]:
!bigmler retrain --add data/ext_diabetes_2.csv \
                 --model-tag best_diabetes \
                 --output-dir accumulative_retrain

[2018-09-10 21:48:45] Creating execution.
[2018-09-10 21:49:01] Execution created: https://bigml.com/dashboard/execution/5b96ca9ec7736e657b001c60

Generated files:

 accumulative_retrain
  ├─bigmler_sessions
  ├─execution
  ├─whizzml_results.txt
  └─whizzml_results.json

[2018-09-10 21:49:01] Retrieving execution. https://bigml.com/dashboard/execution/5b96ca9ec7736e657b001c60
[2018-09-10 21:49:02] Creating source.
[2018-09-10 21:49:04] Source created: https://bigml.com/dashboard/source/5b96caae2774cb43d1002020

Generated files:

 accumulative_retrain
  ├─bigmler_sessions
  ├─execution
  ├─whizzml_results.txt
  ├─whizzml_results.json
  └─source

[2018-09-10 21:49:04] Creating execution.
[2018-09-10 21:49:16] Execution created: https://bigml.com/dashboard/execution/5b96cab192527314e1001fc0

Generated files:

 accumulative_retrain
  ├─bigmler_sessions
  ├─execution
  ├─whizzml_results.txt
  ├─whizzml_results.json
  └─source

The new retrained model is: model/5b96cab73980b563dd018409.
You 