In [None]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting started with AutoML Tables

<table align="left">
  <td>
    <a href="https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-keras">
      <img src="https://cloud.google.com/_static/images/cloud/icons/favicons/onecloud/super_cloud.png"
           alt="Google Cloud logo" width="32px"> Read on cloud.google.com
    </a>
  </td>
  <td>
    <a href="">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

[Google’s AutoML](https://cloud.google.com/automl-tables/) provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps.

<img src="https://cloud.google.com/images/automl-tables/automl-table.svg" alt="AutoML tables" width="600px">

AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from. 

In this notebook, we will use the [Google Cloud SDK AutoML Python API](https://cloud.google.com/automl-tables/docs/client-libraries) to create a binary classification model using a real dataset from the [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).

We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc... 

For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for [Getting Started](https://cloud.google.com/automl-tables/docs/quickstart).


### Dataset

This tutorial uses the [United States Census Income
Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) provided by the
[UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/index.php)containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above.

## Before you begin

Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:
* Create a Google Cloud Platform (GCP) project.
* Enable billing.
* Apply to whitelist your project.
* Enable AutoML API.

You also need to upload your data into [Google Cloud Storage](https://cloud.google.com/storage/) (GCS) or [BigQuery](https://cloud.google.com/bigquery/). 
For example, to use GCS as your data source:

* [Create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).
* Upload the training and batch prediction files.



---



## Instructions

You must do several things before you can train and deploy a model in
AutoML:


 * Set up your local development environment (optional)
 * Set Project ID and Compute Region
 * Authenticate your GCP account
 * Import Python API SDK and create a Client instance,
 * Create a dataset instance and import the data.
 * Create a model instance and train the model.
 * Evaluating the trained model.
 * Deploy the model on the cloud for online predictions.
 * Make online predictions.
 * Undeploy the model


### Set up your local development environment

**If you are using Colab or AI Platform Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

### Set up your GCP project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager)

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the AutoML API ("AutoML API")](https://console.cloud.google.com/flows/enableapi?apiid=automl.googleapis.com)

4. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

In [None]:
PROJECT_ID = "<your-project-id>" # @param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.
! gcloud config set project $PROJECT_ID

### Authenticate your GCP account

**If you are using AI Platform Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the GCP Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. From the **Service account** drop-down list, select **New service account**.

3. In the **Service account name** field, enter a name.

4. From the **Role** drop-down list, select
   **AutoML > AutoML Admin** and
   **Storage > Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
  %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json

### Install the client library
Run the following cell.

In [None]:
%pip install google-cloud-automl

### Import libraries and define constants

First, import Python libraries required for training,
The code example below demonstrates importing the AutoML Python API module into a python script. 

In [None]:
# AutoML library
from google.cloud import automl_v1beta1 as automl

import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt

## Quickstart for AutoML tables

This section of the tutorial walks you through creating an AutoML Tables client.

Additionally, one will want to create an instance to the TablesClient.  
This client instance is the HTTP request/response interface between the python script and the GCP AutoML service.

### Create API Client to AutoML Service*

**If you are using AI Platform Notebooks**, or *Colab* environment is already
authenticated using GOOGLE_APPLICATION_CREDENTIALS. Run this step.

In [None]:
client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)

**If you are using Colab or Jupyter**, and you have defined a service account
follow the following steps to create the AutoML client

You can see a different way to create the API Clients using service account.

In [None]:
# from google.oauth2 import service_account
# credentials = service_account.Credentials.from_service_account_file('/path/to/service_account.json')
# client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION, credentials=credentials)

---

List datasets in your project:

In [None]:
# List datasets in Project
list_datasets = client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets

You can also print the list of your models by running the following cell.

In [None]:
list_models = client.list_models()
models = { model.display_name: model.name for model in list_models }
models



---



### Create a dataset

Now we are ready to create a dataset instance (on GCP) using the client method create_dataset(). This method has one required parameter, the human readable display name `dataset_display_name`.

Select a dataset display name and pass your table source information to create a new dataset.

In [None]:
# Create dataset

dataset_display_name = 'census' 
dataset = client.create_dataset(dataset_display_name)
dataset

### Import data

You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) 
as your training data. We provide code below to copy the data into a bucket you own automatically. You are free to adjust the value of `GCS_STORAGE_BUCKET` as needed.

In [None]:
GCS_STORAGE_BUCKET = 'gs://{}-codelab-data-storage'.format(PROJECT_ID)
GCS_DATASET_URI = '{}/census_income.csv'.format(GCS_STORAGE_BUCKET)
! gsutil ls $GCS_STORAGE_BUCKET || gsutil mb -l $COMPUTE_REGION $GCS_STORAGE_BUCKET
! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income.csv $GCS_DATASET_URI

Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status by printing the dataset object. This time pay attention to the example_count field with 32561 records.

In [None]:
import_data_operation = client.import_data(
    dataset=dataset,
    gcs_input_uris=GCS_DATASET_URI
)
print('Dataset import operation: {}'.format(import_data_operation))

# Synchronous check of operation status. Wait until import is done.
import_data_operation.result()
dataset = client.get_dataset(dataset_name=dataset.name)
dataset

### Review the data specs

In [None]:
# List table specs
list_table_specs_response = client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]

# List column specs
list_column_specs_response = client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}

# Print Features and data_type:

features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
    print(feature[0],':', feature[1])

In [None]:
# Table schema pie chart.

type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1
    
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

___

### Update dataset: assign a label column and enable nullable columns

This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row.

AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income_bracket` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

#### Update a column: Set to nullable

In [None]:
update_column_response = client.update_column_spec(
    dataset=dataset,
    column_spec_display_name='income',
    type_code='CATEGORY',
    nullable=False,
)
update_column_response

**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64` `to  `CATEGORY`.

#### Update dataset: Assign a label

In [None]:
update_dataset_response = client.set_target_column(
    dataset=dataset,
    column_spec_display_name='income',
)
update_dataset_response

___

### Creating a model

Once we have defined our datasets and features we will create a model.

Specify the duration of the training. For example, `train_budget_milli_node_hours=1000` runs the training for one hour. 

If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model.

```python
    model = client.get_model(model_display_name=model_display_name) 
```

In [None]:
model_display_name = 'census_income_model'

create_model_response = client.create_model(
    model_display_name,
    dataset=dataset,
    train_budget_milli_node_hours=1000,
)
print('Create model operation: {}'.format(create_model_response.operation))
# Wait until model training is done.
model = create_model_response.result()
model

___

### Model deployment

**Important** : Deploy the model, then wait until the model FINISHES deployment.

The model takes a while to deploy online. When the deployment code response = client.deploy_model(model_name=model.name) finishes, you will be able to see this on the UI. Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see "online prediction" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see "model deployed" on the far right of the screen if the model is deployed, or a "deploying model" message if it is still deploying. </span> 

In [None]:
client.deploy_model(model=model).result()

Verify if model has been deployed, check `deployment_state` field, it should show: `DEPLOYED`

In [None]:
model = client.get_model(model_name=model.name)
model

Run the prediction, only after the model finishes deployment

### Make an Online prediction

You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.

Note: If the model has not finished deployment, the prediction will NOT work.
The following cells show you how to make an online prediction. 

In [None]:
#@title Make an online prediction: set the categorical variables{ vertical-output: true }
from ipywidgets import interact
import ipywidgets as widgets

workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']
education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']
marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse']
occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces']
relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']
race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']
sex_ids = ['Female', 'Male']
native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']

workclass = widgets.Dropdown(
    options=workclass_ids, 
    value=workclass_ids[0],
    description='workclass:'
)

education = widgets.Dropdown(
    options=education_ids, 
    value=education_ids[0],
    description='education:', 
    width='500px'
)
 
marital_status = widgets.Dropdown(
    options=marital_status_ids, 
    value=marital_status_ids[0],
    description='marital status:', 
    width='500px'
)

occupation = widgets.Dropdown(
    options=occupation_ids, 
    value=occupation_ids[0],
    description='occupation:', 
    width='500px'
)

relationship = widgets.Dropdown(
    options=relationship_ids, 
    value=relationship_ids[0],
    description='relationship:', 
    width='500px'
)

race = widgets.Dropdown(
    options=race_ids, 
    value=race_ids[0],                           
    description='race:', 
    width='500px'
)

sex = widgets.Dropdown(
    options=sex_ids, 
    value=sex_ids[0],
    description='sex:', 
    width='500px'
)

native_country = widgets.Dropdown(
    options=native_country_ids, 
    value=native_country_ids[0],
    description='native_country:', 
    width='500px'
)

display(workclass)
display(education)
display(marital_status)
display(occupation)
display(relationship)
display(race)
display(sex)
display(native_country)

Adjust the slides on the right to the desired test values for your online prediction.

In [None]:
#@title Make an online prediction: set the numeric variables{ vertical-output: true }

age = 34 #@param {type:'slider', min:1, max:100, step:1}
capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}
capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}
fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}
education_num = 9 #@param {type:'slider', min:1, max:16, step:1}
hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}

Run the following cell, and then choose the desired test values for your online prediction.

In [None]:
inputs = {
    'age': age,
    'workclass': workclass.value,
    'fnlwgt': fnlwgt,
    'education': education.value,
    'education_num': education_num,
    'marital_status': marital_status.value,
    'occupation': occupation.value,
    'relationship': relationship.value,
    'race': race.value,
    'sex': sex.value,
    'capital_gain': capital_gain,
    'capital_loss': capital_loss,
    'hours_per_week': hours_per_week,
    'native_country': native_country.value,
}

prediction_result = client.predict(model=model, inputs=inputs)
prediction_result

#### Get Prediction 

We extract the `google.cloud.automl_v1beta1.types.PredictResponse` object `prediction_result` and iterate to create a list of tuples with score and label, then we sort based on highest score and display it.

In [None]:
predictions = [(prediction.tables.score, prediction.tables.value.string_value) for prediction in prediction_result.payload]
predictions = sorted(predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)
print('Prediction is: ', predictions[0])

Undeploy the model

In [None]:
undeploy_model_response = client.undeploy_model(model=model)

### Batch prediction

#### Initialize prediction

Your data source for batch prediction can be GCS or BigQuery. 

For this tutorial, you can use: 

- [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. 

Create a GCS bucket and upload the file into your bucket. 

Some of the lines in the batch prediction input file are intentionally left missing some values. 
The AutoML Tables logs the errors in the `errors.csv` file.
Also, enter the UI and create the bucket into which you will load your predictions. 

The bucket's default name here is `automl-tables-pred` to be replaced with your own.

**NOTE:** The client library has a bug. If the following cell returns a:

`TypeError: Could not convert Any to BatchPredictResult` error, ignore it. 

The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.

In [None]:
GCS_BATCH_PREDICT_URI = '{}/census_income_batch_prediction_input.csv'.format(GCS_STORAGE_BUCKET)
GCS_BATCH_PREDICT_OUTPUT = '{}/census_income_predictions/'.format(GCS_STORAGE_BUCKET)
! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income_batch_prediction_input.csv $GCS_BATCH_PREDICT_URI

Launch Batch prediction

In [None]:
batch_predict_response = client.batch_predict(
    model=model, 
    gcs_input_uris=GCS_BATCH_PREDICT_URI,
    gcs_output_uri_prefix=GCS_BATCH_PREDICT_OUTPUT,
)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

### Next steps

Please follow latest updates on AutoML [here](https://cloud.google.com/automl/docs/)
if you have any questions contact us at [cloud-automl-tables-discuss](https://groups.google.com/forum/#!forum/cloud-automl-tables-discuss)