# Train a model using AutoML

* This notebook uses [AutoML](https://cloud.google.com/natural-language/automl/docs/tutorial#step_1_create_a_dataset) to train a model

## Environment setup

In [1]:
import logging
import os
from pathlib import Path
from importlib import reload
import sys
import notebook_setup

notebook_setup.setup()

Adding /home/jovyan/git_kubeflow-code-intelligence/py to python path


In [2]:
import subprocess 
# TODO(jlewi): Get the project using fairing?
# PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip().decode()
PROJECT = "issue-label-bot-dev"

In [3]:
!pip install --user google-cloud-automl

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Create the AutoML dataset

In [4]:
# TODO(jlewi): How do we check if the dataset already exists and whether it already has data
from google.cloud import automl
import logging

display_name = "kubeflow_issues_with_repo"

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(PROJECT, "us-central1")
# Specify the classification type
# Types:
# MultiLabel: Multiple labels are allowed for one example.
# MultiClass: At most one label is allowed per example.
metadata = automl.types.TextClassificationDatasetMetadata(
    classification_type=automl.enums.ClassificationType.MULTILABEL
)
dataset = automl.types.Dataset(
    display_name=display_name,
    text_classification_dataset_metadata=metadata,
)

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, dataset)

created_dataset = response.result()

# Display the dataset information
logging.info("Dataset name: {}".format(created_dataset.name))
dataset_id = created_dataset.name.split("/")[-1]
logging.info(f"Dataset id: {dataset_id}")

Dataset name: projects/976279526634/locations/us-central1/datasets/TCN7927110499869655040
Dataset id: TCN7927110499869655040


## Prepare the dataset

* [Docs](https://cloud.google.com/natural-language/automl/docs/prepare) for preparing the dataset
* We need to create a CSV file that lists all the data files
* We need to upload each document as a text file to GCS

In [5]:
from code_intelligence import github_bigquery
recent_issues = github_bigquery.get_issues("kubeflow", PROJECT)

  Elapsed 6.22 s. Waiting...
  Elapsed 7.37 s. Waiting...
  Elapsed 8.52 s. Waiting...
  Elapsed 9.67 s. Waiting...
  Elapsed 10.8 s. Waiting...
  Elapsed 11.93 s. Waiting...
Downloading: 100%|██████████| 143075/143075 [00:38<00:00, 2852.43rows/s]
Total time taken 50.67 s.
Finished at 2020-04-30 16:13:44.


## Write the files to GCS

In [6]:
# Need to use a bucket in the same region and type as automl
data_dir = f"gs://issue-label-bot-dev_automl/automl_{dataset_id}"
issues_dir = os.path.join(data_dir, "issues")

In [7]:
from code_intelligence import gcs_util
from code_intelligence import util
from google.cloud import storage

In [8]:
import pandas as pd
info = pd.DataFrame(columns=["url", "set", "labels"], index=range(recent_issues.shape[0]))

# Make the set an empty string because we will let AutoML assign points to the train, eval and test sets
info["set"] = ""

In [9]:
storage_client = storage.Client()

bucket_name, _ = gcs_util.split_gcs_uri(data_dir)
bucket = storage_client.get_bucket(bucket_name)

for i in range(recent_issues.shape[0]):
    owner, repo, number = util.parse_issue_url(recent_issues.iloc[i]["html_url"])
    owner_repo = f"{owner}_{repo}"
    name = f"{owner}_{repo}_{number}.txt"
    target = os.path.join(issues_dir, name)

    issue = recent_issues.iloc[i]
    
    if gcs_util.check_gcs_object(target, storage_client=storage_client):
        logging.info(f"{target} already exists")
        
    else:
        _, obj_path = gcs_util.split_gcs_uri(target)
        blob = bucket.blob(obj_path)
        
        # Include the owner and repo in the text body because it is predictive
        doc = github_util.build_issue_doc(owner, repo, issue["title"], [issue["body"]])
        blob.upload_from_string(doc)
        logging.info(f"Created {target}")

    info.iloc[i]["url"] = target    
    

Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_Code-Intelligence_2.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_Issue-Label-Bot_2.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_Issue-Label-Bot_3.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_101.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_102.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_103.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_104.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_108.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issues/kubeflow_arena_112.txt
Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/issu

* Create the CSV file with the data
* We don't use pandas to_csv because this ends up putting quoting the string containing the labels e.g

  ```
  ,gs://issue-label-bot-dev/automl_2020_0429/issues/kubeflow_website_997.txt,"area/docs, kind/feature, lifecycle/stale, priority/p2"
  ```
* But that isn't the format AutoML expects

## Compute Target Labels

### Compute a historgram of label frequency

* AutoML requires labels have a minimum count of each label (8 for training, 1 for validation, 1 for test) so filter out labels that don't appear very often


In [10]:
from collections import Counter
label_counts = Counter()

for r in range(recent_issues.shape[0]):
    label_counts.update(recent_issues.iloc[r]["parsed_labels"])


In [11]:
#label_counts_df = pd.DataFrame({"label": label_counts.keys(), "count": label_counts.values()})
label_counts_df = pd.DataFrame(label_counts.items(), columns=["label", "count"])

In [12]:
label_counts_df.sort_values("count", ascending=False, inplace=True)


In [13]:
cutoff = 50
target_labels = label_counts_df.loc[label_counts_df["count"] > cutoff]

## Distinguish unlabeled vs. negative examples

* We need to distinguish between unlabeled examples and negative examples
* For example, if an issue doesn't have label "platform/gcp" that could be for one of two reasons
  1. The issue was never labeled
  1. The label platform/gcp doesn't apply
  
* A quick hack to distinguish the two is to only include area and platform labels

  * For now at least if one of these labels exists on an issue it was probably applied by a human
  * This is in contrast to kind labels which could be applied by the bot or by a GitHub issue template
  
* Longer term we could look at GitHub events to infer whether data was labeled by a human

In [14]:
target_labels = target_labels[target_labels["label"].apply(lambda x: x.startswith("area") or x.startswith("platform"))]

* Filter labels to target labels

In [15]:
def label_filter(labels):
    filtered = []
    for l in labels:
        if l in target_labels.values:
            filtered.append(l)
    return filtered

info["labels"] = recent_issues["parsed_labels"].apply(label_filter)

In [16]:
# Compute string for automl

# AutoML doesn't allow "/" only letters, dashes, underscores are allowed in labels
# We need a comma separated string and we need to replace "/" with "-"
info["automl_labels"] = info["labels"].apply(lambda l: ", ".join(l).replace("/", "-"))

In [17]:
import datetime
import io
import csv
buffer = io.StringIO()

# AutoML seems to require at least 1 label for every issue
#labeled_rows = info.loc[info["labels"] != ""]
#labeled_rows = info.loc[info["labels"] != ""]
#labeled_rows.to_csv(buffer, columns=["set", "url", "labels"], header=False, index=False)

info.to_csv(buffer, columns=["set", "url", "automl_labels"], header=False, index=False, doublequote=False)

# for i in range(labeled_rows.shape[0]):
#     row = labeled_rows.iloc[i]    
#     buffer.write(f"{row['set']}, {row['url']}, {row['labels']}\n")
    
now = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
dataset_path = os.path.join(data_dir, f"dataset_{now}.csv")
_, obj_path = gcs_util.split_gcs_uri(dataset_path)
blob = bucket.blob(obj_path)

blob.upload_from_string(buffer.getvalue())

logging.info(f"Created {dataset_path}")

Created gs://issue-label-bot-dev_automl/automl_TCN7927110499869655040/dataset_200430_163955.csv


* Import the data to AutoML

In [18]:
from google.cloud import automl

dataset_full_id = client.dataset_path(
    PROJECT, "us-central1", dataset_id
)

# Get the multiple Google Cloud Storage URIs
input_uris = [dataset_path]
gcs_source = automl.types.GcsSource(input_uris=input_uris)
input_config = automl.types.InputConfig(gcs_source=gcs_source)
# Import data from the input URI
response = client.import_data(dataset_full_id, input_config)

logging.info(f"Processing import: operation: {response.operation.name}")

# This appears to be a blocking call
logging.info("Data imported. {}".format(response.result()))

Processing import: operation: projects/976279526634/locations/us-central1/operations/TCN1328244131213869056
Data imported. 


## Train a model

In [19]:
# A resource that represents Google Cloud Platform location.
project_location = client.location_path(PROJECT, "us-central1")
# Leave model unset to use the default base model provided by Google
metadata = automl.types.TextClassificationModelMetadata()
model = automl.types.Model(
    display_name=display_name,
    dataset_id=dataset_id,
    text_classification_model_metadata=metadata,
)

# Create a model with the model metadata in the region.
response = client.create_model(project_location, model)

print(u"Training operation name: {}".format(response.operation.name))
print("Training started...")

Training operation name: projects/976279526634/locations/us-central1/operations/TCN6900041295201304576
Training started...


In [35]:
# This is blocking
result = response.result()

In [37]:
result.name

'projects/976279526634/locations/us-central1/models/TCN654213816573231104'

## Deploy a model

* We need to deploy the model before we can send predictions.

In [30]:
# r=client.list_models(project_location)

# for i in r:
#     logging.info({})

In [38]:
# Should be a value like "projects/976279526634/locations/us-central1/models/TCN654213816573231104'"
model_name = result.name

In [55]:
model_name

'projects/976279526634/locations/us-central1/models/TCN654213816573231104'

In [39]:
deploy_response = client.deploy_model(model_name)

In [41]:
final_response = deploy_response.result()

## Send some predictions

In [45]:
prediction_client = automl.PredictionServiceClient()

In [54]:
text_snippet = automl.types.TextSnippet(
    content="tfjob isn't working. I can't run my training jobs", mime_type="text/plain"
)
payload = automl.types.ExamplePayload(text_snippet=text_snippet)

response = prediction_client.predict(model_name, payload)

for annotation_payload in response.payload:
    print(
        u"Predicted class name: {}".format(annotation_payload.display_name)
    )
    print(
        u"Predicted class score: {}".format(
            annotation_payload.classification.score
        )
    )

Predicted class name: area-operator
Predicted class score: 0.34206485748291016
Predicted class name: area-tfjob
Predicted class score: 0.050978899002075195
Predicted class name: area-example-code_search
Predicted class score: 0.018343418836593628
Predicted class name: area-testing
Predicted class score: 0.0008122622966766357
Predicted class name: area-api
Predicted class score: 0.0005573247326537967
Predicted class name: area-components
Predicted class score: 0.0003605782985687256
Predicted class name: area-sdk-dsl-compiler
Predicted class score: 0.00031942129135131836
Predicted class name: area-front-end
Predicted class score: 0.00014272332191467285
Predicted class name: area-katib
Predicted class score: 0.00014117948012426496
Predicted class name: platform-gcp
Predicted class score: 9.798812970984727e-05
Predicted class name: platform-aws
Predicted class score: 2.3524933567387052e-05
Predicted class name: area-backend
Predicted class score: 2.1026946342317387e-05
Predicted class name

In [57]:
response.payload.__class__

google.protobuf.pyext._message.RepeatedCompositeContainer

In [None]:
automl.types

In [61]:
from google.cloud.automl import types as automl_types

In [74]:
predict_response = automl_types.PredictResponse()

In [77]:
predict_response.payload.append(annotation)

In [78]:
predict_response.payload

[classification {
  score: 0.8999999761581421
}
display_name: "area-jupyter"
]

In [67]:
annotation_payload.__class__

google.cloud.automl_v1.types.AnnotationPayload

In [70]:
annotation = automl_types.AnnotationPayload()
annotation.display_name = "area-jupyter"
annotation.classification.score = .9