<table align="left">

  <td>
  <a href="https://colab.research.google.com/github/gogitguhan/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%206%20-%20Vertex%20AI/vertex_ai_raw.ipynb" target="_blank">
    <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
  </a>
  <td>
    <a href="https://github.com/gogitguhan/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%206%20-%20Vertex%20AI/vertex_ai_raw.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/gogitguhan/hands-on-lab-neo4j-and-vertex-ai/main/Lab%206%20-%20Vertex%20AI/vertex_ai_raw.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

# Install Additional Packages
First off, you'll also need to install a few packages.

In [None]:
%pip install --quiet google-cloud-storage
%pip install --quiet google.cloud.aiplatform

# Restart the Kernel
After you install the additional packages, you need to restart the notebook kernel so it can find the packages.  When you run this, you may get a notification that the kernel crashed.  You can disregard that.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

# Download and Split the Data
Now let's download the data set and split it into training, validation and test sets.

In [None]:
!wget https://storage.googleapis.com/neo4j-datasets/form13/2021.csv

In [None]:
import pandas
df = pandas.read_csv('2021.csv')

df['split']=df['reportCalendarOrQuarter']
df['split']=df['split'].replace(['03-31-2021', '06-30-2021', '09-30-2021'], ['TRAIN', 'VALIDATE', 'TEST'])

df = df.drop(columns=['reportCalendarOrQuarter'])

df.to_csv('raw.csv', index=False)

# Authenticate your Google Cloud Account
These steps will authenticate the notebook using your Google Cloud credentials.

In [None]:
# Enter the inputs!
PROJECT_ID=''
while PROJECT_ID=='':
  PROJECT_ID = input('Enter your GCP Project ID: ')

# You can leave these defaults
STORAGE_BUCKET = PROJECT_ID + '-form13'
REGION = 'us-east1'

In [None]:
import os
os.environ['GCLOUD_PROJECT'] = PROJECT_ID

In [None]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

# Upload to a GCP Cloud Storage Bucket

To get the data into Vertex AI, we must first put it in a bucket as a CSV.

In [None]:
from google.cloud import storage
client = storage.Client()

In [None]:
bucket = client.bucket(STORAGE_BUCKET)

In [None]:
filename = 'raw.csv'
upload_path = os.path.join('form13', filename)
blob = bucket.blob(upload_path)
blob.upload_from_filename(filename)

# Train a Model on GCP
We'll use the original features to train an AutoML model.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

dataset = aiplatform.TabularDataset.create(
    display_name="form13-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'form13', 'raw.csv'),
)
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

In [None]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name='form13-raw',
    optimization_prediction_type='classification'
)

In [None]:
model = job.run(
    dataset=dataset,
    target_column='target',
    predefined_split_column_name='split',
    model_display_name='form13-raw',
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
)

1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows.  However, Vertex AI isn't respecting that budget currently.  This job will probably run for two and a half hours.  

We're going to move on while that runs.  You can check on the job later in the [Google Cloud Console](https://console.cloud.google.com/) to see the results.  There's a link to the specific job in the output of the cell above.