<a href="https://colab.research.google.com/github/neo4j-partners/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%206%20-%20Vertex%20AI/vertex_ai_raw.ipynb" target="_blank">
  <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
</a>

# Install Additional Packages
First off, you'll also need to install a few packages.

In [None]:
!pip install --quiet google-cloud-storage
!pip install --quiet google.cloud.aiplatform

# Restart the Kernel
After you install the additional packages, you need to restart the notebook kernel so it can find the packages.  When you run this, you may get a notification that the kernel crashed.  You can disregard that.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

# Split the Data
Now let's grab the data set and split it into a training and a test set.

In [None]:
!wget https://storage.googleapis.com/neo4j-datasets/form13/form13.csv

In [None]:
import pandas
df = pandas.read_csv('form13.csv')

train = df.loc[df['reportCalendarOrQuarter'] == '03-31-2021']
train = train.append(df.loc[df['reportCalendarOrQuarter'] == '06-30-2021'])
train.to_csv('train.csv', index=False)

test = df.loc[df['reportCalendarOrQuarter'] == '09-30-2021']
test.to_csv('test.csv', index=False)

# Authenticate your Google Cloud Account

In [None]:
# Edit these variables!
PROJECT_ID = 'YOUR-PROJECT-ID'
STORAGE_BUCKET = 'NAME-OF-BUCKET-FROM-PREVIOUS-LAB'

# You can leave these defaults
REGION = 'us-central1'

In [None]:
import os
os.environ['GCLOUD_PROJECT'] = PROJECT_ID

In [None]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

# Upload to a GCP Cloud Storage Bucket

To get the data into Vertex AI, we must first put it in a bucket as a CSV.

In [None]:
from google.cloud import storage
client = storage.Client()

In [None]:
bucket = client.bucket(STORAGE_BUCKET)

In [None]:
# Upload our files to that bucket
for filename in ['train.csv', 'test.csv']:
    upload_path = os.path.join('raw', filename)
    blob = bucket.blob(upload_path)
    blob.upload_from_filename(filename)

# Train and Deploy a Model on GCP
We'll use the original features to train an AutoML model.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

dataset = aiplatform.TabularDataset.create(
    display_name="form13-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'raw', 'train.csv'),
)
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

In [None]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name='form13-raw',
    optimization_prediction_type='classification'
)

In [None]:
model = job.run(
    dataset=dataset,
    target_column='target',
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    model_display_name='form13-raw',
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
)

In [None]:
endpoint = model.deploy(machine_type='n1-standard-4')

This job will run for an hour.  That's the minimum time for an AutoML job.  We're going to move on while that runs.  You can check on the job later in the [Google Cloud Console](https://console.cloud.google.com/) to see the results.  There's a link to the specific job in the output of the cell above.


The model.deploy command will create an endpoint where you can make batch predictions.