<a href="https://colab.research.google.com/github/neo4j-partners/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%206%20-%20Vertex%20AI/vertex_ai_raw.ipynb" target="_blank">
  <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
</a>

# Install Additional Packages
First off, you'll also need to install a few packages.

In [1]:
%pip install --quiet google-cloud-storage
%pip install --quiet google.cloud.aiplatform

[K     |████████████████████████████████| 1.8 MB 7.6 MB/s 
[K     |████████████████████████████████| 107 kB 11.6 MB/s 
[K     |████████████████████████████████| 230 kB 38.1 MB/s 
[K     |████████████████████████████████| 46 kB 3.6 MB/s 
[K     |████████████████████████████████| 107 kB 42.2 MB/s 
[K     |████████████████████████████████| 107 kB 29.5 MB/s 
[K     |████████████████████████████████| 106 kB 44.0 MB/s 
[K     |████████████████████████████████| 106 kB 43.8 MB/s 
[K     |████████████████████████████████| 106 kB 10.3 MB/s 
[K     |████████████████████████████████| 106 kB 43.0 MB/s 
[K     |████████████████████████████████| 105 kB 41.7 MB/s 
[K     |████████████████████████████████| 105 kB 43.3 MB/s 
[K     |████████████████████████████████| 105 kB 45.2 MB/s 
[K     |████████████████████████████████| 105 kB 42.6 MB/s 
[K     |████████████████████████████████| 105 kB 44.7 MB/s 
[K     |████████████████████████████████| 104 kB 44.0 MB/s 
[K     |██████████████████

# Restart the Kernel
After you install the additional packages, you need to restart the notebook kernel so it can find the packages.  When you run this, you may get a notification that the kernel crashed.  You can disregard that.

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'restart': True, 'status': 'ok'}

# Download and Split the Data
Now let's download the data set and split it into training, validation and test sets.

In [1]:
!wget https://storage.googleapis.com/neo4j-datasets/form13/2021.csv

--2022-06-04 19:53:55--  https://storage.googleapis.com/neo4j-datasets/form13/2021.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.141.128, 74.125.137.128, 2607:f8b0:4023:c03::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.141.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51951699 (50M) [text/csv]
Saving to: ‘2021.csv’


2022-06-04 19:53:56 (81.2 MB/s) - ‘2021.csv’ saved [51951699/51951699]



In [2]:
import pandas
df = pandas.read_csv('2021.csv')

df['split']=df['reportCalendarOrQuarter']
df['split']=df['split'].replace(['03-31-2021', '06-30-2021', '09-30-2021'], ['TRAIN', 'VALIDATE', 'TEST'])

df = df.drop(columns=['reportCalendarOrQuarter'])

df.to_csv('raw.csv', index=False)

# Authenticate your Google Cloud Account
These steps will authenticate the notebook using your Google Cloud credentials.

In [3]:
# Edit these variables!
PROJECT_ID = 'useful-patrol-352218'
STORAGE_BUCKET = 'form13sdjfoiergeoirj'

#PROJECT_ID = 'YOUR-PROJECT-ID'
#STORAGE_BUCKET = 'NAME-OF-BUCKET-FROM-PREVIOUS-LAB'

# You can leave this default
REGION = 'us-central1'

In [4]:
import os
os.environ['GCLOUD_PROJECT'] = PROJECT_ID

In [5]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

# Upload to a GCP Cloud Storage Bucket

To get the data into Vertex AI, we must first put it in a bucket as a CSV.

In [6]:
from google.cloud import storage
client = storage.Client()

In [7]:
bucket = client.bucket(STORAGE_BUCKET)

In [8]:
filename = 'raw.csv'
upload_path = os.path.join('form13', filename)
blob = bucket.blob(upload_path)
blob.upload_from_filename(filename)

# Train a Model on GCP
We'll use the original features to train an AutoML model.

In [9]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

dataset = aiplatform.TabularDataset.create(
    display_name="form13-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'form13', 'raw.csv'),
)
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

Creating TabularDataset
Create TabularDataset backing LRO: projects/701617915854/locations/us-central1/datasets/6276007973298896896/operations/8050753399911612416
TabularDataset created. Resource name: projects/701617915854/locations/us-central1/datasets/6276007973298896896
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/701617915854/locations/us-central1/datasets/6276007973298896896')
	Dataset: "form13-raw"
	name: "projects/701617915854/locations/us-central1/datasets/6276007973298896896"


In [10]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name='form13-raw',
    optimization_prediction_type='classification'
)

In [None]:
model = job.run(
    dataset=dataset,
    target_column='target',
    predefined_split_column_name='split',
    model_display_name='form13-raw',
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
)

No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
The column transformation of type 'auto' was set for the following columns: ['nameOfIssuer', 'cusip', 'split', 'filingManager', 'value', 'shares'].
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1869090052626186240?project=701617915854
AutoMLTabularTrainingJob projects/701617915854/locations/us-central1/trainingPipelines/1869090052626186240 current state:
PipelineState.PIPELINE_STATE_PENDING
AutoMLTabularTrainingJob projects/701617915854/locations/us-central1/trainingPipelines/1869090052626186240 current state:
PipelineState.PIPELINE_STATE_PENDING
AutoMLTabularTrainingJob projects/701617915854/locations/us-central1/trainingPipelines/1869090052626186240 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/701617915854/locations/us-central1/trainingPipelines/1869090052626186240 current state

1000 milli node hours, or one hour, is the minimum budget that Vertex AI allows.  However, Vertex AI isn't respecting that budget currently.  This job will probably run for two and a half hours.  

We're going to move on while that runs.  You can check on the job later in the [Google Cloud Console](https://console.cloud.google.com/) to see the results.  There's a link to the specific job in the output of the cell above.