# Vertex AI Raw
First, we're going to work with the raw data set.  We'll do the following:
* Pull it from a bucket
* Break into train, test and validation sets
* Train a classifier

## Download and Split the Data
Let's download the data set and split it into training, validation and test sets.

In [None]:
!wget https://storage.googleapis.com/neo4j-datasets/form13/2021.csv

In [None]:
import pandas
df = pandas.read_csv('2021.csv')

df['split']=df['reportCalendarOrQuarter']
df['split']=df['split'].replace(['03-31-2021', '06-30-2021', '09-30-2021'], ['TRAIN', 'VALIDATE', 'TEST'])

df = df.drop(columns=['reportCalendarOrQuarter'])

df.to_csv('raw.csv', index=False)

## Setup Variables
Now we need to set a few variables.

In [None]:
# Edit this variable!
REGION = 'us-west1'

# You can leave this as is
shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]

STORAGE_BUCKET = PROJECT_ID + '-form13'

## Upload to a Google Cloud Storage Bucket

To get the data into Vertex AI, we must first put it in a bucket as a CSV.

In [None]:
from google.cloud import storage
client = storage.Client()

In [None]:
bucket = client.bucket(STORAGE_BUCKET)

In [None]:
filename = 'raw.csv'

import os
upload_path = os.path.join('form13', filename)
blob = bucket.blob(upload_path)
blob.upload_from_filename(filename)

## Train a Model with Vertex AI AutoML
We'll use the original features to train an AutoML model.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

dataset = aiplatform.TabularDataset.create(
    display_name="form13-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'form13', 'raw.csv'),
)
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

In [None]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name='form13-raw',
    optimization_prediction_type='classification'
)

In [None]:
model = job.run(
    dataset=dataset,
    target_column='target',
    predefined_split_column_name='split',
    model_display_name='form13-raw',
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
)

1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows.  However, Vertex AI isn't respecting that budget currently.  This job will probably run for two and a half hours.  

We're going to move on while that runs.  You can check on the job later in the [Google Cloud Console](https://console.cloud.google.com/) to see the results.  There's a link to the specific job in the output of the cell above.