-
Notifications
You must be signed in to change notification settings - Fork 2
/
googleCloudAutoMLTablesR.Rmd
309 lines (207 loc) · 9.08 KB
/
googleCloudAutoMLTablesR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: "googleCloudAutoMLTablesR"
author: "Justin Marciszewski"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{googleCloudAutoMLTablesR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Summary
This is a quick start tutorial for training a model using GCP AutoML Tables.
# Setup
## 1. Configure your own Google Cloud Platform project
1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.
2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)
3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)
4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)
## 2. Create a service account and grant permissions
1. In the GCP Console, go to the [**Create service account key** page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).
2. From the **Service account** drop-down list, select **New service account**.
3. In the **Service account name** field, enter a name.
4. From the **Role** drop-down list, select **AutoML > AutoML Admin**, **Storage > Storage Admin** and **BigQuery > BigQuery Admin**.
5. Click _Create_. A JSON file that contains your key downloads to your local environment.
6. Save the json file and copy the pathname for an upcoming step
## 3. Create Google Cloud Storage Bucket
Create a [Google Cloud Storage Bucket](https://console.cloud.google.com/storage) and save the name for use in an upcoming step.
## 4. Configure R environment
### Install R packages
Run the following chunk to install the required R packages to complete this tutorial (checking to see if they are installed first and only install if not already):
```{r install_packages, eval=FALSE}
list.of.packages <- c("remotes", "googleAuthR")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
remotes::install_github("justinjm/googleCloudAutoMLTablesR")
```
### .Renviron
Create a file called `.Renviron` in your project's working directory and use the following environemtn arguments:
* `GAR_SERVICE_JSON` - path to service account (JSON) keyfile downloaded before and copied
* `GCAT_DEFAULT_PROJECT_ID` - string of your GCP project you configured before
* `GCAT_DEFAULT_REGION` - region of GCP resorces that can be one of: `"us-central1"` or `"eu"`
e.g. your `.Renviron` should look like:
```
# .Renviron
GAR_SERVICE_JSON="/Users/me/auth/auth.json"
GCAT_DEFAULT_PROJECT_ID="my-project"
GCAT_DEFAULT_REGION="us-central1"
```
Save the `.Renviron` file and restart your R session to load the environment variables
# Authentication
Load the 2 R packages we need and then authenticate using the Service account we just created:
```{r auth, warning=FALSE}
library(googleAuthR)
library(googleCloudAutoMLTablesR)
options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform")
gar_auth_service(json_file = Sys.getenv("GAR_SERVICE_JSON"))
```
## Set global arguments
To make things easier later on, we set some global arguements for our GCP project and the region of the GCP resources within our project:
```{r gcat-global-arguements}
projectId <- Sys.getenv("GCAT_DEFAULT_PROJECT_ID")
location <- "us-central1"
gcat_region_set("us-central1")
gcat_project_set(projectId)
```
# Viewing AutoML Tables datasets
## Get list of datasets
```{r}
datasets_list <- gcat_list_datasets()
datasets_list
```
### Optional - setting and getting global dataset name
To simplify code for other API calls, you can set a dataset name to a global environment variable:
```{r}
gcat_global_dataset("test_01_bq")
```
Then you can retrieve it like so:
```{r}
gcat_get_global_dataset()
```
The example workflow in the rest of this vignette uses this function to keep the code as concise as possible.
# Importing data into AutoML Tables
## Options and Workflow
There are 2 options for loading data into AutoML Tables
1. Google Cloud Storage
2. BigQuery
At a high-level, the workflow is:
1. Locate your data in GCP (csv file in GCS bucket or csv file loaded into a BQ table)
1. Create AutoML Tables dataset
2. Import data from GCS or BQ into AutoML Tables dataset
## From Google Cloud Storage
To load data into AutoML Tables from Google Cloud storage, first do the following:
1. Setup a GCS bucket that is [regional](https://cloud.google.com/automl-tables/docs/prepare#bucket_requirements) and in `us-central1` region
2. Upload your data file manually or via `googleCloudStorageR`
Now we are ready to create an AutoML tables dataset.
### Create AutoML Tables dataset
Create a AutoML Tables dataset (with a unique `displayName`):
```r
gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq")
gcat_dataset
```
### Import data
Execute import from Google Cloud storage:
```r
# set url as seperate parameter to keep line length under 80
gs_url <- "gs://gcatr-dev/bank_marketing.csv"
gcat_import_job <- gcat_import_data(displayName = "test_03_gcs",
input_source = "gcs",
input_url = gs_url)
gcat_import_job
```
## From BigQuery
To load data into AutoML Tables from BigQuery, first do the following:
1. Setup a BQ dataset
2. Upload your data file manually or via `bigQueryR` or `bigrquery`
### Create AutoML Tables dataset
Create a AutoML Tables dataset (with a unique `displayName`):
```r
gcat_dataset <- gcat_create_dataset(displayName = "test_01_bq")
gcat_dataset
```
### Import data
Execute import from BigQuery:
```r
# set url as seperate parameter to keep line length under 80
bq_url <- "bq://gc-automl-tables-r.gcatr_dev.bank_marketing"
gcat_dataset_import <- gcat_import_data(dataset_display_name = "test_01_bq",
input_source = "bq",
input_url = bq_url)
gcat_dataset_import
```
## View dataset
Once the import - from GCS or BQ - is completed, you'll recieve an email notification. Locate the `datasetId` in the GCP console or via `gcat_list_datasets` function. (e.g. - will be in the form `TBL123456789`) Once the data import is completed, we can then sanity check the results.
Now we're ready to view the results and get the `tableSpecId` we need for later functions:
```{r}
dataset <- gcat_get_dataset()
dataset
```
# Updating dataset in AutoML Tables
AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.
Lets first view the table schema and the columns
## Get table specifications (tableSpec)
```{r}
table_spec <- gcat_get_table_specs()
table_spec
```
## List column specs
```{r}
column_specs_list <- gcat_list_column_specs()
column_specs_list
```
### Get info about our target column
```{r}
column_spec <- gcat_get_column_spec(columnDisplayName = "V16")
column_spec
```
## Set label or target column
Set V16 or outcome [source](https://datahub.io/machine-learning/bank-marketing) as label
```{r}
dataset <- gcat_set_target_column(columnDisplayName = "V16")
dataset
```
# Modeling
## Create model
Training or creating a model is done with `gcat_create_model`.
This will be a long-lasting operation. You will recieve an email when the model training completes
```r
gcat_model <- gcat_create_model(
datasetDisplayName = "test_01_bq",
columnDisplayName = "V16",
modelDisplayName = "test_01_bq_02",
optimizationObjective = "MINIMIZE_LOG_LOSS",
trainBudgetMilliNodeHours = 1000)
gcat_model
```
More details here: [Training models|AutoML Tables Documentation](https://cloud.google.com/automl-tables/docs/train)
## Viewing Models
While you wait - or after your latest model training completes - you can view the list trained models:
```{r}
models_list <- gcat_list_models()
models_list
```
After the model has trained, you can get the trainied model with the following:
```{r}
gcat_model <- gcat_get_model(modelDisplayName = "test_01_bq_01")
gcat_model
```
## Evalute the model
After training has been completed, you can review various performance statistics on the model, such as the accuracy, precision, recall, and so on. The metrics are returned in a nested data structure:
```{r}
model_evaluation <- gcat_list_model_evaluations(
projectId = projectId,
locationId = location,
modelDisplayName = "test_01_bq_01"
)
model_evaluation
```
### Make batch prediction
There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction.
```{r}
batch_predictions <- gcat_batch_predict(
modelDisplayName = "test_01_bq_01",
inputSource = "gs://gcatr-dev/bank_marketing_batch_01.csv",
outputTarget = "gs://gcatr-dev/predictions/"
)
batch_predictions
```