## AI-platform training with sklearn

I have created a script for training with keras in [colab](https://colab.research.google.com/drive/1mWbS2cjBSQ5X7iAyoowJviO_dZcYYFYY#scrollTo=2R7eCG89y3qW), that is mainly used for predicting logic, the training step happens in the colab server, this script is trying to train the model in the cloud.

As that is based on Tensorflow, here is trying to train model with sklearn. I will just use the sample data in sklearn, the real project should be similiar.

In [1]:
# first import sklearn and pandas
!pip install --upgrade scikit-learn pandas

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/d9/3a/eb8d7bbe28f4787d140bb9df685b7d5bf6115c0e2a969def4027144e98b6/scikit_learn-0.23.1-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.9MB 2.7MB/s 
[?25hRequirement already up-to-date: pandas in /usr/local/lib/python3.6/dist-packages (1.0.3)
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/db/09/cab2f398e28e9f183714afde872b2ce23629f5833e467b151f18e1e08908/threadpoolctl-2.0.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.23.1 threadpoolctl-2.0.0


In [0]:
# import libraries
import numpy as np
import pandas as pd
import os

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

import warnings

warnings.simplefilter('ignore')


In [7]:
# As this is a sample script, I just make the data into a file in current sript and upload it into bucket
iris = load_iris()

x, y = iris.data, iris.target

columns_name = ['a', 'b', 'c', 'd', 'label']

df = pd.DataFrame(np.concatenate([x, y[:, np.newaxis]], axis=1), columns=columns_name)

df.head()

Unnamed: 0,a,b,c,d,label
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [10]:
# then I just save the dataframe into server side with a csv file.
df.to_csv(os.path.join('.', 'data.csv'), index=False)

print(os.listdir("."))

['.config', 'data.csv', 'sample_data']


## Upload file into bucket

In [0]:
# first install with the storage api
! pip install google-cloud-storage --quiet

In [0]:
# auth the notebook
from google.colab import auth
auth.authenticate_user()


## Upload sample file into bucket

In [0]:
from google.cloud import storage

os.environ['GCLOUD_PROJECT'] = 'cloudtutorial-278306'

client = storage.Client()

bucket_name = 'first_bucket_lugq'
bucket = client.get_bucket(bucket_name)

# in fact, we don't need to create folder first, we could just define
blob = bucket.blob('sklearn_tutorial/data.csv')

# upload file into bucket
blob.upload_from_filename('data.csv')


In [104]:
# list the file in bucket, so we do upload the file into the bucket
for file in client.list_blobs(bucket_name, prefix='sklearn_tutorial/'):
  print(str(file))

<Blob: first_bucket_lugq, sklearn_tutorial/, 1590565285687456>
<Blob: first_bucket_lugq, sklearn_tutorial/data.csv, 1590572504653769>


## Traning logic file

This is the main training logic happens here, the step is:
* Import modules
* Set up project environment
* Create storage client and download files from storage
* Load data into memory
* Process data
* Split data into train and validation
* Do model training with sklearn
* Dump our model into server
* Upload trained model file into bucket

That's it. But when I do the sample code, one question raised in my mind is why do we need to storage our data in bucket and download it from remote server? Maybe what I could imagine is that for currently implement in cloud is training with container,  each container could have their own resource like disk, memory, etc. So for easy to do is try to download the remote files into container file system, so we could use the resouce in containers. 

In [93]:
%%writefile train.py

# in fact, we do need to import whole modules in the training file, so that we could do it right.
import numpy as np
import pandas as pd
import os

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

from google.cloud import storage

import warnings
warnings.simplefilter('ignore')

os.environ['GCLOUD_PROJECT'] = 'cloudtutorial-278306'

client = storage.Client()

bucket_name = 'first_bucket_lugq'
bucket = client.get_bucket(bucket_name)

# in fact, we could just download the file from bucket directly without the front step.
blob_download = bucket.blob('sklearn_tutorial/data.csv')
blob_download.download_to_filename('data_new.csv')

# load data from server
df = pd.read_csv('data_new.csv')

data = df.drop(['label'], axis=1).values
label = df['label'].values

print("data shape: {}, label shape: {}".format(str(data.shape), str(label.shape)))

pipeline = make_pipeline(StandardScaler(), LogisticRegression())

# split train and validation data
xtrain, xtest, ytrain, ytest = train_test_split(data, label, test_size=.2, random_state=1234)

pipeline.fit(xtrain, ytrain)

score = pipeline.score(xtest, ytest)

print("validation score: ", score)

# store the trained model into server
import joblib

model_name = "pipeline.pkl"
joblib.dump(pipeline, model_name)

upload_model_name = "{}/{}".format('sklearn_tutorial', model_name)

upload_blob = bucket.blob(upload_model_name)
upload_blob.upload_from_filename(model_name)
print('model has been uploaded into bucket: {}'.format(bucket_name))

Overwriting train.py


In [53]:
# let's check our bucket, so great we do find the trained model in bucket.
! gsutil ls gs://$bucket_name/sklearn*


gs://first_bucket_lugq/sklearn_tutorial/
gs://first_bucket_lugq/sklearn_tutorial/data.csv
gs://first_bucket_lugq/sklearn_tutorial/pipeline.pkl


## Make application project

One easist way to create training step is to write the code into a project and wrap it into a file uploaded into cloud storage. Then for the later step we could use the project to do training.

In [94]:
# First make the application folder and file
print("Current folder file list: ", os.listdir('.'))

folder_name = 'iris_tutorial'
try:
  os.makedirs(folder_name)
except:
  pass

# we do need a __init__.py file as a project
if '__init__.py' not in os.listdir(folder_name):
  os.system('touch {}/__init__.py'.format(folder_name))

# we could copy the train.py into the project folder
import shutil

train_file_name = 'train.py'
shutil.copy(train_file_name, os.path.join(folder_name, train_file_name))

print("project folder file list:", os.listdir(folder_name))

Current folder file list:  ['.config', 'adc.json', 'pipeline.pkl', 'train.py', 'data.csv', 'data_new.csv', '__init__.py', 'iris_tutorial', 'sample_data']
project folder file list: ['train.py', '__init__.py']


## submit training job

This is the sample submit code logic with gcould, in fact, we do could find a more detail explain of each parameters [traininig parameters](https://cloud.google.com/ai-platform/training/docs/training-scikit-learn#gcloud), and it's recommended that we should package our code into a project [package project](https://cloud.google.com/ai-platform/training/docs/packaging-trainer#using_gcloud_to_package_and_upload_your_application_recommended)
``` shell
gcloud ai-platform jobs submit training $JOB_NAME \
    --staging-bucket $PACKAGE_STAGING_PATH \
    --job-dir $JOB_DIR  \
    --package-path $TRAINER_PACKAGE_PATH \
    --module-name $MAIN_TRAINER_MODULE \
    --region $REGION \
    -- \
    --user_first_arg=first_arg_value \
    --user_second_arg=second_arg_value
```

In [106]:
# set project
! gcloud config set project cloudtutorial-278306


Updated property [core/project].


In [0]:
# config of the job
JOB_NAME = 'iris_training_04'   # where our source code zip file.
PACKAGE_STAGING_PATH = "gs://{}/sklearn_tutorial/package".format(bucket_name)
JOB_DIR = "gs://{}/sklearn_tutorial/output".format(bucket_name)
TRAINING_PACKAGE_PATH = folder_name
REGION = 'us-central1'
MAIN_TRAINER_MODULE = "{}.train".format(folder_name)


In [108]:
!gcloud ai-platform jobs submit training $JOB_NAME \
--staging-bucket gs://first_bucket_lugq \
--job-dir $JOB_DIR \
--package-path $TRAINING_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version 2.1 \
--python-version 3.7

Job [iris_training_04] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe iris_training_04

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs iris_training_04
jobId: iris_training_04
state: QUEUED


In [110]:
! gcloud ai-platform jobs stream-logs iris_training_04

INFO	2020-05-27 09:42:26 +0000	service		Validating job requirements...
INFO	2020-05-27 09:42:26 +0000	service		Job creation request has been successfully validated.
INFO	2020-05-27 09:42:27 +0000	service		Job iris_training_04 is queued.
INFO	2020-05-27 09:42:27 +0000	service		Waiting for job to be provisioned.
INFO	2020-05-27 09:42:29 +0000	service		Waiting for training program to start.
INFO	2020-05-27 09:43:09 +0000	master-replica-0		Running task with arguments: --cluster={"chief": ["127.0.0.1:2222"]} --task={"type": "chief", "index": 0} --job={  "package_uris": ["gs://first_bucket_lugq/iris_training_04/f99e915a32b8d9eadfdf532e1210923d3eadd38a1ac1ef2dc5e03a758aa7f375/iris_tutorial-0.0.0.tar.gz"],  "python_module": "iris_tutorial.train",  "region": "us-central1",  "runtime_version": "2.1",  "job_dir": "gs://first_bucket_lugq/sklearn_tutorial/output",  "run_on_raw_vm": true,  "python_version": "3.7"}
INFO	2020-05-27 09:43:44 +0000	master-replica-0		Running module iris_tutorial.train.
I

## Well done

We have already deployed our training logic into server with training in cloud :). So great!

The next step is that we have already trained our model with fix parameters without any parameters tuning, also the benefit with cloud training is distributed training with Tensorflow,  XGBoost and even sklearn. Will make it happen in later step.