# Cloud interoperability

> This sample connects services between AzureML, Amazon WS and Google CP.

Our goal is to train a Machine Learning model for credit cards. By now, we have all we need in our GitHub repository, we recommend cloning the data_connector repository and follow all the next instructions.

First of all, you need an active account on GCP, AWS and AzureML. Once you have an account for each one, you need some requirements to accomplish this sample, these requirements are related with configurations and values you need to connect to the cloud. 
If you want to know more, review [GCP Readme](GCP_README.md), [AWS Readme](AWS_README.md) and [AzureML readme](AZUREML_README.md)

**Change your kernel to _intel_sample_env_ at the top right of this document**


## Experiment

Our experiment consists in a simple interconnection between 3 cloud providers: GCP, AWS and Azure. 
Here we will make a use case simulation where you, a data scientist, have a local file with data about credit card behaviors, and you want to save this info into GCP Big Query to extract it later as many times as you want. 
Also, we will run a Machine Learning process in our local machine to get a model to know who can receive a credit card. 

But this work is made to see the light of the sun, so we will configure and run our training code into cloud cluster computers on AzureML. 

After all of this, we need to create versions of our work, so we should have versions of experiments' results and save them in a safe place, we will use an AWS bucket. Then we can get back to previous states of our models just by loading them.


| Before running use ```gcloud auth login```


## GCP load data

In [None]:
# To get environment variables
import os
from dotenv import load_dotenv
# GDP Data connector
from data_connector.gcp import Connector as GCPConnector
from data_connector.gcp import Downloader as GCPDownloader 
from data_connector.gcp  import Query as GCPQuery

load_dotenv()

# Pandas 
import pandas as pd
import numpy as np
from google.cloud import bigquery


sql_types = {np.dtype('int32'): "INTEGER", np.dtype('int64'): "INT64"}

credit_card_name = "credit_card_clients"
credit_card_file_name = credit_card_name + ".xls"
credit_card_data_df: pd.DataFrame = pd.read_excel(credit_card_file_name, skiprows=[0])
schema = []
for name in credit_card_data_df.columns:
    schema.append(bigquery.SchemaField(name, sql_types[credit_card_data_df.dtypes[name]], mode='REQUIRED'))

rows_to_insert = list(credit_card_data_df.itertuples(index=False, name=None))
increment = 1000
chunks = [rows_to_insert[x:x+increment] for x in range(0, len(rows_to_insert), increment)] #Avoids error 413

# Connecting to GCP Big Query
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./credentials.json"
gcp_connector = GCPConnector("bigquery")
# Create a client 
bigquery_client = gcp_connector.connect(connection_string="<your project>")
gcp_query = GCPQuery(bigquery_client)
# Create a data set
data_set_name = "CreditCardClients"
table_name = "test_query"

gcp_query.create_dataset(data_set_name)
gcp_query.create_table(data_set_name, table_name, schema)

for elem in chunks:
    gcp_query.export_items_to_bq(data_set_name, table_name, elem)



# Azure ML
Now we will use our data to train a model locally and then make it run in in AzureML cloud.

* Remember, you need a config.json file from AzureML

In [None]:
! az login

In [None]:
import mlflow
mlflow.end_run()
# Python
from pathlib import Path
# AZURE
# Azure ML
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    AmlCompute,
    Environment
)

# Data connector
from data_connector.azure import Connector, MLUploader
# Connection to Azure ML
ml_client: MLClient = Connector().connect()
ml_uploader: MLUploader = MLUploader(ml_client)
# Training the model in local
path = Path().resolve()
training_script = f"{path}/src/main.py"
training_data = f"{path}/credit_card_clients.xls"
_test_train_ratio=0.2
_learning_rate=0.25
_registered_model_name = "credit_defaults_model"
# Here is our training code ./src/main.py
from src.main import main
arguments = [
        "--data", f"{training_data}",
        "--test_train_ratio", f"{_test_train_ratio}",
        "--learning_rate", f"{_learning_rate}",
        "--registered_model_name", f"{_registered_model_name}"
        ]
main(arguments)


# Moving to the cloud
# AmlCompute 
# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"
_type = "amlconpute"
_size = "STANDARD_DS3_V2"
_min_instances = 0
_max_intances = 2
_idle_time_before_scale_down = 180
_tier = "Dedicated"
try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type=_type,
        # VM Family
        size=_size,
        # Minimum running nodes when there is no job running
        min_instances=_min_instances,
        # Nodes in cluster
        max_instances=_max_intances,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=_idle_time_before_scale_down,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier=_tier,
    )
    print(
        f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
    )
    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_uploader.upload(cpu_cluster)

#Environment
# 
dependencies_dir = f"{path}/dependencies/" 
custom_env_name =  "aml-scikit-learn"
_environment_description = (
                            "Custom environment for Credit Card " 
                            "Defaults pipeline"
                        )
_environment_tags = {"scikit-learn": "0.24.2"}
_environment_conda_file = f"{dependencies_dir}/conda.yml"
_environment_docker_image = "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest"

pipeline_job_env: Environment = Environment(
    name=custom_env_name,
    description=_environment_description,
    tags=_environment_tags,
    conda_file=_environment_conda_file,
    image=_environment_docker_image,
)
# Upload environment using data connector
pipeline_job_env = ml_uploader.upload(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)

# Job
# Now we will set a job to run on cloud, for this moment we know 
# the script to training and we have a data set for training
# 

from azure.ai.ml import command
from azure.ai.ml import Input

# We shoudl define some extra values to run in cloud systems
_code = ".src"
_comand = """python main.py --data ${{inputs.data}} 
        --test_train_ratio ${{inputs.test_train_ratio}} 
        --learning_rate ${{inputs.learning_rate}} 
        --registered_model_name ${{inputs.registered_model_name}}
        """
_job_environment="aml-scikit-learn@latest"
_job_compute="cpu-cluster"
_job_experiment_name="train_model_credit_default_prediction"
_job_display_name="credit_default_prediction_from_data_connector"

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path=training_data,
        ),
        test_train_ratio=_test_train_ratio,
        learning_rate=_learning_rate,
        registered_model_name=_registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --registered_model_name ${{inputs.registered_model_name}}",
    environment=_job_environment,
    compute=_job_compute,
    experiment_name=_job_experiment_name,
    display_name=_job_display_name,
)

ml_uploader.upload(job)

# AWS
Now we should save local results of our experiements into a bucket in AWS.

Here you need to create a bucket to store your data

In [None]:
import os
from data_connector.aws.connector import Connector
from data_connector.aws.uploader import Uploader
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()

# specify a S3 bucket name
bucket_name = '<your bucket name>'
# create a connector
connector = Connector()
# connect to aws using default aws access keys
conection_object = connector.connect()
# Upload a file
# create a uploader object using a connection object
uploader = Uploader(conection_object)
# upload all files from a folder and subfolders
now = datetime.now()
upload_dir = 'mlruns'
for subdir, dirs, files in os.walk(upload_dir):
    for file in files:
        full_path = os.path.join(subdir, file)
        destiny_path = full_path.replace(upload_dir,f'{upload_dir}_{now}')
        uploader.upload(bucket_name, full_path, destiny_path)