In [None]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/ds/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder
    
except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

# Cloud Computing

**Definition**: Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Cloud computing relies on sharing of resources to achieve coherence and [economies of scale](https://en.wikipedia.org/wiki/Economies_of_scale).

[-Wikipedia](https://en.wikipedia.org/wiki/Cloud_computing)

## Architecture

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Cloud_computing.svg/1280px-Cloud_computing.svg.png" height="350" width="500">

**Cloud computing metaphor**: the group of networked elements providing services need not be individually addressed or managed by users; instead, the entire provider-managed suite of hardware and software can be thought of as an **amorphous cloud**.

Googles Colab Notebooks is an example of application based cloud computing (for that matter their whole suite of application falls into that category).

## Service Models

* [Software as a service (SaaS)](https://en.wikipedia.org/wiki/Cloud_computing#Software_as_a_service_(SaaS))
    
    **Definition**: The capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. 
    <!-- The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings. -->
    
    **Examples**: Google Docs and Colab Notebooks
 

   
* [Platform as a service (PaaS)](https://en.wikipedia.org/wiki/Cloud_computing#Platform_as_a_service_(PaaS))

    **Definition**: The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. 
    <!-- The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.-->
    
    **Examples**: AWS Sagemaker and S3
 

   
* [Infrastructure as a service (IaaS)](https://en.wikipedia.org/wiki/Cloud_computing#Infrastructure_as_a_service_(IaaS))

    **Definition**: The consumer is able to deploy and run arbitrary software, which can include operating systems and applications and has control over operating systems, storage, and deployed applications.
    
    **Examples**: AWS and Azure

## Data Science in the Cloud

<img src="https://dmhnzl5mp9mj6.cloudfront.net/bigdata_awsblog/images/White_paper_image1.PNG" width="600" height="200">

A cloud-based architecture of a data science processing pipeline taking advantage of AWS' IaaS.  All the components can be provisioned and configured in AWS console or through their DevOps API.  ([Source](https://aws.amazon.com/blogs/big-data/big-data-analytics-options-on-aws-updated-white-paper/))

We will take a look at two components in the above diagram:

* Cloud-based Storage: [S3](https://aws.amazon.com/s3/) (component #3)
    
    Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.
  

  
* Cloud-based Machine Learning: [Sagemaker](https://aws.amazon.com/sagemaker/) (component #5)
    
    Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. 

## Experiments


If you have an AWS account here are two nice tutorials

* [s3 tutorial](https://aws.amazon.com/getting-started/hands-on/backup-files-to-amazon-s3/)

* [Jupyter/Sagemaker](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)

We will be doing work in AWS Classrooms.

### Exercise 1

1. Goto the S3 console.
1. Create a bucket.
1. Upload the 'tennis-numeric.csv' file into your bucket.
1. Using the SQL interface figure out what the average temperature is for the week. 
    * Make sure you ticked off the 'File has header row' box.  
    * Query: select avg(cast(temperature as integer)) from s3object
    * Note that you have to explicitly cast the temperature column as integer!
 
1. Using the SQL interface figure out how often we play or not play tennis in that week.
    * Query: select count(play) from s3object where cast(play as string) like '%yes%'
    * Note that you have to use 'like' with wildcard characters '%' in order to match 'yes' in case there are 
        hidden characters.



### Exercise 2

1. Goto the Sagemaker notebooks console.
1. Create a notebook instance and open it in the Jupyter console.
1. Create a notebook to do the following:
    1. Access your play tennis data set stored in S3.
    1. Build a decision tree
    1. Print the tree and its accuracy

### The following code snippets will be useful for the above exercises

Cut and paste them into your Sagemaker notebook

```Python
# Accessing buckets for Machine Learning in Sagemaker

import s3fs
import pandas as pd

df = pd.read_csv('s3://<bucket-name>/<filename>.csv')
df.head()
```

```Python
# print decision tree

import operator

def tree_print(clf, X):
    tlevel = _tree_rprint('', clf, X.columns, clf.classes_)
    print('<',end='')
    for i in range(3*tlevel - 2):
        print('-',end='')
    print('>')
    print('Tree Depth: ',tlevel)

def _tree_rprint(kword, clf, features, labels, node_index=0, tlevel_index=0):
    for i in range(tlevel_index):
        print('  |',end='')
    if clf.tree_.children_left[node_index] == -1:  # indicates leaf
        print(kword, end=' ' if kword else '')
        # get the majority label
        count_list = clf.tree_.value[node_index, 0]

        if len(count_list) == 1:
            # regression problem
            print(count_list[0])
        else:
            # get the majority label
            max_index, max_value = max(enumerate(count_list), key=operator.itemgetter(1))
            max_label = labels[max_index]
            print(max_label)
        return tlevel_index
    
    else:
        # compute and print node label
        feature = features[clf.tree_.feature[node_index]]
        threshold = clf.tree_.threshold[node_index]
        print(kword, end=' ' if kword else '')
        print('if {} =< {}: '.format(feature, threshold))
        # recurse down the children
        left_index = clf.tree_.children_left[node_index]
        right_index = clf.tree_.children_right[node_index]
        ltlevel_index = _tree_rprint('then', clf, features, labels, left_index, tlevel_index+1)
        rtlevel_index = _tree_rprint('else', clf, features, labels, right_index, tlevel_index+1)
        # return the maximum depth of either one of the children
        return max(ltlevel_index,rtlevel_index)
```


```Python
from sklearn import tree
from sklearn.metrics import accuracy_score

# set up data
X  = df.drop(['play'],axis=1)
y = df['play']

# set up the tree model object - limit the complexity to put us somewhere in the middle of the graph.
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=None)

# fit the model on the training set of data
model.fit(X, y)

# evaluate model
tree_print(model,X)
y_model = model.predict(X)
acc = accuracy_score(y, y_model)
print("Accuracy: {:3.2f}".format(acc))
```

### Exercise 3

Query your bucket from your Sagemaker notebook and answer the same questions from exercise 2.

The following snippets will be helpful.

```Python
import pandas as pd
import boto3
from io import StringIO

def query_bucket(sql, bucket, key):
    '''
    Query an S3 bucket using 'Select From SQL' syntax.
    If data was found then return a dataframe otherwise
    return 'None'.
    '''
    s3 = boto3.client('s3')

    resp = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=sql,
        InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
        OutputSerialization = {'CSV': {}},
    )   
    
    event_stream = resp['Payload']

    for event in event_stream:
        if 'Records' in event:
            data_in = StringIO(str(event['Records']['Payload'].decode("utf-8")))
            df = pd.read_csv(data_in, header=None)
            return df

    return None
```

```Python
# launch a query
sql = "select temperature,play from s3object"
df = query_bucket(sql, "<bucket name>", "<file key>")
print(df)
```