# Setup

## Get AWS setup

1. From https://boto3.readthedocs.io/en/latest/guide/quickstart.html (Boto is the Amazon Web Services (AWS) SDK for Python)

        sudo apt-get install awscli
        pip install boto3
2. Visit AWS -> IAM -> Add user -> Security Credentials -> Create Access Key
3. Run `aws configure` and enter the ID, code, region (eu-west-1) - ireland, outputformat (blank - leave as JSON)
4. Test with:

        import boto3
        s3 = boto3.resource('s3')
        for b in s3.buckets.all():
            print(b.name)
        
5. From http://distributed.readthedocs.io/en/latest/ec2.html,

        pip install dask-ec2

6. Visit AWS->EC2->Key pairs->Create key pair. I called mine "research". Save the keyfile in .ssh, chmod 600.
7. Get the AMI we want to use (e.g. ubuntu 14.04). Check https://cloud-images.ubuntu.com/locator/ec2/ and search for e.g. `14.04 LTS eu-west-1 hvm ebs`. 

## Running DASK

1. Run `dask-ec2 up --keyname YOUR-AWS-KEY --keypair ~/.ssh/YOUR-AWS-SSH-KEY.pem`. I found I had to also specify the region-name, the ami and tags as the first two have wrong defaults and the tool seems to fail if tags isn't set either.
Also found using ubuntu 16.04 had a SSL wrong version number error (see https://github.com/dask/dask-ec2/issues/38 ).
E.g.

        dask-ec2 up --keyname research --keypair .ssh/research.pem --region-name eu-west-1 --ami ami-d37961b5 --tags research:dp

Or less computation (2x2 = \$0.22/hour):

        dask-ec2 up --keyname research --keypair .ssh/research.pem --region-name eu-west-1 --ami ami-d37961b5 --tags research:dp --count 2 --volume-size 30 --type m4.large
        
Or greedy (8x36 = \$14.5/hour):

        dask-ec2 up --keyname research --keypair .ssh/research.pem --region-name eu-west-1 --ami ami-d37961b5 --tags research:dp --count 8 --volume-size 30 --type c4.8xlarge
        
Eventually after a long time, this will finish with:

        Dask.Distributed Installation succeeded

        Addresses
        ---------
        Web Interface:    http://54.246.253.159:8787/status
        TCP Interface:           54.246.253.159:8786

        To connect from the cluster
        ---------------------------

        dask-ec2 ssh  # ssh into head node
        ipython  # start ipython shell

        from dask.distributed import Client, progress
        c = Client('127.0.0.1:8786')  # Connect to scheduler running on the head node

        To connect locally
        ------------------

        Note: this requires you to have identical environments on your local machine and cluster.

        ipython  # start ipython shell

        from dask.distributed import Client, progress
        e = Client('54.246.253.159:8786')  # Connect to scheduler running on the head node

        To destroy
        ----------

        dask-ec2 destroy
        Installing Jupyter notebook on the head node
        DEBUG: Uploading file /tmp/tmp1GOH7d to /tmp/.__tmp_copy
        DEBUG: Running command sudo -S bash -c 'cp -rf /tmp/.__tmp_copy /srv/pillar/jupyter.sls' on '54.246.253.159'
        DEBUG: Running command sudo -S bash -c 'rm -rf /tmp/.__tmp_copy' on '54.246.253.159'
        +---------+----------------------+-----------------+
        | Node ID | # Successful actions | # Failed action |
        +=========+======================+=================+
        | node-0  | 17                   | 0               |
        +---------+----------------------+-----------------+
        Jupyter notebook available at http://54.246.253.159:8888/ 
        Login with password: jupyter

#### Finding modules is a problem

I found these not to work out the box. Critically, it failed with "`distributed.utils - ERROR - No module named dask_searchcv.methods`". I found I had to intstall the module on each worker:


        local$ dask-ec2 ssh 1
    dask1$ conda install dask-searchcv -c conda-forge -y

I guess if I use GPy, a similar process is needed.

Solutions:
1) Make an image with the correct stuff already - bit awkward if you want to add a module later...?
2) Find out how to load modules on all workers via dask-ec2...

Local:

In [6]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC

# Fit with scikit-learn
from sklearn.model_selection import GridSearchCV

param_space = {'C': [1e-4, 1, 1e4],
               'gamma': [1e-3, 1, 1e3],
               'class_weight': [None, 'balanced']}

model = SVC(kernel='rbf')

digits = load_digits()

search = GridSearchCV(model, param_space, cv=3)
%time search.fit(digits.data, digits.target)


CPU times: user 20.9 s, sys: 0 ns, total: 20.9 s
Wall time: 20.6 s


GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.0001, 1, 10000.0], 'gamma': [0.001, 1, 1000.0], 'class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

Via DASK: (note it takes longer as the two small servers on AWS are basically < than my laptop!)

In [5]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC

from dask.distributed import Client
e = Client('54.194.173.226:8786') #different to example output above as I've restarted DASK

# Fit with dask-searchcv
from dask_searchcv import GridSearchCV

param_space = {'C': [1e-4, 1, 1e4],
               'gamma': [1e-3, 1, 1e3],
               'class_weight': [None, 'balanced']}

model = SVC(kernel='rbf')

digits = load_digits()

search = GridSearchCV(model, param_space, cv=3)
%time search.fit(digits.data, digits.target)


CPU times: user 84 ms, sys: 20 ms, total: 104 ms
Wall time: 25.6 s


GridSearchCV(cache_cv=True, cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       iid=True, n_jobs=-1,
       param_grid={'C': [0.0001, 1, 10000.0], 'gamma': [0.001, 1, 1000.0], 'class_weight': [None, 'balanced']},
       refit=True, return_train_score=True, scheduler=None, scoring=None)

In [None]:
#list of modules I'll need (incomplete)
#!conda install scikit-learn -y
#!conda install dask distributed -y
#!conda install dask-searchcv -c conda-forge -y
#!pip install paramz
#!pip install matplotlib