Frameworks for ML scaling and production
----

# A simple API with Flask and Heroku

Create and pickle a model as `model.pkl`, then just create an app that accepts `POST` requests to the root path:

```python
    import pandas as pd
    from flask import Flask, jsonify, request
    import pickle

    # load model
    model = pickle.load(open('model.pkl','rb'))

    # app
    app = Flask(__name__)

    # routes
    @app.route('/', methods=['POST'])

    def predict():
        # get data
        data = request.get_json(force=True)

        # convert data into dataframe
        data.update((x, [y]) for x, y in data.items())
        data_df = pd.DataFrame.from_dict(data)

        # predictions
        result = model.predict(data_df)

        # send back to browser
        output = {'results': int(result[0])}

        # return data
        return jsonify(results=output)

    if __name__ == '__main__':
        app.run(port = 5000, debug=True)
```

Save the required packages to `requirements.txt`, where each line is of the format `package==version`. If using a clean environment, this can be done with:
>`pip freeze > requirements.txt`

To deploy to Heroku, create `Procfile` with the following contents:
>`web: gunicorn app:app`

From within the Heroku web interface, Github repos can be deployed with a few clicks.

# Hadoop

https://hadoop.apache.org/

## Introduction and Use Case

Hadoop is an open-source distributed data management system. It combines tools to store, analyze, and process large-scale pools of data on clusters of servers, without requiring specialized hardware. The "vanilla" version maintained by the Apache Foundation is quite intricate and not entirely stable, so there are many commercial distributions offered by third parties (such as Cloudera, Hortonworks). The major cloud services ([Google](https://cloud.google.com/dataproc?hl=en), Amazon, Microsoft) can also host Hadoop, either with their own out-of-the-box solutions or provided by commercial distributions.

This [table](https://hadoopecosystemtable.github.io/) summarizes libraries and applications within the Hadoop "ecosystem," including those produced by Apache itself and many others.

Cloud data systems like Hadoop represent an alternative to relational databases in order to provide greater scalability and speed at large scales. It is often said that databases can optimize on 2 of 3 goals (CAP): consistency, availability, and partitioning (i.e. scalability). SQL priortizes C and A, while Hadoop prioritizes A and P. Because it lacks the transaction control of relational databases, it is better suited to "behavioral" rather than "line of business" data (such as customer accounts, supply chains, etc). Behavioral data is collected *in aggregate* as side-effect of user activity. Rather than being tracked and queried on the level of individuals, this data is primarily useful for the general patterns than can be seen in it -- hence it is acceptable to deprioritize consistency in a way that would not be workable for business-critical data.


## Alternatives for Running Hadoop

1. Apache Hadoop open source versus vendor services
1. Docker images versus virtual machines
1. Local file system, pseudo-distributed, fully distributed on own servers, versus on the cloud
1. Versioning: Apache Hadoop updates frequently, and there are incompatbilities with some versions. MapReduce in particular went through a major 1.0 to 2.0 transition.

## Elements of the Hadoop Ecosystem

### Hadoop File System (HDFS)

Developed out of a system published by Google, HDFS promises scalability on "commodity" hardware. By default, it employs 3x data redundancy and enables larger chunk sizes than other formats. It is also possible to use the native file systems of cloud services.

HDFS is immutable: any operations on data are saved as new files in the system rather than altering existing data. This includes re-executing operations: by default this will generate new outputs files every time instead of overwriting.

The HDFS command-line interface syntax is `hadoop fs -command` (or sometimes `dfs`) where `command` shares many Linux shell commands like `cat`, `mkdir`, `ls` etc plus distinctive commands like `put` and `get` to moves file betweens HDFS and other storage (local/cloud). HDFS locations are written as urls `hdfs://...`

### MapReduce

The distributed processing framework for Hadoop. Implemented in Java, MapReduce processes (and anything else executing on a Hadoop server) are executed in Java virtual machines (JVM). Each process is a distinct VM that does not share state. The quirk this introduces is that although the syntax is object-oriented (being Java, everything is a class, in this case Static classes), the paradigm is much closer to functional programming, as each process can only take in data and output results without being able to reference the results of other parallel instances.

There are now also APIs for languages more commonly used in data science like Python and R, as well as interfaces for other systems programming languages like C# and C++.

The basic unit of a MapReduce routine is the **Job**, which is instanciated to carry out Map and Reduce operations on data. The **Map** functionality applies some set of operations *on each node* in the Hadoop cluster. It returns a set of key/value pairs. The **Reduce** functionality aggregates key/value pairs (on some subset of nodes) and returns a combined list, which is stored as a new file in the system. In between these two steps, the data (duplicated across nodes) is "shuffled and sorted" to processing nodes. For efficiency, it is possible to do a preliminary **Combine** stage on the original node, so as to increase the density of data that needs to be transferred across nodes for sorting and later reduction.

So, for example, a basic word count operation -- producing a list of the unique words in a text and their corresponding counts -- the map function would turn the text into a list of words (each with 1 instance) and the reduce function would take look at each word and sum up the instances.

It is considered good practice to subdivide tasks so that each routine performs only a single operation, and more complex operations are the result of chains of jobs. Pre-processing, for example, can be run as a "map only" job.

Jobs are run by submitting them to the scheduler: this takes the form of indicating a `.jar` file and the class name to run as main, plus needed arguments like source and output locations. From the command line, the syntax is `hadoop jar filename.jar input output`.

MapReduce 1.0 was limited because it could only process in batch and was not easy to customize. The 2.0 update allows more "on-time" operations and more intricate controls of how operations are carried out.

### YARN

Yet Another Resource ____: an abstraction layer added along with MapReduce 2.0 that allows a wider range of data processing on top of HDFS.


### Apache Spark

An alternative for distributed data processing engine, which primarily operates in memory. See the notes on PySpark below.

### HBase

A wide-column, schema-on-read (NoSQL) database format that acts as a relatively accessible front-end to data stored in a Hadoop cluster.

### Hive

A query language interface for HBase, which acts as a MapReduce front-end, also known as HQL or H-SQL. The syntax is similar to SQL, but backend is fundamentally different. For one thing because it is a front-end for MapReduce (via often HBase), it is executing batch jobs on the cluster, which can take substantial time.

`CREATE TABLE` commands pull requested fields from data into a wide table, then `SELECT...WHERE` commands can pull out specific records. NB, since the underlying data is not relational, `JOIN` statements are often impractical.

### Pig

A scripting tool for Hadoop, used especially for data input and cleaning (ETL: extract, transform, load). Its native language is called Pig Latin.

### Oozie

A workflow manager used to coordinate scripts from multiple libraries. Jobs are scripted using XML, so commercial GUIs are often used in practice.

### Sqoop

Command-line utility for transferring data between relation databases and Hadoop clusters. 

### ZooKeeper

Centralized service for Hadoop configuration information, to create ensembles of programs. It performs computation in-memory for more real-time operations.

# PySpark


## Introduction

PySpark is the Python API for Apache Spark. From the website:
> Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

The Spark API is used to defined a graph of operations to be performed in parallel on a large, distributed dataset. The framework will try to optimize this for maximum parallelized efficiency. Accordingly, the methods of the Spark API are lazily evaluated. 

Spark enables interactive programming through shells, in different language flavors: Scala, R, and Python. The PySpark API is also usable through a Python librariy. The connection with a Spark cluster through PySpark is managed by instances of the `SparkContext` class.

## SparkContext

With a SparkContext instances `sc`:
- `sc.version` prints the version of Spark
- `sc.pythonVer` print the version of Python being used by Spark
- `sc.master` identifies the server to which the shell is connect (`local[*]` for a local connection)

Loading data into a Spark instance:
- `sc.parallelize(array_like)` converts the given data to an RDD
- `sc.textFile(file_path)` reads the given file

## PySpark data structures

### RDD

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data with redundancy across multiple nodes in the cluster. The partioning can be handled automatically by Spark or managed in certain ways when creating RDDs (with the methods above). RDD objects have a `getNumPartitions()` to see how many partitions are used.

An RDD is structured as an array. It is common for the elements to consist of key-value pairs. These are known as pair RDDs, though they are not a distinct data type in the API. RDD methods interpret a series of two-item tuples as key-value pairs. Thus, a map that generates two-value tuples will produce a paired RDD.

Operations performed on RDDs are divided into two broad categories: transformations that generate a new RDD and actions that have some other result (and often trigger evaluation of the transformation graph). Elementary transformation methods include:
- `map(f)`
- `filter(f)`
- `flatMap(f)`: applies a function that returns multiple values and then flattens all of the results into a single array.
- `union(other)`: combines two RDDs
- `coalesce(n)`: reparallelizes the RDD into `n` partitions

Additional transformations that operate only on pair RDDs (calling these methods on RDDs that are not comprised of 2-value tuples will raise an error **when evaluated**):
- `reduceByKey(f)`: sequentially performs a 2->1 function the set of values sharing keys and generates a new pair RDD with the results
- `groupByKey()`: generates a paired RDD where the values are a special iterable class containing all values for the key in the original data
- `sortByKey(ascending=True)`
- `join()`: by default, carries out an inner join, where the values are tuples of the values for shared keys in the component RDDs

Action methods include:
- `collect()`: executes the graph and returns the result *as a list* (at least in PySpark). Key-value pairs are expressed as tuples.
- `take(N)`: returns an array of N elements drawn from RDD
- `first()`: equivalent to `take(1)`
- `count()`: returns number of entries
- `reduce(f)`: as in `functools`, the function must take 2 values and return 1
- `saveAsTextFile(dir)`: writes each partition to a separate text file in the given directory

Pair-value-only actions:
- `countByKey()`: returns the counts in a dictionary
- `collectAsMap()`: executes the graphs returns all results as a dictionary

Note that for the collect and ByKey actions, care must be taken to not request more data than can fit in memory.

In general, though, RDDs are hard to work with directly, so Spark provides a DataFrame abstraction built on top of RDDs.

### DataFrames

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs. They flatten out the nested complexity of the RDD structure, but unlike reading the data out directly into a Python object, they keep the operations within the Spark API.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in.

#### SparkSession

Within PySpark, the DataFrames interface is encapsulated in `pyspark.sql.SparkSession` objects. To prevent conflicting coexisting sessions, the class method `builder.getOrCreate()` returns an existing session if it exists and only opens a new one if not. The idiomatic name for the SparkSession is `spark`.

DataFrames stored in the database are accessible through the `catalog` (which returns a `Catalog` instance). Thus `spark.catalog.listTables()` will list the available tables in the current database. The `table("name")` method returns a **PySpark DataFrame** of the requested table from the catalog. Alternatively, SQL queries can be made with `spark.sql()`: these treat the catalog like a relational database, allowing the creation of custom DataFrames with `SELECT ... FROM ...` statements.

#### Creating DataFrames

One key difference between an RDD and a DataFrame is that the latter requires a schema. Thus, creating a DataFrame table from an RDD entails, at a minimum, providing column names: `df = spark.createDataFrame(rdd, schema=col_names)`. Spark will infer the data types. The schema of a DataFrame can be seen with the `printSchema()` method, and a list of column names is visible with the through `columns` attribute.

DataFrames can also be created directly from local data. To convert a Pandas DataFrame to a Spark DataFrame, use `spark.createDataFrame(df)`. Or data can be read directly from csv with `spark.read.csv(file_name, header=True, inferSchema=True)`. Spark DataFrames can also be converted to Pandas DataFrame with the `toPandas()` method.

A newly created DataFrame exists only in local memory. To add it to the database, use the DataFrame's `createTempView('name')` or `createOrReplaceTempView('name')` method (NB the former will throw an exception if a view with the given name already exists).

#### Manipulating DataFrames

Note that to print the contents a DataFrame, call the `show()` method (i.e. printing does not work); limit to `n` rows with `show(n)`.

PySpark DataFrames are immutable, so any mutating operation is actually creating a copy, which can then be assigned over the previous variable name.

The columns of the DataFrame are accessible as attributes or by indexing. These are Column objects, and they have overloaded operators as with Pandas Series. An alias for display can be defined for a Column object with the `alias()` method (used for tables produced by `select()`). Note that Column transformations, as with RDDs, are lazily evaluated, so e.g. exceptions based on types are only raised when a column is joined to a DataFrame. Moreover, Columns created by operations on Columns are linked to specific names and IDs, so they can be used with `select()` or `withColumn()` only if the calling DataFrame has a matching column (NB the ID is changed by column overwriting).

To create a new DataFrame with an added or transformed column: `df.withColumn("column_name", column)`, where `column` is a Column object. Rename a column with the `withColumnedRenamed('oldName', 'newName')`.

Column data types can be changed with the Column object's `cast('type')` method.

DataFrames can be filtered using the `filter()` method, which accepts either a query string (akin to what follows a `WHERE` SQL) or boolean operation on a column (eg. `df.col > 0`). Similarly, the `select(*cols)` method returns a DataFrame with the columns specified as positional arguments, which can be either the column names as strings or as Column objects, allowing transformed columns. To use SQL syntax to transform columns, use `selectExpr()`, where each positional argument is a SQL-style column identifier (i.e. something separated by a comma in a `SELECT` statement).

DataFrames can be sorted with the method `orderBy(*cols)`. Duplicates can be removed by `dropDuplicates(*cols)` (an empty parameter set will drop only complete duplicates).

The `describe()` method creates a DataFrame of summary statistics for all numerical columns (or a subset specific by positional arguments). In addition, operations like `min`, `max`, and `mean` can be performed in `selectExpr()` or as methods of a `GroupedData` object. To do the latter, it is necessary to call `df.groupBy()`, even with no argument (thus observation is a "group"). Functions on columns are available in the `pyspark.sql.functions` module, e.g. `functions.stddev('colname')`, returning a Column that can be passed to the `agg()` method of a GroupedData object. NB, the aggregation methods of grouped objects return DataFrames not Columns, so need to use functions instead inside `agg()`. Note also the lazy evaluation of the function: the column name is only resolved when the Column is passed to `agg()`.

Joins can be done with the method `join(other, on, how)`.

### Visualization

There are different ways of getting visualizations out of PySpark data

#### Converting to Pandas

By converting a PySpark DataFrame to Pandas using `toPandas()`, any MatPlotLib or Seaborn plotting method can be used. Recall that this requires fitting all of the data into one machine's methods.

#### Pyspark_dist_explore

This package implements some basic distribution visualization plots for Spark DataFrames
- `hist(df)`
- `distplot(df)`
- `pandas_histogram(df)`

#### HandySpark visualization methods

HandySpark DataFrames add some additional functionality while preserving the distributed character of Spark DataFrames, including visualization methods like `hist()`. They can be created by the Spark DataFrame `toHandy()`.

## Modeling

### MLlib

Apache Spark includes a machine learning library MLlib, which implements common feature engineering and modeling techniques for distributed data, including classification, regression, clustering, and collaborative filtering (recommendation). Note that these each have their own submodules, which contain classes for models and common transformations.

**NOTE: this library seems to be semi-deprecated in favor of the DataFrame-based `ml` module covered below.**

#### Utilities and basic syntax
- RDDs have a `randomSplit([training_share, testing_share])` method that takes a list of two fractions and returns two RDDs with randomly selected values in the given proportions.
- Models are simulateously initialized and trained with class method `MODEL.train(data, **params)`.

#### Feature engineering
`pyspark.mllib.feature`
- Tokenization by hasing values: `HashingTF(numFeatures=d)` returns sparse vectors of length `d`

#### Recommendation

`pyspark.mllib.recommendation`
- `Rating(user, product, rating)`: a class for capturing product rating observations
- `ALS`: alternating least squares algoritm. Its `train()` method accepts parameters `rank` and `iterations`. The `predictAll()` method on the trained model takes data tuples of user and product and predicts ratings, returning Rating objects. Its accuracy can be tested by manually implementing a mean square error measure:
```python
    # Prepare ratings data
    rates = rtest_data.map(lambda r: ((r[0], r[1]), r[2]))

    # Prepare predictions data
    preds = predictions.map(lambda r: ((r[0], r[1]), r[2]))

    # Join the ratings data with predictions data
    rates_and_preds = rates.join(preds)

    # Calculate and print MSE
    MSE = rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
```

#### Classification
`pyspark.mllib.classification`
- `Vectors`: can be created as either `dense(data)` or `sparse(n, {i: value,...})`
- `LabeledPoint(label, feature_vector)`
- `LogisticRegressionWithLBFGS`: accepts data of `LabeledPoint`s. Its `predict()` methods accepts a feature vector and returns a `LabeledPoint` object.

#### Clustering

`pyspark.mllib.clustering`
- `KMeans`: `train()` takes floating point vectors, and named arguments `k` and `maxIterations`. The model objects `clusterCenters` attribute returns list of arrays.

### ML on DataFrames

Machine Learning models on DataFrames are implemented in the `pyspark.ml` module. Different models have different APIs:
1. Transformer models, which have a `transform()` method, which takes and returns a DataFrame, performing transformations on the column(s) identified in the constructor with `inputCol=''` or `inputCols=[]` and creating `outputCol` or `outputCols`.
1. Estimator models that perform fitted transformation. The constructor takes `inputCol` (etc), and the object's `fit()` method takes a DataFrame and returns a Transformer. 
1. Predictor models that carry out machine learning. The constructor takes a *single* `featureCol` (a feature vector, which can be created with the `VectorAssembler()` transformer) and `labelCol`. Note that `featureCol` defaults to `'features'`. The `fit()` method takes a DataFrame and returns a Transformer. NB these objects do have a `predict()` method, but it takes a single feature vector.

For example, to apply one-hot encoding to string values:
```python
    from pyspark.ml.feature import StringIndexer, OneHotEncoder
    indexer = StringIndexer(inputCol = 'string_column', outputCol = 'cat_column')
    fit_indexer = indexer.fit(df)
    indexed_df = fit_indexer.transform(df)
    encoder = OneHoteEncoder(intputCol = 'cat_column', outpotCol = 'encoded_col')
    fit_encoder = encoder.fit(indexed_df)
    encoded_df = fit_encoder.transform(indexed_df)
```
Note that this does not precisely generate one column per value but instead creates a column containing a tuple that can be interpreted by the models.

Estimators and transformers can be combined into a `pyspark.ml.Pipeline(stages=[])`, where the parameter is a list of model objects.

#### Model evaluation and tuning

Train-test split can be created with the DataFrame's `randomSplit()` method, which takes a list of $n$ proportions and returns a tuple of $n$ DataFrames.

Predictor models have built-in `evaluate()` methods (though it's not clear from the docs what the metric is), but custom evaluations are defined in the `ml.evaluate` module. 

To tune a logistic regression model with 5-fold cross-validation:
```python
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
    
    # Create the parameter grid
    grid = tune.ParamGridBuilder()
    grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
    grid = grid.build()

    evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC')

    lr = LogisticRegression()

    cv = tune.CrossValidator(
        estimator=lr,
        estimatorParamMaps=grid,
        evaluator=evaluator
    )

    cv_results = cv.fit(train)
    best_lr = cv_results.bestModel
    
    best_predictions = best_lr.transform(test)
    print(evaluator.evaluate(best_predictions))
```
I'm not sure how well this would work with a pipeline, since it is not clear how to access underlying parameters within a pipeline. The DataCamp course recommended doing all transformations in a pipeline, then splitting the data, then doing cross-validation on the training data, but this would contaminate the fit. The Cloudera presentation splits beforehand and applies the pipeline to each part separately.

# Containerization

In order to reproduce the environment for applications, it is often helpful to encapsulate them through virtualization. This can be done with various virtual environment tool, but *containers* add an additional step of portability. **Docker** is one popular containerzation tool, providing funtionality to reproduce and run containers, as well as hosting DockerHub as a repository for Docker images.

## Docker

It is important to distinguish between:
- **Dockerfile**: A Dockerfile is a text file that specifies how an image will be created.
- **Docker Images**: Images are created by building a Dockerfile.
- **Docker Containers**: Docker containers is the running instance of an image.

### The Dockerfile

```
+------------+-----------------------------------------------------+
| Command    | Description                                         |
+------------+-----------------------------------------------------+
| FROM       | The base Docker image for the Dockerfile.           |
| LABEL      | Key-value pair for specifying image metadata.       |
| RUN        | It execute commands on top of the current image as  |
|              new layers.                                         |
| COPY       | Copies files from the local machine to the          |
|              container filesystem.                               |
| EXPOSE     | Exposes runtime ports for the Docker container.     |
| CMD        | Specifies the command to execute when running the   |   
|              container. This command is overridden if another    |   
|              command is specified at runtime.                    |
| ENTRYPOINT | Specifies the command to execute when running the   |      
|              container. Entrypoint commands are not overridden   |
|              by a command specified at runtime.                  |
| WORKDIR    | Set working directory of the container.             |
| VOLUME     | Mount a volume from the local machine filesystem to | 
|              the Docker container.                               |
| ARG        | Set Environment variable as a key-value pair when   |              
|              building the image.                                 |
| ENV        | Set Environment variable as a key-value pair that   | 
|              will be available in the container after building.  |
+------------+-----------------------------------------------------+
```

### Docker Images

The command `docker build -t <image-name>` builds an image from the `Dockerfile` in the current directory. Docker keeps a records of local images, which can be managed with commands:
```
+---------------------------------+--------------------------------+
| Command                         | Description                    |
+---------------------------------+--------------------------------+
| docker images                   | List all images on the         |   
|                                   machine.                       |
| docker rmi [IMAGE_NAME]         | Remove the image with name     | 
|                                   IMAGE_NAME on the machine.     |
| docker rmi $(docker images -q)  | Remove all images from the     | 
|                                   machine.                       |
+------------+-----------------------------------------------------+
```

### Running Containers

The syntax for running an container from an image is as follows:
```bash
docker run [-d -it --rm --name <CONTAINER_NAME> -p <host:container> -v <source:target>] <IMAGE_NAME>
```

- `-d`: run the container in detached mode. This mode runs the container in the background.
- `-it`: run in interactive mode, with a terminal session attached.
- `--rm`: remove the container when it exits.
- `--name`: specify a name for the container.
- `-p`: port forwarding from host to the container (i.e. host: container).
- `-v`: mount a local directory into the indicated directory within the container. Any changes made on the drive will be reflected in the running container (as opposed to when a file is copied).

```
+-------------------------------+----------------------------------+
| Command                       | Description                      |
+-------------------------------+----------------------------------+
| docker ps                     | List all containers. Append -a   |
|                                 to also list containers not      | 
|                                 running.                         |
| docker stop [CONTAINER_ID]    | Gracefully stop the container    |                            
|                                 with [CONTAINER_ID] on the       |   
|                                 machine.                         |
| docker kill [CONTAINER_ID]     | Forcefully stop the container    |
|                                 with [CONTAINER_ID] on the       |                      
|                                 machine.                         |
| docker rm [CONTAINER_ID]      | Remove the container with        |   
|                                 [CONTAINER_ID] from the machine. |
| docker rm $(docker ps -a -q)  | Remove all containers from the   | 
|                                 machine.                         |
+------------+-----------------------------------------------------+
```

### Using DockerHub

- To connect to DockerHub: `docker login`
- To upload a local image: `docker push <image>`
- Pulling an image DockerHub: `docker pull <image>`

Note: image names on DockerHub have format `user/name`. It is good practice to mimic this format in local names for easy syncing. 

### Examples

#### A simple script

We create a simple script `date-script.sh`:
```bash
#! /bin/sh
    DATE="$(date)"
    echo "Todays date is $DATE"
```

And a `Dockerfile`:
```bash
    # base image for building container
    FROM docker.io/alpine
    # add maintainer label
    LABEL maintainer="mark.simon.cohen@gmail.com"
    # copy script from local machine to container filesystem
    COPY date-script.sh /date-script.sh
    # execute script
    CMD sh date-script.sh
```

The Docker image will be built-off the Alpine Linux package. See https://hub.docker.com/_/alpine

`docker build -t simple .` followed by `docker run simple` will print the date.

#### Serve a Webpage on an nginx Web Server with Docker

Create an `index.html` file, and then a `Dockerfile`:
```bash
    # base image for building container
    FROM docker.io/nginx
    # add maintainer label
    LABEL maintainer="mark.simon.cohen@gmail.com"
    # copy html file from local machine to container filesystem
    COPY html/index.html /usr/share/nginx/html
    # port to expose to the container
    EXPOSE 80
```

Note that 80 is the default port for receiving html requests. So, as a Web server, this container will be listening on port 80.

Now, `docker build -t nginx-server .` and:
```
    docker run -d -it -p 8081:80 nginx-server
```

Two points: this runs in the background, and instructs Docker to capture local port 8081 and forward it to port 80 inside the container. Run `docker ps` to see the status of the container, and navigate a browser to `localhost:8081` to access the server.

Then, to stop the server, run `docker stop <ID>`, using the idea that is printed when the container is run.

#### Downloading and running Jupyter's tensorflow container

```bash
docker pull jupyter/tensorflow-notebook
docker run --rm -p 8888:8888 jupyter/tensorflow-notebook
```

## Kubernetes

Kubernetes is a software system, developed by Google, that addresses the concerns of deploying, scaling and monitoring containers. Hence, it is called a container orchestrator. Examples of other container orchestrators in the wild are Docker Swarm, Mesos Marathon and Hashicorp Nomad.

Google offers its own service for running Kubernetes, but other vendows (e.g. Amazon) offer alternatives.

### Features of Kubernetes
- Horizontal auto-scaling: dynamically scales containers based on resource demands.
- Self-healing: re-provisions failed nodes in response to health checks.
- Load balancing: efficiently distributes requests between containers in a pod.
- Rollbacks and updates: easily update or revert to a previous container deployment without causing application downtime.
- DNS service discovery: Uses Domain Name System (DNS) to manage container groups as a Kubernetes service.
### Components of Kubernetes

The main components of the Kubernetes engine are the:
- Master node(s): manages the Kubernetes cluster. They may be more than one master node in High Availability mode for fault-tolerance purposes. In this case, only one is the master, and the others follow. Master nodes can contain the following functions:
    - etcd (distributed key-store): manages the Kubernetes cluster state. This distributed key-store can be a part of the Master node or external to it. Nevertheless, all master nodes connect to it.
    - api server: manages all administrative tasks. The api server receives commands from the user (kubectl cli,REST or GUI), these commands are executed and the new cluster state is stored in the distributed key-store.
    - scheduler: schedules work to worker nodes by allocating pods. It is responsible for resource allocation.
    - controller: ensure that the desired state of the Kubernetes cluster is maintained. The desired state is what is contained in a JSON or YAML deployment file.
- Worker node(s): machine(s) that runs containerized applications that are scheduled as pod(s). Each worker node is comprised of the following:
    - kubelet: the kubelet agent runs on each worker node. It connects the worker node to the api server on the master node and received instructions from it. Ensures the pods on the node are healthy.
    - kube-proxy: it is the Kubernetes network proxy that runs on each worker node. It listens to the api server and forward requests to the appropriate pod. Important for load-balancing.
    - pod(s): consists of one or more containers that share network and storage resources as well as container runtime instructions. Pods are the smallest deployable unit in Kubernetes.

### Configuring and Deploying a Kubernetes cluster

Kubernetes is controlled by a deployment file in `yaml` format. This specifies the objects and specifications that should be deployed. `kubectl` provides a command-line interface

```
+-------------------------------------------+----------------------+
| Command                                   | Description          |
+-------------------------------------------+----------------------+
| kubectl get all                           | list all resources.  |
| kubectl get pods                          | list pods.           |                            
| kubectl get service                       | list services.       | 
| kubectl get deployments --all-namespaces  | list deployments for | 
|                                             all namespaces.      | 
| kubectl create -f [DEPLOYMENT_FILE.yaml]  | create a new resource|  
|                                             based on the desired | 
|                                             state in the yaml    |  
|                                             file.                | 
| kubectl apply -f [DEPLOYMENT_FILE.yaml]   | if the resource      |  
|                                             already exists,      | 
|                                             refresh the resource |  
|                                             based on the yaml.   |             
|                                             file.                |
| kubectl delete -f [DEPLOYMENT_FILE.yaml]  | remove all resources |  
|                                             from the yaml file.  |
| kubectl get nodes                         | get the nodes of the | 
|                                             Kubernetes cluster.  | 
| kubectl delete deployment [DEPLOYMENT_NAME] | delete the         | 
|                                               deployment with    | 
|                                               [DEPLOYMENT_NAME]. |
| kubectl delete svc [SERVICE_NAME]         | delete the service   | 
|                                             with [SERVICE_NAME]. |
| kubectl delete pod [POD_NAME]             | delete the pod with  | 
|                                             [POD_NAME].          |
+------------+-----------------------------------------------------+
```

### Running Kubernetes locally with Minikube

https://kubernetes.io/docs/tasks/tools/install-minikube/

```
+---------------------+--------------------------------------------+
| Command             | Description                                |
+---------------------+--------------------------------------------+
| minikube status     | Check if Minikube is running.              |
| minikube start      | Create local kubernetes cluster.           |                            
| minikube stop       | Stop a running local kubernetes cluster.   |
| minikube dashboard  | Open Minikube GUI for interacting with the | 
|                       Kubernetes cluster. Append & to open in    | 
|                       background mode minikube dashboard &.      |
| minikube ip         | get ip address of Kubernetes cluster.      |
+------------+-----------------------------------------------------+
```

After starting the cluster use `kubectl` to deploy a docker image. When done, delete the service and then stop Minikube.

### Kubeflow

A set of tools for management the deployment of machine learning workflows on Kubernetes.