# Virtual Environments and Containers

```{contents} Table of Contents
:depth: 4
```

## Virtual Environments

### What exactly is an environment?

"The environment" means all of the software and data stored on your computer. When you launch software, there is a specific version of that software. It may call other background programs or data that are also stored on your computer. All of this together forms your computer's environment. In general, when we refer to everything that is installed and running on a computer, we call this the **global environment**.

### What is a virtual environment?

A virtual environment is isolated space on your computer on which different software/data, or different versions of already installed software/data can be installed. When inside the virtual environment, only this software is present, and this software is not present when outside the isolated space that comprises the virtual environment. Virtual environments are useful because they allow us to use specific software for specific projects. For example, if a model was built using Python 2, but won't work in Python 3, we can create a virtual environment to run Python 2 and run the model in there without replacing the Python version on our whole machine. Advanced Python users do EVERYTHING in virtual environments.

A **kernel** is a special case of a virtual environment that also includes the variables/objects you create in a particular Python session.

### Installing different Python versions on your computer

One of the primary uses of a virtual environment is to run code on a specific (usually older) version of Python without changing the version installed in the global environment on your computer. The best package for downloading and installing different versions of Python is called `pyenv`. To install `pyenv`, follow the instructions here: https://github.com/pyenv/pyenv.

To install a new version of Python using `pyenv`, first take a look at the list of available Python versions for your computer by typing (in either the Mac terminal or in the Windows command line):

```
pyenv install -l
```

Choose the version you need (say for example version 3.6.10), and type

```
pyenv install 3.6.10
```

Now the version of Python you need for your virtual environment should be installed and ready to use in the next step. Note that although this last command downloaded and installed this version of Python, it did not set this version as the default on your computer. If you want to make this version your global environmental default, you can next type 
```
pyenv global 3.6.10
```
however that's neither required nor recommended unless you want to update Python in your global environment.

### The `pipenv` package and projects

There are many ways to use virtual environments in Python, but the `pipenv` package is the most straightforward and is one of the most widely used tools.

In the terminal, type 
```
pip install pipenv
```
The idea of a **project** is that everything you need should be in one folder on your computer. That includes all scripts and notebooks, all local data files, and all supporting files (like pipfiles and .env files, which we will discuss a bit later). In addition, a project sets a virtual environment in the same folder to control the Python version and the associated packages.

First we create the project folder. In the terminal, type `pwd`. This shows the directory the terminal is currently pointing to. If you are using GitHub, you created a new folder when you cloned a repository, and this folder can be considered your project folder.

Before we can install any packages, we must choose and install the version of Python we want. (If we install packages first, it will default to the latest version of Python, and we won't be able to change it without having to start a new project from scratch).

Let's go for a version of Python 3.11. In the terminal, type:
```
pipenv --python 3.11
```
Check out the output. `pipenv` also automatically created a file called a Pipfile, which we will look at in a minute.

### Working outside and inside the virtual environment

Although we've created the virtual environment, we are still working in the computer's default global environment until we activate the virtual environment. For starters, in the terminal, type `python`. Notice the version number of your global version of Python. Then to exit Python and return to the terminal prompt, type `quit()`.

Next let's activate the virtual environment, and then run Python again to see what happens.

The first way to activate a virtual environment is with the following command, typed in the terminal (making sure the prompt says you are in the project's folder):
```
pipenv shell
```
Notice the prompt used to say `(base)`, and now it has the name of your project folder in parentheses. That indicates you are now working in a virtual environment. Type `python` again. Note the version number now! You have one version of Python just for this virtual environment, and an entirely different one for your global system. Type `quit()` to return to the prompt.

When you are in the terminal and you've used `pipenv shell` to enter the project folder's virtual environment, type `exit` to return to the global environment.

The second way to run a virtual environment is to type `pipenv run` followed by the command you would like to run inside the virtual environment. This approach is more efficient when you only want to run one command, or a small number of commands one at a time, inside a virtual environment. Type `pipenv run python`. You'll see the same 3.7.10 instance of Python. Now type quit(). In this case, because we've never used pipenv shell, we are already in the global environment and we do not have to now use exit to return to the global environment.

### Installing packages in the virtual environment

To install Python packages in a virtual environment only, make sure you are in the project folder and use the terminal to type pipenv install followed by the name of the package you want to install. It's exactly the same as pip install, but it only installed the package for a specific virtual environment instead of the global one.

Try these:
```
pipenv install numpy
pipenv install requests
```

Let's test this: type pipenv run python. Then on the Python prompt type import numpy. Then type import requests. These two should run without error. Then type import pandas. You'll get an error because we haven't installed pandas in this virtual machine!

We saw the folder where this code exists in the global environment, but to see where the packages live in the virtual machine, type import sys, then sys.path. Find the address that ends in site-packages. Copy this address, then exit Python, and take a look at this folder. You'll see the packages that we just installed here and a few additional packages that base Python, requests, and numpy depend on, but no other packages yet.

By the way, to uninstall a package in a virtual environment, you can type uninstall instead of install. Try: pipenv uninstall requests

Look at the Pipfile and Pipfile.lock

When you created the virtual environment with Python 3.11, pipenv created two files: Pipfile and Pipfile.lock. These two text files contain information about the entire contents of the virtual environment, including the Python version and the installed packages. Open the Pipfile in any text editor. It should look like this:
```
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
numpy = "*"

[dev-packages]

[requires]
python_version = "3.11"
```
Under `[[source]]` this file describes the repository where pipenv will look be default for new packages. PyPI.org is the Python Package Index, the repository that stores any package downloadable with pip. Under `[[packages]]` you see a list of all of the packages you installed (requests isn't here because we uninstalled it). The * refers to the latest version available, but if we want a specific version we can include that in the pipenv install command, like this: `pipenv install requests==2.22.0`. After running this command, take another look at the Pipfile and you will see the version number listed instead of a *. That means that any time we rebuild this virtual environment in the future, the specific 2.22.0 version of requests will be installed instead of the latest version.

There is also a file called Pipfile.lock. Take a look. This file contains the same information as the Pipfile, and additional information like hashes and markers. That's important for installation instructions, but it's all generated and handled automatically and we don't have to ever worry about the Pipfile.lock file. Best to just leave it alone.
 
### Run Jupyter Lab in a virtual environment

It's one thing to run Python from the command line in a virtual environment, but nobody uses Python this way. Here's how to use Jupyter Lab inside a virtual environment:

First, Jupyter Lab is itself a package we must download into the environment, so type: `pipenv install jupyterlab`

In addition, Jupyter Lab uses a package to manage its kernels called ipykernel, so install that too: `pipenv install ipykernel`

Finally, we need to get Jupyter Lab's kernels to understand the virtual environment as another kernel choose from as an alternative to the global Python 3 environment. Register the virtual environment in Jupyter Lab by typing: `pipenv run python -m ipykernel install --user --name=my-venv` Here we've elected to name the virtual environment's kernel "my-venv" but we could have named it anything we like.

Next, open Jupyter Lab inside the virtual environment by typing: pipenv run jupyter lab. You will see my-venv (or whatever you named your virtual environment) as an option for the kernels you can use to run a notebook. When you select this option, the notebook will run on the version of Python you installed on the virtual machine and will only be able to import the packages you installed.

### The .env file and environmental variables

A **.env file** (a "dot E-N-V" file) is a plain text file that is used for storing sensitive pieces of information such as passwords and access keys. We will use these files a lot later in this course when we talk about APIs and again when we talk about databases. A .env contains environmental variables, which are pieces of data that load into the virtual environment used by a script or notebook without having to be defined in the code itself. Think of it this way: say we are trying to use Python to access a system that requires a password. But if we are trying to keep a password secret, it's not good to explicitly type something like password = 'idancetobieberlikenooneswatching' into the code. But if we define this password as an environmental variable when we create a virtual environment, then the virtual environment tells Python what the password is without ever having to be typed out.

Because the .env file is just a plain text file, you can create a .env file in a number of different ways. One way that should work for both Mac and Windows machines uses Jupyter Lab. Within Jupyter Lab, make sure the left-hand file navigator window is showing your project folder. Then select File, New, Text File. Then select File and Save Text As. Replace 'untitled.txt' with '.env' and save it. If you use another method to create this file, just make sure the file is named .env and not accidentally .env.txt.

Once the file is saved, you can define the environmental variable by writing the word you will write in the code, an equal sign, and the secret value associated with the word. For example, type the following into the .env file and save the file: password=idancetobieberlikenooneswatching. You can define as many environmental variables as you want, just put each one on a new line inside the .env file.

Please note that on most computers .env is considered to be a hidden file, so it won't show up by default when looking for files unless you set an option on your computer to display hidden files (Shift + Command + . on a Mac, View > Show > Hidden items within File Explorer on Windows). Also, it is very important that you never include a .env file in the files you post to GitHub or to Docker Hub. Malicious actors know to look for files called .env and will try to exploit the information contained in these files. If you initialized your GitHub repo with a .gitignore file, .env is included by default in this file.

Now return to the terminal. If it is still showing the log for Jupyter Lab, press Shift + Command to bring back the terminal prompt. Now if you rerun pipenv run jupyter lab (or any pipenv run command) you should see a message that reads "Loading .env environmental variables".

Start a notebook using the virtual enviroment as a kernel, and type `import os` and `password = os.environ['password']` into the first cell. Note that os is part of base Python, so it doesn't have to be separately installed. These two commands create a Python variable called password that contains your secret password, but does not require you to ever type out your password in the notebook.

## Understanding and Using Containers
### Four levels of isolation: the global environment, virtual environments, containers, and virtual machines

One of the most important concepts in software development is the level of [isolation](https://en.wikipedia.org/wiki/Process_isolation) of the system you use to develop the software and host an application. According to Wikipedia, isolation is "a set of different hardware and software technologies designed to protect each process from other processes on the operating system. It does so by preventing process A from writing to process B." In other words, an isolated environment does not save any files or install any programs outside that environment.

There are four approaches with regard to isolation, and as a data engineer you need to choose the appropriate approach from the outset of your project:

* The global environment, the default setup of your computer, doesn't isolate anything;
* a virtual environment isolates the version of Python and the packages installed on that version of Python from everything else on the computer;
* a container isolates Python and packages as well, but also isolates the computer's operating system (Mac, Windows, or Linux) and additional applications from everything else on the computer;
* and a virtual machine isolates everything.

Here is more information about each approach:

The **global environment** refers to the main computation and storage on your computer that every piece of software shares by default, whether that's your web browser, email client, video games, etc. Unless you take explicit steps to use a virtual environment, container, or virtual machine, you are running your code and downloading packages in the global environment. The global environment is the least isolated option. When you develop software in this location, all kinds of crazy things can happen as your code can accidentally interact with other packages and with other applications installed on your computer. In addition, any changes you make to the system for a project in the global environment can change and potentially break everything else on your computer: for example, if a project requires you to use an older version of Python, changing it here changes it for all the packages you've ever installed to the global environment, and some of those packages might not work anymore.

* **Pros**: When everyone first learns Python or any programming language, they work in the global environment. It is straightforward and takes the least amount of time to use.  
* **Cons**: It is hard to tell how your code will interact with everything else installed on your computer. The way that usually plays out is that you hit some sort of error as you code, so you investigate and debug as you go, but while the adaptations you make resolve the specific errors you encountered, they also make your code much too specific to your particular computer and it won't run anymore on another system. That's a big problem if you are trying to share code that runs and works as intended on someone else's computer. It also means that the software you write in the global environment depends on the versions of the packages you've installed. If you wrote the software using Python 3.8 and numpy 1.18, there's no guarantee the software still works on a system using Python 3.11 and numpy 1.25.
  
A **virtual environment** (not to be confused with a virtual machine) begins with a new, empty folder on your computer (called the project folder). Once you register this folder as a virtual environment you can install any version of Python you want in this folder and you install the minimum number of packages with specific versions in this folder as well. If you are running Python 3.10 in your global environment, you can run Python 3.8 in this virtual environment. Some packages wreak havoc with other packages (the package for connecting to Google's Colab service also likes to downgrade several dozen other packages to versions at least two years old), and if you need to use a package like this it would be wise to use a virtual environment instead of a global one.

* **Pros**: Using a dedicated, empty folder for a project helps keep everything organized. Virtual environments allow for easy control of the versions of Python and packages, and allow you to choose the minimum number of packages to install to keep the environment "clean". Creating a virtual environment with pipenv also creates a Pipfile that documents the specifications of the environment in a straightforward way.
* **Cons**: It takes a good deal of technological sophistication to move from using the global environment to using virtual environments or any other isolated system. That adds to Python's learning curve, which is already steep enough, so virtual environments are best taught to people who want to move quickly beyond the beginner stages. Also, while virtual environments allow for the control of Python and packages, they still operate under a particular computer's operating system. If you wrote software in a virtual environment on a Windows machine, there's no guarantee that it will work on a Mac, for example.
  
A **container** is a virtual environment that adds the ability to change the operating system inside the environment and allows for the installation of other software external to Python, such as database management systems. You can use a container to run Windows on a Mac, or vice versa, or Linux on any system. The major client for building and managing containers is called Docker, and containers can be stored and shared for free via a website called Docker Hub (https://hub.docker.com).

* **Pros**: A docker container makes software very portable. A developer with a Windows computer, for example, can use a container to develop in Linux. Docker and Docker Hub make heavy software that is difficult and disruptive to install, like an operating system, worlds easier to use on any computer. PostgreSQL and other database management systems are notoriously difficult to install on the global environment but are relatively straightforward to use once you've gotten used to docker. Docker Hub is free to use and very popular, and most important software for data science is installed on a container on Docker Hub that you can install locally with minimal code, and you can upload and share your own containers here too.
* **Cons**: Docker can be confusing to learn and deploy correctly, and it is another system on top of Python and virtual environments to learn. In addition, depending on the software and files contained within a container, they can get very big very quickly, and big containers can take a long time to deploy on a computer. Some containers are too big to fit on to your computer at all. Finally, although containers use their own operating systems and software, they still depend on your computer's storage and computation hardware. A container can't run more quickly than your computer's processor will allow.

A **virtual machine** (not to be confused with a virtual environment) is an entirely separate computer from your own. Typically (though not necessarily always) virtual machines (VMs) can be accessed through a cloud computing network like Amazon Web Services or Microsoft Azure and physically exist in a data center somewhere: in Charlottesville, most VMs someone will access through AWS are stored in Amazon's massive data center in Ashburn, Virginia. VMs often come pre-loaded with particular sets of software, and cloud compute companies charge you based on the amount of memory a VM uses and on the hardware connected to the VM. You can request a graphics processing unit (GPU) as a processor for your VM, which is several orders of magnitude faster than the standard central processing unit (CPU) you have on your laptop, but be prepared to pay through the nose for that privilege.

* **Pros**: A VM is similar to a container in that it allows you to fully control the Python version, the packages, and the operating system and software installed on the VM. Unlike a container, it also allows you to choose different physical resources for the system to increase the storage and computational speed of the system. That's crucial for big data applications in which the data are much too big to fit on your laptop, and a machine learning model will run far too slowly.
* **Cons**: You have to be extremely careful to protect the credentials you use to log in to the VM. There are so many scammers out there running web-scrapers to find files on public websites like GitHub where people saved their cloud compute keys. With these keys the scammer can install things like Bitcoin miners on your VM that drive up the memory consumption of the VM, sending the cryptocurrency to the scammer's wallet, and footing you with the bill. At any rate, cloud systems are yet another system to master on top of everything else, but unlike virtual environments and containers, these are not free and mistakes can end up costing a lot of money.

### When should you use a global environment, virtual environment, container, or virtual machine?

The main considerations over a choice of environmental management are whether there are conflicts within your own global environment, whether or not you are able to install all the needed software and packages in your global environment, whether and with whom you need to share your code, and whether you need access to more computation and hard disk storage than you have access to on your own computer.

Conflicts within your global environment occur when some packages only work on a version of Python you do not have, or if packages conflict with one another in a way that breaks a package you need for other projects. Many packages, especially the popular and well-maintained ones, try to stay current to avoid these sorts of conflicts. But sometimes the constant stream of updates leads to one package in the chain of dependencies breaking.

<div style= "float:left;position: relative; padding: 20px">
<a href="https://xkcd.com/2347/"><img src="https://imgs.xkcd.com/comics/dependency_2x.png" width="300"></a>
</div>

Source: XKCD Comics #2347

When this happens, you will usually see a cryptic error with the import command. Do some digging and you will usually see posts by other people discussing the conflict of versions, often sharing advice to downgrade some package to an earlier version. If you see issues like this, it is best to use a virtual environment or a container to manage the package versions for a particular project without forcing your global environment to have to get back with out of date versions of Python and of important packages.

You might not be able to install all the software and packages you need in your global environment. Package repositories like CRAN for R require submissions to work on Mac, Windows, and Linux, but PyPI, the Python package repository, allows packages that only work with some or even one operating system. If you need a package but have the wrong operating system, you can use a container with the operating system you need for the package installed.

There are many reasons why you might want to share your code. In academia and science, it is important to make your work reproducible. Reproducibility means allowing someone else to exactly replicate your findings by running the same code you used on the same data. If there is a question of whether the code would produce the same results using different versions of Python and packages, then it is important to supply a Pipfile for a virtual environment along with the code and data. If there is a concern that the code would either not work or generate different results on a different operating system as well, then provide a container with the code, data, and appropriate software included. You might consider a container in either case because that would allow you to use Docker Hub to store the container online, which makes it much easier to share.

If you are working with big data, you probably need more resources for computation and storage than you have access to on your own computer. In this case you will need to use a virtual machine on a cloud computing service. You can rigidly control the VM's environment, and you can share the VM by either sharing access keys (which you should only do with your direct collaborators) or by creating a container image from the VM and sharing it via Docker Hub or GitHub. But if you don't need access to the physical resources available on the cloud, spinning up a VM might be overkill as it creates more work to manage access and it can cost a significant amount of money. There is another common use of VMs: VMs can be connected to a public IP address and hosted on the internet. If you are writing software such as a dashboard that you want to make accessible on the internet through a URL, using a VM is a good approach.

### requirements.txt files

Before we discuss containers, it is helpful to define a requirements.txt file because we will use this file to define and launch a container.

A requirements.txt file is a plain text file with every package installed (or to be installed) in a particular environment, along with the version numbers of the packages. Every package is listed on a new line. To see the requirements.txt list of your global environment, open a terminal window and type pip freeze. You can save this list in a text file directly by typing pip freeze > requirements.txt. You can also easily go from a Pipfile created by setting a virtual environment with pipenv to a requirements.txt file by copy-and-pasting the packages listed under [packages] and [dev-packages] into a new text file and saving the file as requirements.txt.

If you have a requirements.txt file, but you do not yet have any of the packages listed there installed, you can install all of them at once by typing pip install -r requirements.txt, where -r tells pip to read from the file you provided. This is a good step to take when you update your global Python installation and want to bring all of your packages forward into the new Python version.

Let's create a simple requirements.txt file. First create a new, empty folder for this project. Then open a new file in this folder using any plain text editor you feel most comfortable using. You can use JupyterLab by clicking File -> New -> Text File. Before anything else, save the file as requirements.txt. Let's put the following versions of JupyterLab (so we can use JupyterLab within the Docker container), numpy, and pandas in the file:

jupyterlab==3.4.5

numpy==1.23.2

pandas==1.4.3

Then save the file.

4. Terminology of Docker

Every new system we learn has its own set of terminologies to learn. With Docker, the most important words to learn are Dockerfile, Docker image, Docker container, and Docker compose.

A Dockerfile is a plain text file with a lightweight programming syntax that provides instructions for what the container will eventually have installed. In the Dockerfile we can specify the version of Python, the operating system, whether additional software should be installed, and we can provide additional commands such as pip install -r requirements.txt to install all the Python packages we need from a requirements.txt file.

A Docker image is a collection of all the files that are needed to create the container. It is similar to a zipped directory or to an installed Python package in that all the necessary files are present, but in its current state it does nothing until it is unzipped, imported, or activated. A Docker image gets built when we process a Dockerfile. If we say in the Dockerfile that we want an installation of Linux with Python 3.9 installed, PostgreSQL, and pandas, numpy, and matplotlib, then the Docker image reads this file and downloads all of these software packages. The image waits for another command to extract all these software to construct the container. Building the Docker image can take a while depending on the size of the software packages we instruct it to install.

A Docker container is the deployment of a Docker image. When we issue the command to run a Docker image, Docker allocates space on our computer and creates a virtual environment in that space, then it runs all of the software in that virtual environment. Like virtual environments, containers can be accessed either interactively or in the background.

A Docker compose file is a text file with another lightweight coding language that can be used to manage multiple Docker containers. It is important for organizing different containers and efficiently allocating enough space on the system to run all of them. We will need to use Docker compose for running multiple databases, for example, where one Docker container runs a relational database, one container runs a document database, and one runs a graph database.

5. Docker Hub

Docker Hub is a web-based repository of Docker images. It has two primary functions:

It provides free space to any registered user to upload their own Docker images. That allows someone to define a set of containers that are important for their work and to have a fixed and external location for the images of those containers. It also allows someone to easily share a Docker image with someone else by uploading the image to Docker Hub and sending the sharing URL link.
It contains a massive repository of Docker images containing different software and specifications that are free and accessible to anyone. You can search through these images here: https://hub.docker.com/search?q=Links to an external site. Once you are comfortable with Docker and Docker Hub, downloading images containing the software you need from Docker Hub is probably the easiest way to get the software. That's especially true for complicated software such as database management systems.
Take a few minutes to register for a Docker Hub account and to search around the images to see what is available.

6. Writing a Dockerfile

The first step in creating a container is to write a Dockerfile. First, make sure you have Docker DesktopLinks to an external site. installed in the global environment of your local machine.

As a Dockerfile is a plain text file, start by opening a new file in any plain text editor you feel most comfortable using. Before anything else, save the file as Dockerfile. Make sure you capitalize the D but not the f, and make sure there is no file extension (so make sure it isn't accidentally saved as Dockerfile.txt). In JupyterLab click File -> Save Text As, then type Dockerfile in the window.

On the top line of the text file, type

# syntax=docker/dockerfile:1

This tells the Docker image compiler to use the latest version of the Dockerfile syntax, and not to use either an older version nor an experimental version. This line is one that will probably appear at the top of every Dockerfile you write without you having to change it or think too much about it.

Next we will take an image from Docker Hub as a starting point for the image we want to build. First, find an image on Docker Hub that installs the software you need. For this example, let's use the python:3.8-slim-buster image, which installs Linux and Python 3.8. All you need to do to install this image is to write the following line in your Dockerfile:

FROM python:3.8-slim-buster

We can now add to this image with additional lines of code. First, remember that the main concept behind a container is isolating the container from the rest of the computer. That means that the files in our project folder are not automatically included in the container. Let's first copy the requirements.txt file into the container by typing

COPY requirements.txt requirements.txt

The COPY command tells Docker to create a file in the container named requirements.txt, and to create that file by copying the file in the project folder named requirements.txt. The first occurrence of this name refers to what we want to name the file inside the container, and the second occurrence of the name refers to what the file is actually called in the project folder.

Next, let's install the packages in our requirements.txt file by adding this line to the Dockerfile:

RUN pip install -r requirements.txt

Dockerfiles can also be used in the same way as a .env file to define environmental variables with sensitive data such as passwords and API keys. Using environmental variables means that you don't have to type out your passwords and keys in the Python notebooks and scripts you write. To define an environmental variable named secretpassword, type:

ENV secretpassword=whenyourehereyourefamily

But be careful: if you use a Dockerfile to define environmental variables, then those variables will display in the Dockerfile and can be called and viewed in the container. Don't share your Docker image on Docker Hub if you've defined environmental variables in the Dockerfile. A better approach would be to share a Docker image with no environmental variables, then write a Dockerfile that you store locally that adds environmental variables to this image. (We'll talk more about how to do that later)

Docker is especially useful for running applications that run continuously, such as dashboards. In this example, we will run JupyterLab as a continuously running application within the container. Running JupyterLab requires three lines of code within the Dockerfile.

First, we need to define a folder that will exist inside the container that will store the files (such as .ipynb notebook files) that we will create using JupyterLab. We create this folder with this line:

WORKDIR /notebooks

Note that we could name the folder anything we want, but "notebooks" makes sense for us in this situation.

Second, we need to define a port number inside the container for Jupyter Lab to run on. According to a website called CloudFlareLinks to an external site.: "Ports allow computers to easily differentiate between different kinds of traffic: emails go to a different port than webpages, for instance, even though both reach a computer over the same Internet connection." Jupyter Lab, by default, runs on port 8888 on a computer. We need to open this port in the container so that Jupyter Lab can use it, and we do that with this command:

EXPOSE 8888

Finally, we need to specify a command for Docker to run when it launches the container. To launch Jupyter Lab from the command line in our global environment, we type jupyter lab. But within the container, we need to add two arguments: --ip=0.0.0.0 tells Jupyter Lab explicitly to run on the local host that exists within the container, and --allow-root allows us to run Jupyter Lab without having to define any specific user accounts.

For some reason, Docker wants every word within the command to be passed as a string element within a list. So we have to type the command like this:

CMD ["jupyter", "lab","--ip=0.0.0.0","--allow-root"]

Then save the Dockerfile.

There are many more Dockerfile commands we can and will use. But for introductory purposes let's leave it here and return to the construction of Dockerfiles later.

7. Building a Docker image

To build the Docker image from the Dockerfile we just wrote, open the terminal and type

docker build . -t myfirstdocker

docker build is the command that constructs an image from a Dockerfile. The . that appears third refers to the current project directory on your computer. By default Docker is looking for a file named Dockerfile, and it is best practice to always have one project folder for every container, and to have one file named exactly Dockerfile in each of these folders. The -t is a tag that tells Docker you want to provide a name to the container once it is running, and the last word in this command is the name we select.

When you run this, look at what appears on the screen. Notice that Docker installs the image from Docker Hub, then installs Jupyter, numpy, and pandas, and then copies the files we instructed and runs (but does not yet display) the two lines we specified with RUN.

Now the image is built, and the container is ready to launch.

Before we do so, open the Docker Desktop App (see the little ship icon). Click on images. You should see an image named myfirstdocker.

8. Activating the container

Now that the image is built and named "myfirstdocker", we can deploy the image to start the container by typing:

docker run -p 8888:8888 myfirstdocker

Let's break down what this command does. docker run is the core command that reads an existing Docker image and attempts to launch it as a container. If this command can't find the image locally, it will look on Docker Hub for the image. The -p flag tells Docker that you are about to define a mapping from the port inside the container to a port on your computer. The first number is the container port and the second number is your computer's port. In this case, Docker will take the program that runs on port 8888 in the container and run it on port 8888 on your computer. These numbers don't have to be the same. If port 8888 is already being used (maybe by a local Jupyter Lab instance) you wanted Jupyter Lab to run on port 90 on your computer, for example, you can type -p 8888:90. Finally you provide the name of the image you want to use to create the container.

Run this command, and your first Docker container is alive! Jupyter Lab displays the same text you see when you type Jupyter Lab into the command line. You will see a URL towards the bottom: copy this URL into a web browser and it will take you to the Jupyter Lab running inside the Docker container.

A few things to notice:

Open a new notebook and save it. Type !python --version into the first code cell. This version will always be the same version of Python we defined in the FROM command in the Dockerfile.

Try to import some packages. You can import pandas and numpy, but you can't import other packages like seaborn that you did not include in your requirements.txt file.

Type

import os

password = os.environ['secretpassword']

then type password alone in a code cell. You will see your password that you saved as an environmental variable display in the output. Now you can supply passwords and secret keys to anything that requires them in your code without having to actually type them out. Just don't publicly share your Dockerfile or Docker image if you use this method of defining environmental variables.

Open a new terminal window. The command prompt looks different from what you are used to. That's because you are now running Linux. Mac users are running Linux. Windows users are running Linux. We've conquered all our problems with different hardware (for now)!

9. Copying files from the container back to your global environment

Without closing the terminal window in which Jupyter Lab is running, open a new terminal window and navigate to your container's project folder (the same place you saved your Dockerfile). If you type

docker ps

you will see a list of all your active containers that tells you the container ID (a unique combination of letters and numbers) and the (randomly generated) container name. You can use either the container ID or the container name to execute commands that refer to this container.

In Jupyter Lab, save your notebook as dockernotebook.ipynb. Right now, this notebook only lives inside the container. If you close Jupyter Lab then the container will close and this file will be deleted.

To copy this file to your container's project folder, type the following (just replace "elastic_hertz" with whatever random name your container received):

docker cp elastic_hertz:/notebooks/dockernotebook.ipynb ./dockernotebook.ipynb

Now the dockernotebook.ipynb file is saved in your project folder and will remain accessible in your global environment even when the container closes.

10. Best practice for setting environmental variables if you plan to post your Dockerfile publicly to Docker Hub

Although the Dockerfile's ENV command makes it pretty to store passwords and keys as environmental variables, it also creates a privacy and security risk if you want to store your Docker image on Docker Hub, or if you want to store your code on GitHub, because anyone who sees the Docker file or runs the image can see your environmental variables, which include sensitive passwords and keys. Some bad actors run web-scraping scripts on Docker Hub and GitHub to find and exploit exposed environmental variables.

A better practice is to always save your environmental variables locally, on your own computer, and never on any public web-based repository. Here's how to accomplish that:

First: create a new text file using any text editor you feel comfortable with, and save the file as .env. This file will contain all of your environmental variables, and naming it .env is a standard convention. Following this convention makes it easier for your code to work on other people's computers: if everyone's passwords are saved in a file named .env, then the code can always load from whatever .env file is present, even if the .env contains different passwords. Remember to make sure the file is named exactly .env and not accidentally .env.txt. Also remember that files that begin with a . generally are treated as hidden files on both Mac and Windows systems because these files are considered to be more essential or sensitive to the operation of a computer system and hiding them is good practice to discourage users from changing or deleting these files.

Inside the .env file, type all the variables you want to save. Type different environmental variables on different lines. For example:

secretpassword=mydndcharacterisanightelf

secretapikey=123456789

Then save the .env file.

Second: Delete all ENV commands in your Dockerfile if any are present.

Third: Use docker build to create the Docker image. Then use docker run to launch the container for this image, but add the following option just before the name of your container: --env-file=.env. For example:

docker build . -t imagename

docker run -p 8889:8888 --env-file=.env imagename

The --env-file option loads all the environmental variables from your local .env file into the container, but it does not include the .env file or the environmental variables in the Dockerfile or in the Docker image. If someone else has your image, they can use the --env-file option of Docker run to load their own environmental variables from their own local .env file.

Fourth: If you are using the container to run Python (or Python within Jupyter Lab), type the following Python code to use your local environmental variables:

import os
password = os.getenv('secretpassword')
apikey = os.getenv('secretapikey')

Now the password and apikey variables contain your credentials, and you can use these in your subsequent code.

11. Uploading a built Docker image to Docker Hub

Docker Hub provides free, online storage for Docker images. Once you've registered for an account on Docker Hub, start by logging into Docker Hub on the command line, which you can do by simply typing

docker login

You might be prompted to enter your Docker Hub username and password, or it might remember the credentials you supplied previously. Either way, you need to see the response message "Login Succeeded".

The goal is to upload an image you've created locally to Docker Hub. The image you created has a name, and it will have a name on Docker Hub as well, but this name does not need to be the same as the local name. To set the image's name on Docker Hub, use the docker tag command, such as

docker tag myfirstdocker jkropko/python3_8_jupyterlab

Here, after typing docker tag we write the name of the local image, which was myfirstdocker in the example we worked through above. Then we write the name the container will have on Docker Hub. A couple points about how to name a container: first, the name must begin with your own account name, then a slash. Second, the name should be concise but also descriptive of the actual content of the container, which is why I wrote python3_8_jupyterlab instead of something vague like myfirstdocker.

Finally to upload the image to Docker Hub, use the docker push command:

docker push jkropko/python3_8_jupyterlab

The system takes a few minutes to upload all of the files, but once it completes, you can see your repository by going to https://hub.docker.comLinks to an external site. and signing in. You should see your new image listed on the home page. If you click the image you just uploaded, you can edit the description and see the version history (if new versions have been uploaded) along with the download statistics.

When you create an account on Docker Hub, the system creates a "top level" repository for you that has your username. I just saved my image to my top level repository. Also, the top level repository is by default public, which means anyone can access the image I just saved. You can load my image onto your computer by typing

docker run -p 8888:8888 jkropko/python3_8_jupyterlab

(Note, if you have a local instance of Jupyter Lab running, change the first 8888 to 8889 and also replace the 8888 to 8889 in the web address the container lists when you run it.)

Sharing images via Docker Hub is an awesome and super-evolved way to share environments with collaborators, reviewers, or anyone else who might want to take a close look at your code and data.

You might want to save some images on Docker Hub for your own use and access, but you might want these images to be private. Docker Hub allows you to create one private repository before charging you for a premium service, but allows you to save many images inside this one private repository. To create the private repository, go to the Docker Hub home page and click "Create Repository". Next to your user name, I recommend naming the repository "private" so that you can easily remember that this folder is for your private Docker images. Add a description if you want, and make sure private is selected. Then push "Create".

Now that the private repository is created, you can tag and push local images to the private repository by adding private: before the name of the image on Docker Hub. For example, I create a private version of the same image I pushed publicly by typing:

docker push jkropko/private:python3_8_jupyterlab

docker push jkropko/private:python3_8_jupyterlab