Skip to content

Getting Started with EasyData Environments

Amy Wooding edited this page Jan 14, 2023 · 25 revisions

Let's say that you want to use EasyData to manage your environments in a way to help stave off dependency hell. Here's a tutorial outlining how you can get up and running managing your environments the EasyData way. Here's what we'll cover.

  1. Install requirements
    • Installing Anaconda or Miniconda
    • Installing a base environment
  2. Create your repo (also called a project) based on an EasyData project template
    • Create a project using an EasyData cookiecutter template
    • Initialize the project as a GitHub repo
  3. Create and explore the default environment
    • Create and explore the default project conda environment
    • Explore the default paths
  4. Customize your conda environment
    • Updating the environment
    • Checking your updates back into the project repo
    • Deleting and recreating the environment
  5. Customize your local settings (things you shouldn't check in to a repo)
    • Customize your local paths configuration
    • Customize your environment variables and secret storage

1. Install Requirements

This is a setup step that you only need to do once per machine. After this is done, you shouldn't need to do this step again. Occasionally it may be necessary to update your requirements.

  1. Install anaconda: if you don't already have anaconda or miniconda, you'll need to install it following the instructions for your platform (MacOS/Windows/Linux)
  2. Open a terminal window
  3. Configure Anaconda/Miniconda
    • Set your channel priority to strict: conda config --set channel_priority strict
    • On a JupyterHub instance you will need to store your conda environment in your home directory so that the environments will persist across JupyterHub sessions: conda config --prepend envs_dirs ~/.conda/envs # Store environments in local dir for JupyterHub
  4. Install the remaining requirements:
conda create -n easydata python=3 cookiecutter make
conda activate easydata
pip install ruamel.yaml

We've created a conda environment named easydata that we'll use to create EasyData projects. Once this environment exists as created above, we won't need to create it again.

2. Create your EasyData repo

The best time to use an EasyData template is when you first create your project/repo. We will assume that you are starting your project from scratch.

Note: We recommend using EasyData to create every project you work with so that there is at least a 1:1 ratio between your conda environments and your projects. There are many issues with having more than one repo using the same conda environment, so whatever you do please don't use a monolithic environment to rule them all. For more on this idea see Tip #2 of Kjell's talk on building reproducible workflows.

Create a project using an EasyData cookiecutter template

  1. Open a terminal window
  2. Activate the easydata environment created above: conda activate easydata
  3. Navigate to the location that you'd like your project to located (without creating the project directory, that happens automagically in the next step). For example, if I want my project to be located in /home/<my-repo-name> I would navigate to /home in this step.
  4. Create your project. Run cookiecutter https://github.com/hackalog/easydata and fill in the prompts. Note that the repo name that you enter will be the name of the directory that your project lives in.

We've now created a project filled with the EasyData template files in <my-repo-name>.

Check configuration

  1. Navigate into the project: cd <my-repo-name> as entered into the prompts of the previous step
  2. Check that the conda binary is specified correctly: check that the output of which conda matches the line of CONDA_EXE in Makefile.include. If not, update the Makefile.include file accordingly.
  3. Check that all requirements were installed: make check_requirements

Initialize the project as a GitHub repo

We'd like to use git to keep track of changes that are made to our project. Now is the best time to initialize the git repo and create the GitHub connection. (Alternately, use GitLab or BitBucket via morally the same instructions).

  1. Navigate into the project: cd <my-repo-name> as entered into the prompts of the previous step
  2. Initialize the repo:
git init
git add .
git commit -m "initial import"
git branch easydata   # tag for future easydata upgrades
  1. Create a GitHub version of your repo. Follow the GitHub instructions and name your GitHub repo <my-repo-name>.
  2. Import your code to GitHub: Follow the instructions at the bottom of the resulting Quick Setup page, under "Import code from an old repository" to import this project to your new repository. To do so, click Import code.

3. Create and explore the default environment

The EasyData template project <my-repo-name> is populated with recommended default settings. Defaults that we've developed over time that work as a nice base for most folks who are creating a data science related project. It's a great place to start, and until you're familiar with how and why these defaults work in most cases, we don't recommend messing with them.

Create and explore the default project conda environment

First off, the EasyData template comes with everything that you need to create a conda environment with the same name as your project base directory. That is, a conda environment with the name <my-repo-name>. To create this environment:

  1. Navigate into the project repo: e.g. cd <my-repo-name>
  2. Create the environment: make create_environment
  3. Activate the environment: conda activate <my-repo-name>

Using your conda environment in a Jupyter notebook

  1. From the project base run jupyter notebook or jupyter lab
  2. Create a new notebook named easydata-test.ipynb
  3. Select the <my-repo-name> environment from within the notebook: Select Kernel -> Change kernel -> Python[conda env:].
  4. Check that it's finding the right environment by running conda list in a cell. The output should include # packages in environment at <path-to-environment>: where the <path-to-environment> ends in <my-repo-name>.

Troubleshooting: If you don't seem to have the option to select the correct environment, make sure that you ran jupyter notebooks with the <my-repo-name> conda environment enabled, and that which jupyter points to the correct (<my-repo-name>) version of jupyter. Furthermore, if you're on a JupyterHub instance, check that you followed the JupyterHub instructions in the Install Requirements section.

Exploring the environment

The conda environment is now ready. To see what's in it, open up the environment.yml file. In it is a list of the packages that are installed in the environment. There is a section near the top under - pip: that lists pip-based dependencies and the rest of the dependencies in the default environment.yml are conda-based dependencies. These packages are now all installed in the virtual conda environment <my-repo-name>.

There is also a file of the form environment.<your-architecture>.lock.yml. This file gets created when you run make create_environment the first time. If you open this up, you will see all of the packages that are installed in your conda environment <my-repo-name>. This is the list of all the dependencies that were installed to make the packages in environment.yml run.

We like to think of these two files are representing two different perspectives:

  • environment.yml: The packages that you want. This file is manually maintained.
  • environment.<your-architecture>.lock.yml: The packages that you need (to run what you want). This file is autogenerated and updated.

Explore the default paths

As hardcoded paths are a notorious source of reproducibility issues, EasyData attempts to avoid path-related issues by introducing a mechanism to handle paths. These paths are part of the local configuration that point to the location of directories in this project through code. By default, these depend on the out-of-the-box project organization that is described at the bottom of the README file in the project.

One of the features we love about originally basing Easydata off of cookiecutter-datascience is the thoughtful project organization structure and description that it includes as part of the template README file. We love that. And while we've made our own tweaks, we haven't strayed far from the original.

The goal of the paths mechanism is to help ensure that hardcoded path data is never checked-in to the git repository.

The default paths are all relative to the catalog_path, the exact location of the local config file catalog/config.ini (don't move this file!)

[Paths]
cache_path = ${data_path}/interim/cache
data_path = ${project_path}/data
figures_path = ${output_path}/figures
interim_data_path = ${data_path}/interim
notebook_path = ${project_path}/notebooks
output_path = ${project_path}/reports
processed_data_path = ${data_path}/processed
project_path = ${catalog_path}/..
raw_data_path = ${data_path}/raw
template_path = ${project_path}/reference/templates
abfs_cache = ${interim_data_path}/abfs_cache

Note that, for chicken-and-egg reasons, catalog_path (the location of the config.ini file used to specify the paths) is not specified in this file. It is set upon module instantiation (when <my-module-name> is imported) and is write-protected.

Accessing paths from Python

You can access any of the default paths in python code once paths has been imported via the name of the path. For example, to access the data_path,

from <my-module-name> import paths

paths['data_path']

where <my-module-name> is the module name specified when answering the project prompts. If you've forgotten the module name, look it up in .easydata.yml. Typically <my-module-name> is set to src.

Exercise: Check that the paths mechanism resolves to the correct paths as laid out in the default project.

Notice that paths are automatically resolved to absolute filenames (in [pathlib] format) when accessed.

4. Customize your conda environment

The <my-repo-name> conda environment is managed by a workflow that's made up of make commands.

Exercise: Run make at the command line (in a terminal window) in your base repo directory. This will display all of the available make commands.

There are several conda environment related make commands. Here are the three key ones:

create_environment  Set up virtual (conda) environment for this project
delete_environment  Delete the virtual (conda) environment for this project
update_environment  Install or update Python Dependencies in the virtual (conda) environment

Updating the environment

We've already created the environment and now we'll customize it. When adding packages to your python environment, do not pip install or conda install directly. Always edit environment.yml and make update_environment instead.

Your environment.yml file will look something like this:

name: <my-repo-name>
  - pip
  - pip:
    - -e .  # conda >= 4.4 only
    - python-dotenv>=0.5.1
    - nbval
    - nbdime
    - gdown
  - setuptools
  - wheel
  - git>=2.5  # for git worktree template updating
  - sphinx
  - bokeh
...

To add any package available from conda, add it to the end of the list. If you have a PYPI dependency that's not avaible via conda, add it to the list of pip installable dependencies under - pip:.

You can include any python-based project in the pip section via git+https://<url-to-the-package-git-repo>. To pick a specific branch, git+https://<url-to-the-package-git-repo>@<my-branch> to point to a specific branch.

Once you're done your edits, run make update_environment and voila, you're updated.

If you want your environment changes to be immediately available in your running notebooks, make sure to add and run a notebook cell containing

%load_ext autoreload
%autoreload 2

We like to do this at the beginning of our notebooks.

Checking your updates back into the project repo

To share your updated environment, check in your environment.yml file so others can use it. We'll follow the EasyData git workflow

  1. Checkout a new branch: git checkout -b update-environment
  2. Stage the changes: git add -p environment.yml and include the desired changes, discarding the rest
  3. Commit the changes: `git commit -m "update the environment file"
  4. Push to your origin: git push origin update-environment
  5. Create a pull request (PR) to the main branch
  6. Merge this PR: If you look at the PR, the only changes should be to the environment.yml file. If all looks well, merge this PR.
  7. Incorporate changes back into your local setup:
git checkout main
git fetch origin
git merge origin/main

The local main branch will now have the updated environment.yml. While it seems a bit roundabout here, we recommend always using GitHub PRs to keep track of changes. This allows for the adoption of a multi-user workflow using a upstream seamlessly.

Deleting and recreating the environment

Maybe you don't need a package that you thought you did. Maybe you need to make space on your machine. At any point you can nuke your environment from orbit.

Test this out:

conda deactivate
make delete_environment

The environment is now gone. However, it can be recreated at any time.

Now recreate the environment:

make create_environment
conda activate <my-repo-name>
touch environment.yml
make update_environment

While it may not resolve exactly as before (the environment.<architecture>.lock.yml file may be different), what you want as specified in environment.yml will match what was asked for.

5. Customize your local settings

Customize your local paths

Recall that one of the Easydata design goals is to ensure that hardcoded paths should not be checked into your git repository. To this end, paths should never be set from within notebooks or source code that is checked-in to git. If you wish to modify a path on your local system, we recommend using python from the command line. Let's try it.

  1. Try re-setting the default data_path:
>>> python -c "import  <my-module-name>;  <my-module-name>.paths['data_path'] = /alternate/bigdata/path"
  1. Open the easydata-test.ipynb notebook in the environment (as explained above)
  2. Run the following in a cell:
from <my-module-name> import paths
paths['data_path']

This should output the /alternate/bigdata/path.

  1. Reset to the default data_path: Navigate to the command line and run
>>> python -c "import  <my-module-name>;  <my-module-name>.paths['data_path'] = ${project_path}/data"

For more information on paths see

>>> from  <my-module-name> import paths
>>> help(paths)

Customize your environment variables and secret storage

We recommend using a .env file together with python-dotenv to manage your secrets, credentials and other environment variables that need to be set but should not be checked in to source control (e.g. git). This is the place to put your access credentials to AWS, Azure, or Google Cloud and other remote services.

By default, we recommend putting a .env file in your project path:

paths['project_path'] / ".env"

On the initial project creation there is a dummy .env file. Go ahead and open it up, it should look similar to this:

# Environment variables go here, can be read by `python-dotenv` package:
#
#   `src/script.py`
#   ----------------------------------------------------------------
#    import dotenv
#
#    project_dir = os.path.join(os.path.dirname(__file__), os.pardir)
#    dotenv_path = os.path.join(project_dir, '.env')
#    dotenv.load_dotenv(dotenv_path)
#   ----------------------------------------------------------------
#
# DO NOT ADD THIS FILE TO VERSION CONTROL!

Now, let's add an environment variable to the .env file and see what happens. File format of the .env is the same as used to set bash environment variables; i.e.:

TEST_SECRET="test_secret"

Open your .env file in an editor and append the previous line to the file and save it.

The default EasyData .gitignore file includes .env, which automatically makes git...wait for it...ignores the file. This means that for any other checkout of your repo, a .env file will need to be created from scratch. This is the point as what goes in here should never be shared via git, and should be shared via other safer means.

To see this, run git status and note that the .env file does not appear even though it has been edited.

Let's look at how to access the contents of the .env file now.

To access environment variables, python-dotenv will traverse directories until it finds a file called .env, and then create environment variables out of the contents of this file.

Open a Jupyter notebook and use an ipython magic to load_dotenv as follows:

%load_ext dotenv
%dotenv

Note: In a .py file, you can add this:

from dotenv import load_dotenv

load_dotenv()

Voila, the environment variables have been loaded, and can be accessed using the usual Python os.environ dictionary. To check that this worked, let's try and access the TEST_SECRET.

Now try:

import os

os.environ['TEST_SECRET']

If you've done it correctly, you'll have your answer: test_secret.

In this way you can use environment variables to store credentials at a project (EasyData repo) level without ever having to check them in.

Conclusion

At this point you've now set up your environment reproducibly the EasyData way, and will be set to start using and updating your environments as above. This is only the tip of the iceberg, and here are some other references that may be helpful.

  • Customized EasyData docs: in any EasyData repo, there are reference documents under references/easydata that cover:
    • more on conda environments
    • more on paths
    • git configuration (including setting up ssh with GitHub)
    • git workflows
    • tricks for using Jupyter notebooks in an EasyData environment
    • troubleshooting
    • recommendations for how to share your work
  • The EasyData documentation on read the docs
  • The EasyData wiki