Skip to content

Commit

Permalink
chore(hooks): add pre-commit hook to strip notebook outputs (#185)
Browse files Browse the repository at this point in the history
* chore(hooks): add pre-commit hook to strip notebook outputs
* docs: auto update docs for pre-commit hooks
  • Loading branch information
sqr00t committed Feb 2, 2024
1 parent 48a0035 commit 6291176
Show file tree
Hide file tree
Showing 13 changed files with 82 additions and 76 deletions.
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,7 @@ repos:
- id: prettier
name: Prettier (except markdown)
exclude: ".md"
- repo: https://github.com/kynan/nbstripout
rev: 0.6.1
hooks:
- id: nbstripout
4 changes: 2 additions & 2 deletions DEVELOPERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ There are several workflows in `.github/workflows`:
- Each PR should have a `major`/`minor`/`patch` label assigned based on the desired version increment, e.g. `minor` will go from `x.y.z -> x.(y+1).z`
- After a PR is merged then draft release notes will be generated/updated [here](https://github.com/nestauk/ds-cookiecutter/releases) (see `release.yml` above)
- In the Github UI: rewrite the drafts into something informative to a user and then click release
- :warning: Releases should be made little and often - commits on `master` are immediately visible to cookiecutter users
- :warning: Releases should be made little and often - commits on `master` are immediately visible to cookiecutter users

## Documentation

Lives under `docs/`, see [`docs/README.md`](docs/README.md).
Lives under `docs/`, see [`docs/README.md`](docs/README.md).
1 change: 0 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,3 @@ Use [mkdocs](http://www.mkdocs.org/) and [mkdocs material](https://squidfunk.git
:note: mkdocs material uses a superset of Markdown, see [the reference](https://squidfunk.github.io/mkdocs-material/reference/admonitions/)

Docs are automatically published to `gh-pages` branch via. Github actions after a PR is merged into `master`.

8 changes: 4 additions & 4 deletions docs/docs/css/extra.css
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
@import url('https://fonts.cdnfonts.com/css/century-gothic');
@import url("https://fonts.cdnfonts.com/css/century-gothic");
html,
body,
[class*="css"] {
font-family: "Century Gothic";
}
:root {
--md-primary-fg-color: #18A48C;
--md-accent-fg-color: #EB003B;
}
--md-primary-fg-color: #18a48c;
--md-accent-fg-color: #eb003b;
}
8 changes: 4 additions & 4 deletions docs/docs/guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,23 @@ What is `conda`? Conda is an open-source, cross-platform package management syst

When you run `make install`, a `conda` environment will be created for you with a name that matches the repo name. To activate it, you can run:

```conda activate <repo_name>```
`conda activate <repo_name>`

For more context on how the cookiecutter creates a conda environment, click [here](https://nestauk.github.io/ds-cookiecutter/structure/). For more information on Python environments (from our Python guidelines), click [here](https://nestauk.github.io/dap_python_guidelines/python_environments.html).

Your conda environment is an encompassing python environment for your project, you can install/uninstall packages as necessary, and these will only exist in the context of the environment of your project.

To check which packages are installed in your active Conda environment, use the following command:

```conda list <repo_name>```
`conda list <repo_name>`

## Installing packages with pip

`pip` is the standard package manager for Python. It allows you to install and manage additional libraries and dependencies that are not distributed as part of the standard library.

To install a package using pip, you can use the following command:

```pip install package_name```
`pip install package_name`

## The Role of a requirements.txt

Expand All @@ -38,7 +38,7 @@ scikit-learn

If you have a `requirements.txt` file, you can install all required packages with the following command:

```pip install -r requirements.txt```
`pip install -r requirements.txt`

This will install the specific versions of all the packages listed in the `requirements.txt` file. This is useful for ensuring consistent environments across different systems, or when deploying a Python application.

Expand Down
40 changes: 20 additions & 20 deletions docs/docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,28 @@ In this page you will learn how to set up a project using the cookiecutter. The

1. **Request repository setup**: First things first, you need a repo created for you. Submit a [Create a repo in the Nesta GitHub org](https://github.com/nestauk/github_support/issues/new/choose) issue from the [github_support](https://github.com/nestauk/github_support) issue page. You will need to provide a project name, suggested repo name, whether public/private, github teams involved, team leading the project, short and long description of the project. An empty repo will be set up for you and you will receive a notification when this is done.

2. **Set up your project locally**: It is important that you *do not clone the repo yet!* Instead, follow these steps:

* Open the terminal and `cd` to a folder where you eventually want your repo to be
* Run `cookiecutter https://github.com/nestauk/ds-cookiecutter`. This will automatically install the latest version. If you want to install a different version run `cookiecutter https://github.com/nestauk/ds-cookiecutter -c <VERSION TAG>`
* You will be presented with the following:
- `You've downloaded ~.cookiecutters/ds-cookiecutter before. Is it okay to delete and re-download it?[yes]` press Enter to confirm yes, it's always best to use the latest version.
- `project_name [Project_name]`: add_a_name_here
- `repo_name [add_a_name_here]`: add_a_name_here
- `author_name [Nesta]`: add_author or press Enter to confirm Nesta
- `description [A short description of the project.]`: add short description
- `Select openness: 1 - public 2 - private Choose from 1, 2 [1]`: regardless of the choice you can always change it in the future

* `cd` to project directory and run `make install` to:
* Create a conda environment with a name corresponding to the repo_name prompt and install the project package and its dependencies
* Configure and install Git pre-commit hooks
2. **Set up your project locally**: It is important that you _do not clone the repo yet!_ Instead, follow these steps:

- Open the terminal and `cd` to a folder where you eventually want your repo to be
- Run `cookiecutter https://github.com/nestauk/ds-cookiecutter`. This will automatically install the latest version. If you want to install a different version run `cookiecutter https://github.com/nestauk/ds-cookiecutter -c <VERSION TAG>`
- You will be presented with the following:

- `You've downloaded ~.cookiecutters/ds-cookiecutter before. Is it okay to delete and re-download it?[yes]` press Enter to confirm yes, it's always best to use the latest version.
- `project_name [Project_name]`: add_a_name_here
- `repo_name [add_a_name_here]`: add_a_name_here
- `author_name [Nesta]`: add_author or press Enter to confirm Nesta
- `description [A short description of the project.]`: add short description
- `Select openness: 1 - public 2 - private Choose from 1, 2 [1]`: regardless of the choice you can always change it in the future

- `cd` to project directory and run `make install` to:
- Create a conda environment with a name corresponding to the repo_name prompt and install the project package and its dependencies
- Configure and install Git pre-commit hooks

3. **Connect your local project to github**: You have set up your project locally and now you have to connect it to the remote repo. When you change directory to your created project folder, you will see that you are in a git repository and the generated cookiecutter has committed itself to the `0_setup_cookiecutter` branch. Connect to the git repo by running `git remote add origin git@github.com:nestauk/<REPONAME>` to point your local project to the configured repository.

4. **Merging your new branch**: You are on `0_setup_cookiecutter`, whist `dev` is empty. They have diverging histories so you won't be able to push any work to `dev`. For this reason you need to merge `0_setup_cookiecutter` to `dev` by running:

``` bash
```bash

git checkout 0_setup_cookiecutter
git branch dev 0_setup_cookiecutter -f
Expand All @@ -38,9 +39,8 @@ In this page you will learn how to set up a project using the cookiecutter. The

5. **You are all set!** You can delete the `0_setup_cookicutter` branch and enjoy coding!


### Team Members

* Open the terminal and `cd` into a folder where you want the project set up.
* Clone the repository by running `git clone <REPONAME>` and `cd` into the repository.
* Run `make install` to configure the development environment.
- Open the terminal and `cd` into a folder where you want the project set up.
- Clone the repository by running `git clone <REPONAME>` and `cd` into the repository.
- Run `make install` to configure the development environment.
40 changes: 20 additions & 20 deletions docs/docs/structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,13 @@ A direct [tree](#tree) representation of the folder hierarchy is also given at t

Here are a couple of examples from projects:

- [AHL - Out of Home Analysis](https://github.com/nestauk/ahl_out_of_home_analysis)
- [AHL - Out of Home Analysis](https://github.com/nestauk/ahl_out_of_home_analysis)

- [AFS - Birmingham Early Years Data](https://github.com/nestauk/afs_birmingham_ey_data)
- [AFS - Birmingham Early Years Data](https://github.com/nestauk/afs_birmingham_ey_data)

- [ASF - Heat Pump Readiness](https://github.com/nestauk/asf_heat_pump_readiness)

- [DS - Green Jobs](https://github.com/nestauk/dap_prinz_green_jobs)
- [ASF - Heat Pump Readiness](https://github.com/nestauk/asf_heat_pump_readiness)

- [DS - Green Jobs](https://github.com/nestauk/dap_prinz_green_jobs)

_**Note:** In the following sections we use `src/` to denote the project name to avoid awkward `<project_name>` placeholders._

Expand Down Expand Up @@ -124,9 +123,9 @@ config["patstat_companies_house"]["match_threshold"]

This centralisation provides a clearer log of decisions and decreases the chance that a different match threshold gets incorrectly used somewhere else in the codebase.

Config files are also useful for storing model parameters. Storing model parameters in a config makes it much easier to test different model configurations and document and reproduce your model once it’s been trained. You can easily reference your config file to make changes and write your final documentation rather than having to dig through code. Depending on the complexity of your repository, it may make sense to create separate config files for each of your models.
Config files are also useful for storing model parameters. Storing model parameters in a config makes it much easier to test different model configurations and document and reproduce your model once it’s been trained. You can easily reference your config file to make changes and write your final documentation rather than having to dig through code. Depending on the complexity of your repository, it may make sense to create separate config files for each of your models.

For example, if training an SVM classifier you may want to test different values of the regularisation parameter ‘C’. You could create a file called
For example, if training an SVM classifier you may want to test different values of the regularisation parameter ‘C’. You could create a file called
`src/config/svm_classifier.yaml` to store the parameter values in the same way as before.

---
Expand Down Expand Up @@ -175,6 +174,7 @@ Following this approach means:
- If we want to see what data is available, we have a folder in the project to go to and we let the code speak for itself as much as possible - e.g. the following is a lot more informative than an inline call to `pd.read_csv` like we had above

Here are two examples:

```python
# File: getters/companies_house.py
"""Data getters for the companies house data.
Expand All @@ -191,7 +191,9 @@ Here are two examples:
"""
return pd.read_csv("path/to/file", sep="\t", dtype={"sic_code": str})
```

or using ds-utils:

```python
#File: getters/asq_data.py
"""Data getters for the ASQ data.
Expand All @@ -214,7 +216,6 @@ or using ds-utils:

```


## Pipeline components - `src/pipeline`

This folder contains pipeline components. Put as much data science as possible here.
Expand All @@ -233,7 +234,6 @@ This is a place to put utility functions needed across different parts of the co

For example, this could be functions shared across different pieces of analysis or different pipelines.


## Analysis - `src/analysis`

Functionality in this folder takes the pipeline components (possibly combining them) and generates the plots/statistics to feed into reports.
Expand All @@ -246,18 +246,17 @@ It is important that plots are saved in `outputs/` rather than in different area

Notebook packages like [Jupyter notebook](http://jupyter.org/) are effective tools for exploratory data analysis, fast prototyping, and communicating results; however, between prototyping and communicating results code should be factored out into proper python modules.

We have a notebooks folder for all your notebook needs! For example, if you are prototyping a "sentence transformer" you can place the notebooks for prototyping this feature in notebooks, e.g. `notebooks/sentence_transformer/` or `notebooks/pipeline/sentence_transformer/`.

Please try to keep all notebooks within this folder and primarily not on github, especially for code refactoring as the code will be elsewhere, e.g. in the pipeline. However, for collaborating, sharing and QA of analysis, you are welcome to push those to github.
We have a notebooks folder for all your notebook needs! For example, if you are prototyping a "sentence transformer" you can place the notebooks for prototyping this feature in notebooks, e.g. `notebooks/sentence_transformer/` or `notebooks/pipeline/sentence_transformer/`.

Please try to keep all notebooks within this folder and primarily not on github, especially for code refactoring as the code will be elsewhere, e.g. in the pipeline. However, for collaborating, sharing and QA of analysis, you are welcome to push those to github.

### Refactoring

Everybody likes to work differently. Some like to eagerly refactor, keeping as little in notebooks as possible (or even eschewing notebooks entirely); whereas others prefer to keep everything in notebooks until the last minute.

You are welcome to work in whatever way you’d like, but try to always submit a pull request (PR) for your feature with everything refactored into python modules.

We often find it easiest to refactor frequently, otherwise you might get duplicates of functions across the codebase , e.g. if it's a data preprocessing task, put it in the pipeline at `src/pipelines/<descriptive name for task>`; if it's useful utility code, refactor it to `src/utils/`; if it's loading data, refactor it to `src/getters`.
We often find it easiest to refactor frequently, otherwise you might get duplicates of functions across the codebase , e.g. if it's a data preprocessing task, put it in the pipeline at `src/pipelines/<descriptive name for task>`; if it's useful utility code, refactor it to `src/utils/`; if it's loading data, refactor it to `src/getters`.

#### Tips

Expand Down Expand Up @@ -330,17 +329,18 @@ You can write reports in markdown and put them in `outputs/reports` and referenc
├── .envrc | SHARED PROJECT CONFIGURATION VARIABLES
├── .cookiecutter | COOKIECUTTER SETUP & CONFIGURATION (user can safely ignore)
```

## The Makefile

A Makefile is a build automation tool that is commonly used in software development projects. It is a text file that contains a set of rules and instructions for building, compiling, and managing the project. The primary role of a Makefile is to automate the build process and make it easier for developers to compile and run their code.

Here are some key points to understand about the role of a Makefile in a codebase:

- Build Automation: A Makefile defines a set of rules that specify how to build the project. It includes instructions for compiling source code, linking libraries, and generating executable files or other artifacts. By using a Makefile, developers can automate the build process and ensure that all necessary steps are executed in the correct order.
- Dependency Management: Makefiles allow developers to define dependencies between different files or components of the project. This ensures that only the necessary parts of the code are rebuilt when changes are made, saving time and resources. Makefiles can track dependencies based on file timestamps or by explicitly specifying the relationships between files.
- Consistency and Reproducibility: With a Makefile, the build process becomes standardised and reproducible across different environments. Developers can share the Makefile with others, ensuring that everyone follows the same build steps and settings. This helps maintain consistency and reduces the chances of errors or inconsistencies in the build process.
- Customization and Extensibility: Makefiles are highly customizable and allow developers to define their own build targets and actions. This flexibility enables the integration of additional tools, such as code formatters, linters, or test runners, into the build process. Developers can easily extend the functionality of the Makefile to suit the specific needs of their project.
- Integration with Version Control: Makefiles are often included in the codebase and tracked by version control systems. This ensures that the build process is documented and can be easily reproduced by other team members. Makefiles can also be integrated into continuous integration (CI) pipelines, allowing for automated builds and tests whenever changes are pushed to the repository.
- Build Automation: A Makefile defines a set of rules that specify how to build the project. It includes instructions for compiling source code, linking libraries, and generating executable files or other artifacts. By using a Makefile, developers can automate the build process and ensure that all necessary steps are executed in the correct order.
- Dependency Management: Makefiles allow developers to define dependencies between different files or components of the project. This ensures that only the necessary parts of the code are rebuilt when changes are made, saving time and resources. Makefiles can track dependencies based on file timestamps or by explicitly specifying the relationships between files.
- Consistency and Reproducibility: With a Makefile, the build process becomes standardised and reproducible across different environments. Developers can share the Makefile with others, ensuring that everyone follows the same build steps and settings. This helps maintain consistency and reduces the chances of errors or inconsistencies in the build process.
- Customization and Extensibility: Makefiles are highly customizable and allow developers to define their own build targets and actions. This flexibility enables the integration of additional tools, such as code formatters, linters, or test runners, into the build process. Developers can easily extend the functionality of the Makefile to suit the specific needs of their project.
- Integration with Version Control: Makefiles are often included in the codebase and tracked by version control systems. This ensures that the build process is documented and can be easily reproduced by other team members. Makefiles can also be integrated into continuous integration (CI) pipelines, allowing for automated builds and tests whenever changes are pushed to the repository.

As part of the cookiecutter, we have a Makefile that can perform some useful administrative tasks for us:

Expand All @@ -358,4 +358,4 @@ install Install a project: create conda env; install local package;
pip-install Install our package and requirements in editable mode (including development dependencies)
```

By far the most commonly used command is `make install`, so don't worry too much about the rest!
By far the most commonly used command is `make install`, so don't worry too much about the rest!
1 change: 0 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,3 @@ nav:
- Quickstart: quickstart.md
- Structure: structure.md
- Python Environments: guidelines.md

0 comments on commit 6291176

Please sign in to comment.