Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethics Blog & blog docs #1849

Merged
merged 16 commits into from
Jan 23, 2024
10 changes: 10 additions & 0 deletions docs/blog/.authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,13 @@ authors:
name: Ross Kennedy
description: Maintainer
avatar: https://github.com/rossken.png

zoe-s:
name: Zoë Slade
description: Contributor
avatar: https://github.com/zslade.png

alice-o:
name: Alice O'Leary
description: Contributor
avatar: https://github.com/aliceoleary0.png
68 changes: 68 additions & 0 deletions docs/blog/posts/2024-01-25-ethics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
date: 2024-01-23
authors:
- zoe-s
- alice-o
categories:
- Ethics
---

# Ethics in Data Linking

Welcome to the next installment of the Splink Blog where we’re talking about Data Ethics!

## :question: Why should we care about ethics?

Splink was developed in-house at the UK Government’s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.

<!-- more -->

Furthermore, data linkage is generally used at the start of analytical projects so any design decisions that are made, or biases introduced, will have consequences for all downstream use cases of that data. With this in mind, it is important to try and address any potential ethical issues at the linking stage.

## :link: Ethics and Splink

### What do we already have in place?

Data ethics has been a foundational consideration throughout Splink’s development. For example, the decision to make Splink open-source was motivated by an ambition to make our data linking software fully transparent, accessible and auditable to users both inside and outside of government. The fact that this also empowers external users to expand and improve upon Splink’s functionality is another [huge benefit](https://www.robinlinacre.com/open_source_dividend/)!

Another core principle guiding the development of Splink has been explainability. Under the hood we use the [Felligi-Sunter model](../../topic_guides/theory/fellegi_sunter.md) which is an industry-standard, well-researched, explainable methodology. This, in combination with interactive charts such as the [waterfall chart](../../charts/waterfall_chart.ipynb), where model results can be easily broken down and visualised for individual record pairs, make Splink predictions easily interrogatable and explainable. Being able to interrogate predictions is especially valuable when things go wrong - if an incorrect link has been made you can trace it back see exactly why the model made the decision.

### What else should we be considering?

To continue our exploration of ethical issues, we recently had a team away day focused on data ethics. We aimed to better understand where ethical concerns (e.g. bias) could arise in our own Splink linkage pipelines and what further steps we could take to empower users to be able to better understand and possibly mitigate these issues within their own projects.

We discussed a typical data linking pipeline, as used in the Ministry of Justice, from data collection at source through to the generation of Splink cluster IDs. It became clear that there are considerations to make at each stage of a pipeline that can have an ethical implications such as:

!["Diagram of data linkage process and ethical considerations at each stage"](./img/linkage_process.drawio.png)

For example, a higher occurrence of misspellings for names of non-UK origin during data collection can impact the accuracy of links for certain groups.

As you can see, the entire data linking process has many stages with lots of moving parts, resulting in numerous opportunities for ethical issues to arise.

### What are we going to do about it?

Summarised below are the key areas of ethical concern we identified and how we plan to address them.

#### :material-thumbs-up-down: Evaluation

Splink is not plug and play. As a software, it provides many configuration options to support its users, from [blocking rules](../../topic_guides/blocking/blocking_rules.md) to [term frequency adjustments](../../topic_guides/comparisons/term-frequency.md). However, with greater flexibility comes greater variation in model design. From an explainability and quality assurance perspective, it is important to understand how different choices on model build interact and can influence results.

It isn’t trivial to unpick the interplaying factors that affect Splink’s outputs, but as a first step we are building a framework and guidance to demonstrate how changes to a model's settings can impact predictions. We hope this will give users a better understanding of model sensitivity and more confidence in explaining and justifying the results of their models. We also hope this will serve as a stepping stone to tools that help evaluate models in a production setting (e.g. model drift).

#### :scales: Bias

Bias is a key area of ethical concern within data linking and one that crops up at many stages during a typical linking pipeline; from data collection to downstream linking. It is important to identify, quantify and, where possible, mitigate bias in input sources, model building and outputs. However, sources of bias are specific to a given use-case, and therefore finding general solutions to mitigating bias is challenging.

This year we are embarking on a collaboration with the [Alan Turing Institute](https://www.turing.ac.uk/) to get expert support on assessing bias in our linking pipelines. The long-term goal is to create general tooling to help Splink users gain a better understanding of how bias could be being introduced into their models. Improved model evaluation (see above) is a first step in the development of these tools.

#### :loudspeaker: Communication

Sharing both our current knowledge and future discoveries on the ethics of data linking with Splink is important to help support our users and the data linking community more broadly. This blog is the first step on that journey for us.

As already mentioned, Splink comes with a variety of tools that support explainability. We will be updating the Splink documentation to convey the significance of these resources from a data ethics perspective to help give existing users, potential adopters and their customers greater confidence in building Splink models and model predictions.

Please visit [this discussion](https://github.com/moj-analytical-services/splink/discussions/1878) on Splink's GitHub repo to get involve the conversation and share your thoughts - we'd love to hear them!

<hr>

If you want to stay up to date with the latest Splink blogs subscribe to our new [:simple-rss: RSS feed](https://moj-analytical-services.github.io/splink/feed_rss_created.xml)!
Binary file added docs/blog/posts/img/linkage_process.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions docs/dev_guides/changing_splink/blog_posts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Contributing to the Splink Blog

Thanks for considering making a contribution to the [Splink Blog](../../blog/index.md)! We are keen to use this blog as a forum all things data linking and Splink!

This blog, and the docs as a whole, are build using the fantastic [mkdocs-material](https://squidfunk.github.io/mkdocs-material/), to uderstand more about how the blog works under the hood checkout out the mkdocs-material [blog documentation](https://squidfunk.github.io/mkdocs-material/blog/2022/09/12/blog-support-just-landed/).

For more general guidance for contributing to Splink, check out our [Contributor Guide](../../../CONTRIBUTING.md).

## Adding a blog post

The easiest way to get started with a blog post is to make a copy of one of the [pre-existing blog posts](https://github.com/moj-analytical-services/splink/tree/master/docs/blog/posts) and make edits from there. There is a metadata in the section at the top of each post which should be updated with the post date, authors and the category of the post (this is a tag system to make posts easier to find).

Blog posts are ordered by date, so change the name of your post markdown file to be a recent date (YYYY-MM-DD format) to make sure it appears at the top of the blog.

!!! note

In this blog we want to make content as easily digestible as possible. We encourage breaking up and big blocks of text into sections and using visuals/emojis/gifs to bring your post to life!

## Adding a new author to the blogs

If you are a new author, you will need to add yourself to the [.authors.yml file](https://github.com/moj-analytical-services/splink/blob/master/docs/blog/.authors.yml).

## Testing your changes

Once you have made a first draft, check out how the deployed blog will look by [building the docs locally](./build_docs_locally.md).
1 change: 1 addition & 0 deletions docs/dev_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ When making changes to Splink, there are a number of common operations that deve
* [Testing](./changing_splink/testing.md) - to ensure all of the codebase is performing as intended.
* [Building the Documentation locally](./changing_splink/build_docs_locally.md) - to test any changes to the docs site render correctly.
* [Releasing a new package version](./changing_splink/releases.md) - to walk-through the release process for new versions of Splink. This generally happens every 2 weeks, or in the case of an urgent bug fix.
* [Contributing to the Splink Blog](./changing_splink/blog_posts.md) - to walk through the process of adding a post to the Splink blog.

## How Splink works

Expand Down
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ nav:
- Testing: "dev_guides/changing_splink/testing.md"
- Building Docs: "dev_guides/changing_splink/build_docs_locally.md"
- Releasing a Package Version: "dev_guides/changing_splink/releases.md"
- Contributing to the Splink Blog: "dev_guides/changing_splink/blog_posts.md"
- How Splink works:
- Understanding and debugging Splink: "dev_guides/debug_modes.md"
- Transpilation using sqlglot: "dev_guides/transpilation.md"
Expand Down Expand Up @@ -271,4 +272,6 @@ extra:
link: https://pypi.org/project/splink/
- icon: fontawesome/solid/chevron-right
link: https://www.robinlinacre.com/
- icon: fontawesome/solid/rss
link: https://moj-analytical-services.github.io/splink/feed_rss_created.xml
new: Recently added
Loading