Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for spellchecker and spellcheck docs #2025

Merged
merged 50 commits into from
Mar 28, 2024
Merged

Conversation

zslade
Copy link
Contributor

@zslade zslade commented Mar 4, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Docs spellchecker update: #2000
Original spellchecker PR: #1588

Give a brief description for the solution you have provided

I have updated the documentation with guidance on how to use the docs spellchecker, following on from this issue #2000.

All docs have been spellchecked to set a baseline for 'spellchecked' docs. This involves

  1. adding legitimate spellings to custom_dictonary.txt which aren't being picked up by LibreOffice dictionaries, e.g. words like Splink
  2. adding regex to pyspelling.yml to bypass certain patterns or sections in the docs, e.g. code blocks

@ThomasHepworth and I decided not to make the spellchecker part of CD/CI with a GitHub Action at this stage as would require some non-trivial configuration and a new check box on the PR template will probably suffice.

Code for testing ignore rules in pyspelling.yml

General text

This is some text that does not contain a mistake

startoffilemistake

colons

With a massive thanks to external contributor @hanslemm, Splink now supports :simple-postgresql: Postgres. To get started, check out the Postgres Topic Guide.

[:colon-mistake: RSS feed]
:octicons-duplicate-24-mistake:

Code block ignore test


Code block mssstake

python code

  pycodeblockmistakeee

alt python blocks

import jellyfish
altpythonblockmistake
jellyfish.jaro_similarity("MARTHA", "AMRTHA")

inline code

inlinecodemistkakeeee

l.first_name = r.first_name and substr(l.surname,1,1) = substr(r.surname,1,1).

blog post headers


date: 2024-01-23
authors:

  • zoe-s
  • alice-o
    categories:
  • Ethics

anchor tags

anchormistake

Test text between diff blocks

??? info "Example of convergence output"

```diff
@@ -1,9 +1,8 @@
{
diffblockmistake
-  
```

links

create_function function

text between $$

$$\text{}
mathcodeblockmistake
\something $$

Almost 100%, say 98% $inlinemathmistake m \approx 0.98$

compound words

Equi-join

Card content ignore test

::cards::
[
{
cardblockmistakee
},
]
::/cards::

auto generated code

::: splink.linker.Linker
handler: python
selection:
members:
- cluster_pairwise_predictions_at_threshold
- compare_two_records
- compute_tf_table
- deterministic_link
- find_matches_to_new_records
- load_settings
- load_model
- load_settings_from_json
- predict
rendering:
show_root_heading: false
show_source: true

BibTeX blocks

@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,
	title        = {Splink: Free software for probabilistic record linkage at scale.},
	author       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},
	year         = 2022,
	month        = {Aug.},
	journal      = {International Journal of Population Data Science},
	volume       = 7,
	number       = 3,
	doi          = {10.23889/ijpds.v7i3.1794},
	url          = {https://ijpds.org/article/view/1794},
}

error logger

To enable the logging of multiple errors in a singular check, or across multiple checks, an ErrorLogger class is available for use.

The ErrorLogger operates in a similar way to working with a list, allowing you to add additional errors using the append method. Once you've logged all of your errors, you can raise them with the raise_and_log_all_errors method.

??? note "ErrorLogger in practice"
```py
from splink.exceptions import ErrorLogger

# Create an error logger instance
e = ErrorLogger()

# Log your errors
e.append(SyntaxError("The syntax is wrong"))
e.append(NameError("Invalid name entered"))

# Raise your errors
e.raise_and_log_all_errors()
```

endoffilemistake

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@zmbc
Copy link
Contributor

zmbc commented Mar 6, 2024

Note: I briefly messed around with the spellchecker, and wasn't immediately able to get it working on Linux. It looks like the script as it currently stands is Mac-only, but even manually installing Aspell, I couldn't get it to use .aff and .dic files.

@zslade
Copy link
Contributor Author

zslade commented Mar 11, 2024

Note: I briefly messed around with the spellchecker, and wasn't immediately able to get it working on Linux. It looks like the script as it currently stands is Mac-only, but even manually installing Aspell, I couldn't get it to use .aff and .dic files.

Thanks @zmbc for drawing this to our attention! Will have a look into it :)

@zslade
Copy link
Contributor Author

zslade commented Mar 18, 2024

Note: I briefly messed around with the spellchecker, and wasn't immediately able to get it working on Linux. It looks like the script as it currently stands is Mac-only, but even manually installing Aspell, I couldn't get it to use .aff and .dic files.

Thanks again @zmbc. Can I just check which Linux distribution and version you're running? I'm assuming you tried to run the spellchecker following the instructions in this PR - is that correct? If so, it is probably not working because extra dependencies are required for Linux, e.g. we've been able to run the spellchecker on a Docker file using alpine 3.14 but this required installing some additional dependencies: aspell-libs and aspell-en (as well as aspell). If this is the case, we'll raise an issue and corresponding PR to solve

@zmbc
Copy link
Contributor

zmbc commented Mar 18, 2024

@zslade Sorry -- I should have given a bit more information. I'm on Ubuntu 20.04 via Windows Subsystem for Linux. I installed aspell via conda. I can run the spellchecker script (if I comment out the Mac-specific part) but I have no reason to believe that it is using the LibreOffice dictionaries. ./scripts/pyspelling/spellchecker.sh | wc -l outputs 797 -- so it is finding a lot of "typos" in the docs as they currently are on the splink4_dev branch.

@zmbc
Copy link
Contributor

zmbc commented Mar 18, 2024

Running on master, the same command outputs 896 lines.

@zslade
Copy link
Contributor Author

zslade commented Mar 25, 2024

@zslade Sorry -- I should have given a bit more information. I'm on Ubuntu 20.04 via Windows Subsystem for Linux. I installed aspell via conda. I can run the spellchecker script (if I comment out the Mac-specific part) but I have no reason to believe that it is using the LibreOffice dictionaries. ./scripts/pyspelling/spellchecker.sh | wc -l outputs 797 -- so it is finding a lot of "typos" in the docs as they currently are on the splink4_dev branch.

Thanks @zmbc. Just so I'm really clear - the spellchecker runs for you (spellchecks the docs) but doesn't pass (finds spelling errors)? We are still in the process of updating custom_dictonary.txt with legitimate spellings which might not be found in the LibreOffice dictionaries e.g. Splink so it might be the case that everything is working as intended but there is still a fair amount of spelling correcting/dictionary updating to do! 🙏 This is being done as part of this PR - I will update the description to reflect that.

As a rough check, could you please git checkout spellchecker_docs and run ./scripts/pyspelling/spellchecker.sh | wc -l again? The resulting line count should be significantly lower than that for splink4_dev or master as exception patterns have been added to pyspelling.yml and words have been added to custom_dictonary.txt.

Would you also be able to tell us which Mac-specific parts you commented out? This will help us if we decide built a system-agnostic version :)

Many thanks!

@zmbc
Copy link
Contributor

zmbc commented Mar 25, 2024

As a rough check, could you please git checkout spellchecker_docs and run ./scripts/pyspelling/spellchecker.sh | wc -l again?

I get 282. The full output:

Path to spellcheck: docs
============== Running pyspelling spellchecker on docs ==============
docs/blog/index.md docs/charts/index.md docs/dev_guides/caching.md docs/dev_guides/debug_modes.md docs/dev_guides/dependency_management.md docs/dev_guides/index.md docs/dev_guides/spark_pipelining_and_caching.md docs/dev_guides/transpilation.md docs/dev_guides/udfs.md docs/includes/tags.md docs/settingseditor/editor.md docs/topic_guides/topic_guides_index.md
Misspelled words:
<context> docs/blocking_rule_composition.md
--------------------------------------------------------------------------------
splink
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/includes/generated_files/comparison_library_dialect_table.md
--------------------------------------------------------------------------------
PostgreSql
damerau
jaccard
jaro
levenshtein
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/includes/generated_files/comparison_composition_library_dialect_table.md
--------------------------------------------------------------------------------
PostgreSql
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/includes/generated_files/comparison_level_library_dialect_table.md
--------------------------------------------------------------------------------
PostgreSql
damerau
jaccard
jaro
levenshtein
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/includes/generated_files/comparison_template_library_dialect_table.md
--------------------------------------------------------------------------------
PostgreSql
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/includes/generated_files/datasets_table.md
--------------------------------------------------------------------------------
splink
wikidata
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/linkerpred.md
--------------------------------------------------------------------------------
tf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/getting_started.md
--------------------------------------------------------------------------------
PostgreSql
conda
duckdbless
github
splink
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/dev_guides/changing_splink/contributing_to_docs.md
--------------------------------------------------------------------------------
pyspelling
yml
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/spelling_test_file.md
--------------------------------------------------------------------------------
Longrightarrow
endoffilemistake
inlinecodemista
pymistake
startoffilemistake
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/documentation_index.md
--------------------------------------------------------------------------------
autocomplete
autoformatting
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/data_preparation/feature_engineering.md
--------------------------------------------------------------------------------
ATSN
Bedfordshire
ComparisonLevels
DF
DuckDBLinker
FTSN
HU
Harpenden
JL
NG
QE
QF
TLR
TQ
acos
chudleigh
clifford
devon
dm
dmetaphone
geonames
geospatial
jaro
nan
oNah
splink
subdistricts
substrings
thomas
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/theory/fellegi_sunter.md
--------------------------------------------------------------------------------
Longrightarrow
frac
textsf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/evaluation/model.md
--------------------------------------------------------------------------------
frac
textsf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/blocking/blocking_rules.md
--------------------------------------------------------------------------------
frac
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/performance/optimising_duckdb.md
--------------------------------------------------------------------------------
frac
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/comparisons/phonetic.md
--------------------------------------------------------------------------------
AE
ComparisonLevels
GN
Metaphone
NER
PN
Soundex
WH
WR
XMT
damerau
jaro
levenshtein
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/comparisons/regular_expressions.md
--------------------------------------------------------------------------------
ComparisonLevels
PQ
RL
UZ
jaro
levenshtein
standardized
substring
substrings
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/topic_guides/comparisons/comparators.md
--------------------------------------------------------------------------------
frac
jaro
levenshtein
lfloor
rfloor
textsf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/comparison_helpers.md
--------------------------------------------------------------------------------
splink
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/settings_dict_guide.md
--------------------------------------------------------------------------------
bayes
jaro
probabalistic
splink
sql
sqlglot
tf
winkler
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/linker.md
--------------------------------------------------------------------------------
num
roc
sql
tf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/comparison_level_composition.md
--------------------------------------------------------------------------------
splink
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/linkereval.md
--------------------------------------------------------------------------------
roc
tf
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/index.md
--------------------------------------------------------------------------------
ADR
BibTeX
DASD
Doidge
Gateshead
Harron
Hepworth
Lewisham
Manassis
MoJ
Slade
Sunter's
TQyNLJRjnhsfJy
UQ
Zd
customizations
doi
lowercased
scalable
sdt
standardized
url
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/CONTRIBUTING.md
--------------------------------------------------------------------------------
PRs
dev
docstrings
repos
splink
--------------------------------------------------------------------------------

Misspelled words:
<context> docs/linkerbloc.md
--------------------------------------------------------------------------------
num
--------------------------------------------------------------------------------

!!!Spelling check failed!!!

the spellchecker runs for you (spellchecks the docs) but doesn't pass (finds spelling errors)?

Correct, and a lot of them, so I figured that wasn't working correctly. And, I'm pretty sure it's not using the .dic and .aff files on my computer -- how sure are you that it is using them on yours?

Would you also be able to tell us which Mac-specific parts you commented out?

I commented out these lines, since they use Homebrew, which is Mac-specific:

# Function to check if necessary packages are installed
source scripts/utils/ensure_packages_installed.sh
ensure_homebrew_packages_installed aspell yq

I replaced them with conda package installation (aspell and go-yq), which is cross-platform. Happy to contribute this if you are interested. Would you be open to switching to that, or would you want to have both?

This is being done as part of this PR - I will update the description to reflect that.

Ah okay, I think this was the crux of my misunderstanding. I was working on doc updates in #2083 and thought I needed to get this spellcheck passing. The name of this PR indicated to me that the feature was done, just not documented.

@ThomasHepworth ThomasHepworth mentioned this pull request Mar 27, 2024
4 tasks
@zslade
Copy link
Contributor Author

zslade commented Mar 27, 2024

@ThomasHepworth, I think all mistakes have been corrected now 🙌 . See the latest diff for the decisions I made related to your PR #2102.

  • Please read through the guidance to check that it's all clear. I have updated with your British English dictionary suggestion 👍
  • I have also added some more comments to the pyspelling.yml so easier to understand what the rules mean.
  • Please could you also check the docs are displaying correctly where we have made significant changes (mainly the tables and output text)? 🙏

@zslade zslade marked this pull request as ready for review March 27, 2024 15:57
@zslade zslade closed this Mar 28, 2024
@zslade zslade reopened this Mar 28, 2024
Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments and then you can merge this.

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great, thank you so much for pulling it together! Sorry it was so much work.

@zslade zslade merged commit d1acc1d into master Mar 28, 2024
5 checks passed
@zslade zslade deleted the spellchecker_docs branch March 28, 2024 15:46
@zslade zslade mentioned this pull request Mar 28, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants