Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update performance.md #1865

Merged
merged 2 commits into from
Jan 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/topic_guides/blocking/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Below we run through an example of how to address each of these drivers.

One way to reduce the number of comparisons being considered within a model is to apply strict blocking rules. However, this can have a significant impact on the how well the Splink model works.

In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which will means you can iterate through model versions more quickly.
In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which means you can iterate through model versions more quickly.

??? example "Example - Incrementally loosening Prediction Blocking Rules"

Expand Down Expand Up @@ -84,4 +84,4 @@ In most SQL engines, an `OR` condition within a blocking rule will result in all

Given the ability to parallelise operations in Spark, there are some additional configuration options which can improve performance of blocking. Please refer to the Spark Performance Topic Guides for more information.

Note: In Spark Equi-joins are implemented using hash partitioning, which facilitates splitting the workload across multiple machines.
Note: In Spark Equi-joins are implemented using hash partitioning, which facilitates splitting the workload across multiple machines.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.\n",
"\n",
"By tailoring the definitions of similarity, linking models are more effectively able to distinguish beteween different gradations of similarity, leading to more accurate data linking models.\n",
"By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.\n",
"\n",
"Note that for performance reasons, Splink requires the user to define `n` discrete levels (gradations) of similarity.\n",
"\n",
Expand Down
Loading