Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-ascii chars in search #7378

Merged
merged 9 commits into from Mar 11, 2021
4 changes: 3 additions & 1 deletion conf/solr/8.8.1/schema.xml
Expand Up @@ -38,7 +38,7 @@
catchall "text" field, and use that for searching.
-->

<schema name="default-config" version="1.6">
<schema name="default-config" version="1.7">
<!-- attribute "name" is the name of this schema and is only used for display purposes.
version="x.y" is Solr's version number for the schema syntax and
semantics. It should not normally be changed by applications.
Expand Down Expand Up @@ -566,6 +566,7 @@
<filter class="solr.KeywordRepeatFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
Expand Down Expand Up @@ -616,6 +617,7 @@
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
Expand Down
10 changes: 10 additions & 0 deletions doc/release-notes/820-non-ascii-chars-in-search.md
@@ -0,0 +1,10 @@
(review these notes if this gets into the same release as #7645 as the steps are included there - we expect to include this in the same release)

### Search with non-ascii characters

Many languages include characters that have close analogs in ascii, e.g. (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…). This release changes the default Solr configuration to allow search to match words based on these associations, e.g. a search for Mercè would match the word Merce in a Dataset, and vice versa. This should generally be helpful, but can result in false positives.,e.g. "canon" will be found searching for "cañon".

## Upgrade Instructions

1. You will need to replace or modify your `schema.xml` and restart solr. Re-indexing is required to get full-functionality from this change - the standard instructions for an incremental reindex could be added here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When PR #7645 gets merged (in QA), this PR will need to be updated, as these release notes should be merged with the release notes included in that Solr update PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the release notes from the Solr upgrade PR.

Solr Update

With this release we upgrade to the latest available stable release in the Solr 8.x branch. We recommend a fresh installation of Solr 8.8.1 (the index will be empty)
followed by an "index all".

Before you start the "index all", Dataverse will appear to be empty because
the search results come from Solr. As indexing progresses, partial results will
appear until indexing is complete.

See http://guides.dataverse.org/installation/prerequisites.html#installing-solr

[for the additional upgrade steps section]

Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration.
For example: (modify the path names as needed)
cd /usr/local/solr-8.8.1/server/solr/collection1/conf
wget https://github.com/IQSS/dataverse/releases/download/v5.4/updateSchemaMDB.sh
chmod +x updateSchemaMDB.sh
./updateSchemaMDB.sh -t .

See http://guides.dataverse.org/en/5.4/admin/metadatacustomization.html?highlight=updateschemamdb for more information.

Copy link
Contributor

@mheppler mheppler Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djbrooke in tech hours, @scolapasta suggested moving this PR forward with the note to REMOVE this "Upload Instructions" section, as a duplicate of the steps already included in the Solr upgrade PR #7645.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheppler great, thanks, I'll leave a note in this PR at the top of the notes so I don't forget, in case something slips and we don't get this into the release we expect to.