Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-5733: Allow to validate otherdatabases URL relationships #354

Merged
merged 8 commits into from Sep 30, 2016

Conversation

yvanzo
Copy link
Contributor

@yvanzo yvanzo commented Sep 4, 2016

Previously, it was not handy to define validation rules for otherdatabases URLs, because validation rules had to be defined separately for each entity type whereas there were much more other databases.

Changes caused by this improvement:

  • LINK_TYPES no longer contain fictive duplicate IDs
  • CLEANUPS are website match-related before relationship type-related
  • New validation rules can be defined in CLEANUPS.*.validate and are stricter than old ones:

Old validation rules are still supported. Ultimately, all of them should be converted.

Additionally, other restricted URL relationships (lyrics, score, youtube) can be validated the same way.

Two examples are provided: Soundtrack Collector (converted rules) and The Session (created rules).

y-van-z added 2 commits September 6, 2016 00:42
Prior to this patch, LINK_TYPES contained duplicate entries for
websites that could match more than one relationship type.  For
example, (fictive) image.work and (real) score.work for Wikimedia
Commons.  CLEANUPS used to match at most one relationship type.

- LINK_TYPES do not contain fictive duplicate relationships anymore.
- For Wikimedia Commons, Vimeo, and YouTube, the `type` property
  contains a custom mapping of LINK_TYPES based on entity types.
  Side-effect: `streamingmusic` gets auto-selected for YouTube links
  from `release` entities.  Note: this spots an issue with Vimeo and
  YouTube URLs type, it should be guessed as `videochannel` instead.

This starts a shift of CLEANUPS from link types towards web sites.
Link types now have a unique path through LINK_TYPES, this helps to
distinguish restricted link types from others.  This will help to
validate restricted link types such as other databases (MBS-5733).
'Image' URL-Work relationship does not exist, it was a conveniency
alias defined in `LINK_TYPES` for 'Score' URL-Work relationship.

This alias has been removed in the preceding commit.

This patch updates Wikimedia and CPDL tests accordingly.
Copy link
Member

@mwiencek mwiencek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a while to understand what's going on, but looks good with a couple minor changes.

@@ -232,6 +226,35 @@ const LINK_TYPES = {
}
};

const RESTRICTED_LINK_TYPES = _.reduce([
Copy link
Member

@mwiencek mwiencek Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining what this does? Looks like a list of link types that only accept whitelisted URLs. So if someone selects one of these, it must have a corresponding cleanup entry and pass its validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does what it looks like to do. I picked the term restricted from the doc Style / Relationships / URLs / Restricted relationships. Shall I replace it by whitelisted instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is fine, I think just linking to that page in a comment should be explanatory enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for both LINK_TYPES and RESTRICTED_LINK_TYPES and improved the commit message, see 8514695.

},
validate: function (url, id) {
if (id === LINK_TYPES.otherdatabases.artist)
return /^http:\/\/soundtrackcollector\.com\/composer\/[0-9]+\/$/.test(url);
Copy link
Member

@mwiencek mwiencek Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer braces around all the if statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see 280521a and b61b16c.

y-van-z added 6 commits September 29, 2016 15:50
Previously, validation rules were grouped by link type.

This patch adds support for validation rules grouped by website.
Existing validation rules are untouched and still supported.

It generates validation rules for each link type, by trying to find
any matching CLEANUPS entry and then using its `validate` property.
This will be used to allow other databases validation (MBS-5733).

It adds a new list of link types that only accept whitelisted URLs,
such as `otherdatabases`, `lyrics`, `wikidata`, `bookbrainz`...

Important change: matched URLs are no more allowed in other link types
than the ones specified by the `type` property if any.  For example,
Vimeo URLs are now blocked but for their auto-selected link types.
Previously, limited validation rules applied to `otherdatabases` URLs.

With this patch, validation rules for any specific other database can
be added through a `validate` property in a dedicated CLEANUPS entry.

This patch moves the validation of (the only validated other database)
Soundtrack Collector links to a specific CLEANUPS entry for this
website.  It removes all hard-coded `otherdatabases` validation rules
to rely on the validation process introduced in the previous commit.

Note: It does neither fix nor complete validation rules for Soundtrack
Collector URLs.  This will be fixed as an example in the next commit.
Prior to this patch, SC soundtrack URLs were blocked for artist
entities, SC composer URLs were blocked for release_group entities,
and both soundtrack and composer were blocked for release entities.

With this patch, the above cases are still blocked and additionally:
- only clean SC composer URLs are allowed for artist entities
- only clean SC soundtrack URLs are allowed for release_group entities
  (all other SC URLs are blocked for these two entity types)
- all SC URLs are blocked for all other entity types
- all SC URLs are blocked for all other link types

This comes as an example for `otherdatabases` validation (MBS-5733).
Prior to this patch, The Session (TS) URLs were only auto-detected as
`otherdatabases` relationships and cleaned up.

With this patch, they are also validated as follows:
- only clean TS artists URLs are allowed for artist entities
- only clean TS events URLs are allowed for event entities
- only clean TS recordings URLs are allowed for release_group entities
- only clean TS tunes URLs are allowed for work entities
  (all other TS URLs are blocked for these four entity types)
- all TS URLs are blocked for other entity types
- all TS URLs are blocked for other link types

This comes as a full example of other databases validation (MBS-5733).
It includes a 'bad' URL test (example for MBS-5736).
@mwiencek
Copy link
Member

👍 Thanks. :) Sorry again that it took so long to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants