Unable to load Jaro-Winkler Similarity in AWS Glue #1377

richard-a-lott · 2023-06-29T10:00:01Z

richard-a-lott
Jun 29, 2023

Hi,

I've had a bit of an issue using Splink with AWS Glue (similar to this one: #636 (comment)) where Spark is unable to find and load the Splink UDF jar.

I confirmed the jar was present in the path identified by similarity_jar_location, and I confirmed the jar was included in multiple spark config parameters (spark.jars, spark.driver.extraCLassPath, spark.executor.extraClassPath), however my Glue job refused to recognise the jar.

I have found a work-around, which I'll share here, but it would be good to know if anyone has found a better way?

My work around is to specify, in the Glue Job Details, a location for "Dependend Jars Path" (has to be full S3 URI to the jar file, including the filename), and then to upload the jar file into this S3 location, at the start of the Glue job, prior to creating the Spark session:

(apologies for the image, I'm unable to paste the code in directly)

Maybe a bit messy but it seems to be working. Anyone have a better method?

richard-a-lott · 2023-06-29T15:33:34Z

richard-a-lott
Jun 29, 2023
Author

Actually this doesn't (quite) work! One has to copy the jar file into S3 before running the job. The rest of the method works though.

In my case I'd already copied the jar, so it existed before the job ran.

So a revised method to the above is:

Copy the jar file to a target S3 location - manually or via another process prior to your Glue job
Reference the S3 location in the Glue job parameter '--extra-jars'

0 replies

93Akkord · 2023-06-29T17:10:52Z

93Akkord
Jun 29, 2023

Hmmm.. I've been able to use and load the jar file just fine. Are you including splink into your job in the job parameters section like below?

1 reply

richard-a-lott Jun 30, 2023
Author

Yes I'm doing this (version 3.9.2 though), and the jar is definitely present in the Glue environment, but spark wasn't loading it, even after specifying it as per the example in here.

Let me check again in case I've done something stupid.

richard-a-lott · 2023-06-30T08:12:55Z

richard-a-lott
Jun 30, 2023
Author

No it's still not working. I can't see what I've done wrong. Here's an overview of my spark config:

Here is the log showing it's being picked up by spark:
[Edit: Image removed]

And here's the susequent log showing it's failed to load the udf:
[Edit: Image removed]

I'm baffled. Any ideas?

8 replies

richard-a-lott Jul 5, 2023
Author

My first line of support is the team that manages our AWS environment, so I will raise a question with them, however whether they are able to answer (and willing to make any required access changes) is another question!

If you wanted to raise this separately with AWS then you might get a more comprehensive answer. I'm happy to provide what info I can, if you want to go down this route.

richard-a-lott Aug 3, 2023
Author

@RobinL I've been following up this issue with AWS support (via my internal support team) and after a bit of investigation they have come back with the following response. It seems that the 'best practice' solution suggested above should not work, and the jar will need to be in S3 before running, in order for the job to work properly.

I'm happy with my current working method. I just wanted to share this with you for info.

AWS response:

[...]
The team confirmed that Glue does not make use of the spark.jars parameter. The Glue job builds a separate classpath used to start the individual executors before connecting to the driver. Once started, this classpath cannot be changed. In other words, even though we can access the .jar file from the script on the driver, changing the classpath already on the executors themselves to point to the jar is not supported.

In order to make sure the latest jar is always on the UDF, it is instead recommended to regularly update the jar in S3 to the latest/desired version; replacing the jar with a file of the same name would allow the Glue job to still use the latest jar without making changes to the job itself. Alternatively, the path in the ‘--extra-jars’ parameter can be updated to point to the latest desired version of the jar in S3[1].

References:

[1] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

RobinL Aug 3, 2023
Maintainer

Thanks - that's super useful and seems to explain the annoying behaviour we've been observing!

So basically means there's no way for us to write aws glue compatible code that auto registers the jar

@samnlindsay worth you being aware

richard-a-lott Aug 3, 2023
Author

The AWS response states that it's not possible to alter the classpath. If there were a way to copy the jar into a location already on the classpath that might work (but I don't think it's so easy to move files around on the executors is it?)

RobinL Aug 3, 2023
Maintainer

That's a good idea, hadn't thought of that. Worth a try

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load Jaro-Winkler Similarity in AWS Glue #1377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

References:

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unable to load Jaro-Winkler Similarity in AWS Glue #1377

richard-a-lott Jun 29, 2023

Replies: 3 comments · 9 replies

richard-a-lott Jun 29, 2023 Author

93Akkord Jun 29, 2023

richard-a-lott Jun 30, 2023 Author

richard-a-lott Jun 30, 2023 Author

richard-a-lott Jul 5, 2023 Author

richard-a-lott Aug 3, 2023 Author

References:

RobinL Aug 3, 2023 Maintainer

richard-a-lott Aug 3, 2023 Author

RobinL Aug 3, 2023 Maintainer

richard-a-lott
Jun 29, 2023

Replies: 3 comments 9 replies

richard-a-lott
Jun 29, 2023
Author

93Akkord
Jun 29, 2023

richard-a-lott Jun 30, 2023
Author

richard-a-lott
Jun 30, 2023
Author

richard-a-lott Jul 5, 2023
Author

richard-a-lott Aug 3, 2023
Author

RobinL Aug 3, 2023
Maintainer

richard-a-lott Aug 3, 2023
Author

RobinL Aug 3, 2023
Maintainer