Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade dependency versions #5

Merged
merged 28 commits into from
May 16, 2022
Merged

Upgrade dependency versions #5

merged 28 commits into from
May 16, 2022

Conversation

riley-harper
Copy link
Contributor

This PR upgrades many of hlink's dependencies to newer versions. This should make it much easier to maintain and keep it from getting stuck using old, broken versions of different packages.

  • Upgraded from Python 3.6 to Python 3.10.
  • Upgraded from Java 8 to Java 11.
  • Upgraded from Scala 2.11 to Scala 2.12.
  • Upgraded from pyspark 2 to pyspark 3. 🎉
  • Upgraded the library that hlink uses for computing Jaro-Winkler similarity scores. This upgrade included some bug fixes, which may change computed Jaro-Winkler scores slightly in some cases. They were incorrect before.
  • Removed a workaround for avoiding logging errors in tests. The workaround isn't needed anymore in pyspark 3.
  • Upgraded to Jinja2 3, which slightly changed how Jinja's PackageLoader works. I added some empty templates/ subdirectories in link task packages, which fixed the issue.
  • Upgraded to newer versions of black and flake8, which only caused a few minor formatting changes.
  • Started using JaroWinklerSimilarity instead of JaroWinklerDistance in the Scala code. This should save us a headache when we next go to upgrade Scala Commons Text, and the two classes have the exact same logic right now.
  • Changed from using pandas.DataFrame.append() to pandas.concat(), following deprecation messages.
  • Updated some documentation to keep it up to date, made Sphinx docs automatically track the hlink version.

- This requires renaming OneHotEncoderEstimator -> OneHotEncoder
- Some tests are failing now
- I also updated the Scala dependency versions to keep them up to date
- This is a temporary fix that we should come back to check on later
- This doesn't affect anything right now, since we're not using Scala in
  the Dockerfile yet.
- When we upgraded from Apache Commons Text 1.4 to 1.9, we got a couple
  of Jaro-Winkler bugfixes which slightly changed the similarity scores
  returned here.
- These tests seem to have changed due to a change in randomSplit()'s
  internals from Spark v2 to Spark v3. So some of the training results
  are different, but similar.
- These tests are very finnicky and I think that they were failing for
  the same reason as the previous ones -- differences in how training
  data was split.
- This required an upgrade to Spark 3, so now we can take care of it
- This introduced an error that I had some trouble tracking down. The
  PackageLoader couldn't load some of the packages because they didn't
  have a templates/ subdirectory. So I've added some empty templates/
  directories to fix the issue.
- This is a dependency of pytest, so let's just let pytest install it
- No other changes seem to be needed
- The biggest change here is that black now cares about whitespace in
  single-line doc comments.
- This will keep us out of trouble when the deprecated
  JaroWinklerDistance changes in Scala Commons Text 2.0:
  https://issues.apache.org/jira/browse/TEXT-104

- I also removed some imports which were causing compiler warnings
- Some pandas type-inference code seems to have changed, so we need to change these queries to use integers instead of strings
- I haven't been able to come up with a surefire explanation for why this has
changed. My theory is that something internal changed in the RandomForestClassifier
with the upgrade from Java 8 to Java 11. The new value seems reasonable, and I
don't think that this will cause any issues.
@jacwellington jacwellington merged commit 465d7ce into main May 16, 2022
@jacwellington jacwellington deleted the upgrade_versions branch May 16, 2022 20:49
jacwellington pushed a commit that referenced this pull request May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants