Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'NoneType' object has no attribute 'sparkContext' #1872

Closed
2 tasks done
w2o-hbrashear opened this issue Jan 22, 2024 · 1 comment · Fixed by #1873
Closed
2 tasks done

AttributeError: 'NoneType' object has no attribute 'sparkContext' #1872

w2o-hbrashear opened this issue Jan 22, 2024 · 1 comment · Fixed by #1873

Comments

@w2o-hbrashear
Copy link
Contributor

What happens?

When running spark example notebook "deduplicate_1k_synthetic"
AttributeError: 'NoneType' object has no attribute 'sparkContext'

AttributeError                            Traceback (most recent call last)
File 
<command-3287110354252329>:2
      1 from splink.spark.linker import SparkLinker
----> 2 linker = SparkLinker(df, settings)
      3 deterministic_rules = [
      4     "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
      5     "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
      6     "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
      7     "l.email = r.email"
      8 ]
     10 linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)


File {redacted}/splink/spark/linker.py:192, in SparkLinker.__init__(self, input_table_or_tables, settings_dict, break_lineage_method, set_up_basic_logging, input_table_aliases, spark, validate_settings, catalog, database, repartition_after_blocking, num_partitions_on_repartition, register_udfs_automatically)
    190 self.in_databricks = "DATABRICKS_RUNTIME_VERSION" in os.environ
    191 if self.in_databricks:
--> 192     enable_splink(spark)
    194 self._set_default_break_lineage_method()
    196 if register_udfs_automatically:

File /Workspace/Repos/hbrashear@w2ogroup.com/hbrashear-splink-issue-param-spark/splink/databricks/enable_splink.py:15, in enable_splink(spark)
      4 def enable_splink(spark):
      5     """
      6     Enable Splink functions.
      7     Use this function at the start of your workflow to ensure Splink is registered on
   (...)
     13         None
     14     """
---> 15     sc = spark.sparkContext
     16     _jar_path = similarity_jar_location()
     17     JavaURI = sc._jvm.java.net.URI

AttributeError: 'NoneType' object has no attribute 'sparkContext'

To Reproduce

Run spark example notebook "deduplicate_1k_synthetic" on Databricks 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12)

OS:

Databricks 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12)

Splink version:

splink==3.9.8

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@w2o-hbrashear
Copy link
Contributor Author

It's a really quick fix : in cell 5 change
linker = SparkLinker(df, settings)
to
linker = SparkLinker(df, settings, spark=spark)

RobinL added a commit that referenced this issue Jan 22, 2024
Fixes #1872 Update deduplicate_1k_synthetic.ipynb to fix spark error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant