Skip to content
This repository has been archived by the owner on May 18, 2023. It is now read-only.

Commit

Permalink
Update arrays.py
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinL committed Oct 6, 2020
1 parent 6e57f0e commit a7087be
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions splink_data_normalisation/arrays.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import expr, regexp_replace, col

# One of the original motivations for this was problems with Athena handling arrays containing only a null (i.e. [None,], as opposed to None)
# This fixes a problem where athena can't handle a parquet file with a zero length array
# so [None] is fine, and so is None, but [] is not
# See here: https://forums.aws.amazon.com/thread.jspa?messageID=874178&tstart=0
# This no longer seems to be a problem: https://gist.github.com/RobinL/0692e2cd266483b3088646206aa8be62
# A reprex is here https://gist.github.com/RobinL/0692e2cd266483b3088646206aa8be62
def fix_zero_length_arrays(df:DataFrame):
"""For every field of type array, turn zero length arrays into true nulls
Expand Down

0 comments on commit a7087be

Please sign in to comment.