In [2]:
from pyspark.sql import SparkSession
import os

In [3]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("SparkDoubleTypeTest") \
        .getOrCreate()
else:
    spark = SparkSession.builder \
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("SparkDoubleTypeTest") \
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "8g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

22/03/09 10:58:10 WARN Utils: Your hostname, pliu-SATELLITE-P850 resolves to a loopback address: 127.0.1.1; using 172.22.0.33 instead (on interface wlp3s0)
22/03/09 10:58:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/03/09 10:58:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [12]:
data = [("James", "Sales", "NY", 900.01, 34, 10000),
            ("Michael", "Sales", "NY", 860.02, 56, 20000),
            ("Robert", "Sales", "CA", 810.03, 30, 23000),
            ("Maria", "Finance", "CA", 900.04, 24, 23000),
            ("Raman", "Finance", "CA", 990.05, 40, 24000),
            ("Scott", "Finance", "NY", 830.06, 36, 19000),
            ("Jen", "Finance", "NY", 790.07, 53, 15000),
            ("Jeff", "Marketing", "CA", 800.08, 25, 18000),
            ("Kumar", "Marketing", "NY", 910.09, 50, 21000)
            ]

schema = ["employee_name", "department", "state", "salary", "age", "bonus"]
df = spark.createDataFrame(data=data, schema=schema)

In [13]:
df.show(5)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|        James|     Sales|   NY|900.01| 34|10000|
|      Michael|     Sales|   NY|860.02| 56|20000|
|       Robert|     Sales|   CA|810.03| 30|23000|
|        Maria|   Finance|   CA|900.04| 24|23000|
|        Raman|   Finance|   CA|990.05| 40|24000|
+-------------+----------+-----+------+---+-----+
only showing top 5 rows



In [14]:
df.printSchema()

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



In [15]:
from pyspark.sql.functions import col
df.withColumn("raised_salary",col("salary")+0.1)\
    .withColumn("reduced_salary",col("salary")-0.1).show()

+-------------+----------+-----+------+---+-----+-----------------+-----------------+
|employee_name|department|state|salary|age|bonus|    raised_salary|   reduced_salary|
+-------------+----------+-----+------+---+-----+-----------------+-----------------+
|        James|     Sales|   NY|900.01| 34|10000|           900.11|           899.91|
|      Michael|     Sales|   NY|860.02| 56|20000|           860.12|           859.92|
|       Robert|     Sales|   CA|810.03| 30|23000|           810.13|           809.93|
|        Maria|   Finance|   CA|900.04| 24|23000|           900.14|899.9399999999999|
|        Raman|   Finance|   CA|990.05| 40|24000|           990.15|989.9499999999999|
|        Scott|   Finance|   NY|830.06| 36|19000|           830.16|829.9599999999999|
|          Jen|   Finance|   NY|790.07| 53|15000|790.1700000000001|           789.97|
|         Jeff| Marketing|   CA|800.08| 25|18000|800.1800000000001|           799.98|
|        Kumar| Marketing|   NY|910.09| 50|21000|     

You can notice after the addition or soustraction with 0.1, many results are wrong. Because the underlying spark engine is implemented by using Scala.

**In scala, double is internally stored as a fraction in binary -- like 1/4=0.25 + 1/8=0.125 + 1/16=0.0625 + ...**

As a result, the value 829.94 -- or the value 789.96 -- cannot be stored as an exact fraction in binary, so double cannot store the exact value 0.94 or 0.96, and the subtracted value isn't quite exact.

If you want to get exact precise decimal arithmetic, convert column type from double/float to DecimalType. For all supported column type in spark: https://spark.apache.org/docs/latest/sql-ref-datatypes.html

In fact, this is a general problem for many programing language. For more detail about float calculation, you can visit this https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html