## Set up spark

In [1]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark = SparkSession \
       .builder \
       .appName("data processing experiment") \
       .getOrCreate()


Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [872 kB]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,375 kB]
Get:13 http://archive.ubuntu.com/ubun

In [2]:
from pyspark.sql.functions import col

seed = 4

## The example

Taken from the pyspark code, creates a simple dataframe from a range and then samples it with specified fractions. Then its a simple query to count the rows grouped by key.

The actual outputs don't match the expected outputs (as documented in the code, or as you might think from maths).

In the first case the data only has 99 rows. Each key (0,1,2) occurs 33 times.

So
- for a fraction of 0.1 we'd expect 3.
- for a fraction of 0.2 we'd expect 6.


## Raw data

In [10]:
# from https://spark.apache.org/docs/3.3.2/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sample

# create a dataset using a range - each row is the modulus of 3 - so we end up with 100 rows containing keys 0, 1, or 2
dataset = spark.range(0, 99).select((col("id") % 3).alias("key"))
# and summarise the data - counts by key
dataset.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|   33|
|  1|   33|
|  2|   33|
+---+-----+



## Sample with seed=4

In [11]:
# now sample the data using defined fractions for each key
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=4)
# and summarise the results - counts by key
sampled.groupBy("key").count().orderBy("key").show()

# Expected:
# +---+-----+
# |key|count|
# +---+-----+
# |  0|    3|
# |  1|    6|
# +---+-----+

+---+-----+
|key|count|
+---+-----+
|  0|    2|
|  1|    7|
+---+-----+



## Sample with seed=8

In [14]:
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=8)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|    8|
|  1|    4|
+---+-----+



## Sample with seed=12

In [15]:
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=12)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|    4|
|  1|    8|
+---+-----+



This isn't what I expected - quite a bit of variability and not so accurate proportions - so lets increase the data size to see if it gets closer...

Increasing the size of the data (to 1000) gets a closer result...

In [5]:
dataset = spark.range(0, 1000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=seed)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|   38|
|  1|   87|
+---+-----+



Increasing the size of the data (to 10,000) gets closer still...

In [6]:
dataset = spark.range(0, 10000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=seed)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|  336|
|  1|  697|
+---+-----+



With 100,000 rows it's pretty much there...

In [7]:
dataset = spark.range(0, 100000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=seed)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0| 3314|
|  1| 6750|
+---+-----+



And with 1,000,000 it's spot on:

In [8]:
dataset = spark.range(0, 1000000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=seed)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|33312|
|  1|66800|
+---+-----+



And using different seeds gives similar results this time

In [16]:
dataset = spark.range(0, 1000000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=8)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|33134|
|  1|67107|
+---+-----+



In [18]:
dataset = spark.range(0, 1000000).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=12)
sampled.groupBy("key").count().orderBy("key").show()

+---+-----+
|key|count|
+---+-----+
|  0|33294|
|  1|66564|
+---+-----+



So as we increase the data size the proportions become more accurate - I guess nothing surprising there. Using different seeds also results in significantly different outputs when the data is small.

In the next episode I'll look at sampling using multiple columns...