In [None]:
PySpark provides a pyspark.sql.DataFrame.sample(), pyspark.sql.DataFrame.sampleBy(), RDD.sample(), and RDD.takeSample() methods
to get the random sampling subset from the large dataset

Syntax:
    sample(withReplacement, fraction, seed=None)

In [None]:
PySpark sampling can be done on RDD and DataFrame. In order to do sampling, you need to know how much data you wanted to retrieve by specifying fractions.
Use seed to regenerate the same sampling multiple times. and
Use withReplacement if you are okay to repeat the random records.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

#Using fraction    
df=spark.range(100)
print(df.sample(0.06).collect())

#Using seed
print(df.sample(0.1,123).collect())

print(df.sample(0.1,456).collect())

#Using replacement-May contain duplicates
print(df.sample(True,0.3,123).collect())



[Row(id=49), Row(id=56), Row(id=57), Row(id=90)]
[Row(id=36), Row(id=37), Row(id=41), Row(id=43), Row(id=56), Row(id=66), Row(id=69), Row(id=75), Row(id=83)]
[Row(id=19), Row(id=21), Row(id=42), Row(id=48), Row(id=49), Row(id=50), Row(id=75), Row(id=80)]
[Row(id=0), Row(id=5), Row(id=9), Row(id=11), Row(id=14), Row(id=14), Row(id=16), Row(id=17), Row(id=21), Row(id=29), Row(id=33), Row(id=41), Row(id=42), Row(id=52), Row(id=52), Row(id=54), Row(id=58), Row(id=65), Row(id=65), Row(id=71), Row(id=76), Row(id=79), Row(id=85), Row(id=96)]


In [None]:
You can get Stratified sampling in PySpark without replacement by using sampleBy() method. It returns a sampling fraction 
for each stratum. If a stratum is not specified, it takes zero as the default.
Syntax :sampleBy(col, fractions, seed=None)


In [3]:
df2=df.select((df.id % 3).alias("key"))
print(df2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect())

[Row(key=0), Row(key=1), Row(key=1), Row(key=1), Row(key=0), Row(key=1), Row(key=1), Row(key=0), Row(key=1), Row(key=1), Row(key=1)]


In [None]:
Pyspark rdd sample
syntax: sample(self, withReplacement, fraction, seed=None)


In [4]:
rdd = spark.sparkContext.range(0,100)
print(rdd.sample(False,0.1,0).collect())
print(rdd.sample(True,0.3,123).collect())

[24, 29, 41, 64, 86]
[0, 11, 13, 14, 16, 18, 21, 23, 27, 31, 32, 32, 48, 49, 49, 53, 54, 72, 74, 77, 77, 83, 88, 91, 93, 98, 99]


In [None]:
RDD takeSample() is an action hence you need to careful when you use this function as it returns the selected sample records to driver memory. 
Returning too much data results in an out-of-memory error similar to collect().
syntax:takeSample(self, withReplacement, num, seed=None) 

In [5]:
print(rdd.takeSample(False,10,0))
print(rdd.takeSample(True,30,123))

[58, 1, 96, 74, 29, 24, 32, 37, 94, 91]
[43, 65, 39, 18, 84, 86, 25, 13, 40, 21, 79, 63, 7, 32, 26, 71, 23, 61, 83, 60, 22, 35, 84, 22, 0, 88, 16, 40, 65, 84]
