## PySpark Random Sample

PySpark provides a `pyspark.sql.DataFrame.sample()`, `pyspark.sql.DataFrame.sampleBy()`, `RDD.sample()`, and `RDD.takeSample()` methods to get the random sampling subset from the large dataset

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf, explode, concat_ws

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark Random Sample').getOrCreate()

#### Using sample

By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.

In [0]:
df=spark.range(100)
print(df.sample(0.06).collect())

#### Using seed to reproduce the same Samples 

Every time you run a `sample()`` function it returns a different set of sampling records, however sometimes during the development and testing phase you may need to regenerate the same sample every time as you need to compare the results from your previous run.

In [0]:
print(df.sample(0.1,123).collect())
print(df.sample(0.1,123).collect())

#### Sample withReplacement (May contain duplicates)

Some times you may need to get a random sample with repeated values. By using the value `true`, results in repeated values.

In [0]:
print(df.sample(True,0.3,123).collect()) # with duplicates
print(df.sample(False,0.3,123).collect()) # without duplicates
print(df.sample(0.3,123).collect()) # without duplicates

#### Stratified sampling

You can get Stratified sampling in PySpark without replacement by using `sampleBy()` method. It returns a sampling fraction for each stratum. If a stratum is not specified, it takes zero as the default.

In [0]:
df2=df.select((df.id % 3).alias('key'))
print(df2.sampleBy('key', {0: 0.1, 1: 0.2},0).collect())

#### RDD Sample

PySpark RDD also provides `sample()` function to get a random sampling, it also has another signature `takeSample()` that returns an Array[].

In [0]:
rdd = spark.sparkContext.range(0,100)
print(rdd.sample(False,0.1,0).collect())
print(rdd.sample(True,0.3,123).collect())

In [0]:
#RDD takeSample() is an action you need to be careful using as it returns the selected sample records to driver memory.
#Returning too much data results in an out-of-memory error similar to collect().
print(rdd.takeSample(False,10,0))

In [0]:
print(rdd.takeSample(True,30,123))

#### The end of the notebook