# Sampling  and Descriptive Statistics using Dataframe API

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [1]:
data_path = '../Data'
file_path = data_path + '/location_temp.csv'

In [14]:
df = spark.read.format('csv').options(header=True).load(file_path)

# Sampling

In [17]:
df.count()

500000

with NO replacement, we will always get unique values.

In [24]:
df_sample = df.sample(withReplacement=False, fraction=0.1)

In [25]:
df_sample.count()

50195

# Descriptive Statistics

In [29]:
# mean of temp for every location
df_sample.groupBy('location_id').agg({'temp_celcius': 'mean'}).show()

+-----------+------------------+
|location_id| avg(temp_celcius)|
+-----------+------------------+
|     loc196| 29.25263157894737|
|     loc226|25.152380952380952|
|     loc463|23.401960784313726|
|     loc150| 32.22857142857143|
|     loc292|            29.625|
|     loc311|24.485148514851485|
|      loc22|28.172043010752688|
|     loc351|28.292035398230087|
|     loc370|29.093457943925234|
|     loc419|29.108695652173914|
|      loc31| 25.07766990291262|
|     loc305|27.477272727272727|
|      loc82|27.119565217391305|
|      loc90|23.051546391752577|
|     loc118|23.670212765957448|
|     loc195|26.990384615384617|
|     loc208|25.737373737373737|
|      loc39|25.454545454545453|
|      loc75| 23.35042735042735|
|     loc228|             27.27|
+-----------+------------------+
only showing top 20 rows



In [31]:
# mean of temp for every location, order by location ascending order
df_sample.groupBy('location_id').agg({'temp_celcius': 'mean'}).orderBy('location_id').show(10)

+-----------+------------------+
|location_id| avg(temp_celcius)|
+-----------+------------------+
|       loc0|29.641509433962263|
|       loc1|28.654205607476637|
|      loc10|             25.47|
|     loc100| 27.12121212121212|
|     loc101|24.755555555555556|
|     loc102|30.372340425531913|
|     loc103| 25.53846153846154|
|     loc104| 25.96551724137931|
|     loc105|26.586206896551722|
|     loc106|27.582417582417584|
+-----------+------------------+
only showing top 10 rows



Now we can compare the mean of original dataset and mean of sample dataset.

In [32]:
# mean of original dataset
df.groupBy('location_id').agg({'temp_celcius':'mean'}).orderBy('location_id').show(10)

+-----------+-----------------+
|location_id|avg(temp_celcius)|
+-----------+-----------------+
|       loc0|           29.176|
|       loc1|           28.246|
|      loc10|           25.337|
|     loc100|           27.297|
|     loc101|           25.317|
|     loc102|           30.327|
|     loc103|           25.341|
|     loc104|           26.204|
|     loc105|           26.217|
|     loc106|           27.201|
+-----------+-----------------+
only showing top 10 rows



From the comparision, Avg Temparature of Original Dataset for location ID loc0 is 29.176 C.

And Avg Temparature of Sample Dataset for location ID loc0 is 29.641509433962263 which are pretty close.

But we need to take note of sampling size. As the sampling size get smaller, the value may be varied a lot from population data set.