<img src="uva_seal.png">  

## Machine Learning with MLlib - Basic Statistics


### University of Virginia
### DS 5559: Big Data Analytics
### Last Updated: Dec 12, 2019

---  


### SOURCE:

1. Learning Spark  
2. Spark Documentation  
https://spark.apache.org/docs/latest/mllib-statistics.html
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs


### OBJECTIVES
-  Introduction to basic statistics capabilities  

### CONCEPTS

- Summary statistics
- Correlations
- Stratified sampling
- Hypothesis testing
- Random data generation
- Kernel density estimation


---

In this notebook, we provide a short summary of the basic statistics functionality provided for RDDs.  Much of this is supported by the `pyspark.mllib.stat` library.  

**Summary Statistics**  

`colStats()`  
Computes summary of an RDD of vectors. Similar to calling `summary()` on dataframe in R.

#### Summary using `colStats()`

In [1]:
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()
sc = spark.sparkContext

rdd = sc.parallelize([Vectors.dense([2, 0, 0, -2]),
                      Vectors.dense([4, 5, 0,  3]),
                      Vectors.dense([6, 7, 0,  8])])

cStats = Statistics.colStats(rdd)

In [2]:
cStats.mean()

array([4., 4., 0., 3.])

In [3]:
cStats.variance()

array([ 4., 13.,  0., 25.])

In [4]:
cStats.count()

3

In [5]:
cStats.numNonzeros()

array([3., 2., 0., 3.])

In [6]:
cStats.max()

array([6., 7., 0., 8.])

In [7]:
cStats.min()

array([ 2.,  0.,  0., -2.])

#### Correlation  
Currently supports Pearson, Spearman correlation  
Takes parameter method. Set to one of ‘pearson’ (default) or ‘spearman’  

In [8]:
from pyspark.mllib.stat import Statistics

seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0])
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

Statistics.corr(seriesX,seriesY)

0.8500286768773001

#### Stratified Sampling

For stratified sampling, the keys can be thought of as a label and the value as a specific attribute

In [31]:
# an RDD of any key value pairs
data = sc.parallelize([(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')])

# specify the exact fraction desired from each key as a dictionary
fractions = {1: 0.1, 2: 0.6, 3: 0.3}

approxSample = data.sampleByKey(False, fractions)

approxSample.collect()

[(2, 'c'), (2, 'd'), (3, 'f')]

#### Hypothesis testing

- Pearson’s Independence Test

This function computes the goodness of fit. If a second vector to test against is not supplied as a parameter, the test runs against a uniform distribution.

In [32]:
goodnessOfFitTestResult = Statistics.chiSqTest(vec)

NameError: name 'vec' is not defined

- Pearson’s Goodness of Fit (GoF) Test 

This function conducts Pearson's independence test on the input contingency matrix.

In [12]:
independenceTestResult = Statistics.chiSqTest(mat)

NameError: name 'mat' is not defined

- Kolmogorov-Smirnov Test  

`spark.mllib` provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test for equality of probability distributions.

In [None]:
Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)

#### Random data generation

`spark.mllib` supports generating random RDDs with i.i.d. values drawn from a given distribution: 
- uniform
- standard normal
- Poisson

In [None]:
RandomRDDs.normalRDD(sc, 1000000L, 10)

#### Kernel density estimation

KDE is a technique useful for visualizing empirical probability distributions without requiring assumptions about the particular distribution that the observed samples are drawn from. 


In [13]:
from pyspark.mllib.stat import KernelDensity

kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3.0)

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Create an RDD containing vectors with 3 columns.  Using `colStats()`, compute the mean and number of nonzero values in each column.

2) Compute the Spearman correlation between the vectors from (1).

3) Generate 10,000 independent samples from a Poisson distribution with mean 2.