# Chapter 5
## Powerful exploratory data analysis with MLlib

Regression, another central task in machine learning, is all about predicting numbers. In this section, we explore Spark’s capabilities to perform regression tasks with models like linear regression and SVMs.

# Computing summary statistics with MLlib

In [9]:
raw_data = sc.textFile("./kddcup.data.gz")

In [10]:
csv = raw_data.map(lambda x: x.split(","))
duration = csv.map(lambda x: [int(x[0])])

In [12]:
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(duration)
summary.mean()[0]

47.97930249928637

In [14]:
from math import sqrt 
sqrt(summary.variance()[0])  # std. dev.

707.746472305374

In [15]:
summary.max()

array([58329.])

In [16]:
summary.min()

array([0.])

# Using pearson and spearman to discover correlations

In [18]:
metrics = csv.map(lambda x: [x[0], x[4], x[5]])
Statistics.corr(metrics, method="spearman")

array([[ 1.        ,  0.01419628,  0.29918926],
       [ 0.01419628,  1.        , -0.16793059],
       [ 0.29918926, -0.16793059,  1.        ]])

In [None]:
Statistics.corr(metrics, method="pearson")

# Testing your hypotheses on large datasets

In [19]:
from pyspark.mllib.linalg import Vectors

In [21]:
visitors_freq = Vectors.dense(0.13, 0.61, 0.8, 0.5, 0.3)
print(Statistics.chiSqTest(visitors_freq))

Chi squared test summary:
method: pearson
degrees of freedom = 4 
statistic = 0.5852136752136753 
pValue = 0.9646925263439344 
No presumption against null hypothesis: observed follows the same distribution as expected..
