<DIV ALIGN=CENTER>

# Introduction to Spark: Machine Learning
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this IPython Notebook, we explore using Spark to perform basic statistical analysis and machine learning. For part of this analysis, we will use the airline data, which has been stored in files that are accesible from within our Spark cluster. 

-----

### Initialization

In this class, we have a dedicated Spark cluster running to allow
students to explore Spark from within our IPython Notebook environment.
Since our Spark cluster has limited resources, we need to carefully
manage them, in particular we need to ensure that any SparkContext
previously used by this Jupyter Server is properly released before
starting a new one. After this, we will initialize a new SparkContext to
properly interact from this dockerized IPython Notebook to the Spark
cluster.

----- 

In [1]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements
from pyspark import SparkConf, SparkContext

# Create new Spark Configuration (port numbers might need to be adjusted from defaults.)
myconf = SparkConf()
myconf.setMaster('local[*]')
myconf.setAppName("INFO490 SP16 W14-NB3: Professor Brunner")
myconf.set('spark.executor.memory', '1g')

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 1.6.0


-----

### Data Processing

Previously in this Notebook, we have used Spark to create simple RDDs
that demonstrated Spark transformations and actions on small data. Now
we will change approaches and analyze the airline data, first starting
with the single 2001 flight data file. We can create a new RDD by
reading in the data as a textfile, after which we execute the RDD
creation by counting the number of lines in the RDD. We subsequently
apply several other RDD methods to display the first few rows of data by
using the `take` method. Finally, we use the built-in `help` to se the
list of supported RDD methods.

-----



In [2]:
filename = '/home/data_scientist/data/2001/2001-1.csv'

text_file = sc.textFile(filename)

col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

cols = col_data.filter(lambda line: 'NA' not in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

In [3]:
# Should be 480106 if everything works correctly
fields.count()

480106

In [4]:
fields.take(5)

[(2001, 1, 17, 1806, -3, -4, 'BWI', 'CLT', 361),
 (2001, 1, 18, 1805, 4, -5, 'BWI', 'CLT', 361),
 (2001, 1, 19, 1821, 23, 11, 'BWI', 'CLT', 361),
 (2001, 1, 20, 1807, 10, -3, 'BWI', 'CLT', 361),
 (2001, 1, 21, 1810, 20, 0, 'BWI', 'CLT', 361)]

-----

## Spark Statistics


-----

In [5]:
from pyspark.mllib.stat import Statistics

sdt = fields.map(lambda p: (p[2], p[3], p[4], p[5], p[8]))

summary = Statistics.colStats(sdt)

mus = summary.mean()
mns = summary.min()
mxs = summary.max()

vrs = summary.variance()
nnzs = summary.numNonzeros()

In [6]:
cols = ['Day', 'Dep. Time', 'Arr. Delay', 'Dep. Delay', 'Distance']

# Print out Header
print('{0:>20s}{1:>12s}{2:>8s}{3:>10s}{4:>12s}'\
      .format('Mean', 'Variance', 'Min', 'Max', 'Non Zeroes'))
print(65*'-')

# Printout summary statistics
for idx, (m, v, mn, mx, n) in enumerate(zip(mus, vrs, mns, mxs, nnzs)):
    print('{5:10s}{0:10.2f}{1:12.2f}{2:8.2f}{3:10.2f}{4:12d}'\
          .format(m, v, mn, mx, int(n), cols[idx]))

                Mean    Variance     Min       Max  Non Zeroes
-----------------------------------------------------------------
Day            16.01       79.87    1.00     31.00      480106
Dep. Time    1359.66   237399.85    1.00   2400.00      480106
Arr. Delay      6.38      964.02  -80.00   1688.00      461157
Dep. Delay      8.78      782.11  -59.00   1692.00      393503
Distance      716.99   323369.33   21.00   4962.00      480106


-----

### Correlations


-----

In [7]:
# Demonstrate Correlation Measurements

# Sample Data
x = sc.parallelize([0, 1, 2])
y = sc.parallelize([1, 2, 4])
z = sc.parallelize([2, 1, 0])

print('x = ', x.collect())
print('y = ', y.collect())
print('z = ', z.collect())

print('\nPearson Correlation Tests')
print(25*'-')
print('x corr x = {0:+5.3f}'\
      .format(Statistics.corr(x, x, method='pearson')))

print('x corr y = {0:+5.3f}'\
      .format(Statistics.corr(x, y, method='pearson')))

print('x corr z = {0:+5.3f}'\
      .format(Statistics.corr(x, z, method='pearson')))

x =  [0, 1, 2]
y =  [1, 2, 4]
z =  [2, 1, 0]

Pearson Correlation Tests
-------------------------
x corr x = +1.000
x corr y = +0.982
x corr z = -1.000


In [8]:
# Set print precision of matrices
import numpy as np
np.set_printoptions(precision=3)

# Compute correlation of three columns in RDD
cd = sdt.map(lambda p: (p[1], p[2], p[3]))

print('Dearture Time, Arrival Delay, Departure Delay')

print('\nPearson Correlation Matrix:')
print(Statistics.corr(cd, method='pearson'))

print('\nSpearman Correlation Matrix:')
print(Statistics.corr(cd, method='spearman'))

Dearture Time, Arrival Delay, Departure Delay

Pearson Correlation Matrix:
[[ 1.     0.134  0.167]
 [ 0.134  1.     0.904]
 [ 0.167  0.904  1.   ]]

Spearman Correlation Matrix:
[[ 1.     0.109  0.173]
 [ 0.109  1.     0.616]
 [ 0.173  0.616  1.   ]]


-----

### Random Data and Sampling

-----

In [9]:
from pyspark.mllib.random import RandomRDDs

ud = RandomRDDs.uniformRDD(sc, 1000, seed=23)

nd = RandomRDDs.normalRDD(sc, 1000, seed=23)

pd = RandomRDDs.poissonRDD(sc, mean=2.0, size=1000, seed=23)

In [10]:
print('Uniform Distribution Statistics\n', ud.stats())

Uniform Distribution Statistics
 (count: 1000, mean: 0.495907509202282, stdev: 0.298581265498, max: 0.99957542053, min: 0.000220626980565)


In [11]:
print('Normal Distribution Statistics\n', nd.stats())

Normal Distribution Statistics
 (count: 1000, mean: -0.01951879687296531, stdev: 0.936332160006, max: 2.76048478382, min: -3.10768336984)


In [12]:
print('Poisson Distribution Statistics\n', pd.stats())

Poisson Distribution Statistics
 (count: 1000, mean: 2.0089999999999995, stdev: 1.45771019068, max: 9.0, min: 0.0)


In [13]:
# Sample without replacement

frac = 0.25

ds = nd.sample(False, frac)
print(ds.stats())

(count: 280, mean: 0.015385485912264357, stdev: 1.00131906706, max: 2.76048478382, min: -2.98699042684)


In [14]:
# Sample with replacement
ds = nd.sample(True, frac)
print(ds.stats())

(count: 254, mean: -0.11978375730478891, stdev: 0.884613297735, max: 2.76048478382, min: -3.10768336984)


In [15]:
a = [0, 1, 2, 3, 4, 5]
print(a[0])
print(a[1:])

0
[1, 2, 3, 4, 5]


-----

## Machine Learning

-----

### Linear Modeling

-----

In [16]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.regression import LinearRegressionModel

# Minimum departure delay
min_delay = 5.
data = fields.filter(lambda p: p[5] > min_delay).map(lambda p: LabeledPoint(p[4], [p[5]]))

In [17]:
data.take(5)

[LabeledPoint(23.0, [11.0]),
 LabeledPoint(18.0, [20.0]),
 LabeledPoint(96.0, [100.0]),
 LabeledPoint(20.0, [17.0]),
 LabeledPoint(87.0, [97.0])]

In [18]:
ln_model = LinearRegressionWithSGD.train(data, intercept=False, iterations=100, step=0.001)

In [19]:
vnp = data.map(lambda lp: (lp.label, float(ln_model.predict(lp.features))))

In [20]:
vnp.take(5)

[(23.0, 10.730386646572505),
 (18.0, 19.5097939028591),
 (96.0, 97.54896951429551),
 (20.0, 16.583324817430235),
 (87.0, 94.62250042886664)]

In [21]:
from pyspark.mllib.evaluation import RegressionMetrics

tm = RegressionMetrics(vnp)

print('RMSE = {0:5.1f}'.format(tm.rootMeanSquaredError))
print('MSE = {0:5.1f}'.format(tm.meanSquaredError))
print('MAE = {0:5.1f}'.format(tm.meanAbsoluteError))
print('r2 = {0:5.1f}'.format(tm.r2))
print('EV = {0:5.1f}'.format(tm.explainedVariance))

RMSE =  15.4
MSE = 238.1
MAE =  10.8
r2 =   0.9
EV = 2014.8


In [22]:
print(ln_model)

(weights=[0.975489695143], intercept=0.0)


-----

### Random Forest

-----

In [23]:
from pyspark.mllib.tree import RandomForest #, RandomForestModel

rf_model = RandomForest.trainRegressor(data, categoricalFeaturesInfo={}, numTrees=1)

#                                    numTrees=25, featureSubsetStrategy="auto",
#                                    impurity='variance', maxDepth=10, maxBins=32)


In [24]:
print(rf_model.toDebugString())

TreeEnsembleModel regressor with 1 trees

  Tree 0:
    If (feature 0 <= 74.0)
     If (feature 0 <= 30.0)
      If (feature 0 <= 16.0)
       If (feature 0 <= 11.0)
        Predict: 5.726113904806455
       Else (feature 0 > 11.0)
        Predict: 11.548860895202358
      Else (feature 0 > 16.0)
       If (feature 0 <= 23.0)
        Predict: 17.536144237834172
       Else (feature 0 > 23.0)
        Predict: 24.667233253496097
     Else (feature 0 > 30.0)
      If (feature 0 <= 50.0)
       If (feature 0 <= 40.0)
        Predict: 32.78938220887976
       Else (feature 0 > 40.0)
        Predict: 42.70787648243455
      Else (feature 0 > 50.0)
       If (feature 0 <= 64.0)
        Predict: 54.16448238547553
       Else (feature 0 > 64.0)
        Predict: 66.7016348773842
    Else (feature 0 > 74.0)
     If (feature 0 <= 135.0)
      If (feature 0 <= 103.0)
       If (feature 0 <= 86.0)
        Predict: 77.87184661957619
       Else (feature 0 > 86.0)
        Predict: 93.02047162477325
  

In [25]:
pr = rf_model.predict(data.map(lambda x: x.features))
pnl = data.map(lambda lp: lp.label).zip(pr)

In [26]:
tm = RegressionMetrics(pnl)

print('RMSE = {0:5.1f}'.format(tm.rootMeanSquaredError))
print('MSE = {0:5.1f}'.format(tm.meanSquaredError))
print('MAE = {0:5.1f}'.format(tm.meanAbsoluteError))
print('r2 = {0:5.1f}'.format(tm.r2))
print('EV = {0:5.1f}'.format(tm.explainedVariance))

RMSE =  22.1
MSE = 487.8
MAE =  12.0
r2 =   0.7
EV = 2012.6


-----
### Student Activity

In the preceding cells, we introduced basic statistical analysis and machine learning with Spark. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Change the DataFrame to ...

2. Use Intercept in Linear regression. Try more columns (extract them and add in after p[5]).

3. Use more trees with RF.

4. Try doing an LR like in Week 2 Logisitic Regression.

-----

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release resources on the
instructional cluster before existing this Notebook.

-----

In [27]:
sc.stop()