# Classification with Decision Tree and Naive Bayes example

### Importing MLlib libraries

In [19]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
import pyspark.mllib.linalg
from pyspark.mllib.linalg import Vectors
from pyspark import SparkConf, SparkContext

### Read data and pre-processing
The **Iris flower data set** or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In [9]:
conf = SparkConf()
sc = SparkContext(conf=conf)

In [10]:
rawData = sc.textFile("data/iris.csv")

In [11]:
'''
val splitlines = rawData.map(lines => {
    lines.split(',')
  })
splitlines.first()
'''
splitlines = rawData.map(lambda lines: lines.split(','))
splitlines.first()

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']

In [13]:
'''
val Data = splitlines.map { col =>   
     val species = col(col.size - 1)                       
     val label = if (species == "Iris-versicolor") 0.toInt else if (species == "Iris-setosa") 1.toInt else 2.toInt
     val features = col.slice(0, col.size - 1).map(_.toDouble)
     LabeledPoint(label, Vectors.dense(features))
}
Data.take(5).foreach(println)
'''
def iris_label(argument):
    iris = {
        "Iris-versicolor": 0,
        "Iris-setosa": 1,
    }
    return iris.get(argument, 2)

def estractor(col):
    species = col[-1]
    label = iris_label(species)
    features = [float(x) for x in col[:-1]]
    return LabeledPoint(label, Vectors.dense(features))
    
Data = splitlines.map(estractor)

for x in Data.take(5):
    print(x)

(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[4.7,3.2,1.3,0.2])
(1.0,[4.6,3.1,1.5,0.2])
(1.0,[5.0,3.6,1.4,0.2])


### Split the data into training and test sets (40% held out for testing)

In [18]:
'''
val splits = Data.randomSplit(Array(0.6, 0.4), seed = 11L)
val trainingData = splits(0).cache()
val testData = splits(1)
println("Training Data")
trainingData.take(5).foreach(println)
println("Test Data")
testData.take(5).foreach(println)
'''
splits = Data.randomSplit([0.6, 0.4], seed = 11)
trainingData = splits[0].cache()
testData = splits[1]
print("Training Data")
for x in trainingData.take(5):
    print(x)
print("Test Data")
for x in testData.take(5):
    print(x)

Training Data
(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[5.0,3.6,1.4,0.2])
(1.0,[5.4,3.9,1.7,0.4])
(1.0,[4.6,3.4,1.4,0.3])
Test Data
(1.0,[4.7,3.2,1.3,0.2])
(1.0,[4.6,3.1,1.5,0.2])
(1.0,[5.0,3.4,1.5,0.2])
(1.0,[4.4,2.9,1.4,0.2])
(1.0,[4.9,3.1,1.5,0.1])


### Train a Decision Tree
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks. MLlib supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances.

In [24]:
'''
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "entropy"
val maxDepth = 3
val maxBins = 10
val dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)
println(dtModel.toDebugString)
'''
numClasses = 3
categoricalFeaturesInfo = {} #Map[Int, Int]()
impurity = "entropy"
maxDepth = 3
maxBins = 10
dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
print(dtModel.toDebugString())

DecisionTreeModel classifier of depth 3 with 13 nodes
  If (feature 2 <= 3.5)
   If (feature 1 <= 2.6)
    If (feature 0 <= 4.8)
     Predict: 1.0
    Else (feature 0 > 4.8)
     Predict: 0.0
   Else (feature 1 > 2.6)
    Predict: 1.0
  Else (feature 2 > 3.5)
   If (feature 3 <= 1.7)
    If (feature 2 <= 5.3)
     Predict: 0.0
    Else (feature 2 > 5.3)
     Predict: 2.0
   Else (feature 3 > 1.7)
    If (feature 2 <= 5.0)
     Predict: 2.0
    Else (feature 2 > 5.0)
     Predict: 2.0



In [34]:
'''
val dtTotalCorrect = trainingData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum

println(dtTotalCorrect)
println(trainingData.count)
'''
bcModel = sc.broadcast(dtModel)
dtTotalCorrect = trainingData.map(lambda point: dcModel.predict(point.features))

'''
for x in trainingData.collect():
    print(dtModel.predict(x.features))
'''

print(dtTotalCorrect)
#print(trainingData.count())

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.