This notebook is about the Landsat dataset. Please upload the small version of the training data (landsat_train_small.csv) to a 'landsat' directory in the home directory of your Hadoop cluster via

* hadoop fs -mkdir landsat
* hadoop fs -put landsat_train_small.csv landsat/

Next, follow the instructions provided below ...

In [None]:
training = sc.textFile("hdfs:///user/lsda/landsat/landsat_train_small.csv")
print("First line of the training RDD: {}".format(training.take(1)))
print("Number of elements in the training RDD: {}".format(training.count()))

Let us define a Python function to extract, for each line, the label and the associated features.

In [None]:
def parse(line):
    
    try:
    
        line = line.split(',')
        label = int(line[0])
        features = [float(f) for f in line[1:]]
        
        return (label, features)
    
    except Exception as e:
        
        return None

training = training.map(parse)
training = training.filter(lambda line: line is not None)

print("First line of modified RDD: {}".format(training.take(1)))

Next, make use of the 'map' and 'reduceByKey' transformations to count how often each single class occurs in the training RDD.

In [None]:
# YOUR CODE HERE
#
# Make use of 'map' and 'reduceByKey' to count how
# often a label is given in the training RDD

# counts = ...

In [None]:
# to plot the class distribution, we need to get the 
# statistics back to the driver
classes_local, counts_local = zip(*counts.collect())

# plot the class histogram
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(classes_local, counts_local)
plt.show()

The class distribution is very skewed. Next, we will generate a new version of the training RDD that is more balanced w.r.t. the labels. Given this modified training set, we will be able to efficiently train a tree ensemble on the driver.

Make use of the 'sampleByKey' transformation to generate a more balanced dataset. Have a look at the section *Stratified sampling* of the [documentation](https://spark.apache.org/docs/latest/mllib-statistics.html).

In [None]:
fractions = {2: 1.0, 3: 1.0, 4: 0.001, 5:1.0, 6:0.1, 7:1.0, 8:1.0}

# YOUR CODE HERE (use the above fractions and sample without replacement)
# USE seed=0 for reproducibility!
# training_balanced = ... 

Let us count again and plot the outcome:

In [None]:
# YOUR CODE HERE

# generate another plot, this time based on 
# the training_balanced RDD

# counts = ...

classes_local, counts_local = zip(*counts.collect())

import matplotlib.pyplot as plt
plt.bar(classes_local, counts_local)
plt.show()

Let's copy the small subset of the training instances back to the driver. Since we only have about 10K instances left, we can simply build the ensemble on the driver.

In [None]:
# YOUR CODE HERE

# (1) copy the balanced training subset to the driver via 'collect'
# (2) create an extra trees classifier via sklearn using
#     n_estimators=50, max_depth=10, random_state=0

# model = ...


Finally, let's check the accuracy of the ensemble on the training set. Further, we save the model to a file for later use ...

In [None]:
from sklearn.metrics import accuracy_score

# compute accuracy on training set
preds = model.predict(Xtrain)
print("Training accuracy is: {}".format(accuracy_score(preds, ytrain)))

# save model via pickle
import pickle
with open("model.save", 'wb') as f:
    pickle.dump(model, f)