### Predict the tree cover type using Random Forest

#### Task description
The dataset represents the data about trees which were planted in the US (https://archive.ics.uci.edu/ml/datasets/covertype). The dataset consists of the information about 500000 trees. Your aim is to build Random Forest Ensemble to predict the cover type of trees. In order to successfully complete this assignment you have to follow this algorithm:

Load the training data
Transform categorical features into vector representations
Split the dataset into the train and validation part
Fit the Random Forest Ensemble on the training set
Compare accuracy of the fitted model with the Logistic Regression Model, which is about 0.67 for this set
If you have enough time, it will be very interesting to dig into further research through these steps:

Determine which features are valuable for your model (calculate feature importance of your model).
Try to reduce the number of trees and see the results.
Understand why the linear models have poor performance on this dataset.
Notes
The dataset is located at /data/covertype.

The test set will be replaced during the testing phase. Do not forget to clear your code and choose the appropriate option in the dropdown menu (Week 4, Random Forest). If you have any questions feel free to report about problems on the course forum.

The metric for this assignment is MultiClass Accuracy. You have to achieve score higher than 71% on the test dataset in order to get the full score for the assignment.

In [1]:
from __future__ import division, print_function, unicode_literals 

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

In [2]:
spark_session = SparkSession.builder\
                            .enableHiveSupport()\
                            .appName("spark sql")\
                            .master("local[4]")\
                            .getOrCreate()

In [3]:
train = spark_session.read.csv("/data/covertype2/train.csv", header=True, inferSchema=True)
test = spark_session.read.csv("/data/covertype2/test.csv", header=True, inferSchema=True)

In [4]:
names = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
         'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
         'Horizontal_Distance_To_Fire_Points', 'Wild_Type_ohe', 'Soil_Type_ohe']

In [5]:
wildTypeTransformer = StringIndexer(inputCol="Wild_Type", outputCol="Wild_Type_int")
soilTypeTransformer = StringIndexer(inputCol="Soil_Type", outputCol="Soil_Type_int")

wildTypeEncoder = OneHotEncoder(inputCol="Wild_Type_int", outputCol="Wild_Type_ohe")
soilTypeEncoder = OneHotEncoder(inputCol="Soil_Type_int", outputCol="Soil_Type_ohe")

assembler = VectorAssembler(inputCols=names, outputCol="features")

rf = RandomForestClassifier(numTrees=50, maxDepth=12, labelCol="Target", predictionCol="prediction")

evaluator = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")

pipeline = Pipeline(stages=[wildTypeTransformer, soilTypeTransformer, wildTypeEncoder, soilTypeEncoder, assembler, rf])

In [6]:
model = pipeline.fit(train)
predict = model.transform(test)

In [7]:
print(evaluator.evaluate(predict))

0.7550895936553261
