#Kaggle Competition

In this session lab, we are going to compete in a Kaggle Competition.

First, we are going to upload the `train` and `test` datasets to databricks using the following route:

*Data -> Add Data -> Upload File*

**Note:** You have the option to select the location to store the files within DBFS.

Once the files are uploaded, we can use them in our environment.

You will need to change /FileStore/tables/train.csv with the name of the files and the path(s) that you chose to store them.

**Note 1:** When the upload is complete, you will get a confirmation along the path and name assigned. Filenames might be slightly modified by Databricks.

**Note 2:** If you missed the path and filename message you can navigate the DBFS via: *Data -> Add Data -> Upload File -> DBFS* or checking the content of the path `display(dbutils.fs.ls("dbfs:/FileStore/some_path"))`

### 1 - Upload Data

We start by uploading the training data set and the testing data set.

In [4]:
train_data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferSchema='true').load('/FileStore/tables/train_set-51e11.csv')

test_data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferSchema='true').load('/FileStore/tables/test_set-b5f57.csv')

display(train_data)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
5,2488,23,11,117,21,1117,209,218,151,1136,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,3242,262,5,849,169,1672,207,242,173,691,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10,3315,61,15,120,-6,3042,231,208,106,1832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7


In [5]:
print('Train data size: {} rows, {} columns'.format(train_data.count(), len(train_data.columns)))
print('Test data size: {} rows, {} columns'.format(test_data.count(), len(test_data.columns)))

To evaluate our model, we split the training data set into 2 groups: 70% for training and 30% for validation

In [7]:
splits = train_data.randomSplit([0.7, 0.3])
test_data_to_evaluate = splits[1].cache()
train_data_to_evaluate = splits[0].cache()
display(train_data_to_evaluate)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
11,3221,165,3,520,33,5695,218,241,154,2529,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12,2366,17,34,30,19,474,172,150,96,642,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
13,2852,329,14,275,6,3314,185,226,182,2695,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2


### 2- Data Exploration

To explore this data, we must explore the features  that can be regrouped in 2 groupes:
 ###### 1 - Continous :
    `a) Distances`
  - Elevation : distance in meters
  - Horizontal_Distance_To_Hydrology : Horizontal distance to nearest surface water 
  - Vertical_Distance_To_Hydrology : Vertical distance to nearest surface water 
  - Horizontal_Distance_To_Roadways : Horizontal distance to nearest roadway
  - Horizontal_Distance_To_Fire_Points : Horizontal distance to nearest wildfire ignition points
    
 `b) Other features`
  - Aspect : in degrees azimuth
  - Slope : in degrees
  
  - Hillshade_9am : Hillshade index at 9am, summer solstice
  - Hillshade_Noon : Hillshade index at noon, summer soltice
  - Hillshade_3pm : Hillshade index at 3pm, summer solstice
  
###### 2- Categorical :
  - Wilderness_Area : (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation
  - Soil_Type : (40 binary columns) qualitative 0 (absence) or 1 (presence) Soil Type designation

#### a) Continuous features:
Before focusing on the correlation between features, we try to center and rescale the continuous variables.

In [11]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import stddev, mean, col

continuousCols = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points"]


def standarize(data):
  for column in continuousCols:
    mean_column, sttdev_column = data.select(mean(column), stddev(column)).first()
    data = data.withColumn("modifiedColumn", (col(column) - mean_column) / sttdev_column ).drop(column)
    data = data.withColumnRenamed("modifiedColumn", column)
  return data

train_data_std = standarize(train_data)


In [12]:
display(train_data_std)

Id,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points
1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,-1.2441532193082974,1.5198661367230608,0.7871331372222709,-0.7027383499354302,-0.3332775507779793,-0.4832537688105368,-1.6478855563296315,-0.471596653549139,1.081077644874232,0.7041645701503906
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,-0.6698746896417298,1.502024202522768,0.3872319508819058,-1.0688167306310696,-0.6755444298899668,-0.3444677151047788,-1.386630768043917,-0.1687724987965541,1.0549926237893064,0.6785211772229301
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,-0.6984102687555964,-1.3526852695240543,-0.0126692354584592,0.9868541763521368,-1.154718060646749,-1.0569454285988544,-0.4162558401255471,-0.5725380384666673,0.1420168858169124,-0.9649186224516684
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0.2575316315589384,1.662601610325402,-0.679171212692401,-0.9843371043166914,-0.6584310859343674,1.08560424842552,-0.3789337275133021,0.1845223484147948,0.5593772231757211,0.2991498053843238
5,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,-1.682887748183998,-1.1831868946212742,-0.4125704217988242,-0.7168182876544933,-0.4359576145115755,-0.7902459613394024,-0.1176789392275872,-0.2697138837140824,0.2202719490716891,-0.6360798190289401
6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0.0292469986480046,-0.647928868612495,-0.8124716081391893,0.5644560447802451,-0.4701843024227743,1.2154363631825194,0.7407296508540475,0.1340516559560307,-0.4057685569665239,1.9539028666445664
7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0.2396968946127716,-1.29023849982303,-1.079072399032766,1.240293055295272,0.008989328334008,-0.7064626754617328,0.0689316238336377,0.234993040873559,0.2202719490716891,0.3082004146528393
8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.9138499511778728,1.082738748815891,-0.679171212692401,-0.9514839163055442,-0.4017309266003768,1.9682068019460088,-0.0057126013908522,0.6387585805436721,0.6898023286003488,0.296887153067195
9,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0065905832979396,0.9489242423136964,-1.2123727944795544,2.718686515796893,2.0968172909171314,-0.4352862387279015,-0.1923231644520772,0.941582735296257,0.794142412940051,-0.9717065794030548
10,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7,1.2669777427119733,-0.8441901448157141,0.1206311599883291,-0.7027383499354302,-0.8980179013127586,0.4409206441149023,0.7034075382418026,-0.7744208083017239,-0.9535539997499602,-0.1111444814550435


Let's focus now to see if there are any correlations between distances data.

Let's start by the horizontal ones

In [15]:
display(train_data_std.select("Horizontal_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Horizontal_Distance_To_Fire_Points", "Cover_Type"))

Horizontal_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Cover_Type
-0.7027383499354302,-0.4832537688105368,0.7041645701503906,6
-1.0688167306310696,-0.3444677151047788,0.6785211772229301,2
0.9868541763521368,-1.0569454285988544,-0.9649186224516684,2
-0.9843371043166914,1.08560424842552,0.2991498053843238,2
-0.7168182876544933,-0.7902459613394024,-0.6360798190289401,2
0.5644560447802451,1.2154363631825194,1.9539028666445664,2
1.240293055295272,-0.7064626754617328,0.3082004146528393,2
-0.9514839163055442,1.9682068019460088,0.296887153067195,1
2.718686515796893,-0.4352862387279015,-0.9717065794030548,1
-0.7027383499354302,0.4409206441149023,-0.1111444814550435,7


We have here some correlations that can be approached by linear relationships

Let's now focus on vertical distances

In [18]:
display(train_data_std.select("Cover_Type", "Elevation", "Vertical_Distance_To_Hydrology"))

Cover_Type,Elevation,Vertical_Distance_To_Hydrology
6,-1.2441532193082974,-0.3332775507779793
2,-0.6698746896417298,-0.6755444298899668
2,-0.6984102687555964,-1.154718060646749
2,0.2575316315589384,-0.6584310859343674
2,-1.682887748183998,-0.4359576145115755
2,0.0292469986480046,-0.4701843024227743
2,0.2396968946127716,0.008989328334008
1,0.9138499511778728,-0.4017309266003768
1,1.0065905832979396,2.0968172909171314
7,1.2669777427119733,-0.8980179013127586


The graph of correlation represent an almost horizontal line but can be approched by a linear relationship.

#### b) categorical features:
Let's now explore the categorical features:

We can see that categorical features : Wilderness_Areai and Soil_Typei are distributed in different columns.

We can explore to see if there are any constant columns between those categorical features.

In [23]:
categoricalCols = ["Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9", "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14", "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24", "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39", "Soil_Type40"]

from pyspark.sql import functions
for column in categoricalCols:
  train_data.agg(functions.min(train_data[column]),functions.max(train_data[column])).show()

We find that there are no constant columns in categorical features.
We try to count to see if there are features with same values especially for Soil_Types

In [25]:
soil_type_columns = ["Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9", "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14", "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24", "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39", "Soil_Type40"]

for column1 in soil_type_columns:
  for column2 in soil_type_columns:
      if train_data.select(column1) == train_data.select(column2):
        print(column1, column2)

There are no equal columns for soil types.

### 3- Feature Engineering

Now, based on the previous exploration, we :

1) Combine categorical features

2) Explore the linearity between distances

3) Combine between the distances to Hydrology (vertical and horizontal)

In [28]:
from pyspark.ml.feature import VectorAssembler
from  pyspark.sql.functions import abs



Wilderness_Area_columns = ["Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4"]
def scaleData(data):
  data = data.withColumn("Distance_to_Hydrology", (data["Horizontal_Distance_To_Hydrology"]**2 + data["Vertical_Distance_To_Hydrology"]**2)**0.5 )

  data = data.withColumn("Elevation_hydrology", data["Elevation"]-data["Vertical_Distance_To_Hydrology"])

  data = data.withColumn("H_F", data["Horizontal_Distance_To_Hydrology"] - data["Horizontal_Distance_To_Fire_Points"])

  data = data.withColumn("H_R",data["Horizontal_Distance_To_Hydrology"] - data["Horizontal_Distance_To_Roadways"])

  data = data.withColumn("F_R", data["Horizontal_Distance_To_Fire_Points"] - data["Horizontal_Distance_To_Roadways"])
  
  assembler1 = VectorAssembler(
    inputCols=soil_type_columns,
    outputCol="soil_type")

  data = assembler1.transform(data)
  
  assembler2 = VectorAssembler(
    inputCols=Wilderness_Area_columns,
    outputCol="Wilderness_Area")

  data  = assembler2.transform(data)
  
  return data
  
    
train_data = scaleData(train_data)
test_data = scaleData(test_data)

In [29]:
scaled_columns = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Distance_to_Hydrology", "Elevation_hydrology", "H_F", "H_R", "F_R", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points", "Wilderness_Area","soil_type"]

### 4- Model Construction

Now, we combine all the features and try to build a model to train our dataset

In [32]:
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=scaled_columns, outputCol="features")

We begin by trying out the model of the decision tree classifier since the logistic regression gave poor results.
But also the decision tree supports a heterogeneous dataset as ours (both continuous and categorical features) and is a good multiclassification model

In [34]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Setup classifier
classifier = DecisionTreeClassifier(labelCol="Cover_Type", featuresCol="features")

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(classifier.impurity, ['entropy'])
             .addGrid(classifier.maxDepth, [5, 10, 30])
             .addGrid(classifier.maxBins, [100, 130])
             .build())

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol=classifier.getLabelCol())


Now, we try to train the "train_data_to_evaluate" and evaluate the "test_data_to_evaluate" after rescaling them

In [36]:
# Rescale Data 
train_data_to_evaluate = scaleData(train_data_to_evaluate)
test_data_to_evaluate = scaleData(test_data_to_evaluate)

from pyspark.ml import Pipeline

# Chain vecAssembler and classificaiton model 
pipeline = Pipeline(stages=[vector_assembler, classifier])

cv = CrossValidator(estimator=pipeline, \
                   estimatorParamMaps=paramGrid, \
                   evaluator=evaluator, \
                   numFolds=3)

# Run stages in pipeline with the train data
model = cv.fit(train_data_to_evaluate)

Now we run model on test and evaluate it.

In [38]:
# Make predictions on testData
predictions = model.transform(test_data_to_evaluate)

evaluator.evaluate(predictions)

Now we train all the training data set with this model and generate a prediction on the testing set

In [40]:
best_model = model.bestModel.stages[1]

#get maxDepth
maxDepth = best_model._java_obj.getMaxDepth()

#get maxBins
maxBins = best_model._java_obj.getMaxBins()

classifier_bestModel = DecisionTreeClassifier(labelCol="Cover_Type", featuresCol="features").setMaxDepth(maxDepth).setImpurity('Entropy').setMaxBins(maxBins)

# Chain vecAssembler and classificaiton model 
pipeline = Pipeline(stages=[vector_assembler, classifier_bestModel])

# Run stages in pipeline with all the train data
model = pipeline.fit(train_data)




In [41]:
predictions = model.transform(test_data)

predictions = predictions.withColumn("Cover_Type", predictions["prediction"].cast("int")) 

# Select columns Id and prediction
(predictions
 .repartition(1)
 .select('Id', 'Cover_Type')
 .write
 .format('com.databricks.spark.csv')
 .options(header='true')
 .mode('overwrite')
 .save('/FileStore/kaggle-submission'))

display(dbutils.fs.ls("dbfs:/FileStore/kaggle-submission"))

path,name,size
dbfs:/FileStore/kaggle-submission/_committed_1371024487777017041,_committed_1371024487777017041,201
dbfs:/FileStore/kaggle-submission/_committed_2251705317318529424,_committed_2251705317318529424,203
dbfs:/FileStore/kaggle-submission/_committed_3231335014039124151,_committed_3231335014039124151,214
dbfs:/FileStore/kaggle-submission/_committed_3719788540446126626,_committed_3719788540446126626,203
dbfs:/FileStore/kaggle-submission/_committed_5768524907528976268,_committed_5768524907528976268,201
dbfs:/FileStore/kaggle-submission/_committed_5941212678253780017,_committed_5941212678253780017,203
dbfs:/FileStore/kaggle-submission/_committed_695739710580495773,_committed_695739710580495773,201
dbfs:/FileStore/kaggle-submission/_committed_vacuum4375122112081905232,_committed_vacuum4375122112081905232,162
dbfs:/FileStore/kaggle-submission/_started_5768524907528976268,_started_5768524907528976268,0
dbfs:/FileStore/kaggle-submission/part-00000-tid-5768524907528976268-ea7cb287-8f5e-4b60-909c-e7c0f3ef2d0f-3795-c000.csv,part-00000-tid-5768524907528976268-ea7cb287-8f5e-4b60-909c-e7c0f3ef2d0f-3795-c000.csv,2039369


We get on kaggle a result of `0.91453` that we have to enhance.

To have a better result, a good idea is to use a gradient boost classifier as it is  an additive model using decision trees that allows the optimization of loss functions.
In pyspark, this classifier is named 'gbtclassifier' but it is only implemented for binary classification. Therefore, we use onevsrest model to score each class against all other classes and therefore transform multiclassification to binary one.

For implementation reasons, we extend the gbtclassifier along with HasRawPredictionCol.
Since the cluster is not powerful enough, we can't use cross validation but chose the hyper parameters from trying different ones.

In [44]:
from pyspark.ml.classification import OneVsRest, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.param.shared import HasRawPredictionCol

class newClassifier (GBTClassifier, HasRawPredictionCol):
  def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
                 maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic",
                 maxIter=20, stepSize=0.1, seed=None, rawPredictionCol="rawPrediction"):
    GBTClassifier.__init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
                 maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic",
                 maxIter=20, stepSize=0.1, seed=None)
  def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
                 maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic",
                 maxIter=20, stepSize=0.1, seed=None, rawPredictionCol="rawPrediction"):
        return self

classifier = newClassifier().setLabelCol("Cover_Type").setMaxDepth(30).setMaxIter(10)

ovr = OneVsRest(classifier = classifier).setLabelCol("Cover_Type")

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol=ovr.getLabelCol())

Now, we are going to create a pipeline that will chain the vector assambler and the classifier stages.

In [46]:
from pyspark.ml import Pipeline

# Chain vecAssembler and classificaiton model 
pipeline = Pipeline(stages=[vector_assembler, ovr])


# Run stages in pipeline with the train data
model = pipeline.fit(train_data_to_evaluate)

Once we have trained the classifier, we can use it to make predictions on the test data.

In [48]:
# Make predictions on testData
predictions = model.transform(test_data_to_evaluate)


evaluator.evaluate(predictions)

We see that we have ameliorated the result with this model. Train now the model on all the training dataset then use it for the final delivery

In [50]:
# Run stages in pipeline with the train data
model = pipeline.fit(train_data)

In [51]:
# Make predictions on testData
predictions = model.transform(test_data)

predictions = predictions.withColumn("Cover_Type", predictions["prediction"].cast("int")) 

Finally, we can create a file with the predictions.

In [53]:
# Select columns Id and prediction
(predictions
 .repartition(1)
 .select('Id', 'Cover_Type')
 .write
 .format('com.databricks.spark.csv')
 .options(header='true')
 .mode('overwrite')
 .save('/FileStore/kaggle-submission'))

To be able to download the predictions file, we need its name (`part-*.csv`):

In [55]:
display(dbutils.fs.ls("dbfs:/FileStore/kaggle-submission"))

path,name,size
dbfs:/FileStore/kaggle-submission/_committed_1371024487777017041,_committed_1371024487777017041,201
dbfs:/FileStore/kaggle-submission/_committed_2251705317318529424,_committed_2251705317318529424,203
dbfs:/FileStore/kaggle-submission/_committed_4502625633671811995,_committed_4502625633671811995,200
dbfs:/FileStore/kaggle-submission/_committed_5768524907528976268,_committed_5768524907528976268,201
dbfs:/FileStore/kaggle-submission/_committed_695739710580495773,_committed_695739710580495773,201
dbfs:/FileStore/kaggle-submission/_committed_vacuum8768991349236609623,_committed_vacuum8768991349236609623,95
dbfs:/FileStore/kaggle-submission/_started_4502625633671811995,_started_4502625633671811995,0
dbfs:/FileStore/kaggle-submission/part-00000-tid-4502625633671811995-7e7e6d43-8ee6-41f3-bfef-26e582c1f496-39327-c000.csv,part-00000-tid-4502625633671811995-7e7e6d43-8ee6-41f3-bfef-26e582c1f496-39327-c000.csv,2039369


Files stored in /FileStore are accessible in your web browser via `https://<databricks-instance-name>.cloud.databricks.com/files/`.
  
For this example:

https://community.cloud.databricks.com/files/kaggle-submission/part-*.csv?o=######

where `part-*.csv` should be replaced by the name displayed in your system  and the number after `o=` is the same as in your Community Edition URL.


Finally, we can upload the predictions to kaggle and check what is the perfromance.

We get a prediction on kaggle of `0.93950` (better than the one submitted because I retrained now on all the training dataset)

We are limited by the models provided by the librairies of pySpark and by the computing time the cluster can handle.
But we can ameliorate models by reimplemting them (case of gbtclassifier) and by using parallelized cross-validations.