# Observed whale species predictor

## Introduction

### About
Majority of whales spotted by American offshore whalers between 1820 and 1860 was of species stated as "whale". This likely meant a baleen whale, as opposed to a toothed whale such as a dolphin or a killer whale.

This notebook contains development of a ML model that attempts to label these baleen whales of unknown species, based on the very limited data of whales with their species identified.

### Source data
American Offshore Whaling Logbook - obtained from https://whalinghistory.org/av/logs/ in October 2024

### Environment and method
ML tools available in Spark using Scala - on notebook running an Almond kernel.


In [1]:
// creating the spark session
import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
    .appName("Spark Scala Application")
    .master("local")
    .getOrCreate()

// loading the source data
val df = spark.read
    .format("csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load("AmericanOffshoreWhalingLogbookData/aowl_20240403.txt")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
25/05/18 14:49:05 INFO SparkContext: Running Spark version 2.4.0
25/05/18 14:49:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/18 14:49:06 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
	at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
	at org.apache.hadoop.security.Groups.getUserToGroupsMappingServic

[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36morg.apache.spark.sql.SparkSession

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@5734309d
[36mdf[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [sequence: string, VoyageID: string ... 12 more fields]

## Preprocessing

### Data range
As determined by visual exploration, majority of observations were made between 1820 and 1860.

We will limit data to be from 01 January 1820 to  31 December 1860.

### Variables
Columns description are provided at https://whalinghistory.org/av/logs/aowl/columns/

Note that the column descriptions might not be completely accurate, as discovered during initial data exploration.

#### Dependent - Y
The independent variable is Species.

#### Independent - X1, X2, ...
The other variables are independent but will be further limited.

Variables initially eliminated as holding no information relevant for training the model:
- sequence - metadata information, not relevant
- Voyage ID - ID of the voyage that made the observation, should not be related to the type of whale spotted beyond correlation with spacetime data
- Encounter - type of encounter, on the assumption that all whales not classified as "whale" were properly identified, this holds no information
- NStruck - # of specimen was hit, not relevant to identification as it's too hard to quantify likeliness of a hit on each species on such a small sample
- NTried - # of specimen processed, as for Struck, not relevant
- Place - the name of the geographical location, not needed as the longitude and latitude are already given
- Source - source database, not relevant
- Remarks - too hard to quantify this information, not even described in the documentation, likely holds no useful information


In [2]:
val df_after_col_drop = df.drop("sequence", "VoyageID", "Encounter", "NStruck", "NTried", "Place", "Source", "Remarks")
df_after_col_drop.show()

25/05/18 14:49:11 INFO FileSourceStrategy: Pruning directories with: 
25/05/18 14:49:11 INFO FileSourceStrategy: Post-Scan Filters: 
25/05/18 14:49:11 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/18 14:49:11 INFO FileSourceScanExec: Pushed Filters: 
25/05/18 14:49:11 INFO CodeGenerator: Code generated in 19.5644 ms
25/05/18 14:49:11 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 112.7 KB, free 2.1 GB)
25/05/18 14:49:12 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/18 14:49:12 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on LAPTOP-M6L49AOD:54316 (size: 20.8 KB, free: 2.1 GB)
25/05/18 14:49:12 INFO SparkContext: Created broadcast 3 from show at cmd1.sc:2
25/05/18 14:49:12 INFO FileSourceScanExec: Planning scan with bin packing, max size: 40079939 bytes, open cost is considered as s

+------+-------+---+-----+----+-------+
|   Lat|    Lon|Day|Month|Year|Species|
+------+-------+---+-----+----+-------+
|41.640|-70.930|  9|    8|1875|   NULL|
|40.750|-70.750| 10|    8|1875|   NULL|
|39.660|-70.330| 11|    8|1875|   NULL|
|39.260|-69.410| 12|    8|1875|   NULL|
|38.800|-67.830| 13|    8|1875|   NULL|
|37.630|-63.660| 14|    8|1875|   NULL|
|35.460|-60.750| 15|    8|1875|  Sperm|
|35.960|-60.080| 16|    8|1875|   NULL|
|36.060|-57.500| 17|    8|1875|   NULL|
|35.580|-55.410| 18|    8|1875|   NULL|
|35.700|-53.660| 19|    8|1875|   NULL|
|35.960|-51.580| 20|    8|1875|   NULL|
|36.580|-51.500| 21|    8|1875|   NULL|
|37.280|-51.410| 22|    8|1875|   NULL|
|36.630|-51.330| 23|    8|1875|   NULL|
|36.460|-61.250| 24|    8|1875|   NULL|
|36.960|-51.530| 25|    8|1875|   NULL|
|37.000|-51.330| 26|    8|1875|   NULL|
|36.300|-49.830| 27|    8|1875|   NULL|
|36.020|-54.510| 28|    8|1875|   NULL|
+------+-------+---+-----+----+-------+
only showing top 20 rows



[36mdf_after_col_drop[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: string, Lon: string ... 4 more fields]

### Learning data
The model can only be trained on data where the Species is specified and hence only such records are kept for learning data.\
Killers and Killer unified.

In [3]:
import org.apache.spark.sql.functions.{col, when}
val df_learning = df_after_col_drop
    .where("UPPER(Species) != 'WHALE'")
    .where("Species IS NOT NULL")
    .where("UPPER(Species) != 'NULL'")
    .select(
        col("Lat").cast("float").alias("Lat"),
        col("Lon").cast("float").alias("Lon"),
        col("Day").cast("int").alias("Day"),
        col("Month").cast("int").alias("Month"),
        col("Year").cast("int").alias("Year"),
        when(col("Species") === "Killers", "Killer").otherwise(col("Species")).alias("Species")
        )
df_learning.show()

25/05/18 14:49:12 INFO FileSourceStrategy: Pruning directories with: 
25/05/18 14:49:12 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#18) = WHALE),isnotnull(Species#18),NOT (upper(Species#18) = NULL)
25/05/18 14:49:12 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/18 14:49:12 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/18 14:49:13 INFO CodeGenerator: Code generated in 31.4528 ms
25/05/18 14:49:13 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 112.7 KB, free 2.1 GB)
25/05/18 14:49:13 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/18 14:49:13 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on LAPTOP-M6L49AOD:54316 (size: 20.8 KB, free: 2.1 GB)
25/05/18 14:49:13 INFO SparkContext: Created broadcast 5 from show at cmd2.sc:14
25/05/18 14:49:13 INFO Fil

+------+------+---+-----+----+--------+
|   Lat|   Lon|Day|Month|Year| Species|
+------+------+---+-----+----+--------+
| 35.46|-60.75| 15|    8|1875|   Sperm|
| 35.56|-49.71|  1|    9|1875|   Sperm|
| 35.18|-49.62|  2|    9|1875|   Sperm|
| 35.18|-49.62|  2|    9|1875|   Sperm|
| 35.53|-49.53|  3|    9|1875|   Sperm|
| 36.43|-50.11|  6|    9|1875|   Sperm|
| 37.13| -42.9| 15|    9|1875|   Sperm|
| 20.95|-21.45| 31|   10|1875|   Sperm|
|  9.46|-22.33| 14|   11|1875|   Pilot|
| -5.74| -30.0| 24|   11|1875|   Sperm|
|  -7.0|-30.93| 25|   11|1875|   Sperm|
| -16.4|-34.75|  3|   12|1875|   Sperm|
| -18.5|-33.83|  6|   12|1875|   Sperm|
|-24.37| -36.0| 11|   12|1875|   Sperm|
|-30.87|-44.25| 18|   12|1875|   Sperm|
|-31.43|-44.83| 25|    2|1876|   Pilot|
| -7.33| 12.56| 13|    6|1876|Humpback|
| -7.26| 12.43| 15|    6|1876|Humpback|
| -7.23| 12.37| 16|    6|1876|Humpback|
| -7.19|  12.3| 17|    6|1876|Humpback|
+------+------+---+-----+----+--------+
only showing top 20 rows



[32mimport [39m[36morg.apache.spark.sql.functions.{col, when}
[39m
[36mdf_learning[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: float, Lon: float ... 4 more fields]

Comparing the number of records between the entire data from the time interval and just the identified species data for learning:

In [4]:
val all_count = df.count()
val identified_count = df_learning.count()
printf("%d records total", all_count)
println()
printf("%d identified species records", identified_count)

25/05/18 14:49:13 INFO FileSourceStrategy: Pruning directories with: 
25/05/18 14:49:13 INFO FileSourceStrategy: Post-Scan Filters: 
25/05/18 14:49:13 INFO FileSourceStrategy: Output Data Schema: struct<>
25/05/18 14:49:13 INFO FileSourceScanExec: Pushed Filters: 
25/05/18 14:49:13 INFO CodeGenerator: Code generated in 8.4921 ms
25/05/18 14:49:13 INFO CodeGenerator: Code generated in 7.7418 ms
25/05/18 14:49:13 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 112.7 KB, free 2.1 GB)
25/05/18 14:49:13 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/18 14:49:13 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on LAPTOP-M6L49AOD:54316 (size: 20.8 KB, free: 2.1 GB)
25/05/18 14:49:13 INFO SparkContext: Created broadcast 7 from count at cmd3.sc:1
25/05/18 14:49:13 INFO FileSourceScanExec: Planning scan with bin packing, max size: 40079939 bytes, open cost is considered as scanning 4194304 b

467053 records total
53744 identified species records

[36mall_count[39m: [32mLong[39m = [32m467053L[39m
[36midentified_count[39m: [32mLong[39m = [32m53744L[39m

Importing necessary ML tools:

In [5]:
import $ivy.`org.apache.spark::spark-mllib:2.4.0`

[32mimport [39m[36m$ivy.$                                    [39m

Further data preprocessing:

In [6]:
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
val assembler = new VectorAssembler()
  .setInputCols(Array("Lat", "Lon", "Day", "Month", "Year"))
  .setOutputCol("features")
val label_indexer = new StringIndexer()
  .setInputCol("Species")
  .setOutputCol("indexed_label")
  .fit(df_learning)

25/05/18 14:49:17 INFO FileSourceStrategy: Pruning directories with: 
25/05/18 14:49:17 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#18) = WHALE),isnotnull(Species#18),NOT (upper(Species#18) = NULL),AtLeastNNulls(n, CASE WHEN (Species#18 = Killers) THEN Killer ELSE Species#18 END)
25/05/18 14:49:17 INFO FileSourceStrategy: Output Data Schema: struct<Species: string>
25/05/18 14:49:17 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/18 14:49:17 INFO CodeGenerator: Code generated in 15.9291 ms
25/05/18 14:49:17 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 112.7 KB, free 2.1 GB)
25/05/18 14:49:17 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/18 14:49:17 INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on LAPTOP-M6L49AOD:54316 (size: 20.8 KB, free: 2.1 GB)
25/05/18 14:49:17 INFO SparkContext: Created broadcast 13 from rdd at StringIndexer.scala

[32mimport [39m[36morg.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
[39m
[36massembler[39m: [32mVectorAssembler[39m = vecAssembler_9f68eb6913d3
[36mlabel_indexer[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mfeature[39m.[32mStringIndexerModel[39m = strIdx_89e3789febc0

In [7]:
val Array(train_data, test_data) = df_learning.randomSplit(Array(0.8, 0.2), seed = 2137)

[36mtrain_data[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [Lat: float, Lon: float ... 4 more fields]
[36mtest_data[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [Lat: float, Lon: float ... 4 more fields]

### Decision Tree
The first model is a simple decision tree classifier:

In [53]:
import org.apache.spark.ml.classification.DecisionTreeClassifier
val simple_tree_model = new DecisionTreeClassifier()
    .setLabelCol("indexed_label")
    .setFeaturesCol("features")

25/05/08 22:33:20 INFO BlockManagerInfo: Removed broadcast_261_piece0 on LAPTOP-M6L49AOD:57489 in memory (size: 11.7 KB, free: 2.1 GB)


[32mimport [39m[36morg.apache.spark.ml.classification.DecisionTreeClassifier
[39m
[36msimple_tree_model[39m: [32mDecisionTreeClassifier[39m = dtc_ad384cd11f1e

In [54]:
import org.apache.spark.ml.Pipeline
val simple_tree_pipeline = new Pipeline()
    .setStages(Array(label_indexer, assembler, simple_tree_model))

[32mimport [39m[36morg.apache.spark.ml.Pipeline
[39m
[36msimple_tree_pipeline[39m: [32mPipeline[39m = pipeline_d586632c2d7d

In [55]:
val trained_simple_tree = simple_tree_pipeline.fit(train_data)

25/05/08 22:33:21 INFO Instrumentation: [31f9e71b] Stage class: DecisionTreeClassifier
25/05/08 22:33:21 INFO Instrumentation: [31f9e71b] Stage uid: dtc_ad384cd11f1e
25/05/08 22:33:21 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:21 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:21 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:21 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:21 INFO CodeGenerator: Code generated in 22.8702 ms
25/05/08 22:33:21 INFO MemoryStore: Block broadcast_263 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:21 INFO MemoryStore: Block broadcast_263_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:21 INFO BlockManagerInfo: Added broadcast_263_p

[36mtrained_simple_tree[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mPipelineModel[39m = pipeline_d586632c2d7d

In [56]:
val predictions_simple_tree = trained_simple_tree.transform(test_data)
predictions_simple_tree.show()

25/05/08 22:33:26 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:26 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:26 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:26 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4544
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4598
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4567
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4561
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4584
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4562
25/05/08 22:33:26 INFO ContextCleaner: Cleaned accumulator 4594
25/05/08 22:33:26 INFO ContextCleaner: Cleaned shuffle 49
25/05/08 22:33:26 INFO ContextCleaner:

+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|   Lat|    Lon|Day|Month|Year| Species|indexed_label|            features|       rawPrediction|         probability|prediction|
+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|-61.08|  52.91|  3|    1|1856|    Blue|          9.0|[-61.080001831054...|[157.0,387.0,109....|[0.21988795518207...|       1.0|
|-58.53|  -68.8|  9|   12|1856|Humpback|          5.0|[-58.529998779296...|[157.0,387.0,109....|[0.21988795518207...|       1.0|
|-57.36| -67.23| 31|    1|1844|   Sperm|          0.0|[-57.360000610351...|[358.0,1430.0,78....|[0.18832193582325...|       1.0|
|-57.35| -67.41| 13|    5|1844|   Sperm|          0.0|[-57.349998474121...|[358.0,1430.0,78....|[0.18832193582325...|       1.0|
|-57.28| -71.58| 25|    9|1834|   Right|          1.0|[-57.279998779296...|[443.0,4254.0,259...|[

[36mpredictions_simple_tree[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: float, Lon: float ... 9 more fields]

In [57]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexed_label")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions_simple_tree)
printf("Test Accuracy = %f", accuracy)

25/05/08 22:33:27 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:27 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:27 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:27 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:27 INFO CodeGenerator: Code generated in 20.084 ms
25/05/08 22:33:27 INFO MemoryStore: Block broadcast_286 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:27 INFO MemoryStore: Block broadcast_286_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:27 INFO BlockManagerInfo: Added broadcast_286_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 GB)
25/05/08 22:33:27 INFO SparkContext: Created broadcast 286 from rdd at MulticlassClassificationE

Test Accuracy = 0,734957

[32mimport [39m[36morg.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
[39m
[36mevaluator[39m: [32mMulticlassClassificationEvaluator[39m = mcEval_c496c674f20b
[36maccuracy[39m: [32mDouble[39m = [32m0.7349567506581421[39m

### Logistic Regression

In [58]:
import org.apache.spark.ml.classification.LogisticRegression
val logistic_regression_model = new LogisticRegression()
    .setLabelCol("indexed_label")
    .setFeaturesCol("features")
    .setMaxIter(10)
    .setRegParam(0.3)
    .setElasticNetParam(0.8)
val logistic_regression_pipeline = new Pipeline()
    .setStages(Array(label_indexer, assembler, logistic_regression_model))

val trained_logistic_regression = logistic_regression_pipeline.fit(train_data)
val predictions_logistic_regression = trained_logistic_regression.transform(test_data)
predictions_logistic_regression.show()

printf("Test Accuracy = %f", evaluator.evaluate(predictions_logistic_regression))

25/05/08 22:33:29 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:29 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:29 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:29 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:29 INFO CodeGenerator: Code generated in 24.2571 ms
25/05/08 22:33:29 INFO MemoryStore: Block broadcast_291 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:29 INFO MemoryStore: Block broadcast_291_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:29 INFO BlockManagerInfo: Added broadcast_291_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 GB)
25/05/08 22:33:29 INFO SparkContext: Created broadcast 291 from rdd at LogisticRegression.scala

+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|   Lat|    Lon|Day|Month|Year| Species|indexed_label|            features|       rawPrediction|         probability|prediction|
+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|-61.08|  52.91|  3|    1|1856|    Blue|          9.0|[-61.080001831054...|[3.59257498254686...|[0.45487831646387...|       0.0|
|-58.53|  -68.8|  9|   12|1856|Humpback|          5.0|[-58.529998779296...|[3.59257498254686...|[0.45487831646387...|       0.0|
|-57.36| -67.23| 31|    1|1844|   Sperm|          0.0|[-57.360000610351...|[3.59257498254686...|[0.45487831646387...|       0.0|
|-57.35| -67.41| 13|    5|1844|   Sperm|          0.0|[-57.349998474121...|[3.59257498254686...|[0.45487831646387...|       0.0|
|-57.28| -71.58| 25|    9|1834|   Right|          1.0|[-57.279998779296...|[3.59257498254686...|[

25/05/08 22:33:33 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:33 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:33 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:33 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:33 INFO MemoryStore: Block broadcast_360 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:33 INFO MemoryStore: Block broadcast_360_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:33 INFO BlockManagerInfo: Added broadcast_360_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 GB)
25/05/08 22:33:33 INFO SparkContext: Created broadcast 360 from rdd at MulticlassClassificationEvaluator.scala:79
25/05/08 22:33:33 INFO FileSourceScanExec: Plann

Test Accuracy = 0,457503

[32mimport [39m[36morg.apache.spark.ml.classification.LogisticRegression
[39m
[36mlogistic_regression_model[39m: [32mLogisticRegression[39m = logreg_0ebfc0959fb9
[36mlogistic_regression_pipeline[39m: [32mPipeline[39m = pipeline_56d545732032
[36mtrained_logistic_regression[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mPipelineModel[39m = pipeline_56d545732032
[36mpredictions_logistic_regression[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: float, Lon: float ... 9 more fields]

### Random Forest

In [59]:
import org.apache.spark.ml.classification.RandomForestClassifier
val rf_model = new RandomForestClassifier()
  .setLabelCol("indexed_label")
  .setFeaturesCol("features")

val rf_pipeline = new Pipeline()
  .setStages(Array(label_indexer, assembler, rf_model))

val trained_rf = rf_pipeline.fit(train_data)
val predictions_rf = trained_rf.transform(test_data)
predictions_rf.show()

printf("Test Accuracy = %f", evaluator.evaluate(predictions_rf))

25/05/08 22:33:35 INFO Instrumentation: [ea323f03] Stage class: RandomForestClassifier
25/05/08 22:33:35 INFO Instrumentation: [ea323f03] Stage uid: rfc_79f56c96fd22
25/05/08 22:33:35 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:35 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:35 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:35 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:35 INFO MemoryStore: Block broadcast_365 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:35 INFO MemoryStore: Block broadcast_365_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:35 INFO BlockManagerInfo: Added broadcast_365_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 

+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|   Lat|    Lon|Day|Month|Year| Species|indexed_label|            features|       rawPrediction|         probability|prediction|
+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|-61.08|  52.91|  3|    1|1856|    Blue|          9.0|[-61.080001831054...|[4.09687564125372...|[0.20484378206268...|       1.0|
|-58.53|  -68.8|  9|   12|1856|Humpback|          5.0|[-58.529998779296...|[5.06068995784005...|[0.25303449789200...|       1.0|
|-57.36| -67.23| 31|    1|1844|   Sperm|          0.0|[-57.360000610351...|[4.34246631828857...|[0.21712331591442...|       1.0|
|-57.35| -67.41| 13|    5|1844|   Sperm|          0.0|[-57.349998474121...|[4.34246631828857...|[0.21712331591442...|       1.0|
|-57.28| -71.58| 25|    9|1834|   Right|          1.0|[-57.279998779296...|[3.23384173472190...|[

25/05/08 22:33:42 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:33:42 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:33:42 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:33:42 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:33:42 INFO MemoryStore: Block broadcast_389 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:33:42 INFO MemoryStore: Block broadcast_389_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:33:42 INFO BlockManagerInfo: Added broadcast_389_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 GB)
25/05/08 22:33:42 INFO SparkContext: Created broadcast 389 from rdd at MulticlassClassificationEvaluator.scala:79
25/05/08 22:33:42 INFO FileSourceScanExec: Plann

Test Accuracy = 0,741162

[32mimport [39m[36morg.apache.spark.ml.classification.RandomForestClassifier
[39m
[36mrf_model[39m: [32mRandomForestClassifier[39m = rfc_79f56c96fd22
[36mrf_pipeline[39m: [32mPipeline[39m = pipeline_820263d28758
[36mtrained_rf[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mPipelineModel[39m = pipeline_820263d28758
[36mpredictions_rf[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: float, Lon: float ... 9 more fields]

### Neural network

In [63]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val nn_model = new MultilayerPerceptronClassifier()
    .setLabelCol("indexed_label")
    .setFeaturesCol("features")
    .setLayers(
        Array[Int](
            assembler.getInputCols.length,
            assembler.getInputCols.length * 2,
            ((assembler.getInputCols.length * 2 + label_indexer.labels.length) / 2).round,
            label_indexer.labels.length
            )
        )
    .setBlockSize(256)
    .setSeed(2137)
    .setMaxIter(100)

val nn_pipeline = new Pipeline()
  .setStages(Array(label_indexer, assembler, nn_model))

val trained_nn = nn_pipeline.fit(train_data)
val predictions_nn = trained_nn.transform(test_data)
predictions_nn.show()

printf("Test Accuracy = %f", evaluator.evaluate(predictions_nn))

25/05/08 22:53:31 INFO Instrumentation: [dc419ac6] Stage class: MultilayerPerceptronClassifier
25/05/08 22:53:31 INFO Instrumentation: [dc419ac6] Stage uid: mlpc_d668b1db4a1e
25/05/08 22:53:31 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:53:31 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:53:31 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:53:31 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:53:31 INFO MemoryStore: Block broadcast_508 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:53:31 INFO MemoryStore: Block broadcast_508_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:53:31 INFO BlockManagerInfo: Added broadcast_508_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, f

+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|   Lat|    Lon|Day|Month|Year| Species|indexed_label|            features|       rawPrediction|         probability|prediction|
+------+-------+---+-----+----+--------+-------------+--------------------+--------------------+--------------------+----------+
|-61.08|  52.91|  3|    1|1856|    Blue|          9.0|[-61.080001831054...|[3.65341071087298...|[0.45320246628877...|       0.0|
|-58.53|  -68.8|  9|   12|1856|Humpback|          5.0|[-58.529998779296...|[3.65341071087298...|[0.45320246628877...|       0.0|
|-57.36| -67.23| 31|    1|1844|   Sperm|          0.0|[-57.360000610351...|[3.65341071087298...|[0.45320246628877...|       0.0|
|-57.35| -67.41| 13|    5|1844|   Sperm|          0.0|[-57.349998474121...|[3.65341071087298...|[0.45320246628877...|       0.0|
|-57.28| -71.58| 25|    9|1834|   Right|          1.0|[-57.279998779296...|[3.65341071087298...|[

25/05/08 22:53:38 INFO FileSourceStrategy: Pruning directories with: 
25/05/08 22:53:38 INFO FileSourceStrategy: Post-Scan Filters: NOT (upper(Species#1394) = WHALE),isnotnull(Species#1394),NOT (upper(Species#1394) = NULL)
25/05/08 22:53:38 INFO FileSourceStrategy: Output Data Schema: struct<Lat: string, Lon: string, Day: string, Month: string, Year: string ... 1 more field>
25/05/08 22:53:38 INFO FileSourceScanExec: Pushed Filters: IsNotNull(Species)
25/05/08 22:53:38 INFO MemoryStore: Block broadcast_621 stored as values in memory (estimated size 112.8 KB, free 2.1 GB)
25/05/08 22:53:38 INFO MemoryStore: Block broadcast_621_piece0 stored as bytes in memory (estimated size 20.8 KB, free 2.1 GB)
25/05/08 22:53:38 INFO BlockManagerInfo: Added broadcast_621_piece0 in memory on LAPTOP-M6L49AOD:57489 (size: 20.8 KB, free: 2.1 GB)
25/05/08 22:53:38 INFO SparkContext: Created broadcast 621 from rdd at MulticlassClassificationEvaluator.scala:79
25/05/08 22:53:38 INFO FileSourceScanExec: Plann

Test Accuracy = 0,457503

[32mimport [39m[36morg.apache.spark.ml.classification.MultilayerPerceptronClassifier
[39m
[36mnn_model[39m: [32mMultilayerPerceptronClassifier[39m = mlpc_d668b1db4a1e
[36mnn_pipeline[39m: [32mPipeline[39m = pipeline_4d6550424a67
[36mtrained_nn[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mml[39m.[32mPipelineModel[39m = pipeline_4d6550424a67
[36mpredictions_nn[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mpackage[39m.[32mDataFrame[39m = [Lat: float, Lon: float ... 9 more fields]

### Preliminary results
It is clearly seen that in terms of accuracy, trees vastly outperform 2 other models: logistic regression and neural network\
As a random forest is a generalized decision tree, further fine-tuning and tests will be performed on the random forest model.