# Vessel Type Classification

The goal of this project is to classify vessel types using AIS data and to become familiar with PySpark and Plotly packages.

The data was acquired from [MarineCadastre.gov](https://marinecadastre.gov/ais/).

The GitHub repository for this project can be found  [here](https://github.com/nicksento/Vessel-Type-Classification).

# Contents
[1.Data manipulation](#manipulation)    
&nbsp;&nbsp;&nbsp;[1.1.Import Libraries](#import)        
&nbsp;&nbsp;&nbsp;[1.2.Static Dataset](#static)       
&nbsp;&nbsp;&nbsp;[1.3.Null Values](#null)  
&nbsp;&nbsp;&nbsp;[1.4.AIS Type Summary Dataset](#type)    
&nbsp;&nbsp;&nbsp;[1.5.Dynamic Dataset](#dynamic)       
&nbsp;&nbsp;&nbsp;[1.6.Merge Datasets](#merge)    
&nbsp;&nbsp;&nbsp;[1.7.Feature Engineering](#feat)    
[2.Data visualization](#visual)       
&nbsp;&nbsp;&nbsp;[2.1.Outliers](#out)     
[3.Prepare the Data for Machine Learning Algorithms](#prepare)     
&nbsp;&nbsp;&nbsp;[3.1.Label Encoder](#enco)      
&nbsp;&nbsp;&nbsp;[3.2.Stratified Shuffle Split](#strat)  
[4.Train and Evaluate Machine Learning Models](#train)     
&nbsp;&nbsp;&nbsp;[4.1.Decision Tree Classifier](#dtc)      
&nbsp;&nbsp;&nbsp;[4.2.Support Vector Classifier](#scv)      
&nbsp;&nbsp;&nbsp;[4.3.K Neighbors Classifier](#knc)      
&nbsp;&nbsp;&nbsp;[4.4.Extra Tree Classifier](#exc)      
&nbsp;&nbsp;&nbsp;[4.5.Extra Trees Classifier](#etc)      
&nbsp;&nbsp;&nbsp;[4.6.Label Propagation Classifier](#lpc)       
&nbsp;&nbsp;&nbsp;[4.7.Multi-layer Perceptron classifier](#mlp)       
&nbsp;&nbsp;&nbsp;[4.8.Gradient Boosting Classifier](#gbc)            
[5.Conclusion](#end)

# Data manipulation <a id='manipulation'></a>

### import libraries <a id='import'></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import *

In [3]:
spark = SparkSession \
    .builder \
    .appName("Maritime") \
    .getOrCreate()

We will combine 3 separate datasets to compose the final dataset that we are going to use for our model training and prediction.      
- The **first** dataset has information about the ship type which is our target value and several dimensional features which will be used for the training of our model.    
- We will use the **second** dataset to match the 'shiptype' feature of the first dataset with 
the 'ais_type_summary' to have a better understanding of the vessels we are examining and
to group the different ship types.     
- The **third** dataset has information about the speed, the position and the direction of the vessels. We are going to use the speed feature from this dataset.

### Static dataset <a id='static'></a>

In [4]:
# load the data
static_df = spark.read.options(header=True,
                               inferSchema=True,
                               col=['sourcemmsi', 'imonumber']) \
                                .csv('Data/nari_static.csv')

In [5]:
static_df.limit(5).toPandas().head()

Unnamed: 0,sourcemmsi,imonumber,callsign,shipname,shiptype,tobow,tostern,tostarboard,toport,eta,draught,destination,mothershipmmsi,t
0,304091000,9509255,V2GU5,HC JETTE-MARIT,70,130,30,18,6,04-09 20:00,10.1,BREST,,1443650423
1,228037600,0,FIHX,AEROUANT BREIZH,30,6,9,5,2,00-00 24:60,0.0,,,1443650457
2,228064900,8304816,FITO,VN SAPEUR,51,21,54,10,6,29-09 12:00,5.9,RADE DE BREST,,1443650471
3,227705102,262144,FGD5860,BINDY,60,9,26,5,4,00-00 24:60,0.0,,,1443650474
4,227415000,0,FHAF,F/V JEREMI SIMON,90,11,9,3,3,00-00 24:60,0.0,,,1443650479


Attribute | Data type | Description
--- | --- | ---
sourcemmsi | integer | MMSI identifier for vessel
imo |	 integer     |    	IMO ship identification number (7 digits) 
callsign |	  text     |       	International radio call sign (max 7 characters), assigned to the vessel by its country of registry
  shipname 	|  text       |     	Name of the vessel (max 20 characters)
  shiptype 	|  integer   |        Code for the type of the vessel (see enumeration)
  to_bow 	  |	  integer |         	Distance (meters) to Bow
  to_stern 	|  integer  |       	Distance (meters) to Stern --> to_bow + to_stern = LENGTH of the vessel
  to_starboard  | integer   |   		Distance (meters) to Starboard, i.e., right side of the vessel --> to_port + to_starboard = BEAM at the vessel's nominal waterline
  to_port 		|  integer   |        Distance (meters) to Port, i.e., left side of the vessel (meters)  
  eta 		|	  text    |        	ETA (estimated time of arrival) in format dd-mm hh:mm (day, month, hour, minute) – UTC time zone
  draught 	|	  double precision | Allowed values: 0.1-25.5 meters
  destination 	 | text |           	Destination of this trip (manually entered)
  mothershipmmsi |integer	|		Dimensions of ship in metres and reference point for reported position
  t 		|	  bigint   |       	timestamp in UNIX epochs

Our target value is the 'shiptype'. We are going to use 'sourcemmsi' as an id and columns 'tobow', 'tostern', 'tostarboard' and 'draught' as features to train our model and predict the ship type.

In [6]:
# dataframe shape
print((static_df.count(), len(static_df.columns)))

(1078617, 14)


We are going to keep only one record of each ship and columns 'sourcemmsi', 'shiptype', 'tobow', 'tostern', 'tostarboard', 'draught'.

In [7]:
# drop duplicates
dropDF = static_df.dropDuplicates(["sourcemmsi"])

In [8]:
static_df1 = dropDF.select(['sourcemmsi', 'shiptype', 'tobow',
                     'tostern', 'tostarboard', 'toport', 'draught'])

### Null values  <a id='null'></a>

We are going to check for null values.

In [9]:
static_df1.toPandas().isna().sum()

sourcemmsi       0
shiptype       212
tobow          212
tostern        212
tostarboard    212
toport         212
draught        401
dtype: int64

We will drop the instances for which the target value has null values.

In [10]:
static_df1 = static_df1.filter(dropDF.shiptype. isNotNull())

In [11]:
static_df1.toPandas().isna().sum()

sourcemmsi       0
shiptype         0
tobow            0
tostern          0
tostarboard      0
toport           0
draught        189
dtype: int64

There are still 189 values missing from feature 'draught'. These are going to be imputed later. 

In [12]:
uni = static_df1.toPandas().shiptype.unique()
uni.sort()
print(uni)

[ 0 20 30 31 32 33 34 35 36 37 40 50 51 52 54 55 57 59 60 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 89 90 91 92 94 95 96 99]


Measurements with 0 values for 'shiptype' would be replaced with value 15 which corresponds to Unspecified type as we will see below.

In [13]:
from pyspark.sql.functions import when
static_df2 = static_df1.withColumn("shiptype", \
              when(static_df1["shiptype"] == 0, 15).otherwise(static_df1["shiptype"]))

In [14]:
static_df2.limit(3).toPandas().head()

Unnamed: 0,sourcemmsi,shiptype,tobow,tostern,tostarboard,toport,draught
0,357851000,80,120,26,7,17,9.8
1,227312180,30,7,8,3,3,2.0
2,250001396,70,80,10,7,7,5.7


In [15]:
static_df2.toPandas().shiptype.nunique()

43

There are 43 different values for the target value. To decrease that amount we use another dataset which contains an ais type summary depending to the shiptype values.

### AIS Type Summary Dataset  <a id='type'></a>

In [16]:
ship_types_df = pd.read_csv("Data/Ship Types List.csv")

In [17]:
ship_types_df.head(38)

Unnamed: 0,id_shiptype,shiptype_min,shiptype_max,type_name,ais_type_summary
0,1,10,19,Reserved,Unspecified
1,2,20,28,Wing In Grnd,Wing in Grnd
2,3,29,29,SAR Aircraft,Search and Rescue
3,4,30,30,Fishing,Fishing
4,5,31,31,Tug,Tug
5,6,32,32,Tug,Tug
6,7,33,33,Dredger,Special Craft
7,8,34,34,Dive Vessel,Special Craft
8,9,35,35,Military Ops,Special Craft
9,10,36,36,Sailing Vessel,Sailing Vessel


In [18]:
ship_types_df.ais_type_summary.unique()

array(['Unspecified', 'Wing in Grnd', 'Search and Rescue', 'Fishing',
       'Tug', 'Special Craft', 'Sailing Vessel', 'Pleasure Craft',
       'High-Speed Craft', 'Passenger', 'Cargo', 'Tanker', 'Other'],
      dtype=object)

In [19]:
ship_types_df.ais_type_summary.nunique()

13

There are 13 distinct types of ships. We will create a new column in the former dataset in which we will add the 'ais_type_summary' values according to the 'shiptype' values.

In [20]:
# Add an empty column
static_df2 = static_df2.withColumn("type_summary", lit(None).cast(StringType()))

In [21]:
#static_pandas_df = static_df1.select('shiptype', 'type_summary').toPandas()
static_pandas_df = static_df2.toPandas()

In [22]:
for i,j in enumerate(static_pandas_df.shiptype):
    for k,(l,m) in enumerate(zip(ship_types_df.shiptype_min, ship_types_df.shiptype_max)):        
        if ((l<=j<=m)):
            static_pandas_df.loc[i,'type_summary']=ship_types_df.ais_type_summary[k]
            break        

In [23]:
static_pandas_df.head()

Unnamed: 0,sourcemmsi,shiptype,tobow,tostern,tostarboard,toport,draught,type_summary
0,357851000,80,120,26,7,17,9.8,Tanker
1,227312180,30,7,8,3,3,2.0,Fishing
2,250001396,70,80,10,7,7,5.7,Cargo
3,305737000,79,124,14,8,13,6.2,Cargo
4,205554000,79,258,34,17,28,8.6,Cargo


In [24]:
static_pandas_df.isna().sum()

sourcemmsi        0
shiptype          0
tobow             0
tostern           0
tostarboard       0
toport            0
draught         189
type_summary      0
dtype: int64

In [25]:
static_pandas_df.type_summary.unique()

array(['Tanker', 'Fishing', 'Cargo', 'Other', 'Passenger', 'Unspecified',
       'Sailing Vessel', 'Pleasure Craft', 'Special Craft',
       'Search and Rescue', 'Tug', 'High-Speed Craft', 'Wing in Grnd'],
      dtype=object)

### Dynamic Dataset <a id='dynamic'></a>

In [26]:
dynamic_df = spark.read.options(header=True, inferSchema=True).csv('Data/nari_dynamic.csv')

In [27]:
dynamic_df.limit(3).toPandas().head()

Unnamed: 0,sourcemmsi,navigationalstatus,rateofturn,speedoverground,courseoverground,trueheading,lon,lat,t
0,245257000,0,0,0.1,13.1,36,-4.465718,48.38249,1443650402
1,227705102,15,-127,0.0,262.7,511,-4.496571,48.38242,1443650403
2,228131600,15,-127,8.5,263.7,511,-4.644325,48.092247,1443650404


Attribute | Data type | Description
--- | --- | --- |
  mmsi |	integer           |	MMSI identifier for vessel
  status |	integer          |	Navigational status
  turn |	double precision  |	Rate of turn, right or left, 0 to 720 degrees per minute
  speed |	double precision |	Speed over ground in knots (allowed values: 0-102.2 knots)
  course |	double precision | 	Course over ground (allowed values: 0-359.9 degrees)
  heading |	integer      	|	True heading in degrees (0-359), relative to true north
  lon 	|	double precision  |	Longitude (georeference: WGS 1984)
  lat 	|	double precision |	Latitude  (georeference: WGS 1984)
  t 	|	bigint |             timestamp in UNIX epochs

We will use feature 'speedoverground' to build our model and 'sourcemmsi' to merge the two datasets.

### Merge Datasets <a id='merge'></a>

In [28]:
dynamic_df1 = dynamic_df.select('sourcemmsi','speedoverground')

In [29]:
dynamic_df1 = dynamic_df1.groupby('sourcemmsi').avg('speedoverground')

In [30]:
static_spark_df = spark.createDataFrame(static_pandas_df)

In [31]:
left_join = static_spark_df.join(dynamic_df1, on='sourcemmsi')

In [32]:
left_join.limit(3).toPandas().head()

Unnamed: 0,sourcemmsi,shiptype,tobow,tostern,tostarboard,toport,draught,type_summary,avg(speedoverground)
0,211202460,15,29,89,13,13,11.2,Unspecified,8.275
1,227315190,30,5,9,3,3,,Fishing,6.963392
2,227416000,30,12,7,1,5,0.0,Fishing,2.832258


In [33]:
# dataframe shape
print(left_join.count())

3566


In [34]:
df = left_join.toPandas()

### Feature Engineering <a id='feat'></a>

In [35]:
df.head()

Unnamed: 0,sourcemmsi,shiptype,tobow,tostern,tostarboard,toport,draught,type_summary,avg(speedoverground)
0,211202460,15,29,89,13,13,11.2,Unspecified,8.275
1,227315190,30,5,9,3,3,,Fishing,6.963392
2,227416000,30,12,7,1,5,0.0,Fishing,2.832258
3,228186700,51,10,60,10,10,6.4,Search and Rescue,78.763053
4,228281000,90,0,0,0,0,0.0,Other,2.56183


We will create two new variables 'length' and 'beam'. According to the varibles description  
"*to_bow + to_stern = LENGTH of the vessel*"  </br>     
"*to_starboard + to_port = BEAM at the vessel's nominal waterline*"



In [36]:
df['length'] = df['tobow'] + df['tostern']

In [37]:
df['beam'] = df['tostarboard'] + df['toport']

In [38]:
df.head()

Unnamed: 0,sourcemmsi,shiptype,tobow,tostern,tostarboard,toport,draught,type_summary,avg(speedoverground),length,beam
0,211202460,15,29,89,13,13,11.2,Unspecified,8.275,118,26
1,227315190,30,5,9,3,3,,Fishing,6.963392,14,6
2,227416000,30,12,7,1,5,0.0,Fishing,2.832258,19,6
3,228186700,51,10,60,10,10,6.4,Search and Rescue,78.763053,70,20
4,228281000,90,0,0,0,0,0.0,Other,2.56183,0,0


Now we don't need 'sourcemmsi' and 'shiptype' columns anymore. We are going to drop those, rearange the position of the columns and change tha name of avg(speedoverground) column.

In [39]:
df.drop(labels=['sourcemmsi', 'shiptype'], axis=1, inplace=True) 

In [40]:
# rearrange column order
df = df[['tobow', 'tostern', 'length', 'tostarboard', 'toport', 'beam', 'draught', 'avg(speedoverground)',
                 'type_summary']]

In [41]:
# rename column
df=df.rename(columns = {'avg(speedoverground)':'speed_avg'})

In [42]:
df.head()

Unnamed: 0,tobow,tostern,length,tostarboard,toport,beam,draught,speed_avg,type_summary
0,29,89,118,13,13,26,11.2,8.275,Unspecified
1,5,9,14,3,3,6,,6.963392,Fishing
2,12,7,19,1,5,6,0.0,2.832258,Fishing
3,10,60,70,10,10,20,6.4,78.763053,Search and Rescue
4,0,0,0,0,0,0,0.0,2.56183,Other


In [43]:
spark.sparkContext.stop()
spark.stop()

# Data visualization <a id='visual'></a>

In [44]:
import chart_studio.plotly as py
import cufflinks as cf
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()

We will take a look at some visualizations to get a better insight of our data.

In [45]:
px.pie(df, names='type_summary',
       title='Vessel Type Percentages',
       color_discrete_sequence=px.colors.sequential.RdBu)

In [46]:
px.histogram(df, x='type_summary', labels={'type_summary': 'Vessel Type'})

As we can see from the graphs above there only two instances for High-Speed Craft and Wing in Grnd so it would be better to incorporate them in the Other type.

In [47]:
df.replace({'High-Speed Craft': 'Other','Wing in Grnd':'Other'}, inplace=True)

In [48]:
df.type_summary.value_counts()

Cargo                2062
Tanker                710
Fishing               318
Sailing Vessel        135
Other                 101
Passenger              56
Unspecified            48
Tug                    45
Special Craft          39
Pleasure Craft         32
Search and Rescue      20
Name: type_summary, dtype: int64

In [49]:
px.bar(df.groupby(['type_summary'],as_index=False).mean().sort_values('speed_avg'), 
       x='type_summary', y='speed_avg', color='type_summary', 
       title='Average Speed by Vessel Type',
       labels={'type_summary': 'Vessel Type', 'speed_avg':'Average Speed'})

Search and Rescue ships have the fastest average speed.

In [50]:
px.bar(df.groupby(['type_summary'],as_index=False).mean().sort_values('length'), 
       x='type_summary', y='length', color='type_summary', 
       title='Average Length by Vessel Length',
      labels={'type_summary': 'Vessel Type', 'length':'Length'})

Cargo, Passenger and Tanker vessels have the biggest length.

In [51]:
px.bar(df.groupby(['type_summary'],as_index=False).mean().sort_values('beam'), 
       x='type_summary', y='beam', color='type_summary', 
       title='Average Beam by Vessel Type',
      labels={'type_summary': 'Vessel Type', 'beam':'Beam'})

The same applies to the beam but this time Tankers have the biggest beam.

In [52]:
px.scatter(df, x='tobow', y='tostern',
          color='type_summary', size='length',
          hover_data=['type_summary'])

In [53]:
fig = px.scatter_3d(df.groupby(['type_summary'],as_index=False).mean(),
                   x='tobow', y='tostern', z='length', color='type_summary',
                   opacity=0.7, width=700, height=400)
fig

In [54]:
px.bar(df.groupby(['type_summary'],as_index=False).mean().sort_values('draught'), 
       x='type_summary', y='draught', color='type_summary',
       title='Average Draught by Vessel Type',
      labels={'type_summary': 'Vessel Type', 'draught':'Draught'})

Tanker and Cargo vessels have the biggest draught.

## Outliers  <a id='out'></a>

We are going to use visualization tools to find outliers. For the scope of this project outliers will not be removed or modified.

In [55]:
px.box(df, x='type_summary', y='speed_avg',
       color='type_summary', title='Average Speed Box Plot',
      labels={'type_summary': 'Vessel Type', 'speed_avg':'Speed'})

We can clearly see an outlier with speed 78.76 for Search and Rescue vessels, 3 outlers for Cargo vessels with speed over 40 and maybe some more for Tanker and Pleasure Craft Vessels.

In [56]:
px.box(df, x='type_summary', y='length', 
       color='type_summary', title='Length Box Plot',
      labels={'type_summary': 'Vessel Type', 'length':'Length'})

There are some outliers that stand out (e.g. the Unspecified instance with value 396) but it's less clear in the case of Fishing and Sailing Vessel which have a lot of values above their upper fence. We don't want to remove all these because we are going to lose valuable data.

In [57]:
px.box(df, x='type_summary', y='beam',
       color='type_summary', title='Beam Box Plot',
      labels={'type_summary': 'Vessel Type', 'beam':'Beam'})

Unspecified and Other type are having similar values. We will embed Uspacified type in Other type.

In [58]:
df.replace({'Unspecified': 'Other'}, inplace=True)

# Prepare the Data for Machine Learning Algorithms <a id='prepare'></a>

### Label Encoder <a id='enco'></a>

We will encode the target values beacause machine learning algorithms can't handle strings.

In [59]:
from sklearn.preprocessing import LabelEncoder

In [60]:
le = LabelEncoder()
le.fit(df.type_summary)

LabelEncoder()

In [61]:
le.classes_

array(['Cargo', 'Fishing', 'Other', 'Passenger', 'Pleasure Craft',
       'Sailing Vessel', 'Search and Rescue', 'Special Craft', 'Tanker',
       'Tug'], dtype=object)

In [62]:
le.transform(df.type_summary)

array([2, 1, 1, ..., 0, 0, 0])

In [63]:
df['type_enc'] = pd.Series(le.transform(df.type_summary))

In [64]:
df.head()

Unnamed: 0,tobow,tostern,length,tostarboard,toport,beam,draught,speed_avg,type_summary,type_enc
0,29,89,118,13,13,26,11.2,8.275,Other,2
1,5,9,14,3,3,6,,6.963392,Fishing,1
2,12,7,19,1,5,6,0.0,2.832258,Fishing,1
3,10,60,70,10,10,20,6.4,78.763053,Search and Rescue,6
4,0,0,0,0,0,0,0.0,2.56183,Other,2


In [65]:
df.type_enc.unique()

array([2, 1, 6, 8, 0, 7, 5, 4, 3, 9])

### Stratified Shuffle and Split <a id='strat'></a>

As we saw above the data is not evenly distributed so it's better to perform a stratified shuffle and split.

In [66]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(df, df['type_summary']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

In [67]:
def type_proportions(data):
    return data["type_summary"].value_counts() / len(data)

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "type_summary": type_proportions(df),
    "Stratified": type_proportions(strat_test_set),
    "Random": type_proportions(test_set),
}).sort_index()
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["type_summary"] - 100
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["type_summary"] - 100


In [68]:
compare_props

Unnamed: 0,type_summary,Stratified,Random,Strat. %error,Rand. %error
Cargo,0.578239,0.578431,0.617647,0.033282,6.815199
Fishing,0.089176,0.089636,0.082633,0.516181,-7.336645
Other,0.041784,0.042017,0.043417,0.558344,3.910289
Passenger,0.015704,0.015406,0.014006,-1.895758,-10.814326
Pleasure Craft,0.008974,0.008403,0.005602,-6.355042,-37.570028
Sailing Vessel,0.037858,0.037815,0.033613,-0.112045,-11.210707
Search and Rescue,0.005609,0.005602,0.002801,-0.112045,-50.056022
Special Craft,0.010937,0.011204,0.009804,2.449185,-10.356963
Tanker,0.199103,0.19888,0.177871,-0.112045,-10.663589
Tug,0.012619,0.012605,0.012605,-0.112045,-0.112045


### Fill null values of test set

In [69]:
cols = ['tobow', 'tostern', 'length', 'tostarboard', 'toport', 'beam',
       'draught', 'speed_avg']
for i in cols:
    strat_train_set[i] = df.groupby('type_summary')[i].apply(lambda x: x.fillna(x.mean()))

### Split train and test set

In [70]:
X_train = strat_train_set.drop(['type_summary','type_enc'], axis=1)
y_train = strat_train_set.type_enc.copy()

X_test = strat_test_set.drop(['type_summary','type_enc'], axis=1)
y_test = strat_test_set.type_enc.copy()

# Train and Evaluate Machine Learning Models <a id='train'></a>

In [71]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import model_selection
from sklearn.model_selection import cross_val_score

In [72]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
    ])

In [73]:
numcol = list(X_train.columns)
preprocessor = ColumnTransformer([
    ("num", num_pipeline, numcol),
])

In [74]:
labels = ['Cargo', 'Fishing', 'Other', 'Passenger', 'Pleasure Craft', 'Sailing Vessel',
          'Search and Rescue', 'Special Craft', 'Tanker', 'Tug']

In [75]:
def draw_conf_mat(y_test, preds, model):
    z = confusion_matrix(y_test, preds)
    z = np.around(z, decimals=2)
    # invert z idx values
    z = z[::-1]
    x = labels
    y =  x[::-1].copy() # invert idx values of x    
    
    # change each element of z to type string for annotations
    z_text = [[str(y) for y in x] for x in z]
    # set up figure 
    fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text,
                                      colorscale='Blues')
    # add title
    fig.update_layout(title_text='<i><b>Confusion matrix</b></i>',
                  #xaxis = dict(title='x'),
                  #yaxis = dict(title='x')
                 )

    # add custom xaxis title
    fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted value",
                        xref="paper",
                        yref="paper"))

    # add custom yaxis title
    fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.35,
                        y=0.5,
                        showarrow=False,
                        text="Real value",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

    # adjust margins to make room for yaxis title
    fig.update_layout(margin=dict(t=50, l=200))

    # add colorbar
    fig['data'][0]['showscale'] = True
    fig.show()

In [76]:
def ml(X_train, y_train, X_test, y_test, model):
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
    my_pipeline.fit(X_train, y_train)
    preds = my_pipeline.predict(X_test)
    accuracy = my_pipeline.score(X_test, y_test)
    print('Accuracy:', accuracy)
    error = mean_absolute_error(y_test, preds)
    print('MAE:', error)
    draw_conf_mat(y_test, preds, model)        

### Decision Tree Classifier <a id='dtc'></a>

In [77]:
from sklearn.tree import DecisionTreeClassifier 
model = DecisionTreeClassifier(max_depth = 9)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.757703081232493
MAE: 1.361344537815126


### Support Vector Classifier <a id='scv'></a>

In [78]:
from sklearn.svm import SVC 
model = SVC(gamma='auto')
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7296918767507002
MAE: 1.6372549019607843


### K Neighbors Classifier <a id='knc'></a>

In [79]:
from sklearn.neighbors import KNeighborsClassifier 
model = KNeighborsClassifier(n_neighbors = 7)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7464985994397759
MAE: 1.4411764705882353


### Extra Tree Classifier <a id='exc'></a>

In [80]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import ExtraTreeClassifier
extra = ExtraTreeClassifier(random_state=42)
model = BaggingClassifier(extra, random_state=0)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7773109243697479
MAE: 1.2871148459383754


### Extra Trees Classifier <a id='etc'></a>

In [81]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=100, random_state=42)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.8291316526610645
MAE: 0.9005602240896359


### Label Propagation Classifier <a id='lpc'></a>

In [82]:
from sklearn.semi_supervised import LabelPropagation
model = LabelPropagation()
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7450980392156863
MAE: 1.3711484593837535


### Multi-layer Perceptron classifier <a id='mlp'></a>

In [83]:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(random_state=42, max_iter=1000)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7689075630252101
MAE: 1.3053221288515406


### Gradient Boosting Classifier <a id='gbc'></a>

In [84]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100,
                                   learning_rate=0.1,
                                   max_depth=10,
                                   random_state=42)
ml(X_train, y_train, X_test, y_test, model)

Accuracy: 0.7983193277310925
MAE: 1.050420168067227


The **best model** for the specific task is the **Extra Trees Classifier** with 82.91% *accuracy* and 0.9 *mean square error*.

From the confusion matrices we can see that many Tankers are falsely identified as Cargos and vice versa.

# Conclusion <a id='end'></a>

We combined 3 maritime datasets using *pyspark* and *pandas*. We visualized them to get a better grasp using *plotly*. We prepared the data for the machine learning algorithms building a pipeline and finally, we trained and evaluated several models of which **Extra Trees Classifier** had the best performance. 