# Vessel Data Analysis

Each day, many vessels arrive in this port and are served by some stevedore(s). Four cargo types have been identified (ore, coal, oil, and petroleum), and vessels often carry a mixture of cargo types. For each unique vessel arrival (i.e. each row in the data), we would like a prediction of how much it transships (total of load & discharge activities) per cargo type.
Variables of interest therefore are: discharge1, load1, discharge2, load2, discharge3, load3, discharge4 and load4.

In [48]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings(action='ignore')

In [7]:
df = pd.read_excel("./VesselData.xlsx")
df.head()

Unnamed: 0,eta,ata,atd,vesseldwt,vesseltype,discharge1,load1,discharge2,load2,discharge3,...,load4,stevedorenames,hasnohamis,earliesteta,latesteta,traveltype,previousportid,nextportid,isremarkable,vesselid
0,2017-09-19 00:00:00+00,2017-09-19 00:00:00+00,2017-09-22 00:00:00+00,109290.0,5.0,0.0,0.0,0.0,0.0,90173.0,...,0.0,Stevedore_104,,2017-09-19 00:00:00+00,2017-09-19 00:00:00+00,ARRIVAL,981.0,731.0,f,2242.0
1,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,2017-10-03 00:00:00+00,67170.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Stevedore_109,,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,ARRIVAL,19.0,15.0,f,5462.0
2,2017-09-30 00:00:00+00,2017-09-30 00:00:00+00,2017-10-01 00:00:00+00,67737.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Stevedore_57,,2017-09-30 00:00:00+00,2017-09-30 00:00:00+00,ARRIVAL,19.0,19.0,f,5251.0
3,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,2017-10-03 00:00:00+00,43600.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Stevedore_57,,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,ARRIVAL,15.0,18.0,f,5268.0
4,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,9231.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,Stevedore_98,,2017-10-02 00:00:00+00,2017-10-02 00:00:00+00,ARRIVAL,74.0,27.0,f,5504.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   eta             8208 non-null   object 
 1   ata             8208 non-null   object 
 2   atd             8208 non-null   object 
 3   vesseldwt       8206 non-null   float64
 4   vesseltype      8208 non-null   float64
 5   discharge1      8208 non-null   float64
 6   load1           8208 non-null   float64
 7   discharge2      8208 non-null   float64
 8   load2           8208 non-null   float64
 9   discharge3      8208 non-null   float64
 10  load3           8208 non-null   float64
 11  discharge4      8208 non-null   float64
 12  load4           8208 non-null   float64
 13  stevedorenames  8206 non-null   object 
 14  hasnohamis      0 non-null      float64
 15  earliesteta     8208 non-null   object 
 16  latesteta       8208 non-null   object 
 17  traveltype      8208 non-null   o

There is one feature with only null values, namely hasnohamis, so naturally we would like to drop this feature.

In [8]:
# drop null features
df = df.drop(["hasnohamis"], axis = 1)

In [14]:
# convert column to datetime pandas
df['eta'] = pd.to_datetime(df['eta'])
df['ata'] = pd.to_datetime(df['ata'])
df['atd'] = pd.to_datetime(df['atd'])
df['earliesteta'] = pd.to_datetime(df['earliesteta'])
df['latesteta'] = pd.to_datetime(df['latesteta'])

In [25]:
df["isremarkable"].unique()

array(['f'], dtype=object)

This feature (Boolean whether there is anything remarkable regarding the vessel) only contains false values, so naturally we could drop this column.

In [26]:
df = df.drop(["isremarkable"], axis = 1)

In [35]:
#drop NaN values in the dataset
df.dropna(inplace=True)

In [36]:
df.head()

Unnamed: 0,eta,ata,atd,vesseldwt,vesseltype,discharge1,load1,discharge2,load2,discharge3,load3,discharge4,load4,stevedorenames,earliesteta,latesteta,traveltype,previousportid,nextportid,vesselid
0,2017-09-19 00:00:00+00:00,2017-09-19 00:00:00+00:00,2017-09-22 00:00:00+00:00,109290.0,5.0,0.0,0.0,0.0,0.0,90173.0,0.0,0.0,0.0,Stevedore_104,2017-09-19 00:00:00+00:00,2017-09-19 00:00:00+00:00,ARRIVAL,981.0,731.0,2242.0
1,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,2017-10-03 00:00:00+00:00,67170.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Stevedore_109,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,ARRIVAL,19.0,15.0,5462.0
2,2017-09-30 00:00:00+00:00,2017-09-30 00:00:00+00:00,2017-10-01 00:00:00+00:00,67737.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Stevedore_57,2017-09-30 00:00:00+00:00,2017-09-30 00:00:00+00:00,ARRIVAL,19.0,19.0,5251.0
3,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,2017-10-03 00:00:00+00:00,43600.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Stevedore_57,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,ARRIVAL,15.0,18.0,5268.0
4,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,9231.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Stevedore_98,2017-10-02 00:00:00+00:00,2017-10-02 00:00:00+00:00,ARRIVAL,74.0,27.0,5504.0


For the predictions we are not taking into account the date values but only the vessel features (deadweight tonnage, vessel type etc.)

In [37]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Drop columns
    df = df.drop('eta', axis=1)
    df = df.drop('ata', axis=1)
    df = df.drop('atd', axis=1)
    df = df.drop('earliesteta', axis=1)
    df = df.drop('latesteta', axis=1)
    df = df.drop('stevedorenames', axis=1)
    df = df.drop('traveltype', axis=1)
    
    # Split df into X and y
    y = df['discharge1']
    X = df.drop('discharge1', axis=1)
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [38]:
X_train, X_test, y_train, y_test = preprocess_inputs(df)

In [39]:
X_train

Unnamed: 0,vesseldwt,vesseltype,load1,discharge2,load2,discharge3,load3,discharge4,load4,previousportid,nextportid,vesselid
6647,-0.666071,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,-0.799367,1.434406,-1.529206
6011,-0.557933,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,-0.773005,-0.823311,0.688573
5597,-0.684123,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,-0.801764,-0.863883,-1.519187
4915,-0.151662,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,0.106518,-0.861497,-1.385990
5775,-0.654011,1.409077,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,1.079505,0.198151,-0.164827
...,...,...,...,...,...,...,...,...,...,...,...,...
2899,-0.532874,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,-0.581284,2.088333,1.386970
7817,-0.597252,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,0.708045,0.682629,0.175826
906,-0.257676,1.409077,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,1.424936,-0.128593,0.516323,-0.863883,-1.057125
5196,1.271920,-0.613147,-0.047757,-0.100638,-0.034955,-0.194479,-0.0165,-0.186215,-0.128593,-0.713092,-0.863883,1.556118


In [40]:
y_train

6647    0.0
6011    0.0
5597    0.0
4915    0.0
5775    0.0
       ... 
2899    0.0
7817    0.0
906     0.0
5196    0.0
235     0.0
Name: discharge1, Length: 5742, dtype: float64

We can test several models and evaluate their performance on this dataset.

In [49]:
models = {
    "                      LinearRegression": LinearRegression(),
    "                   Logistic Regression": LogisticRegression(),
    "                   K-Nearest Neighbors": KNeighborsClassifier(),
    "                         Decision Tree": DecisionTreeClassifier(),
    "Support Vector Machine (Linear Kernel)": LinearSVC(),
    "   Support Vector Machine (RBF Kernel)": SVC(),
    "                        Neural Network": MLPClassifier(),
    "                         Random Forest": RandomForestClassifier(),
    "                     Gradient Boosting": GradientBoostingClassifier(),
    "                               XGBoost": XGBClassifier(eval_metric='mlogloss')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                      LinearRegression trained.
                   Logistic Regression trained.
                   K-Nearest Neighbors trained.
                         Decision Tree trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                        Neural Network trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.


In [50]:
for name, model in models.items():
    print(name + ": {:.2f}%".format(model.score(X_test, y_test) * 100))

                      LinearRegression: 14.89%
                   Logistic Regression: 97.97%
                   K-Nearest Neighbors: 98.01%
                         Decision Tree: 97.40%
Support Vector Machine (Linear Kernel): 98.01%
   Support Vector Machine (RBF Kernel): 98.01%
                        Neural Network: 97.89%
                         Random Forest: 97.97%
                     Gradient Boosting: 97.36%
                               XGBoost: 98.01%


The scores of accuracy are significantly and abnormally high, except for Linear Regression model which performs poorly. We think that the models are overfitting. We can try to see the predictions we get with the Logistic Regression.

In [None]:
# Drop columns
df = df.drop('eta', axis=1)
df = df.drop('ata', axis=1)
df = df.drop('atd', axis=1)
df = df.drop('earliesteta', axis=1)
df = df.drop('latesteta', axis=1)
df = df.drop('stevedorenames', axis=1)
df = df.drop('traveltype', axis=1)

In [59]:
clf = LogisticRegression().fit(X_train, y_train)
df["predicted_discharge1"] = clf.predict(df.drop('discharge1', axis=1))

In [61]:
df

Unnamed: 0,vesseldwt,vesseltype,discharge1,load1,discharge2,load2,discharge3,load3,discharge4,load4,previousportid,nextportid,vesselid,predicted_discharge1
0,109290.0,5.0,0.0,0.0,0.0,0.0,90173.0,0.0,0.0,0.0,981.0,731.0,2242.0,186955.0
1,67170.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,15.0,5462.0,186955.0
2,67737.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,19.0,5251.0,186955.0
3,43600.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,18.0,5268.0,186955.0
4,9231.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,74.0,27.0,5504.0,186955.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8203,9587.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,19.0,5681.0,186955.0
8204,9654.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,391.0,102.0,4843.0,186955.0
8205,4726.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3537.0,3051.0,1043.0,19.0,3115.0,172045.0
8206,13320.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.0,71.0,4623.0,186955.0


The values don't seem to be coherent... But theoretically, once we have predicted the values for discharge1, we proceed with the same method to predict discharge 2,3 and 4. Then we will add the 4 values together to have the total discharge activity.

In [68]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Split df into X and y
    y = df['load1']
    X = df.drop('load1', axis=1)
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [69]:
X_train, X_test, y_train, y_test = preprocess_inputs(df)

In [70]:
models = {
    "                      LinearRegression": LinearRegression(),
    "                   Logistic Regression": LogisticRegression(),
    "                   K-Nearest Neighbors": KNeighborsClassifier(),
    "                         Decision Tree": DecisionTreeClassifier(),
    "Support Vector Machine (Linear Kernel)": LinearSVC(),
    "   Support Vector Machine (RBF Kernel)": SVC(),
    "                        Neural Network": MLPClassifier(),
    "                         Random Forest": RandomForestClassifier(),
    "                     Gradient Boosting": GradientBoostingClassifier(),
    "                               XGBoost": XGBClassifier(eval_metric='mlogloss')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                      LinearRegression trained.
                   Logistic Regression trained.
                   K-Nearest Neighbors trained.
                         Decision Tree trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                        Neural Network trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.


In [71]:
for name, model in models.items():
    print(name + ": {:.2f}%".format(model.score(X_test, y_test) * 100))

                      LinearRegression: 0.36%
                   Logistic Regression: 99.63%
                   K-Nearest Neighbors: 99.59%
                         Decision Tree: 99.31%
Support Vector Machine (Linear Kernel): 99.63%
   Support Vector Machine (RBF Kernel): 99.63%
                        Neural Network: 99.63%
                         Random Forest: 99.63%
                     Gradient Boosting: 99.15%
                               XGBoost: 99.59%


The scores of accuracy are not coherent and make us think we need to review our choice of models, training parameters and dataset preparation.

We tested classification models in this notebook, but according to the results we have, they don't perform well on our dataset. 