### **Feature Engineering**

- **Handle missing data**:
  - Remove rows or columns with missing values.
  - Impute missing values using forward fill, mean, median, or mode.

- **Encode categorical variables**:
  - Use one-hot encoding for non-ordinal categorical data.
  - Use label encoding for ordinal categorical data.

- **Extract features from datetime columns**:
  - Extract components like year, month, day, hour, and day of the week.

- **Scale and normalize features**:
  - Standardize features (e.g., for logistic regression, KNN).
  - Normalize features to a [0,1] range if necessary.

- **Create new features**:
  - Generate interaction features by combining existing ones.

- **Remove irrelevant or redundant features**:
  - Drop unnecessary columns or highly correlated features using a correlation matrix.
  - Drop unique identifiers if they do not add value.

In [65]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [64]:
filePath = "../data/merged_hurricane_vessel.csv"
df = pd.read_csv(filePath)

In [66]:
df.head()

Unnamed: 0,MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,...,50-knot Wind Radii SW,50-knot Wind Radii NW,64-knot Wind Radii NE,64-knot Wind Radii SE,64-knot Wind Radii SW,64-knot Wind Radii NW,Speed mph,hurricane_datetime,impacted,PathChange
0,563135300,2022-08-31 12:00:00,32.11997,-79.93341,0.0,270.9,179.0,WAN HAI 625,IMO9298997,9V7324,...,0.0,0.0,0.0,0.0,0.0,0.0,11.784165,2022-08-31 12:00:00,False,stayed on course
1,636092896,2022-08-31 12:00:00,40.4947,-73.66555,0.1,299.4,299.0,CCNI ANGOL,IMO9683867,D5GZ4,...,0.0,0.0,0.0,0.0,0.0,0.0,11.784165,2022-08-31 12:00:00,False,stayed on course
2,373457000,2022-08-31 12:00:00,36.62089,-75.54031,11.2,336.6,337.0,MOL MAESTRO,IMO9415727,3EKT9,...,0.0,0.0,0.0,0.0,0.0,0.0,11.784165,2022-08-31 12:00:00,False,stayed on course
3,211779000,2022-08-31 12:00:00,27.72167,-78.49475,22.2,255.3,253.0,NORTHERN MAJESTIC,IMO9252565,DCPP2,...,0.0,0.0,0.0,0.0,0.0,0.0,11.784165,2022-08-31 12:00:00,False,stayed on course
4,636021760,2022-08-31 12:00:00,36.14663,-74.65842,17.5,12.7,12.0,MSC TAMPA,IMO9317925,5LFN8,...,0.0,0.0,0.0,0.0,0.0,0.0,11.784165,2022-08-31 12:00:00,False,stayed on course


In [67]:
le = LabelEncoder()

# Label Encoding for the columns with string values
df['VesselName'] = le.fit_transform(df['VesselName'])
df['IMO'] = le.fit_transform(df['IMO'])
df['CallSign'] = le.fit_transform(df['CallSign'])
df['Record Identifier'] = le.fit_transform(df['Record Identifier'])
df['Name'] = le.fit_transform(df['Name'])
df['PathChange'] = le.fit_transform(df['PathChange'])

In [68]:
# One Hot Encoding for the columns with string values that have classes more than 2
df = pd.get_dummies(df, columns=['TransceiverClass', 'Status of System'], drop_first=True)

In [69]:
# Convert columns to datetime
df['BaseDateTime'] = pd.to_datetime(df['BaseDateTime'], errors='coerce')
df['hurricane_datetime'] = pd.to_datetime(df['hurricane_datetime'], errors='coerce')

# Extract year, month, day, and hour from the datetime
df['vessel_year'] = df['BaseDateTime'].dt.year
df['vessel_month'] = df['BaseDateTime'].dt.month
df['vessel_day'] = df['BaseDateTime'].dt.day
df['vessel_hour'] = df['BaseDateTime'].dt.hour

df['hurricane_year'] = df['hurricane_datetime'].dt.year
df['hurricane_month'] = df['hurricane_datetime'].dt.month
df['hurricane_day'] = df['hurricane_datetime'].dt.day
df['hurricane_hour'] = df['hurricane_datetime'].dt.hour

In [70]:
# Drop datetime columns from the dataframe
df.drop(['BaseDateTime', 'hurricane_datetime', 'Year'], axis=1, inplace=True)

In [71]:
df.dtypes

MMSI                        int64
LAT                       float64
LON                       float64
SOG                       float64
COG                       float64
Heading                   float64
VesselName                  int64
IMO                         int64
CallSign                    int64
VesselType                float64
Status                    float64
Length                    float64
Width                     float64
Draft                     float64
Cargo                     float64
Name                        int64
Num Entries                 int64
Time                        int64
Record Identifier           int64
Latitude                  float64
Longitude                 float64
Maximum Sustained Wind    float64
Minimum Pressure          float64
34-knot Wind Radii NE     float64
34-knot Wind Radii SE     float64
34-knot Wind Radii SW     float64
34-knot Wind Radii NW     float64
50-knot Wind Radii NE     float64
50-knot Wind Radii SE     float64
50-knot Wind R

In [72]:
df.head()

Unnamed: 0,MMSI,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,...,Status of System_ TD,Status of System_ TS,vessel_year,vessel_month,vessel_day,vessel_hour,hurricane_year,hurricane_month,hurricane_day,hurricane_hour
0,563135300,32.11997,-79.93341,0.0,270.9,179.0,1348,281,418,70.0,...,False,False,2022,8,31,12,2022,8,31,12
1,636092896,40.4947,-73.66555,0.1,299.4,299.0,218,953,715,79.0,...,False,False,2022,8,31,12,2022,8,31,12
2,373457000,36.62089,-75.54031,11.2,336.6,337.0,849,600,38,70.0,...,False,False,2022,8,31,12,2022,8,31,12
3,211779000,27.72167,-78.49475,22.2,255.3,253.0,1027,184,852,70.0,...,False,False,2022,8,31,12,2022,8,31,12
4,636021760,36.14663,-74.65842,17.5,12.7,12.0,965,359,190,70.0,...,False,False,2022,8,31,12,2022,8,31,12


In [73]:
df.isna().sum()

MMSI                      0
LAT                       0
LON                       0
SOG                       0
COG                       0
Heading                   0
VesselName                0
IMO                       0
CallSign                  0
VesselType                0
Status                    0
Length                    0
Width                     0
Draft                     0
Cargo                     0
Name                      0
Num Entries               0
Time                      0
Record Identifier         0
Latitude                  0
Longitude                 0
Maximum Sustained Wind    0
Minimum Pressure          0
34-knot Wind Radii NE     0
34-knot Wind Radii SE     0
34-knot Wind Radii SW     0
34-knot Wind Radii NW     0
50-knot Wind Radii NE     0
50-knot Wind Radii SE     0
50-knot Wind Radii SW     0
50-knot Wind Radii NW     0
64-knot Wind Radii NE     0
64-knot Wind Radii SE     0
64-knot Wind Radii SW     0
64-knot Wind Radii NW     0
Speed mph           

In [74]:
df.to_csv('modeling_first_iteration.csv', index=False)