# **MULTI UAV CONFLICT RISK ANALYSIS - CLASSIFICATION**



---


# **IMPORT**
Importing the required packages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# data visualization
import seaborn as sns

# data processing 
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler, RobustScaler

# training
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier


# evaluation
from sklearn.metrics import confusion_matrix, f1_score, balanced_accuracy_score, accuracy_score, ConfusionMatrixDisplay



---


# **MOUNT DRIVE**
Mounting Google Drive to then load the dataset.

In [None]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')



---


# **LOAD THE DATASET**
In order to load the dataset, i first set its current location in my drive (to avoid errors, check your path and replace it). Then, since this is a tabular-separated values file, i read it using *panda.read_csv()* which loads it into a DataFrame. 

It is possible to modify the query to work with only some samples and to test different configurations. Then, i make the split between input and output columns. The input are stored in the first 35 columns, while the outputs for classification are stored in the 36th column.

## Data Extraction

In [None]:
# importing the file from drive and reading it into DataFrame
filename = '/content/drive/MyDrive/Project_ML/Data/train_set.tsv'
dataframe = pd.read_csv(filename, sep = "\t", header = 0)
print('File loaded: %d samples.' %(len(dataframe)))

# the query can be changed to work with only some samples belonging to the specified class
query = '(num_collisions ==0) or (num_collisions==1) or (num_collisions==2) or (num_collisions==3) or (num_collisions ==4)'
df = dataframe.query(query)

# dividing input and output columns
X = df.iloc[:, :-2]
y = df.iloc[:, -2]

## Correlations and missing values

It is important to know if there are some missing values in the dataset and eventually replace them. In our case, there aren't missing values. 

In [None]:
print("Number of null cells in df: %d" %(df.isnull().sum().sum()))

I can even display correlations between the features in the dataset. From there you can see that there are almost no correlations. 

In [None]:
X.corr()



---


# **DATA VISUALIZATION**
Before starting processing the dataset it's important to visualize it to have an overview of our data and to decide how to proceed.

## Histogram
This first histogram shows how the samples are distributed. As you can see, the dataset is very imbalanced. More than 500 samples belong to the first class, while the last class has only three samples.  

In [None]:
# Histogram plot of the class distribution
distr_plot = sns.histplot(
    data = df, 
    x='num_collisions', 
    hue = 'num_collisions',
    discrete = True,  
    shrink=0.8)
Counter(y)

## Pie Plot
This imbalance can be visualized better using a pie chart and plotting the percentage of examples for each class.

In [None]:
# Pie plot of the class distribution
colors = sns.color_palette('pastel')[0:5]

print("CLASS DISTRIBUTION:")
plot = df['num_collisions'].value_counts().plot.pie(
    autopct='%.2f',
    colors = colors, 
    figsize=(8, 8))

## Two features scatter plot
In this section it's possible to plot some features to see how they are related to each other and with the belonging class. 

*Obviosly this is just a try and a preliminary analysis.*

In [None]:
def two_features_plot(x, feature1, feature2):
  fig, ax = plt.subplots(figsize=(8,8))
  scatter_plot= sns.scatterplot(
      data=df,
      x=x.iloc[:,feature1], y=x.iloc[:,feature2],
      hue="num_collisions",
      size = "num_collisions",
      ax=ax)

In [None]:
# features from 0 to 34
two_features_plot(X,1,2)

## One feature density plot
In this section it's possible to plot one feature to see if there are single features that are more relevant to predict the class. 

*Obviosly this is just a try and a preliminary analysis.*

In [None]:
def feature_plot(x, feature):
  fig, ax = plt.subplots(figsize=(8,8))
  kde_plot = sns.kdeplot(data=df, x = X.iloc[:, feature], hue = 'num_collisions')

In [None]:
# features from 0 to 34
feature_plot(X, 1)



---


# **FEATURE SCALING**

Here, there's the possibility to scale the features column-wise with four different methods or to not scale at all.

The first method is the **Maximum Absolute Scaling** which returns values of the input data between -1 and 1. It takes the input and it divides it by the maximum absolute value on that column.

The second method is the **Min-Max Feature Scaling**, also called normalization, which scales the feature between 0 and 1. It's computed by subtracting from the input the minimum value in the column and subsequently dividing by the difference between the maximum and minimum value.

The third method is the **Standard Scaler** and scales the data into a distribution with zero mean and variance 1.

The last method is the **Robust Scaling** which removes the median and scales the data according to the quantile range.

I will use StandardScaler.

In [None]:
# "MAS": Maximum Absolute Scaling, "MMS": Min-Max Scaling or "SS": Standard Scaler, "RS": Robust Scaling
s = "SS"
def scaling(series, scaling_type):
  if scaling_type == "MAS":
    scaler = MaxAbsScaler()
  elif scaling_type == "MMS":
    scaler = MinMaxScaler()
  elif scaling_type == "SS":
    scaler = StandardScaler()
  elif scaling_type == "RS":
    scaler = RobustScaler()
  else:
    return series
  scaler.fit(series)
  scaled = scaler.fit_transform(series)
  series = pd.DataFrame(scaled, columns = series.columns)
  return series

X = scaling(X, s)

We can see the effect of the scaling on two desired features:

In [None]:
two_features_plot(X, 1, 2)



---


# **DATA SPLITTING AND RESAMPLING**

## Splitting the dataset
I've splitted the dataset in training and test set.The model must be trained with the training data and then tested. Comparing predictions to true labels in the test set can be seen as the unbiased performance evaluation of the model. Since i have an imbalanced dataset, providing the class label array y as an argument to stratify ensures that both training and test datasets have the same class proportions as the original dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
          test_size=0.3, random_state=24, stratify = y)

# This below is just a check on the classes count in the test set. 
y_test.value_counts()

## Resampling
Here, you can resample the training set. A resample method repeatedly draw samples from the dataset. Oversample can be useful in this case since the dataset is severely imbalanced. I've decided to try three different techniques of oversampling:

**SMOTE**: or Synthetic Minority Oversampling TEchnique works by selecting examples that are close in the feature space , drawing a line between them and drawing a new sample at a point along that line.

**ADASYN**: this method is similar to SMOTE but it generates different number of samples depending on an estimate of the local distribution of the class to be oversampled.

**RandomOverSampler**: it's the most naive strategy to generate new samples by randomly sampling with replacement of the current available samples.

In [None]:
# "SMOTE", "ADASYN", "ROV": RandomOverSampler, 
res = "ADASYN"

def resample(X, y, res):
  if res == "SMOTE":
    oversample = SMOTE(k_neighbors = 1)
  elif res == "ADASYN":
    oversample = ADASYN(n_neighbors = 1)
  elif res == "ROV":
    oversample = RandomOverSampler(sampling_strategy="all")  
  
  X_res, y_res = oversample.fit_resample(X, y)
  print('Resampled dataset shape %s' % Counter(y_res))

  return X_res, y_res

X_res, y_res = resample(X_train, y_train, res)



---


# **MODEL DEFINITION**
Here, i've defined a bunch of models to evaluate and then to make a comparison. I've decided to compare **SVC**, **DecisionTreeClassifier**, and **RandomForestClassifier**. 

In [None]:
# models
rand_forest = RandomForestClassifier(class_weight='balanced')
svc = SVC()
dt = DecisionTreeClassifier()

# list of models to evaluate
models = [rand_forest, svc, dt]



---


# **EVALUATION OF NOT OPTIMIZED MODELS**



In this section, i analyzed the main metrics for each model (from the previous section) trained with the training set. I've decided to use f1_macro, f1_weighted and balanced accuracy. Thus, i've created a function that does exactly this, it displays the performance of the models on train and test set in a compact way. Then, i've displayed the confusion matrices. 

In [None]:
def models_scores(models, X_train, y_train, X_test, y_test): 
  # train metrics lists
  f1_macro_train_list = []
  f1_weighted_train_list = []
  balanced_accuracy_train_list = []
  accuracy_train_list=[]

  #test metrics lists 
  f1_macro_test_list = []
  f1_weighted_test_list = []
  balanced_accuracy_test_list = []
  accuracy_test_list=[]

  names = []

  for model in models:
      #append the name of the model to the names list
      names.append(type(model).__name__)

      # fit the model and predict
      model.fit(X_train,y_train)
      y_pred_train = model.predict(X_train)
      y_pred_test = model.predict(X_test)
      

      # compute the metrics for training set
      f1_macro_train = f1_score(y_train, y_pred_train, average = 'macro')
      f1_weighted_train = f1_score(y_train, y_pred_train, average = 'weighted')
      balanced_accuracy_train = balanced_accuracy_score(y_train, y_pred_train)
      accuracy_train = accuracy_score(y_train, y_pred_train)

      # compute the metrics for test set
      f1_macro_test = f1_score(y_test, y_pred_test, average ='macro')
      f1_weighted_test = f1_score(y_test, y_pred_test, average = 'weighted')
      balanced_accuracy_test = balanced_accuracy_score(y_test, y_pred_test)
      accuracy_test = accuracy_score(y_test, y_pred_test)

      # add train metrics to the lists
      f1_macro_train_list.append(f1_macro_train)
      f1_weighted_train_list.append(f1_weighted_train)
      balanced_accuracy_train_list.append(balanced_accuracy_train)
      accuracy_train_list.append(accuracy_train)

      # add test metrics to the lists
      f1_macro_test_list.append(f1_macro_test)
      f1_weighted_test_list.append(f1_weighted_test)
      balanced_accuracy_test_list.append(balanced_accuracy_test)
      accuracy_test_list.append(accuracy_test)
  
  d = {
      'Model': names, 
      'balanced_accuracy_train':  balanced_accuracy_train_list, 
      'balanced_accuracy_test': balanced_accuracy_test_list,
      'F1_macro_train': f1_macro_train_list,
      'F1_macro_test':  f1_macro_test_list, 
      'F1_weighted_train': f1_weighted_train_list, 
      'F1_weighted_test': f1_weighted_test_list, 
      'accuracy_train': accuracy_train_list, 
      'accuracy_test': accuracy_test_list}
      
  scores = pd.DataFrame(d)
  return scores

## Metrics

In [None]:
# metrics of models trained with the original training set
models_scores(models, X_train, y_train, X_test, y_test)

In [None]:
# metrics of models trained with the resampled training set
models_scores(models, X_res, y_res, X_test, y_test)

## Confusion Matrices
A confusion matrix is simply a square matrix that reports the counts of the true positive, true negative, false positive, false negative predictions of a classifier. Here below, the confusion matrices of the models:

In [None]:
def display_confusion_matrices(models, X, y):
  if len(models)>1:
    fig, axes = plt.subplots(nrows=1, ncols=len(models), figsize=(35,10))
    for model, ax in zip(models, axes.flatten()):
      ax.set_title(type(model).__name__)
      ax.grid(False)
      display = ConfusionMatrixDisplay.from_estimator(
        model,
        X,
        y,
        cmap=plt.cm.Blues,
        normalize='true',
        ax = ax)
  else:
    display = ConfusionMatrixDisplay.from_estimator(
    models[0],
    X,
    y,
    cmap=plt.cm.Blues,
    normalize='true')


In [None]:
display_confusion_matrices(models, X_test, y_test)



---


# **HYPERPARAMETER TUNING OF SVC, DECISION TREE, RANDOM FOREST**
Since the performances were bad, i've analyzed these classifiers to see if they can be improved using GridSearch. I used as scoring balanced accuracy, which i thought to be one of the most representative performance metrics for this task. In fact, using accuracy would have probably led to a set of hyperparameters that classify all the samples as belonging to the zero class, the most populated one. 

Note: the grid search may take some time, you can decide not to run it. The best parameters found for each model are already saved below.



## Grid search

Choose the model from which you want to perform gridsearch, given multiple possible values. These values were obtained after reading the documentation of the classifiers and after several trials. 

In [None]:
# Choose your classifier: "SVC": Support Vector Machine Classifier, "DT": Decition Tree Classifier, "RFC": Random Forest Classifier
clf = 'SVC'
def define_grid(classifier):
  if classifier == 'SVC':
    estimator  = SVC()
    param_grid = {
      'C': np.arange(0.1, 1, 0.1), 
      'gamma': [1, 0.1, 0.001, 'scale', 'auto'],
      'degree': [2, 3, 4],
      'kernel': ['rbf', 'poly', 'sigmoid'],
      'class_weight': ['balanced', None],
      }
  elif classifier == 'DT':
    estimator = DecisionTreeClassifier()
    param_grid ={
      'max_depth':np.arange(2, 10, 1),
      'class_weight':['balanced', None],
      'splitter': ['best', 'random']
  }
  elif classifier == 'RFC':
    estimator = RandomForestClassifier(n_jobs = -1)
    param_grid ={
        'n_estimators':[2, 4, 6, 10, 15, 20, 30, 50, 60, 90, 100, 150],
        'max_depth':[2,3,4,5, 6,7],
        'class_weight':['balanced', None],
        'max_features':['sqrt', 'log2', None],
  }
  return estimator, param_grid


In [None]:
estimator, param_grid = define_grid(clf)
grid_search = GridSearchCV(estimator=estimator, param_grid = param_grid, cv=3, n_jobs =-1, scoring='balanced_accuracy')

If you want to avoid GridSearch skip this part, the best combinations of parameters have been already found and saved. You can load and evaluate the models in the next section 

In [None]:
grid_search.fit(X_train, y_train)

If you'd performed GridSearch, you can visualize the best parameters found for the chosen model and the dataset

In [None]:
#Best parameters for the classifier
print("Best classification hyper-parameters for the chosen classifier: %r" %grid_search.best_params_)
print("Best balanced accuracy: %.4f" %grid_search.best_score_)



---


# **EVALUATION AFTER HYPERPARAMETER OPTIMIZATION**
After the Grid search it is necessary to train the new models on the train set and to evaluate it on both the test and train set. This is done in order to visualize the performance and to deduce if the models overfit the data. 

## F1 Scores and Accuracies
If you haven't runned the Grid Search, which is quite long, use the first lines which have the best hyperparameters already set.
In the first cell, i've evaluated the models trained on the original training set, while in the second cell, i've evaluated the models trained on the resampled training set.

In [None]:
# best parameters found for the models with GridSearch, evaluation of models trained with original training set
# you can even extract the best model found by the GridSearch using grid_search.best_estimator_
svc = SVC(C=0.7, class_weight ='balanced', degree = 2, kernel = 'poly', gamma = 'scale')
decision_tree = DecisionTreeClassifier(max_depth = 5, class_weight = 'balanced', splitter = 'random', random_state=2)
random_forest = RandomForestClassifier(n_estimators = 50, class_weight='balanced',  max_depth = 2, random_state = 4)



best_models = [svc, decision_tree, random_forest]
df_metrics = models_scores(best_models, X_train, y_train, X_test, y_test)
df_metrics

In [None]:
# best parameters found for the models with GridSearch, evaluation of models trained with resampled dataset
svc_res = SVC(C=0.0094, class_weight = 'balanced', degree = 2, kernel = 'poly', gamma = 0.2)
decision_tree_res = DecisionTreeClassifier(max_depth = 10, splitter = 'best', random_state=4)
random_forest_res = RandomForestClassifier(n_estimators = 12, max_depth = 4, random_state =4)


best_models_res = [svc_res, decision_tree_res, random_forest_res]
df_metrics_res = models_scores(best_models_res, X_res, y_res, X_test, y_test)
df_metrics_res

## Confusion matrices
It is also useful to display the confusion matrices both on the test and train set. 

### Test set Confusion Matrices

In [None]:
print("confusion matrices of predictions on the test set with clf trained on original training set")
display_confusion_matrices(best_models, X_test, y_test)

In [None]:
print("confusion matrices of predictions on the test set with clf trained on resampled training set")
display_confusion_matrices(best_models_res, X_test, y_test)

### Train set Confusion Matrices
Useful to visualize overtfitting.




In [None]:
print("confusion matrices of predictions on original training set with clf trained on original training set")
display_confusion_matrices(best_models, X_train, y_train)

In [None]:
print("confusion matrices of predictions on the resampled training set with clf trained on resampled training set")
display_confusion_matrices(best_models_res, X_res, y_res)



---


# **BAGGING**
In this section, i've tried bagging. Bagging is an ensemble learning technique where each classifier receives a random subset of examples from the training dataset. Once the individual classifier are fit to the bootstrap samples, the predictions are combined together.
I've used BaggingClassifier from the scikit-learn library with svc with polynomial kernel as base estimator.

In [None]:
base_estimator = SVC(C=0.85, class_weight ='balanced', degree = 2, kernel = 'poly', gamma = 'scale')
base_estimator_res = SVC(C=0.0094, class_weight = 'balanced', degree = 2, kernel = 'poly', gamma = 0.2)

bagging = BaggingClassifier(
    base_estimator = base_estimator,
    n_estimators = 100,
    random_state=0
    )

bagging_res = BaggingClassifier(
    base_estimator = base_estimator_res,
    n_estimators = 50,
    random_state = 0
    )


In [None]:
model = [bagging]
model_res = [bagging_res]

## Metrics
Here, the main metrics are reported:

In [None]:
# metrics for the model trained with the original dataset
models_scores(model, X_train, y_train, X_test, y_test)

In [None]:
# metrics for the model trained with the resampled dataset 
models_scores(model_res, X_res, y_res, X_test, y_test)

## Confusion Matrices

It is useful to display the confusion matrices both on the test and train set. 

### Test set

In [None]:
print("confusion matrices of predictions on the original test set with clf trained on original training set")
display_confusion_matrices(model, X_test, y_test)

In [None]:
print("confusion matrices of predictions on the original test set with clf trained on resampled training set")
display_confusion_matrices(model_res, X_test, y_test)

### Training set

The following matrices are useful to visualize overfitting.

In [None]:
print("confusion matrices of predictions on the original training set with clf trained on original training set")
display_confusion_matrices(model, X_train, y_train)

In [None]:
print("confusion matrices of predictions on the resampled training set with clf trained on resampled training set")
display_confusion_matrices(model_res, X_res, y_res)