# **MULTI UAV CONFLICT RISK ANALYSIS - REGRESSION**



---


#**IMPORT**
Import the required packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# data visualization
import seaborn as sns

# data processing 
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler, RobustScaler

# training
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression


# evaluation
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error



---


#**MOUNT DRIVE**
Mount Google Drive to then load the dataset

In [None]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')



---


#**LOAD THE DATASET AND FEATURE SCALING**
In order to load the dataset, i first set its current location in my drive (to avoid errors, check your path and replace it). Then, since this is a tabular-separated values file, i read it using *panda.read_csv()* which loads it into a DataFrame. 

It is possible to change *feature_number* to plot some relations between features and min_CPA and to test different configurations. There's also a function to use the whole dataset. It splits between input and output columns. The input are stored in the first 35 columns of df, while the outputs for regression are stored in the last column.

## Feature Scaling
While extracting the dataset, i've also performed feature scaling.
Here, there's the possibility to scale the features column-wise with four different methods or to not scale at all.

The first method is the **Maximum Absolute Scaling** which returns values of the input data between -1 and 1. It takes the input and it divides it by the maximum absolute value on that column.

The second method is the **Min-Max Feature Scaling**, also called normalization, which scales the feature between 0 and 1. It's computed by subtracting from the input the minimum value in the column and subsequently dividing by the difference between the maximum and minimum value.

The third method is the **Standard Scaler** and scales the data into a distribution with zero mean and variance 1.

The last method is the **Robust Scaling** which removes the median and scales the data according to the quantile range.

I've used StandardScaler.

In [None]:
# "MAS": Maximum Absolute Scaling, "MMS": Min-Max Feature Scaling, "SS": Standard Scaler, "Robust Scaling"  
s = "SS"
def scaling(series, scaling_type):
  if scaling_type == "MAS":
    scaler = MaxAbsScaler()
  elif scaling_type == "MMS":
    scaler = MinMaxScaler()
  elif scaling_type == "SS":
    scaler = StandardScaler()
  elif scaling_type == "RS":
    scaler = RobustScaler()
  else:
    return series
  scaler.fit(series)
  scaled = scaler.fit_transform(series)
  series = pd.DataFrame(scaled, columns = series.columns)
  return series

## Extract the data

In [None]:
# extract all the features
def load_all(dataframe):
  print('Loaded all the features and min_CPA of %d samples into X and y.' %(len(dataframe)))
  #dividing input and targets
  X = dataframe.iloc[:, :-2] 
  y = dataframe.iloc[:, -1]
  # feature scaling
  X = scaling(X, s)
  return X, y


# extract just one feature
def load_one(feature_number, dataframe):
  print('Loaded feature number %d and min_CPA of %d samples into X1 and y1.' %(feature_number, len(dataframe)))
  #extracting the dataset with the required feature
  dataframe1 = dataframe.loc[:, [list(dataframe.columns)[feature_number], 'min_CPA']]
  # dividing input and targets
  X1 = dataframe1.iloc[:, 0]
  y1 = dataframe1.iloc[:, 1] 
  return dataframe1, X1, y1

In [None]:
# importing the file from drive and reading it into DataFrame
filename = '/content/drive/MyDrive/Project_ML/Data/train_set.tsv'
df = pd.read_csv(filename, sep = "\t", header = 0)

# load the full dataset
X, y = load_all(df)

# load just the feature specified in feature_number and the targets
feature_number = 0
df1, X1, y1 = load_one(feature_number, df)

## Correlations and missing values

It is important to know if there are some missing values in the dataset and eventually replace them. In our case, there aren't missing values. 

In [None]:
print("Number of null cells in df: %d" %(df.isnull().sum().sum()))
print("Number of null cells in df1: %d" %(df1.isnull().sum().sum()))

I can even plot correlations between the features in the dataset. From there you can see that there are almost no correlations.

In [None]:
# df.corr()
df1.corr()



---


# **DATA VISUALIZATION**



## Regplot
Here, i visualize the relation between two columns in df1. You can just choose the feature you want by taking x as 'df.iloc[:, feature_number]' and y as df.iloc[:, -1]

In [None]:
sns.regplot(x = df1.iloc[:, 0], y = df1.iloc[:, 1], fit_reg=False)

# Nuova sezione



---


# **DATA SPLITTING**

## **Splitting the dataset**
I've splitted the dataset in training and test set.The model must be trained on the training data and then tested. Comparing predictions to targets in the test set can be seen as the unbiased performance evaluation of the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
          test_size=0.3, random_state=24)



---


# **FIRST MODELS**
Here, i've defined a bunch of models. I've decided to compare **SVR**, **DecisionTreeRegressor** and **RandomForestRegressor**. 

In [None]:
# models
rand_forest = RandomForestRegressor()
svr = SVR()
dtr = DecisionTreeRegressor()

# list of models for evaluation purposes
models = [rand_forest, svr, dtr]



---


#**MODELS EVALUATION**
Then, i've analyzed the main metrics for each model (defined previously) trained with the training set. I've decided to use mean squared error and r2 score. Thus, i've created a function that does exactly this, displaying the performances of the models on train and test set in a compact way.

In [None]:
def models_scores(models, X_train, y_train, X_test, y_test): 
  # metrics lists for train
  r2_train_list = []
  mean_squared_train_list = []
  
  # metrics lists for test
  r2_test_list = []
  mean_squared_test_list = []
  
  names = []

  for model in models:
      #append the name of the model to the names list
      names.append(type(model).__name__)

      # fit the model and predict
      model.fit(X_train,y_train)
      y_pred_train = model.predict(X_train)
      y_pred_test = model.predict(X_test)
      

      # compute the metrics for training set
      mse_train = mean_squared_error(y_train, y_pred_train)
      r2_train = r2_score(y_train, y_pred_train)

      # compute the metrics for test set
      mse_test = mean_squared_error(y_test, y_pred_test)
      r2_test = r2_score(y_test, y_pred_test)

      # add train metrics to the list 
      r2_train_list.append(r2_train)
      mean_squared_train_list.append(mse_train)

      # add test metrics to the list
      r2_test_list.append(r2_test)
      mean_squared_test_list.append(mse_test)
      
  d = {
      'Model': names, 
      'R2_train': r2_train_list,
      'R2_test': r2_test_list, 
      'MSE_train': mean_squared_train_list, 
      'MSE_test': mean_squared_test_list}
  scores = pd.DataFrame(d)
  return scores

In [None]:
models_scores(models, X_train,y_train, X_test, y_test)



---


# **MODELS OPTIMIZATION**
As you can see, the performances of models with default parameters were bad. Here, i've tried to improve the performances


In [None]:
# polynomial svr
poly_svr = SVR(kernel='poly',C= 1, degree=6,  gamma=0.2)

# decision tree regressor
dt = DecisionTreeRegressor(criterion='poisson', 
                           splitter='best', 
                           max_depth=None, 
                           max_leaf_nodes=2)
# random forest regressor
ran_forest = RandomForestRegressor(n_estimators = 90, max_depth =3, criterion ='poisson')

best_models = [poly_svr, dt, ran_forest]

## **Metrics**
Below, the metrics of the new optimized models. The performances are slightly better. With SVR with polynomial kernel, i've got still a bad r2 score but visibly better than the others. 

In [None]:
models_scores(best_models, X_train, y_train, X_test, y_test)



---


# **HYPERPARAMETER TUNING**
Since now, i've tuned the parameters by hand. Here, i've tried to improve performances using GridSearch. I used as scoring r2, which i thought to be one of the most representative performance metrics for this task.



## **Grid search**

I've searched for the model optimal combination of parameters, given multiple possible values. These values were obtained after reading the documentation of the regressors and after several trials. 

In [None]:
# Choose your regressor: "SVR": Support Vector Machine Classifier, "RFR": Random Forest Regressor, "DTR": Decition Tree Regressor
regr = 'DTR'
def define_grid(regressor):
  if regressor == 'SVR':
    estimator  = SVR()
    param_grid = {
        'kernel': ['rbf', 'sigmoid', 'poly'],
        'degree': np.arange(2, 10, 1),
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'gamma':[0.001, 0.01, 0.1, 0.014, 1, 10, 100]
      }
  elif regressor == 'RFR':
    estimator = RandomForestRegressor()
    param_grid = {
        'n_estimators':[1, 5, 10, 15, 20, 50, 100, 200, 500, 1000],
        'max_depth': np.arange(2, 7, 1),
        'criterion': ['poisson', 'squared_error', 'absolute_error'],
      }

  elif regressor == 'DTR':
    estimator = DecisionTreeRegressor()
    param_grid ={
        'criterion':['poisson', 'squared_error', 'absolute_error'],
        'splitter':['best', 'random'], 
        'max_depth':[None, 1, 2, 3, 4, 5, 6, 8, 9, 10]
  }
  return estimator, param_grid


In [None]:
estimator, param_grid = define_grid(regr)
# grid search using r2
grid_search = GridSearchCV(estimator=estimator, param_grid = param_grid, cv=3, n_jobs =-1, scoring='r2')

If you want to avoid GridSearch skip this part.

In [None]:
# run this cell only if you want to perform Grid Search. It will take some time.
grid_search.fit(X_train, y_train)

If you'd performed GridSearch, you can visualize the best parameters found for the chosen model and the dataset

In [None]:
#Best parameters for the classifier
print("Best regression hyper-parameters for the chosen regressor: %r" %grid_search.best_params_)
print("Best r2: %.4f" %grid_search.best_score_)



---


# **BAGGING**
In this section, i've tried bagging. Bagging is an ensemble learning technique where each estimator receives a random subset of examples from the training dataset. Once the individual estimators are fit to the bootstrap samples, the predictions are combined together.
I've used BaggingRegressor from the scikit-learn library.

In [None]:
base_estimator = SVR(kernel='poly',C= 500, degree=2,  gamma=0.075)
bagging = BaggingRegressor(base_estimator=base_estimator, n_estimators=21, random_state=0)

## Evalution
As you can see, bagging performs nicely. Still, the performances are not very good. Surely, it reduces overfitting. 

In [None]:
model = [bagging]
models_scores(model, X_train, y_train, X_test, y_test)