# Enhancing LoRa-based Outdoor Localization Accuracy Using Machine Learning

### Supplementary Jupyter Notebook 

## Introduction

This notebook presents a machine learning (ML)-based localization framework for LoRaWAN networks, aiming to improve positioning accuracy in smart city applications.  
We evaluate six ML models: `k-NN`,  `ANN`,  `XGBoost`,`LightGBM`, `SVR` and `CNN` using an open-source dataset.  
Performance is compared using multiple error metrics and benchmarked against existing studies.

#### Environment Setup


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
import warnings
warnings.filterwarnings("ignore")


# Libraries for ML models
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR




# Libraries for data preprocessing
from sklearn.model_selection import train_test_split


#### Dataset Description & Preprocessing

The complete data preparation process, including feature selection, scaling, and signal processing (e.g., logarithmic transformation of RSSI), has been thoroughly described in the main body of the associated research article.

In [2]:
# Example: loading data
X = pd.read_csv('X_scaled_1_dataset.csv')
y = pd.read_csv('y_scaled.csv')

**Note:** In this notebook, we use only preprocessed data (e.g., `X_train` derived from log-transformed RSSI values) for the sole purpose of demonstrating the architecture and implementation of machine learning models. This is **not** intended as a fully replicable pipeline for achieving the final reported results.

The input features (`X_train`) represent the final processed form of LoRaWAN signal attributes, and the target output (`y`) corresponds to the scaled true positions (e.g., x and y coordinates).




In [3]:
# Split Data: Train (70%), Validation (15%), Test (15%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42)

## Model Architectures

In this section, we briefly describe the key models used in our study. Below each description, we include the full architecture or implementation used for experimentation.


## 1. k-Nearest Neighbors (k-NN)

The k-Nearest Neighbors (k-NN) model is a distance-based, non-parametric learning method used as a baseline in our comparison. Unlike neural networks, k-NN does not learn internal parameters or feature hierarchies. Instead, it predicts a device's position by averaging the target values of the `k` closest samples in the training set based on the chosen distance metric.


#### Hyperparameters:
- `k` — number of neighbors: **5**, **7**, **9**, **11**
- `p` — distance metric (Minkowski exponent):  
  - `p = 2`: Euclidean distance  
  - `p = 1`: Manhattan distance  

- Output: latitude and longitude

In [None]:
# Hyperparameter settings
k = 5  # Number of neighbors
p = 2  # Distance metric

# Model training on full training data
model_knn = KNeighborsRegressor(n_neighbors=k, p=p)



## 2. Convolutional Neural Network (CNN)


The Convolutional Neural Network (CNN) model is designed to automatically learn spatial dependencies and local patterns from 1D sequences of RSSI features. By applying convolutional filters, the model captures signal variations and feature interactions across neighboring input values. These extracted representations are then processed by dense layers to generate accurate predictions of the device's coordinates `[x, y]`

#### Hyperparameters:

- Convolutional layers:
  - `Layer 1: 32 filters, kernel size = 3`
  - `Layer 2: 64 filters, kernel size = 3`

* Activation function: `ReLU`

* Dropout rate: `0.2`

* Fully connected (dense) layer - 128 neurons

* Activation: linear

* Training `epochs`: **10, 50, 100, 200**

* `Learning rate`: **0.001, 0.005**

* `Optimizer`: Adam

* Output layer - 2 neurons (latitude and longitude)

In [9]:
# Create CNN model
model_CNN = keras.Sequential([
    keras.layers.Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], 1)),
    keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu'),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(2, activation='linear')  # Predicting [latitude, longitude]
])

In [None]:
model_CNN.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
model_CNN.fit(X_train, y_train, epochs=100, batch_size=32)

In [None]:
model_CNN.summary()

### 3. Extreme Gradient Boosting (XG-Boost)

XG-Boost is a high-performance implementation of gradient-boosted decision trees. It builds an ensemble of weak learners (trees) iteratively, each one correcting the errors of its predecessors. It is well-suited for structured data like RSSI-based features, offering both speed and predictive power.



#### Hyperparameters:

- `Learning rate`: **0.01, 0.05, 0.1, 0.2**

- Number of estimators (`n_estimators`): **200, 400, 600, 800**

- Max depth of trees: **6, 8, 10, 12**

- Output: latitude and longitude

In [None]:
# Create XGB model
model_XGB =XGBRegressor(n_estimators=400, max_depth=12, learning_rate=0.05)

In [None]:
model_XGB.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            eval_metric="rmse",
            early_stopping_rounds=10, 
            verbose=False)

## 4. LightGBM (Light Gradient Boosting Machine)

The LightGBM model is a gradient boosting framework based on decision trees that is designed to be highly efficient and scalable. It is particularly well-suited for large datasets and provides faster training speed and lower memory usage compared to traditional gradient boosting methods such as XGBoost.


In this setup, two separate LightGBM models were used to predict geographical coordinates:

- `model_LightGBM_lat` for latitude
- `model_LightGBM_lon` longitude

#### Hyperparameters:

- Number of leaves: **200, 400, 600**

- `Learning rate`: **0.01, 0.02, 0.05, 0.1**

- Number of estimators: **600, 800, 1000**

- Subsample (`row sampling`): **0.8**

- Colsample by tree (`feature sampling`): 0.8

- Regularization:
  - `reg_alpha` = 0.1 (L1)
  - `reg_lambda` = 0.1 (L2)

In [None]:
params = {
    'num_leaves': 400,
    'learning_rate': 0.02,
    'n_estimators': 1000,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'random_state': 42
}

# Create two LGBMR models
model_LightGBM_lat = LGBMRegressor(**params)
model_LightGBM_lat.fit(X_train, y_train[:, 0], 
              eval_set=[(X_val, y_val[:, 0])], 
              eval_metric='rmse', 
              callbacks=[early_stopping(stopping_rounds=10, verbose=2)])


model_LightGBM_lon = LGBMRegressor(**params)
model_LightGBM_lon.fit(X_train, y_train[:, 1], 
              eval_set=[(X_val, y_val[:, 1])], 
              eval_metric='rmse', 
              callbacks=[early_stopping(stopping_rounds=10, verbose=2)])

## 5. Support Vector Regression (SVR) 

It is a kernel-based machine learning algorithm that extends the principles of Support Vector Machines (SVM) to regression tasks. It aims to find a function that approximates the target outputs within a specified margin of tolerance ($\epsilon$), while also minimizing model complexity.

In this setup, two separate SVR models were used to predict geographical coordinates:

- `model_SVR-LAT` for latitude
- `model_SVR-LON` for longitude

#### Hyperparameters 

- Kernel type: `linear`, `polynomial`

- Regularization parameter `C`: **0.05, 0.1, 0.2, 0.3**

- `Epsilon`: **0.01, 0.04, 0.1, 0.2, 0.3**

In [None]:
# Create two LGBMR models
model_SVR_lat = SVR(kernel='rbf', C=0.3, epsilon=0.01)
model_SVR_lat.fit(X_train, Y1_train.ravel())  

model_SVR_lon = SVR(kernel='rbf', C=0.3, epsilon=0.01)
model_SVR_lon.fit(X_train, Y2_train.ravel())

## 6.  Artificial Neural Network (ANN)


The Artificial Neural Network (ANN) model is a fully connected feedforward network designed to learn complex nonlinear relationships between RSSI-based features and device position. It is composed of multiple dense layers with decreasing neuron counts and dropout regularization to prevent overfitting.


#### Hyperparameters:

- Dense layers:
  - `Layer 1`: 512 neurons
  - `Layer 2`: 256 neurons
  - `Layer 3`: 256 neurons
  - `Layer 4`: 128 neurons
- Activation function: `ReLU`
- Dropout rate: `0.1` after each layer
- Activation: `linear`
- Training epochs:  max. **150**
- Learning rate: **0.0001**
- Optimizer: Adam
- Output layer: 2 neurons (for latitude and longitude)

In [None]:
# Create ANN model
ANN_model = keras.Sequential([
    keras.layers.Dense(512, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.1),

    keras.layers.Dense(2, activation='linear')  # Predicting [latitude, longitude]
    ])

ANN_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.0001),loss='mse', metrics=['mae']) 


In [19]:
ANN_model.summary()