Skip to content

The experiments were conducted in Google Colab, as part of a research project developed during my undergraduate scientific initiation at Universidade Presbiteriana Mackenzie (UPM).

Notifications You must be signed in to change notification settings

mackcoder/Machine-Learning---Autoencoder-SQL-Injection-and-XSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 

Repository files navigation

deep-learning2 AutoencoderVanila - Detecting SQL Injection and XSS (Cross-Site-Scripting)

GitHub stars GitHub issues GitHub last commit

📝Description

This study aims to improve website defense systems through unsupervised machine learning methods to detect SQL Injection (SQLi) and Cross-Site Scripting (XSS) attacks. Using Python and artificial intelligence libraries, an autoencoder model was developed and trained to identify anomalous patterns in web traffic. Publicly available datasets were applied for training and validation. The results indicated high efficiency in detecting SQL Injection, with strong recall and low false negatives, although with high false positive rates for XSS detection. It is concluded that the proposed approach is promising but requires further optimization before being used in production environments

🌟 Highlights

  • 👾SQL Injection and XSS (Cross-Site-Scripting)
  • 📂Training with Public Datasets
  • 🛠️Implementation in Python with TensorFlow
  • 🧠Unsupervised Machine Leaning with Autoencoder
  • ❗Detection tests and results

Getting Started

🥣 Prerequisites

All following experiments were conducted using the GoogleColab enviroment.

📚 Libraries Used

  • numpy - For numerical operations and array manipulation
  • pandas - For data loading and preprocessing
  • scikit-learn - For model evaluation, data splitting and scaling
  • tensorflow - For building and training the Autoencoder model
  • keras - High-level API for defining neural network layers
  • classification_report & confusion_matrix - For performance metrics

🗃️ Data Acquisition and Preprocessing

📑 Datasets

Uploading Datasets in Google Colab:

  • Drag and drop the files into the Files pane on the left sidebar; OR
  • Upload them programmatically using the following code:

from google.colab import files
uploaded = files.upload()

Preprocessing data

📝 Explanation:

In this section, the CSV file containing the dataset (SQL Injection) is loaded. The target variable (y) is extracted from the 'Label' column, while the feature matrix (X) is composed of all numerical columns except 'Label'. The dataset is then split into training and testing sets using train_test_split, with 80% of the data allocated for training and 20% for testing. The stratify=y parameter ensures that the class distribution remains consistent across both sets. Next, data normalization is performed using StandardScaler. Only the normal samples (where y_treino == 0) are used to train the autoencoder, so the scaler is fitted exclusively on these. The same scaling transformation is then applied to the entire test set to ensure consistency.

Note

For SQL INJECTION


dfxss = pd.read_csv('SLQ Injection Attack for training (D1) (1).csv')
y = dfxss['Label']
X = dfxss.select_dtypes(include = ['number']).drop('Label', axis = 1)
X_treino, X_teste, y_treino, y_teste = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)
#2 Normalização dos dados
escala = StandardScaler()
X_treino_normal = X_treino[y_treino == 0]  #Filtra apenas os dados normais para treinar o autoencoder
X_treino_normal_escala = escala.fit_transform(X_treino_normal)  #Ajusta e transforma os dados normais
X_teste_escala = escala.transform(X_teste)  #Aplica a mesma transformação ao conjunto de teste

📝 Explanation:

The XSS dataset underwent the same preprocessing steps used for the SQL Injection data. The only difference was in how the feature matrix (X) was extracted from the 'Label' collumn.

Note

For XSS

#1 Carrega o dataset de treino
df_treino = pd.read_csv("XSSTraining.csv")

# Separa os dados em variáveis independentes (X) e a variável alvo (y):
X = df_treino.drop("Class", axis=1)  # Remove a coluna "Class" para usar como entrada
y = df_treino["Class"].apply(lambda x: 1 if x == "Malicious" else 0)  # Converte rótulo para binário: 1 = ataque, 0 = normal

#2 Divide os dados em treino e teste, mantendo a proporção das classes
X_treino, X_teste, y_treino, y_teste = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # 20% para teste, estratificado por classe
)

#3 Normaliza apenas os dados normais do conjunto de treino
escala = StandardScaler()  # Inicializa o normalizador
X_treino_normal = X_treino[y_treino == 0]  # Filtra apenas os dados normais
X_treino_normal_escala = escala.fit_transform(X_treino_normal)  # Ajusta e transforma os dados normais
X_teste_escala = escala.transform(X_teste)  # Aplica a mesma transformação ao conjunto de teste

🛠️Implementing Autoencoder

image

Important

Model for SQL Injection

📝 Explanation:

1. 📐 Define Input Dimensions + Compression and decompression

  • Determine the number of input features
  • Initialize the input layer for neural network
  • Compresses the input data into a lower-dimensional representation (encoded)
  • Helps the model learn the most relevant patterns from normal data (encoded)
  • The algorithym reconstructs the original input from the compressed representation by itself (decoded)
  • The final layer matches the original input shape (decoded)
#3 Criação da arquitetura do Autoencoder
input_dim = X_treino.shape[1]  #Define o número de atributos de entrada
input_layer = Input(shape=(input_dim,))  #Camada de entrada

# Camadas de codificação (reduzem a dimensionalidade)
encoded = Dense(16, activation='relu')(input_layer)
encoded = Dense(8, activation='relu')(encoded)

# Camadas de decodificação (reconstrução dos dados)
decoded = Dense(16, activation='relu')(encoded)
decoded = Dense(input_dim, activation='relu')(decoded)  # Camada final com mesma dimensão da entrada

2.🏋️ Model training

  • Building model
  • Normal is inserted in the autoencoder training
  • Adam optimizer and MSE (Mean Squared Error) used
  • 90 cycles (epochs) done to learn important patterns
# Cria o modelo Autoencoder e compila com otimizador Adam e função de perda MSE
autoencoder = Model(inputs=input_layer, outputs=decoded)
autoencoder.compile(optimizer='adam', loss='mse')

#4 Treinamento do Autoencoder com os dados normais
autoencoder.fit(
  X_treino_normal_escala, X_treino_normal_escala,
  epochs=90,  #Número de ciclos
  batch_size=32,  #Tamanho do lote
  shuffle=True,  #Embaralha os dados a cada época
  validation_split=0.1,  #10% dos dados para validação
  verbose=1  #Exibe progresso do treinamento
)

3.🧩 Reconstruct model + Reference Error

  • Reconstruct test data
  • Calculate reconstruction error for each sample
  • Measures baseline error from normal training data
  • Used to define detection threshold
#5 Reconstrução dos dados de teste e cálculo do erro de reconstrução
X_recalculo = autoencoder.predict(X_teste_escala)  #Reconstrói os dados de teste
error = np.mean(np.square(X_teste_escala - X_recalculo), axis=1)  #Calcula o erro por amostra

# Reconstrução dos dados normais de treino e cálculo do erro
reco_treino = autoencoder.predict(X_treino_normal_escala)
mse_train = np.mean(np.square(X_treino_normal_escala - reco_treino), axis=1)

4.🔬 Threshold optimization loop + Classification Report + Recall optimization

  • Loop created to identify the optimal threshold and recall
  • Generating classification report -> compares the model's reconstruction-bases predictions against the true labels
# 6. Loop para encontrar o melhor threshold (limiar de detecção)
porcentagens = range(70, 100, 2)  # Testa thresholds entre 70 e 98
melhor_recall = 0  #Inicializa o melhor recall
melhor_threshold = 0  #Inicializa o melhor threshold
melhor_resultado = {}  #Dicionário para armazenar o melhor resultado

y_teste_numeric = y_teste.values  #Converte os rótulos para array NumPy

# Loop para testar diferentes thresholds:
for x in porcentagens:
    threshold = np.percentile(mse_train, x)  #Define o threshold com base no erro dos dados normais
    prev_loop = [1 if i > threshold else 0 for i in error]  #Classifica como ataque se erro > threshold

    #Gera relatório de classificação para o threshold atual:
    relatorio = classification_report(
        y_teste_numeric, prev_loop,
        target_names=["Normal", "Ataque"],
        output_dict=True
    )
    recall_ataque = relatorio["Ataque"]["recall"]  #Extrai o recall da classe "Ataque"

    # Atualiza o melhor resultado se o recall for maior:
    if recall_ataque > melhor_recall:
        melhor_recall = recall_ataque
        melhor_threshold = threshold
        melhor_resultado = {
            "percentil": x,
            "precision": relatorio["Ataque"]["precision"],
            "recall": recall_ataque,
            "f1": relatorio["Ataque"]["f1-score"],
            "matriz_confusao": confusion_matrix(y_teste_numeric, prev_loop)
        }

Important

🏋️‍♂️Model training for XSS

📝 Explanation:

The same training procedure used for the SQL Injection detection model was applied to the XSS (Cross-Site Scripting) detection model.
This includes the autoencoder architecture, preprocessing steps, training strategy, and threshold optimization for anomaly classification.


#4 Define a arquitetura do Autoencoder
input_dim = X.shape[1]  # Número de atributos de entrada
input_layer = Input(shape=(input_dim,))  # Camada de entrada

# Camadas de codificação
encoded = Dense(16, activation='relu')(input_layer)
encoded = Dense(8, activation='relu')(encoded)

# Camadas de decodificação
decoded = Dense(16, activation='relu')(encoded)
decoded = Dense(input_dim, activation='relu')(decoded)  #Reconstrói a entrada

# Cria e compila o modelo Autoencoder
autoencoder = Model(inputs=input_layer, outputs=decoded)
autoencoder.compile(optimizer='adam', loss='mse')  #Usa o erro como função de perda

#5 Treina o Autoencoder apenas com dados normais
autoencoder.fit(
  X_treino_normal_escala, X_treino_normal_escala,  #Entrada e saída são iguais
  epochs=90, batch_size=32, shuffle=True,  #Treinamento por 90 épocas/ciclos
  validation_split=0.1, verbose=1  # Usa 10% dos dados normais para validação
)

#6 Calcula o erro de reconstrução no conjunto de teste
X_recalculo = autoencoder.predict(X_teste_escala)  #Reconstrói os dados de teste
error = np.mean(np.square(X_teste_escala - X_recalculo), axis=1)  #Erro de reconstrução por amostra

# Calcula o erro de reconstrução nos dados normais de treino
reco_treino = autoencoder.predict(X_treino_normal_escala)
mse_train = np.mean(np.square(X_treino_normal_escala - reco_treino), axis=1)

#7 Busca o melhor threshold para detectar ataques
porcentagens = range(70, 100, 2)  #Testa thresholds entre 70 e 98
melhor_recall = 0
melhor_threshold = 0
melhor_resultado = {}

y_teste_numeric = y_teste.values  #Converte para array NumPy

# Loop para testar diferentes thresholds
for x in porcentagens:
  threshold = np.percentile(mse_train, x)  #Define threshold com base no erro dos dados normais
  prev_loop = [1 if i > threshold else 0 for i in error]  #Classifica como ataque se erro > threshold

  # Gera relatório de classificação
  relatorio = classification_report(
      y_teste_numeric, prev_loop,
      target_names=["Normal", "Ataque"],
      output_dict=True
  )
  recall_ataque = relatorio["Ataque"]["recall"]  #Extrai o recall da classe "Ataque"

  # Atualiza o melhor resultado se o recall for maior
  if recall_ataque > melhor_recall:
      melhor_recall = recall_ataque
      melhor_threshold = threshold
      melhor_resultado = {
          "percentil": x,
          "precision": relatorio["Ataque"]["precision"],
          "recall": recall_ataque,
          "f1": relatorio["Ataque"]["f1-score"],
          "matriz_confusao": confusion_matrix(y_teste_numeric, prev_loop)
      }

Important

🏋️‍♂️Model testing for XSS

📝 Explanation: Now the Dataset with testing data is loaded to test the model's capabilities to detect XSS patterns

  • Loads XSS Dataset into a DataFrame
  • Converts the "Class" column into binary labels: 1 for malicious, 0 for normal
  • Separates the feature set by removing the label column
  • Applies the same scaler used during training to ensure consistent data distribution
  • Uses the trained autoencoder to reconstruct the input data
  • Computes the mean squared error for each sample to quantify how well the model reconstructed it

df_xss = pd.read_csv("XSSTesting.csv")
y_xss = df_xss["Class"].apply(lambda x: 1 if x == "Malicious" else 0)  #Converte rótulo para binário
X_xss = df_xss.drop("Class", axis=1)  #Remove a coluna de classe

X_xss_escala = escala.transform(X_xss)  #Normaliza os dados com o mesmo scaler
X_xss_reconstruido = autoencoder.predict(X_xss_escala)  #Reconstrói os dados
erro_xss = np.mean(np.square(X_xss_escala - X_xss_reconstruido), axis=1)  #Calcula erro de reconstrução

prev_xss = [1 if e > melhor_threshold else 0 for e in erro_xss]  #Classifica com base no threshold ideal

🧪 Running the tests

👾 Model trained for SQL INJECTION:

Classe Precisão Recall F1-Score
NORMAL 0.92 0.81 0.86
ATAQUE 0.60 0.80 0.69
------------ ----------- -------- ----------
ACCURACY 0.91
MACRO AVG 0.91 0.92 0.91
WEIGHT AVG 0.92 0.92 0.91

👾 Model trained for CROSS-SITE SCRIPTING:

Classe Precisão Recall F1-Score
NORMAL 1.00 0.84 0.91
ATAQUE 0.82 1.00 0.90
------------ ----------- -------- ----------
ACCURACY 0.91
MACRO AVG 0.91 0.92 0.91
WEIGHT AVG 0.92 0.91 0.91

⚖️ Comparison:

Attack Type Metric Precision Recall F1-Score Accuracy
SQL Injection Normal 0.92 0.81 0.86 0.91
Attack 0.60 0.80 0.69
XSS Normal 1.00 0.84 0.91 0.91
Attack 0.82 1.00 0.90

📊 Graphics

SQLINjection_graphic
XSS_graphic

✍️ Final Considerations

This project demonstrates the potential of unsupervised anomaly detection with Autoencoders in cybersecurity applications. The method showed strong recall for attack detection, particularly for SQL Injection, but further improvements are needed to reduce false positives in XSS detection.

🔮 Future Work May Include:

  • Enhancing the pre-processing method to improve data quality and model performance

  • Experimenting with different neuron configurations to obtain more robust and interpretable results

  • Validating the model using real traffic data to observe its behavior, assess accuracy, and measure execution time

  • Applying ROC curve analysis to visually identify the optimal threshold, replacing the current method with a more time-efficient approach

  • Implementing adaptive monitoring techniques to continuously track performance metrics and adjust parameters as needed

📩 Contact

👥 Contributor

  • Prof. Dr. Rodrigo Cardoso Silva - Academic advisor and project supervisor
  • GitHub: @profdsrodrigo

📚 References

🚨 Issue report

  • You can also open an issue template if it's related to this project.

About

The experiments were conducted in Google Colab, as part of a research project developed during my undergraduate scientific initiation at Universidade Presbiteriana Mackenzie (UPM).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages