# Confident-aware LSTM-based Intelligent Caching

**This project explores the use of a Long Short-Term Memory (LSTM) network combined with Confidence Intervals (CIs) for an intelligent caching system implemented in Redis. The aim is to improve prefetching, TTL assignment, and eviction policies, providing a framework that outperforms traditional baseline strategies.**

## 1. Data Generation

Before training, evaluating or running experiments, we need to **generate synthetic data** on which to work on. Data should reflect the nature of the accesses in real-world systems, which is often characterized by:
- **Skewed popularity**: Some objects are more popular than others.
- **Locality**: Some objects are correlated and are often accessed in sequence.
- **Periodic access patterns**: Objects are accessed in recurring time intervals.
- **Burstiness periods**: Sudden spikes in access frequency occur during short time periods.

To simulate this behavior, we generate two types of synthetic access patterns:

- **Spatial accesses**: Governed by Zipf distribution (the first keys are more likely to be used than later ones) and locality (after accessing to a certain key, the neighboring ones are more likely to be accessed also). The first access always targets a popular key, then the probability of accessing to a popular key is 30%. After the first access, the probability to access to a neighboring key is 70%.
- **Temporal accesses**: Modeled by using a combination of periodic and bursty patterns. While some time periods follow predictable, recurring intervals, others exhibit intervals of sudden, high-frequency access.

For doing more realistic experiments, we generate both a **static** and **dynamic dataset**. The first assumes key popularities are fixed over time (i.e., the Zipf parameter remains constant), whereas the second simulates changes in key popularities over time (i.e., the Zipf parameter varies).

The final datasets are composed by the following columns:
- `id`: The ID of the current request.
- `delta_time`: The temporal distance between the current request and the previous one.
- `freq_last_10`: The relative frequency of the current requested key in the last 10 accesses.
- `freq_last_100`: The relative frequency of the current requested key in the last 100 accesses.
- `freq_last_1000`: The relative frequency of the current requested key in the last 1000 accesses.
- `request`: The ID of the requested key.

In [None]:
from main import config_settings
from data_generation import data_generation

data_generation(config_settings)

## 2. Data Preprocessing

**Data preprocessing** aims to prepare data for being used in the next steps. Three processes are performed here:
- **Duplicates removal**: Removes all the duplicates based on one or more columns. In that case, `id` duplicates are removed from the dataset.
- **Missing values removal**: Removes all rows having missing values from the dataset.
- **Standardization**: Standardizes one or more columns, avoiding too large value distances between data in a given column. In that case, we standardize `id`, `delta_time`, `freq_last_10`, `freq_last_100`, and `freq_last_1000`.

Some rows of the final (static) dataset are shown below.

In [None]:
from data_preprocessing import data_preprocessing

data_preprocessing(config_settings)

Unnamed: 0,id,delta_time,freq_last_10,freq_last_100,freq_last_1000,request
0,-1.732020,3.275748,-0.754229,-1.037309,-1.175211,5
1,-1.731958,0.103608,-0.754229,-1.037309,-1.175211,13
2,-1.731896,0.600755,-0.754229,-1.037309,-1.175211,12
3,-1.731834,1.289181,2.515228,6.262462,7.637087,13
4,-1.731772,1.343766,-0.754229,-1.037309,-1.175211,14
...,...,...,...,...,...,...
79995,3.216387,-0.085736,-0.754229,-0.599323,0.093760,6
79996,3.216449,-0.085769,-0.754229,0.495643,0.120197,7
79997,3.216511,-0.085783,-0.754229,0.714636,-0.144172,8
79998,3.216573,-0.085428,-0.754229,-0.818316,-1.043026,42


## 3. Validation

**Validation** aims at finding the **best hyperparameters** to be used for training the final model. After defining the hyperparameter search space, we compute a **Grid Search** to explore all possible combinations. For each combination we perform a **Time Series Cross-Validation**, useful to avoid data leakage by preserving the temporal order of events. **Early Stopping** is applied while training on each fold, stopping the process when the validation starts to decrease. Whenever a new hyperparameter combination achieves the best average validation loss seen so far, it is saved as the new best. In the end, we obtain the best hyperparameters (i.e., those yielding the lowest average validation loss).

In [None]:
from validation import validation

config_settings = validation(config_settings)

## 4. Training

After identifying the best hyperparameters, the **final model** is obtained by **training** on the whole training set using those optimal values. Once the training is completed, the model is saved.

In [None]:
from training import training

training(config_settings)

## 5. Testing

The trained model is **evaluated** on the testing set. The evaluation **metrics** computed are:
- **Average loss**: Cross Entropy.
- **Average loss per class**: Cross Entropy.
- **Class report**: Precision, Recall, and F1 for each class.
- **Confusion matrix**: Summarizes the number of correct and incorrect prediction for each class.
- **Top-k accuracy**: How many times the target is predicted in the first k most probable keys.
- **Kappa statistic**: Compares the model with a random one.

Additionally, **costs** associated with the two types of errors (i.e., false positive and false negative) are taken into account. The total cost is computed as a percentage by multiplying the number of each error type by its corresponding cost.

In [None]:
from testing import testing

avg_loss, avg_loss_per_class, metrics, cost_perc = testing(config_settings)