In [2]:
use_gpu = True

if use_gpu and torch.cuda.is_available():
    device = torch.device('cuda')
    torch.backends.cudnn.benchmark = True
else:
    device = torch.device('cpu')
print(device)

cuda


### **ATTENTION:**
- Watch out for **overfitting**, which happens when a neural network essentially “memorizes” the training data. Overfitting means you get great performance on training data, but the network’s model is useless for out-of-sample prediction.
- Regularization helps: regularization methods include l1, l2, and dropout among others.
- So have a separate test set on which the network doesn’t train.
- The larger the network, the more powerful, but it’s also easier to overfit. Don’t want to try to learn a million parameters from 10,000 examples – parameters > examples = trouble.
- **More data** is almost always better, because it helps fight overfitting.
- Train over multiple epochs (complete passes through the dataset).
- Evaluate test set performance at each epoch to know when to stop (**early stopping**).
- In general, stacking layers can help.
- For LSTMs, use the softsign (not softmax) activation function over tanh (it’s faster and less prone to saturation (~0 gradients)).
- Updaters: RMSProp, AdaGrad or momentum (Nesterovs) are usually good choices. AdaGrad also decays the learning rate, which can help sometimes.
- Finally, remember **data normalization**, MSE loss function + identity activation function for regression, Xavier weight initialization

In [4]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True

torch.set_printoptions(precision=6,sci_mode=False)
pd.set_option('display.float_format',lambda x : '%.6f' % x)

SEED = 2024

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### **Step1: Dataset Load**

As our prediction is based on each type of squence, we need to generate batches contain different `Outcomes` in one `Reference` sequence. Consequently, we apply class `BatchSampler` and `SubsetRandomSampler` to generate batches and then pack them to `gRNADataset`.

[torch.utils.data](https://pytorch.org/docs/stable/data.html)

`class torch.utils.data.Sampler(data_source=None)`: Base class for all Samplers. Every Sampler subclass has to provide an __iter__() method, providing a way to iterate over indices or lists of indices (batches) of dataset elements, and may provide a __len__() method that returns the length of the returned iterators.

`DataLoader`(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)

**Parameters**: data_source (Dataset) – This argument is not used and will be removed in 2.2.0. You may still have custom implementation that utilizes it.


**class definition:**
```python
class Book:
    def __init__(self, title, author):
        self.title = title
        self.author = author
        self.pages = []

    def __iter__(self):
        return iter(self.pages)
        
    def __len__(self):
        return len(self.pages)
        
    def add_pages(self, content):
        self.pages.append(content)

my_book = Book('python', 'Martina')
my_book.add_pages('Chapter 1')
my_book.add_pages('Chapter 2')

for page in my_book:
    print(page)

print(my_book, len(my_book))
```

In [5]:
class BatchSampler(Sampler):
    def __init__(self, data):
        idx_list = []
        for ref, sub_df in data.groupby('Reference'):
            idx_list.append(sub_df.index.tolist())
        self.sampler = SubsetRandomSampler(idx_list)
    
    def __len__(self):
        return len(self.sampler)

    def __iter__(self):
        for idx in self.sampler:
            yield idx

class gRNADataset:
    def __init__(self, df):
        self.df = df
        self.indices = self.df['index'].tolist()
        self.offsets = self.df['offsets'].tolist()
        self.cnt = self.df['Count'].tolist()

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        indices = self.indices[idx]
        offsets = self.offsets[idx]
        cnt = self.cnt[idx]
        return indices, offsets, cnt

class testDataset:
    def __init__(self, df):
        self.df = df
        self.indices = self.df['index'].tolist()
        self.offsets = self.df['offsets'].tolist()
        self.cnt = self.df['True_Proportion'].tolist()

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        indices = self.indices[idx]
        offsets = self.offsets[idx]
        cnt = self.cnt[idx]
        return indices, offsets, cnt

### **Step2: Generate Encoding and Batches**

1. `generate_encoding`: use number `0` to `3` to represent bases and iterates through both sequences simultaneously, encoding each pair of nucleotides. Then it creates an offset list for every even position in the encoded sequence.
2. `generate_batch`: process a batch of datas for input into our model. Firstly flatten the encoded sequences, from all items in the batch into a single list. Then adjust offsets for each items in one batch and put all of `offsets` and `indices` together to a single list.

```python
def generate_encoding(ref, out):
    STOI = str.maketrans('ACGT', '0123') ## string to integer
    idx = [int(nuc) for pair in zip(ref.translate(STOI), out.translate(STOI)) for nuc in pair]
    ofs = list(range(0, len(idx), 2))
    return idx, ofs

idx, ofs = generate_encoding('TTTAACCCGG', 'TTTCCCTTTAA')
print(idx,ofs)
idx = [3, 3, 3, 3, 3, 3, 0, 1, 0, 1, 1, 1, 1, 3, 1, 3, 2, 3, 2, 0]
ofs = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

def generate_batch(batch):
    indices = [idx for item in batch for idx in item[0]]
    offsets = [ofs + i * len(item[0]) for i, item in enumerate(batch) for ofs in item[1]]
    cnts = [item[2] for item in batch]
    return indices, offsets, cnts
```
**Sample Usage:**

```python
sample_data = [("ACGTACGTGTCCTGGCTGCC", "ACGTGCGTGTCCTGGCTGCC", 10), ("CATATCTACGCGTCGCACTT", "CATGTCTACGCGTCGCACTT", 20)]

df = pd.DataFrame(sample_data, columns = ['Reference', 'Outcomes', 'Count'])
df['index'] = None
df['offsets'] = None
df[['index', 'offsets']] = df.apply(lambda x: generate_encoding(x['Reference'], x['Outcomes']), axis=1, result_type='expand')
batch = list(zip(df['index'], df['offsets'], df['Count']))
indices, offsets, cnts = generate_batch(batch)
```
**Output:**
```
         Reference              Outcomes  Count index offsets
0  ACGTACGTGTCCTGGCTGCC  ACGTGCGTGTCCTGGCTGCC     10  None    None
1  CATATCTACGCGTCGCACTT  CATGTCTACGCGTCGCACTT     20  None    None
              Reference              Outcomes  Count  \
0  ACGTACGTGTCCTGGCTGCC  ACGTGCGTGTCCTGGCTGCC     10   
1  CATATCTACGCGTCGCACTT  CATGTCTACGCGTCGCACTT     20   

                                               index  \
0  [0, 0, 1, 1, 2, 2, 3, 3, 0, 2, 1, 1, 2, 2, 3, ...   
1  [1, 1, 0, 0, 3, 3, 0, 2, 3, 3, 1, 1, 3, 3, 0, ...   

                                             offsets  
0  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24...  
1  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24...  
indices:
tensor([0, 0, 1, 1, 2, 2, 3, 3, 0, 2, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 1, 1, 1, 1,
        3, 3, 2, 2, 2, 2, 1, 1, 3, 3, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 3, 3, 0, 2,
        3, 3, 1, 1, 3, 3, 0, 0, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 1, 1,
        0, 0, 1, 1, 3, 3, 3, 3]) 
 offsets:
tensor([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
        36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70,
        72, 74, 76, 78]) 
 cnts:
tensor([10., 20.])
```

In [6]:
def generate_encoding(ref, out):
    STOI = str.maketrans('ACGT', '0123')
    idx = [int(nuc) for pair in zip(ref.translate(STOI), out.translate(STOI)) for nuc in pair]
    ofs = list(range(0, len(idx), 2))
    return idx, ofs

def generate_batch(batch):
    indices = [idx for item in batch for idx in item[0]]
    offsets = [ofs + i * len(item[0]) for i, item in enumerate(batch) for ofs in item[1]]
    counts = [item[2] for item in batch]
    ref = [item[0] for item in batch]
    otm = [item[1] for item in batch]
    
    return torch.LongTensor(indices), torch.LongTensor(offsets), torch.FloatTensor(counts), torch.LongTensor(ref), torch.LongTensor(otm)

## Formulas

- $f$: fully connected layer
- $\mathbf{h}$: hidden state
- $s$: output scores
- $k$: the total number of certain reference's outcomes

**Step1: From biLSTM get output context vector**

$\mathbf{h}$ represent the hidden state, and for the last hidden state, we optimize it as:

$$
Output\ Score_{(s)} = LeakReLU(f(\mathbf{h_last}))
$$

**Step2: Generate Predicted Outcomes**

And we can denote a batch of output score as: $Output\ Scores\ Set = [s_1,s_2,s_3 ...s_i...s_k], i \in \mathbf{R}^k$. Then, we put $softmax$ function into $s$ to generate the prediction:
$$
pred_i = softmax(\mathbf{s}) = \frac{e^{s_i}}{\sum_{j = 1}^{k}e^{s_j}}
$$
$$
pred_{all} = q = \sum_{j = 1}^{k}q_j = 1
$$

**Step3: True Scores**

Here, we regard $true_i$ as conresponded true score, and the set is denoted as $p =[p_1,p_2,p_3 ... p_j... p_{k}], j \in \mathbf{R}^{k}$

**Step4: Loss Function**

To measure the distance between $pred_i$ and $true_j$, we put method $\mathbf{KL-Divergence}$ and add the weight $w_i$ of each outcomes of each single batch, which is counted by:
$$
\frac{c_i}{\sum_{i=1}^{k}c_i}, i \in \mathbf{R}^k
$$
$c_i$ is the true count number of the certain true outcomes, and then we calculate the **loss1** as $L_1$:
$$
L_1 = \mathbf{D}_{KL}(q\Arrowvert{p}) = \sum_{i = 1}^{k}w_i p_i log(\frac{p_i}{q_i})
$$
Besides, considering the mean squared error (MSE) can also shrink the difference between $pred_i$ and $pred_i$, we also put it as **loss2** as $L_2$:
$$
L_2 = \frac{1}{2K} \sum_{i=1}^{k} w_i (p_i - q_i)^2
$$
Finally, the total **Loss** is:
$$
L = \frac{1}{2}L_1 + \frac{1}{2}L_2
$$

****

### **Step3: BiLSTM with Attention Model**
<center><img class="image" id="me" src="attachment:0f4eabb1-1ec4-40c5-a73a-3b3b4cfec21e.png" width = "500"></center>

1. `EmbeddingBag`:embedding layer, to pool the embeddings of variable-length sequences. This pooling method reduces them to fixed-size vectors.
2. `nn.LSTM`: Bidirectional LSTM layer, which learns temporal dependencies in both forward and backward directions.
3. `attention_net`: computes the attention scores for a query over a squence. Calculates the dot-product attention between query and sequence x. `scores`, it's attention scores matrix that measures relevance between each element of the sequence and the query. `alpha_n`, attention weights, computed by applying softmax to the scores. `context`, weighted sum of the sequence x, according to the attention weights, which summarizes the sequence.
4. `fc`: fully connected layer that produces output.

In [12]:
def init_weights(m):
    '''
    Optimized: to prevent early-stage vanishing gradients in LSTMs, can let model learn long-term dependencies better.
    '''
    for name, param in m.named_parameters():
        if 'rnn.weight_' in name:
            nn.init.orthogonal_(param.data) ## orthogonal initialization of LSTM weights
        elif 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01) ## Xavier initialization for non-recurrent weights
        elif 'bias' in name:
            if 'rnn.bias_' in name:
                nn.init.constant_(param.data, 1) ## set the forget gate bias in LSTM to 1
            else:
                nn.init.constant_(param.data, 0) ## set other biases to 0

In [13]:
print(device)

cuda:0


In [14]:
torch.cuda.empty_cache()

In [15]:
model = BiLSTM_Attention(my_input_dim, my_emb_dim, my_hid_dim, my_layers, my_dropout).to(device)

### **Step4: Define Training and Testing Function**

#### **4.1 Choose which one of `PyTorch loss functions`:**

For [nn.MSELoss:](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html) ---> 3 params: none, mean, sum

`reduction (str, optional)` – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

1.	`none: criterion = nn.MSELoss(reduction='none')`:
    - Behavior: No reduction is applied. This means the loss is computed element-wise and returned as a tensor with the same shape as the input.
    - Use Case: Useful when you need the loss for each element separately, e.g., for per-sample loss tracking or custom aggregation.
2.	`mean: criterion = nn.MSELoss(reduction='mean')`:
    - Behavior: The loss is averaged across all elements of the input tensor.
    - Use Case: This is the most common setting, as it gives a single scalar representing the average loss across all elements. Use this if you want the average loss per element.
3.	`sum: criterion = nn.MSELoss(reduction='sum')`:
    - Behavior: The loss is summed across all elements of the input tensor.
    - Use Case: Useful when you want the total loss across all elements, rather than an average. Typically used when you need the overall magnitude of the loss or when dealing with batch loss where the total matters (e.g., when normalizing by the total number of elements later).

- `none`: No reduction; per-element loss.
- `mean`: Average loss per element (most common).
- `sum` Sum of losses across all elements (useful if you need the total loss or will normalize by batch size later).

For [nn.KLDivLoss:](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) ---> 4 params: none, mean, batch mean, sum

This function is used to measure the divergence between two **probability distributions**. Since KLD loss often deals with log probabilities, it’s common to use 'batchmean' or 'sum' to compute the overall divergence for a batch.
Common choice:

```python
if reduction == "mean":  # default
    loss = loss_pointwise.mean()
elif reduction == "batchmean":  # mathematically correct
    loss = loss_pointwise.sum() / input.size(0)
elif reduction == "sum":
    loss = loss_pointwise.sum()
else:  # reduction == "none"
    loss = loss_pointwise
```

Here we choose both `none`, which means each of the elements in one batch is calculated seperately? This will print a tensor of the same shape as the input, where each element represents the KLD loss for that particular element. Becuase we need to perform some custom reduction later (e.g., **weighting each element’s loss differently**, or computing per-sample loss), then `reduction='none'` is useful.

****

- `softmax`: result from 0 to 1; `logsoftmax`: from - $\infty$ to 0

****

#### **4.2 Debug: The output dimensions of each line**

As what we mentioned before, the output is like:
```
y_pred_1 --->: (softmax from outputs)   
tensor([0.032258, 0.032229, 0.032282, 0.032281, 0.032286, 0.032240, 0.032271,
        0.032273, 0.032271, 0.032263, 0.032207, 0.032257, 0.032261, 0.032250,
        0.032254, 0.032273, 0.032274, 0.032233, 0.032268, 0.032244, 0.032247,
        0.032256, 0.032227, 0.032283, 0.032283, 0.032243, 0.032284, 0.032250,
        0.032269, 0.032272, 0.032213], device='cuda:0', grad_fn=<ViewBackward>)
y_pred_2: (logsoftmax from outputs)
tensor([-3.433977, -3.434890, -3.433241, -3.433290, -3.433125, -3.434560,
        -3.433588, -3.433515, -3.433597, -3.433846, -3.435585, -3.434012,
        -3.433898, -3.434241, -3.434123, -3.433510, -3.433484, -3.434758,
        -3.433667, -3.434437, -3.434321, -3.434052, -3.434946, -3.433228,
        -3.433214, -3.434468, -3.433172, -3.434247, -3.433659, -3.433563,
        -3.435398], device='cuda:0', grad_fn=<ViewBackward>)
y_true:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0')
```

The problem is that outputs are quite similar, so it means the model **an not perform well**. Then I find it's because of the `weight`, after I add it:
```
y predicted 1: tensor([-1.939422, -1.944494, -1.945731, -1.946127, -1.946525, -1.947677,
        -1.951434], device='cuda:0', grad_fn=<ViewBackward>)
y predicted 2: tensor([0.143787, 0.143060, 0.142883, 0.142826, 0.142769, 0.142605, 0.142070],
       device='cuda:0', grad_fn=<ViewBackward>)
outputs:tensor([[-0.000286],
        [-0.005358],
        [-0.006595],
        [-0.006991],
        [-0.007389],
        [-0.008541],
        [-0.012298]], device='cuda:0', grad_fn=<LeakyReluBackward0>)
true scores: tensor([0.007529, 0.019860, 0.360981, 0.011812, 0.049844, 0.087098, 0.462876], ...
loss 1 KLD: tensor([0.005233, 0.013803, 0.250890, 0.008210, 0.034643, 0.060535, 0.321709], ...
loss 2 MSE: tensor([0.000229, 0.000605, 0.010990, 0.000360, 0.001517, 0.002652, 0.014092], ...
```
The loss performs much better now. But still, the predicted outputs are quite the same.

In [16]:
def train_model(model, epochs, clips, fold, trained_model_path, train_dataloader, valid_dataloader, criterion_KLD, criterion_mse, device, basename, optimizer):
    
    train_loss_list = []
    valid_loss_list = []
    
    # avg_valid_losses = [] 
    best_valid_loss = np.inf

    for epoch in range(epochs):
        model.train()
        train_per_epoch = len(train_dataloader)
        kbar = pkbar.Kbar(target=train_per_epoch, epoch=epoch, num_epochs=epochs, width=8, always_stateful=False)
        train_loss = 0
        for i, batch in enumerate(train_dataloader):
            indices, offsets, counts, ref, otm = batch
            indices = indices.to(device)
            offsets = offsets.to(device)
            counts = counts.to(device)
            all_counts = counts.sum() + 1e-6
            y_true = counts / all_counts
            batch_size = int(len(indices) / 40)
            
            optimizer.zero_grad()
            outputs = model(indices, offsets, batch_size)
            
            y_pred_1 = F.log_softmax(outputs, 0).view(-1)
            loss_temp_1 = criterion_KLD(y_pred_1, y_true)
            loss_temp_1 = torch.abs(loss_temp_1 * y_true) * 100
            
            y_pred_2 = F.softmax(outputs, 0).view(-1)
            loss_temp_2 = criterion_mse(y_pred_2, y_true)
            loss_temp_2 = loss_temp_2 * y_true * 100
            
            print(f'outputs:\n{outputs}\npred_y_1: \n{y_pred_1}\npred_y_1.shape:\n{y_pred_1.shape}\npred_y_2: \n{y_pred_2}\npred_y_2.shape:\n{y_pred_2.shape}\ny_true: {y_true}\n y_true.shape:\n {y_true.shape}')

            loss1 = loss_temp_1.mean()
            loss2 = loss_temp_2.mean()
            
            loss = 0.3 * loss1 + 0.7 * loss2
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            
            train_loss += loss.item()
            kbar.update(i, values=[("training loss", train_loss/(i + 1)),("loss1",loss1),("loss2",loss2),("lr", optimizer.defaults['lr'])])
            
        train_loss_ = train_loss / len(train_dataloader)
        train_loss_list.append(train_loss_)

        
        valid_loss = test_model(valid_dataloader, model, criterion_KLD, criterion_mse, device)
        valid_loss_list.append(valid_loss)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), f'{trained_model_path}/{basename}_{fold+1}.pth')
            print(f"New best model saved for fold {fold+1} with validation loss {valid_loss:.6f}")

        ## logging
        print(f'Fold {fold + 1}, Epoch {epoch + 1}/{epochs}, Training Loss: {train_loss_:.6f}, Validation Loss: {valid_loss:.6f}')
        
        scheduler.step(valid_loss) 
        for param_group in optimizer.param_groups:
            print(f"Learning Rate: {param_group['lr']}")
    return train_loss_list, valid_loss_list

def test_model(test_dataloader, model, criterion_KLD, criterion_mse, device):
    model.eval()
    test_loss = 0
    lst_pred = []
    
    with torch.no_grad():
        for batch in test_dataloader:
            indices, offsets, counts, ref, otm = batch
            indices = indices.to(device)
            offsets = offsets.to(device)
            counts = counts.to(device)
            all_counts = counts.sum() + 1e-6
            y_true = counts / all_counts
            batch_size = int(len(indices) / 40)            

            outputs = model(indices, offsets, batch_size)
            outs = outputs.view(-1)
            df_on = pd.DataFrame({'Reference': ref,'Outcomes': otm})
            df_on['true'] = y_true.cpu().detach().numpy()
            df_on['pred'] = outs.cpu().detach().numpy()
            lst_pred.append(df_on)
            
            pred_y_1 = F.log_softmax(outputs, dim=0).view(-1)
            loss_temp_1 = criterion_KLD(pred_y_1, y_true)
            loss_temp_1 = torch.abs(loss_temp_1 * y_true)
            pred_y_2 = F.softmax(outputs, dim=0).view(-1)
            loss_temp_2 = criterion_mse(pred_y_2, y_true)
            loss_temp_2 = loss_temp_2 * y_true
            
            loss1 = loss_temp_1.mean()
            loss2 = loss_temp_2.mean()
            loss = loss1 + loss2
            test_loss += loss.item()
            
    test_loss /= len(test_dataloader)
    df = pd.concat(lst_pred)
    return test_loss,df

In [21]:
train_dataset, list_gRNA, grp_df = process_data(train_data)
test_dataset, list_gRNA_test, grp_test = process_data_test(test_data)

In [22]:
criterion_KLD = nn.KLDivLoss(reduction='none')
criterion_mse = nn.MSELoss(reduction='none')

In [23]:
use_cuda = torch.cuda.is_available()
device = torch.device('cuda:0' if use_cuda else 'cpu')
print(device)

cuda:0


In [24]:
best_valid_loss = float('inf')

### **Step5.Testing and Evaluate Performances**
****

**Tensor Score Similar Problem:**

Finally, I changed the model as `BiLSTM` without `Attention Mechanism`, and still I use `nn.LSTM` instead of `GRU`. Additionally, I used `hid_dim = 256` and `n_layer = 2`. Then I observed that as the training progress going on, the `outputs` be more reasonable for it "learned" the information. **You can scroll down the outputs below to look for the changes.**

Next, I plan to draw plots of correlation coefficience for each test data of each type of base editor and also there are some of statistical results need to be done.

In [None]:
loss_plot(fold_list, epoch_list, train_list, valid_list, file_path, basename)

### **Step6: Tuning Suggestion:**

1.	Batch handling: Ensuring that the reference sequences are correctly handled for each batch. Since you mention the reference sequence remains constant per batch, let’s confirm the batching process handles it correctly.
    - However, from the predicted values, it seems that **all predictions are very similar**, which could indicate that the model isn’t learning to distinguish between sequences effectively. This could be due to several factors like **gradient vanishing**, **improper scaling of the input**, or **issues in how the batch is constructed**.
    - Since your model takes reference sequences and outcomes (for both ABE and CBE), it’s crucial that these sequences are well balanced within each batch. However, looking at the predictions:
    - **y_pred_1 and y_pred_2** show very similar values across the batch. This uniformity could indicate that the input sequences in the batch may be too similar or that there is an issue in how the batch data is processed and fed into the model.

3.	Loss: You are using two loss functions:
    - **KLD (Kullback-Leibler Divergence)**: The KLD loss between the predicted y_pred_1 and the true target might not be changing much because all predictions are very close in magnitude (-3.0916). This could imply that the model is stuck in a local minimum or the gradients are not properly backpropagating.
    - **MSE (Mean Squared Error)**: The MSE loss on the second set of predictions (y_pred_2) is very small, which could indicate that the model is not being penalized strongly enough, potentially leading to little learning in early epochs.
4.	Model initialization and learning rate: An unstable loss at the beginning could also stem from improper initialization or learning rate settings.
5.	Attention mechanism integration: Since the model combines LSTM and attention, proper integration and normalization of attention weights could affect early training stability.
6.	Parameters fine tuning

In [None]:
model.load_state_dict(torch.load())