In [1]:
 # boolean network packages
from sad2_final_project.boolean_bn import BN
from sad2_final_project.boolean_bn import save_trajectories_to_csv 
from boolean import BooleanAlgebra 
# system packages
import os
from pathlib import Path


# paths
## set global dir
cwd=Path.cwd()
if cwd.name == "notebooks":
    os.chdir(cwd.parent) 
print(os.getcwd())
## create paths
DATA_PATH = Path('data/test')
DATASET_PATH = DATA_PATH / 'datasets'
BN_TRUE_PATH = DATA_PATH / 'bn_ground_truth'
BN_DATASET_BNFINDER_FORMAT_PATH = DATA_PATH / 'datasets_bnfinder_format'
BN_RESULT_PATH = DATA_PATH / "result"
## create directories
!mkdir -p {DATASET_PATH}
!mkdir -p {BN_TRUE_PATH}
!mkdir -p {BN_DATASET_BNFINDER_FORMAT_PATH}
!mkdir -p {BN_RESULT_PATH}




/home/maxi7524/repositories/SAD2_final_project


## Part 1 

### 1. Construct boolean networks (way to do it ?)
#### Task
Construct several Boolean networks with sizes (measured by the number of nodes or variables) ranging from 5 to 16.† Each node should have no more than three parent nodes, and the Boolean functions governing individual nodes should be generated at random.

#### Goal
The goal of this study is to determine how the type and amount of time-series data generated from Boolean network dynamics affect the accuracy of dynamic Bayesian network (DBN) structure inference. In particular, we aim to identify:
- how the presence of attractor states in trajectories influences reconstruction accuracy,
- how trajectory length, sampling frequency, and the number of trajectories affects model metrics
- how these effects depend on network size and update dynamics,
- which data-generation techniques yield stable and informative reconstructions. 

Because synchronous and asynchronous Boolean network updates correspond to different classes of stochastic processes, all results are analyzed separately for these two update modes. Furthermore, since different scoring functions (MDL and BDe) are not directly comparable in scale, they are treated as different experiment groups for same boolean networks. 

#### Methodology

##### Relation Between Trajectory Length and Entering Attractors

**Objective.**
To characterize how trajectory length is related to the probability of entering attractors as a function of network size and dynamics.

**Experimental design.**

- The target attractor proportion is **not controlled**; trajectories evolve naturally.
- Trajectory lengths are varied in increments proportional to network size:
    - from 5 steps to 50 by 5
    - from 50 steps to 200 by 10
- Networks are grouped by size (from 4 to 16 nodes, in steps of two).
- The number of parents per node is randomly chosen from set of $\{1,2,3\}$ to avoid conditioning results on a fixed connectivity pattern.

**Measured quantities.**

- Probability of reaching an attractor as a function of trajectory length.
- How different groups (below TODO - inner link) differ in in this probability.

**Rationale.**
Attractor entry is an emergent property of the dynamics. Controlling it directly is undesirable, as it would introduce selection bias. This experiment instead characterizes the **natural scaling behavior** of Boolean network dynamics.

---

##### Impact of Sampling Frequency

**Objective.**
To determine how temporal subsampling affects autocorrelation, effective sample size, and reconstruction accuracy.

Dynamic Bayesian network inference assumes conditional independence of observations given parent states in the previous time slice. Excessive temporal dependence violates this assumption in practice by introducing redundant observations.

**Experimental design.**

- For fixed networks and trajectory lengths, datasets are generated using multiple sampling frequencies (1, 2, 3, 4, 5).
- For each dataset
    - ACF and ESS are computed,
    - MDL and BDe scores are extracted from BNFinder2 logs,

**Analysis.**
Accuracy is analyzed jointly as a function of: 
<!-- What does it mean jointly ?? -->

* sampling frequency,
* ESS,
* scoring function (MDL or BDe).

**Rationale.**
This experiment identifies sampling regimes that balance reduced autocorrelation against loss of dynamic information due to over-subsampling.

---

##### Amount of Trajectories Required for Stable Inference

**Objective.**
To determine how many independent trajectories are required to obtain statistically stable reconstructions.

**Experimental design.**

- Sampling frequency and trajectory length are fixed to values identified as near-optimal in previous experiments.
- The number of trajectories per dataset is gradually increased - from 10 to 100 by 10.
- For each setting, multiple (30) independent repetitions are performed to obtain convergent distribution .

**Evaluation.**

- Reconstruction accuracy is summarized using distributions (score functions).
- Stability is assessed by observing convergence of accuracy metrics as the number of trajectories increases.
- No classical parametric hypothesis test is assumed; instead, convergence trends is reported.
<!-- Nie znalazłem żadnego sensownego -->

**Rationale.**
Due to the randomness of Boolean functions and initial states, averaging over multiple networks is necessary to separate systematic effects from instance-specific variability.

---

#### Sets

To ensure reproducibility and comparisons, experiments are organized into predefined sets.
##### Groups
1. **Network size groups**
   Networks are grouped by number of nodes (5–16) to analyze scaling behavior.
2. **Update mode groups**
   Synchronous and asynchronous updates are treated separately, as they correspond to different stochastic processes (deterministic map vs stochastic transition system).
3. **Scoring function groups**
   MDL and BDe are analyzed independently. Absolute score values are not compared across scoring functions; only trends with respect to accuracy are considered.
4. **Random function sets**
   For each experimental condition, multiple independently generated Boolean networks are used.
   Random seeds are fixed per network/experiment (TODO ?) instance so that different data-generation parameters (sampling frequency, trajectory length, number of trajectories) are evaluated on identical underlying networks.
   <!-- is not possible we just need to use seed to experiments are reproducle -->

---

##### Averaging and Distributions

All reported results are based on **distributions**, not single values.
Depending on the experiment, averaging is performed over:

- trajectories (within a dataset),
- independent datasets,
- independently generated networks.

The aggregation strategy is explicitly chosen for each experiment to match the source of variability under investigation.


<!-- ### Impact of Proportion of Attractor States in Trajectories

**Objective.**
To quantify how an increasing proportion of attractor states in trajectories affects the sensitivity of network structure reconstruction.

The central hypothesis is that a high proportion of attractor states leads to strong temporal autocorrelation, which **reduces the effective number of independent observations**, thereby decreasing the sensitivity (recall) of detected edges in the inferred network.

**Experimental design.**

- For each network, datasets are generated with controlled proportions of attractor states:
  (0%, 10%, \ldots, 100%).
- Trajectories are made sufficiently long to ensure that attractors are reached whenever the target proportion is nonzero.
- Sampling frequency is varied to contrast regimes of strong versus weak temporal dependence.

**Autocorrelation analysis.**
For each dataset, temporal dependence is quantified using the **autocorrelation function (ACF)**:
[
\rho(k) = \frac{\mathrm{Cov}(X_t, X_{t+k})}{\mathrm{Var}(X_t)}
]
computed separately for each node and then aggregated (mean or maximum across nodes).

From the ACF, the **effective sample size (ESS)** is estimated:
[
\mathrm{ESS} \approx \frac{N}{1 + 2\sum_{k=1}^{K} \rho(k)},
]
where (N) is the nominal number of observations and (K) is the truncation lag where autocorrelation becomes negligible.

Reconstruction accuracy is then analyzed as a function of ESS rather than nominal sample size.

**Rationale.**
This isolates the effect of attractor-induced redundancy and avoids conflating “amount of data” with “amount of information”. -->




In [10]:
# TODO konstrukcja sieci

## TODO - tworzymy generator który będzie nam zwracał odpowiednie sieci

## TODO - losujemy na zadanych warunkach nasze sieci

## zapisujemy nasze wygenerowane funkcje w klasie bn (można domyślnie) 

## TODO zapisujemy naszą klase jako ground trough - tutaj trzeba zrobic metodę do robienia tego !!!

# Create randomo BN 
bn_random = BN(num_nodes=9, mode="asynchronous", trajectory_length=50, n_parents_per_node=[2, 3])
# save ground truth for 
bn_random.save_ground_truth(BN_TRUE_PATH / 'condition_2.csv')

bn_random.functions

2026-01-19 22:38:18,884 [INFO] sad2_final_project.boolean_bn.bn: Initializing Boolean Network with 9 nodes in asynchronous mode.
2026-01-19 22:38:18,890 [INFO] sad2_final_project.boolean_bn.bn: Generating random Boolean functions.
2026-01-19 22:38:18,894 [INFO] sad2_final_project.boolean_bn.bn: Generated 9 Boolean functions.
2026-01-19 22:38:18,897 [INFO] sad2_final_project.boolean_bn.bn: Boolean Network initialized successfully.
2026-01-19 22:38:18,898 [INFO] sad2_final_project.boolean_bn.bn: Computing attractors.
2026-01-19 22:38:18,899 [INFO] sad2_final_project.boolean_bn.bn: Generating state transition system (STS).
2026-01-19 22:38:19,093 [INFO] sad2_final_project.boolean_bn.bn: State transition system generated with 512 states and 3132 transitions.
2026-01-19 22:38:19,102 [INFO] sad2_final_project.boolean_bn.bn: Found 1 attractors.
2026-01-19 22:38:19,103 [INFO] sad2_final_project.boolean_bn.bn: Saving ground truth edges to: data/test/bn_ground_truth/condition_2.csv
2026-01-19 22

[{(1, 1, 0, 0, 1, 1, 1, 1, 1), (0, 1, 0, 0, 0, 1, 0, 0, 1), (1, 1, 0, 0, 1, 1, 0, 0, 0), (0, 1, 0, 0, 1, 1, 0, 1, 0), (0, 1, 0, 1, 0, 1, 0, 1, 1), (1, 1, 0, 1, 0, 1, 1, 0, 0), (1, 1, 0, 0, 0, 1, 0, 1, 0), (0, 1, 0, 1, 1, 1, 1, 0, 1), (0, 1, 0, 0, 1, 1, 0, 0, 1), (0, 1, 0, 1, 0, 1, 1, 0, 1), (0, 1, 0, 1, 0, 1, 1, 1, 0), (0, 1, 0, 0, 0, 1, 1, 1, 1), (1, 1, 0, 0, 0, 1, 1, 1, 0), (0, 1, 0, 0, 1, 1, 1, 0, 1), (1, 1, 0, 1, 0, 1, 0, 0, 1), (0, 1, 0, 0, 1, 1, 1, 1, 0), (0, 1, 0, 1, 1, 1, 0, 1, 0), (1, 1, 0, 0, 0, 1, 0, 0, 1), (1, 1, 0, 1, 1, 1, 0, 1, 0), (1, 1, 0, 1, 0, 1, 1, 1, 1), (1, 1, 0, 1, 1, 1, 1, 1, 0), (1, 1, 0, 0, 0, 1, 1, 0, 1), (0, 1, 0, 1, 1, 1, 0, 0, 1), (1, 1, 0, 0, 1, 1, 1, 1, 0), (0, 1, 0, 0, 0, 1, 0, 0, 0), (0, 1, 0, 1, 0, 1, 0, 1, 0), (0, 1, 0, 1, 1, 1, 1, 0, 0), (0, 1, 0, 0, 1, 1, 0, 0, 0), (1, 1, 0, 0, 1, 1, 0, 1, 1), (1, 1, 0, 1, 1, 1, 0, 0, 1), (0, 1, 0, 0, 1, 1, 1, 0, 0), (0, 1, 0, 1, 0, 1, 1, 0, 0), (0, 1, 0, 1, 1, 1, 1, 1, 1), (1, 1, 0, 1, 1, 1, 1, 0, 1), (0, 1, 0, 0,

[OR(OR(Symbol('x7'), NOT(Symbol('x4'))), NOT(Symbol('x0'))),
 OR(Symbol('x5'), NOT(Symbol('x3'))),
 AND(NOT(Symbol('x1')), NOT(Symbol('x2'))),
 OR(Symbol('x8'), Symbol('x7')),
 AND(Symbol('x8'), Symbol('x1')),
 OR(AND(Symbol('x0'), NOT(Symbol('x2'))), Symbol('x1')),
 AND(Symbol('x4'), NOT(Symbol('x6'))),
 AND(Symbol('x5'), NOT(Symbol('x8'))),
 OR(Symbol('x2'), Symbol('x7'))]

### 2. Simulate boolean networks 
#### Task
Simulate trajectories of the generated networks in both synchronous and asynchronous modes to create datasets. The datasets should vary in 
- the proportion of transient and attractor states,
- the trajectory sampling frequency (i.e. the number of time steps between consecutive sampled states), and 
- their overall size (i.e. the number and length of trajectories used to construct each dataset);


In [7]:
# TODO symulacja sieci

##  Korzystamy z funkcji Michała, do symulacji
# save_trajectories_to_csv(bn_instance=bn, num_trajectories=100, output_file= DATASET_PATH / 'condition_1.csv' )

save_trajectories_to_csv(bn_instance=bn_random, num_trajectories=2, output_file= DATASET_PATH / 'condition_2.csv' )

## TODO musimy zrobić dla każdej sieci różne datasety, to jet network_1_condition_x i mamy x warunków dla każdego przypadku, żeby zobaczyć jak się zachowuje znajdywanie takiej sieci w zależności od dataset'u 

2026-01-19 21:22:36,068 [INFO] sad2_final_project.boolean_bn.bn: Simulating trajectory with target attractor ratio 0.40, tolerance 0.10, max_iter 50
2026-01-19 21:22:36,073 [INFO] sad2_final_project.boolean_bn.bn: Trajectory completed. Length: 50, Attractors: 64, Transients: 83
2026-01-19 21:22:36,074 [INFO] sad2_final_project.boolean_bn.bn: Simulating trajectory with target attractor ratio 0.40, tolerance 0.10, max_iter 50
2026-01-19 21:22:36,080 [INFO] sad2_final_project.boolean_bn.bn: Trajectory completed. Length: 50, Attractors: 67, Transients: 80
2026-01-19 21:22:36,081 [INFO] sad2_final_project.boolean_bn.bn: Simulating trajectory with target attractor ratio 0.40, tolerance 0.10, max_iter 50
2026-01-19 21:22:36,094 [INFO] sad2_final_project.boolean_bn.bn: Trajectory completed. Length: 50, Attractors: 65, Transients: 82
2026-01-19 21:22:36,095 [INFO] sad2_final_project.boolean_bn.bn: Simulating trajectory with target attractor ratio 0.40, tolerance 0.10, max_iter 50
2026-01-19 21:

Done.
Done.


### 3. Build dynamic Bayesian newtorks BNFinder2
#### Task
Use the datasets to infer dynamic Bayesian networks with the BNFinder2 software tool. Unlike classical Bayesian networks, dynamic Bayesian networks can contain cycles, making them more suitable for this task. Consider two scoring functions (referred to as scoring criteria in the BNFinder2 terminology): Minimal Description Length (MDL) and BDe (Bayesian–Dirichlet equivalence).

In [8]:
# # TODO temp - problem with path
# p = BN_TRUE_PATH / 'condition_1.csv'
# print("cwd:", Path.cwd())
# print("path:", p)
# print("exists:", p.exists())
# print("is_file:", p.is_file())
# print("parent exists:", p.parent.exists())
# print("files in parent:", list(p.parent.iterdir()) if p.parent.exists() else None)
# # TODO temp - problem with path
p = BN_TRUE_PATH / 'condition_2.csv'
print("cwd:", Path.cwd())
print("path:", p)
print("exists:", p.exists())
print("is_file:", p.is_file())
print("parent exists:", p.parent.exists())
print("files in parent:", list(p.parent.iterdir()) if p.parent.exists() else None)


cwd: /home/maxi7524/repositories/SAD2_final_project
path: data/test/bn_ground_truth/condition_1.csv
exists: True
is_file: True
parent exists: True
files in parent: [PosixPath('data/test/bn_ground_truth/condition_2.csv'), PosixPath('data/test/bn_ground_truth/condition_1.csv')]
cwd: /home/maxi7524/repositories/SAD2_final_project
path: data/test/bn_ground_truth/condition_2.csv
exists: True
is_file: True
parent exists: True
files in parent: [PosixPath('data/test/bn_ground_truth/condition_2.csv'), PosixPath('data/test/bn_ground_truth/condition_1.csv')]


In [12]:
# TODO odbudowa sieci 

## TODO bierezmy data sety z poprzedniego ćwiczenia i je tutaj odtwarzamy za pomocą tego skryptu, więc powinniśmy móc go po prostu uruchomić
from sad2_final_project.bnfinder import run_bnfinder

datasets = [DATASET_PATH / 'condition_1.csv', DATASET_PATH / 'condition_2.csv']
datasets = [DATASET_PATH / 'condition_2.csv']
# IDEA - conditions 
for d in datasets:
    model_name = str(d.stem)
    run_bnfinder(
        dataset_path=d,
        ground_truth_path=BN_TRUE_PATH / (model_name + '.csv'),
        bnf_file_path=BN_DATASET_BNFINDER_FORMAT_PATH / (model_name + '.txt'),
        trained_model_name=BN_RESULT_PATH / model_name,
        metrics_file=BN_RESULT_PATH / (model_name + '_metric.csv'),
    )
## TODO przeliczenie miar (to trzeba osobny skrypt na to, bo ja bym to chciał zestawić potem w jedną tabele)

-> Loading external data from: data/test/datasets/condition_2.csv
   Loaded dataset with shape: (100, 9)
-> Data written to data/test/datasets_bnfinder_format/condition_2.txt
-> Loading ground truth from: data/test/bn_ground_truth/condition_2.csv

=== STARTING INFERENCE ===
-> Running BNFinder with score: MDL...


  edges.add((str(row[0]), str(row[1]))) # Parent, Child


   Success! Output saved to data/test/result/condition_2_MDL.sif
[MDL] Inferred 14 edges.
   (Ground truth file found, evaluating metrics)
-> Running BNFinder with score: BDE...
   Success! Output saved to data/test/result/condition_2_BDE.sif
[BDE] Inferred 15 edges.
   (Ground truth file found, evaluating metrics)

=== DONE ===


### 4. Asses Quaility 
#### Task
Evaluate the accuracy of the reconstructed network structures with respect to the characteristics of the datasets and the scoring functions used. Consider the original Boolean network (i.e. the one used to generate the dataset) as the ground truth. The evaluation should employ at least two structure-based graph distance measures of your choice, with a brief justification for the selected measures.

In [None]:
# TODO - tutaj tylko wytłumaczeni wniosków bo poprzednoi to jest metodologia

## Part 2

### 1. Choose validated Boolean network
#### Task
our task is to consider a validated Boolean network model of a real-life biological mechanism. To this end, select a Boolean network 2 model of your choice from the ‘models’ subfolder of the Biodivine repository, available at https://github.com/sybila/biodivine-boolean-models, with the number of nodes (variables) not exceeding 16.‡. 

In [None]:
# TODO - wybrac sieć

# TODO - 

### 2. Generate dataset,
#### Task
Using the insights gained from the first part of the project,
generate an appropriate dataset for the network inference task. 




In [None]:
# TODO - tworzymy dataset (na bazie wniosków z poprzedniego)
# TODO - tutaj bierzemy wszystkie datasety, jakie wnioski udało nam się ustalić do nich

### 3. Reconstruct the network with BNFinder2 
#### Task
Reconstruct the network structure with BNFinder2, applying a scoring function chosen based on your previous
experience. Evaluate the accuracy of the reconstruction.

In [None]:
# TODO zliczenie jakości dopasowania

# Notes

Idea behind structure is:

# Report

## Part I
## Part II

# Bibliografy