# # Bayesian Network for Modeling Dependencies in Cost Estimation
First attemot at bayesian network:
This isexample code to model dependencies in cost estimation for engineering projects using a Bayesian Network. The network includes variables such as Labor Costs, Material Costs, and Project Duration, and will perform probabilistic queries on the network.



In [1]:
# Import necessary modules
# pgmpy is a python libraries for probabilistic graphical models 
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination


INFO:numexpr.utils:NumExpr defaulting to 8 threads.


### Step 1: Define Structure
We define the structure of the Bayesian Network, which includes the cost-significant items (CSIs) and the dependencies between them. 


In [2]:
# Define the Bayesian Network Structure (CSIs and Dependencies)
model = BayesianNetwork([
    ('Risk_and_Contingency', 'Project_Duration'),
    ('Project_Duration', 'Labor_Costs'),
    ('Inflation', 'Material_Costs'),
    ('Inflation', 'Equipment_Costs'),
    ('Permits', 'Project_Duration'),
    ('Project_Duration', 'Subcontractor_Costs'),
    ('Project_Duration', 'Overhead_and_Management')
])


![Bayesian Network Structure](bayesian_network_structure_resized.png)


above is the structure of this network, it has nodes represnting the CSIs. affects are more cause and affect relationships than influences. i.e project duration will have a strong affect on labour costs.

### Step 2: Define the Conditional Probability Distributions (CPDs)
Each node (variable) in the network requires a Conditional Probability Distribution (CPD), which defines the likelihood of each state of the variable, depending on the states of its parent nodes.


In [3]:
# Define CPDs for each node in the network
cpd_risk = TabularCPD(variable='Risk_and_Contingency', variable_card=2, 
                      values=[[0.8], [0.2]])

cpd_duration = TabularCPD(variable='Project_Duration', variable_card=2,
                          values=[[0.9, 0.7, 0.5, 0.3],
                                  [0.1, 0.3, 0.5, 0.7]],
                          evidence=['Risk_and_Contingency', 'Permits'],
                          evidence_card=[2, 2])

cpd_labor = TabularCPD(variable='Labor_Costs', variable_card=2,
                       values=[[0.8, 0.4],
                               [0.2, 0.6]],
                       evidence=['Project_Duration'],
                       evidence_card=[2])

cpd_material = TabularCPD(variable='Material_Costs', variable_card=2,
                          values=[[0.7, 0.4],
                                  [0.3, 0.6]],
                          evidence=['Inflation'],
                          evidence_card=[2])

cpd_equipment = TabularCPD(variable='Equipment_Costs', variable_card=2,
                           values=[[0.75, 0.5],
                                   [0.25, 0.5]],
                           evidence=['Inflation'],
                           evidence_card=[2])

cpd_subcontractor = TabularCPD(variable='Subcontractor_Costs', variable_card=2,
                               values=[[0.85, 0.45],
                                       [0.15, 0.55]],
                               evidence=['Project_Duration'],
                               evidence_card=[2])

cpd_overhead = TabularCPD(variable='Overhead_and_Management', variable_card=2,
                          values=[[0.7, 0.3],
                                  [0.3, 0.7]],
                          evidence=['Project_Duration'],
                          evidence_card=[2])

cpd_permits = TabularCPD(variable='Permits', variable_card=2, 
                         values=[[0.6], [0.4]])

cpd_inflation = TabularCPD(variable='Inflation', variable_card=2, 
                           values=[[0.7], [0.3]])


### Step 3: Add CPDs to the Model and Check Validity
We now add the defined CPDs to the Bayesian Network and verify that the model structure is valid.
checks this by seeing if all CPDs and correct Dependencies are applies

In [4]:
# Add CPDs to the Model
model.add_cpds(cpd_risk, cpd_duration, cpd_labor, cpd_material, cpd_equipment, 
               cpd_subcontractor, cpd_overhead, cpd_permits, cpd_inflation)

# Check if the model is valid
assert model.check_model()


### Step 4: Define the Cost Components and Total Cost Function

Here, we define the cost components of the project (labor, material, equipment, subcontractor, and overhead costs) based on the values sampled from the Bayesian Network. Each sample will either have a high or low cost for each component, and we will calculate the total cost for the project by summing these individual costs.


In [5]:
import numpy as np
import plotly.graph_objects as go
from pgmpy.sampling import BayesianModelSampling


# Define the cost components as sums of the sampled variables
def calculate_total_cost(sample):
    labor_cost = 100000 if sample['Labor_Costs'] == 0 else 150000
    material_cost = 50000 if sample['Material_Costs'] == 0 else 80000
    equipment_cost = 70000 if sample['Equipment_Costs'] == 0 else 90000
    subcontractor_cost = 30000 if sample['Subcontractor_Costs'] == 0 else 50000
    overhead_cost = 20000 if sample['Overhead_and_Management'] == 0 else 40000
    total_cost = labor_cost + material_cost + equipment_cost + subcontractor_cost + overhead_cost
    return total_cost


### Step 5: Sample from the Bayesian Network

We generate 1,000 samples from the Bayesian Network using the `forward_sample()` function. Each sample represents one possible combination of values for the project's cost components (labor, material, equipment, subcontractor, and overhead costs).


In [6]:
# Sample from the Bayesian Network
inference = BayesianModelSampling(model)
samples = inference.forward_sample(size=1000)


  0%|          | 0/9 [00:00<?, ?it/s]

### Step 6: Calculate Total Project Cost

We now calculate the total project cost for each sample by applying the `calculate_total_cost()` function. This results in a total project cost for each sample, taking into account all the cost components.


In [7]:
# Calculate the total cost for each sample
total_costs = samples.apply(calculate_total_cost, axis=1)


### Step 4: Generate the Cumulative Distribution Function (CDF)

Once we have the total project costs for all the samples, we calculate the CDF. The CDF tells us the cumulative probability that the total project cost will be less than or equal to a given amount. To generate the CDF, we first sort the total costs and then calculate the cumulative probabilities.


In [8]:
# Generate the CDF
sorted_costs = np.sort(total_costs)
cdf = np.arange(1, len(sorted_costs) + 1) / len(sorted_costs)


### Step 5: Visualize the Cumulative Distribution Function (CDF) with Plotly
uses Plotly to visualize the CDF of the total project costs. The CDF plot shows the probability that the project cost will be below a certain value. This helps in understanding the risk and possible cost overruns.


![CDF Project Costs](cdf_project_costs.png)
