# Oil and Gas Production and Emissions Data on the Norwegian Continental Shelf

## Part 4

This is the fourth part of a series of notebooks that I am creating to analyze the oil and gas production and emissions data on the Norwegian Continental Shelf. The data is provided by the Norwegian Petroleum Directorate (NPD) and covers the period from 2001 to 2020. The data is available on the NPD website and can be downloaded from the following link: [Production and Emissions Data](https://factpages.sodir.no/).

You can find the other parts of the series here:

#### Part 1: [Data Collection](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/01_data_building/01_production_and_emission_data_building.ipynb)

#### Part 2: [Data Cleaning](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/02_data_cleaning/02_production_and_emission_data_cleaning.ipynb)

#### Part 3 [Data Processing](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/03_data_processing/03_production_and_emission_data_processing.ipynb)


# Table of contents

1. [Imports](#imports)
2. [Causal Discovery](#causal_discovery)
3. [Linear Regression](#linear_regression)
4. [Double Machine Learning](#Double-Machine-Learning)


## Imports


In [5]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [28]:
# Importing the dataset from the csv file
filepath = "https://raw.githubusercontent.com/percw/Norwegian_oil_gas_decarbonization/main/data/output/emissions_and_production/cleaned/fields_prod_emissions_intensities_share_1997_2023.csv"

# Creating a check if import is successful
try:
    data = pd.read_csv(filepath, sep=",")
    print("Data import successful")
except:
    print("Data import failed")

Data import successful


## Causal Discovery


I will utilize the LCAM-Hybrid method, as described in the ["Autonomous LLM-Augmented Causal Discovery Framework (ALCM)" paper](https://arxiv.org/pdf/2405.01744v1), combines traditional causal discovery algorithms with Large Language Models (LLMs) to enhance the robustness and accuracy of causal graphs. This approach leverages the strengths of both conventional algorithms and LLMs to generate, refine, and validate causal structures.

Key Components of the LCAM-Hybrid Method

- Causal Structure Learning:

  - Utilizes traditional causal discovery algorithms (e.g., Peter-Clark (PC) and Linear Non-Gaussian Acyclic Model (LiNGAM)).
    Generates initial causal graphs from observational data.

- Causal Wrapper:

  - Translates the initial causal graphs into a series of contextual, causal-aware prompts for the LLM-driven refiner.

- LLM-Driven Refiner:

  - Uses an LLM to refine and validate the initial causal graph.
    Addresses and alleviates limitations in both the causal discovery algorithms and the datasets by uncovering hidden causal relationships.

- Evaluation:
  - Evaluates the refined causal graph using metrics like precision, recall, F1-score, accuracy, and Normalized Hamming Distance (NHD).


## Regression


In this notebook, I will use the data that I have processed in the previous notebooks to perform an Ordinary Least Squares (OLS) regression analysis. The goal is to understand the relationship between the oil and gas production and the CO2 emissions on the Norwegian Continental Shelf. I will also analyze the relationship between the oil and gas production and the methane emissions.

The degree of complexity will increase as the series progresses. I will start with a simple model and then add more variables to the regression analysis. I will also analyze the residuals to check the assumptions of the OLS regression.


### Static Variables Regression

#### The $\beta$'s to explore (independent variables):

1. Share of production of peak production
2. Gas/Oil Ratio
3. Years in production
4. Water depth
5. Original reserve
6. Remaining reserve
7. Carbon Price (inflation and currency adjusted)
8. Oil price (inflation and currency adjusted)
9.

++ Controls


| Variable                                    | Description/Unit                       | Name in the dataframe                  |
| ------------------------------------------- | -------------------------------------- | -------------------------------------- |
| Y1 : Distributed emission intensity         | tCO2e/toe                              | share_intensity_tco2e/toe_gwp100       |
| Y2 : Emission intensity                     | tCO2e/toe                              | kgco2e/toe_int_gwp100                  |
| Share of production of peak production      | Share of production of peak production | share_peak_prod                        |
| Oil/gas Ratio                               | Oil/gas Ratio                          | gas_oil_ratio                          |
| Well water depth                            | Water depth                            | well_water_depth_mean                  |
| Well depth                                  | Water depth                            | well_final_vertical_depth_mean         |
| Original reserve                            | Original reserve                       | original_recoverable_oe                |
| Share remaining reserve of original reserve | %                                      | share_reserve_of_original_reserve      |
| Oil eq. production volatility               | Volatility                             | net_oil_eq_prod_monthly_sm3_volatility |


In [25]:
# Show all columns

data.head()
data.columns.tolist()

['field',
 'year',
 'net_oil_prod_yearly_mill_sm3',
 'net_gas_prod_yearly_bill_sm3',
 'net_ngl_prod_yearly_mill_sm3',
 'net_condensate_prod_yearly_mill_sm3',
 'net_oil_eq_prod_yearly_mill_sm3',
 'produced_water_yearly_mill_sm3',
 'field_id',
 'net_oil_prod_monthly_sm3_volatility',
 'net_gas_prod_monthly_sm3_volatility',
 'net_ngl_prod_monthly_sm3_volatility',
 'net_condensate_prod_monthly_sm3_volatility',
 'net_oil_eq_prod_monthly_sm3_volatility',
 'produced_water_in_field_volatility',
 'status',
 'current_status',
 'field_owner',
 'processing_field',
 'field_in_emissions',
 'facilities_lifetime_mean',
 'facilities_lifetime_std',
 'facilities_water_depth_mean',
 'facilities_water_depth_std',
 'subsea_facilites_shut_down',
 'surface_facilites_shut_down',
 'subsea_facilites_in_service',
 'surface_facilites_in_service',
 'facility_kind_multi well template',
 'facility_kind_single well template',
 'facility_kind_offshore wind turbine',
 'facility_kind_subsea structure',
 'facility_kind_fps

In [27]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Preprocess the Data: Remove np.nan and inf values
ols_data = data.replace([np.inf, -np.inf], np.nan).dropna()

# Define the dependent and independent variables
Y1 = ols_data["share_intensity_tco2e/toe_gwp100"]
Y2 = ols_data["kgco2e/toe_int_gwp100"]

X = ols_data[
    [
        "share_peak_prod",
        "well_water_depth_mean",
        "well_final_vertical_depth_mean",
        "original_recoverable_oe",
        "share_reserve_of_original_reserve",
        "net_oil_eq_prod_monthly_sm3_volatility",
        "gas_reserve_ratio",
        "oil_gas_reserve_ratio",
    ]
]

# Perform OLS regression using statsmodels for detailed results
ols_Y1 = sm.OLS(Y1, X).fit()
ols_Y2 = sm.OLS(Y2, X).fit()

# Conduct the OLS using sklearn for consistency with provided code
reg = LinearRegression().fit(X, Y2)

# Second degree polynomial regression
X_poly2 = X.copy()
for col in X.columns:
    if col != "const":
        X_poly2[f"{col}^2"] = X[col] ** 2

ols_Y1_poly2 = sm.OLS(Y1, X_poly2).fit()
ols_Y2_poly2 = sm.OLS(Y2, X_poly2).fit()


# Define a function to display the results in a pretty format
def pretty_print_results(results, degree):
    print(f"OLS Regression Results (Degree {degree}):")
    print("========================================")
    print(results.summary())
    print("\n\n")


# Print the results
pretty_print_results(ols_Y1, 1)
pretty_print_results(ols_Y2, 1)
pretty_print_results(ols_Y1_poly2, 2)
pretty_print_results(ols_Y2_poly2, 2)

OLS Regression Results (Degree 1):
                                   OLS Regression Results                                   
Dep. Variable:     share_intensity_tco2e/toe_gwp100   R-squared:                       0.282
Model:                                          OLS   Adj. R-squared:                  0.273
Method:                               Least Squares   F-statistic:                     35.12
Date:                              Wed, 26 Jun 2024   Prob (F-statistic):           5.97e-47
Time:                                      00:06:16   Log-Likelihood:                -4655.7
No. Observations:                               726   AIC:                             9329.
Df Residuals:                                   717   BIC:                             9371.
Df Model:                                         8                                         
Covariance Type:                          nonrobust                                         
                                   

In [54]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels.panel import PanelOLS
from statsmodels.tools.sm_exceptions import MissingDataError

data = pd.read_csv(filepath, sep=",")

# Ensure the DataFrame has a MultiIndex for panel data
data.set_index(["field_id", "year"], inplace=True)

# Define the dependent variable
Y1 = data["share_intensity_tco2e/toe_gwp100"]

# Define the independent variables
X = data[
    [
        "share_peak_prod",
        "well_water_depth_mean",
        "well_final_vertical_depth_mean",
        "original_recoverable_oe",
        "share_reserve_of_original_reserve",
        "net_oil_eq_prod_monthly_sm3_volatility",
        "gas_reserve_ratio",
        # "oil_gas_reserve_ratio",
    ]
]

# Add a constant term for the intercept
X = sm.add_constant(X)

# Ensure Y1 aligns with X after dropping NaNs
X = X.dropna()
Y1 = Y1.loc[X.index]

# Perform OLS regression using statsmodels for detailed results
try:
    ols_Y1 = sm.OLS(Y1, X).fit()
except MissingDataError:
    print("Missing data found in the independent variables")

# Fixed Effects Model
try:
    X_fe = X.drop(columns=["const"])
    fixed_effects_Y1 = PanelOLS(Y1, X_fe, time_effects=True).fit()
except MissingDataError:
    print("Missing data found in the fixed effects model")


# Define a function to display the results in a pretty format
def pretty_print_results(results, degree):
    print(f"OLS Regression Results (Degree {degree}):")
    print("========================================")
    print(results.summary())
    print("\n\n")


def pretty_print_fe_results(results):
    print("Fixed Effects Regression Results:")
    print("=================================")
    print(results)
    print("\n\n")


# Print the results
pretty_print_results(ols_Y1, 1)
pretty_print_fe_results(fixed_effects_Y1)

OLS Regression Results (Degree 1):
                                   OLS Regression Results                                   
Dep. Variable:     share_intensity_tco2e/toe_gwp100   R-squared:                       0.226
Model:                                          OLS   Adj. R-squared:                  0.222
Method:                               Least Squares   F-statistic:                     59.90
Date:                              Wed, 26 Jun 2024   Prob (F-statistic):           1.34e-75
Time:                                      00:30:57   Log-Likelihood:                -9373.1
No. Observations:                              1445   AIC:                         1.876e+04
Df Residuals:                                  1437   BIC:                         1.880e+04
Df Model:                                         7                                         
Covariance Type:                          nonrobust                                         
                                   

### Dynamic (operational) Variables Regression

#### The $\beta$'s to explore (independent variables):

1. Operator
2. Owner
3. Electrified
4. Production Volatility
5. Well Status : Drilling
6. Well Status : Open/Producing
7. Well Status : Plugged
8. Well Status : Closed
9. Well Final Vertical Depth Mean
10.
11.
12.

++ Controls


In [None]:
data.head(5)

## Double Machine Learning


### Introduction to Double Machine Learning (DML) Regression

The Double Machine Learning (DML) regression framework aims to accurately estimate the causal effects of production levels and the quantity left in the field on pollution levels, while effectively accounting for other confounding factors that could influence these relationships. At its core, the model decomposes the relationship into two primary stages. In the first stage, advanced machine learning algorithms are utilized to predict the treatment variables (production level and quantity left in the field) based on a set of control variables. This step helps isolate the variation in the treatment variables that is not explained by the control variables. By doing so, it creates residuals—essentially, the unexplained parts of the treatment variables that are free from the confounding influence of the controls. In the second stage, these residuals are used in a regression model to predict the outcome variable (pollution), along with their interaction terms and squared terms to capture potential nonlinear effects and interactions. This approach ensures that the estimated effects of the treatment variables on pollution are unbiased and more reliable, as it rigorously controls for the confounding influence of other variables. The ultimate goal is to provide a clearer understanding of how changes in production and field practices directly impact pollution, free from the distortions caused by other factors.


### Double Machine Learning (DML) Regression of Carbon Intensity

Double Machine Learning (DML) represents a cutting-edge econometric framework tailored for robust causal inference amidst high-dimensional data and intricate confounding structures. DML employs sophisticated machine learning techniques to meticulously control for confounders, enabling the precise isolation of causal effects of treatment variables on an outcome. This methodology is particularly advantageous in scenarios involving complex datasets where conventional regression approaches may inadequately address the multiplicity of confounding factors.

### Model Specification

Consider the following DML regression model where the outcome variable $ Y $ denotes pollution levels. The treatment variables are $ X_1 $ (production level) and $ X_2 $ (quantity left in the field), while $ Z $ represents additional control variables. Interaction terms between production and the quantity left in the field are included to capture potential synergistic effects.

$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2^2 + \beta_4 (X_1 \cdot X_2) + \beta_5 (X_1 \cdot X_2^2) + G(Z) + u
$

### Components of the Model

- **Outcome Variable $ Y $**: Represents the pollution levels.
- **Treatment Variables $ X_1 $ and $ X_2 $**:
  - $ X_1 $: Production level.
  - $ X_2 $: Quantity left in the field.
- **Interaction Terms**:
  - $ X_1 \cdot X_2 $: Interaction between production and quantity left in the field.
  - $ X_1 \cdot X_2^2 $: Interaction between production and the squared quantity left in the field.
- **Control Function $ G(Z) $**: A non-parametric function representing the influence of other covariates $ Z $ on $ Y $.
- **Error Term $ u $**: Captures unobserved factors affecting $ Y $.

### Estimation Process Using DML

1. **First Stage**:

   1. **Modeling $ X_1 $ and $ X_2 $**: Use machine learning methods (e.g., Lasso, Random Forest, Neural Networks) to predict the treatment variables $ X_1 $ and $ X_2 $ based on $ Z $. This helps control for the confounding effect of $ Z $.
   2. **Residuals**: Calculate the residuals of $ X_1 $ and $ X_2 $ after accounting for $ Z $.

2. **Second Stage**:
   1. **Predicting $ Y $**: Use the residuals obtained from the first stage to predict $ Y $, including the interaction terms and their non-linear components.
   2. **Control Function**: Include $ G(Z) $ to control for the influence of $ Z $ on $ Y $.

### Mathematical Formulation

The DML estimation process involves the following steps:

1. **Estimate the nuisance parameters**: $ \hat{m}\_1(Z) $ and $ \hat{m}\_2(Z) $ for $ X_1 $ and $ X_2 $ respectively.
2. **Compute residuals**:
   $
   \hat{u}_1 = X_1 - \hat{m}_1(Z), \quad \hat{u}_2 = X_2 - \hat{m}_2(Z)
   $
3. **Estimate the reduced form**:
   $
   Y = \beta_0 + \beta_1 \hat{u}_1 + \beta_2 \hat{u}_2 + \beta_3 \hat{u}_2^2 + \beta_4 (\hat{u}_1 \cdot \hat{u}_2) + \beta_5 (\hat{u}_1 \cdot \hat{u}_2^2) + G(Z) + u
   $


### Introduction to Double Machine Learning (DML) Regression

Double Machine Learning (DML) represents a cutting-edge econometric framework tailored for robust causal inference amidst high-dimensional data and intricate confounding structures. DML employs sophisticated machine learning techniques to meticulously control for confounders, enabling the precise isolation of causal effects of treatment variables on an outcome. This methodology is particularly advantageous in scenarios involving complex datasets where conventional regression approaches may inadequately address the multiplicity of confounding factors.

### Model Specification

Consider the following DML regression model where the outcome variable $ Y $ denotes pollution levels. The treatment variables are $ X_1 $ (production level) and $ X_2 $ (quantity left in the field), while $ Z $ represents additional control variables. Interaction terms between production and the quantity left in the field are included to capture potential synergistic effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2^2 + \beta_4 (X_1 \cdot X_2) + \beta_5 (X_1 \cdot X_2^2) + G(Z) + u$

### Components of the Model

- **Outcome Variable $ Y $**: Represents the pollution levels.
- **Treatment Variables $ X_1 $ and $ X_2 $**:
  - $ X_1 $: Production level.
  - $ X_2 $: Quantity left in the field.
- **Interaction Terms**:
  - $ X_1 \cdot X_2 $: Interaction between production and quantity left in the field.
  - $ X_1 \cdot X_2^2 $: Interaction between production and the squared quantity left in the field.
- **Control Function $ G(Z) $**: A non-parametric function representing the influence of other covariates $ Z $ on $ Y $.
- **Error Term $ u $**: Captures unobserved factors affecting $ Y $.

### Estimation Process Using DML

1. **First Stage**:

   - **Modeling $ X_1 $ and $ X_2 $**: Use machine learning methods (e.g., Lasso, Random Forest, Neural Networks) to predict the treatment variables $ X_1 $ and $ X_2 $ based on $ Z $. This helps control for the confounding effect of $ Z $.
   - **Residuals**: Calculate the residuals of $ X_1 $ and $ X_2 $ after accounting for $ Z $.

2. **Second Stage**:
   - **Predicting $ Y $**: Use the residuals obtained from the first stage to predict $ Y $, including the interaction terms and their non-linear components.
   - **Control Function**: Include $ G(Z) $ to control for the influence of $ Z $ on $ Y $.

### Mathematical Formulation

The DML estimation process involves the following steps:

1. **Estimate the nuisance parameters**: $\hat{m}_1(Z)$ and $\hat{m}_2(Z)$ for $ X_1 $ and $ X_2 $ respectively.
2. **Compute residuals**:
   $\hat{u}_1 = X_1 - \hat{m}_1(Z), \quad \hat{u}_2 = X_2 - \hat{m}_2(Z)$
3. **Estimate the reduced form**:
   $Y = \beta_0 + \beta_1 \hat{u}_1 + \beta_2 \hat{u}_2 + \beta_3 \hat{u}_2^2 + \beta_4 (\hat{u}_1 \cdot \hat{u}_2) + \beta_5 (\hat{u}_1 \cdot \hat{u}_2^2) + G(Z) + u$
