In [1]:
# Genral imports to come

import pandas as pd

# Oil and Gas Production and Emissions Data on the Norwegian Continental Shelf

## Part 4

This is the fourth part of a series of notebooks that I am creating to analyze the oil and gas production and emissions data on the Norwegian Continental Shelf. The data is provided by the Norwegian Petroleum Directorate (NPD) and covers the period from 2001 to 2020. The data is available on the NPD website and can be downloaded from the following link: [Production and Emissions Data](https://factpages.sodir.no/).

You can find the other parts of the series here:

#### Part 1: [Data Collection](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/01_data_building/01_production_and_emission_data_building.ipynb)

#### Part 2: [Data Cleaning](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/02_data_cleaning/02_production_and_emission_data_cleaning.ipynb)

#### Part 3 [Data Procesing](https://github.com/percw/Norwegian_oil_gas_decarbonization/blob/main/notebooks/03_data_processing/03_production_and_emission_data_processing.ipynb)


### Introduction to Double Machine Learning (DML) Regression

The Double Machine Learning (DML) regression framework aims to accurately estimate the causal effects of production levels and the quantity left in the field on pollution levels, while effectively accounting for other confounding factors that could influence these relationships. At its core, the model decomposes the relationship into two primary stages. In the first stage, advanced machine learning algorithms are utilized to predict the treatment variables (production level and quantity left in the field) based on a set of control variables. This step helps isolate the variation in the treatment variables that is not explained by the control variables. By doing so, it creates residuals—essentially, the unexplained parts of the treatment variables that are free from the confounding influence of the controls. In the second stage, these residuals are used in a regression model to predict the outcome variable (pollution), along with their interaction terms and squared terms to capture potential nonlinear effects and interactions. This approach ensures that the estimated effects of the treatment variables on pollution are unbiased and more reliable, as it rigorously controls for the confounding influence of other variables. The ultimate goal is to provide a clearer understanding of how changes in production and field practices directly impact pollution, free from the distortions caused by other factors.


### Double Machine Learning (DML) Regression of Carbon Intensity

Double Machine Learning (DML) represents a cutting-edge econometric framework tailored for robust causal inference amidst high-dimensional data and intricate confounding structures. DML employs sophisticated machine learning techniques to meticulously control for confounders, enabling the precise isolation of causal effects of treatment variables on an outcome. This methodology is particularly advantageous in scenarios involving complex datasets where conventional regression approaches may inadequately address the multiplicity of confounding factors.

### Model Specification

Consider the following DML regression model where the outcome variable $ Y $ denotes pollution levels. The treatment variables are $ X_1 $ (production level) and $ X_2 $ (quantity left in the field), while $ Z $ represents additional control variables. Interaction terms between production and the quantity left in the field are included to capture potential synergistic effects.

$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2^2 + \beta_4 (X_1 \cdot X_2) + \beta_5 (X_1 \cdot X_2^2) + G(Z) + u
$

### Components of the Model

- **Outcome Variable $ Y $**: Represents the pollution levels.
- **Treatment Variables $ X_1 $ and $ X_2 $**:
  - $ X_1 $: Production level.
  - $ X_2 $: Quantity left in the field.
- **Interaction Terms**:
  - $ X_1 \cdot X_2 $: Interaction between production and quantity left in the field.
  - $ X_1 \cdot X_2^2 $: Interaction between production and the squared quantity left in the field.
- **Control Function $ G(Z) $**: A non-parametric function representing the influence of other covariates $ Z $ on $ Y $.
- **Error Term $ u $**: Captures unobserved factors affecting $ Y $.

### Estimation Process Using DML

1. **First Stage**:

   1. **Modeling $ X_1 $ and $ X_2 $**: Use machine learning methods (e.g., Lasso, Random Forest, Neural Networks) to predict the treatment variables $ X_1 $ and $ X_2 $ based on $ Z $. This helps control for the confounding effect of $ Z $.
   2. **Residuals**: Calculate the residuals of $ X_1 $ and $ X_2 $ after accounting for $ Z $.

2. **Second Stage**:
   1. **Predicting $ Y $**: Use the residuals obtained from the first stage to predict $ Y $, including the interaction terms and their non-linear components.
   2. **Control Function**: Include $ G(Z) $ to control for the influence of $ Z $ on $ Y $.

### Mathematical Formulation

The DML estimation process involves the following steps:

1. **Estimate the nuisance parameters**: $ \hat{m}\_1(Z) $ and $ \hat{m}\_2(Z) $ for $ X_1 $ and $ X_2 $ respectively.
2. **Compute residuals**:
   $
   \hat{u}_1 = X_1 - \hat{m}_1(Z), \quad \hat{u}_2 = X_2 - \hat{m}_2(Z)
   $
3. **Estimate the reduced form**:
   $
   Y = \beta_0 + \beta_1 \hat{u}_1 + \beta_2 \hat{u}_2 + \beta_3 \hat{u}_2^2 + \beta_4 (\hat{u}_1 \cdot \hat{u}_2) + \beta_5 (\hat{u}_1 \cdot \hat{u}_2^2) + G(Z) + u
   $

### Conclusion

By implementing DML, researchers can achieve unbiased and efficient estimations of the effects of production levels and remaining quantities in the field on pollution levels, while rigorously controlling for other covariates. This technique significantly enhances the reliability of causal inference in high-dimensional contexts, making it an invaluable tool for policy analysis and decision-making in environmental economics and beyond. The robustness and adaptability of DML in handling complex, high-dimensional data structures underscore its utility in modern econometric analysis.


### Introduction to Double Machine Learning (DML) Regression

Double Machine Learning (DML) represents a cutting-edge econometric framework tailored for robust causal inference amidst high-dimensional data and intricate confounding structures. DML employs sophisticated machine learning techniques to meticulously control for confounders, enabling the precise isolation of causal effects of treatment variables on an outcome. This methodology is particularly advantageous in scenarios involving complex datasets where conventional regression approaches may inadequately address the multiplicity of confounding factors.

### Model Specification

Consider the following DML regression model where the outcome variable $ Y $ denotes pollution levels. The treatment variables are $ X_1 $ (production level) and $ X_2 $ (quantity left in the field), while $ Z $ represents additional control variables. Interaction terms between production and the quantity left in the field are included to capture potential synergistic effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2^2 + \beta_4 (X_1 \cdot X_2) + \beta_5 (X_1 \cdot X_2^2) + G(Z) + u$

### Components of the Model

- **Outcome Variable $ Y $**: Represents the pollution levels.
- **Treatment Variables $ X_1 $ and $ X_2 $**:
  - $ X_1 $: Production level.
  - $ X_2 $: Quantity left in the field.
- **Interaction Terms**:
  - $ X_1 \cdot X_2 $: Interaction between production and quantity left in the field.
  - $ X_1 \cdot X_2^2 $: Interaction between production and the squared quantity left in the field.
- **Control Function $ G(Z) $**: A non-parametric function representing the influence of other covariates $ Z $ on $ Y $.
- **Error Term $ u $**: Captures unobserved factors affecting $ Y $.

### Estimation Process Using DML

1. **First Stage**:

   - **Modeling $ X_1 $ and $ X_2 $**: Use machine learning methods (e.g., Lasso, Random Forest, Neural Networks) to predict the treatment variables $ X_1 $ and $ X_2 $ based on $ Z $. This helps control for the confounding effect of $ Z $.
   - **Residuals**: Calculate the residuals of $ X_1 $ and $ X_2 $ after accounting for $ Z $.

2. **Second Stage**:
   - **Predicting $ Y $**: Use the residuals obtained from the first stage to predict $ Y $, including the interaction terms and their non-linear components.
   - **Control Function**: Include $ G(Z) $ to control for the influence of $ Z $ on $ Y $.

### Mathematical Formulation

The DML estimation process involves the following steps:

1. **Estimate the nuisance parameters**: $\hat{m}_1(Z)$ and $\hat{m}_2(Z)$ for $ X_1 $ and $ X_2 $ respectively.
2. **Compute residuals**:
   $\hat{u}_1 = X_1 - \hat{m}_1(Z), \quad \hat{u}_2 = X_2 - \hat{m}_2(Z)$
3. **Estimate the reduced form**:
   $Y = \beta_0 + \beta_1 \hat{u}_1 + \beta_2 \hat{u}_2 + \beta_3 \hat{u}_2^2 + \beta_4 (\hat{u}_1 \cdot \hat{u}_2) + \beta_5 (\hat{u}_1 \cdot \hat{u}_2^2) + G(Z) + u$
