# How can machine learning be applied to predict site viability scores for offshore basins?

In my second subquestion, I use a new dataset of potential offshore injection sites across the Gulf of Mexico from Nassabeh et al. (2024). 

# Import Necessary Packages

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import ibis
from ibis import _
import ibis.selectors as s
from plotnine import *
import pandas as pd
con = ibis.duckdb.connect()
from sklearn.preprocessing import OneHotEncoder

# Read in Data from Wendt et al. (2022)

For this dataset, I use ibis with Duckdb to read in the data instead of pandas. This is due to the complexity of the data, and ibis handles this better internally. I also use ibis to accomodate the larger size of data I am working with.

In [2]:
wendt = con.read_csv('wendt_data.csv', skip =3)

In [3]:
wendt.head().execute()

Unnamed: 0,Grid FID,Grid ID,Longitude\n(degrees),Latitude\n(degrees),CO2 Source ID (tied to NATCARB dataset)*,Distance (km),CO2 Source ID (tied to NATCARB dataset),Distance (km)_1,CO2 Source ID (tied to NATCARB dataset)_1,Distance (km)_2,...,Sum of Reservoir Quality W/O Depth (feet),New Injectivity,S1,S2,S3,S4,S1_1,S2_1,S3_1,S4_1
0,0,36663,-96.964053,26.000342,1671,56.265652,860,39.710421,286,127.205047,...,9.1271,12.3958,57.746243,,38.5721,,25-50%,,25-50%,
1,1,36668,-97.035537,26.071826,1671,50.940346,860,34.179635,286,117.880692,...,1.5259,1.7557,30.571945,,16.120983,,0-25%,,0-25%,
2,2,36669,-96.964053,26.071826,1671,57.783507,860,41.034622,286,124.718795,...,9.1271,12.3958,57.670181,,38.559063,,25-50%,,25-50%,
3,3,36670,-97.035537,26.14331,1671,53.574517,860,37.428918,286,115.740856,...,9.1271,12.3958,32.312887,,14.222778,,0-25%,,0-25%,
4,4,36671,-96.964053,26.14331,1671,60.118581,860,43.77447,286,122.694148,...,9.1271,12.3958,57.596115,,38.506005,,25-50%,,25-50%,


In [4]:
len(wendt.execute())

2559

In [5]:
wendt.schema

<bound method Table.schema of DatabaseTable: ibis_read_csv_rcp45pdglvaqhmxa3apz3hhsbi
  Grid FID                                                                            int64
  Grid ID                                                                             int64
  Longitude
  (degrees)                                                                 float64
  Latitude
  (degrees)                                                                  float64
  CO2 Source ID (tied to NATCARB dataset)*                                            int64
  Distance (km)                                                                       float64
  CO2 Source ID (tied to NATCARB dataset)                                             int64
  Distance (km)_1                                                                     float64
  CO2 Source ID (tied to NATCARB dataset)_1                                           int64
  Distance (km)_2                                                         

# Data Cleaning
## Drop Grid Identifier Numbers

In [6]:
wendt = wendt.drop('Grid ID', 'Grid FID')

## Modify Data Type

The last 4 columns contain quartile information for the site score, which is currently stored as a string. I modify the data types to floats (0.25, 0.5, 0.75, 1) to represent the quartile bin and convert 'none' strings to NaN values.

In [7]:
wendt.select(wendt['S1_1'])[0].execute()[1]

'0-25%'

trial = wendt.mutate(
    q1 = (
        ibis.case()
        .when(wendt['S1_1'] == '0-25%', 0.25)
        .when(wendt['S1_1'] == '25-50%', 0.5)
        .when(wendt['S1_1'] == '50-75%', 0.75)
        .when(wendt['S1_1'] == '75-100%', 1.0)
        .else_(None)
        .end()
    ))
trial.execute()

# Principle Component Analysis

In order to perform PCA, I need to reorient the data to be a 2559 x 58 numpy array. I also only want to include the data, not the site scores and quartiles for a given scenario.

In [8]:
wendt.columns[-8:]

('S1', 'S2', 'S3', 'S4', 'S1_1', 'S2_1', 'S3_1', 'S4_1')

In [9]:
wendt_pca = wendt.drop(wendt.columns[-8:])
wendt_pca_arr = wendt_pca.execute().to_numpy()
wendt_pca_arr.shape

(2559, 58)

In [10]:
wendt_pca_scores = wendt.select(wendt.columns[-8:]).execute().to_numpy
wendt_pca_scores

<bound method DataFrame.to_numpy of              S1  S2         S3  S4    S1_1  S2_1     S3_1  S4_1
0     57.746243 NaN  38.572100 NaN  25-50%  None   25-50%  None
1     30.571945 NaN  16.120983 NaN   0-25%  None    0-25%  None
2     57.670181 NaN  38.559063 NaN  25-50%  None   25-50%  None
3     32.312887 NaN  14.222778 NaN   0-25%  None    0-25%  None
4     57.596115 NaN  38.506005 NaN  25-50%  None   25-50%  None
...         ...  ..        ...  ..     ...   ...      ...   ...
2554  30.814319 NaN  16.993804 NaN   0-25%  None    0-25%  None
2555  30.510465 NaN  15.706714 NaN   0-25%  None    0-25%  None
2556  53.280620 NaN  33.239677 NaN  25-50%  None    0-25%  None
2557  53.660471 NaN  34.577485 NaN  25-50%  None    0-25%  None
2558  48.344369 NaN  53.137015 NaN   0-25%  None  75-100%  None

[2559 rows x 8 columns]>

In [11]:
wendt_mean = np.mean(wendt_pca_arr, axis=0)
wendt_std = np.std(wendt_pca_arr, axis=0)
wendt_standardized = (wendt_pca_arr - wendt_mean) / wendt_std

In [12]:
u, s, vt = np.linalg.svd(wendt_standardized, full_matrices=False)
print(f"Dimensions of U: {u.shape}")
print(f"1D List of diagonal elements of Sigma: {s}")
print(f"Dimensions of V Transpose: {vt.shape}")

Dimensions of U: (2559, 58)
1D List of diagonal elements of Sigma: [1.81479333e+02 1.40890587e+02 1.29904609e+02 9.08362762e+01
 9.02046059e+01 8.18919813e+01 7.90134409e+01 7.53663594e+01
 7.45665152e+01 6.92069343e+01 6.39427653e+01 5.92650405e+01
 5.41589899e+01 5.09088129e+01 5.01114413e+01 4.81300428e+01
 4.53491611e+01 4.40457679e+01 3.87414291e+01 3.66575773e+01
 3.56645572e+01 3.39847048e+01 3.28065057e+01 3.18979044e+01
 2.88514390e+01 2.58544655e+01 2.13379882e+01 2.04692452e+01
 1.99979313e+01 1.57800300e+01 1.56201962e+01 1.45043747e+01
 1.38395423e+01 1.27763389e+01 1.02554571e+01 9.11214394e+00
 6.34243279e+00 4.14822636e+00 3.91675162e-01 1.49938369e-04
 1.13685255e-04 7.85939065e-05 7.56138791e-05 7.03359953e-05
 6.73046042e-05 4.95069539e-05 4.74193190e-05 4.29965866e-05
 3.33158382e-05 1.92643262e-05 2.88201312e-14 1.45332609e-14
 1.23189140e-14 1.22658215e-14 1.19308481e-14 1.19308481e-14
 1.19308481e-14 1.19308481e-14]
Dimensions of V Transpose: (58, 58)


In [13]:
wendt_total_variance = (sum(s**2))/u.shape[0]

print("wendt_total_variance: {:.3f} should approximately equal the sum of the feature variances: {:.3f}"
      .format(wendt_total_variance, np.sum(np.var(wendt_standardized, axis=0))))

wendt_total_variance: 58.000 should approximately equal the sum of the feature variances: 58.000


In [14]:
wendt_2d = wendt_standardized @ vt.transpose()[:, :2]
wendt_2d.shape

(2559, 2)

plt.figure(figsize = (7, 7))
plt.title("PC2 vs. PC1 for wendt et al. (2022)")
plt.xlabel("wendt PC1")
plt.ylabel("wendt PC2")
sns.scatterplot(x = wendt_2d[:, 0], y = wendt_2d[:, 1], #hue = iris_target);