## Data Wrangling

### Introduction

This project is part of a Capstone project for Springboard Data Science Career Track.

The goal of this project is to develop a machine learning model to rank and predict the likelihood that an oil company will initiate a frac job in a county within the Permian Basin in the first quarter of 2024.

In [1]:
# Import statements
import concurrent.futures as cf
import random
import re
import tempfile
import warnings
from functools import lru_cache
from pathlib import Path
from typing import Optional
from urllib.request import urlopen
from zipfile import ZipFile
import missingno as msno
import cartopy.crs as ccrs
from shapely import wkt
import geopandas as gpd
import geoviews as gv
import geoviews.tile_sources as gts
import colorcet as cc
import holoviews as hv
import hvplot.pandas  # noqa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyproj
from fiona.io import ZipMemoryFile
from pyvis.network import Network
from sqlalchemy import create_engine
from sqlalchemy.exc import SQLAlchemyError
from tqdm import tqdm
from scipy.stats import hypergeom


hv.extension("bokeh")
gv.extension("bokeh")

In [155]:
# ignore all warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 40)

In [3]:
# Test initial print statement
print("CapstoneJourney begins!")

CapstoneJourney begins!


#### Constants
Let's start by defining some constants that will be used throughout this notebook.

Most of the data was first downloaded from external websites and then uploaded onto a cloud storage bucket. This was done to ensure consistency and availability during the project. A brief description of the data and its original source link is referenced below.

## Data Sources

The following table provides an overview of the data sources used in this project:

| Dataset Name | Source URL | Original Source | Description | Date Downloaded |
|--------------|------------|-----------------|-------------|-----------------|
|RegistryUpload Table | [link](https://fracfocus.org/data-download) | FracFocus | This table contains each disclosure’s header information such as the job date, API number, location, base water volume, and total vertical depth. | 2023-11-11 |
RBDMSWells | [link](https://gisdata-occokc.opendata.arcgis.com/datasets/OCCOKC::rbdms-wells/about) | Oklahoma Corporation Commission | This table contains Oklahoma RBDMS statewide well data | 2023-11-23 |
| Wolfcamp Delaware Play Boundary | [link]((https://www.eia.gov/maps/maps.htm))| EIA | Permian Basin, Delaware Sub-Basin: Wolfcamp play boundary (9/4/2018) | 2023-11-19 |
| Wolfcamp Midland Play Boundaries | [link]((https://www.eia.gov/maps/maps.htm))| EIA | Wolfcamp A, B, C, and D play boundaries, Midland Basin (6/4/2020) | 2023-11-21 |
| ShalePlay Delaware | [link]((https://www.eia.gov/maps/maps.htm))| EIA |Delaware play boundary (10/8/2019)  | 2023-11-21 |
| AboYeso GlorietaYeso Spraberry | [link]((https://www.eia.gov/maps/maps.htm))| EIA | Abo-Yeso, Glorieta-Yeso, and Spraberry play boundaries (3/11/2016) | 2023-11-21 |
| NM SLO OilGas Leases | [link](https://www.nmstatelands.org/maps-gis/gis-data-download/)| New Mexico State Land Office | Active Oil and Gas Leases (11/07/2023) | 2023-11-21 |
| NM SLO Geologic Regions | [link]((https://www.nmstatelands.org/maps-gis/gis-data-download/))| New Mexico State Land Office | Geologic Regions (01/04/2010) | 2023-11-21 |
| NM SLO STL Status Combined | [link]((https://www.nmstatelands.org/maps-gis/gis-data-download/))| New Mexico State Land Office | New Mexico State Trust Lands By Subdivision (04/14/2022) | 2023-11-21 |
| All Layers By County | [link](https://rrc.texas.gov/resource-center/research/data-sets-available-for-download/)  | Railroad Commission of Texas | Map & Associated Data: Base Map, Wells, Surveys & Pipelines layers | 2023-11-17 |
| Oil & Gas Leases | [link](https://www.glo.texas.gov/land/land-management/gis/index.html) | Texas General Land Office | Active Leases (11/17/2023) | 2023-11-17 |
| Oil & Gas Units | [link](https://www.glo.texas.gov/land/land-management/gis/index.html) | Texas General Land Office | Active Units (11/17/2023) | 2023-11-17 |
| U.S. County Boundaries | [link](https://www2.census.gov/geo/tiger/TIGER2022/COUNTY/tl_2022_us_county.zip) | United States Census Bureau | County (2022-10-31). Data is downloaded directly in the code. | N/A |
| U.S. County FIPS Codes | [link](https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county) | Wikipedia | List of United States FIPS codes by county. Data is downloaded directly in the code. | N/A |

Each row in the table represents a different dataset. The columns are:

- **Dataset Name**: The name of the dataset.
- **Source URL**: The URL where the dataset can be downloaded. Click on "link" to access the webpage.
- **Original Source**: The original source of the data.
- **Description**: A brief description of the dataset.
- **Date Downloaded**: The date when the dataset was downloaded.

In [4]:
# Constants
# This cell generates lists of URLs to CSV files stored in a Google Cloud Storage bucket.
# The CSV files contain data from the FracFocus Chemical Disclosure Registry.

# Generate a list of URLs to the FracFocusRegistry CSV files.
# There are 24 files in total, named FracFocusRegistry_i.csv where i ranges from 1 to 24.
DATA_URLS1 = [
    f"https://storage.googleapis.com/mrprime_dataset/fracfocus/FracFocusRegistry_{i}.csv"
    for i in range(1, 25)
]

# Generate a list of URLs to the registryupload CSV files.
# There are 3 files in total, named registryupload_i.csv where i ranges from 1 to 3.
DATA_URLS2 = [
    f"https://storage.googleapis.com/mrprime_dataset/fracfocus/registryupload_{j}.csv"
    for j in range(1, 4)
]

# URL to the readme.txt file in the bucket.
DATA_README_URL = [
    "https://storage.googleapis.com/mrprime_dataset/fracfocus/readme.txt"
]

# url to the OCC (Oklahoma) well data in th bucket
OCC_PARQUET_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/occ/rbdms_wells.parquet"

In [5]:
# Url for the shapefile for US counties from the Census Bureau's website.
CENSUS_COUNTY_MAP_URL = (
    "https://www2.census.gov/geo/tiger/TIGER2022/COUNTY/tl_2022_us_county.zip"
)
# Url for a Wikipedia page containing a table of FIPS codes for US counties.
FIPS_WIKI_URL = (
    "https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county"
)
# Bounds of the continental US in longitude and latitude.
USA_BOUNDS = (-124.77, 24.52, -66.95, 49.38)
# bounds of the continental US in Web Mercator coordinates.
USA_BOUNDS_MERCATOR = (-13874905.0, 2870341.0, -7453304.0, 6338219.0)

In [6]:
# url for the shapefiles of Permian Basin, Delaware Sub-Basin: Wolfcamp play boundary
WOLFCAMP_ZIP_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/eia/Wolfcamp_Delaware_Play_Boundary.zip"
MIDLAND_ZIP_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/eia/Wolfcamp_Midland_Play_Boundaries_EIA.zip"
DELAWARE_ZIP_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/eia/ShalePlay_Delaware_EIA.zip"
ABOYESO_ZIP_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/eia/ShalePlays_AboYeso_GlorietaYeso_Spraberry_EIA.zip"
# PB_ZIP_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/eia/PermianBasin_Boundary_Structural_Tectonic.zip"

basins_url_list = [
    WOLFCAMP_ZIP_URL,
    MIDLAND_ZIP_URL,
    DELAWARE_ZIP_URL,
    ABOYESO_ZIP_URL,
    # PB_ZIP_URL,
]


# url for shapefiles of Polygon data set intended to delineate active oil and gas leases on New Mexico State Trust Lands.
NM_SLO_OIL_LEASE_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/nm_slo/OilGas_Leases.zip"

# url for shapefiles of Polygon layer created to highlight general boundaries of subsurface geologic basins and uplifts of New Mexico
NM_SLO_GEO_REGION_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/nm_slo/slo_GeologicRegions.zip"
# url for shapefiles of Polygons of New Mexico State Trust Lands by PLSS subdivision (quarter-quarter, lot, tract, or partial).
NM_SLO_STL_PLSS_URL = "https://storage.googleapis.com/mrprime_dataset/capstone_journey/nm_slo/slo_STLStatusCombined.zip"

nm_slo_url_list = [
    NM_SLO_OIL_LEASE_URL,
    NM_SLO_GEO_REGION_URL,
]  # , NM_SLO_STL_PLSS_URL]

In [7]:
# Define a list of county numbers that we want to test. These numbers correspond to counties
# that we did not include in the data folder, but they do not cover all 254 counties.

# county_nums = [
#     "003",
#     "103",
#     "135",
#     "173",
#     "301",
#     "317",
#     "329",
#     "371",
#     "389",
#     "475",
#     "495",
# ] + [str(i).zfill(3) for i in range(7, 103, 4)]
county_nums = [str(i).zfill(3) for i in range(1, 508, 4)]

# Generate a list of URLs to the shapefile zip files stored in a Google Cloud Storage bucket.
# The zip files are named Shp{num}.zip, where {num} is a county number from the county_nums list.
SHP_ZIP_URLS = [
    f"https://storage.googleapis.com/mrprime_dataset/capstone_journey/rrc/all_layers_rrc_20231117/Shp{num}.zip"
    for num in county_nums
]

In [8]:
# url for the active leases in Texas on State land gdb
GDB_ZIP_URLS = [
    "https://storage.googleapis.com/mrprime_dataset/capstone_journey/glo/GDB_ActiveLeases.zip",
    "https://storage.googleapis.com/mrprime_dataset/capstone_journey/glo/GDB_ActiveUnits.zip",
    # "https://storage.googleapis.com/mrprime_dataset/capstone_journey/glo/GDB_InactiveLeases.zip",
]

#### Function definations
Next, let's define some functions that will be used throughout this notebook.

In [9]:
def read_csv_concurrent(urls_list):
    """Reads a list of CSV files concurrently"""
    # Create a thread pool
    with cf.ThreadPoolExecutor() as executor:
        # Use map to apply pd.read_csv to each URL
        results = list(tqdm(executor.map(pd.read_csv, urls_list), total=len(urls_list)))
    # Return the results
    return results

In [10]:
def extract_specific_gdf_from_local_zip(
    zip_paths: list[str], regex_patterns: list[str]
) -> dict[str, gpd.GeoDataFrame]:
    """
    Reads shapefiles from a list of zip files and returns a dictionary
    where the keys are the names of the shapefiles and the values are GeoDataFrames.
    """
    # Initialize an empty dictionary to store the GeoDataFrames
    shp_dict = {}
    # compile the regex patterns
    patterns = [re.compile(pattern) for pattern in regex_patterns]

    # Loop over the list of zip file paths
    for zip_path in zip_paths:
        # Open the zip file
        with ZipFile(zip_path) as z:
            # Get the list of files in the zip file
            zip_contents = z.namelist()
            # Filter the list to get only the shapefiles that match any of the patterns
            shp_files = [
                f
                for f in zip_contents
                for pattern in patterns
                if pattern.search(f) and f.endswith(".shp")
            ]
            # read the shapefiles into GeoDataFrames
            for shp_file in shp_files:
                # Get the name of the shapefile
                shp_name = Path(shp_file).stem
                # Read the shapefile into a GeoDataFrame and add it to the dictionary
                shp_dict[shp_name] = gpd.read_file(f"zip://{zip_path}!{shp_file}")
    # Return the dictionary of GeoDataFrames
    return shp_dict

In [11]:
def extract_matching_shp_files_from_zip_urls(
    zip_urls: list[str], regex_patterns: list[str]
) -> dict[str, gpd.GeoDataFrame]:
    """
    Reads shapefiles from a list of zip file urls and returns a dictionary
    where the keys are the names of the shapefiles and the values are GeoDataFrames.
    """
    # Initialize an empty dictionary to store the GeoDataFrames
    shp_dict = {}
    # compile the regex patterns
    patterns = [re.compile(pattern) for pattern in regex_patterns]

    # Loop over the list of zip file urls
    for zip_url in tqdm(zip_urls, desc="Processing zip files"):
        # download the zip file
        with urlopen(zip_url) as u:
            zip_data = u.read()
        # create a ZipMemoryFile from the zip data
        with ZipMemoryFile(zip_data) as z:
            # get the list of files in the zip file
            zip_files = z.listdir()
            # filter for shapefiles that match any of the patterns
            shp_files = [
                f
                for f in zip_files
                for pattern in patterns
                if pattern.search(f) and f.endswith(".shp")
            ]
            # read the shapefiles into GeoDataFrames
            for shp_file in shp_files:
                with z.open(shp_file) as f:
                    shp_dict[Path(shp_file).stem] = gpd.GeoDataFrame.from_features(
                        f, crs=f.crs
                    )
    # Return the dictionary of GeoDataFrames
    return shp_dict

In [12]:
def process_zip_url(
    zip_url: str, patterns: list[re.Pattern]
) -> dict[str, gpd.GeoDataFrame]:
    """Downloads a zip file url and returns a dictionary of GeoDataFrames for shapefiles that match the patterns"""
    shp_dict = {}
    with urlopen(zip_url) as u:
        zip_data = u.read()
    with ZipMemoryFile(zip_data) as z:
        zip_files = z.listdir()
        shp_files = [
            f
            for f in zip_files
            for pattern in patterns
            if pattern.search(f) and f.endswith(".shp")
        ]
        for shp_file in shp_files:
            with z.open(shp_file) as f:
                shp_dict[Path(shp_file).stem] = gpd.GeoDataFrame.from_features(
                    f, crs=f.crs
                )
    return shp_dict


def extract_matching_shp_files_from_zip_urls_concurrent(
    zip_urls: list[str], regex_patterns: list[str]
) -> dict[str, gpd.GeoDataFrame]:
    """Reads shapefiles from a list of zip file urls and returns a dictionary
    where the keys are the names of the shapefiles and the values are GeoDataFrames."""
    shp_dict = {}
    patterns = [re.compile(pattern) for pattern in regex_patterns]
    with cf.ThreadPoolExecutor() as executor:
        future_to_url = {
            executor.submit(process_zip_url, url, patterns): url for url in zip_urls
        }
        futures = tqdm(
            cf.as_completed(future_to_url),
            total=len(future_to_url),
            desc="Processing URLs",
            dynamic_ncols=True,
        )
        for future in futures:
            shp_dict.update(future.result())
    return shp_dict

In [13]:
def concat_gdf_from_dict(gdf_dict: dict[str, gpd.GeoDataFrame]) -> gpd.GeoDataFrame:
    """
    Given a dictionary of GeoDataFrames, returns a single GeoDataFrame
    with a new column indicating the source of the data.
    """
    # use a dictionary comprehension to create a new dictionary
    gdf_data = {k: gdf.assign(source_file=k) for k, gdf in gdf_dict.items()}
    # return the concatenated GeoDataFrame
    return pd.concat(gdf_data.values(), ignore_index=True)

In [14]:
def extract_gdfs_from_zip(zip_path: str) -> Optional[dict[str, gpd.GeoDataFrame]]:
    """
    Reads shapefiles from a zip file and returns a dictionary of GeoDataFrames.
    """
    gdfs = {}
    # Open the zip file
    with ZipFile(zip_path) as z:
        # Get the list of files in the zip file
        zip_contents = z.namelist()
        # Find the shapefiles
        shp_files = [f for f in zip_contents if f.endswith(".shp")]
        for shp_file in shp_files:
            # Read the shapefile into a GeoDataFrame
            gdf = gpd.read_file(f"zip://{zip_path}!{shp_file}")
            gdfs[shp_file] = gdf

    # If no shapefile was found, return None
    return gdfs if gdfs else None

In [15]:
def extract_gdfs_from_zip_url(zip_url: str) -> Optional[dict[str, gpd.GeoDataFrame]]:
    """
    Downloads a ZIP file from a URL, reads shapefiles from the ZIP file, and returns a dictionary of GeoDataFrames.
    """
    gdfs = {}
    # Open the URL
    with urlopen(zip_url) as u:
        # Read the content of the response into a byte stream
        zip_data = u.read()
        # Open the ZIP file from the byte stream
        with ZipMemoryFile(zip_data) as z:
            # Get the list of files in the ZIP file
            zip_contents = z.listdir()
            # Find the shapefiles
            shp_files = [f for f in zip_contents if f.endswith(".shp")]
            for shp_file in shp_files:
                # Read the shapefile into a GeoDataFrame
                with z.open(shp_file) as f:
                    gdf = gpd.GeoDataFrame.from_features(f, crs=f.crs)
                gdfs[Path(shp_file).stem] = gdf

    # If no shapefile was found, return None
    return gdfs if gdfs else None

In [16]:
def process_shp_url(zip_url: str):
    """Downloads a zip file url and returns a dictionary of GeoDataFrames for shapefiles that match the patterns"""
    shp_dict = {}
    with urlopen(zip_url) as u:
        zip_data = u.read()
    with ZipMemoryFile(zip_data) as z:
        zip_files = z.listdir()
        shp_files = [f for f in zip_files if f.endswith(".shp")]
        for shp_file in shp_files:
            with z.open(shp_file) as f:
                shp_dict[Path(shp_file).stem] = gpd.GeoDataFrame.from_features(
                    f, crs=f.crs
                )
    return shp_dict


def extract_gdfs_from_zip_url_concurrent(
    zip_urls: list[str],
) -> dict[str, gpd.GeoDataFrame]:
    """Reads shapefiles from a list of zip file urls and returns a dictionary
    where the keys are the names of the shapefiles and the values are GeoDataFrames."""
    shp_dict = {}
    with cf.ThreadPoolExecutor() as executor:
        future_to_url = {executor.submit(process_shp_url, url): url for url in zip_urls}
        futures = tqdm(
            cf.as_completed(future_to_url),
            total=len(future_to_url),
            desc="Processing URLs",
            dynamic_ncols=True,
        )
        for future in futures:
            shp_dict.update(future.result())
    return shp_dict

In [17]:
def read_gdb_from_zip(gdb_zips_list: list[str]):
    """Reads a list of zip files containing geodatabases and returns a dictionary of GeoDataFrames"""
    # initialize an empty dictionary
    gdb_dict = {}
    # loop through each zip file
    for gdb_zip in gdb_zips_list:
        with ZipFile(gdb_zip, "r") as z:
            # get list of files in zip
            files = z.namelist()
            # filter for gdb folders
            gdb_folders = [f for f in files if f.endswith(".gdb/")]
            # if there is a gdb folder in the zip file
            if gdb_folders:
                # get it and read it into a GeoDataFrame
                gdb_folder = gdb_folders[0]
                gdb_dict[Path(gdb_folder).stem] = gpd.read_file(
                    f"zip://{gdb_zip}!{gdb_folder}"
                ).to_crs("EPSG:4269")
    # return the dictionary of GeoDataFrames
    return gdb_dict

In [18]:
def read_gdb_from_zip_url(gdb_urls_list: list[str]):
    """Reads a list of zip file urls containing geodatabases and returns a dictionary of GeoDataFrames"""
    # initialize an empty dictionary
    gdb_dict = {}
    # loop through each zip file
    for gdb_url in gdb_urls_list:
        # create a temporary directory
        with tempfile.TemporaryDirectory() as tmp_dir:
            # download the zip file
            with urlopen(gdb_url) as u, open(f"{tmp_dir}/data.zip", "wb") as f_out:
                f_out.write(u.read())
            # extract the zip file
            with ZipFile(f"{tmp_dir}/data.zip", "r") as zip_ref:
                zip_ref.extractall(tmp_dir)
            # get the list of extracted files
            extracted_files = list(Path(tmp_dir).iterdir())
            # filter for gdb folders
            gdb_folders = [f for f in extracted_files if f.suffix == ".gdb"]
            # if there is a gdb folder in the extracted files
            if gdb_folders:
                # get it and read it into a GeoDataFrame
                gdb_folder = gdb_folders[0]
                gdb_dict[Path(gdb_folder).stem] = gpd.read_file(gdb_folder).to_crs(
                    "EPSG:4269"
                )
    # return the dictionary of GeoDataFrames
    return gdb_dict

In [19]:
def process_gdb_url(gdb_url):
    """Downloads a zip file url containing a geodatabase and returns a GeoDataFrame"""
    with tempfile.TemporaryDirectory() as tmp_dir:
        # download the zip file
        with urlopen(gdb_url) as u, open(f"{tmp_dir}/data.zip", "wb") as f_out:
            f_out.write(u.read())
        # extract the zip file
        with ZipFile(f"{tmp_dir}/data.zip", "r") as zip_ref:
            zip_ref.extractall(tmp_dir)
        # get the list of extracted files
        extracted_files = list(Path(tmp_dir).iterdir())
        # filter for gdb folders
        gdb_folders = [f for f in extracted_files if f.suffix == ".gdb"]
        # if there is a gdb folder in the extracted files
        if gdb_folders:
            # get it and read it into a GeoDataFrame
            gdb_folder = gdb_folders[0]
            return Path(gdb_folder).stem, gpd.read_file(gdb_folder)


def read_gdb_from_zip_url_concurrent(gdb_urls_list: list[str]):
    """Reads a list of zip file urls containing geodatabases and returns a dictionary of GeoDataFrames"""
    # initialize an empty dictionary
    gdb_dict = {}
    # create a ThreadPoolExecutor
    with cf.ThreadPoolExecutor() as executor:
        # submit the process_gdb_url function for each url and gather the results
        future_to_url = {
            executor.submit(process_gdb_url, url): url for url in gdb_urls_list
        }
        for future in cf.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                key, data = future.result()
                gdb_dict[key] = data
            except Exception as exc:
                print(f"{url} generated an exception: {exc}")
    # return the dictionary of GeoDataFrames
    return gdb_dict

In [20]:
# Function definitions
def pascal_to_snake(name: str):
    """Converts a string from PascalCase to snake_case"""
    # (?<=[A-Za-z0-9]) - positive lookbehind for any alphanumeric character
    # (?=[A-Z][a-z]) - positive lookahead for any uppercase followed by lowercase
    pattern = re.compile(r"(?<=[A-Za-z0-9])(?=[A-Z][a-z])")
    name = pattern.sub("_", name).lower()
    return name

In [21]:
def unify_crs(
    dataframe: pd.DataFrame,
    lon_col: str = "longitude",
    lat_col: str = "latitude",
    crs_col: str = "crs",
    final_crs: str = "EPSG:4269",
):
    """
    Given a DataFrame with lon/lat or x/y coordinates,
    converts the coordinates to a unified crs and combines
    into a single GeoDataframe with a geometry column.
    """

    # Define the main columns that will be used for the conversion
    main_cols = [lon_col, lat_col, crs_col]

    # Get the other columns in the dataframe
    other_cols = list(set(dataframe.columns) - set(main_cols))

    # Create a subframe with only the main columns
    subframe = dataframe[main_cols]

    # Create a list of GeoDataFrames, each with a different CRS
    geo_dfs = [
        gpd.GeoDataFrame(
            # Use the data for this CRS
            data=data,
            # Create a geometry column from the lon/lat columns
            geometry=gpd.points_from_xy(x=data[lon_col].values, y=data[lat_col].values),
            # Set the CRS for this GeoDataFrame
            crs=pyproj.CRS(crs_val),
            # Convert the GeoDataFrame to the final CRS
        ).to_crs(final_crs)
        # Do this for each unique CRS in the subframe
        for crs_val, data in subframe.groupby(crs_col)
    ]

    # Merge the GeoDataFrames back together and return the result
    return pd.merge(
        # Concatenate the GeoDataFrames
        pd.concat(geo_dfs, sort=True),
        # Add the other columns back in
        dataframe[other_cols],
        # Merge on the index
        left_index=True,
        right_index=True,
    )

In [22]:
# @lru_cache(maxsize=3)
def get_background_map(bgcolor="black", alpha=0.5):
    """Returns a GeoViews background map"""
    return gts.CartoLight().opts(bgcolor=bgcolor, alpha=alpha)


def platecaree_to_mercator_vectorised(x, y):
    """Use Cartopy to convert PlateCarree coordinates to Mercator"""
    return ccrs.GOOGLE_MERCATOR.transform_points(ccrs.PlateCarree(), x, y)[:, :2]

In [23]:
def format_in_000(num):
    """Formats a number in thousands"""
    for unit in ["", "thousand", "million", "billion", "trillion"]:
        if abs(num) < 1000.0:
            return f"{num:3.2f} {unit}"
        num /= 1000.0
    return f"{num:.2f} quadrillion"

In [24]:
def split_datetime(df, column):
    """Splits a datetime column into year, month, and day columns"""
    # remove '_date' from the column name
    column_stem = column.replace("_date", "") if "_date" in column else column
    try:
        datetime_series = pd.to_datetime(df[column], errors="coerce")
        if datetime_series.isna().any():
            print(f"Errors occurred during conversion of column {column}.")
        df[column_stem + "_year"] = datetime_series.dt.year
        df[column_stem + "_month"] = datetime_series.dt.month
        df[column_stem + "_day"] = datetime_series.dt.day
    except KeyError:
        print(f"Column {column} not found in the DataFrame.")
    except Exception as e:
        print(f"An error occurred: {e}")

### Load data

Readme file with data dictionary

In [25]:
# get readme data
readme = urlopen(DATA_README_URL[0]).read().decode("windows-1252")
display(readme)

'FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017\r\n--------------------------------------------------------\r\nThis data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures \r\nlocatable through the FracFocus ‘Find a Well’ search.\r\n\r\n\r\nTable Name: RegistryUpload\r\n--------------------------\r\npKey - Key index for the table\r\n\r\nJobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.\r\n\r\nJobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.\r\n\r\nAPINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits \r\nrepresent the state, second three digits represent the county, third 5 digits represent the well.\r\n\r\nStateNumber - The first two digits of the API number.  Range is from 01-50.\r\n\r\nCountyNumber - The 

In [26]:
# print function goes beyond 'hello world' and takes care of the escape characters
print(readme)

FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017
--------------------------------------------------------
This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures 
locatable through the FracFocus ‘Find a Well’ search.


Table Name: RegistryUpload
--------------------------
pKey - Key index for the table

JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.

JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.

APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.

StateNumber - The first two digits of the API number.  Range is from 01-50.

CountyNumber - The 3 digit county code.

OperatorName - The name of the opera

In [27]:
# you can also neaten up the readme data yourself for it to be more compact
readme_as_list = readme.replace("\r", "").split("\n")
readme_as_list = [line.strip() for line in readme_as_list if line != ""]
display(readme_as_list)

['FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017',
 '--------------------------------------------------------',
 'This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures',
 'locatable through the FracFocus ‘Find a Well’ search.',
 'Table Name: RegistryUpload',
 '--------------------------',
 'pKey - Key index for the table',
 'JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.',
 'JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.',
 'APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits',
 'represent the state, second three digits represent the county, third 5 digits represent the well.',
 'StateNumber - The first two digits of the API number.  Range is from 01-50.',
 'CountyNumber - The 3 digit county co

In [28]:
# We can collect all the dataframe into a list and then concatenate them
df_list = read_csv_concurrent(DATA_URLS2)


dfs = pd.concat(df_list).reset_index(drop=True)

100%|██████████| 3/3 [00:23<00:00,  7.89s/it]


In [29]:
registry_df = pd.DataFrame()
registry_df = dfs.copy()
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   pKey                     213883 non-null  object 
 1   JobStartDate             213868 non-null  object 
 2   JobEndDate               213883 non-null  object 
 3   APINumber                213883 non-null  object 
 4   StateNumber              213883 non-null  int64  
 5   CountyNumber             213883 non-null  int64  
 6   OperatorName             213883 non-null  object 
 7   WellName                 213883 non-null  object 
 8   Latitude                 213883 non-null  float64
 9   Longitude                213883 non-null  float64
 10  Projection               213883 non-null  object 
 11  TVD                      183743 non-null  float64
 12  TotalBaseWaterVolume     183714 non-null  float64
 13  TotalBaseNonWaterVolume  163574 non-null  float64
 14  Stat

Looking at the missing values it is interesting to see that most missing values are from the `TVD`, `TotalBaseWaterVolume` and `TotalBaseNonWaterVolume`. One reason for this may be found in the data limitations on terms of use on the FracFocus website. It states:
-  Disclosures submitted using the FracFocus 1.0 format (January, 2011 to May 31, 2013) will contain only header data. 
-  Disclosures submitted using the FracFocus 2.0 format (November 2012 to present) will contain both header and chemical data. NOTE: Between November, 2012 and May 31, 2013 disclosures in both 1.0 and 2.0 formats were submitted to the system. 
-  After May 31, 2013 only disclosures submitted in the 2.0 format were accepted.
-  Data submitted appears as it was submitted by the operator or operator’s authorized agent. FracFocus does not warrant the data in any way.

In [30]:
# Calculate the percentage of non-missing values in each column
missing_data_percent = (registry_df.notna().mean() * 100).rename("Percent")

# Create a DataFrame of the counts of non-missing values
non_missing_count = registry_df.notna().sum().rename("Count")

# Concatenate the two DataFrames along the columns
non_missing_data = pd.concat([missing_data_percent, non_missing_count], axis=1)

# Create a horizontal bar plot of the percentage of non-missing data
hbar_plot = non_missing_data.hvplot.barh(
    y="Percent",
    width=800,
    height=600,
    title="Percentage of Non-Missing Data in Each Column",
    ylabel="",
    xlabel="",
    xaxis="bare",
    hover_cols="all",
).opts(
    active_tools=["box_zoom"],
    toolbar="above",
)

hbar_plot

In [31]:
# Look at some of the rows of the dataframe
display(registry_df.head(3))
display(registry_df.sample(5, random_state=628))
display(registry_df.tail(3))

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,Projection,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
0,448c1dab-c7fd-4e07-9d6f-e3b1cf64b708,5/1/1955 12:00:00 AM,5/1/1955 12:00:00 AM,42317372620000,42,317,Pioneer Natural Resources,Rogers 42 #5,32.283431,-101.906575,NAD27,,,,Texas,Martin,1,False,False,,
1,f66add2e-8ea8-4843-9388-24725b5d37c1,5/19/1982 12:00:00 AM,5/19/1982 12:00:00 AM,49009219470000,49,9,"Chesapeake Operating, Inc.",WILLIAM VALENTINE 1,42.97281,-105.95384,NAD27,,,,WYOMING,CONVERSE,1,False,False,,
2,95f0904c-2556-4912-9f5a-34913ba57625,2/7/1995 12:00:00 AM,2/7/1995 12:00:00 AM,49009228850000,49,9,"Chesapeake Operating, Inc.",LIZARD HEAD 1-8H RE,42.85147,-105.41151,NAD27,,,,WYOMING,CONVERSE,1,False,False,,


Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,Projection,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
192605,376bd0a4-8e43-4894-a7d6-8268ddeacd6f,1/29/2022 6:00:00 AM,2/7/2022 6:00:00 AM,42255372980000,42,255,Marathon Oil,Rodriguez-Trial Unit 503H,28.894024,-97.972811,NAD27,11184.0,9638591.0,0.0,Texas,Karnes,3,False,False,,
286,8081a495-1bbf-47a0-bafd-c8766c022654,1/4/2011 12:00:00 AM,1/4/2011 12:00:00 AM,42383369750000,42,383,Apache Corporation,Holley 1208 1,31.32203,-101.63464,NAD27,,,,Texas,Reagan,1,False,False,,
47360,2a887203-c601-4587-b8f6-72e133dea786,4/17/2013 12:00:00 AM,4/17/2013 12:00:00 AM,42331346250000,42,331,"Omni Oil & Gas, Inc.",M 686,30.763809,-97.002506,NAD83,795.0,16000.0,,Texas,Milam,1,False,False,,
86105,94ba4924-cd37-4492-be61-d9407f83e50a,8/22/2014 12:00:00 AM,8/27/2014 12:00:00 AM,42109326860000,42,109,Cimarex Energy Co.,Citation 32 Fee #3H,31.92331,-104.19476,NAD83,9777.0,7705941.0,0.0,Texas,Culberson,2,False,False,,
168687,5e40d630-9e7b-4e90-863f-ef815467e50a,9/16/2019 6:05:00 PM,9/26/2019 10:28:00 PM,33105052420000,33,105,Kraken Operating LLC,Blue 26-35 #1TFH,48.574161,-103.452184,NAD83,9848.7,8781819.0,0.0,North Dakota,Williams,3,False,False,,


Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,Projection,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213880,361bd982-58d6-437d-9592-08aeb80fd738,10/11/2023 7:21:00 AM,11/5/2023 6:07:00 PM,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,NAD83,10936.115961,23218520.0,0.0,Texas,Harrison,3,False,False,,
213881,f9fdc139-0f1e-4943-8a16-adb5152d862c,9/28/2023 9:43:00 PM,11/6/2023 7:29:00 AM,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,NAD83,11022.313802,40457386.0,0.0,Texas,Harrison,3,False,False,,
213882,2241ec7e-f113-4f8e-8b61-8a74c9e03dc2,4/1/3012 12:00:00 AM,4/1/3012 12:00:00 AM,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,NAD27,,,,Texas,Howard,1,False,False,,


From our first look at a few sample rows some things stick out immediately.
1. The dataset may be in chronological order and the values of the `JobStartDate`/`JobEndDate` at both of the extremes may be incorrect.
2. There may be an abundance for `StateNumber` `42` if 4 out of the 5 draws of the 200k+ rows drawn at random had a `StateNumber` of `42`.

$$ P(X=4) = \frac{{4,000 \choose 4} {200,000-4,000 \choose 5-4}}{{200,000 \choose 5}} $$

> The probability mass function of the hypergeometric distribution is given by:
>
> $$ P(X=k) = \frac{{K \choose k} {N-K \choose n-k}}{{N \choose n}} $$
>
> This can be rewritten using factorials as:
>
> $$ P(X=k) = \frac{(K!/(k!(K-k)!)) ((N-K)!/((n-k)!(N-K-n+k)!))}{(N!/(n!(N-n)!))} $$
>
> where:
> - `N` is the population size.
> - `K` is the number of successes in the population.
> - `n` is the number of draws.
> - `k` is the number of observed successes.
> - `!` denotes the factorial operation, which is the product of all positive integers up to that number.
>
> The formula for the hypergeometric distribution were provided by Microsoft Bing.


In [32]:
# Total number of oil wells (population size)
N = 200000

# Number of oil wells per state
# This assumes that each state (from 1 to 50 inclusive) has exactly 4000 wells in the 200,000 wells.
K = 4000

# Number of oil wells sampled (number of draws)
n = 5

# Number of times a specific state is drawn in the sample
k = 4

# Calculate the probability of drawing a well from a specific state 'k' times in 'n' total draws
p = hypergeom.pmf(k, N, K, n)

# N.B.: This calculation assumes that the wells are evenly distributed across the states
print(f"The probability is {p:.3e}.")

print(
    f"This is a 1 in {format_in_000(1/p)} chance, assuming an even distribution, that 4 out of the 5 wells drawn are from the same state."
)

The probability is 7.829e-07.
This is a 1 in 1.28 million chance, assuming an even distribution, that 4 out of the 5 wells drawn are from the same state.


### Data Cleaning

Before we jump into cleaning the data in the columns, let's make the columns look more pythonic by changing the column names to snake_case.


In [33]:
registry_df.columns = [pascal_to_snake(col) for col in registry_df.columns]
registry_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   p_key                        213883 non-null  object 
 1   job_start_date               213868 non-null  object 
 2   job_end_date                 213883 non-null  object 
 3   api_number                   213883 non-null  object 
 4   state_number                 213883 non-null  int64  
 5   county_number                213883 non-null  int64  
 6   operator_name                213883 non-null  object 
 7   well_name                    213883 non-null  object 
 8   latitude                     213883 non-null  float64
 9   longitude                    213883 non-null  float64
 10  projection                   213883 non-null  object 
 11  tvd                          183743 non-null  float64
 12  total_base_water_volume      183714 non-null  float64
 13 

Next, we can remove the columns with only null values. These are the last 2 columns in the dataframe, `source` and `dtmod`. Also we can drop the `total_non_base_water_volume` column since we may not have much need for it.


In [34]:
registry_df = registry_df.drop(
    columns=["source", "dtmod", "total_base_non_water_volume"]
)
registry_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   p_key                    213883 non-null  object 
 1   job_start_date           213868 non-null  object 
 2   job_end_date             213883 non-null  object 
 3   api_number               213883 non-null  object 
 4   state_number             213883 non-null  int64  
 5   county_number            213883 non-null  int64  
 6   operator_name            213883 non-null  object 
 7   well_name                213883 non-null  object 
 8   latitude                 213883 non-null  float64
 9   longitude                213883 non-null  float64
 10  projection               213883 non-null  object 
 11  tvd                      183743 non-null  float64
 12  total_base_water_volume  183714 non-null  float64
 13  state_name               213881 non-null  object 
 14  coun

Next, we will fix some of the dtypes of the columns.
- Both the `job_start_date` and the `job_end_date` columns are object dtypes, so we will convert those to datetime dtypes and drop the timestamp.
- We can also separate out the date components into its various components. This may come in handy for feature engineering later on.
- The `projection` column is an object dtype. That can be converted to a string dtype and shorten to `crs` as it represents the Cooordinate Reference System used in the `latitude` and `longitude` columns values. We can dig into what CRS is later on.
- The `federal_well` and `indian_well` columns are both boolean type columns. They may be more aptly named as `is_federal_well` and `is_indian_well` respectively.

In [35]:
# Use the function on 'job_start_date' and 'job_end_date'
split_datetime(registry_df, "job_start_date")
split_datetime(registry_df, "job_end_date")
registry_df[[col for col in registry_df.columns if re.search("start|end", col)]].info(
    memory_usage="deep"
)
# show the values which are null still
registry_df[
    registry_df[[col for col in registry_df.columns if re.search("start|end", col)]]
    .isna()
    .any(axis=1)
]

Errors occurred during conversion of column job_start_date.
Errors occurred during conversion of column job_end_date.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   job_start_date   213868 non-null  object 
 1   job_end_date     213883 non-null  object 
 2   job_start_year   213866 non-null  float64
 3   job_start_month  213866 non-null  float64
 4   job_start_day    213866 non-null  float64
 5   job_end_year     213882 non-null  float64
 6   job_end_month    213882 non-null  float64
 7   job_end_day      213882 non-null  float64
dtypes: float64(6), object(2)
memory usage: 41.4 MB


Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,projection,tvd,total_base_water_volume,state_name,county_name,ff_version,federal_well,indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day
106821,d88e17b4-069d-488d-83bd-b61743c8956a,,7/17/2015 5:00:00 AM,42235359160000,42,235,"Foreland Operating, LLC",Pea Eye Parker 7H,31.133399,-101.09679,NAD27,6500.0,14605167.0,Texas,Irion,3,False,False,,,,2015.0,7.0,17.0
116103,68f9baae-9970-4edc-b926-9ddc2abfac04,,3/31/2016 5:00:00 AM,30015432740000,30,15,Devon Energy Production Company L. P.,CDU 242H,32.144287,-103.730842,NAD27,10409.0,3875748.0,New Mexico,Eddy,3,False,False,,,,2016.0,3.0,31.0
120313,a2babd3d-48de-47e4-96d1-9eb7c0aadbf7,,9/7/2016 6:00:00 AM,49037294230000,49,37,Anadarko Petroleum Corporation,BLACK WATCH / 1696-3-41H,41.383397,-108.186969,NAD27,9171.0,5420161.0,Wyoming,Sweetwater,3,False,False,,,,2016.0,9.0,7.0
121488,77c6253c-8950-4764-919d-3d8e65d6daca,,10/17/2016 3:40:00 PM,35073253510000,35,73,"Chesapeake Operating, Inc.",HILL 2-18-6 1H,36.058634,-97.812685,NAD27,6343.1,2395680.0,Oklahoma,Kingfisher,3,False,False,,,,2016.0,10.0,17.0
122026,3ac92e3a-90f5-467c-912c-fffc12c8832d,,11/3/2016 6:00:00 AM,33053068810000,33,53,Whiting Petroleum,KOALA 44-5-2H,47.921674,-103.369945,NAD27,11248.0,8130123.0,North Dakota,McKenzie,3,False,False,,,,2016.0,11.0,3.0
124195,57e661a8-47ba-452c-826c-747e732235c6,,1/18/2017 7:00:00 AM,42255350040000,42,255,Encana Oil & Gas (USA) Inc.,Hons 19H,28.94259,-97.98925,NAD27,10854.0,3172451.0,Texas,Karnes,3,False,False,,,,2017.0,1.0,18.0
124266,eb2690aa-9a75-47ab-8766-76f5496b8b19,,1/20/2017 7:00:00 AM,42311366530000,42,311,Sundance Energy,Woodward EFS 4HB,28.45577,-98.37957,NAD27,12108.0,12653303.0,Texas,McMullen,3,False,False,,,,2017.0,1.0,20.0
124457,6fd58568-d142-4ca0-8530-c5f20644138b,,1/26/2017 7:00:00 AM,49035300390000,49,35,Jonah Energy LLC,SHB 10-08,42.496681,-109.741876,NAD27,13186.0,1333967.0,Wyoming,Sublette,3,True,False,,,,2017.0,1.0,26.0
136363,972d5032-fcde-4999-8527-90afe4ca9497,,11/18/2017 3:03:00 PM,35073257850000,35,73,"Chesapeake Operating, Inc.",LIBERTY 28-18-6 2H,36.014118,-97.84306,NAD27,6501.4,1972362.0,Oklahoma,Kingfisher,3,False,False,,,,2017.0,11.0,18.0
141057,27619da3-2e88-4297-9261-c80007059844,,3/7/2018 12:00:00 AM,42317406660000,42,317,Endeavor Energy Resources,DICKENSON 18-7ESL 2ub,32.200971,-101.989404,NAD27,11000.0,18981941.0,Texas,Martin,3,False,False,,,,2018.0,3.0,7.0


In [36]:
# Convert 'job_start_date' to datetime format and format it as 'YYYY-MM-DD'
registry_df["job_start_date"] = pd.to_datetime(
    registry_df["job_start_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")

# Convert 'job_end_date' to datetime format and format it as 'YYYY-MM-DD'
registry_df["job_end_date"] = pd.to_datetime(
    registry_df["job_end_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")

# drop rows with null values in 'job_start_date' and 'job_end_date'
# registry_df = registry_df.dropna(subset=["job_start_date", "job_end_date"])

# create columns for year, month, and day
# registry_df["start_year"] = pd.to_datetime(registry_df["job_start_date"]).dt.year
# registry_df["start_month"] = pd.to_datetime(registry_df["job_start_date"]).dt.month
# registry_df["start_day"] = pd.to_datetime(registry_df["job_start_date"]).dt.day
# registry_df["end_year"] = pd.to_datetime(registry_df["job_end_date"]).dt.year
# registry_df["end_month"] = pd.to_datetime(registry_df["job_end_date"]).dt.month
# registry_df["end_day"] = pd.to_datetime(registry_df["job_end_date"]).dt.day


# Rename some columns for clarity
registry_df.rename(
    columns={
        "federal_well": "is_federal_well",
        "indian_well": "is_indian_well",
        "projection": "crs",
    },
    inplace=True,
)

# Display the information of the DataFrame
registry_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 24 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   p_key                    213883 non-null  object 
 1   job_start_date           213866 non-null  object 
 2   job_end_date             213882 non-null  object 
 3   api_number               213883 non-null  object 
 4   state_number             213883 non-null  int64  
 5   county_number            213883 non-null  int64  
 6   operator_name            213883 non-null  object 
 7   well_name                213883 non-null  object 
 8   latitude                 213883 non-null  float64
 9   longitude                213883 non-null  float64
 10  crs                      213883 non-null  object 
 11  tvd                      183743 non-null  float64
 12  total_base_water_volume  183714 non-null  float64
 13  state_name               213881 non-null  object 
 14  coun

Next, we will look at the `api_number` column.
We learned from the read me that 
> APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: 
> - First two digits represent the state, 
> - second three digits represent the county, 
> - third 5 digits represent the well.

Theoretically, we could just grab the first two characters of the `APINnumber` and use that as the state number according to the definition of the `APINumber` above. Actually, that would not be a good idea, and here is why.<br>
Although the column was called `APINumber`, it is not actually a number, so if it starts with a leading `0` that first character `0`, cannot be omitted from the value. Let's look at some of the rows with a single digit state numbers.

Right now, 
- the `api_number` column is an object dtype, but a better option would be a `string` dtype, as `object` dtype can be mixed . We can also shorten that column name to `api`.
- the `state_number` column and the `county_number` column are both `int64` dtypes right now. `string` type may be a stronger option.
- `state_code` and `county_code` may be better names for the `state_number` and `county_number` columns respectively.


In [37]:
# rows where the state_number is a single digit
registry_df[
    (registry_df["state_number"] == 3) | (registry_df["state_number"] == 5)
].sample(5, random_state=628)

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day
105811,da3d8794-3642-4f4e-989b-835deacf2085,2015-06-24,2015-06-25,5123373360000,5,123,PDC Energy,Alm 33U-234,40.53547,-104.66002,NAD83,7026.0,3245896.0,Colorado,Weld,2,False,False,2015.0,6.0,24.0,2015.0,6.0,25.0
18508,18244515-8e6b-4e97-9ae6-4b9046ccfe3c,2012-03-05,2012-03-05,5045145670000,5,45,Encana Oil & Gas (USA) Inc.,N Parachute WF10A-25 I25A 596,39.583918,-108.110775,NAD83,,,Colorado,Garfield,1,False,False,2012.0,3.0,5.0,2012.0,3.0,5.0
188852,1fd5668c-4bf4-43a5-b12d-165028c20758,2021-10-01,2021-10-08,5123449100000,5,123,"Bonanza Creek Energy, Inc.",LATHAM 31-34-14HNC,40.31893,-104.39749,NAD83,6396.0,8927539.0,Colorado,Weld,3,False,False,2021.0,10.0,1.0,2021.0,10.0,8.0
209958,92cda205-c905-48c5-830d-2b77cb61320e,2023-05-12,2023-05-28,5039066850000,5,39,GMT EXPLORATION,Vulcan 6-64 10-8 3HN,39.545648,-104.542233,NAD83,8014.566351,13778575.0,Colorado,Elbert,3,False,False,2023.0,5.0,12.0,2023.0,5.0,28.0
6671,14783fbc-bebb-46ed-956e-577e37940096,2011-07-22,2011-07-22,5123334900000,5,123,Anadarko Petroleum Corporation,REI 39-9,40.241188,-104.66153,NAD83,,,CO,WELD,1,False,False,2011.0,7.0,22.0,2011.0,7.0,22.0


Some rows' `api_number` values have leading `0`, which is correct, but some do not. The rows without the leading `0` though are 13 characters long instead of 14. Maybe we can just add a leading `0` where needed until all API number values are 14 characters long. 

In [38]:
# Check the number of characters in the api_number column
registry_df["api_number"].astype("string").str.len().value_counts()

api_number
14    190830
13     23051
10         1
12         1
Name: count, dtype: Int64

Most are 14 characters long, but some are 13 characters long, like the ones we saw above without the leading `0`. Let's assume the ones with 13 characters are missing the leading `0` and not something else.

In [39]:
# Convert 'api' to string and pad it with zeros to make it 14 characters long
registry_df["api"] = registry_df["api_number"].astype("string").str.zfill(14)

# Convert 'state_number' to string and pad it with zeros to make it 2 characters long
registry_df["state_code"] = registry_df["state_number"].astype("string").str.zfill(2)

# Convert 'county_number' to string and pad it with zeros to make it 3 characters long
registry_df["county_code"] = registry_df["county_number"].astype("string").str.zfill(3)

In [40]:
# check which rows may have the api with the first two digits not matching the state number
api_state_mismatch_mask = registry_df["state_code"] != registry_df["api"].str[0:2]
# api_state_mismatch_mask

In [41]:
# check which rows may have the api with the first two digits not matching the state number
registry_df[api_state_mismatch_mask][
    ["api_number", "api", "state_code", "state_name", "county_code", "county_name"]
]

Unnamed: 0,api_number,api,state_code,state_name,county_code,county_name
50509,4226932868,4226932868,42,Texas,269,King
197331,423714037500,423714037500,42,Texas,371,Pecos


We expected to get 2 rows here, since we checked the length of the `api_number` column above we saw that 1 row had 10 and another row had 12 characters. It is only two rows, so this may be an easy fix.

In [42]:
# Remove leading zeros and pad to 14 digits on mismatches
registry_df.loc[api_state_mismatch_mask, "api"] = (
    registry_df.loc[api_state_mismatch_mask, "api"].str.lstrip("0").str.ljust(14, "0")
)

In [43]:
# check which rows may have the api with the first two digits not matching the state number
registry_df[api_state_mismatch_mask][
    ["api_number", "api", "state_code", "state_name", "county_code", "county_name"]
]

Unnamed: 0,api_number,api,state_code,state_name,county_code,county_name
50509,4226932868,42269328680000,42,Texas,269,King
197331,423714037500,42371403750000,42,Texas,371,Pecos


In [44]:
# check which rows may have the api with the 3-5 digits not matching the county number
api_county_mismatch_mask = registry_df["county_code"] != registry_df["api"].str[2:5]
registry_df[api_county_mismatch_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code


State name should not have more than 50 possible values, given that there are only 50 states in the US. If we were to check the number of unique values in the `state_name` column, we would see 95. This is due to the variation in the way the `state_name` value is entered. Although not as obvious, we can assume the same for the `county_name` column. Luckily, the `api` includes both the `state_number` and the `county_number`. With this we can do 
1. data validation ensuring that these corresponding columns match
2. Ensure that the `state_name` and the `county_name` columns are correct. Important to note that 
> The state codes used in an API number are DIFFERENT from another standard which is the Federal Information Processing Standard (FIPS) state code established in 1987 by NIST. ([source](https://en.wikipedia.org/wiki/API_well_number#State_code))

In [45]:
print(
    f'Number of different values in state_name column: {registry_df["state_name"].nunique()}'
)
print(
    f'Number of different values in state_number column: {registry_df["state_number"].nunique()}'
)

Number of different values in state_name column: 95
Number of different values in state_number column: 28


In [46]:
# group by state_code and find the mode of the state_name
state_code_mode = (
    registry_df.groupby("state_code")["state_name"]
    .apply(lambda x: x.mode().iloc[0])
    .reset_index()
)
state_code_mode

Unnamed: 0,state_code,state_name
0,1,Alabama
1,3,Arkansas
2,4,California
3,5,Colorado
4,11,Idaho
5,12,Illinois
6,13,Indiana
7,15,Kansas
8,16,Kentucky
9,17,Louisiana


In [47]:
registry_df = registry_df.merge(state_code_mode.rename(columns={"state_name": "state"}))
registry_df.sample(3, random_state=628)

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state
192605,d45ffa9c-25cc-42a3-92ab-09fcc74d032f,2018-11-13,2018-11-18,35083244260000,35,83,"chisholm Oil and Gas Operating, LLC",King Ranch #18-4-32 1H,36.000705,-97.653121,NAD27,6569.0,12481014.0,Oklahoma,Logan,3,False,False,2018.0,11.0,13.0,2018.0,11.0,18.0,35083244260000,35,83,Oklahoma
286,e3a3d6d1-d409-4e7c-ad57-703cbff32eef,2011-02-14,2011-02-14,42097341930000,42,97,"EOG Resources, Inc.",Herbert Unit #3H,33.536889,-97.474433,NAD27,,,Texas,Cooke,1,False,False,2011.0,2.0,14.0,2011.0,2.0,14.0,42097341930000,42,97,Texas
47360,702c7c38-053f-4b28-930d-b75fd0d1623d,2014-12-07,2015-01-07,42255337360000,42,255,Encana Oil & Gas (USA) Inc.,Charger 9H,29.037556,-97.934511,NAD27,10395.0,14672682.0,Texas,Karnes,2,False,False,2014.0,12.0,7.0,2015.0,1.0,7.0,42255337360000,42,255,Texas


We will focus our efforts in the most recent 10 years. Although more data is usually better, data too far in the past may distract whatever model we may build since unconventional drilling practices have really taken over the industry. We will also put our focus in one specific area, the Permian Basin. The Permian Basin has been instrumental in the shale boom transformation and is the most active area of exploration and production in the US presently. 

In [48]:
# create mask for from 2013 onwards
post_2012_mask = registry_df["job_start_date"] >= "2013-01-01"
registry_df_post_2012 = registry_df[post_2012_mask].copy()

# find all the rows with null values
null_mask = registry_df_post_2012.isna().any(axis=1)
registry_df_post_2012[null_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state
61454,7534cb0b-9f85-4ca2-a21c-31be328456f8,2017-04-10,2017-04-10,42173374020000,42,173,"Cinnabar Energy, LTD.",Thomas 4101HD,31.996472,-101.568598,NAD27,9310.0,,Texas,Glasscock,3,False,False,2017.0,4.0,10.0,2017.0,4.0,10.0,42173374020000,42,173,Texas
61965,c57e59c6-d132-4a13-bb0b-8fb3e58ce918,2017-05-08,2017-05-08,42077352710000,42,77,Lane Operating Company,Dillard A Unit No. 1,33.591273,-98.138934,NAD27,4600.0,,Texas,Clay,3,False,False,2017.0,5.0,8.0,2017.0,5.0,8.0,42077352710000,42,77,Texas
62626,a2ad142e-6a38-44ed-8ebc-8572e43236b7,2017-06-13,2017-06-13,42237401310000,42,237,"Blakenergy Operating, LLC",Garner #2,33.434316,-98.227896,NAD27,6350.0,,Texas,Jack,3,False,False,2017.0,6.0,13.0,2017.0,6.0,13.0,42237401310000,42,237,Texas
84840,fcbf3967-867e-4e4e-af18-d83e7a82c604,2020-04-19,2020-04-19,42461378800000,42,461,COG Operating LLC,Powell 36 7,31.523085,-102.118329,NAD27,,,Texas,Upton,1,False,False,2020.0,4.0,19.0,2020.0,4.0,19.0,42461378800000,42,461,Texas
89052,ead37751-a53f-4ed9-99df-8cac1ebf1e15,2021-05-23,2021-05-23,42173346750000,42,173,Berry Petroleum,Talon #4,31.950785,-101.775302,NAD83,,,Texas,Glasscock,1,False,False,2021.0,5.0,23.0,2021.0,5.0,23.0,42173346750000,42,173,Texas
89305,4753a32e-39cb-4994-b69b-8acb36597463,2021-06-08,2021-06-08,42115334560000,42,115,Pioneer Natural Resources,Echols 10 #1,32.531683,-102.096109,NAD27,,,Texas,Dawson,1,False,False,2021.0,6.0,8.0,2021.0,6.0,8.0,42115334560000,42,115,Texas
98083,b76c7cc4-f1c1-4711-9168-ba819f41d070,2022-10-16,2022-10-27,42479446890000,42,479,Lewis Energy Group,HAMILTON NO. 34H,27.95851,-99.567909,WGS84,10456.0,,Texas,Webb,3,False,False,2022.0,10.0,16.0,2022.0,10.0,27.0,42479446890000,42,479,Texas
98087,99463560-ccb5-4c07-9537-8abdc731208b,2022-10-16,2022-10-27,42479446880000,42,479,Lewis Energy Group,HAMILTON NO. 33H,27.95851,-99.567956,WGS84,10437.0,,Texas,Webb,3,False,False,2022.0,10.0,16.0,2022.0,10.0,27.0,42479446880000,42,479,Texas
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,8.0,33610338000000,33,610,North Dakota
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,9.0,33610338100000,33,610,North Dakota


See how many nans we still have in each column.

In [49]:
registry_df_post_2012.info(memory_usage="deep")
registry_df_post_2012.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 174264 entries, 18356 to 213882
Data columns (total 28 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   p_key                    174264 non-null  object 
 1   job_start_date           174264 non-null  object 
 2   job_end_date             174264 non-null  object 
 3   api_number               174264 non-null  object 
 4   state_number             174264 non-null  int64  
 5   county_number            174264 non-null  int64  
 6   operator_name            174264 non-null  object 
 7   well_name                174264 non-null  object 
 8   latitude                 174264 non-null  float64
 9   longitude                174264 non-null  float64
 10  crs                      174264 non-null  object 
 11  tvd                      174261 non-null  float64
 12  total_base_water_volume  174254 non-null  float64
 13  state_name               174264 non-null  object 
 14  count

p_key                       0
job_start_date              0
job_end_date                0
api_number                  0
state_number                0
county_number               0
operator_name               0
well_name                   0
latitude                    0
longitude                   0
crs                         0
tvd                         3
total_base_water_volume    10
state_name                  0
county_name                 4
ff_version                  0
is_federal_well             0
is_indian_well              0
job_start_year              0
job_start_month             0
job_start_day               0
job_end_year                0
job_end_month               0
job_end_day                 0
api                         0
state_code                  0
county_code                 0
state                       0
dtype: int64

Looking at the `county_number` for the rows with a nan value in the `county_name` column, we can see why there is a nan for the `county_name`. Those numbers are most likely incorrect as small states like `North Dakota` and `Arkansas` do not have large `county_number` values. However we can still try to impute what the correct values by cross referencing with other sources or by using the `latitude` and `longitude` values.


In [50]:
# the rows with a null value for the county_name column
registry_df_post_2012[registry_df_post_2012["county_name"].isna()]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,8.0,33610338000000,33,610,North Dakota
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,9.0,33610338100000,33,610,North Dakota
178445,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,2019-11-19,3729439000000,3,729,WFD Oil Corporation,Vanorsdale,0.123455,-0.12345,NAD27,2442.0,22134.0,Arkansas,,3,False,False,2019.0,11.0,19.0,2019.0,11.0,19.0,3729439000000,3,729,Arkansas
206589,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-11-11,2020-12-01,43317428660000,43,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,32.409144,-101.808518,NAD83,8262.0,17519292.0,Utah,,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,43317428660000,43,317,Utah


In [51]:
# get the index of one of the rows with a null value for the county_name column (3rd one down)
index_vanorsdale = registry_df_post_2012.query("api_number == '03729439000000'").index

#### Oklahoma Commission Corporation (OOC)

With some search engine investigating, we can learn that WFD Oil Corporation is a PC in Oklahoma. We also learn that the well name is `VANORSDOL` ,the well number is `#1-29`, and the API number is `3503729439` `0000`. We can correct some of the data which was entered incorrectly in FracFocus.

The data are looking for is in this markdown cell so we can manually input it in but we will do it through code instead. We will query the api number in the data we have on the wells in OK.

| Column Name | Value |
| --- | --- |
API	|3503729439
WELL_NAME|	VANORSDOL
WELL_NUM|	#1-29
OPERATOR|	WFD OIL CORPORATION
WELLSTATUS|	AC
WELLTYPE|	OIL
SH_LAT	|35.749381
SH_LON	|-96.370355
COUNTY	|CREEK


In [52]:
# When reading the Parquet file
occ_wells = pd.read_parquet(OCC_PARQUET_URL)
# Convert the WKT column back to a geometry column
occ_wells["geometry"] = occ_wells["geometry"].apply(lambda x: wkt.loads(x))


# Convert the DataFrame to a GeoDataFrame, specifying the CRS
occ_wells = gpd.GeoDataFrame(
    occ_wells, geometry="geometry", crs=occ_wells["crs"].iloc[0]
)
occ_wells.info()
# look at 1 sample row of the dataframe
occ_wells.sample()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 445575 entries, 0 to 445574
Data columns (total 27 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   objectid           445575 non-null  int64   
 1   api                445575 non-null  float64 
 2   well_browse_link   445575 non-null  object  
 3   well_records_docs  445575 non-null  object  
 4   well_name          445568 non-null  object  
 5   well_num           445570 non-null  object  
 6   operator           445575 non-null  object  
 7   wellstatus         443813 non-null  object  
 8   welltype           365021 non-null  object  
 9   symbol_class       445575 non-null  object  
 10  sh_lat             437073 non-null  float64 
 11  sh_lon             437073 non-null  float64 
 12  county             445575 non-null  object  
 13  section            445454 non-null  float64 
 14  township           445450 non-null  object  
 15  range              445575 

Unnamed: 0,objectid,api,well_browse_link,well_records_docs,well_name,well_num,operator,wellstatus,welltype,symbol_class,sh_lat,sh_lon,county,section,township,range,qtr4,qtr3,qtr2,qtr1,pm,footage_ew,ew,footage_ns,ns,geometry,crs
102057,102058,3504320000.0,http://wellbrowse.occ.ok.gov/Webforms/WellInfo...,https://public.occ.ok.gov/OGCDWebLink/Search.a...,STEERS,#1-15,KAISER-FRANCIS OIL COMPANY,PA,OIL,PLUGGED,35.948053,-98.99353,DEWEY,15.0,17N,17W,S2,N2,NE,SE,IM,660.0,E,2130.0,S,POINT (-98.99353 35.94805),EPSG:4326


In [53]:
# Convert the 'api' column to string
occ_wells["api"] = occ_wells["api"].astype("int64").astype("string")
occ_wells.sample()

Unnamed: 0,objectid,api,well_browse_link,well_records_docs,well_name,well_num,operator,wellstatus,welltype,symbol_class,sh_lat,sh_lon,county,section,township,range,qtr4,qtr3,qtr2,qtr1,pm,footage_ew,ew,footage_ns,ns,geometry,crs
3059,3060,3500323012,http://wellbrowse.occ.ok.gov/Webforms/WellInfo...,https://public.occ.ok.gov/OGCDWebLink/Search.a...,BURLESON 2611,#2-19H,SANDRIDGE EXPLORATION & PRODUCTION LLC,AC,OIL,OIL,36.724125,-98.418716,ALFALFA,19.0,26N,11W,NW,NW,NE,NE,IM,1265.0,E,225.0,N,POINT (-98.41872 36.72413),EPSG:4326


In [54]:
# make a copy of the well_name column
registry_df_post_2012["well"] = registry_df_post_2012["well_name"].copy()

In [55]:
# query the well_name column for 'vanors
# occ_wells[occ_wells["well_name"].str.contains("vanors", case=False, na=False)]
vanorsdol_row = occ_wells.query(
    'well_name.fillna("").str.contains("vanors", case=False) & (api == "3503729439")',
    engine="python",
)[
    [
        "api",
        "well_name",
        "well_num",
        "operator",
        "sh_lat",
        "sh_lon",
        "county",
    ]
].rename(
    columns={"sh_lat": "latitude", "sh_lon": "longitude"}
)
vanorsdol_row["well"] = (
    vanorsdol_row["well_name"].str.title()
    + " "
    + vanorsdol_row["well_num"].astype("string")
)
index_vanorsdol = vanorsdol_row.index

columns_to_replace = ["api", "well", "latitude", "longitude"]
for col in columns_to_replace:
    registry_df_post_2012.loc[index_vanorsdale, col] = vanorsdol_row.loc[
        index_vanorsdol, col
    ].values

# check that the values have been replaced
registry_df_post_2012.loc[index_vanorsdale, columns_to_replace]

Unnamed: 0,api,well,latitude,longitude
178445,3503729439,Vanorsdol #1-29,35.749381,-96.370355


In [56]:
# adjust the api column to 14 characters again
registry_df_post_2012["api"] = registry_df_post_2012["api"].str.ljust(14, "0")

registry_df_post_2012[registry_df_post_2012["county_name"].isna()]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state,well
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,8.0,33610338000000,33,610,North Dakota,TRULSON 156-90-11-14H-3
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,9.0,33610338100000,33,610,North Dakota,PALERMO 156-90-2-31H-5
178445,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,2019-11-19,3729439000000,3,729,WFD Oil Corporation,Vanorsdale,35.749381,-96.370355,NAD27,2442.0,22134.0,Arkansas,,3,False,False,2019.0,11.0,19.0,2019.0,11.0,19.0,35037294390000,3,729,Arkansas,Vanorsdol #1-29
206589,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-11-11,2020-12-01,43317428660000,43,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,32.409144,-101.808518,NAD83,8262.0,17519292.0,Utah,,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,43317428660000,43,317,Utah,Rhea 1-6 Unit 1 #133


In [57]:
# columns_for_occ = [
#     "api",
#     "well_name",
#     "well_num",
#     "operator",
#     "wellstatus",
#     "welltype",
#     "symbol_class",
#     "sh_lat",
#     "sh_lon",
#     "county",
#     "geometry",
# ]

# occ_wells[columns_for_occ].info(memory_usage="deep")

In [58]:
# occ_wells = occ_wells[columns_for_occ].copy()
# occ_wells["api"] = occ_wells["api"].astype("int64").astype("string")
# occ_wells["api"] = occ_wells["api"].str.ljust(14, "0")
# occ_wells["wellstatus"] = occ_wells["wellstatus"].astype("category")
# occ_wells["welltype"] = occ_wells["welltype"].astype("category")
# occ_wells["symbol_class"] = occ_wells["symbol_class"].astype("category")
# for col in occ_wells.columns:
#     if occ_wells[col].dtype.name == "object":
#         occ_wells[col] = occ_wells[col].astype("string")
# occ_wells.info(memory_usage="deep")
# occ_wells.sample(3)

Now, with `latitude` and `longitude` coordinates for all 4 rows with missing `county_name`, let's find out which counties they belong to spatially.



In [59]:
# Arkansas well with null county_name and incorrect lat/long values
# display(registry_df_post_2012[registry_df_post_2012["api"] == "03729439000000"].T)

In [60]:
# # drop row with api 03729439000000 with incorrect lat/long values and null county_name
# index_to_drop = registry_df_post_2012[
#     registry_df_post_2012["api"] == "03729439000000"
# ].index
# # drop the row
# registry_df_post_2012.drop(index_to_drop, inplace=True)

### Geodataframe

We can get the boundary coordinates for all the counties in the US from the [census.gov](https://www.census.gov/) website. We saved the URL for this as `CENSUS_COUNTY_MAP_URL` at the top of the notebook.

In [145]:
# read in the US counties map data using geopandas
county = gpd.read_file(CENSUS_COUNTY_MAP_URL)[
    ["GEOID", "STATEFP", "COUNTYFP", "NAME", "geometry"]
]
county.columns = county.columns.str.lower()
county.sample(3)

Unnamed: 0,geoid,statefp,countyfp,name,geometry
761,51137,51,137,Orange,"POLYGON ((-77.77209 38.28220, -77.77196 38.282..."
954,28135,28,135,Tallahatchie,"POLYGON ((-90.04210 33.81008, -90.04477 33.810..."
2509,26097,26,97,Mackinac,"POLYGON ((-84.11429 45.97824, -84.11445 45.973..."


We will scrape the FIPS table from wikipedia since the county dataframe does not have the state name and merge the 2 tables just for convenience. 

In [146]:
fips_df = pd.read_html(FIPS_WIKI_URL)[1]
fips_df.columns = ["geoid", "county", "state"]
fips_df["geoid"] = fips_df["geoid"].astype("string").str.zfill(5)
fips_df.sample(3)

Unnamed: 0,geoid,county,state
2647,48071,Chambers County,Texas
1445,28075,Lauderdale County,Mississippi
40,1081,Lee County,Alabama


In [147]:
county_fips_gdf = county.merge(fips_df, on="geoid")
county_fips_gdf.sample(5, random_state=628)

Unnamed: 0,geoid,statefp,countyfp,name,geometry,county,state
824,31045,31,45,Dawes,"POLYGON ((-102.77315 42.52564, -102.77298 42.5...",Dawes County,Nebraska
3058,51149,51,149,Prince George,"POLYGON ((-77.13939 37.12645, -77.14097 37.125...",Prince George County,Virginia
2315,22015,22,15,Bossier,"POLYGON ((-93.84522 32.95043, -93.84486 32.951...",Bossier Parish,Louisiana
300,29109,29,109,Lawrence,"POLYGON ((-93.74705 37.28455, -93.74601 37.284...",Lawrence County,Missouri
449,48367,48,367,Parker,"POLYGON ((-98.06034 32.81070, -98.06034 32.810...",Parker County,Texas


Quick note about `GeoDataFrames`: they must have a column called `geometry` and this column contains the geometric objects. This call is what enables `geopandas` to perform spatial operations, and can also contain certain attributes like `.crs` which is the coordinate reference system.


Commonly used datums in North America are NAD27, NAD83, and WGS84. More info [here](https://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Projection_basics_the_GIS_professional_needs_to_know).<br>

The county geodataframe uses `EPSG:4269` which is the EPSG code for the NAD83 coordinate system. Let's create a geodataframe with the `latitude` and `longitude` values that we have and put all of the points to the same CRS.

In [148]:
county_fips_gdf.crs

<Geographic 2D CRS: EPSG:4269>
Name: NAD83
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands

In [149]:
# ensures each row of the geodataframe is in the same CRS
registry_gdf = unify_crs(registry_df_post_2012, crs_col="crs")

In [159]:
# We can now perform a spatial join on the 2 GeoDataFrame.
# fitler for rows with null county_name

joined_gdf = (
    registry_gdf[registry_gdf["county_name"].isna()]
    .sjoin(county_fips_gdf.drop(columns=["county"]), how="left", predicate="intersects")
    .drop(columns=["index_right"])
)
joined_gdf

Unnamed: 0,crs,geometry,latitude,longitude,job_start_day,tvd,job_end_month,api,is_indian_well,state_code,job_start_year,state_number,operator_name,well,total_base_water_volume,job_start_month,well_name,p_key,job_end_date,state_name,state_left,job_start_date,job_end_day,county_name,county_number,ff_version,county_code,api_number,job_end_year,is_federal_well,county,geoid,statefp,countyfp,name,state_right
178445,NAD27,POINT (-96.37064 35.74946),35.749381,-96.370355,19.0,2442.0,11.0,35037294390000,False,35,2019.0,3,WFD Oil Corporation,Vanorsdol #1-29,22134.0,11.0,Vanorsdale,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,Arkansas,Arkansas,2019-11-19,19.0,,729,3,37,3729439000000,2019.0,False,Creek,40037,40,37,Creek,Oklahoma
139253,NAD83,POINT (-102.21212 48.35551),48.355506,-102.212124,28.0,8872.4,6.0,33610338000000,False,33,2022.0,33,Hunt Oil Company,TRULSON 156-90-11-14H-3,13496802.0,5.0,TRULSON 156-90-11-14H-3,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-06-08,North Dakota,North Dakota,2022-05-28,8.0,,610,3,61,33610338000000,2022.0,False,Mountrail,38061,38,61,Mountrail,North Dakota
139256,NAD83,POINT (-102.21233 48.35551),48.355506,-102.21233,28.0,8782.65,6.0,33610338100000,False,33,2022.0,33,Hunt Oil Company,PALERMO 156-90-2-31H-5,14820967.0,5.0,PALERMO 156-90-2-31H-5,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-06-09,North Dakota,North Dakota,2022-05-28,9.0,,610,3,61,33610338100000,2022.0,False,Mountrail,38061,38,61,Mountrail,North Dakota
206589,NAD83,POINT (-101.80852 32.40914),32.409144,-101.808518,11.0,8262.0,12.0,42317428660000,False,42,2020.0,43,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,17519292.0,11.0,Rhea 1-6 Unit 1 #133,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-12-01,Utah,Utah,2020-11-11,1.0,,317,3,317,43317428660000,2020.0,False,Martin,48317,48,317,Martin,Texas


- The `North Dakota` `county_number` should be `061`, which is `Mountrail` county, not `610`. 
- The `Utah` `county_number` though was actually correct. The error was the state number which should have been `42`, not `43`. This error is somewhat significant as according to the data dictionary:
> APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.<br>

All this means is the `api` number is also incorrect. It should be `42317428660000` (<u><b>42</b></u>-317-42866-0000) instead of `43317428660000` (<u><b>43</b></u>-317-42866-0000).<br>


In [161]:
# Let's corect theos values putting the county_code and 
registry_gdf["county"] = registry_gdf["county_name"].copy()

# replace the county_code and county columns with the values from the joined_gdf
registry_gdf.loc[joined_gdf.index, "county"] = joined_gdf["name"]
registry_gdf.loc[joined_gdf.index, "county_code"] = joined_gdf["countyfp"]

# change the api column of the last row in joined_gdf to 42317428660000 instead of 43317428660000
# registry_gdf.loc[joined_gdf.index[-1], "api"] = "42317428660000"
registry_gdf["api"] = registry_gdf["api"].replace("43317428660000", "42317428660000")
# correct the state_code values for where the api was changed
registry_gdf["state_code"] = registry_gdf["api"].str[0:2]

# check that the values have been replaced
registry_gdf[registry_gdf["county_name"].isna()]

Unnamed: 0,crs,geometry,latitude,longitude,job_start_day,tvd,job_end_month,api,is_indian_well,state_code,job_start_year,state_number,operator_name,well,total_base_water_volume,job_start_month,well_name,p_key,job_end_date,state_name,state,job_start_date,job_end_day,county_name,county_number,ff_version,county_code,api_number,job_end_year,is_federal_well,county
178445,NAD27,POINT (-96.37064 35.74946),35.749381,-96.370355,19.0,2442.0,11.0,35037294390000,False,35,2019.0,3,WFD Oil Corporation,Vanorsdol #1-29,22134.0,11.0,Vanorsdale,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,Arkansas,Arkansas,2019-11-19,19.0,,729,3,37,3729439000000,2019.0,False,Creek
139253,NAD83,POINT (-102.21212 48.35551),48.355506,-102.212124,28.0,8872.4,6.0,33610338000000,False,33,2022.0,33,Hunt Oil Company,TRULSON 156-90-11-14H-3,13496802.0,5.0,TRULSON 156-90-11-14H-3,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-06-08,North Dakota,North Dakota,2022-05-28,8.0,,610,3,61,33610338000000,2022.0,False,Mountrail
139256,NAD83,POINT (-102.21233 48.35551),48.355506,-102.21233,28.0,8782.65,6.0,33610338100000,False,33,2022.0,33,Hunt Oil Company,PALERMO 156-90-2-31H-5,14820967.0,5.0,PALERMO 156-90-2-31H-5,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-06-09,North Dakota,North Dakota,2022-05-28,9.0,,610,3,61,33610338100000,2022.0,False,Mountrail
206589,NAD83,POINT (-101.80852 32.40914),32.409144,-101.808518,11.0,8262.0,12.0,42317428660000,False,42,2020.0,43,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,17519292.0,11.0,Rhea 1-6 Unit 1 #133,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-12-01,Utah,Utah,2020-11-11,1.0,,317,3,317,43317428660000,2020.0,False,Martin


Anoher way to impute the missing `county_name` values would have been by using the `well_name` in the `registry_df_post_2012` dataframe. Assuming the other wells on the same pad has the correct `state_code` and `state` values. However, this may not have worked with `Vanorsdol` as there's only one well with that name.

In [88]:
registry_df_post_2012[
    registry_df_post_2012["well_name"].str.contains("vanors", case=False, na=False)
]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state,well
178445,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,2019-11-19,3729439000000,3,729,WFD Oil Corporation,Vanorsdale,35.749381,-96.370355,NAD27,2442.0,22134.0,Arkansas,,3,False,False,2019.0,11.0,19.0,2019.0,11.0,19.0,35037294390000,3,729,Arkansas,Vanorsdol #1-29


In [163]:
# show other wells with a similar name to rhea 1-6
registry_df_post_2012[
    registry_df_post_2012["well"].str.contains("Rhea 1-6", case=False, na=False)
]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state,well
86492,f2fc89b2-43bc-4a2d-a302-d003c9dab182,2020-11-07,2020-11-25,42317428600000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #112,32.409086,-101.803853,NAD83,8242.0,17632700.0,Texas,Martin,3,False,False,2020.0,11.0,7.0,2020.0,11.0,25.0,42317428600000,42,317,Texas,Rhea 1-6 Unit 1 #112
86493,ee8e1a88-fd13-4ee1-9ac9-3cf04bfb73c1,2020-11-07,2020-11-25,42317428630000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #123,32.409031,-101.804087,NAD83,8247.0,19430193.0,Texas,Martin,3,False,False,2020.0,11.0,7.0,2020.0,11.0,25.0,42317428630000,42,317,Texas,Rhea 1-6 Unit 1 #123
86504,b1fa42d1-7bcd-4910-be96-23f1f93ab58d,2020-11-07,2020-11-26,42317428590000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #211,32.409068,-101.803931,NAD83,8977.0,16920683.0,Texas,Martin,3,False,False,2020.0,11.0,7.0,2020.0,11.0,26.0,42317428590000,42,317,Texas,Rhea 1-6 Unit 1 #211
86505,7924a3b1-3eda-4452-b0c0-769017cb2830,2020-11-07,2020-11-26,42317428610000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #221,32.409049,-101.804009,NAD83,8722.0,17988065.0,Texas,Martin,3,False,False,2020.0,11.0,7.0,2020.0,11.0,26.0,42317428610000,42,317,Texas,Rhea 1-6 Unit 1 #221
86550,fd5b8aa8-2727-42be-8dfb-e91bd30a2906,2020-11-11,2020-12-01,42317428640000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #231,32.409162,-101.80844,NAD83,8739.0,17739048.0,Texas,Martin,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,42317428640000,42,317,Texas,Rhea 1-6 Unit 1 #231
86551,00937ae7-0cc4-446f-9c31-befe787ee09b,2020-11-11,2020-12-01,42317428650000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #232,32.409126,-101.808596,NAD83,8884.0,17523065.0,Texas,Martin,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,42317428650000,42,317,Texas,Rhea 1-6 Unit 1 #232
86552,52ff3cb8-be62-456e-91be-3edab3f13ab2,2020-11-11,2020-12-01,42317428670000,42,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #241,32.409108,-101.808674,NAD83,8729.0,17466997.0,Texas,Martin,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,42317428670000,42,317,Texas,Rhea 1-6 Unit 1 #241
206589,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-11-11,2020-12-01,43317428660000,43,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,32.409144,-101.808518,NAD83,8262.0,17519292.0,Utah,,3,False,False,2020.0,11.0,11.0,2020.0,12.0,1.0,43317428660000,43,317,Utah,Rhea 1-6 Unit 1 #133


In [164]:
registry_df_post_2012[registry_df_post_2012["well"].str.contains("Trulson", case=False)]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state,well
136875,5f5b1711-c657-4e85-b0d1-238585a4adaa,2019-06-01,2019-06-03,33061042180000,33,61,Hunt Oil Company,Trulson 156-90-11-14H-4,48.3552,-102.219,NAD27,8926.77,4923847.0,North Dakota,Mountrail,3,False,False,2019.0,6.0,1.0,2019.0,6.0,3.0,33061042180000,33,61,North Dakota,Trulson 156-90-11-14H-4
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,8.0,33610338000000,33,610,North Dakota,TRULSON 156-90-11-14H-3


In [165]:
# county_fips_gdf[county_fips_gdf["state"].isin(permian_states)]
registry_df_post_2012[registry_df_post_2012["well"].str.contains("palermo", case=False)]

Unnamed: 0,p_key,job_start_date,job_end_date,api_number,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,job_start_year,job_start_month,job_start_day,job_end_year,job_end_month,job_end_day,api,state_code,county_code,state,well
138760,07a3ddb7-7b49-4b78-b784-dcd7d4ece9c3,2021-09-11,2021-09-21,33061018280000,33,61,Hunt Oil Company,Palermo 2-6-34H,48.356826,-102.297271,NAD83,9034.68,10726600.0,North Dakota,Mountrail,3,False,False,2021.0,9.0,11.0,2021.0,9.0,21.0,33061018280000,33,61,North Dakota,Palermo 2-6-34H
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,2022.0,5.0,28.0,2022.0,6.0,9.0,33610338100000,33,610,North Dakota,PALERMO 156-90-2-31H-5
139473,a1f7da60-c02d-4094-b4d5-0d61487c1bbb,2022-08-05,2022-08-20,33061049910000,33,61,Hunt Oil Company,PALERMO-KING 156-90-5-34H 1,48.356878,-102.270326,NAD83,8928.9,13760852.0,North Dakota,Mountrail,3,False,False,2022.0,8.0,5.0,2022.0,8.0,20.0,33061049910000,33,61,North Dakota,PALERMO-KING 156-90-5-34H 1
139505,8584cf47-a792-4412-9877-85106ebdcf90,2022-08-22,2022-08-30,33061049870000,33,61,Hunt Oil Company,PALERMO 156-90-6-34H 3,48.357256,-102.286621,NAD83,8967.15,13389298.0,North Dakota,Mountrail,3,False,False,2022.0,8.0,22.0,2022.0,8.0,30.0,33061049870000,33,61,North Dakota,PALERMO 156-90-6-34H 3
139506,c3ee8ca1-258a-46d5-b4e6-846cf6a34e1a,2022-08-22,2022-08-31,33061049880000,33,61,Hunt Oil Company,PALERMO 156-90-5-34H 4,48.357256,-102.286497,NAD83,8955.3,14201058.0,North Dakota,Mountrail,3,False,False,2022.0,8.0,22.0,2022.0,8.0,31.0,33061049880000,33,61,North Dakota,PALERMO 156-90-5-34H 4


In [162]:
# make the well column uppercase to lower variation among for wells on the same pad
registry_gdf["well"] = registry_gdf["well"].str.upper()
# make the operator column uppercase to lower the variation among entry for the same operator
registry_gdf["operator"] = registry_gdf["operator_name"].str.upper()

Just as we did for those 4 wells to find the `county` and `state` for which they belong to, we can do that for all the wells.

In [75]:
# Check which other wells may have a state value which did not agree with the spatial join result
registry_county_gdf = registry_gdf.sjoin(
    county_fips_gdf, how="left", predicate="intersects"
).drop(columns=["index_right"])

trimmed_column_set = [
    "api",
    "well",
    "state_left",
    "state_right",
    "county_name",
    "county_code",
    "operator_name",
    "longitude",
    "latitude",
    "geometry",
]

# rows which id did not match the spatial join
mismatch_geo = registry_county_gdf[registry_county_gdf["state_right"].isna()]
# get the rows with Oklahoma in the state_left column
mismatch_geo_ok = mismatch_geo[trimmed_column_set].query(
    'state_left.str.contains("Oklahoma")'
)
# alter api to match the format in the occ_wells dataframe
mismatch_geo_ok["ok_api"] = mismatch_geo_ok["api"].str[:10]  # ok for Oklahoma

# look for the api in the occ_wells dataframe and merge that row to mismatch_geo_ok
mismatch_geo_ok.merge(
    occ_wells[
        ["api", "well_name", "well_num", "operator", "sh_lat", "sh_lon", "county"]
    ],
    how="left",
    left_on="ok_api",
    right_on="api",
)

# occ_wells[occ_wells["api"].isin(mismatch_geo_ok["ok_api"])][
#     ["api", "well_name", "well_num", "operator", "sh_lat", "sh_lon", "county"]
# ]

Unnamed: 0,api_x,well,state_left,state_right,county_name,county_code,operator_name,longitude,latitude,geometry,ok_api,api_y,well_name,well_num,operator,sh_lat,sh_lon,county
0,35093250270000,MCCONNELL #19-1H,Oklahoma,,Major,93,Comanche Resources Company,-98732100000.0,36.201866,POINT (-98732102099.00000 36.20187),3509325027,3509325027.0,MCCONNELL,#19-1H,COMANCHE RESOURCES COMPANY,,,MAJOR
1,35073254930000,DALWHINNIE 1605 1-31MH,Oklahoma,,Kingfisher,73,"Alta Mesa Services, LP",35.82785,-97.776229,POINT (35.82785 -97.77623),3507325493,3507325493.0,DALWHINNIE 1605,#1-31MH,BCE-MACH III LLC,35.827851,-97.776229,KINGFISHER
2,35047250860000,GUNGOLL 20-1H,Oklahoma,,Garfield,47,Gastar Exploration Inc.,36.18891,-98.070053,POINT (36.18891 -98.07005),3504725086,3504725086.0,GUNGOLL 20,#1H,CHISHOLM OIL AND GAS OPERATING LLC,36.188967,-98.070053,GARFIELD
3,35073255130000,BENNIE RACER 14-1H,Oklahoma,,Kingfisher,73,Gastar Exploration Inc.,36.117,-98.017878,POINT (36.11700 -98.01788),3507325513,3507325513.0,BENNIE RACER 14,#1H,CHISHOLM OIL AND GAS OPERATING LLC,36.117041,-98.018222,KINGFISHER
4,35073256050000,HAROLD DOROTHY 1907 6-1H,Oklahoma,,Kingfisher,73,Gastar Exploration Inc.,36.14611,-97.988311,POINT (36.14611 -97.98831),3507325605,3507325605.0,HAROLD AND DOROTHY 1907,#6-1H,CHISHOLM OIL AND GAS OPERATING LLC,36.146174,-97.89667,KINGFISHER
5,35063247190000,ANGUS 9-1,Oklahoma,,Hughes,63,"Silver Creek Oil & Gas, LLC",34.92702,-96.255972,POINT (34.92702 -96.25597),3506324719,3506324719.0,ANGUS,#9-1H,SILVER CREEK OIL & GAS LLC,34.927665,-96.263397,HUGHES
6,35117236960000,J L MILLER 36,Oklahoma,,Pawnee,117,"Contango Resources, Inc.",-96.48668,-96.284071,POINT (-96.48668 -96.28407),3511723696,3511723696.0,JL MILLER,#36,CONTANGO RESOURCES LLC,36.284071,-96.486678,PAWNEE
7,35065202650001,EDDIE #2-31H,Oklahoma,,Jackson,65,"GLB Exploration, Inc.",0.0,34.420379,POINT (0.00000 34.42038),3506520265,3506520265.0,EDDIE,#2-31H,GLB EXPLORATION INC,34.42044,-99.64525,JACKSON
8,35003233370000,ALVA 5-25-11 1H,Oklahoma,,Alfalfa,3,Mach Resources,-10.12346,10.123456,POINT (-10.12346 10.12346),3500323337,3500323337.0,ALVA 5-28-11,#1H,BCE-MACH LLC,36.92756,-98.40674,ALFALFA
9,35063247180000,WALT 07-01H,Oklahoma,,Hughes,63,"Silver Creek Oil & Gas, LLC",35.01488,-96.389653,POINT (35.01488 -96.38965),3506324718,3506324718.0,WALT,#07-1H,SILVER CREEK OIL & GAS LLC,35.01488,-96.389653,HUGHES


In [78]:
mismatch_geo.tail(50)

Unnamed: 0,crs,geometry,latitude,longitude,job_start_day,tvd,job_end_month,api,is_indian_well,state_code,job_start_year,state_number,operator_name,well,total_base_water_volume,...,job_start_date,job_end_day,county_name,county_number,ff_version,county_code,api_number,job_end_year,is_federal_well,geoid,statefp,countyfp,name,county,state_right
85499,NAD27,POINT (33.18122 -103.04381),-103.043806,33.18122,28.0,5320.0,8.0,42501372640000,False,42,2020.0,42,"Steward Energy II, LLC",BRASS MONKEY 414 A 1H,3708600.0,...,2020-08-28,30.0,Yoakum,501,3,501,42501372640000,2020.0,False,,,,,,
90528,NAD27,POINT (-10.00000 31.74061),31.740613,-10.0,15.0,9293.0,8.0,42329434900000,False,42,2021.0,42,Chevron USA Inc.,CMC BLUEBONNET E 0054WA,841263.0,...,2021-07-15,26.0,Midland,329,3,329,42329434900000,2021.0,False,,,,,,
90548,NAD27,POINT (28.84272 -98.11911),-98.119115,28.84272,11.0,10910.0,8.0,42255371560000,False,42,2021.0,42,EP Energy,TRENCH FOOT UNIT C 3H,25110960.0,...,2021-08-11,28.0,Karnes,255,3,255,42255371560000,2021.0,False,,,,,,
99482,NAD27,POINT (-101.97097 21.41620),21.416198,-101.971,28.0,9412.0,1.0,42317445230000,False,42,2022.0,42,Black Swan Oil & Gas,CLAIRE 252 C 4WA,25825943.0,...,2022-12-28,10.0,Martin,317,3,317,42317445230000,2023.0,False,,,,,,
102531,NAD27,POINT (28.76321 -98.61394),-98.613942,28.76321,6.0,8015.0,6.0,42013359480000,False,42,2023.0,42,Texas American Resources Company,HARRIS NORTH B UNIT 105H,16308774.0,...,2023-06-06,24.0,Atascosa,13,3,13,42013359480000,2023.0,False,,,,,,
112507,NAD27,POINT (39.85945 -81.20438),-81.204377,39.85945,12.0,8957.0,1.0,34111247300000,False,34,2018.0,34,Antero Resources Corporation,HARPER 2H,28191768.0,...,2018-12-12,27.0,Monroe,111,3,111,34111247300000,2019.0,False,,,,,,
112696,NAD27,POINT (40.52965 -81.12848),-81.128479,40.52965,18.0,7540.0,8.0,34019221650000,False,34,2019.0,34,"Chesapeake Operating, Inc.",RUTLEDGE 10-14-6 5H,21188499.0,...,2019-07-18,6.0,Carroll,19,3,19,34019221650000,2019.0,False,,,,,,
112697,NAD27,POINT (40.52973 -81.12857),-81.128575,40.52973,17.0,7529.0,8.0,34019222380000,False,34,2019.0,34,"Chesapeake Operating, Inc.",RUTLEDGE 10-14-6 3H,21383065.0,...,2019-07-17,6.0,Carroll,19,3,19,34019222380000,2019.0,False,,,,,,
112698,NAD27,POINT (40.52969 -81.12853),-81.128529,40.52969,17.0,7519.0,8.0,34019227570000,False,34,2019.0,34,"Chesapeake Operating, Inc.",RUTLEDGE 10-14-6 4H,21189513.0,...,2019-07-17,6.0,Carroll,19,3,19,34019227570000,2019.0,False,,,,,,
134811,NAD27,POINT (103.26030 47.93409),47.934093,103.2603,8.0,11281.0,10.0,33053077330000,False,33,2017.0,33,Oasis Petroleum,PATSY 5198 11-5 2BX,9839634.0,...,2017-10-08,24.0,McKenzie,53,3,53,33053077330000,2017.0,False,,,,,,


In [None]:

# state mismatch
mismatch_geo = registry_county_gdf[
    registry_county_gdf["state_left"] != registry_county_gdf["state_right"]
]
print(f"Number of rows with state mismatch: {mismatch_geo.shape[0]}")


tx_query_mismatches = mismatch_geo.query('state_left.str.contains("Texas")')[
    trimmed_column_set
]
print(f"Number of rows with state mismatch in Texas: {tx_query_mismatches.shape[0]}")

# find mismatches within the bounds of the USA BOUND
within_bounds = tx_query_mismatches.cx[
    USA_BOUNDS[0] : USA_BOUNDS[2], USA_BOUNDS[1] : USA_BOUNDS[3]
]
print(
    f"Number of rows with state mismatch in Texas within bounds: {within_bounds.shape[0]}"
)
# get those not in bounds
outside_bounds = tx_query_mismatches[
    ~tx_query_mismatches.index.isin(within_bounds.index)
].copy()

# points with geometry outside the bounds
print(
    f"Number of rows with state mismatch in Texas outside bounds: {outside_bounds.shape[0]}"
)
tx_query_mismatches.shape

It looks like for most of the rows that are out of bounds, the `latitude` and `longitude` values were accidently swapped, or the negative sign in front of the `longitude` value was accidently omitted. 

In [None]:
mer_points.shape

In [None]:
# filter out the points outside the bounds of +- 180 longitude and +- 90 latitude
tx_query_mismatches = tx_query_mismatches.loc[
    (tx_query_mismatches["longitude"] <= 180)
    & (tx_query_mismatches["longitude"] >= -180)
    & (tx_query_mismatches["latitude"] <= 90)
    & (tx_query_mismatches["latitude"] >= -90)
]

# Define bg_map variable here
bg_map = get_background_map()

# BEGIN: ed8c6549bwf9
mer_points = platecaree_to_mercator_vectorised(
    tx_query_mismatches["geometry"].x, tx_query_mismatches["geometry"].y
)
mer_coords = pd.DataFrame(mer_points, columns=["x", "y"])

bg_map * gv.Points(
    mer_coords.reset_index(), ["x", "y"], ["index"], crs=ccrs.GOOGLE_MERCATOR
).opts(
    color="red",
    size=5,
    height=600,
    width=800,
    tools=["hover"],
)

In [None]:
# Define a list of states that are in the Permian Basin.
nm_tx = ["New Mexico", "Texas"]

# Filter the county_fips_gdf DataFrame to include only the counties in the Permian states.
counties_nm_tx_gdf = county_fips_gdf[county_fips_gdf["state"].isin(nm_tx)]

# Perform a spatial join between the registry_gdf and counties_nm_tx_gdf DataFrames.
# This will add the data from counties_nm_tx_gdf to registry_gdf for matching locations.
# After the join, drop the 'index_right' column as it's not needed.
registry_nm_tx_gdf = registry_gdf.sjoin(counties_nm_tx_gdf).drop(
    columns=["index_right"]
)
registry_nm_tx_gdf.sample(3)

In [None]:
registry_nm_tx_gdf.info()

In [None]:
registry_nm_tx_gdf["operator_name"].nunique()

In [None]:
# create a pivot table with the count of operators for each year
operator_year_count = registry_nm_tx_gdf.pivot_table(
    index="operator_name", columns="job_start_year", values="api", aggfunc="count"
)
# See which operators were active every year
# operator_year_count[operator_year_count.count(axis=1) == 11]

# See who has been active for the last 5 years
operator_active_5y = operator_year_count.loc[
    ~operator_year_count.iloc[:, -5:].isna().any(axis=1)
].fillna(0)

# do some styler table formatting from pandas
style = operator_active_5y.style.background_gradient(
    cmap="cet_CET_L2_r", axis=1, vmin=0, vmax=operator_year_count.max().max()
)
# Format the numbers in the table as integers
style = style.format("{:.0f}")
print(f"Number of operators active for the last 5 years: {len(operator_active_5y)}")
style

In [None]:
# Create a mask for rows where the first 2 characters of the api number are not 42 nor 30.
# This is done to filter out rows that do not belong to the states we are interested in (Texas and New Mexico).
api_mask = ~registry_nm_tx_gdf["api"].str[0:2].isin(["42", "30"])

# Apply the mask to the registry_nm_tx_gdf DataFrame to get the rows that match the condition.
mismatch_state = registry_nm_tx_gdf[api_mask]

# Display selected columns from the mismatch_state DataFrame.
# These columns provide information about the well, its location, and the job start date.
display(
    mismatch_state[
        [
            "api",
            "api_number",
            "state_right",
            "state_name",
            "state_number",
            "well_name",
            "operator_name",
            "county_name",
            "latitude",
            "longitude",
            "geometry",
            "crs",
            "job_start_date",
            "county",
            "countyfp",
        ]
    ]
)

In [None]:
# bg_map * gv.Points(mismatch_state["geometry"]).opts(height=500, width=500)

In [None]:
# get a background map
bg_map = get_background_map()

In [None]:
# bg_map * gv.Points(mismatch_state["geometry"]) * gv.Path(
#     counties_nm_tx_gdf, vdims=["county", "countyfp"]
# ).opts(
#     height=500,
#     width=500,
#     tools=["hover"],
#     # color="county",
# )

In [None]:
# bg_map * gv.Points(registry_nm_tx_gdf["geometry"]).opts(
#     color="skyblue", size=1, tools=["hover"], width=800, height=600, alpha=0.5
# )

In [None]:
# Convert the coordinates to Mercator
mercator_coords = platecaree_to_mercator_vectorised(
    registry_nm_tx_gdf["geometry"].x, registry_nm_tx_gdf["geometry"].y
)

# Round the coordinates and create a DataFrame
mer_points = pd.DataFrame(np.round(mercator_coords), columns=["x", "y"])

# Create a Points object for plotting
gpoints = gv.Points(
    mer_points.reset_index(), ["x", "y"], ["index"], crs=ccrs.GOOGLE_MERCATOR
).opts(height=600, width=800, color="skyblue", size=1, tools=["hover"])

# Create a layout with the background map and the points
layout = bg_map * gpoints
layout

### Map files taken from the EIA website.


### Formations

In [None]:
# This is just the county. We pulled it in straight form the URL above
# extract_gdfs_from_zip_url(CENSUS_COUNTY_MAP_URL)

In [None]:
# Use the function 'extract_gdfs_from_zip_url_concurrent' to get GeoDataFrames from the URLs in 'basins_url_list'
# This function concurrently downloads and extracts GeoDataFrames from the given URLs
basins_dict = extract_gdfs_from_zip_url_concurrent(basins_url_list)

# Display the keys of 'basins_dict' to see the names of the basins
display(basins_dict.keys())

# Concatenate the GeoDataFrames in 'basins_dict' into a single GeoDataFrame using the function 'concat_gdf_from_dict'
basins_gdf = concat_gdf_from_dict(basins_dict)

# Convert the column names of 'basins_gdf' to snake case for consistency
# The function 'pascal_to_snake' is used to convert PascalCase or camelCase to snake_case
basins_gdf.columns = [pascal_to_snake(col) for col in basins_gdf.columns]

In [None]:
print(f"Number of sub basins/ geodataframes: {len(basins_dict)}\n")

for k, gdf in basins_dict.items():
    print(f"{k}| Shape:{gdf.shape}| CRS:{gdf.crs.to_string()}")
    display(gdf.sample())
    print()

In [None]:
# get the shapefile of the basin boundaries
basins_dict = extract_gdfs_from_zip_url_concurrent(basins_url_list)
display(basins_dict.keys())
basins_gdf = concat_gdf_from_dict(basins_dict)
# scrub the column names
basins_gdf.columns = [pascal_to_snake(col) for col in basins_gdf.columns]

In [None]:
# Plot shale plays and basin boundaries of the different formations in the Permian Basin
bg_map * basins_gdf.hvplot(
    geo=True,
    alpha=0.5,
    title="Shale Plays in the Permian Basin",
    legend=True,
    by="shale_play",
    muted_alpha=0.01,
).opts(
    tools=["hover", "tap"],
    legend_position="right",
    height=600,
    width=800,
)

In [None]:
from holoviews import opts

# Dissolve the geometries of the basins_gdf GeoDataFrame into a single geometry



shale_plays_gdf = basins_gdf[["geometry"]].dissolve()
# get the counties in the Permian Basin
permian_counties = county_fips_gdf.intersects(shale_plays_gdf)
# plot the counties in the Permian Basin
bg_map * gv.Polygons(county_fips_gdf[permian_counties]).opts(
    title="Counties in the Permian Basin",
    tools=["hover"],
    height=600,
    width=800,
    alpha=0.5,
    color="skyblue",
)
# plot an outline of the Permian Basin over the counties
overlay = (
    bg_map
    * gv.Polygons(county_fips_gdf[permian_counties])
    * gv.Path(shale_plays_gdf)
    * gpoints
)

overlay.opts(
    opts.Polygons(alpha=0.5, cmap=["#73d2ff"], line_color="gray"),
    opts.Path(alpha=0.5, color="black"),
    opts.Overlay(tools=["hover"], height=600, width=800),
    opts.Points(color="crimson"),
)


# plot to confirm that the geometries have been dissolved


# bg_map * gv.Polygons(shale_plays_gdf).opts(


#     # geo=True,


#     title="Dissolved Shale play in the Permian Basin",


#     height=600,


#     width=800,


# )

#### State land leases

#### New Mexico:
> 


In [None]:
# Use the extract_gdfs_from_zip_url_concurrent function to download and extract GeoDataFrames
# from the shapefile zip files at the URLs in the shp_url_list. The function returns a dictionary
# where the keys are the names of the shapefiles and the values are the corresponding GeoDataFrames.
nm_slo_dict = extract_gdfs_from_zip_url_concurrent(nm_slo_url_list)

# Display the keys of the land_map_dict dictionary. These are the names of the shapefiles
# that were downloaded and extracted.
nm_slo_dict.keys()

In [None]:
# land_map_dict = extract_gdfs_from_zip_url_concurrent(shp_url_list)


# land_map_dict.keys()

In [None]:
# sample the gdfs in the dictionary
for k, gdf in nm_slo_dict.items():
    print(f"{k}| Shape:{gdf.shape}| CRS:{gdf.crs.to_string()}")
    display(gdf.sample(3))
    display(gdf.info())

In [None]:
# Create 2 separate gdfs instead of concatenating them as they have distinct columns
nm_slo_gdfs = list(nm_slo_dict.values())
# first one is the geologic regions
nm_slo_geo = nm_slo_gdfs[0]
# scrub the columns
nm_slo_geo.columns = [pascal_to_snake(col) for col in nm_slo_geo.columns]

# Define  a dictionary for the opts to include in plot function
poly_opts = dict(
    alpha=0.8,
    height=600,
    width=800,
    line_width=0,
    line_color="lightgray",
    tools=["hover"],
)


# Adjust opts for this plot
poly_opts_copy = poly_opts.copy()
poly_opts_copy["line_width"] = 1

# plot the geologic regions gdf
bg_map * gv.Polygons(nm_slo_geo.to_crs("EPSG:4269"), vdims=["label"]).opts(
    **poly_opts_copy, cmap=["#73d2ff"] * 256, title="New Mexico Geologic Regions"
)

In [None]:
# Second one is the oil and gas leases on New Mexico State Trust Lands
nm_slo_lease = nm_slo_gdfs[1]
# scrub the columns
nm_slo_lease.columns = [pascal_to_snake(col) for col in nm_slo_lease.columns]
# create a new column for the area of the lease
nm_slo_lease["area"] = nm_slo_lease["geometry"].area
# groupby the ogrid_nam and sum the area
# add the transformed area to the gdf
nm_slo_lease["ogrid_area"] = (
    nm_slo_lease.groupby("ogrid_nam")["area"].transform("sum") / 1e6
)

# plot the oil and gas leases gdf for New Mexico State Trust Lands
bg_map * gv.Polygons(
    nm_slo_lease.to_crs("EPSG:4269"), vdims=["ogrid_nam", "ogrid_area"]
).opts(
    **poly_opts,
    cnorm="eq_hist",
    colorbar=True,
    title="Oil and Gas Leases on New Mexico State Trust Lands"
)

In [None]:
# nm_slo_plss = nm_slo_gdfs[2]
# nm_slo_plss.columns = [pascal_to_snake(col) for col in nm_slo_plss.columns]
# nm_slo_plss.drop(["benef_surf", "benef_subs", "meridian"], axis=1, inplace=True)

In [None]:
# # get the county boundaries for New Mexico
# lea_county_poly = county_fips_gdf[
#     (county_fips_gdf["state"] == "New Mexico")
#     & (county_fips_gdf["county"].str.contains("Lea"))
# ]
# # get the plss within the county boundaries
# # lea_plss = nm_slo_plss.sjoin(lea_county_poly, how="inner", predicate="intersects")

# lea_plss = lea_county_poly.sjoin(
#     nm_slo_plss.to_crs(lea_county_poly.crs)
# ).drop(columns=["index_right"])

# lea_plss[["geometry"]]

In [None]:
# This has too many polygons to plot quickly
# nm_slo_plss

In [None]:
# list(land_map_dict.values())
# [gdf['geometry'] for gdf in land_map_dict.values()]
random_color_list = [
    "#" + "".join([random.choice("0123456789ABCDEF") for j in range(6)])
    for i in range(len(nm_slo_dict))
]
plots = []
new_map = get_background_map()
plots.append(new_map)
for color, (name, gdf) in zip(random_color_list, nm_slo_dict.items()):
    # Add new column with the name of the shapefile for the hover tool
    gdf["label"] = name
    plot = gv.Polygons(gdf.to_crs("EPSG:4269"), vdims=["label"]).opts(
        tools=["hover"], height=600, width=800, alpha=0.5, title=""
    )
    plots.append(plot)

overlay = hv.Overlay(plots)
overlay

In [None]:
nm_slo_lease.info()

In [None]:
# # url for the New Mexico State Land Office zip file
# nm_slo_dict = extract_gdfs_from_zip_url(NM_SLO_OIL_LEASE_URL)
# if nm_slo_dict is not None:
#     nm_slo_gdf = concat_gdf_from_dict(nm_slo_dict)
#     nm_slo_gdf.info()
#     display(nm_slo_gdf.sample(3))

In [None]:
# nm_slo_gdf.explore()



Texas:
> Land survey data and surface well data was taken from the RRC website. The data was then uploaded to GCP for easier reliabiliity. They are 254 zipfiles, one for each county, in the state of Texas. Each of those zipfiles contained various file extensions, and spatial data format that is usually contained in shapefiles. and contained info for various categories ranging from Airport lines to Offshore survey polys.  

In [None]:
# # make a list of the 2 county numbers
# county_nums = ["003", "135", "317"]

# # For each county number, define a path to the zip file
# shp_zips = [f"../data/Shp{num}.zip" for num in county_nums]
# # Look at the survey lines polygons and the surface wells points
# # data from local disk

# patterns = [r"surv\d{3}p", r"well\d{3}s"]
# shp_dict = extract_specific_gdf_from_local_zip(shp_zips, patterns)
# # for k, v in shp_dict.items():
# # print(f"{k}: {v.shape}: {v.columns}: {v.crs}")
# # print()

In [None]:
# regex patterns to identify which shapefiles to extract
patterns = [r"surv\d{3}p", r"well\d{3}s"]


# shp_dict = extract_specific_gdf_from_zip_url(shp_zip_urls, patterns)

# Look at the survey lines polygons and the surface wells points. Data saved from RRC website
shp_dict = extract_matching_shp_files_from_zip_urls_concurrent(SHP_ZIP_URLS, patterns)
shp_dict.keys()

In [None]:
# use the patterns to separate the gdf in the dict based on the pattern
surv_dict = {k: shp_dict[k] for k, v in shp_dict.items() if re.search(patterns[0], k)}
well_dict = {k: shp_dict[k] for k, v in shp_dict.items() if re.search(patterns[1], k)}

In [None]:
# Concatenate the GeoDataFrames in surv_dict into a single GeoDataFrame
surv_data_gdf = concat_gdf_from_dict(surv_dict)

# Convert the column names to snake case for consistency
surv_data_gdf.columns = [pascal_to_snake(col) for col in surv_data_gdf.columns]
# addd a coulmn for the county_code
surv_data_gdf["county_code"] = surv_data_gdf["source_file"].str.extract(r"(\d{3})")

# Display a sample of 3 rows from the DataFrame
display(surv_data_gdf.sample(3))

# Display information about the DataFrame, including the number of non-null entries in each column
surv_data_gdf.info()

In [None]:
# Concatenate the GeoDataFrames in well_dict into a single GeoDataFrame
well_data_gdf = concat_gdf_from_dict(well_dict)

# Convert the column names to snake case for consistency
well_data_gdf.columns = [pascal_to_snake(col) for col in well_data_gdf.columns]
# get the county code from the source_file column
well_data_gdf["county_code"] = well_data_gdf["source_file"].str.extract(r"(\d{3})")

# Display a sample of 3 rows from the DataFrame
display(well_data_gdf.sample(3))

# Display information about the DataFrame, including the number of non-null entries in each column
well_data_gdf.info()

In [None]:
# add a column to the surv_data_gdf with the county number
# the county_number wil be the numbers in the source_file column
surv_data_gdf["county_number"] = surv_data_gdf["source_file"].str.extract(r"(\d{3})")

# using just the geometry and the county_number columns, intersect with the permian basin gdf
surv_permian_gdf = surv_data_gdf[["geometry", "county_number"]].sjoin(
    shale_plays_gdf[["geometry"]], how="inner", predicate="intersects"
)
# see which counties are in the permian basin
pb_county_numbers = surv_permian_gdf["county_number"].unique().tolist()

# plot the survey lines in the permian basin
bg_map * gv.Polygons(
    surv_permian_gdf.to_crs("EPSG:4269"), vdims=["county_number"]
).opts(
    tools=["hover"],
    height=600,
    width=800,
    alpha=0.5,
    line_width=0,
    title="Permian Basin Survey Lines",
)

# surv_data_gdf.sample(3)

In [None]:
# see how land survey polygon data looks on map
pb_plot = shale_plays_gdf.hvplot(geo=True, color="red", alpha=0.5, line_width=0).opts(
    height=600, width=800
)
survey_plot = surv_data_gdf.hvplot(geo=True, color="blue", alpha=0.5, line_width=0)

# bg_map * survey_plot * pb_plot

In [None]:
# Get the intersection of the survey polygons and the Permian Basin polygon
survey_pb_gdf = gpd.overlay(surv_data_gdf, shale_plays_gdf, how="intersection")
survey_pb_gdf.info()

In [None]:
# survey_pb_gdf.explore()

In [None]:
# List of numbers
numbers = [
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
    10,
    11,
    17,
    18,
    19,
    20,
    21,
    22,
    23,
    73,
    74,
    75,
    76,
    77,
    86,
    87,
]

# List of well types. Some values were changed to allow for easier grouping
well_types = [
    "Permitted Location",
    "Dry Hole",
    "Oil/Gas",  # oil
    "Oil/Gas",  # gas
    "Oil/Gas",  # oil/gas
    "Plugged/Shut-in",  # oil
    "Plugged/Shut-in",  # gas
    "Canceled Location",
    "Plugged"  # Oil/Gas
    "Injection/Disposal",
    "Storage",  # oil
    "Storage",  # gas
    "Plugged/Shut-in",  # oil
    "Plugged/Shut-in",  # gas
    "Injection/Disposal",  # oil
    "Injection/Disposal",  # gas
    "Injection/Disposal",  # oil/gas
    "Brine Mining",
    "Water Supply",
    "Water Supply",  # oil
    "Water Supply",  # gas
    "Water Supply",  # oil/gas
    "Horizontal Well Surface Location",
    "Directional/Sidetrack Well Surface Location",
]

# Create a dictionary from the two lists
well_dict = dict(zip(numbers, well_types))
# map the dictionary to the SYMNUM column and fill the rare values with 'Other'
well_data_gdf["welltype"] = well_data_gdf["symnum"].map(well_dict).fillna("Other")
# use pascal_to_snake function to convert the column names to snake case
well_data_gdf.columns = [pascal_to_snake(col) for col in well_data_gdf.columns]
well_data_gdf.rename(
    columns={"api": "api_short", "welltype": "well_type"}, inplace=True
)
well_data_gdf.sample(3)

In [None]:
# spatial join of the registry_gdf(from fracfocus) and surv_data_gdf
registry_join_gdf = gpd.sjoin(
    registry_gdf[
        [
            "geometry",
            "api",
            "operator_name",
            "well_name",
            "state",
            "county_name",
            "county_number",
        ]
    ],
    surv_data_gdf,
).drop(columns=["index_right"])


registry_join_gdf.sort_values(by="api")

registry_join_gdf.county_name.value_counts()

# create a well_id column from the api column

registry_join_gdf["api_short"] = registry_join_gdf["api"].str[2:10]


# merge the welltype column from well_data_gdf to registry_join_gdf on the api_short column

registry_join_gdf = (
    registry_join_gdf.merge(well_data_gdf[["api_short", "well_type"]], on="api_short")
    .drop(columns=["scrap_file", "level4_sur"])
    .rename(columns={"level2_blo": "block"})
)
# registry_join_gdf.explore()
# plot polygons using geoviews
bg_map * gv.Polygons(registry_join_gdf.to_crs("EPSG:4269"), vdims=["well_type"]).opts(
    **poly_opts, color="well_type", title="Well Types in the Permian Basin"
)

bg_map * registry_join_gdf.hvplot(
    geo=True,
    by="well_type",
    alpha=0.8,
    legend="right",
    width=800,
    height=600,
    size=1,
    muted_alpha=0.1,
)

### Geodatabase files taken from the Texas GLO (General Land Office.)

These files contained both the Oil and Gas Leases (active only), managed by the Texas GLO, and Oil & Gas units (active only) which is Oil and Gas pooling agreements managed by the Texas GLO. 

In [None]:
# # get the geodataframe of the active leases
# gdb_zips = ["../data/GDB_ActiveUnits.zip", "../data/GDB_ActiveLeases.zip"]

# active_gdb_dict = read_gdb_from_zip(gdb_zips)

In [None]:
# get the geodataframe of the active leases
# active_gdb_dict = read_gdb_from_zip_url(gdb_zip_urls)


# get the geodataframe of the active leases using concurrent futures
active_gdb_dict = read_gdb_from_zip_url_concurrent(GDB_ZIP_URLS)

In [None]:
active_gdb_dict

In [None]:
active_gdb_dict.keys()

In [None]:
# Read in the active lease geodatabase
active_leases_gdf = active_gdb_dict["OAG_Leases_Active"]
# clean column names
active_leases_gdf.columns = [pascal_to_snake(col) for col in active_leases_gdf.columns]
active_leases_gdf.columns

In [None]:
# get the columns with the date in it using regex
date_cols = [col for col in active_leases_gdf.columns if re.search(r"date", col)]
# add any other columns that should be dates
date_cols.extend(["lease_input"])

date_cols

In [None]:
# convert the date columns to datetime
active_leases_gdf[date_cols] = pd.concat(
    [pd.to_datetime(active_leases_gdf[col]) for col in date_cols], axis=1
)
# active_leases_gdf[date_cols] = active_leases_gdf[date_cols].fillna(
#     pd.Timestamp("1900-06-28")
# )

In [None]:
# get the columns interested in seeing
columns_of_interest = date_cols + [
    "county",
    "geometry",
    "land_type",
    "primary_term_year",
    "original_lessee",
    "lessor",
    "field_name",
    "lease_type",
    "lease_status",
]

In [None]:
active_leases_gdf[columns_of_interest].info()
active_leases_gdf[active_leases_gdf[columns_of_interest].isna().any(axis=1)][
    columns_of_interest
].sort_values(by="effective_date", ascending=False)

In [None]:
non_date_cols = list(set(columns_of_interest) - set(date_cols))
pd.concat(
    [
        active_leases_gdf[non_date_cols],
        active_leases_gdf[date_cols].astype(
            str
        ),  # the .explore() does not work with NaT in datetime columns
    ],
    axis=1,
).explore()

### Network

In [None]:
edge_cols = ["operator_name", "api_short", "county_number", "well_type", "block"]
# ensure that the api_short column is unique
network_df = (
    registry_join_gdf[edge_cols]
    .drop_duplicates(subset=["api_short"], keep="last")
    .reset_index(drop=True)
)


# Filled in the missing values with NSEW
network_df["block"].fillna("NSEW", inplace=True)
# createa a new column with the county_number-block
network_df["county_block"] = network_df["county_number"] + "-" + network_df["block"]

network_df.info()

# for col in network_df.columns:
#     print()
#     print(f"{col}: {network_df[col].nunique()}")

display(
    pd.DataFrame.from_records(
        [(col, network_df[col].nunique()) for col in network_df.columns],
        columns=["Column", "Unique Count"],
    )
)

# check if the same block value exist in 2 different counties
network_df.groupby("block")["county_number"].nunique().sort_values(
    ascending=False
).head(20)

In [None]:
import igraph as ig
from igraph import plot


# Create a graph from the well data
g = ig.Graph(directed=False)
network_df.info()
network_df.sample(3)

In [None]:
# Get unique operators and wells
operators = network_df["operator_name"].unique().tolist()
wells = network_df["api_short"].unique().tolist()
blocks = network_df["block"].unique().tolist()
county_numbers = network_df["county_number"].unique().tolist()

# Add all vertices to the graph
g.add_vertices(
    len(operators)
    # + len(blocks)
    + len(county_numbers)
)

# Assign attributes to operator vertices
g.vs[: len(operators)]["operator"] = operators

# Assign attributes to well vertices
# g.vs[len(operators) :]["well"] = wells

# Assign attributes to well vertices
# g.vs[len(operators) :]["block"] = blocks

# Assign attributes to county vertices
g.vs[
    len(operators)
    # + len(blocks)
    :
]["county"] = county_numbers

# Create a dictionary to map operators and wells to their indices
vertex_indices = {operator: index for index, operator in enumerate(operators)}
index_to_operator = {index: operator for index, operator in enumerate(operators)}
# vertex_indices.update(
#     {block: index for index, block in enumerate(blocks, start=len(operators))}
# )
vertex_indices.update(
    {county: index for index, county in enumerate(county_numbers, start=len(operators))}
)

# Add the edges to the graph
edges = [
    (
        vertex_indices[operator],
        # vertex_indices[block],
        vertex_indices[county],
    )
    for operator, county in network_df[["operator_name", "county_number"]]
    .to_records(index=False)
    .tolist()
]
g.add_edges(edges)
g.es["well_type"] = network_df["well_type"].tolist()

# g.es["county_number"] = network_df["county_number"].tolist()

In [None]:
# get the number of vertices and edges
print(f"Number of vertices: {g.vcount()}")
print(f"Number of edges: {g.ecount()}")

In [None]:
plot(g, layout="kk", edge_color="gray")

In [None]:
# compute the best partition for the graph
communities = g.as_undirected().community_multilevel()
print(f"Number of communities: {len(communities)}")
# print the number of vertices in each community
print([len(c) for c in communities])

# create a dictionary of the communities
community_dict = {i: c for i, c in enumerate(communities)}

node_community_dict = {
    g.vs[node]["operator"]
    if g.vs[node]["operator"]
    else g.vs[node]["county"]: community
    for community, nodes in community_dict.items()
    for node in nodes
}
# node_community_dict

In [None]:
# Create a color list for the communities
community_color_list = [
    "#" + "".join([random.choice("0123456789ABCDEF") for j in range(6)])
    for i in range(len(set(node_community_dict.values())))
]


net = Network(notebook=True, cdn_resources="remote")

# Add nodes to the network
for i, community in enumerate(communities):
    for node in community:
        if g.vs[node]["operator"]:
            color = community_color_list[i]
            size = 10  # size for operators
        else:
            color = community_color_list[i] + "80"  # half alpha
            size = 20  # size for counties
        net.add_node(node, label=node, color=color, size=size)

for edge in g.es:
    net.add_edge(edge.source, edge.target)

net.show("node_community.html")

In [None]:
net = Network(notebook=True, cdn_resources="remote")
sources = network_df["operator_name"].values
targets = network_df["county_number"].values


edge_data = zip(sources, targets)
for e in edge_data:
    src = e[0]
    dst = e[1]
    net.add_node(src, label=src)
    net.add_node(dst, label=dst)
    net.add_edge(src, dst, value=1)

net.show("network.html")

In [None]:
# Add the vertices to the network
for v in g.vs:
    is_operator = "operator" in v.attributes()
    label = title = v["operator"] if is_operator else v["well"]
    color = "#1f78b4" if is_operator else "#33a02c"
    size = 10 if is_operator else 5
    net.add_node(v.index, label=label, title=title, color=color, size=size)

# Add the edges to the network
net.add_edges([(e.source, e.target) for e in g.es])

# net.show("network.html")
net.save_graph("../images/network.html")

In [None]:
# visualize the graph
layout = g.layout("kk")
ig.plot(
    g,
    layout=layout,
    vertex_size=5,
    vertex_label_size=5,
    vertex_label_dist=1,
    vertex_label_angle=3.14 / 2,
    edge_arrow_size=0.1,
    bbox=(1000, 1000),
    margin=100,
)

In [None]:
network_df["block"].value_counts()[50:100]

In [None]:
well_count = network_df.groupby("operator_name")["api_short"].count()
well_count

In [None]:
from pyvis.network import Network
import networkx as nx
import community as community_louvain


# Create a new Pyvis network


net = Network(notebook=True, cdn_resources="remote")

# get the well count grouped by operator


for index, row in network_df.iterrows():
    net.add_node(
        row["operator_name"],
        label=row["operator_name"],
        title=row["operator_name"],
        color="#1f78b4",
        size=well_count[row["operator_name"]] / 10,
    )
    net.add_node(
        row["block"],
        label=row["block"],
        title=row["block"],
        color="#33a02c",
        size=5,
    )
    net.add_node(
        row["county_number"],
        label=row["county_number"],
        title=row["county_number"],
        color="##FF0000",
        size=5,
    )
    net.add_edge(
        row["operator_name"],
        row["block"],
        value=0.5,
        size=0.5,
        color="rgba(200, 200, 200, 0.5)",
    )
    net.add_edge(
        row["operator_name"],
        row["county_number"],
        value=0.5,
        size=0.5,
        color="rgba(200, 0, 0, 0.5)",
    )

# Show the network
# net.show("network.html", local=False)

In [None]:
import networkx as nx
import community as community_louvain

# Convert the Pyvis network to a NetworkX graph
G = nx.Graph(net.get_adj_list())

# Compute the best partition using the Louvain method
partition = nx.community.louvain_communities(G)

# Convert the communities to a dictionary
partition_dict = {node: i for i, comm in enumerate(partition) for node in comm}

# Add the community of each node as an attribute to the nodes in the Pyvis network
for node in net.nodes:
    node["title"] += f" - Community: {partition_dict[node['id']]}"

# Show the network
net.show_buttons(filter_=["physics"])
net.show("network.html")