## Data Wrangling

### Introduction

This project is part of a Capstone project for Springboard Data Science Career Track. The goal of this project is to develop a machine learning model to rank and predict the likelihood that an oil company will initiate a frac job in a county within the Permian Basin in the first quarter of 2024.

In [54]:
# initial imports
from functools import lru_cache
import re
import warnings
import pandas as pd
import geopandas as gpd
import pyproj
import cartopy.crs as ccrs
import numpy as np
from tqdm import tqdm
from urllib.request import urlopen
from sqlalchemy import create_engine
import geoviews.tile_sources as gts
import geoviews as gv
import holoviews as hv

hv.extension("bokeh")
gv.extension("bokeh")

In [2]:
# ignore all warnings
warnings.filterwarnings("ignore")

In [3]:
# Test initial print statement
print("CapstoneJourney begins!")

CapstoneJourney begins!


### Load data

In [4]:
# there is FracFocusRegistry_i.csv files in the bucket for i in range 1-24
# there is registryupload_i.csv files in the bucket for i in range 1-3
# there is readme.txt file in the bucket

# First list of urls
data_urls1 = []
for i in range(1, 25):
    url_frame = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/FracFocusRegistry_{i}.csv"
    data_urls1.append(url_frame)

# Second list of urls
data_urls2 = []
for j in range(1, 4):
    url_frame2 = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/registryupload_{j}.csv"
    data_urls2.append(url_frame2)

data_url3 = ["https://storage.googleapis.com/mrprime_dataset/fracfocus/readme.txt"]

In [5]:
# get readme data
readme = urlopen(data_url3[0]).read().decode("windows-1252")
display(readme)

'FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017\r\n--------------------------------------------------------\r\nThis data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures \r\nlocatable through the FracFocus ‘Find a Well’ search.\r\n\r\n\r\nTable Name: RegistryUpload\r\n--------------------------\r\npKey - Key index for the table\r\n\r\nJobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.\r\n\r\nJobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.\r\n\r\nAPINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits \r\nrepresent the state, second three digits represent the county, third 5 digits represent the well.\r\n\r\nStateNumber - The first two digits of the API number.  Range is from 01-50.\r\n\r\nCountyNumber - The 

In [6]:
# print function goes beyond 'hello world' and takes care of the escape characters
print(readme)

FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017
--------------------------------------------------------
This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures 
locatable through the FracFocus ‘Find a Well’ search.


Table Name: RegistryUpload
--------------------------
pKey - Key index for the table

JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.

JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.

APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.

StateNumber - The first two digits of the API number.  Range is from 01-50.

CountyNumber - The 3 digit county code.

OperatorName - The name of the opera

In [7]:
# you can also neaten up the readme data yourself for it to be more compact
readme_as_list = readme.replace("\r", "").split("\n")
readme_as_list = [line.strip() for line in readme_as_list if line != ""]
display(readme_as_list)

['FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017',
 '--------------------------------------------------------',
 'This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures',
 'locatable through the FracFocus ‘Find a Well’ search.',
 'Table Name: RegistryUpload',
 '--------------------------',
 'pKey - Key index for the table',
 'JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.',
 'JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.',
 'APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits',
 'represent the state, second three digits represent the county, third 5 digits represent the well.',
 'StateNumber - The first two digits of the API number.  Range is from 01-50.',
 'CountyNumber - The 3 digit county co

In [8]:
# We can collect all the dataframe into a list and then concatenate them
df_list = [pd.read_csv(url, low_memory=False) for url in tqdm(data_urls2)]


dfs = pd.concat(df_list).reset_index(drop=True)

100%|██████████| 3/3 [00:24<00:00,  8.24s/it]


In [9]:
registry_df = pd.DataFrame()
registry_df = dfs.copy()
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   pKey                     213883 non-null  object 
 1   JobStartDate             213868 non-null  object 
 2   JobEndDate               213883 non-null  object 
 3   APINumber                213883 non-null  object 
 4   StateNumber              213883 non-null  int64  
 5   CountyNumber             213883 non-null  int64  
 6   OperatorName             213883 non-null  object 
 7   WellName                 213883 non-null  object 
 8   Latitude                 213883 non-null  float64
 9   Longitude                213883 non-null  float64
 10  Projection               213883 non-null  object 
 11  TVD                      183743 non-null  float64
 12  TotalBaseWaterVolume     183714 non-null  float64
 13  TotalBaseNonWaterVolume  163574 non-null  float64
 14  Stat

In [10]:
# Look at some of the rows of the dataframe
display(registry_df.tail(3))

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213880,361bd982-58d6-437d-9592-08aeb80fd738,10/11/2023 7:21:00 AM,11/5/2023 6:07:00 PM,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3,False,False,,
213881,f9fdc139-0f1e-4943-8a16-adb5152d862c,9/28/2023 9:43:00 PM,11/6/2023 7:29:00 AM,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3,False,False,,
213882,2241ec7e-f113-4f8e-8b61-8a74c9e03dc2,4/1/3012 12:00:00 AM,4/1/3012 12:00:00 AM,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1,False,False,,


We use Windows Authentication instead of the usual username: password to connect to the SQL Server. When connecting to a SQL Server database with Windows Authentication, you don't need to provide a username and password in your connection string, Instead, the system uses the credentials of the currently logged-in Windows user.

In [11]:
# Define the server and database names
server_name = "ANDIE\SQLEXPRESS"
database_name = "FracFocusRegistry"
table_name = "RegistryUpload"

# Create the connection
conn_str = f"mssql+pyodbc://@{server_name}/{database_name}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server"

# Create the engine
engine = create_engine(conn_str, echo=True)

df = pd.read_sql(f"SELECT * FROM {table_name}", engine)

df.info()

2023-11-16 13:14:40,772 INFO sqlalchemy.engine.Engine SELECT CAST(SERVERPROPERTY('ProductVersion') AS VARCHAR)
2023-11-16 13:14:40,773 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-11-16 13:14:40,775 INFO sqlalchemy.engine.Engine SELECT schema_name()
2023-11-16 13:14:40,776 INFO sqlalchemy.engine.Engine [generated in 0.00085s] ()
2023-11-16 13:14:40,797 INFO sqlalchemy.engine.Engine SELECT CAST('test max support' AS NVARCHAR(max))
2023-11-16 13:14:40,798 INFO sqlalchemy.engine.Engine [generated in 0.00098s] ()
2023-11-16 13:14:40,799 INFO sqlalchemy.engine.Engine SELECT 1 FROM fn_listextendedproperty(default, default, default, default, default, default, default)
2023-11-16 13:14:40,800 INFO sqlalchemy.engine.Engine [generated in 0.00070s] ()
2023-11-16 13:14:40,884 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2023-11-16 13:14:40,885 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE ([INFORMATION_SCHEMA].[TABLE

In [12]:
df.tail(3)

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213847,361BD982-58D6-437D-9592-08AEB80FD738,2023-10-11 07:21:00,2023-11-05 18:07:00,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3.0,False,False,,
213848,F9FDC139-0F1E-4943-8A16-ADB5152D862C,2023-09-28 21:43:00,2023-11-06 07:29:00,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3.0,False,False,,
213849,2241EC7E-F113-4F8E-8B61-8A74C9E03DC2,3012-04-01 00:00:00,3012-04-01 00:00:00,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1.0,False,False,,



The data from the csv for some reason had more rows. We will use the dataframe from the CSV's data given the odd chance that they contain more data points, although from looking at the last 3 rows of both dataframes, we can see that those rows have the same values so any extra rows/ data points that we have were not added on at the end.

In [13]:
# Convert the columns to lowercase and create Index objects
registry_index = pd.Index(registry_df["pKey"].str.lower())
df_index = pd.Index(df["pKey"].str.lower())

# Use the difference method to find values in registry_index but not in df_index
diff = registry_index.difference(df_index)
# show the rows in registry_df that are not in df
print(f"Number of rows in registry_df that are not in df: {len(diff)}")
registry_df[registry_index.isin(diff)].sample(3)

Number of rows in registry_df that are not in df: 33


Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213433,f37b7cd5-b62d-4848-9649-b94176687677,9/27/2023 6:00:00 AM,9/28/2023 6:00:00 AM,49037329890000,49,37,Wexpro Company,Trail Unit 112,41.105143,-108.643109,...,6563.0,381097.0,0.0,Wyoming,Sweetwater,3,True,False,,
213732,017dd486-d993-416a-9796-ea3c91cb31d7,9/27/2023 12:00:00 AM,10/18/2023 12:00:00 AM,30025504760000,30,25,"EOG Resources, Inc.",AMAZING 19 FED #712H,32.381904,-103.70866,...,12004.66,19428864.0,0.0,New Mexico,Lea,3,False,False,,
213460,9454a919-4fb0-44c6-af08-e42090ddcc1c,9/26/2023 6:00:00 AM,9/29/2023 6:00:00 AM,49037329880000,49,37,Wexpro Company,Trail Unit 111,41.105153,-108.64316,...,7514.0,885488.0,0.0,Wyoming,Sweetwater,3,True,False,,



We can also see some obvious errors with the `JobStartDate` column, but before we jump into cleaning the data columns, let's make it look more pythonic by changing the column names to snake_case.

### Data Cleaning

In [14]:
def pascal_to_snake(string):
    """Converts a string from PascalCase to snake_case"""
    # (?<=[A-Za-z0-9]) - positive lookbehind for any alphanumeric character
    # (?=[A-Z][a-z]) - positive lookahead for any uppercase followed by lowercase
    pattern = re.compile(r"(?<=[A-Za-z0-9])(?=[A-Z][a-z])")
    return pattern.sub("_", string).lower()


# create test cases for the function
test_cases = ["PascalCase", "camelCase", "snake_case", "kebab-case", "UPPERCASE"]
print([pascal_to_snake(case) for case in test_cases])

['pascal_case', 'camel_case', 'snake_case', 'kebab-case', 'uppercase']


In [15]:
registry_df.columns = [pascal_to_snake(col) for col in registry_df.columns]
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   p_key                        213883 non-null  object 
 1   job_start_date               213868 non-null  object 
 2   job_end_date                 213883 non-null  object 
 3   api_number                   213883 non-null  object 
 4   state_number                 213883 non-null  int64  
 5   county_number                213883 non-null  int64  
 6   operator_name                213883 non-null  object 
 7   well_name                    213883 non-null  object 
 8   latitude                     213883 non-null  float64
 9   longitude                    213883 non-null  float64
 10  projection                   213883 non-null  object 
 11  tvd                          183743 non-null  float64
 12  total_base_water_volume      183714 non-null  float64
 13 

Next, we can remove the columns with only null values. These are the last 2 columns in the dataframe, `source` and `dtmod`. Also we can drop the `total_non_base_water_volume` column since we may not have much need for it.


In [16]:
registry_df = registry_df.drop(
    columns=["source", "dtmod", "total_base_non_water_volume"]
)
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   p_key                    213883 non-null  object 
 1   job_start_date           213868 non-null  object 
 2   job_end_date             213883 non-null  object 
 3   api_number               213883 non-null  object 
 4   state_number             213883 non-null  int64  
 5   county_number            213883 non-null  int64  
 6   operator_name            213883 non-null  object 
 7   well_name                213883 non-null  object 
 8   latitude                 213883 non-null  float64
 9   longitude                213883 non-null  float64
 10  projection               213883 non-null  object 
 11  tvd                      183743 non-null  float64
 12  total_base_water_volume  183714 non-null  float64
 13  state_name               213881 non-null  object 
 14  coun

Next, we will fix some of the dtypes of the columns.
- Both the `job_start_date` and the `job_end_date` columns are object dtypes, so we will convert those to datetime dtypes and drop the timestamp.
- The `api_number` column is an object dtype, but it should be a string dtype. We can also shorten that column name to `api`.
- The `state_number` column and the `county_number` column are both `int64` dtypes right now, but those should be  `CategoricalDtype`.
- The `projection` column is an object dtype. That can be converted to a string dtype and shorten to `crs` as it represents the Cooordinate Reference System used in the `latitude` and `longitude` columns values.
- the `federal_well` and `indian_well` columns are both boolean type columns. They may be more aptly named as `is_federal_well` and `is_indian_well` respectively.

In [17]:
registry_df["job_start_date"] = pd.to_datetime(
    registry_df["job_start_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")
registry_df["job_end_date"] = pd.to_datetime(
    registry_df["job_end_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")
registry_df["api_number"] = registry_df["api_number"].astype("string").str.zfill(14)

registry_df["state_number"] = registry_df["state_number"].astype("string").str.zfill(2)
registry_df["county_number"] = (
    registry_df["county_number"].astype("string").str.zfill(3)
)

registry_df.rename(
    columns={
        "federal_well": "is_federal_well",
        "indian_well": "is_indian_well",
        "api_number": "api",
        "projection": "crs",
    },
    inplace=True,
)

registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   p_key                    213883 non-null  object 
 1   job_start_date           213866 non-null  object 
 2   job_end_date             213882 non-null  object 
 3   api                      213883 non-null  string 
 4   state_number             213883 non-null  string 
 5   county_number            213883 non-null  string 
 6   operator_name            213883 non-null  object 
 7   well_name                213883 non-null  object 
 8   latitude                 213883 non-null  float64
 9   longitude                213883 non-null  float64
 10  crs                      213883 non-null  object 
 11  tvd                      183743 non-null  float64
 12  total_base_water_volume  183714 non-null  float64
 13  state_name               213881 non-null  object 
 14  coun

State name should not have more than 50 possible values, given that there are only 50 states in the US. If we were to check the number of unique values in the `state_name` column, we would see 95. This is due to the variation in the way the `state_name` value is entered. Although not as obvious, we can assume the same for the `county_name` column. Luckily, the `api` includes both the `state_number` and the `county_number`. With this we can do 
1. data validation ensuring that these corresponding columns match
2. Ensure that the `state_name` and the `county_name` columns have no variations and are standardized with the official FIPS (Federal Information Processing Standard) codes. 

In [18]:
print(
    f'Number of different values in state_name column: {registry_df["state_name"].nunique()}'
)
print(
    f'Number of different values in state_number column: {registry_df["state_number"].nunique()}'
)

Number of different values in state_name column: 95
Number of different values in state_number column: 28


In [19]:
# check which rows may have the api with the first two digits not matching the state number
api_state_mismatch_mask = (
    registry_df["state_number"].astype("string") != registry_df["api"].str[0:2]
)
registry_df[api_state_mismatch_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well
50509,4e73a0eb-a744-46d0-a2e0-d6048f2dbb46,2013-05-25,2013-05-25,4226932868,42,269,"Medders Oil Company, Inc.",Pitchfork IIII #5,33.56366,-100.49796,WGS84,3794.0,10794.0,Texas,King,2,False,False
197331,5ce7e732-e28d-4451-a679-2e9485f9539a,2022-05-22,2022-06-23,423714037500,42,371,Diamondback E&P LLC,MOORE SHARK 10 9 Unit 5WA,31.2352,-103.136,NAD27,10880.0,30288566.0,Texas,Pecos,3,False,False


In [20]:
# Remove leading zeros and pad to 14 digits on mismatches
registry_df.loc[api_state_mismatch_mask, "api"] = (
    registry_df.loc[api_state_mismatch_mask, "api"].str.lstrip("0").str.ljust(14, "0")
)
# check again for mismatches
registry_df[registry_df["state_number"].astype("string") != registry_df["api"].str[0:2]]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well


In [21]:
# check which rows may have the api with the 3-5 digits not matching the county number
api_county_mismatch_mask = (
    registry_df["county_number"].astype("string") != registry_df["api"].str[2:5]
)
registry_df[api_county_mismatch_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well


In [22]:
# group by state_number and find the mode of the state_name
state_number_mode = (
    registry_df.groupby("state_number")["state_name"]
    .apply(lambda x: x.mode().iloc[0])
    .reset_index()
)
registry_df = registry_df.merge(
    state_number_mode.rename(columns={"state_name": "state"})
)
registry_df.sample(3)

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,state
189445,1e76da07-02d9-4956-9fbe-0a97b0177e90,2017-02-27,2017-03-07,35063246220000,35,63,Bravo Arkoma LLC,Bruner 1-35/26H,35.142223,-96.113377,NAD27,4691.0,9987091.0,Oklahoma,Hughes,3,False,False,Oklahoma
57651,62f7ba90-258c-4a36-a06c-e60df0ee0e00,2016-07-13,2016-07-18,42461401730000,42,461,"Bold Operating, LLC",Bold Hamman 30 #1HM,31.414639,-101.931306,NAD27,9368.0,10760652.0,Texas,Upton,2,False,False,Texas
37557,266c3739-78d2-444f-8890-693c962e5bf9,2014-05-01,2014-05-01,42479426120000,42,479,Rosetta Resources,Gates 09 Rose B GU 9 7-17,28.10765,-99.70281,NAD27,8747.0,4074294.0,Texas,Webb,2,False,False,Texas


We will focus our efforts in the most recent 10 years. Although more data is usually better, data too far in the past may distract whatever model we may build since unconventional drilling practices have really taken over the industry. We will also put our focus in one specific area, the Permian Basin. The Permian Basin has been instrumental in the shale boom transformation and is the most active area of exploration and production in the US presently. 

In [23]:
# create mask for from 2013 onwards
post_2012_mask = registry_df["job_start_date"] >= "2013-01-01"
registry_df_post_2012 = registry_df[post_2012_mask].copy()

# find all the rows with null values
null_mask = registry_df_post_2012.isna().any(axis=1)
registry_df_post_2012[null_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,state
61454,7534cb0b-9f85-4ca2-a21c-31be328456f8,2017-04-10,2017-04-10,42173374020000,42,173,"Cinnabar Energy, LTD.",Thomas 4101HD,31.996472,-101.568598,NAD27,9310.0,,Texas,Glasscock,3,False,False,Texas
61965,c57e59c6-d132-4a13-bb0b-8fb3e58ce918,2017-05-08,2017-05-08,42077352710000,42,77,Lane Operating Company,Dillard A Unit No. 1,33.591273,-98.138934,NAD27,4600.0,,Texas,Clay,3,False,False,Texas
62626,a2ad142e-6a38-44ed-8ebc-8572e43236b7,2017-06-13,2017-06-13,42237401310000,42,237,"Blakenergy Operating, LLC",Garner #2,33.434316,-98.227896,NAD27,6350.0,,Texas,Jack,3,False,False,Texas
84840,fcbf3967-867e-4e4e-af18-d83e7a82c604,2020-04-19,2020-04-19,42461378800000,42,461,COG Operating LLC,Powell 36 7,31.523085,-102.118329,NAD27,,,Texas,Upton,1,False,False,Texas
89052,ead37751-a53f-4ed9-99df-8cac1ebf1e15,2021-05-23,2021-05-23,42173346750000,42,173,Berry Petroleum,Talon #4,31.950785,-101.775302,NAD83,,,Texas,Glasscock,1,False,False,Texas
89305,4753a32e-39cb-4994-b69b-8acb36597463,2021-06-08,2021-06-08,42115334560000,42,115,Pioneer Natural Resources,Echols 10 #1,32.531683,-102.096109,NAD27,,,Texas,Dawson,1,False,False,Texas
98083,b76c7cc4-f1c1-4711-9168-ba819f41d070,2022-10-16,2022-10-27,42479446890000,42,479,Lewis Energy Group,HAMILTON NO. 34H,27.95851,-99.567909,WGS84,10456.0,,Texas,Webb,3,False,False,Texas
98087,99463560-ccb5-4c07-9537-8abdc731208b,2022-10-16,2022-10-27,42479446880000,42,479,Lewis Energy Group,HAMILTON NO. 33H,27.95851,-99.567956,WGS84,10437.0,,Texas,Webb,3,False,False,Texas
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,North Dakota
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,North Dakota


In [24]:
# the rows with a null value for the county_name column
registry_df_post_2012[registry_df_post_2012["county_name"].isna()]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,crs,tvd,total_base_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well,state
139253,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,2022-06-08,33610338000000,33,610,Hunt Oil Company,TRULSON 156-90-11-14H-3,48.355506,-102.212124,NAD83,8872.4,13496802.0,North Dakota,,3,False,False,North Dakota
139256,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,2022-06-09,33610338100000,33,610,Hunt Oil Company,PALERMO 156-90-2-31H-5,48.355506,-102.21233,NAD83,8782.65,14820967.0,North Dakota,,3,False,False,North Dakota
178445,692d9381-748e-4e5f-b83f-30f868f18882,2019-11-19,2019-11-19,3729439000000,3,729,WFD Oil Corporation,Vanorsdale,0.123455,-0.12345,NAD27,2442.0,22134.0,Arkansas,,3,False,False,Arkansas
206589,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-11-11,2020-12-01,43317428660000,43,317,Endeavor Energy Resources,Rhea 1-6 Unit 1 #133,32.409144,-101.808518,NAD83,8262.0,17519292.0,Utah,,3,False,False,Utah


### Geodataframe

From looking at the `latitude` and `longitude` values for the index number `178445` (3rd one above in Arkansas), we can see that they are not correct coordinates. We can drop that row.<br>
The other 3 county numbers are incorrect, but we may be able to get those values by using the `latitude` and `longitude` coordinates. Let's try to find out which counties in `North Dakota` and `Utah` these wells are located in.



In [25]:
# Arkansas well with null county_name and incorrect lat/long values
display(registry_df_post_2012.loc[178445])

p_key                      692d9381-748e-4e5f-b83f-30f868f18882
job_start_date                                       2019-11-19
job_end_date                                         2019-11-19
api                                              03729439000000
state_number                                                 03
county_number                                               729
operator_name                               WFD Oil Corporation
well_name                                            Vanorsdale
latitude                                               0.123455
longitude                                              -0.12345
crs                                                       NAD27
tvd                                                      2442.0
total_base_water_volume                                 22134.0
state_name                                             Arkansas
county_name                                                 NaN
ff_version                              

In [26]:
# drop row at index value 178445
registry_df_post_2012 = registry_df_post_2012.drop(index=178445)

We can get the boundary coordinates for all of the counties in the US on the census.gov website.

In [27]:
census_county_map_url = (
    "https://www2.census.gov/geo/tiger/TIGER2022/COUNTY/tl_2022_us_county.zip"
)


county = gpd.read_file(census_county_map_url)[
    ["GEOID", "STATEFP", "COUNTYFP", "NAME", "geometry"]
]
county.columns = county.columns.str.lower()
county.sample(3)

Unnamed: 0,geoid,statefp,countyfp,name,geometry
3234,54099,54,99,Wayne,"POLYGON ((-82.30872 38.28106, -82.30874 38.280..."
1449,37161,37,161,Rutherford,"POLYGON ((-82.15520 35.52049, -82.15503 35.520..."
563,17023,17,23,Clark,"POLYGON ((-88.01394 39.48079, -88.01318 39.480..."


We will scrape the FIPS table from wikipedia since the county dataframe does not have the state name and merge the 2 tables just for convenience. 

In [28]:
fips_wiki_url = (
    "https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county"
)
fips_df = pd.read_html(fips_wiki_url)[1]
fips_df.columns = ["geoid", "county", "state"]
fips_df["geoid"] = fips_df["geoid"].astype("string").str.zfill(5)
fips_df.sample(3)

Unnamed: 0,geoid,county,state
2078,39049,Franklin County,Ohio
929,20069,Gray County,Kansas
1773,33005,Cheshire County,New Hampshire


In [29]:
county_fips_gdf = county.merge(fips_df, on="geoid")
county_fips_gdf.sample(3)

Unnamed: 0,geoid,statefp,countyfp,name,geometry,county,state
265,17151,17,151,Pope,"POLYGON ((-88.48289 37.35345, -88.48291 37.353...",Pope County,Illinois
313,47165,47,165,Sumner,"POLYGON ((-86.56583 36.30515, -86.56587 36.304...",Sumner County,Tennessee
1102,72005,72,5,Aguadilla,"POLYGON ((-67.08742 18.44141, -67.08731 18.440...",Aguadilla Municipality,Puerto Rico


In [30]:
county_fips_gdf.crs

<Geographic 2D CRS: EPSG:4269>
Name: NAD83
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands

Commonly used datums in North America are NAD27, NAD83, and WGS84. More info [here](https://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Projection_basics_the_GIS_professional_needs_to_know).<br>

The county geodataframe uses `EPSG:4269` which is the EPSG code for the NAD83 coordinate system. Let's create a geodataframe with the `latitude` and `longitude` values that we have and put all of the points to the same CRS.

In [31]:
def unify_crs(
    dataframe: pd.DataFrame,
    lon_col: str = "longitude",
    lat_col: str = "latitude",
    crs_col: str = "crs",
    final_crs: str = "EPSG:4269",
):
    """
    Given a DataFrame with lon/lat or x/y coordinates,
    converts the coordinates to a unified crs and combines
    into a single GeoDataframe with a geometry column.
    """

    # Define the main columns that will be used for the conversion
    main_cols = [lon_col, lat_col, crs_col]

    # Get the other columns in the dataframe
    other_cols = list(set(dataframe.columns) - set(main_cols))

    # Create a subframe with only the main columns
    subframe = dataframe[main_cols]

    # Create a list of GeoDataFrames, each with a different CRS
    geo_dfs = [
        gpd.GeoDataFrame(
            # Use the data for this CRS
            data=data,
            # Create a geometry column from the lon/lat columns
            geometry=gpd.points_from_xy(x=data[lon_col].values, y=data[lat_col].values),
            # Set the CRS for this GeoDataFrame
            crs=pyproj.CRS(crs_val),
            # Convert the GeoDataFrame to the final CRS
        ).to_crs(final_crs)
        # Do this for each unique CRS in the subframe
        for crs_val, data in subframe.groupby(crs_col)
    ]

    # Merge the GeoDataFrames back together and return the result
    return pd.merge(
        # Concatenate the GeoDataFrames
        pd.concat(geo_dfs, sort=True),
        # Add the other columns back in
        dataframe[other_cols],
        # Merge on the index
        left_index=True,
        right_index=True,
    )

In [32]:
# this is a geodataframe with the geometry column based for the lat/long columns
registry_gdf = unify_crs(registry_df_post_2012, crs_col="crs")

In [33]:
registry_gdf[registry_gdf["county_name"].isna()].sjoin(
    county_fips_gdf, how="left", predicate="intersects"
).drop(columns=["index_right"])

Unnamed: 0,crs,geometry,latitude,longitude,state_number,job_end_date,operator_name,is_indian_well,county_number,total_base_water_volume,...,p_key,job_start_date,county_name,state_name,geoid,statefp,countyfp,name,county,state_right
139253,NAD83,POINT (-102.21212 48.35551),48.355506,-102.212124,33,2022-06-08,Hunt Oil Company,False,610,13496802.0,...,0f8e944c-87d7-4d84-8d56-4b8a5f1cba94,2022-05-28,,North Dakota,38061,38,61,Mountrail,Mountrail County,North Dakota
139256,NAD83,POINT (-102.21233 48.35551),48.355506,-102.21233,33,2022-06-09,Hunt Oil Company,False,610,14820967.0,...,6e5d8284-29d1-4b3d-a72c-729d95c327de,2022-05-28,,North Dakota,38061,38,61,Mountrail,Mountrail County,North Dakota
206589,NAD83,POINT (-101.80852 32.40914),32.409144,-101.808518,43,2020-12-01,Endeavor Energy Resources,False,317,17519292.0,...,87ea1a50-ab40-4956-8051-d39bf139ae53,2020-11-11,,Utah,48317,48,317,Martin,Martin County,Texas


The `North Dakota` `county_number` should have been `061`, which is `Mountrail` county, not `610`. The `Utah` `county_number` though was actually correct. The error was the state number which should have been `42`, not `43`. This error is somewhat significant as according to the data dictionary:
> APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.<br>

All this means is the `api` number is also incorrect. It should be `42317428660000`(<u>42</u>-317-42866-0000) instead of `43317428660000`(<u>43</u>-317-42866-0000).<br>

In [42]:
permian_states = ["New Mexico", "Texas"]
counties_nm_tx_gdf = county_fips_gdf[county_fips_gdf["state"].isin(permian_states)]
registry_nm_tx_gdf = registry_gdf.sjoin(counties_nm_tx_gdf).drop(
    columns=["index_right"]
)

In [43]:
registry_nm_tx_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 95969 entries, 18356 to 146745
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   crs                      95969 non-null  object  
 1   geometry                 95969 non-null  geometry
 2   latitude                 95969 non-null  float64 
 3   longitude                95969 non-null  float64 
 4   state_number             95969 non-null  string  
 5   job_end_date             95969 non-null  object  
 6   operator_name            95969 non-null  object  
 7   is_indian_well           95969 non-null  bool    
 8   county_number            95969 non-null  string  
 9   total_base_water_volume  95960 non-null  float64 
 10  is_federal_well          95969 non-null  bool    
 11  state_left               95969 non-null  object  
 12  ff_version               95969 non-null  int64   
 13  tvd                      95966 non-null  float64 
 14

In [55]:
@lru_cache(maxsize=3)
def get_background_map(bgcolor="black", alpha=0.5):
    """Returns a GeoViews background map"""
    return gts.CartoLight().opts(bgcolor=bgcolor, alpha=alpha)


def platecaree_to_mercator_vectorised(x, y):
    """Use Cartopy to convert PlateCarree coordinates to Mercator"""
    return ccrs.GOOGLE_MERCATOR.transform_points(ccrs.PlateCarree(), x, y)[:, :2]


bg_map = get_background_map()

In [56]:
registry_nm_tx_gdf["geometry"].x

18356     -98.186278
18387     -98.221913
18460     -98.660630
18480     -98.329383
18677     -98.214256
             ...    
143411   -104.948518
143412   -104.905974
144269   -103.595539
146121   -103.371180
146745   -103.815562
Length: 95969, dtype: float64

In [62]:
bg_map * gv.Points(registry_nm_tx_gdf["geometry"]).opts(
    color="red", size=1, tools=["hover"], width=800, height=600, alpha=0.5
)

In [66]:
# Convert the coordinates to Mercator
mercator_coords = platecaree_to_mercator_vectorised(
    registry_nm_tx_gdf["geometry"].x, registry_nm_tx_gdf["geometry"].y
)

# Round the coordinates and create a DataFrame
mer_points = pd.DataFrame(np.round(mercator_coords), columns=["x", "y"])

# Create a Points object for plotting
gpoints = gv.Points(
    mer_points.reset_index(), ["x", "y"], ["index"], crs=ccrs.GOOGLE_MERCATOR
).opts(height=600, width=800, color="skyblue", size=1, tools=["hover"])

# Create a layout with the background map and the points
layout = bg_map * gpoints
layout

we can see that some of the `latitude` > 90 or < -90 and `longitude` values are > 180 or < -180. Those are out of this world! Let's check them out.

In [None]:
# get all the rows with the latitude >90 or < -90 and longitude > 180 or < -180
registry_df[registry_df["latitude"] > 90]

In [77]:
registry_df["operator_name"].str.upper().str.strip().value_counts().sort_index()

operator_name
1776 ENERGY OPERATORS                    3
1849 ENERGY PARTNERS OPERATING, LLC      9
1859 OPERATING LLC                      98
2015 OIL, LLC                            1
3-M ENERGY CORPORATION                   4
                                      ... 
ZARGON OIL (ND) INC.                     1
ZARVONA ENERGY                          39
ZAVANNA, LLC                           138
ZENERGY, INC.                           50
ZTC PETRO INVESTMENTS LP                 5
Name: count, Length: 1830, dtype: int64