## Data Wrangling

### Introduction

This project is part of a Capstone project for Springboard Data Science Career Track. The goal of this project is to develop a machine learning model to rank and predict the likelihood that an oil company will initiate a frac job in a county within the Permian Basin in the first quarter of 2024.

In [3]:
# initial imports

import re
import warnings
import pandas as pd
import numpy as np
from tqdm import tqdm
from urllib.request import urlopen
from sqlalchemy import create_engine

In [4]:
# ignore all warnings
warnings.filterwarnings("ignore")

In [5]:
# Test initial print statement
print("CapstoneJourney begins!")

CapstoneJourney begins!


In [6]:
# there is FracFocusRegistry_i.csv files in the bucket for i in range 1-24
# there is registryupload_i.csv files in the bucket for i in range 1-3
# there is readme.txt file in the bucket

# First list of urls
data_urls1 = []
for i in range(1, 25):
    url_frame = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/FracFocusRegistry_{i}.csv"
    data_urls1.append(url_frame)

# Second list of urls
data_urls2 = []
for j in range(1, 4):
    url_frame2 = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/registryupload_{j}.csv"
    data_urls2.append(url_frame2)

data_url3 = ["https://storage.googleapis.com/mrprime_dataset/fracfocus/readme.txt"]

In [7]:
# get readme data
readme = urlopen(data_url3[0]).read().decode("windows-1252")
display(readme)

'FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017\r\n--------------------------------------------------------\r\nThis data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures \r\nlocatable through the FracFocus ‘Find a Well’ search.\r\n\r\n\r\nTable Name: RegistryUpload\r\n--------------------------\r\npKey - Key index for the table\r\n\r\nJobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.\r\n\r\nJobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.\r\n\r\nAPINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits \r\nrepresent the state, second three digits represent the county, third 5 digits represent the well.\r\n\r\nStateNumber - The first two digits of the API number.  Range is from 01-50.\r\n\r\nCountyNumber - The 

In [8]:
# print function goes beyond 'hello world' and takes care of the escape characters
print(readme)

FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017
--------------------------------------------------------
This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures 
locatable through the FracFocus ‘Find a Well’ search.


Table Name: RegistryUpload
--------------------------
pKey - Key index for the table

JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.

JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.

APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.

StateNumber - The first two digits of the API number.  Range is from 01-50.

CountyNumber - The 3 digit county code.

OperatorName - The name of the opera

In [9]:
# you can also neaten up the readme data yourself for it to be more compact
readme_as_list = readme.replace("\r", "").split("\n")
readme_as_list = [line.strip() for line in readme_as_list if line != ""]
display(readme_as_list)

['FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017',
 '--------------------------------------------------------',
 'This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures',
 'locatable through the FracFocus ‘Find a Well’ search.',
 'Table Name: RegistryUpload',
 '--------------------------',
 'pKey - Key index for the table',
 'JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.',
 'JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.',
 'APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits',
 'represent the state, second three digits represent the county, third 5 digits represent the well.',
 'StateNumber - The first two digits of the API number.  Range is from 01-50.',
 'CountyNumber - The 3 digit county co

In [10]:
# We can collect all the dataframe into a list and then concatenate them
df_list = [pd.read_csv(url, low_memory=False) for url in tqdm(data_urls2)]


dfs = pd.concat(df_list).reset_index(drop=True)

100%|██████████| 3/3 [00:28<00:00,  9.50s/it]


In [11]:
registry_df = dfs.copy()
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   pKey                     213883 non-null  object 
 1   JobStartDate             213868 non-null  object 
 2   JobEndDate               213883 non-null  object 
 3   APINumber                213883 non-null  object 
 4   StateNumber              213883 non-null  int64  
 5   CountyNumber             213883 non-null  int64  
 6   OperatorName             213883 non-null  object 
 7   WellName                 213883 non-null  object 
 8   Latitude                 213883 non-null  float64
 9   Longitude                213883 non-null  float64
 10  Projection               213883 non-null  object 
 11  TVD                      183743 non-null  float64
 12  TotalBaseWaterVolume     183714 non-null  float64
 13  TotalBaseNonWaterVolume  163574 non-null  float64
 14  Stat

In [12]:
# Look at some of the rows of the dataframe
display(registry_df.tail(3))

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213880,361bd982-58d6-437d-9592-08aeb80fd738,10/11/2023 7:21:00 AM,11/5/2023 6:07:00 PM,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3,False,False,,
213881,f9fdc139-0f1e-4943-8a16-adb5152d862c,9/28/2023 9:43:00 PM,11/6/2023 7:29:00 AM,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3,False,False,,
213882,2241ec7e-f113-4f8e-8b61-8a74c9e03dc2,4/1/3012 12:00:00 AM,4/1/3012 12:00:00 AM,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1,False,False,,


We use Windows Authentication instead of the usual username: password to connect to the SQL Server. When connecting to a SQL Server database with Windows Authentication, you don't need to provide a username and password in your connection string, Instead, the system uses the credentials of the currently logged-in Windows user.

In [44]:
# Define the server and database names
server_name = "ANDIE\SQLEXPRESS"
database_name = "FracFocusRegistry"
table_name = "RegistryUpload"

# Create the connection
conn_str = f"mssql+pyodbc://@{server_name}/{database_name}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server"

# Create the engine
engine = create_engine(conn_str, echo=True)

df = pd.read_sql(f"SELECT * FROM {table_name}", engine)

df.info()

2023-11-15 11:35:07,191 INFO sqlalchemy.engine.Engine SELECT CAST(SERVERPROPERTY('ProductVersion') AS VARCHAR)
2023-11-15 11:35:07,191 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-11-15 11:35:07,198 INFO sqlalchemy.engine.Engine SELECT schema_name()
2023-11-15 11:35:07,199 INFO sqlalchemy.engine.Engine [generated in 0.00108s] ()
2023-11-15 11:35:07,213 INFO sqlalchemy.engine.Engine SELECT CAST('test max support' AS NVARCHAR(max))
2023-11-15 11:35:07,213 INFO sqlalchemy.engine.Engine [generated in 0.00080s] ()
2023-11-15 11:35:07,215 INFO sqlalchemy.engine.Engine SELECT 1 FROM fn_listextendedproperty(default, default, default, default, default, default, default)
2023-11-15 11:35:07,216 INFO sqlalchemy.engine.Engine [generated in 0.00090s] ()


2023-11-15 11:35:07,262 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2023-11-15 11:35:07,263 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE ([INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) OR [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max))) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max))
2023-11-15 11:35:07,263 INFO sqlalchemy.engine.Engine [generated in 0.00199s] ('BASE TABLE', 'VIEW', 'SELECT * FROM RegistryUpload', 'dbo')
2023-11-15 11:35:07,275 INFO sqlalchemy.engine.Engine SELECT * FROM RegistryUpload
2023-11-15 11:35:07,276 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-11-15 11:35:11,672 INFO sqlalchemy.engine.Engine ROLLBACK
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213850 entries, 0 to 213849
Data columns (total 21 columns):
 #   Column            

In [45]:
df.tail(3)

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213847,361BD982-58D6-437D-9592-08AEB80FD738,2023-10-11 07:21:00,2023-11-05 18:07:00,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3.0,False,False,,
213848,F9FDC139-0F1E-4943-8A16-ADB5152D862C,2023-09-28 21:43:00,2023-11-06 07:29:00,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3.0,False,False,,
213849,2241EC7E-F113-4F8E-8B61-8A74C9E03DC2,3012-04-01 00:00:00,3012-04-01 00:00:00,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1.0,False,False,,


The data from the csv for some reason had more rows. We will use the dataframe from the CSV's data given the odd chance that they contain more data points, although from looking at the last 3 rows of both dataframes, we can see that those rows have the same values so any extra rows/ data points that we have were not added on at the end.

We can also see some obvious errors with the `JobStartDate` column, but before we jump into cleaning the data columns, let's make it look more pythonic by changing the column names to snake_case.

In [13]:
def pascal_to_snake(string):
    """Converts a string from PascalCase to snake_case"""
    # (?<=[A-Za-z0-9]) - positive lookbehind for any alphanumeric character
    # (?=[A-Z][a-z]) - positive lookahead for any uppercase followed by lowercase
    pattern = re.compile(r"(?<=[A-Za-z0-9])(?=[A-Z][a-z])")
    return pattern.sub("_", string).lower()


# create test cases for the function
test_cases = ["PascalCase", "camelCase", "snake_case", "kebab-case", "UPPERCASE"]
print([pascal_to_snake(case) for case in test_cases])

['pascal_case', 'camel_case', 'snake_case', 'kebab-case', 'uppercase']


In [14]:
registry_df.columns = [pascal_to_snake(col) for col in registry_df.columns]
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   p_key                        213883 non-null  object 
 1   job_start_date               213868 non-null  object 
 2   job_end_date                 213883 non-null  object 
 3   api_number                   213883 non-null  object 
 4   state_number                 213883 non-null  int64  
 5   county_number                213883 non-null  int64  
 6   operator_name                213883 non-null  object 
 7   well_name                    213883 non-null  object 
 8   latitude                     213883 non-null  float64
 9   longitude                    213883 non-null  float64
 10  projection                   213883 non-null  object 
 11  tvd                          183743 non-null  float64
 12  total_base_water_volume      183714 non-null  float64
 13 

Next, we can remove the columns with only null values. These are the last 2 columns in the dataframe, `source` and `dtmod`.


In [15]:
registry_df = registry_df.drop(columns=["source", "dtmod"])
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 19 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   p_key                        213883 non-null  object 
 1   job_start_date               213868 non-null  object 
 2   job_end_date                 213883 non-null  object 
 3   api_number                   213883 non-null  object 
 4   state_number                 213883 non-null  int64  
 5   county_number                213883 non-null  int64  
 6   operator_name                213883 non-null  object 
 7   well_name                    213883 non-null  object 
 8   latitude                     213883 non-null  float64
 9   longitude                    213883 non-null  float64
 10  projection                   213883 non-null  object 
 11  tvd                          183743 non-null  float64
 12  total_base_water_volume      183714 non-null  float64
 13 

Next, we will fix some of the dtypes of the columns.
- Both the `job_start_date` and the `job_end_date` columns are object dtypes, so we will convert those to datetime dtypes and drop the timestamp.
- The `api_number` column is an object dtype, but it should be a string dtype. We can also shorten that column name to `api`.
- The `state_number` column and the `county_number` column are both `int64` dtypes right now, but those should be  `CategoricalDtype`.
- The `projection` column is an object dtype. That can be converted to a string dtype and shorten to `crs` as it represents the Cooordinate Reference System used in the `latitude` and `longitude` columns values.
- the `federal_well` and `indian_well` columns are both boolean type columns. They may be more aptly named as `is_federal_well` and `is_indian_well` respectively.

In [16]:
registry_df["job_start_date"] = pd.to_datetime(
    registry_df["job_start_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")
registry_df["job_end_date"] = pd.to_datetime(
    registry_df["job_end_date"], errors="coerce"
).dt.strftime("%Y-%m-%d")
registry_df["api_number"] = registry_df["api_number"].astype("string").str.zfill(14)

registry_df["state_number"] = (
    registry_df["state_number"].astype("string").str.zfill(2).astype("category")
)
registry_df["county_number"] = (
    registry_df["county_number"].astype("string").str.zfill(3).astype("category")
)

registry_df.rename(
    columns={
        "federal_well": "is_federal_well",
        "indian_well": "is_indian_well",
        "api_number": "api",
    },
    inplace=True,
)

registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 19 columns):
 #   Column                       Non-Null Count   Dtype   
---  ------                       --------------   -----   
 0   p_key                        213883 non-null  object  
 1   job_start_date               213866 non-null  object  
 2   job_end_date                 213882 non-null  object  
 3   api                          213883 non-null  string  
 4   state_number                 213883 non-null  category
 5   county_number                213883 non-null  category
 6   operator_name                213883 non-null  object  
 7   well_name                    213883 non-null  object  
 8   latitude                     213883 non-null  float64 
 9   longitude                    213883 non-null  float64 
 10  projection                   213883 non-null  object  
 11  tvd                          183743 non-null  float64 
 12  total_base_water_volume      183714 non-null

In [55]:
registry_df[["latitude", "longitude"]].describe()

Unnamed: 0,latitude,longitude
count,213883.0,213883.0
mean,46.62528,-466549.8
std,3881.864,213498200.0
min,-103.6188,-98732100000.0
25%,31.69815,-103.5524
50%,32.8447,-101.7849
75%,40.01219,-97.85731
max,1731278.0,3810848.0


we can see that some of the `latitude` > 90 or < -90 and `longitude` values are > 180 or < -180. Those are out of this world! Let's check them out.

In [81]:
# get all the rows with the latitude >90 or < -90 and longitude > 180 or < -180
registry_df[registry_df["operator_name"].str.contains("relentless", case=False)]
registry_df[registry_df["latitude"] > 90]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,projection,tvd,total_base_water_volume,total_base_non_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well
81356,d4358031-8937-4cd4-9227-801045baf5c0,2014-06-28,2014-06-28,,42,373,Unit Petroleum,"BP ""D"" # 2",376223.0,3810848.0,NAD27,11551.0,134836.0,0.0,Texas,Polk,2,False,False
104918,35646d31-be1b-4755-9e98-80e15a2665b7,2015-06-05,2015-06-05,35043233890000.0,35,43,Reeder Energy,Clark 1-7,98.8364,-36.05387,NAD83,9750.0,205547.0,0.0,Oklahoma,Dewey,3,False,False
136909,aa1fb665-7e35-4917-9ed7-724d63fa45fc,2017-11-30,2017-11-30,42505357230000.0,42,505,Merit Energy Company LLC,Haynes #171,99.19106,27.11004,NAD83,10250.0,45558.0,0.0,Texas,Zapata,3,False,False
152652,8d5e2126-1bea-43b3-9b04-494ba96d0a34,2018-10-15,2018-10-19,42255359940000.0,42,255,Magnolia Oil & Gas LLC,Crowder 1H,290315.0,-97.85775,NAD27,10163.0,5685834.0,0.0,Texas,Karnes,3,False,False
169559,a21be3ac-937d-46d1-b1bb-cfabbceb3032,2019-10-14,2019-10-18,35081243130000.0,35,81,"Roberson Oil Company, Inc.",Potter 1-12MH,96.93887,35.88457,NAD27,8800.0,6752298.0,0.0,Oklahoma,Lincoln,3,False,False
171522,b0efcdca-b742-4ffc-8077-890fb6217b2f,2019-12-04,2019-12-11,42431335240000.0,42,431,"Atoka Operating Permian, LLC",Reed Ranch 8,1731278.0,845137.6,NAD27,7825.0,386890.0,0.0,Texas,Sterling,3,False,False


State name should not have more than 50 possible values, given that there are only 50 states in the US. If we were to check the number of unique values in the `state_name` column, we would see 95. This is due to the variation in the way the `state_name` value is entered. Although not as obvious, we can assume the same for the `county_name` column. Luckily, the `api` includes both the `state_number` and the `county_number`. With this we can do 
1. data validation ensuring that these corresponding columns match
2. Ensure that the `state_name` and the `county_name` columns have no variations and are standardized with the official FIPS (Federal Information Processing Standard) codes. 

In [17]:
print(
    f'Number of different values in state_name column: {registry_df["state_name"].nunique()}'
)
print(
    f'Number of different values in state_number column: {registry_df["state_number"].nunique()}'
)

Number of different values in state_name column: 95
Number of different values in state_number column: 28


In [18]:
# check which rows may have the api with the first two digits not matching the state number
api_state_mismatch_mask = (
    registry_df["state_number"].astype("string") != registry_df["api"].str[0:2]
)
registry_df[api_state_mismatch_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,projection,tvd,total_base_water_volume,total_base_non_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well
50509,4e73a0eb-a744-46d0-a2e0-d6048f2dbb46,2013-05-25,2013-05-25,4226932868,42,269,"Medders Oil Company, Inc.",Pitchfork IIII #5,33.56366,-100.49796,WGS84,3794.0,10794.0,,Texas,King,2,False,False
197331,5ce7e732-e28d-4451-a679-2e9485f9539a,2022-05-22,2022-06-23,423714037500,42,371,Diamondback E&P LLC,MOORE SHARK 10 9 Unit 5WA,31.2352,-103.136,NAD27,10880.0,30288566.0,0.0,Texas,Pecos,3,False,False


In [19]:
# Remove leading zeros and pad to 14 digits on mismatches
registry_df.loc[api_state_mismatch_mask, "api"] = (
    registry_df.loc[api_state_mismatch_mask, "api"].str.lstrip("0").str.ljust(14, "0")
)
# check again for mismatches
registry_df[registry_df["state_number"].astype("string") != registry_df["api"].str[0:2]]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,projection,tvd,total_base_water_volume,total_base_non_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well
50509,4e73a0eb-a744-46d0-a2e0-d6048f2dbb46,2013-05-25,2013-05-25,42269328680000,42,269,"Medders Oil Company, Inc.",Pitchfork IIII #5,33.56366,-100.49796,WGS84,3794.0,10794.0,,Texas,King,2,False,False
197331,5ce7e732-e28d-4451-a679-2e9485f9539a,2022-05-22,2022-06-23,42371403750000,42,371,Diamondback E&P LLC,MOORE SHARK 10 9 Unit 5WA,31.2352,-103.136,NAD27,10880.0,30288566.0,0.0,Texas,Pecos,3,False,False


In [21]:
# check which rows may have the api with the 3-5 digits not matching the county number
api_county_mismatch_mask = (
    registry_df["county_number"].astype("string") != registry_df["api"].str[2:5]
)
registry_df[api_county_mismatch_mask]

Unnamed: 0,p_key,job_start_date,job_end_date,api,state_number,county_number,operator_name,well_name,latitude,longitude,projection,tvd,total_base_water_volume,total_base_non_water_volume,state_name,county_name,ff_version,is_federal_well,is_indian_well


In [24]:
# group by state_number and find the mode of the state_name
state_number_mode = (
    registry_df.groupby("state_number")["state_name"]
    .apply(lambda x: x.mode().iloc[0])
    .reset_index()
)
registry_df = registry_df.merge(
    state_number_mode.rename(columns={"state_name": "state"})
)

In [25]:
registry_df["state"].value_counts()

state
Texas             103915
Colorado           20036
Oklahoma           18707
North Dakota       16259
New Mexico         11689
Pennsylvania       10758
Wyoming             6309
Utah                5637
Louisiana           4300
California          3829
West Virginia       3312
Ohio                3297
Arkansas            2871
Kansas               867
Montana              824
Virginia             602
Alaska               254
Mississippi          168
Alabama              155
Kentucky              34
Michigan              31
Nebraska              13
Nevada                 6
Illinois               3
New York               3
Indiana                2
Idaho                  1
North Carolina         1
Name: count, dtype: int64