## Data Wrangling

### Introduction

This project is part of a Capstone project for Springboard Data Science Career Track. The goal of this project is to develop a machine learning model to rank and predict the likelihood that an oil company will initiate a frac job in a county within the Permian Basin in the first quarter of 2024.

In [1]:
# initial imports

import warnings
import pandas as pd
import numpy as np
from tqdm import tqdm
from urllib.request import urlopen
from sqlalchemy import create_engine

In [2]:
# ignore all warnings
warnings.filterwarnings("ignore")

In [3]:
# Test initial print statement
print("CapstoneJourney begins!")

CapstoneJourney begins!


In [15]:
# there is FracFocusRegistry_i.csv files in the bucket for i in range 1-24
# there is registryupload_i.csv files in the bucket for i in range 1-3
# there is readme.txt file in the bucket

# First list of urls
data_urls1 = []
for i in range(1, 25):
    url_frame = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/FracFocusRegistry_{i}.csv"
    data_urls1.append(url_frame)

# Second list of urls
data_urls2 = []
for j in range(1, 4):
    url_frame2 = f"https://storage.googleapis.com/mrprime_dataset/fracfocus/registryupload_{j}.csv"
    data_urls2.append(url_frame2)

data_url3 = ["https://storage.googleapis.com/mrprime_dataset/fracfocus/readme.txt"]

In [16]:
# get readme data
readme = urlopen(data_url3[0]).read().decode("windows-1252")
display(readme)

'FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017\r\n--------------------------------------------------------\r\nThis data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures \r\nlocatable through the FracFocus ‘Find a Well’ search.\r\n\r\n\r\nTable Name: RegistryUpload\r\n--------------------------\r\npKey - Key index for the table\r\n\r\nJobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.\r\n\r\nJobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.\r\n\r\nAPINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits \r\nrepresent the state, second three digits represent the county, third 5 digits represent the well.\r\n\r\nStateNumber - The first two digits of the API number.  Range is from 01-50.\r\n\r\nCountyNumber - The 

In [17]:
# print function goes beyond 'hello world' and takes care of the escape characters
print(readme)

FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017
--------------------------------------------------------
This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures 
locatable through the FracFocus ‘Find a Well’ search.


Table Name: RegistryUpload
--------------------------
pKey - Key index for the table

JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.

JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.

APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits 
represent the state, second three digits represent the county, third 5 digits represent the well.

StateNumber - The first two digits of the API number.  Range is from 01-50.

CountyNumber - The 3 digit county code.

OperatorName - The name of the opera

In [18]:
# you can also neaten up the readme data yourself for it to be more compact
readme_as_list = readme.replace("\r", "").split("\n")
readme_as_list = [line.strip() for line in readme_as_list if line != ""]
display(readme_as_list)

['FRACFOCUS DATA DICTIONARY - Last updated: July 19th, 2017',
 '--------------------------------------------------------',
 'This data dictionary defines each attribute found in the FracFocusRegistry database backup which includes all disclosures',
 'locatable through the FracFocus ‘Find a Well’ search.',
 'Table Name: RegistryUpload',
 '--------------------------',
 'pKey - Key index for the table',
 'JobStartDate - The date on which the hydraulic fracturing job was initiated.  Does not include site preparation or setup.',
 'JobEndDate - The date on which the hydraulic fracturing job was completed.  Does not include site teardown.',
 'APINumber - The American Petroleum Institute well identification number formatted as follows xx-xxx-xxxxx0000 Where: First two digits',
 'represent the state, second three digits represent the county, third 5 digits represent the well.',
 'StateNumber - The first two digits of the API number.  Range is from 01-50.',
 'CountyNumber - The 3 digit county co

In [19]:
pd.read_csv(
    "https://storage.googleapis.com/mrprime_dataset/fracfocus/registryupload_3.csv"
)

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
0,4048c091-97c3-4a73-8680-f9c0122ce240,8/9/2022 5:00:00 AM,9/1/2022 5:00:00 AM,35017258000000,35,17,Ovintiv Mid-Continent Inc.,Frank 1406 2H-15X,35.666269,-97.823698,...,8230.880000,20086962.0,55554.0,Oklahoma,Canadian,3,False,False,,
1,a8584ce5-222f-40d9-ade5-7eb0e654783c,8/9/2022 5:00:00 AM,9/1/2022 5:00:00 AM,35017258010000,35,17,Ovintiv Mid-Continent Inc.,Frank 1406 3H-15X,35.666270,-97.823749,...,8049.340000,18904032.0,19793.0,Oklahoma,Canadian,3,False,False,,
2,2ea6e80f-b143-4260-9433-c2c1f3d38d64,8/9/2022 5:00:00 AM,9/1/2022 5:00:00 AM,35017258020000,35,17,Ovintiv Mid-Continent Inc.,Frank 1406 4H-15X,35.666270,-97.823799,...,8230.880000,20518134.0,45242.0,Oklahoma,Canadian,3,False,False,,
3,54d40666-1489-400e-a979-f28821c3c465,8/9/2022 5:00:00 AM,9/1/2022 5:00:00 AM,35017258080000,35,17,Ovintiv Mid-Continent Inc.,Stephen 1406 2H-27X,35.666270,-97.823850,...,8337.130000,14028882.0,32109.0,Oklahoma,Canadian,3,False,False,,
4,f56bb3a3-bf68-4264-851e-eb77f0da9750,8/9/2022 5:00:00 AM,9/1/2022 5:00:00 AM,35017258090000,35,17,Ovintiv Mid-Continent Inc.,Stephen 1406 3H-27X,35.666270,-97.823910,...,8197.820000,14128926.0,16561.0,Oklahoma,Canadian,3,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13878,85808d0e-0c2d-4c26-b5b7-08be3118f457,10/20/2023 10:40:00 AM,11/5/2023 5:08:00 PM,42311374120000,42,311,Castlerock Exploration,CRX Y Bar 9H,28.498265,-98.762438,...,9797.000000,19717740.0,0.0,Texas,McMullen,3,False,False,,
13879,c01d658a-4373-4a0b-b048-9fe98043bad1,10/20/2023 10:40:00 AM,11/5/2023 5:08:00 PM,42311374130000,42,311,Castlerock Exploration,CRX Y Bar A10H,28.498265,-98.762376,...,9836.000000,19970412.0,0.0,Texas,McMullen,3,False,False,,
13880,361bd982-58d6-437d-9592-08aeb80fd738,10/11/2023 7:21:00 AM,11/5/2023 6:07:00 PM,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3,False,False,,
13881,f9fdc139-0f1e-4943-8a16-adb5152d862c,9/28/2023 9:43:00 PM,11/6/2023 7:29:00 AM,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3,False,False,,


In [20]:
# We can collect all the dataframe into a list and then concatenate them
df_list = [pd.read_csv(url, low_memory=False) for url in tqdm(data_urls2)]



registry_df = pd.concat(df_list).reset_index(drop=True)

100%|██████████| 3/3 [00:19<00:00,  6.43s/it]


In [21]:
registry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213883 entries, 0 to 213882
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   pKey                     213883 non-null  object 
 1   JobStartDate             213868 non-null  object 
 2   JobEndDate               213883 non-null  object 
 3   APINumber                213883 non-null  object 
 4   StateNumber              213883 non-null  int64  
 5   CountyNumber             213883 non-null  int64  
 6   OperatorName             213883 non-null  object 
 7   WellName                 213883 non-null  object 
 8   Latitude                 213883 non-null  float64
 9   Longitude                213883 non-null  float64
 10  Projection               213883 non-null  object 
 11  TVD                      183743 non-null  float64
 12  TotalBaseWaterVolume     183714 non-null  float64
 13  TotalBaseNonWaterVolume  163574 non-null  float64
 14  Stat

In [22]:
# Look at some of the rows of the dataframe
display(registry_df.tail(3))

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213880,361bd982-58d6-437d-9592-08aeb80fd738,10/11/2023 7:21:00 AM,11/5/2023 6:07:00 PM,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3,False,False,,
213881,f9fdc139-0f1e-4943-8a16-adb5152d862c,9/28/2023 9:43:00 PM,11/6/2023 7:29:00 AM,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3,False,False,,
213882,2241ec7e-f113-4f8e-8b61-8a74c9e03dc2,4/1/3012 12:00:00 AM,4/1/3012 12:00:00 AM,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1,False,False,,


We use Windows Authentication instead of the usual username: password to connect to the SQL Server. When connecting to a SQL Server database with Windows Authentication, you don't need to provide a username and password in your connection string, Instead, the system uses the credentials of the currently logged-in Windows user.

In [23]:
# Define the server and database names
server_name = "ANDIE\SQLEXPRESS"
database_name = "FracFocusRegistry"
table_name = "RegistryUpload"

# Create the connection
conn_str = f"mssql+pyodbc://@{server_name}/{database_name}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server"

# Create the engine
engine = create_engine(conn_str, echo=True)

df = pd.read_sql(f"SELECT * FROM {table_name}", engine)

df.info()

2023-11-15 09:52:24,248 INFO sqlalchemy.engine.Engine SELECT CAST(SERVERPROPERTY('ProductVersion') AS VARCHAR)
2023-11-15 09:52:24,248 INFO sqlalchemy.engine.Engine [raw sql] ()
2023-11-15 09:52:24,251 INFO sqlalchemy.engine.Engine SELECT schema_name()
2023-11-15 09:52:24,252 INFO sqlalchemy.engine.Engine [generated in 0.00071s] ()
2023-11-15 09:52:24,261 INFO sqlalchemy.engine.Engine SELECT CAST('test max support' AS NVARCHAR(max))
2023-11-15 09:52:24,262 INFO sqlalchemy.engine.Engine [generated in 0.00102s] ()
2023-11-15 09:52:24,265 INFO sqlalchemy.engine.Engine SELECT 1 FROM fn_listextendedproperty(default, default, default, default, default, default, default)
2023-11-15 09:52:24,266 INFO sqlalchemy.engine.Engine [generated in 0.00108s] ()
2023-11-15 09:52:24,305 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2023-11-15 09:52:24,306 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE ([INFORMATION_SCHEMA].[TABLE

In [24]:
df.tail(3)

Unnamed: 0,pKey,JobStartDate,JobEndDate,APINumber,StateNumber,CountyNumber,OperatorName,WellName,Latitude,Longitude,...,TVD,TotalBaseWaterVolume,TotalBaseNonWaterVolume,StateName,CountyName,FFVersion,FederalWell,IndianWell,Source,DTMOD
213847,361BD982-58D6-437D-9592-08AEB80FD738,2023-10-11 07:21:00,2023-11-05 18:07:00,42203355450000,42,203,"Silver Hill Operating, LLC",BOOKOUT D ALLOC 5H,32.524418,-94.493567,...,10936.115961,23218520.0,0.0,Texas,Harrison,3.0,False,False,,
213848,F9FDC139-0F1E-4943-8A16-ADB5152D862C,2023-09-28 21:43:00,2023-11-06 07:29:00,42203355270000,42,203,"Silver Hill Operating, LLC",BOOKOUT C ALLOC 4H,32.524414,-94.494218,...,11022.313802,40457386.0,0.0,Texas,Harrison,3.0,False,False,,
213849,2241EC7E-F113-4F8E-8B61-8A74C9E03DC2,3012-04-01 00:00:00,3012-04-01 00:00:00,42227368950000,42,227,"Meritage Energy Company, LLC",Patterson #2713,32.175028,-101.505275,...,,,,Texas,Howard,1.0,False,False,,
