<H4> Section 16.4 of Machine Learning Bootcamp : Capstone Step 2, Data Collection</H4>

Here I will select 3 candidate datasets for the capstone project, describe them in size and attributes, and what makes them good potential candidates from my perspective. I'm looking forward to having a mentor discussion about them, and the pros and cons of the project roadmaps.  

<H4>Candidate 1: Idle wells at risk of fees or penalties in California</H4>
<p>In 2021, new legislation was passed in California that escalates the fines oil & gas operators must pay if they have not tested their idle wells according to the law. There is now a dramatic escalation of fees based on a schedule of the age of the well, and the duration of idleness.<br></p>

<p><b>Problem statement:</b> Given that not all the data is available to accurately determine the specific status of a well, I think we can using machine learning to assess the companies exposure to the risk: i.e. have they been addressing their idle well burden by plugging & abandoning their idle well fleets, or are they "kicking the can down the road", and postponing until max pain arrives.<br></p> 

<p><b>Hypothesis:</b> I think the larger companies (Chevron, Aera, SPR, CRC) are likely addressing their own idle well situations in a way to avoid onerous fees, and potentially damaging optics of not following the law. However, I suspect there a large number of smaller privately owned companies that have not kept up with their liabilities, and either don't have the cash flow to take care of it, or are hoping to fly under the radar for a bit longer.</p>

<p><b>Data set:</b> the state publishes its view of the list of all wells in the state (241,912 data points), with currently known status and original drill (SPUD) date, so we know the age of the well, and the state's view of the status, which they have likely calculated based on the monthly production reports the operators are legally bound to send them. I have heard annecdotally it can take the state up to 2 years to update the well status, so there's some room here for machine learning to predict a trend by operator.</p>
 
<p><b>File:</b> downloaded from state's website (filesize 55MB): <a href='https://gis.conservation.ca.gov/portal/home/item.html?id=0d30c4d9ac8f4f84a53a145e7d68eb6b'>linked here</a> and added to GIT here: <a href='source_data/CALGEM AllWells_20241113.csv'>CALGEM AllWells_20241113.csv</a></p>

In [5]:
import numpy as np
import pandas as pd

# load candidate data file 1 and take a peek
# will take some work to get the data loaded with proper datatypes, and dates set correctly 
well_list_df = pd.read_csv('source_data/CALGEM AllWells_20241113.csv')
well_list_df.head()

Unnamed: 0,OID,API,LeaseName,WellNumber,WellDesign,WellStatus,WellType,WellTypeLa,OperatorCo,OperatorNa,...,Range,BaseMeridi,Latitude,Longitude,GISSource,isConfiden,isDirectio,SpudDate,inHPZ,WellSymbol
0,-1,403300003,Lease by W.G. Young,1,1,Idle,DG,Dry Gas,11838,W.G. Young,...,09W,MD,38.976693,-122.833093,Notice of Intent to Drill,N,N,,Verified HPZ,IdleDG
1,-1,402120723,Kauai,1-Mar,Kauai 3-1,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44064,-121.951889,GPS,N,N,12/7/1999,Not Within HPZ,IdleDG
2,-1,402120734,Lanai,3-Mar,Lanai 3-3,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.447174,-121.961273,GPS,N,N,5/13/2000,Uncertainty Area,IdleDG
3,-1,402120815,Anacapa,4-Mar,Anacapa 3-4,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44323,-121.951042,GPS,N,N,11/6/2002,Not Within HPZ,IdleDG
4,-1,402120521,Angel Slough,1-Feb,Angel Slough 2-1,Active,DG,Dry Gas,C8720,"Crain Orchards, Inc.",...,01W,MD,39.449158,-121.946289,GPS,N,N,2/17/1988,Not Within HPZ,ActiveDG


In [4]:
well_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241912 entries, 0 to 241911
Data columns (total 27 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   OID         241912 non-null  int64  
 1   API         241912 non-null  int64  
 2   LeaseName   241912 non-null  object 
 3   WellNumber  241912 non-null  object 
 4   WellDesign  241912 non-null  object 
 5   WellStatus  241912 non-null  object 
 6   WellType    241912 non-null  object 
 7   WellTypeLa  241912 non-null  object 
 8   OperatorCo  241912 non-null  object 
 9   OperatorNa  241912 non-null  object 
 10  FieldName   241912 non-null  object 
 11  AreaName    241912 non-null  object 
 12  Place       241912 non-null  object 
 13  District    241912 non-null  object 
 14  CountyName  241912 non-null  object 
 15  Section     241912 non-null  int64  
 16  Township    241912 non-null  object 
 17  Range       241912 non-null  object 
 18  BaseMeridi  241912 non-null  object 
 19  La

In [3]:
well_list_df.describe()

Unnamed: 0,OID,API,Section,Latitude,Longitude
count,241912.0,241912.0,241912.0,241912.0,241912.0
mean,-1.0,404136500.0,18.972134,35.314957,-119.46211
std,0.0,3268516.0,11.030148,1.003615,0.851529
min,-1.0,400100000.0,1.0,32.537811,-124.36367
25%,-1.0,402945200.0,9.0,35.077979,-119.735008
50%,-1.0,403017600.0,19.0,35.366459,-119.566372
75%,-1.0,403707300.0,29.0,35.485517,-118.990685
max,-1.0,429520100.0,36.0,41.812958,-114.572904
