<H4> Section 16.4 of Machine Learning Bootcamp : Capstone Step 2, Data Collection</H4>

Here I will select 3 candidate datasets for the capstone project, describe them in size and attributes, and what makes them good potential candidates from my perspective. I'm looking forward to having a mentor discussion about them, and the pros and cons of the project roadmaps.  

In [3]:
import numpy as np
import pandas as pd

<H4>Candidate 1: Idle wells at risk of fees or penalties in California</H4>
<p>In 2021, new legislation was passed in California that escalates the fines oil & gas operators must pay if they have not tested their idle wells according to the law. There is now a dramatic escalation of fees based on a schedule of the age of the well, and the duration of idleness.<br></p>

<p><b>Problem statement:</b> Given that not all the data is available to accurately determine the specific status of a well, I think we can using machine learning to assess the companies exposure to the risk: i.e. have they been addressing their idle well burden by plugging & abandoning their idle well fleets, or are they "kicking the can down the road", and postponing until max pain arrives.<br></p> 

<p><b>Hypothesis:</b> I think the larger companies (Chevron, Aera, SPR, CRC) are likely addressing their own idle well situations in a way to avoid onerous fees, and potentially damaging optics of not following the law. However, I suspect there a large number of smaller privately owned companies that have not kept up with their liabilities, and either don't have the cash flow to take care of it, or are hoping to fly under the radar for a bit longer.</p>

<p><b>Data set:</b> the state publishes its view of the list of all wells in the state (241,912 data points, so I think it exceeds the recommended minimum 15k data samples), with currently known status and original drill (SPUD) date, so we know the age of the well, and the state's view of the status, which they have likely calculated based on the monthly production reports the operators are legally bound to send them. I have heard annecdotally it can take the state up to 2 years to update the well status, so there's some room here for machine learning to predict a trend by operator.</p>
 
<p><b>File:</b> downloaded from state's website (filesize 55MB): <a href='https://gis.conservation.ca.gov/portal/home/item.html?id=0d30c4d9ac8f4f84a53a145e7d68eb6b'>linked here</a> and added to my git repo here: <a href='source_data/CALGEM AllWells_20241113.csv'>CALGEM AllWells_20241113.csv</a></p>

In [4]:
# load candidate data file 1 and take a peek
# will take some work to get the data loaded with proper datatypes, and dates set correctly 
well_list_df = pd.read_csv('source_data/CALGEM AllWells_20241113.csv')
well_list_df.head()

Unnamed: 0,OID,API,LeaseName,WellNumber,WellDesign,WellStatus,WellType,WellTypeLa,OperatorCo,OperatorNa,...,Range,BaseMeridi,Latitude,Longitude,GISSource,isConfiden,isDirectio,SpudDate,inHPZ,WellSymbol
0,-1,403300003,Lease by W.G. Young,1,1,Idle,DG,Dry Gas,11838,W.G. Young,...,09W,MD,38.976693,-122.833093,Notice of Intent to Drill,N,N,,Verified HPZ,IdleDG
1,-1,402120723,Kauai,1-Mar,Kauai 3-1,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44064,-121.951889,GPS,N,N,12/7/1999,Not Within HPZ,IdleDG
2,-1,402120734,Lanai,3-Mar,Lanai 3-3,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.447174,-121.961273,GPS,N,N,5/13/2000,Uncertainty Area,IdleDG
3,-1,402120815,Anacapa,4-Mar,Anacapa 3-4,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44323,-121.951042,GPS,N,N,11/6/2002,Not Within HPZ,IdleDG
4,-1,402120521,Angel Slough,1-Feb,Angel Slough 2-1,Active,DG,Dry Gas,C8720,"Crain Orchards, Inc.",...,01W,MD,39.449158,-121.946289,GPS,N,N,2/17/1988,Not Within HPZ,ActiveDG


In [5]:
well_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241912 entries, 0 to 241911
Data columns (total 27 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   OID         241912 non-null  int64  
 1   API         241912 non-null  int64  
 2   LeaseName   241912 non-null  object 
 3   WellNumber  241912 non-null  object 
 4   WellDesign  241912 non-null  object 
 5   WellStatus  241912 non-null  object 
 6   WellType    241912 non-null  object 
 7   WellTypeLa  241912 non-null  object 
 8   OperatorCo  241912 non-null  object 
 9   OperatorNa  241912 non-null  object 
 10  FieldName   241912 non-null  object 
 11  AreaName    241912 non-null  object 
 12  Place       241912 non-null  object 
 13  District    241912 non-null  object 
 14  CountyName  241912 non-null  object 
 15  Section     241912 non-null  int64  
 16  Township    241912 non-null  object 
 17  Range       241912 non-null  object 
 18  BaseMeridi  241912 non-null  object 
 19  La

In [6]:
well_list_df.describe()

Unnamed: 0,OID,API,Section,Latitude,Longitude
count,241912.0,241912.0,241912.0,241912.0,241912.0
mean,-1.0,404136500.0,18.972134,35.314957,-119.46211
std,0.0,3268516.0,11.030148,1.003615,0.851529
min,-1.0,400100000.0,1.0,32.537811,-124.36367
25%,-1.0,402945200.0,9.0,35.077979,-119.735008
50%,-1.0,403017600.0,19.0,35.366459,-119.566372
75%,-1.0,403707300.0,29.0,35.485517,-118.990685
max,-1.0,429520100.0,36.0,41.812958,-114.572904


<H4>Candidate 2: Image recognition for pest control efforts
<p>In my city we have a lot of issues with Norwegian roof rats in house attics. Using a 24/7 video surveillance system, it should be possible to use motion detection to grab still pictures for pest identification. I would use something like a VGG framework with transfer learning to differentiate pictures with and without pests. For example, local birds and squirrels would not be classified as pests, but mice and rats would be. In a real world application, the users would only be notified when a real pest is identified, not other animals (or shadows from trees/clouds/etc.)<br></p>

<p><b>Problem statement: </b> Many current services like Google Nest and Ring camera offer cloud-based services for motion detection and sending notifications if people are detected, but I haven't seen any popular services for pest control.<br></p> 

<p><b>Hypothesis: </b> Using pre-trained model frameworks, such as VGG16, it should be fairly straightforward to do some image analysis for a class of small animals such as rats, mice, racoons, squirrels, and determine frequency of visitation, and get pretty decent accuracy on pest or not pest classification.</p>

<p><b>Data set: </b> Self provided still images from Nest Google camera captures vs. ImageNet trained VGG16 or similar framework.</p>
 
<p><b>Files: </b>VGG built in with tensorflow (filesize TBD): <a href='https://keras.io/api/applications/vgg/'>VGG Source network here</a>, source testing pictures from my cameras: <a href='source_pics/'>added to GIT here</a></p>

In [7]:
# No code yet for candidate 2, though it would probably be very similar to MiniProject "MLE_MiniProject_Fine_Tuning"
# Can add some code to load the model and the pictures later

<H4>Candidate 3: Impact of government subsidies on renewable energy deployment
<p>In the US, and California in particular, there have been several rounds of subsidies over the years, both at the federal and state levels. With consumver solar deployment numbers, it should be possible to forecast the rate of solar (and maybe other renewable energy) adoption over time as a function of subsidies.<br></p>

<p><b>Problem statement: </b>Governemnt Subsidies (in the form of tax breaks in the US) promote solar (and maybe other renewable) energy construction and adoption<br></p> 

<p><b>Proposal: </b>Using publically available data, build a forecasting model that predicts the adoption of solar when government subsidies/incentives are in effect. This would show the rate of adoption in the future with and without subsidies.</p>

<p><b>Data set: </b>EIA data for 24k solar projects in the US (so I think it meets the recommended 15k samples), time built and generation capacity. Subsidy years would have to be manually constructed in a target column as 0=No, 1=Yes. I expect this information to be readily available either on the EIA website, or Wikipedia.</p>
 
<p><b>Files: </b>Solar data (filesize 12 MB) <a href='source_data/eia_september_generator2024.xlsx'>Solar data over time</a> and the source is <a href='https://www.eia.gov/electricity/data/eia860m/'>EIA website with solar data link</a>. And for my notes, a relevant <a href='https://www.nrel.gov/docs/fy24osti/90042.pdf'>2024 paper from NREL</a> on the state of things.</p> 

In [8]:
# I exported the first tab of the EIA solar data to a CSV file for consuming with pandas:
solar_ops_df = pd.read_csv('source_data/eia_september_generator2024.csv')
solar_ops_df

Unnamed: 0,Entity ID,Entity Name,Plant ID,Plant Name,Google Map,Bing Map,Plant State,County,Balancing Authority Code,Sector,...,Nameplate Energy Capacity (MWh),DC Net Capacity (MW),Planned Derate Year,Planned Derate Month,Planned Derate of Summer Capacity (MW),Planned Uprate Year,Planned Uprate Month,Planned Uprate of Summer Capacity (MW),Latitude,Longitude
0,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
1,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
2,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
3,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
4,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26402,63359,"REA Investments, LLC",67947,"Ward Solar 1, LLC",Map,Map,MA,Plymouth,ISNE,IPP Non-CHP,...,,,,,,,,,41.887790,-70.74620
26403,4363,Corn Belt Power Coop,67948,Hampton Substation Energy Storage,Map,Map,IA,Franklin,SWPP,Electric Utility,...,,,,,,,,,42.687200,-93.23210
26404,803,Arizona Public Service Co,67964,Agave,Map,Map,AZ,Maricopa,AZPS,Electric Utility,...,,,,,,,,,33.323890,-112.83970
26405,66518,180th Fighter Wing,67967,Toledo Air National Guard,Map,Map,OH,Lucas,PJM,Commercial Non-CHP,...,,,,,,,,,41.586376,-83.78781


In [10]:
solar_ops_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26407 entries, 0 to 26406
Data columns (total 33 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Entity ID                               26407 non-null  int64  
 1   Entity Name                             26407 non-null  object 
 2   Plant ID                                26407 non-null  int64  
 3   Plant Name                              26407 non-null  object 
 4   Google Map                              26407 non-null  object 
 5   Bing Map                                26407 non-null  object 
 6   Plant State                             26407 non-null  object 
 7   County                                  26407 non-null  object 
 8   Balancing Authority Code                25655 non-null  object 
 9   Sector                                  26407 non-null  object 
 10  Generator ID                            26407 non-null  ob

In [11]:
solar_ops_df.describe()

Unnamed: 0,Entity ID,Plant ID,Operating Month,Operating Year,Latitude,Longitude
count,26407.0,26407.0,26407.0,26407.0,26407.0,26407.0
mean,37436.746544,40361.264854,6.78642,1996.883364,38.956246,-94.016901
std,24749.585047,26775.123317,3.612453,27.598854,6.169855,18.742644
min,7.0,1.0,1.0,1891.0,18.9742,-171.7124
25%,12989.0,6388.0,4.0,1986.0,34.899445,-104.6712
50%,54842.0,56226.0,7.0,2007.0,39.648565,-90.4211
75%,61038.0,61347.5,10.0,2017.0,42.579489,-78.89158
max,66541.0,67994.0,12.0,2024.0,71.292,-67.4012
