# Section 20.6.5 of Machine Learning Bootcamp : Capstone Step 5, Data Wrangling and EDA

Here I will focus on my chosen dataset for the capstone project, using Exploratory Data Analysis methods and Data Wrangling techniques to prepare and better undersnat the data. The prior step of the capstone had me find a similar project, but there it appears all the data was already wrangled and cleaned (for the most part) by whoever is responsible for managing the city's data. Also the writers of the paper seem to have received quite a bit of guidance about how to create the features used in their model, so here I will have to determine all that myself.  

In [3]:
import numpy as np
import pandas as pd

## Dataset: Idle wells at risk of fees or penalties in California
<p>To recap, in 2021, new legislation was passed in California that escalates the fines oil & gas operators must pay if they have not tested their idle wells according to the law. There is now a dramatic escalation of fees based on a schedule of the age of the well, and the duration of idleness.<br></p>

<p><b>Problem statement:</b> Given that not all the data is available to accurately determine the specific status of a well, I think we can using machine learning to assess the companies exposure to the risk: i.e. have they been addressing their idle well burden by plugging & abandoning their idle well fleets, or are they "kicking the can down the road", and postponing until max pain arrives.<br></p> 

<p><b>Hypothesis:</b> I think the larger companies (Chevron, Aera, SPR, CRC) are likely addressing their own idle well situations in a way to avoid onerous fees, and potentially damaging optics of not following the law. However, I suspect there a large number of smaller privately owned companies that have not kept up with their liabilities, and either don't have the cash flow to take care of it, or are hoping to fly under the radar for a bit longer.</p>

<p><b>Data set:</b> the state publishes its view of the list of all wells in the state (241,912 data points, each of which represents a physical well), with currently known status and original drill (SPUD) date, so we know the age of the well, and the state's view of the status, which they have likely calculated based on the monthly production reports the operators are legally bound to send them. I have heard annecdotally it can take the state up to 2 years to update the well status, so there's some room here for machine learning to predict a trend by operator.</p>
 
<p><b>File:</b> downloaded from state's website (filesize 55MB): <a href='https://gis.conservation.ca.gov/portal/home/item.html?id=0d30c4d9ac8f4f84a53a145e7d68eb6b'>linked here</a> and added to my git repo here: <a href='source_data/CALGEM AllWells_20241113.csv'>CALGEM AllWells_20241113.csv</a></p>

In [4]:
# load candidate data file 1 and take a peek
# will take some work to get the data loaded with proper datatypes, and dates set correctly
# I'll use the API as the index as it should be unique, and is the well's identifier if any other data gets joined later.
well_list_df = pd.read_csv('source_data/CALGEM AllWells_20241113.csv')
well_list_df.head()

Unnamed: 0,OID,API,LeaseName,WellNumber,WellDesign,WellStatus,WellType,WellTypeLa,OperatorCo,OperatorNa,...,Range,BaseMeridi,Latitude,Longitude,GISSource,isConfiden,isDirectio,SpudDate,inHPZ,WellSymbol
0,-1,403300003,Lease by W.G. Young,1,1,Idle,DG,Dry Gas,11838,W.G. Young,...,09W,MD,38.976693,-122.833093,Notice of Intent to Drill,N,N,,Verified HPZ,IdleDG
1,-1,402120723,Kauai,1-Mar,Kauai 3-1,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44064,-121.951889,GPS,N,N,12/7/1999,Not Within HPZ,IdleDG
2,-1,402120734,Lanai,3-Mar,Lanai 3-3,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.447174,-121.961273,GPS,N,N,5/13/2000,Uncertainty Area,IdleDG
3,-1,402120815,Anacapa,4-Mar,Anacapa 3-4,Idle,DG,Dry Gas,R4085,"Royale Energy, Inc.",...,01W,MD,39.44323,-121.951042,GPS,N,N,11/6/2002,Not Within HPZ,IdleDG
4,-1,402120521,Angel Slough,1-Feb,Angel Slough 2-1,Active,DG,Dry Gas,C8720,"Crain Orchards, Inc.",...,01W,MD,39.449158,-121.946289,GPS,N,N,2/17/1988,Not Within HPZ,ActiveDG


In [5]:
well_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241912 entries, 0 to 241911
Data columns (total 27 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   OID         241912 non-null  int64  
 1   API         241912 non-null  int64  
 2   LeaseName   241912 non-null  object 
 3   WellNumber  241912 non-null  object 
 4   WellDesign  241912 non-null  object 
 5   WellStatus  241912 non-null  object 
 6   WellType    241912 non-null  object 
 7   WellTypeLa  241912 non-null  object 
 8   OperatorCo  241912 non-null  object 
 9   OperatorNa  241912 non-null  object 
 10  FieldName   241912 non-null  object 
 11  AreaName    241912 non-null  object 
 12  Place       241912 non-null  object 
 13  District    241912 non-null  object 
 14  CountyName  241912 non-null  object 
 15  Section     241912 non-null  int64  
 16  Township    241912 non-null  object 
 17  Range       241912 non-null  object 
 18  BaseMeridi  241912 non-null  object 
 19  La

In [6]:
well_list_df.describe()

Unnamed: 0,OID,API,Section,Latitude,Longitude
count,241912.0,241912.0,241912.0,241912.0,241912.0
mean,-1.0,404136500.0,18.972134,35.314957,-119.46211
std,0.0,3268516.0,11.030148,1.003615,0.851529
min,-1.0,400100000.0,1.0,32.537811,-124.36367
25%,-1.0,402945200.0,9.0,35.077979,-119.735008
50%,-1.0,403017600.0,19.0,35.366459,-119.566372
75%,-1.0,403707300.0,29.0,35.485517,-118.990685
max,-1.0,429520100.0,36.0,41.812958,-114.572904


### Metadata & Data Dictionary

This is my current understanding of the CSV data file description of the fields

|Field Name| Data type | Width | description of record                                                                                                                                                                                |
|---|-----------|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|OID| 	Integer  | 	4    | 	Internal feature number OBJECTID: Sequential unique whole numbers that are automatically generated.                                                                                                 |
|API| 	String   | 	10   | 	API Number. Unique and permanent identifier assigned to each well as standardized by the American Petroleum Institute                                                                               |
|LeaseName| 	String   | 	109  | 	Lease name of oil and gas wells. At situation when lease name of that well is unknown, the location in which well is located (i.e. section, township, range, longitude, latitude) is then recorded. |
|WellNumber| 	String   | 	50   | 	Operator-assigned alpha numeric designation for well.                                                                                                                                               |
|WellDesign| 	String   | 	     | 	???                                                                                                                                                                                                 |
|WellStatus| 	String   | 	     | 	Well Status - single code identifying current status of well  (see below)                                                                                                                           |
|WellType| 	String   | 	     | 	Codes indicating well type  (see below)                                                                                                                                                             |
|WellTypeLa| 	String   | 	     | 	Likely well type label                                                                                                                                                                              |
|OperatorCo| 	String   | 	12   | 	Operator code. Unique identifier number assigned to each operator                                                                                                                                   |
|OperatorNa| 	String   | 	100  | 	Operator Name. Name of individual or company or organization responsible for managment and operations of wells.                                                                                     |
|FieldName| 	String   | 	255  | 	Name of the oil and gas field in which the well is located                                                                                                                                          |
|AreaName| 	String   | 	255  | 	Name of area in which well is located                                                                                                                                                               |
|Place| 	String   | 	     | 	Keyword                                                                                                                                                                                             |
|District| 	String   | 	255  | 	DOGGR district with jurisdiction over the location in which well is located                                                                                                                         |
|CountyName| 	String   | 	50   | 	County with jurisdiction over the location in which well is located                                                                                                                                 |
|Section| 	Integer  | 	50   | 	Public Land Survey System section number in which well is located                                                                                                                                   |
|Township| 	String   | 	100  | 	Public Land Survey System township in which well is located                                                                                                                                         |
|Range| 	String   | 	100  | 	Public Land Survey System range in which well is located                                                                                                                                            |
|BaseMeridi| 	String   | 	50   | 	Base Meridian. Principle meridians required for all California surveys; defines PLSS base; H = Humboldt; MD = Mount Diablo; SB = San Bernardino                                                     |
|Latitude| 	Double   | 	8    | 	Latitude of well in the NAD83 coordinate system; decimal degree format                                                                                                                              |
|Longitude| 	Double   | 	8    | 	Longitude of well in the NAD83 coordinate system; decimal degree format                                                                                                                             |
|GISSource| 	String   | 	3    | 	3-character code describing the method by which the well location was established (see below)                                                                                                       |
|isConfiden| 	String   | 	1    | 	Confidential Well. Subsurface information for well is held confidential for a period of two years pursuant to Public Resources Code section 3234                                                    |
|isDirectio| 	String   | 	1    | 	Directionally Drilled. Indicator of whether well was directionally drilled; NULL for confidential wells                                                                                             |
|SpudDate| 	String   | 	8    | 	Date on which well drilling commenced                                                                                                                                                               |
|inHPZ| 	String   | 	26   | 	Well Intersection with Health Protection Zone (HPZ) (see below)                                                                                                                                     |
|WellSymbol| 	String   | 	15   | 	Code for GIS symbology (see below)|

### Additional Codes:

**Well Status** single code identifying current status of well. Single code stands for:
A = Active (well has been drilled and completed and in-use)
B = Buried (well is buried and idle)
C = Cancelled (well permit has been cancelled prior to drilling)
I = Idle (well is idle, not produced or injected for 6 consecutive months for two years)
N = New (recently permitted well; planned or in the process of being drilled)
P = Plugged & Abandoned (well has been plugged and abandoned to current standards)
U = Unknown (well status not known; mostly older wells dated on pre-1976)

**Well Type Codes** 
AI = Air Injector
CH = Core Hole
DG = Dry Gas
DH = Dry Hole
GAS = Gas
GD = Gas Disposal
GI = Gas Injection
GS = Gas Storage
INJ = Injection
LG = Liquid Petroleum Gas
Multi = Multiple Types
OB = Observation
OG = Oil & Gas
PM = Pressure Maintenance
SC = Cyclic Steam
SF = Steam Flood
UNK = Unknown well type; often a historic or legacy (pre-1976) well
WD = Water Disposal
WF = Water Flood
WS = Water Source

**GIS Source**: 3-character code describing the method by which the well location was established
(Ranked from most accurate to least accurate)
GPS = Global Positioning System (Coordinates derived from Division staff and Trimble GPS unit)
OPR = Operator (Coordinates provided by Operator via electronic format; ex. Excel, db, etc.)
SUM = Well Summary Report (Coordinates provided by Operator, post-drilling, on SUM)
NOI = Notice of Intent to Drill (Coordinates provided by Operator, pre-drilling, on NOI)
DOQ = Digital Ortho Quad (Coordinates derived from aerial imagery)
MIP = MapInfo Plotted (Coordinates generated from tool in MapInfo using corner call locations)
HUD = Heads Up Digitized (Coordinates derived from scanned, georeferenced Mylar maps)

**InHPZ**
Potentially HPZ = well intersects with HPZs created algorithmically from source inputs representing sensitive receptors.
Verified HPZ = well intersects with HPZs created from verified sensitive receptors that have been quality checked by CalGEM.
Uncertainty Area = well falls within 32ft outside or inside a potential HPZ or verified HPZ.
Not Within HPZ = well intersects with areas quality checked to be outside known HPZ areas.
Potentially Not Within HPZ = well is outside Potential HPZ, Verified HPZ, Not Within HPZ, and Uncertainty Area.

Another source for data glossary and dictionary: https://www.conservation.ca.gov/calgem/Documents/Glossary%20UA.pdf

### Well Symbol Definition
Pulled from XML document here: https://gis.conservation.ca.gov/server/rest/services/WellSTAR/Wells/MapServer/0
Total Count = 241,912 wells

|Code|	Count|	Desc|
|---|---|---|
|ActiveCH|	0|	Active Core Hole|
|ActiveDG|	887|	Active Gas: Dry Gas; Liquid Gas|
|ActiveDH|	0|	Active Dry Hole|
|ActiveGD|	60|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveGS|	255|	Active Gas Storage|
|ActiveINJ|	3|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveMulti|	45|	Active Multipurpose|
|ActiveOB|	3127|	Active Observation|
|ActiveOG|	37998|	Active Oil and Gas|
|ActivePM|	96|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveSC|	7674|	Active Cyclic Steam|
|ActiveSF|	3757|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveSTR|	1|	???|
|ActiveUNK|	2|	Active well of Unknown type|
|ActiveWD|	776|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveWF|	3986|	Active Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|ActiveWS|	16|	Active Water Source|
|CanceledCH|	12|	Canceled Core Hole|
|CanceledDG|	97|	Canceled Gas: Dry Gas; Liquid Gas|
|CanceledDH|	6|	Canceled Dry Hole|
|CanceledGAS|	143|	Canceled Gas: Dry Gas; Liquid Gas|
|CanceledGS|	5|	Canceled Gas Storage|
|CanceledINJ|	669|	Canceled Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|CanceledMulti|	96|	Canceled Multipurpose|
|CanceledOB|	123|	Canceled Observation|
|CanceledOG|	7786|	Canceled Oil and Gas|
|CanceledSC|	306|	Canceled Cyclic Steam|
|CanceledSF|	186|	Canceled Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|CanceledUNK|	58|	Canceled well of Unknown type|
|CanceledWD|	23|	Canceled Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|CanceledWF|	165|	Canceled Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|CanceledWS|	10|	Canceled Water Source|
|IdleAI|	8|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleCH|	127|	Idle Core Hole|
|IdleDG|	1320|	Idle Gas: Dry Gas; Liquid Gas|
|IdleDH|	173|	Idle Dry Hole|
|IdleGAS|	5|	Idle Gas: Dry Gas; Liquid Gas|
|IdleGD|	37|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleGS|	27|	Idle Gas Storage|
|IdleINJ|	35|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleLG|	1|	Idle Gas: Dry Gas; Liquid Gas|
|IdleMulti|	109|	Idle Multipurpose|
|IdleOB|	676|	Idle Observation|
|IdleOG|	28372|	Idle Oil and Gas|
|IdlePM|	28|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleSC|	1854|	Idle Cyclic Steam|
|IdleSF|	3836|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleUNK|	56|	Idle well of Unknown type|
|IdleWD|	848|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleWF|	2169|	Idle Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|IdleWS|	105|	Idle Water Source|
|NewCH|	109|	Permitted Core Hole|
|NewDG|	1|	Permitted Gas: Dry Gas; Liquid Gas|
|NewDH|	6|	Permitted Dry Hole|
|NewGS|	12|	Permitted Gas Storage|
|NewINJ|	7|	Permitted Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|NewOB|	347|	Permitted Observation|
|NewOG|	1011|	Permitted Oil and Gas|
|NewSC|	227|	Permitted Cyclic Steam|
|NewSF|	108|	Permitted Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|NewSTR|	1|	???|
|NewWD|	17|	Permitted Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|NewWF|	8|	Permitted Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedAI|	83|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedCH|	1319|	Plugged Core Hole|
|PluggedDG|	2809|	Plugged Gas: Dry Gas; Liquid Gas|
|PluggedDH|	16587|	Plugged Dry Hole|
|PluggedGAS|	1268|	Plugged Gas: Dry Gas; Liquid Gas|
|PluggedGD|	21|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedGS|	168|	Plugged Gas Storage|
|PluggedINJ|	868|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedLG|	5|	Plugged Gas: Dry Gas; Liquid Gas|
|PluggedMulti|	2302|	Plugged Multipurpose|
|PluggedOB|	728|	Plugged Observation|
|PluggedOG|	91850|	Plugged Oil and Gas|
|PluggedOnlyOB|	1|	Plugged Observation|
|PluggedOnlyOG|	48|	Plugged Oil and Gas|
|PluggedOnlySC|	1|	Plugged Cyclic Steam|
|PluggedOnlySF|	3|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedOnlyWF|	17|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedPM|	18|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedSC|	2127|	Plugged Cyclic Steam|
|PluggedSF|	5056|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedUNK|	672|	Plugged well of Unknown type|
|PluggedWD|	790|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedWF|	5006|	Plugged Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|PluggedWS|	125|	Plugged Water Source|
|UnknownDH|	1|	Unknown status Dry Hole|
|UnknownINJ|	9|	Unknown status Injectors: Air Injector; Gas Disposal; Pressure Maintenance; Steam Flood; Water Disposal; Water Flood|
|UnknownOG|	20|	Unknown status Oil and Gas|
|UnknownUNK|	2|	Unknown status well of Unknown type|



# Data Wrangling

In [8]:
# I exported the first tab of the EIA solar data to a CSV file for consuming with pandas:
solar_ops_df = pd.read_csv('source_data/eia_september_generator2024.csv')
solar_ops_df

Unnamed: 0,Entity ID,Entity Name,Plant ID,Plant Name,Google Map,Bing Map,Plant State,County,Balancing Authority Code,Sector,...,Nameplate Energy Capacity (MWh),DC Net Capacity (MW),Planned Derate Year,Planned Derate Month,Planned Derate of Summer Capacity (MW),Planned Uprate Year,Planned Uprate Month,Planned Uprate of Summer Capacity (MW),Latitude,Longitude
0,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
1,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
2,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
3,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
4,63560,"Sand Point Generating, LLC",1,Sand Point,Map,Map,AK,Aleutians East,,Electric Utility,...,,,,,,,,,55.339722,-160.49720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26402,63359,"REA Investments, LLC",67947,"Ward Solar 1, LLC",Map,Map,MA,Plymouth,ISNE,IPP Non-CHP,...,,,,,,,,,41.887790,-70.74620
26403,4363,Corn Belt Power Coop,67948,Hampton Substation Energy Storage,Map,Map,IA,Franklin,SWPP,Electric Utility,...,,,,,,,,,42.687200,-93.23210
26404,803,Arizona Public Service Co,67964,Agave,Map,Map,AZ,Maricopa,AZPS,Electric Utility,...,,,,,,,,,33.323890,-112.83970
26405,66518,180th Fighter Wing,67967,Toledo Air National Guard,Map,Map,OH,Lucas,PJM,Commercial Non-CHP,...,,,,,,,,,41.586376,-83.78781


In [10]:
solar_ops_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26407 entries, 0 to 26406
Data columns (total 33 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Entity ID                               26407 non-null  int64  
 1   Entity Name                             26407 non-null  object 
 2   Plant ID                                26407 non-null  int64  
 3   Plant Name                              26407 non-null  object 
 4   Google Map                              26407 non-null  object 
 5   Bing Map                                26407 non-null  object 
 6   Plant State                             26407 non-null  object 
 7   County                                  26407 non-null  object 
 8   Balancing Authority Code                25655 non-null  object 
 9   Sector                                  26407 non-null  object 
 10  Generator ID                            26407 non-null  ob

In [11]:
solar_ops_df.describe()

Unnamed: 0,Entity ID,Plant ID,Operating Month,Operating Year,Latitude,Longitude
count,26407.0,26407.0,26407.0,26407.0,26407.0,26407.0
mean,37436.746544,40361.264854,6.78642,1996.883364,38.956246,-94.016901
std,24749.585047,26775.123317,3.612453,27.598854,6.169855,18.742644
min,7.0,1.0,1.0,1891.0,18.9742,-171.7124
25%,12989.0,6388.0,4.0,1986.0,34.899445,-104.6712
50%,54842.0,56226.0,7.0,2007.0,39.648565,-90.4211
75%,61038.0,61347.5,10.0,2017.0,42.579489,-78.89158
max,66541.0,67994.0,12.0,2024.0,71.292,-67.4012
