# Making a data set for Montana production

## Task
Combine and reformat 2 large files of Well information and Well production.

## Technical issues encountered
- Loading large files into Colab
- Handling "tab" delimited data
- Handling some formatting errors
- Having the appropriate data for the task



## Reading files


In [1]:
import pandas as pd

In [2]:
# the two files below come from the site: http://www.bogc.dnrc.mt.gov/production/
#  in the historical zip file.

# Note that in these "read_csv" functions, we set "sep" to  " \t " which is a TAB character.
# These data are all tab delimioted.

prod = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\histprodwell.tab",sep='\t',
                   low_memory=False,
                  dtype={'API_WELLNO':'str'}) # we need to treat as a string not a number

prod['date'] = pd.to_datetime(prod.rpt_date) # get the date into a pandas datetime format

# Note that in the next line, encoding is explicitly given.  This is because without that, an
#  error was thrown.  This solution was found at:
#  https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python

well = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\histWellData.tab",sep='\t',low_memory=False,
                  encoding = "ISO-8859-1")


In [3]:
# the location data comes directly from the commission's public website:
# but note that we have saved the data using their "text" button and saved the file with a CSV extention.
#   The data are TAB delimited.
loc = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\location.csv",sep = '\t')

# Show how long each data frame is:
print(f'Len: prod: {len(prod)}, well: {len(well)}, loc: {len(loc)}')


Len: prod: 5110254, well: 19281, loc: 44748


## Clean up names and APINumbers values

In [4]:
prod['APINumber'] = prod.API_WELLNO
loc['APINumber'] = loc['API #'].str.replace('-','')
lapis = loc['APINumber'].unique().tolist()
papis = prod['APINumber'].unique().tolist()

In [5]:
#  let's look at the production data for ONE well (say #1000 in our list)
prod[prod.APINumber==papis[1000]]


Unnamed: 0,rpt_date,API_WELLNO,ST_FMTN_CD,Name_,Lease_Unit,OPNO,CoName,BBLS_OIL_COND,MCF_GAS,BBLS_WTR,DAYS_PROD,AMND_RPT,STATUS,dt_mod,date,APINumber
1005,01/31/1986,25025213270000,SO,Siluro-Ordovician,1825.0,,,1571.0,151.0,4838,31.0,False,P,,1986-01-31,25025213270000
7776,02/28/1986,25025213270000,SO,Siluro-Ordovician,1825.0,,,1320.0,151.0,5880,28.0,False,P,,1986-02-28,25025213270000
14419,03/31/1986,25025213270000,SO,Siluro-Ordovician,1825.0,,,2985.0,168.0,1220,31.0,False,P,,1986-03-31,25025213270000
20970,04/30/1986,25025213270000,SO,Siluro-Ordovician,1825.0,,,1406.0,149.0,6169,30.0,False,P,,1986-04-30,25025213270000
27444,05/31/1986,25025213270000,SO,Siluro-Ordovician,1825.0,,,448.0,34.0,6000,31.0,False,P,,1986-05-31,25025213270000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5050109,11/30/2021,25025213270000,SO,Siluro-Ordovician,1825.0,664.0,"Denbury Onshore, LLC",0.0,0.0,0,0.0,False,,02/01/2022,2021-11-30,25025213270000
5063869,12/31/2021,25025213270000,SO,Siluro-Ordovician,1825.0,664.0,"Denbury Onshore, LLC",0.0,0.0,0,0.0,False,,02/01/2022,2021-12-31,25025213270000
5077377,01/31/2022,25025213270000,SO,Siluro-Ordovician,1825.0,664.0,"Denbury Onshore, LLC",0.0,0.0,0,0.0,False,,03/01/2022,2022-01-31,25025213270000
5090236,02/28/2022,25025213270000,SO,Siluro-Ordovician,1825.0,664.0,"Denbury Onshore, LLC",0.0,0.0,0,0.0,False,,03/31/2022,2022-02-28,25025213270000


This shows that there are more than 300 rows of data for this well.  The data appear to be reported monthly.

## Summarize to single value for all wells

In [6]:
gb = prod.groupby('APINumber',as_index=False)[['BBLS_OIL_COND', 'MCF_GAS', 'BBLS_WTR', 'DAYS_PROD']].sum()

### and add lat/lon

In [7]:
mg = pd.merge(gb,loc[['APINumber', 'Wh_Long', 'Wh_Lat']],on='APINumber',how = 'left')
mg.head()

Unnamed: 0,APINumber,BBLS_OIL_COND,MCF_GAS,BBLS_WTR,DAYS_PROD,Wh_Long,Wh_Lat
0,25003050000000,0.0,0.0,0,0.0,-107.039824,44.997318
1,25003050010000,0.0,0.0,0,31.0,-107.03352,44.997228
2,25003050040000,0.0,0.0,19889,303.0,-107.04491,44.99741
3,25003050050000,0.0,0.0,0,0.0,-107.036697,45.000285
4,25003050070000,33702.0,0.0,337882,2393.0,-107.050015,44.997324


In [8]:
mg.to_csv('./tmp/MT_prod_summary.csv')