# Making a data set for Montana production

## Task
Combine and reformat 2 large files of Well information and Well production.

## Technical issues encountered
- Loading large files into Colab
- Handling "tab" delimited data
- Handling some formatting errors
- Having the appropriate data for the task



## Reading files
Two files are needed: 
- the first is the raw production values and is stored in a zip file on [Montana servers](http://www.bogc.dnrc.mt.gov/production/).  The code below will download that zipfile and extract the needed data.
- the second is the well information and **you** will need to gather it from the 

In [None]:
import pandas as pd
import zipfile 
import requests
import shutil

### Download production zip file from Montana server
Downloading this zipfile (over 140Mbyte) can take **two minutes or more** due to an apparently slow server in Montana.

In [None]:
# define an efficient routine to fetch a large file from a web address and save it
# https://stackoverflow.com/a/39217788/6736072

def download_file(url):
    local_filename = url.split('/')[-1] # just names the local file like the last part of the link
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print(f'Downloaded {local_filename}')

In [None]:
zfn = 'http://www.bogc.dnrc.mt.gov/production/historical.zip'
download_file(zfn)

In [None]:
# now pull the well production file into a dataframe
# Note that in these "read_csv" functions, we set "sep" to  " \t " which is a TAB character.

with zipfile.ZipFile(zfn.split('/')[-1]) as z:
    with z.open('histprodwell.tab') as f:
        prod = pd.read_csv(f,sep='\t',  # it is TAB delimited
                           low_memory=False,
                           dtype={'API_WELLNO':'str'}) # we need to treat as a string not a number

prod['year'] = pd.to_datetime(prod.rpt_date).dt.year # keep just the year

# Keep only fields that we need and rename them to something more useful
prod = prod[['API_WELLNO','BBLS_OIL_COND', 'MCF_GAS', 'BBLS_WTR', 'DAYS_PROD', 'year']]
prod.columns = ['APINumber','Oil', 'Gas', 'Water','Days','year']


In [None]:
gb = prod.groupby(['APINumber','year'],as_index=False)[['Oil', 'Gas', 
                                                        'Water','Days']].sum()
gb.tail(10)

In [None]:
colnames = ['Oil','Gas','Water','Days']
concat_list = []
for col in colnames:
    piv = gb.pivot(index='APINumber',columns='year',values=col).fillna(0)
    names = piv.columns.tolist()
    ncols = []
    for name in names:
        ncols.append(col+'_'+str(name))
    piv.columns = ncols
    concat_list.append(piv)

whole = pd.concat(concat_list,axis=1)

whole.to_csv('piv.csv')

In [None]:
# the two files below come from the site: http://www.bogc.dnrc.mt.gov/production/
#  in the historical zip file.

prod = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\histprodwell.tab",sep='\t',
                   low_memory=False,
                  dtype={'API_WELLNO':'str'}) # we need to treat as a string not a number

prod['date'] = pd.to_datetime(prod.rpt_date) # get the date into a pandas datetime format

# Note that in the next line, encoding is explicitly given.  This is because without that, an
#  error was thrown.  This solution was found at:
#  https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python

well = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\histWellData.tab",sep='\t',low_memory=False,
                  encoding = "ISO-8859-1")


In [None]:
well.columns

In [None]:
# the location data comes directly from the commission's public website:
# but note that we have saved the data using their "text" button and saved the file with a CSV extention.
#   The data are TAB delimited.
loc = pd.read_csv(r"C:\MyDocs\OpenFF\data\non-FF\montana\location.csv",sep = '\t')

# Show how long each data frame is:
print(f'Len: prod: {len(prod)}, well: {len(well)}, loc: {len(loc)}')


## Clean up names and APINumbers values

In [None]:
prod['APINumber'] = prod.API_WELLNO
loc['APINumber'] = loc['API #'].str.replace('-','')
lapis = loc['APINumber'].unique().tolist()
papis = prod['APINumber'].unique().tolist()

In [None]:
#  let's look at the production data for ONE well (say #1000 in our list)
prod[prod.APINumber==papis[1000]]


This shows that there are more than 300 rows of data for this well.  The data appear to be reported monthly.

## Summarize to single value for all wells

In [None]:
gb = prod.groupby('APINumber',as_index=False)[['BBLS_OIL_COND', 'MCF_GAS', 'BBLS_WTR', 'DAYS_PROD']].sum()

### and add lat/lon

In [None]:
mg = pd.merge(gb,loc[['APINumber', 'Wh_Long', 'Wh_Lat']],on='APINumber',how = 'left')
mg.head()

In [None]:
mg.to_csv('./tmp/MT_prod_summary.csv')