# Making a data set for Montana production

## Task
Combine and reformat 2 large files of Well information and Well production.

## Technical issues encountered
- Loading large files into Colab
- Handling "tab" delimited data
- Renaming columns
- Summarizing monthly data to years 
- "pivotting" data to yearly columns
- merging 2 data sets.

## Result
The product is a CSV file that is in a format needed at FracTracker.



## Reading files
Two files are needed: 
- the first is the raw production values and is stored in a zip file on [Montana servers](http://www.bogc.dnrc.mt.gov/production/).  The code below will download that zipfile and extract the needed data.
- the second is the well information and **you** will need to gather it from the 

In [1]:
import pandas as pd
import zipfile 
import requests
import shutil

### File 1: Download production zip file from Montana server
Downloading this zipfile (over 140Mbyte) can take **two minutes or more** due to an apparently slow server in Montana.

In [2]:
# define an efficient routine to fetch a large file from a web address and save it
# https://stackoverflow.com/a/39217788/6736072

def download_file(url):
    local_filename = url.split('/')[-1] # just names the local file like the last part of the link
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print(f'Downloaded {local_filename}')

In [3]:
zfn = 'http://www.bogc.dnrc.mt.gov/production/historical.zip'
download_file(zfn)

Downloaded historical.zip


In [4]:
# now pull the well production file into a dataframe
# Note that in these "read_csv" functions, we set "sep" to  " \t " which is a TAB character.

with zipfile.ZipFile(zfn.split('/')[-1]) as z:
    with z.open('histprodwell.tab') as f:
        prod = pd.read_csv(f,sep='\t',  # it is TAB delimited, not comma delimited
                           low_memory=False,
                           dtype={'API_WELLNO':'str'}) # we need to treat as a string not a number

prod['year'] = pd.to_datetime(prod.rpt_date).dt.year # keep just the year

# Keep only fields that we need and rename them to something more useful
prod = prod[['API_WELLNO','BBLS_OIL_COND', 'MCF_GAS', 'BBLS_WTR', 'DAYS_PROD', 'year']]
prod.columns = ['APINumber','Oil', 'Gas', 'Water','Days','year']


In [5]:
#  Here is where we summarize by well and year
gb = prod.groupby(['APINumber','year'],as_index=False)[['Oil', 'Gas', 
                                                        'Water','Days']].sum()
gb.tail(10)

Unnamed: 0,APINumber,year,Oil,Gas,Water,Days
447387,25111600110000,2014,0.0,0.0,0,0.0
447388,25111600110000,2015,0.0,0.0,0,0.0
447389,25111600110000,2016,0.0,0.0,0,0.0
447390,25111600110000,2017,0.0,0.0,0,0.0
447391,25111600110000,2018,0.0,0.0,0,61.0
447392,25111600110000,2019,0.0,0.0,0,0.0
447393,25111600110000,2020,0.0,0.0,0,0.0
447394,25111600110000,2021,0.0,0.0,0,0.0
447395,25111600110000,2022,0.0,0.0,0,0.0
447396,25111916470000,1986,945.0,0.0,0,91.0


In [6]:
# The dataframe created above is "long":  each record has only one year 
# the code code below makes it "wide"; there is only one record per well ('APINumber')

colnames = ['Oil','Gas','Water','Days']
concat_list = []
for col in colnames:
    piv = gb.pivot(index='APINumber',columns='year',values=col).fillna(0)
    names = piv.columns.tolist()
    ncols = []
    for name in names:
        ncols.append(col+'_'+str(name))
    piv.columns = ncols
    concat_list.append(piv)

whole = pd.concat(concat_list,axis=1)
whole.head()
# whole.to_csv('piv.csv')

Unnamed: 0_level_0,Oil_1899,Oil_1986,Oil_1987,Oil_1988,Oil_1989,Oil_1990,Oil_1991,Oil_1992,Oil_1993,Oil_1994,...,Days_2013,Days_2014,Days_2015,Days_2016,Days_2017,Days_2018,Days_2019,Days_2020,Days_2021,Days_2022
APINumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25003050000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25003050010000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25003050040000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25003050050000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25003050070000,0.0,2638.0,2319.0,2567.0,2921.0,3659.0,4054.0,3562.0,3315.0,2292.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## File 2: Getting the well location data and uploading it to Colab
This process takes a few manual steps.

#### Step 1: Navigate to the Montana website
Go to [this site](http://www.bogc.dnrc.mt.gov/webapps/dataminer/Wells/WellSurfaceLongLat.aspx).
It should look something like:



<img src="images/montana_1.png" height=100 />

#### Step 2: Set up search for all wells
1. Select "API #" in the first dropdown menu
2. Type in "25" in the text box
3. Click the search button

<img src="images/montana_2.png" height=100 />


**The results should look something like this:**


<img src="images/montana_3.png" height=100 />



#### Step 3: Save the results to your computer
1. Click on the "Text" button in the upper right of the screen

This will cause the website to display another page with LOTS of text.  It is actually a text file that is "tab" delimited (that is, the tab character separates the values in each row).

2. Save the file to your computer.  Usually that something simple like **Ctrl-s**. Save it with the file name "Location.csv".  (The code below uses that name. Change the code below if you want to name it something else.)

#### Step 4: Move the file to Colab
1. Back in the Colab window, open the "Files" panel on the far left by clicking on the folder icon (it may already be open).
2. Click on the "upload" icon and follow the prompts to upload the file you saved in the last step.

<img src="images/montana_4.png" height=10 />

**We can now clean up this file and merge it with the production file**

## Read the Location file into a dataframe

In [7]:
# the location data comes directly from the commission's public website:
# but note that we have saved the data using their "text" button and saved the file with a CSV extention.
#   The data are TAB delimited.
loc = pd.read_csv("Location.csv",sep = '\t')
loc.head()

Unnamed: 0,API #,Wh_Long,Wh_Lat,CoName,Well_Nm,Well_Type,Wl_Status,Dt_Cmp,Field,County,Wh_TR,Wh_Sec,Wh_Qtr,Wh_FtNS,Wh_NS,Wh_FtEW,Wh_EW,DTD,Unnamed: 18
0,25-009-05295-00-00,-108.78621,45.534055,Yellowstone Oil and Gas Company,# 1,Dry Hole,P&A - Approved,6/1/1951 12:00:00 AM,Wildcat Carbon,Carbon,3S-24E,32,NE,,,,,1780.0,
1,25-065-05391-00-00,-108.672079,46.707982,Roundup Oil & Gas Company,#1,Dry Hole,P&A - Approved,12/1/1921 12:00:00 AM,Devil's Basin,Musselshell,11N-24E,14,W2 SW SW,725.0,S,370.0,W,1232.0,
2,25-049-05010-00-00,-112.540326,47.512402,Lewis & Clark Co.,#1,Dry Hole,Unknown,12/1/1910 12:00:00 AM,Wildcat Lewis & Clark,Lewis and Clark,20N-7W,6,SE,,,,,700.0,
3,25-025-06880-00-00,-104.301744,46.409877,"WBI Energy Transmission, Inc.",#100,Gas Storage,Completed,9/15/1928 12:00:00 AM,Cedar Creek,Fallon,8N-59E,35,,660.0,N,360.0,E,905.0,
4,25-025-06898-00-00,-104.4019,46.514588,"WBI Energy Transmission, Inc.",#103,Gas Storage,Completed,3/28/1931 12:00:00 AM,Cedar Creek,Fallon,9N-58E,22,C SE,1320.0,S,1310.0,E,1120.0,


## Cleanup names, merge, and save!

In [8]:
loc['APINumber'] = loc['API #'].str.replace('-','')
loc.rename({'Wh_Long':'Longitude','Wh_Lat':'Latitude'},axis=1,inplace=True)
out = pd.merge(loc[['APINumber','Longitude', 'Latitude']],whole,
               on='APINumber',how='right',validate='1:1')

out.to_csv('MT_prod_summary.csv',index=False)

PermissionError: [Errno 13] Permission denied: 'MT_prod_summary.csv'