# Results and Conclusion

---

This notebook discusses the results and conclusions of the overall project. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Results-and-Conclusion" data-toc-modified-id="Results-and-Conclusion-1">Results and Conclusion</a></span><ul class="toc-item"><li><span><a href="#Results" data-toc-modified-id="Results-1.1">Results</a></span><ul class="toc-item"><li><span><a href="#Extremely-Randomized-Trees-(Extra-Trees)-Regression" data-toc-modified-id="Extremely-Randomized-Trees-(Extra-Trees)-Regression-1.1.1">Extremely Randomized Trees (Extra Trees) Regression</a></span></li></ul></li><li><span><a href="#Recommendations" data-toc-modified-id="Recommendations-1.2">Recommendations</a></span><ul class="toc-item"><li><span><a href="#Presentation" data-toc-modified-id="Presentation-1.2.1"><a href="https://docs.google.com/presentation/d/1blz2lrQwKBJgtudyriewEqyXFqCvffJsTLtiVCprDs4/edit?usp=sharing" target="_blank">Presentation</a></a></span></li></ul></li><li><span><a href="#Imports" data-toc-modified-id="Imports-1.3">Imports</a></span></li><li><span><a href="#Interactive-Plot-html-Pages-using-Plotly" data-toc-modified-id="Interactive-Plot-html-Pages-using-Plotly-1.4">Interactive Plot html Pages using Plotly</a></span></li></ul></li><li><span><a href="#Back-to-Project-Repository" data-toc-modified-id="Back-to-Project-Repository-2">Back to Project Repository</a></span></li></ul></div>

## Results 

While the Time-Series model was an informative exercise, the aim of the project evaluation-wise was to use Regression models. The aim was also to consider the feasibility of measuring Basin Production using Nightfire data and it appears that this is in fact feasible. 

### Extremely Randomized Trees (Extra Trees) Regression

A **test score of 0.9982 R-squared score** with a **0.9697 Cross Validation 10 folds scor**e of is interpreted as satisfactory in using Nightfire data aggregated to EIA Basin Region to predict Oil (bbl/d) Total production per month. It did not seem necessary to use grid search or continue with much more analysis here as results here are very promising. 

![predictions](https://raw.git.generalassemb.ly/danielmartinsheehan/capstone/master/images/extra_trees_predictions_all_regions.png)

Another aim of this project was to create a uniform model across all Basin Regions, and from this Residuals plot, it appears that there is reliability across the Basin Regions. 

![residuals](https://raw.git.generalassemb.ly/danielmartinsheehan/capstone/master/images/extra_trees_residual_predicted_vs_observed.png)

## Recommendations

Due to time, organizational and reporting restrictions, the scope of this project has some limitations set by certain time line and oversight by General Assembly and our instructions. Below are some recommendations for improvement. 

* Include Temperature Data 
    * `df_complete_rows` in [Feature Engineering and Exploratory Data Analysis of Processed Data](https://git.generalassemb.ly/danielmartinsheehan/capstone/blob/master/notebooks/04_feature_engineernig_and_exploratory_data_analysis_processed_data.ipynb)

* Grid search for most optimal model parameters. 

* Attempt Neural Networks for model. 

* Consider additional metrics or aggregation techniques, such as `rolling()`. 

* Another such area for potential improvement or additional insights would be to include more data sources. 

    * Baker Hughes North American Rig Counts - this data tracks monthly rig counts, this would be beneficial for inclusion for an additional perspective on rigs and the types of rigs. 
      * https://rigcount.bakerhughes.com/na-rig-count
    * Other data sources from US Energy Information Administration (EIA) - no specific data identified but worth exploring. 
      * https://www.eia.gov/maps/maps.htm

* Allow the library to intake custom geographies, so one could get overall production for Basin Regions outside of the United States, or track smaller geographies or assets. If the topic is other the Oil Production, it may be worthwhile to allow for setting Prediction column as well as the X Feature (Predictors)

* Create a custom prediction that Votes using Time-Series as well as Trees model. 

### [Presentation](https://docs.google.com/presentation/d/1blz2lrQwKBJgtudyriewEqyXFqCvffJsTLtiVCprDs4/edit?usp=sharing)

[![presentation](https://raw.git.generalassemb.ly/danielmartinsheehan/capstone/master/images/presentation.png)](https://docs.google.com/presentation/d/1blz2lrQwKBJgtudyriewEqyXFqCvffJsTLtiVCprDs4/edit?usp=sharing)



## Imports

In [21]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly
import pandas as pd

In [23]:
from tools.tools import read_json, get_current_time

In [24]:
config = read_json('../capstone/config.json')

current_date = get_current_time('yyyymmdd')

wd = f"{config['workspace_directory']}/data"

In [25]:
basin_colors_hex = {  # manually defined dictionary of EIA basin-level standardized colors 
    "Anadarko Region":    "#2BA2CF", 
    "Appalachia Region":  "#769F5D",
    "Bakken Region":      "#F6C432", 
    "Eagle Ford Region":  "#48366B", 
    "Haynesville Region": "#807B8F",
    "Niobrara Region":    "#9D3341",
    "Permian Region":     "#6F4B27",
}

In [26]:
df = pd.read_csv((f"{wd}/output/basin_extra_trees_predictions.csv"))
#df = df[df['region'] == 'Permian Region']
df.head()

Unnamed: 0,year_month,region,latest_day_in_month,obs_day_cnt_avg,obs_day_cnt_med,obs_day_cnt_sum,obs_day_cnt_min,obs_day_cnt_max,qf_fit_day_avg_avg,qf_fit_day_avg_med,...,obs_day_cnt_sum_over_pct_month_completed,obs_day_cnt_min_over_pct_month_completed,obs_day_cnt_max_over_pct_month_completed,obs_day_cnt_avg_per_squaremeters_over_pct_month_completed,obs_day_cnt_med_per_squaremeters_over_pct_month_completed,obs_day_cnt_sum_per_squaremeters_over_pct_month_completed,obs_day_cnt_min_per_squaremeters_over_pct_month_completed,obs_day_cnt_max_per_squaremeters_over_pct_month_completed,colors,predicted
0,2012-03-01,Anadarko Region,2012-03-31,13.041667,12.5,313,3,33,4.786164,4.291667,...,313.0,3.0,33.0,1.728202e-10,1.656423e-10,4.147684e-09,3.975416e-11,4.372958e-10,#2BA2CF,254203.561
1,2012-03-01,Appalachia Region,2012-03-31,31.071429,32.0,870,1,86,9.502222,8.966184,...,870.0,1.0,86.0,1.616772e-10,1.665089e-10,4.526961e-09,5.203404e-12,4.474927e-10,#769F5D,32000.0
2,2012-03-01,Bakken Region,2012-03-31,260.321429,249.0,7289,2,571,1.497039,1.500962,...,7289.0,2.0,571.0,2.852491e-09,2.728436e-09,7.986976e-08,2.191515e-11,6.256775e-09,#F6C432,636428.686712
3,2012-03-01,Eagle Ford Region,2012-03-31,48.538462,30.0,1262,1,209,1.046799,0.374157,...,1262.0,1.0,209.0,7.517042e-10,4.646032e-10,1.954431e-08,1.548677e-11,3.236736e-09,#48366B,506707.612
4,2012-03-01,Haynesville Region,2012-03-31,4.47619,2.0,94,1,15,8.188339,0.0,...,94.0,1.0,15.0,9.172062e-11,4.098155e-11,1.926133e-09,2.049078e-11,3.073617e-10,#807B8F,52698.2258


In [27]:
region_list = list(df['region'].unique())

## Interactive Plot html Pages using Plotly

In [29]:
data = []

for region in region_list:
    for val, label, linestyle in zip(
        ['predicted', 'oil_bbl_d_total_production'], 
        ['Prediction', 'Actual'],
        ['dash', 'solid']
    ):
        data.append(
            go.Scatter(
                x=df[df['region'] == region]['year_month'],
                y=df[df['region'] == region][val],
                name=f"{region} {label}",
                line=dict(
                    color=basin_colors_hex[region],
                    dash=linestyle, 
                    width=5,
                ),
                opacity=0.75,
            )
        )

layout = dict(
    title='Basin Production (Oil (bbl/d) Total production) Predictions with Satellite-derived Nightfire data ',
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1m',
                     step='month',
                     stepmode='backward'),
                dict(count=6,
                     label='6m',
                     step='month',
                     stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True
        ),
        type='date'
    )
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename = "Time Series with Rangeslider")
plotly.offline.plot(fig, filename = '../html/index.html', auto_open=True);
plotly.offline.plot(fig, filename = '/Users/danielmsheehan/Documents/GitHub/basinpredictions.github.io/index.html', auto_open=True);

[![plotly](https://raw.git.generalassemb.ly/danielmartinsheehan/capstone/master/images/plotly.png)](https://nygeog.github.io/basinpredictions.github.io/)

# Back to Project Repository 

[Project Repository](https://git.generalassemb.ly/danielmartinsheehan/capstone)