### Project Summary

#### Background and Goals
A processing facility receives its raw materials from a large geographical area. This area is split into 6 major regions and within those regions are 49 active zones, each with an average of 51 extraction points (2511 extraction points in total were analyzed in this project). Within the raw materials is 1 compound (pA) that the processing facility is concentrating into a salable product along with 3 compounds (pB, pC, pD) that the facility needs to get rid of. Over time, the amount of pA has been decreasing while pB, pC, and pD have been increasing. It is theorized that certain variables from the extraction points have been changing over time as they age and more material is drawn from them and that that these variables can be defined, measured, and predicted in order to draw a correlation to the process facility's compounds.

During the processing, 3 chemicals are used to help concentrate pA while removing pB, pC, and pD. Chemical cA is beneficial to pA, while cB removes pB and cCD removes both pC and pD. The second theory and goal of this project is to use models to predict feed pA, pB, pC, and pD and, using those predictions along with the extraction point variables, determine and predict how much of each chemical the processing facility requires. Required chemical additions can both reduce costs by reducing excess chemical usage and allow for more accurate forecasting of operating costs and quantity of chemicals needed per month.

Data from the extraction points and processing facility has been gathered from June 2001 through March 2020. When performing regression analysis, the first 6 months of data (June - December 2001) was removed due to unreliable initial exhaustion data. pA was available for the entire analysis period, however pB and pD were not available until August 2009. pC did not become available until 2016, but did not have correct resolution until ~2018. Due to the limited pC data available by month, final pC results were not identified.

***
This Notebook acts as the summary to the project. It will pull any required CSV files that were generated from other notebooks and all conclusions and methods of work are described. In order to generate all files yourself, please run the following notebooks in the order they appear.
1. sqlite_app - creates ,aster sqlite DB file from the original, raw CSVs (extraction_data and exhist_query)
2. exhist - generates monthly and daily composite CSV files with process features and outputs. Note there are two sets of data it generates for daily composites, one with normal daily values and another using rolling averages to help reduce variability
3. Regression_test_daily-ra and Regression_testing_monthly - these notebooks can be ran separate from eachother. OLS Linear Regressions and forward selection were first used to identify if any correlations could be found using simple methods. The forward selection also allowed an easier method to identify which features yielded the best correlations
4. hyperparameters_daily-ra and hyperparameters_monthly - Linear, RBF, and Poly SVR Kernels were tested on each of the process outputs to identify if it was worth the effort of utilizing Support Vector Regressions rather than OLS Linear Regressions. After this tested, daily data was no longer used. The regressions worked, however in practice it is extremely difficult to predict daily values of process features to yield accurate output predictions whereas it is possible with monthly forecasts
5. pB_high_error_investigation - it was found that pB was unusually high and did not fit models during the end of 2019; this notebook investigates this further and identifies that the process was not in normal operations, so a model will likely not be able to function correctly. The data was kept in the regression models but is noted as allowed to have a high variance
6. daily_monthly_prediction_comparisons - quick summary of the type of model, hyperparameters, scores, and RMSE of all daily and monthly regression models. This further led to the use of monthly rather than daily predictions
7. zone_visuals_setup - generates final CSVs used in this notebook to create interactive plots rather than the matplotlib plots used in checking performance of models in previous notebooks

In [1]:
# Load dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as sts
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import datetime

Heatmaps were created to help visualize how the extraction points have been utilized throughout the years and how it make affect the process variables and compounds (from this point forward, variables will be referred to as Features and compounds Outputs)

In [2]:
# Load the monthly zone csv into a dataframe
zone_df = pd.read_csv('Resources/zone_months.csv',index_col=False)
zone_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2809 entries, 0 to 2808
Data columns (total 8 columns):
region_zone                     2809 non-null object
year_month                      2809 non-null int64
zone_total_extracted            2809 non-null float64
zone_reserves                   2809 non-null float64
zone_percent_extracted_total    2809 non-null float64
ep_sA_reserves                  2809 non-null float64
zone_exhaustion                 2809 non-null float64
zone_avg_sA                     2809 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 175.7+ KB


In [3]:
# Get a datetime column
zone_df['year_month'] = zone_df['year_month'].astype(str)
datetime_list = []
for i in range(len(zone_df)):
    datetime_list.append(datetime.datetime.strptime(zone_df['year_month'][i],'%Y%m'))

In [4]:
zone_df['date'] = datetime_list
zone_df.head()

Unnamed: 0,region_zone,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date
0,rUP71,201811,366.0,32922.0,1.067803,4137.198011,1.111719,0.125667,2018-11-01
1,rUP71,201812,1600.0,65844.0,3.243872,8274.396022,2.429986,0.125667,2018-12-01
2,rUP71,201902,9241.0,724570.0,11.740762,82382.978088,1.275377,0.113699,2019-02-01
3,rUP71,201903,5515.0,241582.0,3.506571,28276.678022,2.282869,0.117048,2019-03-01
4,rUP71,201910,12648.0,1163056.0,21.17373,147816.560022,1.08748,0.127093,2019-10-01


In [5]:
zone_df.sort_values('zone_percent_extracted_total').head()

Unnamed: 0,region_zone,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date
1556,rVP93,201401,109622.0,98883.0,0.027913,16985.902022,110.86031,0.171778,2014-01-01
1554,rVP93,201311,438150.0,395532.0,0.139555,67943.608088,110.774855,0.171778,2013-11-01
214,rVP71,200807,4850.0,98829.0,0.149271,13232.105011,4.907466,0.133889,2008-07-01
209,rVP71,200802,355.0,98829.0,0.173027,13232.105011,0.359206,0.133889,2008-02-01
1555,rVP93,201312,219201.0,197766.0,0.177463,33971.804044,110.838567,0.171778,2013-12-01


In [6]:
# Load monthly run days from another csv file, noting that the processing facility does not run 24/7
month_data = pd.read_csv('Resources/month_days.csv',index_col=False)
month_data.head()

Unnamed: 0,year_month,run_days
0,200106,17
1,200107,16
2,200108,16
3,200109,19
4,200110,23


In [7]:
# Due to data types, the year_month columns needed to be changed into strings to properly join the dataframes
zone_df.year_month = zone_df.year_month.astype(str)
month_data.year_month = month_data.year_month.astype(str)

In [8]:
zones = zone_df.join(month_data.set_index('year_month'),on='year_month').dropna().reset_index(drop=True)
zones.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2770 entries, 0 to 2769
Data columns (total 10 columns):
region_zone                     2770 non-null object
year_month                      2770 non-null object
zone_total_extracted            2770 non-null float64
zone_reserves                   2770 non-null float64
zone_percent_extracted_total    2770 non-null float64
ep_sA_reserves                  2770 non-null float64
zone_exhaustion                 2770 non-null float64
zone_avg_sA                     2770 non-null float64
date                            2770 non-null datetime64[ns]
run_days                        2770 non-null float64
dtypes: datetime64[ns](1), float64(7), object(2)
memory usage: 216.5+ KB


In [9]:
# percent extracted was able to be produced using the number of run days (this is based on how the original data was generated)
zones['zone_percent_extracted'] = zones['zone_percent_extracted_total']/zones['run_days']
zones.regin_zone = zones.region_zone.astype(str)
zones.sort_values('year_month')


Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access



Unnamed: 0,region_zone,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date,run_days,zone_percent_extracted
1627,rX641,200106,282100.0,11373253.0,374.925690,2.546491e+06,2.480381,0.223902,2001-06-01,17.0,22.054452
1606,rX637,200106,234544.0,7981622.0,230.846784,1.736090e+06,2.938551,0.217511,2001-06-01,17.0,13.579223
1536,rX631,200106,61216.0,3353962.0,107.902085,5.651914e+05,1.825185,0.168515,2001-06-01,17.0,6.347181
1553,rX633,200106,124941.0,7005415.0,215.679302,1.245665e+06,1.783492,0.177815,2001-06-01,17.0,12.687018
1749,rX647,200106,6864.0,1204704.0,57.931902,1.753647e+05,0.569767,0.145567,2001-06-01,17.0,3.407759
...,...,...,...,...,...,...,...,...,...,...,...
46,rUP81,202003,54034.0,2897544.0,20.878357,5.556742e+05,1.864821,0.191774,2020-03-01,16.0,1.304897
77,rUP93,202003,42477.0,1702655.0,4.925182,3.022332e+05,2.494751,0.177507,2020-03-01,16.0,0.307824
758,rVP77,202003,3759576.0,3766088.0,7.825589,6.786627e+05,99.827088,0.180204,2020-03-01,16.0,0.489099
1056,rVP83,202003,5378374.0,4622052.0,31.491856,1.060684e+06,116.363338,0.229483,2020-03-01,16.0,1.968241


In [11]:
# Repeat the previous steps for data by region rather than zone
region_df = pd.read_csv('Resources/region_months.csv',index_col=False)
region_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356 entries, 0 to 355
Data columns (total 8 columns):
region                          356 non-null object
year_month                      356 non-null int64
zone_total_extracted            356 non-null float64
zone_reserves                   356 non-null float64
zone_percent_extracted_total    356 non-null float64
ep_sA_reserves                  356 non-null float64
zone_exhaustion                 356 non-null float64
zone_avg_sA                     356 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 22.4+ KB


In [12]:
# Get a datetime column
region_df['year_month'] = region_df['year_month'].astype(str)
datetime_list = []
for i in range(len(region_df)):
    datetime_list.append(datetime.datetime.strptime(region_df['year_month'][i],'%Y%m'))

In [13]:
region_df['date'] = datetime_list
region_df.head()

Unnamed: 0,region,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date
0,rU,201810,1001.0,329230.0,2.531069,48068.2,0.304043,0.146002,2018-10-01
1,rU,201811,1720.0,582142.0,5.113422,79630.537011,0.295461,0.136789,2018-11-01
2,rU,201812,9842.0,1053994.0,19.64705,154796.485022,0.933781,0.146867,2018-12-01
3,rU,201901,32907.0,2196210.0,26.560157,305035.101,1.498354,0.138892,2019-01-01
4,rU,201902,29638.0,2592000.0,25.801039,354199.399088,1.143441,0.136651,2019-02-01


In [14]:
region_df.sort_values('zone_percent_extracted_total').head()

Unnamed: 0,region,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date
279,rZ,201311,190.0,193248.0,0.624774,34415.711982,0.098319,0.178091,2013-11-01
271,rX,200905,388.0,93940.0,1.505037,24082.041964,0.41303,0.256356,2009-05-01
18,rV,200412,446.0,98892.0,1.633939,16690.772022,0.450997,0.168778,2004-12-01
272,rX,200906,1144.0,276958.0,2.110687,68028.471935,0.413059,0.245627,2009-06-01
270,rX,200904,526.0,492412.0,2.431733,102637.166824,0.106821,0.208438,2009-04-01


In [15]:
region_df.year_month = region_df.year_month.astype(str)
regions = region_df.join(month_data.set_index('year_month'),on='year_month').dropna().reset_index(drop=True)
regions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 352 entries, 0 to 351
Data columns (total 10 columns):
region                          352 non-null object
year_month                      352 non-null object
zone_total_extracted            352 non-null float64
zone_reserves                   352 non-null float64
zone_percent_extracted_total    352 non-null float64
ep_sA_reserves                  352 non-null float64
zone_exhaustion                 352 non-null float64
zone_avg_sA                     352 non-null float64
date                            352 non-null datetime64[ns]
run_days                        352 non-null float64
dtypes: datetime64[ns](1), float64(7), object(2)
memory usage: 27.6+ KB


In [16]:
regions['region_percent_extracted'] = regions['zone_percent_extracted_total']/regions['run_days']
regions.region = regions.region.astype(str)
regions.sort_values('year_month')

Unnamed: 0,region,year_month,zone_total_extracted,zone_reserves,zone_percent_extracted_total,ep_sA_reserves,zone_exhaustion,zone_avg_sA,date,run_days,region_percent_extracted
190,rX,200106,1111571.0,52190402.0,1700.000000,1.061207e+07,2.129838,0.203334,2001-06-01,17.0,100.000000
191,rX,200107,2162918.0,43837350.0,1600.000000,8.816538e+06,4.933962,0.201119,2001-07-01,16.0,100.000000
192,rX,200108,6111322.0,83904101.0,1600.000000,1.657007e+07,7.283699,0.197488,2001-08-01,16.0,100.000000
193,rX,200109,11958649.0,127808556.0,1900.000000,2.631207e+07,9.356689,0.205871,2001-09-01,19.0,100.000000
194,rX,200110,17799854.0,152227774.0,2300.000000,3.218448e+07,11.692908,0.211423,2001-10-01,23.0,100.000000
...,...,...,...,...,...,...,...,...,...,...,...
14,rU,202002,143914.0,8180414.0,120.635580,1.469370e+06,1.759251,0.179620,2020-02-01,14.0,8.616827
350,rZ,202002,108829730.0,119323549.0,839.700248,2.242485e+07,91.205576,0.187933,2020-02-01,14.0,59.978589
15,rU,202003,646193.0,29458899.0,132.765467,5.594907e+06,2.193541,0.189922,2020-03-01,16.0,8.297842
189,rV,202003,87056772.0,76078907.0,392.265214,1.457638e+07,114.429578,0.191596,2020-03-01,16.0,24.516576


In [17]:
# Region and Zone Extraction Heatmap
from plotly.subplots import make_subplots
# 2 rows, 1 column, utilizing subplots
fig = make_subplots(rows=2,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.075,
                   subplot_titles=('By Region','By Zone')
                   )

region_map = go.Heatmap(
        z=regions['region_percent_extracted'],
        x=regions['date'],
        y=regions['region'],
        colorscale='thermal',
        connectgaps=False,
        zsmooth=False,
        name='Region Extraction',
        colorbar=dict(
            len=0.4,y=.75)
         )
zone_map = go.Heatmap(
        z=zones['zone_percent_extracted'],
        x=zones['date'],
        y=zones['region_zone'],
        colorscale='thermal',
       connectgaps=False,
        zsmooth='best',
        name='Zone Extraction',
        colorbar=dict(
            len=0.4,y=0.25)
        )
fig.append_trace(region_map,1,1)
fig.append_trace(zone_map,2,1)
fig.update_layout(
    title_text='Percent Extracted per Region/Zone by Month',
    plot_bgcolor='black', #filling the gaps for NaN values causing incorrect visuals, adding a black background is better
    height=750
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
# Adding range slider
fig.update_layout(
    xaxis1=dict( #adds the range selector above the first plot, xaxis1
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date',
    ),
    xaxis2=dict( #put the range slider below the bottom plot, xaxis2
        rangeslider=dict(
             visible=True,
            thickness=0.05
         ),type='date')
)


fig.write_html('Visualizations/heatmap_extracted.html')
fig.show()

Percent Extracted per Region/Zone by month was selected as a way to identify how much from each region or zone was extracted. The heatmaps above show clear changes in where and how the extraction points have been used to feed the processing facility. Of the 6 regions identified, only 4 have been used in the last 20 years and the top plot groups the exraction points into their regions.
***
Below, the process features and outputs are going to be shown and also compared with the heatmaps above to create a story of how the feed has changed over time

In [18]:
# load monthly process feature/output data
monthly_pA = pd.read_csv('Resources/monthly_pA.csv',index_col=False)
monthly_pB = pd.read_csv('Resources/monthly_pB.csv',index_col=False)
monthly_pD = pd.read_csv('Resources/monthly_pD.csv',index_col=False)

In [19]:
monthly_pA.head()

Unnamed: 0,year_month,average_exhaustion,average_sA,number_extract_points,average_pA,Predicted_pA,Error,datetime
0,200201,25.46,0.204,74.0,0.214,0.217189,0.003189,2002-01-01
1,200202,29.85,0.204,61.0,0.206,0.215986,0.009986,2002-02-01
2,200203,30.1,0.204,67.0,0.23,0.215918,-0.014082,2002-03-01
3,200204,31.33,0.212,75.0,0.215,0.223111,0.008111,2002-04-01
4,200205,37.39,0.212,79.0,0.205,0.221451,0.016451,2002-05-01


In [20]:
# Plot the Outputs
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0 #This graph will plot without spacing as a concise way to view all the datasets
                   )
# Set parameters for each plot
pA = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_pA,
    name="Actual pA",
    yaxis='y1')
pB = go.Scatter(
    x = monthly_pB.datetime,
    y = monthly_pB.average_pB,
    name="Actual pB",
    yaxis='y2')
pD = go.Scatter(
    x = monthly_pD.datetime,
    y = monthly_pD.average_pD,
    name="Actual pD",
    yaxis='y3')

# Add traces to the plot
fig.append_trace(pD,3,1)
fig.append_trace(pB,2,1)
fig.append_trace(pA,1,1)

# Style the traces
fig.update_traces(
    line={'width':1.5},
    mode='lines',
    showlegend=False)

# Update axes
fig.update_layout(
    xaxis3=dict(
        autorange=True,
        range=[monthly_pA.datetime.min(),monthly_pA.datetime.max()],
        rangeslider=dict(
            autorange=True,
            range=[monthly_pA.datetime.min(),monthly_pA.datetime.max()],
            thickness=0.03
        ),
        type='date'
    ),
    yaxis=dict(
        anchor='x',
        autorange=True,
        domain=[0.66667,1],
        linecolor='#673ab7',
        mirror=True,
        range=[monthly_pA.average_pA.min()*.9,monthly_pA.average_pA.max()*1.1],
        showline=True,
        side='right',
        title='Average pA',
        titlefont={'color':'Green'},
        type='linear',
        zeroline=False
    ),
    yaxis2=dict(
        anchor='x',
        autorange=True,
        domain=[0.33333,0.66667],
        linecolor='#E91E63',
        mirror=True,
        range=[monthly_pB.average_pB.min()*.9,monthly_pB.average_pB.max()*1.1],
        showline=True,
        side='right',
        title='Average pB',
        titlefont={'color':'Red'},
        type='linear',
        zeroline=False
    ),
    yaxis3=dict(
        anchor='x',
        autorange=True,
        domain=[0,0.3333],
        linecolor='#795548',
        mirror=True,
        range=[monthly_pD.average_pD.min()*.9,monthly_pD.average_pD.max()*1.1],
        showline=True,
        side='right',
        title='Average pD',
        titlefont={'color':'Blue'},
        type='linear',
        zeroline=False
    )
)

# Update layout
fig.update_layout(
    title_text='Process Outputs',
    dragmode="zoom",
    hovermode="x",
    legend=dict(traceorder="reversed"),
    height=800,
    margin=dict(
        t=100,
        b=100
    ),
)
fig.write_html('Visualizations/process_outputs.html')
fig.show()

In [49]:
# Plot the Features
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0
                   )
# Set parameters for each plot
exhaustion = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y1')
sA = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Actual sA",
    yaxis='y2')
number_ep = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extract Points",
    yaxis='y3')

# Add traces to the plot
fig.append_trace(number_ep,3,1)
fig.append_trace(sA,2,1)
fig.append_trace(exhaustion,1,1)

# Style the traces
fig.update_traces(
    line={'width':1.5},
    mode='lines',
    showlegend=False)

# Update axes
fig.update_layout(
    xaxis3=dict(
        autorange=True,
        range=[monthly_pA.datetime.min(),monthly_pA.datetime.max()],
        rangeslider=dict(
            autorange=True,
            range=[monthly_pA.datetime.min(),monthly_pA.datetime.max()],
            thickness=0.03
        ),
        type='date'
    ),
    yaxis=dict(
        anchor='x',
        autorange=True,
        domain=[0.66667,1],
        linecolor='#673ab7',
        mirror=True,
        range=[monthly_pA.average_exhaustion.min()*.9,monthly_pA.average_exhaustion.max()*1.1],
        showline=True,
        side='right',
        title='Average Exhaustion',
        titlefont={'color':'Green'},
        type='linear',
        zeroline=False
    ),
    yaxis2=dict(
        anchor='x',
        autorange=True,
        domain=[0.33333,0.66667],
        linecolor='#E91E63',
        mirror=True,
        range=[monthly_pA.average_sA.min()*.9,monthly_pA.average_sA.max()*1.1],
        showline=True,
        side='right',
        title='Average sA',
        titlefont={'color':'Red'},
        type='linear',
        zeroline=False
    ),
    yaxis3=dict(
        anchor='x',
        autorange=True,
        domain=[0,0.3333],
        linecolor='#795548',
        mirror=True,
        range=[monthly_pA.number_extract_points.min()*.9,monthly_pA.number_extract_points.max()*1.1],
        showline=True,
        side='right',
        title='Number Extract Points',
        titlefont={'color':'Blue'},
        type='linear',
        zeroline=False
    )
)

# Update layout
fig.update_layout(
    title_text='Process Features',
    dragmode="zoom",
    hovermode="x",
    legend=dict(traceorder="reversed"),
    height=800,
    margin=dict(
        t=100,
        b=100
    ),
)
fig.write_html('Visualizations/process_features.html')
fig.show()

Viewing the process outputs and features apart from each other does not entirely help, but it was still chosen to view them separately in order to visualize how the outputs and features may trend with each other
***
From here, there will be comparison plots to show the zone heatmap, process outputs, and process features

In [22]:
# pA Comparison Plots - zone heatmap, plots of pA, exhaustion, sA, number of extract points
# 5 subplots
fig = make_subplots(rows=5,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.03,
                    row_heights=[0.2,0.2,0.2,0.2,0.2],
                   subplot_titles=('Percent Extraction by Zone','Actual pA','Average Exhaustion','Average sA','Number of Extraction Points')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_pA,
    name="Actual pA",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y4',
    showlegend=True)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y5',
    showlegend=True
    )
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pA_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,4,1)
fig.append_trace(number_points_plot,5,1)

fig.update_layout(
    title_text='pA Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)

# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.83,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis5=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pA_plots.html')
fig.show()

In [23]:
# pB Comparison Plots - zone heatmap, plots of pB, exhaustion, sA, number of extract points
# 5 subplots
fig = make_subplots(rows=5,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.03,
                    row_heights=[0.2,0.2,0.2,0.2,0.2],
                   subplot_titles=('Percent Extraction by Zone','Actual pB','Average Exhaustion','Average sA','Number of Extraction Points')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pB_plot = go.Scatter(
    x = monthly_pB.datetime,
    y = monthly_pB.average_pB,
    name="Actual pB",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y4',
    showlegend=True)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y5',
    showlegend=True
    )
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pB_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,4,1)
fig.append_trace(number_points_plot,5,1)

fig.update_layout(
    title_text='pB Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)

# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.83,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis5=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pB_plots.html')
fig.show()

In [24]:
# pD Comparison Plots - zone heatmap, plots of pD, exhaustion, sA, number of extract points
# 5 subplots
fig = make_subplots(rows=5,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.03,
                    row_heights=[0.2,0.2,0.2,0.2,0.2],
                   subplot_titles=('Percent Extraction by Zone','Actual pD','Average Exhaustion','Average sA','Number of Extraction Points')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pD_plot = go.Scatter(
    x = monthly_pD.datetime,
    y = monthly_pD.average_pD,
    name="Actual pD",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y4',
    showlegend=True)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y5',
    showlegend=True
    )
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pD_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,4,1)
fig.append_trace(number_points_plot,5,1)

fig.update_layout(
    title_text='pD Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)

# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.83,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis5=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pD_plots.html')
fig.show()

The format of the plots above is a great way to view the features and how they each may relate to the different outputs, however 5 plots stacked on top of each other may not be as user friendly (at least if you are not doing the analysis yourself), so the plots below were created to give a little more resolution to the features and outputs by being able to select one feature at a time

In [50]:
# pA Comparison Plots - zone heatmap, plots of pA, exhaustion, sA, number of extract points
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3,0.4,0.4],
                   subplot_titles=('Percent Extraction by Zone','Actual pA','Supply Features')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_pA,
    name="Actual pA",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pA_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='pA Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])
# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.74,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pA_plots_select.html')
fig.show()

In [51]:
# pB Comparison Plots - zone heatmap, plots of pB, exhaustion, sA, number of extract points
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3,0.4,0.4],
                   subplot_titles=('Percent Extraction by Zone','Actual pB','Supply Features')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pB_plot = go.Scatter(
    x = monthly_pB.datetime,
    y = monthly_pB.average_pB,
    name="Actual pB",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pB_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='pB Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])
# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.74,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pB_plots_select.html')
fig.show()

In [52]:
# pD Comparison Plots - zone heatmap, plots of pD, exhaustion, sA, number of extract points
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3,0.4,0.4],
                   subplot_titles=('Percent Extraction by Zone','Actual pD','Supply Features')
                   )
# set parameters for each plot
zone_map = go.Heatmap(
    z=zones['zone_percent_extracted'],
    x=zones['date'],
    y=zones['region_zone'],
    colorscale='thermal',
   connectgaps=False,
    zsmooth='best',
    name='Zone Extraction',
    colorbar=dict(
        len=0.2,y=0.905),
    showlegend=False
    )
pD_plot = go.Scatter(
    x = monthly_pD.datetime,
    y = monthly_pD.average_pD,
    name="Actual pD",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(zone_map,1,1)
fig.append_trace(pD_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='pD Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])
# Add black background to heatmap
fig.update_layout(
    shapes=[
        dict(
            fillcolor='black',
            layer='below',
            line={'width':4,'color':'black'},
            type='rect',
            x0=zones.date.min(),
            x1=zones.date.max(),
            xref='x',
            y0=0.74,
            y1=1,
            yref='paper'
        )
    ]
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
#         rangeslider=dict(
#         visible=True
#         ),
        type='date',
        showgrid=False
    ),
    yaxis1=dict(showgrid=False),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/pD_plots_select.html')
fig.show()

Before getting to the regression plots, some information on the methodology used in the regression analysis:
* OLS Linear Regressions were performed using the statsmodels package and a forward selection function in order to identify which combination of features was best for each output (according to their R-Squared scores)
* This was not taken at face value, but instead every combination was tested individually with sci-kit's OLS Linear Regressions. The scores and MSEs were then compared to prove if the forward selection function worked (don't worry, it did)
* This was performed on each output for both daily and monthly-average datasets
* Looking at scatter plots of features vs outputs, as well as residual plots for each OLS model, some correlations may not necessarily be linear and could benefit from using Support Vector Regressions
* Example of a non-linear relationship: <img src='Images/month_pA_sA.png' />
* Sci-kit GridSearchCV was used for each output and features were decided based on the forward selection and manual verification of the best OLS combinations (this may not necessarily be the best method, however some of the grid searches took over 30 minutes, so it was for the sake of time)
* Linear, RBF, and Poly kernels were used
    - Three values were used for each hyperparameter
    - C: [1, 5, 10]
    - gamma: [0.001, 0.01, 0.1]
    - epsilon: [0.001, 0.01, 0.1]
    - degree (for poly): [2, 3, 4]
    - 5 folds were performed for each test
* The best models and hyperparameters were selected for each output and the table below summarizes the results

In [55]:
comparison = pd.read_csv('Resources/daily_monthly_comparison.csv',index_col=False)
comparison

Unnamed: 0,pA_D,pB_D,pC_D,pD_D,pA_M,pB_M,pC_M,pD_M
0,Daily,Daily,Daily,Daily,Monthly,Monthly,Monthly,Monthly
1,RBF,RBF,RBF,RBF,Linear,RBF,NONE,RBF
2,10,10,5,10,5,5,NONE,1
3,0.1,0.1,0.1,0.1,0.001,0.1,NONE,0.1
4,0.1,0.01,0.1,0.1,0,0.1,NONE,0.01
5,0.3459,0.8023,0.4411,0.5886,0.836,0.8305,NONE,0.5655
6,0.0195,0.2179,0.0003,0.0026,0.0131,0.2066,NONE,0.0025
7,0.215,0.6445,0.1814,0.1577,0.8358,0.7513,NONE,0.2011
8,0.0213,0.2922,0.0003,0.0036,0.0131,0.2502,NONE,0.0032


Below are plots of the output regressions and predictions compared with actuals

In [28]:
# Regression plots
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.333,0.333,0.333],
                   subplot_titles=('pA Model','pB Model','pD Model')
                   )
# set parameters for each plot
pA_plot = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.average_pA,
    name="Actual pA",
    yaxis='y1',
    showlegend=True)
pred_pA = go.Scatter(
    x = monthly_pA.datetime,
    y = monthly_pA.Predicted_pA,
    name="Predicted pA",
    yaxis='y1',
    showlegend=True,
    visible=True)
pB_plot = go.Scatter(
    x = monthly_pB.datetime,
    y = monthly_pB.average_pB,
    name="Actual pB",
    yaxis='y2',
    showlegend=True)
pred_pB = go.Scatter(
    x = monthly_pB.datetime,
    y = monthly_pB.Predicted_pB,
    name="Predicted pB",
    yaxis='y2',
    showlegend=True,
    visible=True)
pD_plot = go.Scatter(
    x = monthly_pD.datetime,
    y = monthly_pD.average_pD,
    name="Actual pD",
    yaxis='y3',
    showlegend=True)
pred_pD = go.Scatter(
    x = monthly_pD.datetime,
    y = monthly_pD.Predicted_pD,
    name="Predicted pD",
    yaxis='y3',
    showlegend=True,
    visible=True)
# Add traces to the plot
fig.append_trace(pA_plot,1,1)
fig.append_trace(pred_pA,1,1)
fig.append_trace(pB_plot,2,1)
fig.append_trace(pred_pB,2,1)
fig.append_trace(pD_plot,3,1)
fig.append_trace(pred_pD,3,1)

fig.update_layout(
    title_text='Regression Plots',
    height=800,
    legend=dict(
        y=0.85)
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date',
        showgrid=True
    ),
        xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.05)
    )
)


fig.write_html('Visualizations/regressions_plots.html')
fig.show()

The models were fairly successful in general with scores of ~ 0.83 for pA and pB and 0.56 for pD. While the score for pD isnt amazing, it is a huge improvement over the OLS model score of 0.2
***
Using the actual outputs and feature values, the same methodology was applied to finding models for the chemical consumptions

In [29]:
# Chemical usage and regressions
monthly_cA = pd.read_csv('Resources/monthly_pA_cA.csv',index_col=False)
monthly_cB = pd.read_csv('Resources/monthly_pB_cB.csv',index_col=False)
monthly_cD = pd.read_csv('Resources/monthly_pD_cCD.csv',index_col=False)
monthly_cA.head()

Unnamed: 0,year_month,average_exhaustion,average_sA,number_extract_points,average_pA,Predicted_pA,Error,datetime,cA,production,cA_per_prod,Predicted_cA,cA_Error
0,200901,43.54,0.241,177.0,0.248,0.247064,-0.000936,2009-01-01,50533.0,630494,0.080148,0.085877,0.005729
1,200902,40.44,0.237,183.0,0.246,0.244148,-0.001852,2009-02-01,37659.0,412244,0.084577,0.085766,0.001189
2,200903,40.62,0.242,169.0,0.241,0.248805,0.007805,2009-03-01,47330.0,464256,0.089929,0.084313,-0.005616
3,200904,43.93,0.24,165.0,0.275,0.246016,-0.028984,2009-04-01,45100.0,483255,0.095671,0.091522,-0.004149
4,200905,42.61,0.232,158.0,0.257,0.238847,-0.018153,2009-05-01,31538.0,289430,0.100221,0.08749,-0.012731


In [32]:
# Chemical Usage plots with process production included
fig = make_subplots(rows=4,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.03,
                    row_heights=[0.25,0.25,0.25,0.25],
                   subplot_titles=('Process Production','cA Consumption','cB Consumption','cCD Consumption')
                   )
# set parameters for each plot
production = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.production,
    name="Process Production",
    yaxis='y1',
    showlegend=True)
cA_con = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.cA,
    name="cA Consumption",
    yaxis='y2',
    showlegend=True)
cB_con = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.cB,
    name="cB Consumption",
    yaxis='y3',
    showlegend=True)
cCD_con = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.cCD,
    name="cCD Consumption",
    yaxis='y4',
    showlegend=True
    )
# Add traces to the plot
fig.append_trace(production,1,1)
fig.append_trace(cA_con,2,1)
fig.append_trace(cB_con,3,1)
fig.append_trace(cCD_con,4,1)

fig.update_layout(
    title_text='Chemical Consumptions',
    height=800,
    legend=dict(
        y=0.5)
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date',
        showgrid=False
    ),
    xaxis4=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/chemical_consumptions.html')
fig.show()

The processing facility's production has changed over time, and the chemical usage has as well, therefor chemical consumption was taken on a per production basis for a better relationship

In [33]:
# Chemical Usage plots with process production included
fig = make_subplots(rows=4,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.03,
                    row_heights=[0.25,0.25,0.25,0.25],
                   subplot_titles=('Process Production','cA Consumption per Production','cB Consumption per Production','cCD Consumption per Production')
                   )
# set parameters for each plot
production = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.production,
    name="Process Production",
    yaxis='y1',
    showlegend=True)
cA_con = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.cA_per_prod,
    name="cA Consumption",
    yaxis='y2',
    showlegend=True)
cB_con = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.cB_per_prod,
    name="cB Consumption",
    yaxis='y3',
    showlegend=True)
cCD_con = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.cCD_per_prod,
    name="cCD Consumption",
    yaxis='y4',
    showlegend=True
    )
# Add traces to the plot
fig.append_trace(production,1,1)
fig.append_trace(cA_con,2,1)
fig.append_trace(cB_con,3,1)
fig.append_trace(cCD_con,4,1)

fig.update_layout(
    title_text='Chemical Consumptions per Production',
    height=800,
    legend=dict(
        y=0.5)
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date',
        showgrid=False
    ),
    xaxis4=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/chemical_consumptions_per_prod.html')
fig.show()

Let's now compare the chemical consumptions per production to their respective process outputs and the features to try and identify relationships

In [37]:
# Compare chemical consumptions to process features and respective outputs
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3333,0.3333,0.3333],
                   subplot_titles=('cA Consumption per Production','Actual pA','Supply Variables')
                   )
# set parameters for each plot
cA_con = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.cA_per_prod,
    name="cA Consumption",
    yaxis='y1',
    showlegend=True)
pA_plot = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.average_pA,
    name="Actual pA",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(cA_con,1,1)
fig.append_trace(pA_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='cA Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date'
    ),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/cA_plots_select.html')
fig.show()

In [39]:
# Compare chemical consumptions to process features and respective outputs
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3333,0.3333,0.3333],
                   subplot_titles=('cB Consumption per Production','Actual pB','Supply Variables')
                   )
# set parameters for each plot
cB_con = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.cB_per_prod,
    name="cB Consumption",
    yaxis='y1',
    showlegend=True)
pB_plot = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.average_pB,
    name="Actual pB",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(cB_con,1,1)
fig.append_trace(pB_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='cB Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date'
    ),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/cB_plots_select.html')
fig.show()

In [42]:
# Compare chemical consumptions to process features and respective outputs
# 3 subplots, last one is a selection of 3 variables
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.3333,0.3333,0.3333],
                   subplot_titles=('cCD Consumption per Production','Actual pD','Supply Variables')
                   )
# set parameters for each plot
cD_con = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.cCD_per_prod,
    name="cCD Consumption",
    yaxis='y1',
    showlegend=True)
pD_plot = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.average_pD,
    name="Actual pD",
    yaxis='y2',
    showlegend=True)
exhaustion_plot = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.average_exhaustion,
    name="Average Exhaustion",
    yaxis='y3',
    showlegend=True,
    visible=True)
sA_plot = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.average_sA,
    name="Average sA",
    yaxis='y3',
    showlegend=True,
    visible=False)
number_points_plot = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.number_extract_points,
    name="Number of Extraction Points",
    yaxis='y3',
    showlegend=True,
    visible=False)
# Add traces to the plot
fig.append_trace(cD_con,1,1)
fig.append_trace(pD_plot,2,1)
fig.append_trace(exhaustion_plot,3,1)
fig.append_trace(sA_plot,3,1)
fig.append_trace(number_points_plot,3,1)

fig.update_layout(
    title_text='cCD Comparison Plots',
    height=800,
    legend=dict(
        y=0.5)
)
# Add dropdown to select supply variable
fig.update_layout(
    updatemenus=[
        dict(
            type='buttons',
            direction='down',
            active=0,
            x=1.3,
            y=.3,
            buttons=list([
                dict(label="Average Exhaustion",
                    method='update',
                    args=[{'visible':[True,True,True,False,False]}]),
                dict(label="Average sA",
                    method='update',
                    args=[{'visible':[True,True,False,True,False]}]),
                dict(label="Number of Extraction Points",
                    method='update',
                    args=[{'visible':[True,True,False,False,True]}]),
            ]),
        )
    ])

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date'
    ),
    xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.025)
    )
)


fig.write_html('Visualizations/cD_plots_select.html')
fig.show()

It appears that there are definitely some relationships between the outputs, features, and chemical consumptions, so regression analysis was performed. As mentioned, the same steps and methodology used for the process outputs was used for the chemical consumptions, with the process outputs added as another feature

In [46]:
# Regression plots
fig = make_subplots(rows=3,cols=1,
                   shared_xaxes=True,
                   vertical_spacing=0.05,
                    row_heights=[0.333,0.333,0.333],
                   subplot_titles=('cA Model','cB Model','cCD Model')
                   )
# set parameters for each plot
cA_plot = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.cA_per_prod,
    name="Actual cA",
    yaxis='y1',
    showlegend=True)
pred_cA = go.Scatter(
    x = monthly_cA.datetime,
    y = monthly_cA.Predicted_cA,
    name="Predicted cA",
    yaxis='y1',
    showlegend=True,
    visible=True)
cB_plot = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.cB_per_prod,
    name="Actual cB",
    yaxis='y2',
    showlegend=True)
pred_cB = go.Scatter(
    x = monthly_cB.datetime,
    y = monthly_cB.Predicted_cB,
    name="Predicted cB",
    yaxis='y2',
    showlegend=True,
    visible=True)
cCD_plot = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.cCD_per_prod,
    name="Actual cCD",
    yaxis='y3',
    showlegend=True)
pred_cCD = go.Scatter(
    x = monthly_cD.datetime,
    y = monthly_cD.Predicted_cCD,
    name="Predicted cCD",
    yaxis='y3',
    showlegend=True,
    visible=True)
# Add traces to the plot
fig.append_trace(cA_plot,1,1)
fig.append_trace(pred_cA,1,1)
fig.append_trace(cB_plot,2,1)
fig.append_trace(pred_cB,2,1)
fig.append_trace(cCD_plot,3,1)
fig.append_trace(pred_cCD,3,1)

fig.update_layout(
    title_text='Chemical Regression Plots',
    height=800,
    legend=dict(
        y=0.85)
)

# Adding range slider
fig.update_layout(
    xaxis1=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                    label='6m',
                    step='month',
                    stepmode='backward'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(count=5,
                    label='5y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        type='date',
        showgrid=True
    ),
        xaxis3=dict(
        rangeslider=dict(
            visible=True,
            thickness=0.05)
    )
)


fig.write_html('Visualizations/chemical_regressions_plots.html')
fig.show()

Chemical Consumption Regression Scores:
* cA: 0.755
* cB: 0.815
* cCD: 0.626

Chemical consumption is a difficult output to analyze as the chemicals in question could easily have been over- or under-used month to month. There is also a time period identified with the pB error investigation from 2019 that has a major impact on proper modeling of cB. Even with these difficulties, the resulting scores are still high.
***
#### Results
Given the many unknowns in the specific industry this processing facility is in, R-squared values typically accepted can be as low at 0.5, so the models shown in this project are a significant improvement over current practices. It can be concluded that there are definite relationships between the selected features and outputs from this project and pA, pB, and pD, along with their respective chemicals, can be predicted with relative accuracy at the processing facility. Due to requiring monthly data to make predictions, there is not current way to test the models to actual data at this time but predictions will be made for the remainder of 2020 and reviewed quarterly