# Integrated Project - Gold Extraction

<div>
<img src="https://www.ecolab.com/-/media/Widen/Nalco-Water/Mining--Mineral-Processing/iStock-997693638-Gold-on-the-stone-floor_Crop_550x310.jpg" width="600" align="left"> 
</div>

`Extraction Process`

<div>
<img src="https://practicum-content.s3.us-west-1.amazonaws.com/resources/moved_ore_1591699963.jpg" width="700" align="left"> 
</div>

Flotation
- Gold ore mixture is fed into the float banks to obtain rougher Au concentrate and rougher tails (product residues with a low concentration of valuable metals).


- The stability of this process is affected by the volatile and non-optimal physicochemical state of the flotation pulp (a mixture of solid particles and liquid).
---
Purification
- The rougher concentrate undergoes two stages of purification. After purification, we have the final concentrate and new tails.

`Data Description`

<span style='color:darkgreen'> ***Technological process*** </span>

`Rougher feed — raw material`

`Rougher additions (or reagent additions) — flotation reagents: Xanthate, Sulphate, Depressant`

> `Xanthate — promoter or flotation activator;`

> `Sulphate — sodium sulphide for this particular process;`

> `Depressant — sodium silicate.`

`Rougher process — flotation`

`Rougher tails — product residues`

`Float banks — flotation unit`

`Cleaner process — purification`

`Rougher Au — rougher gold concentrate`

`Final Au — final gold concentrate`

---

<span style='color:darkgreen'> ***Parameters of stages*** </span>

`air amount — volume of air`

`fluid levels`

`feed size — feed particle size`

`feed rate`

---

<span style='color:darkgreen'> ***Feature naming*** </span>

`[stage].[parameter_type].[parameter_name]`

**Possible values for** `[stage]`**:**

> `rougher — flotation`

> `primary_cleaner — primary purification`

> `secondary_cleaner — secondary purification`

> `final — final characteristics`

**Possible values for** `[parameter_type]`**:**

> `input — raw material parameters`

> `output — product parameters`

> `state — parameters characterizing the current state of the stage`

> `calculation — calculation characteristics`

`libraries`

In [61]:
import pandas as pd
import numpy as np
from scipy.stats import iqr
from itertools import islice


import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_absolute_percentage_error, make_scorer, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import check_cv, cross_validate, cross_val_predict, cross_val_score, learning_curve, validation_curve, RepeatedKFold, GridSearchCV, RandomizedSearchCV

from sklearn import linear_model
from sklearn.dummy import DummyRegressor

`data import`

In [62]:
try:
    df_full = pd.read_csv('/Users/dani/Data Science/TripleTen Projects/Project Data/Integrated Project 2/gold_recovery_full.csv')
    df_train = pd.read_csv('/Users/dani/Data Science/TripleTen Projects/Project Data/Integrated Project 2/gold_recovery_train.csv')
    df_test = pd.read_csv('/Users/dani/Data Science/TripleTen Projects/Project Data/Integrated Project 2/gold_recovery_test.csv')

except:
    print('Something is wrong with your data sourcing')

`data review`

In [63]:
#df_full.info()
#df_full.shape
#display(df_full.sort_values(by='date'))
#df_full.loc['date' == '2017-12-31 21:59:59', "primary_cleaner.input.sulfate":"primary_cleaner.input.depressant"]

In [64]:
df_train.info()
df_train.shape
df_train.dropna(inplace=True)# removing the NaN values to start, could also replace empty/NaN values with previous date values
df_train.replace([np.inf, -np.inf], 0, inplace=True) # followed by replacing infinite values for further analysis
df_train.drop(columns='date', index=1, inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-null  float64


In [65]:
#df_test.info()
#df_test.shape
#df_test.drop(columns='date', index=1, inplace=True)
display(df_test.sort_values(by='date'))

Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-09-01 00:59:59,210.800909,14.993118,8.080000,1.005021,1398.981301,-500.225577,1399.144926,-499.919735,1400.102998,...,12.023554,-497.795834,8.016656,-501.289139,7.946562,-432.317850,4.872511,-500.037437,26.705889,-499.709414
1,2016-09-01 01:59:59,215.392455,14.987471,8.080000,0.990469,1398.777912,-500.057435,1398.055362,-499.778182,1396.151033,...,12.058140,-498.695773,8.130979,-499.634209,7.958270,-525.839648,4.878850,-500.162375,25.019940,-499.819438
2,2016-09-01 02:59:59,215.259946,12.884934,7.786667,0.996043,1398.493666,-500.868360,1398.860436,-499.764529,1398.075709,...,11.962366,-498.767484,8.096893,-500.827423,8.071056,-500.801673,4.905125,-499.828510,24.994862,-500.622559
3,2016-09-01 03:59:59,215.336236,12.006805,7.640000,0.863514,1399.618111,-498.863574,1397.440120,-499.211024,1400.129303,...,12.033091,-498.350935,8.074946,-499.474407,7.897085,-500.868509,4.931400,-499.963623,24.948919,-498.709987
4,2016-09-01 04:59:59,199.099327,10.682530,7.530000,0.805575,1401.268123,-500.808305,1398.128818,-499.504543,1402.172226,...,12.025367,-500.786497,8.054678,-500.397500,8.107890,-509.526725,4.957674,-500.360026,25.003331,-500.856333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5851,2017-12-31 19:59:59,173.957757,15.963399,8.070000,0.896701,1401.930554,-499.728848,1401.441445,-499.193423,1399.810313,...,13.995957,-500.157454,12.069155,-499.673279,7.977259,-499.516126,5.933319,-499.965973,8.987171,-499.755909
5852,2017-12-31 20:59:59,172.910270,16.002605,8.070000,0.896519,1447.075722,-494.716823,1448.851892,-465.963026,1443.890424,...,16.749781,-496.031539,13.365371,-499.122723,9.288553,-496.892967,7.372897,-499.942956,8.986832,-499.903761
5853,2017-12-31 21:59:59,171.135718,15.993669,8.070000,1.165996,1498.836182,-501.770403,1499.572353,-495.516347,1502.749213,...,19.994130,-499.791312,15.101425,-499.936252,10.989181,-498.347898,9.020944,-500.040448,8.982038,-497.789882
5854,2017-12-31 22:59:59,179.697158,15.438979,8.070000,1.501068,1498.466243,-500.483984,1497.986986,-519.200340,1496.569047,...,19.958760,-499.958750,15.026853,-499.723143,11.011607,-499.985046,9.009783,-499.937902,9.012660,-500.154284


`missing values detection`

In [66]:
df_full.isna().sum()

date                                            0
final.output.concentrate_ag                    89
final.output.concentrate_pb                    87
final.output.concentrate_sol                  385
final.output.concentrate_au                    86
                                             ... 
secondary_cleaner.state.floatbank5_a_level    101
secondary_cleaner.state.floatbank5_b_air      101
secondary_cleaner.state.floatbank5_b_level    100
secondary_cleaner.state.floatbank6_a_air      119
secondary_cleaner.state.floatbank6_a_level    101
Length: 87, dtype: int64

In [67]:
df_train.isna().sum() # present before data cleaning/preprocessing

final.output.concentrate_ag                   0
final.output.concentrate_pb                   0
final.output.concentrate_sol                  0
final.output.concentrate_au                   0
final.output.recovery                         0
                                             ..
secondary_cleaner.state.floatbank5_a_level    0
secondary_cleaner.state.floatbank5_b_air      0
secondary_cleaner.state.floatbank5_b_level    0
secondary_cleaner.state.floatbank6_a_air      0
secondary_cleaner.state.floatbank6_a_level    0
Length: 86, dtype: int64

In [68]:
df_test.isna().sum()

date                                            0
primary_cleaner.input.sulfate                 302
primary_cleaner.input.depressant              284
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                166
primary_cleaner.state.floatbank8_a_air         16
primary_cleaner.state.floatbank8_a_level       16
primary_cleaner.state.floatbank8_b_air         16
primary_cleaner.state.floatbank8_b_level       16
primary_cleaner.state.floatbank8_c_air         16
primary_cleaner.state.floatbank8_c_level       16
primary_cleaner.state.floatbank8_d_air         16
primary_cleaner.state.floatbank8_d_level       16
rougher.input.feed_ag                          16
rougher.input.feed_pb                          16
rougher.input.feed_rate                        40
rougher.input.feed_size                        22
rougher.input.feed_sol                         67
rougher.input.feed_au                          16
rougher.input.floatbank10_sulfate             257


`summary statistics`

In [69]:
df_full.describe()

Unnamed: 0,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,primary_cleaner.input.sulfate,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,22627.0,22629.0,22331.0,22630.0,20753.0,22633.0,22516.0,22445.0,22635.0,21107.0,...,22571.0,22587.0,22608.0,22607.0,22615.0,22615.0,22615.0,22616.0,22597.0,22615.0
mean,4.781559,9.095308,8.640317,40.001172,67.447488,8.92369,2.488252,9.523632,2.827459,140.277672,...,18.205125,-499.878977,14.356474,-476.532613,14.883276,-503.323288,11.626743,-500.521502,17.97681,-519.361465
std,2.030128,3.230797,3.785035,13.398062,11.616034,3.517917,1.189407,4.079739,1.262834,49.919004,...,6.5607,80.273964,5.655791,93.822791,6.372811,72.925589,5.757449,78.956292,6.636203,75.477151
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3e-06,...,0.0,-799.920713,0.0,-800.836914,-0.42326,-799.741097,0.427084,-800.258209,-0.079426,-810.473526
25%,4.018525,8.750171,7.116799,42.383721,63.282393,7.684016,1.805376,8.143576,2.303108,110.177081,...,14.09594,-500.896232,10.882675,-500.309169,10.941299,-500.628697,8.037533,-500.167897,13.968418,-500.981671
50%,4.953729,9.914519,8.908792,44.653436,68.322258,9.484369,2.653001,10.212998,2.913794,141.330501,...,18.007326,-499.917108,14.947646,-499.612292,14.859117,-499.865158,10.989756,-499.95198,18.004215,-500.095463
75%,5.862593,10.929839,10.705824,46.111999,72.950836,11.084557,3.28779,11.860824,3.555077,174.049914,...,22.998194,-498.361545,17.977502,-400.224147,18.014914,-498.489381,14.001193,-499.492354,23.009704,-499.526388
max,16.001945,17.031899,19.61572,53.611374,100.0,19.552149,6.086532,22.861749,9.789625,274.409626,...,60.0,-127.692333,31.269706,-6.506986,63.116298,-244.483566,39.846228,-120.190931,54.876806,-29.093593


In [70]:
df_train.describe()

Unnamed: 0,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,primary_cleaner.input.sulfate,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,...,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0,11016.0
mean,5.156254,10.172706,9.600964,44.131766,66.807996,9.699026,2.661076,10.934617,3.009184,140.423475,...,19.305464,-476.379165,15.031783,-456.359652,16.460886,-481.893915,12.754424,-482.150528,20.127841,-508.577714
std,1.340526,1.589737,2.842133,4.171437,8.821443,2.335039,0.956443,2.736705,0.810098,36.240296,...,5.537726,52.792731,5.394614,60.329674,5.849667,40.614383,5.864024,41.887858,5.59433,39.611486
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.873963,...,4.502796,-799.709069,4.815717,-799.889113,-0.372054,-797.142475,3.145675,-800.00618,0.195324,-809.045795
25%,4.231049,9.279527,7.952263,43.24437,62.954521,7.94432,1.968099,9.359555,2.49301,114.76037,...,14.519944,-500.605493,10.121519,-500.158019,11.03262,-500.35969,8.045665,-500.110488,15.005303,-500.873732
50%,4.999101,10.347546,9.268938,44.802348,67.3192,9.525787,2.641639,10.789339,2.931333,138.955661,...,18.982882,-499.621459,14.055455,-498.772447,16.033894,-499.631737,11.969783,-499.897695,19.996901,-500.09038
75%,5.852602,11.187033,10.904984,46.092333,72.061074,11.162284,3.286537,12.070684,3.471886,164.75948,...,24.979838,-401.869808,20.033849,-399.939349,20.019224,-450.68431,16.677064,-450.145969,24.993169,-499.546117
max,16.001945,17.031899,18.124851,51.571885,100.0,19.552149,5.639565,22.272019,7.812801,250.127834,...,30.115735,-245.239184,24.007913,-163.742242,43.709931,-275.073125,27.926001,-183.442252,32.188906,-104.427459


In [71]:
df_test.describe()

Unnamed: 0,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,primary_cleaner.state.floatbank8_c_level,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,5554.0,5572.0,5856.0,5690.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,...,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0
mean,170.515243,8.482873,7.264651,1.32142,1481.990241,-509.057796,1486.90867,-511.743956,1468.495216,-509.741212,...,15.636031,-516.266074,13.145702,-476.338907,12.308967,-512.208126,9.470986,-505.017827,16.678722,-512.351694
std,49.608602,3.353105,0.611526,0.693246,310.453166,61.339256,313.224286,67.139074,309.980748,62.671873,...,4.660835,62.756748,4.304086,105.549424,3.762827,58.864651,3.312471,68.785898,5.404514,69.919839
min,0.000103,3.1e-05,5.65,3e-06,0.0,-799.773788,0.0,-800.029078,0.0,-799.995127,...,0.0,-799.798523,0.0,-800.836914,-0.223393,-799.661076,0.528083,-800.220337,-0.079426,-809.859706
25%,143.340022,6.4115,6.885625,0.888769,1497.190681,-500.455211,1497.150234,-500.936639,1437.050321,-501.300441,...,12.057838,-501.054741,11.880119,-500.419113,10.123459,-500.879383,7.991208,-500.223089,13.012422,-500.833821
50%,176.103893,8.023252,7.259333,1.183362,1554.659783,-499.997402,1553.268084,-500.066588,1546.160672,-500.079537,...,17.001867,-500.160145,14.952102,-499.644328,12.062877,-500.047621,9.980774,-500.001338,16.007242,-500.041085
75%,207.240761,10.017725,7.65,1.763797,1601.681656,-499.575313,1601.784707,-499.323361,1600.785573,-499.009545,...,18.030985,-499.441529,15.940011,-401.523664,15.017881,-499.297033,11.992176,-499.722835,21.009076,-499.395621
max,274.409626,40.024582,15.5,5.433169,2212.43209,-57.195404,1975.147923,-142.527229,1715.053773,-150.937035,...,30.051797,-401.565212,31.269706,-6.506986,25.258848,-244.483566,14.090194,-126.463446,26.705889,-29.093593


`dataframe display`

In [72]:
df_full.head(10)

Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-01-15 00:00:00,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
1,2016-01-15 01:00:00,6.029369,9.968944,5.257781,42.701629,69.266198,10.462676,0.927452,16.634514,2.22493,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,2016-01-15 02:00:00,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,2016-01-15 03:00:00,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,2016-01-15 04:00:00,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691
5,2016-01-15 05:00:00,6.482968,10.049416,5.480257,41.633678,69.465816,10.182708,0.664118,16.999638,1.918586,...,13.938497,-500.970168,11.88335,-500.395298,10.054147,-496.374715,7.965083,-499.364752,14.017067,-599.707915
6,2016-01-15 06:00:00,6.533849,10.058141,4.5691,41.995316,69.300835,10.304598,0.807342,16.723575,2.058913,...,14.046819,-500.971133,12.091543,-500.501426,10.003247,-497.08318,8.01089,-500.002423,14.029649,-600.90547
7,2016-01-15 07:00:00,6.130823,9.935481,4.389813,42.452727,70.230976,10.443288,0.949346,16.689959,2.143437,...,13.974691,-501.819696,12.101324,-500.583446,9.873169,-499.171928,7.993381,-499.794518,13.984498,-600.41107
8,2016-01-15 08:00:00,5.83414,10.071156,4.876389,43.404078,69.688595,10.42014,1.065453,17.201948,2.209881,...,13.96403,-504.25245,12.060738,-501.174549,10.033838,-501.178133,7.881604,-499.729434,13.967135,-599.061188
9,2016-01-15 09:00:00,5.687063,9.980404,5.282514,43.23522,70.279619,10.487013,1.159805,17.483979,2.209593,...,13.989632,-503.195299,12.052233,-500.928547,9.962574,-502.986357,7.979219,-500.146835,13.981614,-598.070855


In [73]:
df_train.head(10)

Unnamed: 0,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,primary_cleaner.input.sulfate,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,127.092003,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
2,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,123.819808,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,122.270188,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,117.988169,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691
5,6.482968,10.049416,5.480257,41.633678,69.465816,10.182708,0.664118,16.999638,1.918586,115.581252,...,13.938497,-500.970168,11.88335,-500.395298,10.054147,-496.374715,7.965083,-499.364752,14.017067,-599.707915
6,6.533849,10.058141,4.5691,41.995316,69.300835,10.304598,0.807342,16.723575,2.058913,117.322323,...,14.046819,-500.971133,12.091543,-500.501426,10.003247,-497.08318,8.01089,-500.002423,14.029649,-600.90547
7,6.130823,9.935481,4.389813,42.452727,70.230976,10.443288,0.949346,16.689959,2.143437,124.59296,...,13.974691,-501.819696,12.101324,-500.583446,9.873169,-499.171928,7.993381,-499.794518,13.984498,-600.41107
8,5.83414,10.071156,4.876389,43.404078,69.688595,10.42014,1.065453,17.201948,2.209881,131.781026,...,13.96403,-504.25245,12.060738,-501.174549,10.033838,-501.178133,7.881604,-499.729434,13.967135,-599.061188
9,5.687063,9.980404,5.282514,43.23522,70.279619,10.487013,1.159805,17.483979,2.209593,138.120409,...,13.989632,-503.195299,12.052233,-500.928547,9.962574,-502.986357,7.979219,-500.146835,13.981614,-598.070855
10,5.706261,10.242511,5.214161,43.487291,70.973641,10.473539,1.171183,17.717049,2.200997,146.153696,...,14.008944,-504.170807,11.995903,-501.269181,10.0431,-498.529996,8.002633,-499.761922,14.004767,-599.595324


In [74]:
df_test.head(10)

Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-09-01 00:59:59,210.800909,14.993118,8.08,1.005021,1398.981301,-500.225577,1399.144926,-499.919735,1400.102998,...,12.023554,-497.795834,8.016656,-501.289139,7.946562,-432.31785,4.872511,-500.037437,26.705889,-499.709414
1,2016-09-01 01:59:59,215.392455,14.987471,8.08,0.990469,1398.777912,-500.057435,1398.055362,-499.778182,1396.151033,...,12.05814,-498.695773,8.130979,-499.634209,7.95827,-525.839648,4.87885,-500.162375,25.01994,-499.819438
2,2016-09-01 02:59:59,215.259946,12.884934,7.786667,0.996043,1398.493666,-500.86836,1398.860436,-499.764529,1398.075709,...,11.962366,-498.767484,8.096893,-500.827423,8.071056,-500.801673,4.905125,-499.82851,24.994862,-500.622559
3,2016-09-01 03:59:59,215.336236,12.006805,7.64,0.863514,1399.618111,-498.863574,1397.44012,-499.211024,1400.129303,...,12.033091,-498.350935,8.074946,-499.474407,7.897085,-500.868509,4.9314,-499.963623,24.948919,-498.709987
4,2016-09-01 04:59:59,199.099327,10.68253,7.53,0.805575,1401.268123,-500.808305,1398.128818,-499.504543,1402.172226,...,12.025367,-500.786497,8.054678,-500.3975,8.10789,-509.526725,4.957674,-500.360026,25.003331,-500.856333
5,2016-09-01 05:59:59,168.485085,8.817007,7.42,0.791191,1402.826803,-499.299521,1401.511119,-499.205357,1404.088107,...,12.029797,-499.814895,8.036586,-500.371492,8.041446,-510.037054,4.983949,-499.99099,24.978973,-500.47564
6,2016-09-01 06:59:59,144.13344,7.92461,7.42,0.788838,1398.252401,-499.748672,1393.255503,-499.19538,1396.738566,...,12.026296,-499.473127,8.027984,-500.983079,7.90734,-507.964971,5.010224,-500.043697,25.040709,-499.501984
7,2016-09-01 07:59:59,133.513396,8.055252,6.988,0.801871,1401.669677,-501.777839,1400.754446,-502.514024,1400.465244,...,12.040911,-501.293852,8.02049,-499.185229,8.116897,-511.927561,5.036498,-500.149615,25.03258,-503.970657
8,2016-09-01 08:59:59,133.735356,7.999618,6.935,0.789329,1402.358981,-499.981597,1400.985954,-496.802968,1401.168584,...,11.998184,-499.481608,8.01261,-500.896783,7.974422,-521.199104,5.061599,-499.791519,25.005063,-497.613716
9,2016-09-01 09:59:59,126.961069,8.017856,7.03,0.805298,1400.81612,-499.014158,1399.975401,-499.570552,1401.871924,...,12.040725,-499.987743,7.989503,-499.750625,7.98971,-509.946737,5.068811,-499.2939,24.992741,-499.272255


`Recovery Calculation Check`

<div>
<img src="https://practicum-content.s3.us-west-1.amazonaws.com/resources/moved_Recovery_1576238822_1589899219.jpg" width="600" align="left"> 
</div>

***C***
- share of gold in the feed before flotation (for finding the rougher concentrate recovery)/in the concentrate right after flotation (for finding the final concentrate recovery)
---
***F***
- share of gold in the concentrate right after flotation (for finding the rougher concentrate recovery)/after purification (for finding the final concentrate recovery)
---
***T***
- share of gold in the rougher tails right after flotation (for finding the rougher concentrate recovery)/after purification (for finding the final concentrate recovery)

***C: Share of Gold in Concentrate***

In [75]:
C = df_train['rougher.output.concentrate_au']
print('Null values:', C.isnull().sum())

Null values: 0


***F: Share of Gold in Feed***

In [76]:
F = df_train['rougher.input.feed_au']
print('Null values:', F.isnull().sum())

Null values: 0


***T: Share of Gold in Rougher Tails***

In [77]:
T = df_train['rougher.output.tail_au']
print('Null values:', T.isnull().sum())

Null values: 0


In [78]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=C, name='Share of Gold in Concentrate'))
fig.add_trace(go.Histogram(x=F, name='Share of Gold in Feed'))
fig.add_trace(go.Histogram(x=T, name='Share of Gold in Rougher Tails'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Au Distribution - Recovery Estimation')
fig.show()

***Rougher Output Recovery***

In [79]:
recovery_calc = lambda df, C, F, T: ((df[C] * (df[F] - df[T])) \
                               / (df[F] * (df[C] - df[T])))  * 100

estimated_recovery = recovery_calc(df_train.dropna(subset=['rougher.output.recovery']), 
                                'rougher.output.concentrate_au',
                                'rougher.input.feed_au', 
                                'rougher.output.tail_au')

estimated_recovery = estimated_recovery.fillna(0)
estimated_recovery.replace([np.inf, -np.inf], 0, inplace=True)

#display(estimated_recovery.sort_values())

***Final Output Recovery***

In [80]:
final_estimated_recovery = recovery_calc(df_train.dropna(subset=['final.output.recovery']), 
                                'final.output.concentrate_au',
                                'rougher.input.feed_au',
                                'final.output.tail_au')

final_estimated_recovery = final_estimated_recovery.fillna(0) # filling in NaN with zero as final_error below has a mismatch after dropping NaN
final_estimated_recovery.replace([np.inf, -np.inf], 0, inplace=True)
#display(final_estimated_recovery.sort_values())

***Mean Absolute Error - Rougher***

In [81]:
recovery_feature = df_train['rougher.output.recovery']
print('Null values:', recovery_feature.isna().sum())

Null values: 0


In [82]:
error = mae(recovery_feature, estimated_recovery)

print("Mean absolute error : ", error)
print(f"MAE with supression: {error:.17f}")

Mean absolute error :  9.45971292906466e-15
MAE with supression: 0.00000000000000946


***Mean Absolute Error - Final***

In [83]:
final_recovery_feature = df_train['final.output.recovery']
print('Null values:', final_recovery_feature.isna().sum())
#print('Null values:', final_recovery_feature.mean())

Null values: 0


In [84]:
final_error = mae(final_recovery_feature, final_estimated_recovery)

print("Mean absolute error : ", final_error)
print(f"MAE with supression: {final_error:.17f}")

Mean absolute error :  8.186141186364662e-15
MAE with supression: 0.00000000000000819


**`Findings`**

<span style='color:teal'> After removing NaN values, the average error between the predictions and actuals in this feature comparison (rougher recovery) is `~0.00000000000000946`, which is a good value considering the average feature value is `~82.74` For final recovery, following the same process as prior and replacing infinity values, we see a slightly lower absolute error of `~0.00000000000000819` with the average final feature value of about `66.8` -- not a crazy MAE given the average value. </span>

***Missing Features from Test dataset***

In [85]:
column_difference = df_train.columns.difference(df_test.columns)
display(pd.Series(column_difference))

0                           final.output.concentrate_ag
1                           final.output.concentrate_au
2                           final.output.concentrate_pb
3                          final.output.concentrate_sol
4                                 final.output.recovery
5                                  final.output.tail_ag
6                                  final.output.tail_au
7                                  final.output.tail_pb
8                                 final.output.tail_sol
9                 primary_cleaner.output.concentrate_ag
10                primary_cleaner.output.concentrate_au
11                primary_cleaner.output.concentrate_pb
12               primary_cleaner.output.concentrate_sol
13                       primary_cleaner.output.tail_ag
14                       primary_cleaner.output.tail_au
15                       primary_cleaner.output.tail_pb
16                      primary_cleaner.output.tail_sol
17                      rougher.calculation.au_p

**The above is a list of features between the train and test datasets where the `34` columns displayed are all missing from the test dataset (all float types). These will be dropped from the Train dataset once we evaluate the chosen model and leverage the Test dataset.**
    
Parameters include (with their respective types): 

`1) concentrate (outputs)`

`2) tail (outputs)`

`3) pb_ratio (calculation)`

`4) floatbank10_sulfate_to_au_feed & floatbank11_sulfate_to_au_feed (calculation)`

`5) sulfate_to_au_concentrate (calculation)`

`6) recovery (outputs)`


***Concentration of Metals***

***Au***

Base Concentrate

In [86]:
au_metal = df_train['rougher.output.concentrate_au']
df_train['rougher.output.concentrate_au'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      19.776763
median    20.262513
var       12.964445
std        3.600617
Name: rougher.output.concentrate_au, dtype: float64

First Purification Stage (Concentrate)

In [87]:
first_purif_au = df_train['primary_cleaner.output.concentrate_au']
df_train['primary_cleaner.output.concentrate_au'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      32.380828
median    33.016945
var       28.393822
std        5.328585
Name: primary_cleaner.output.concentrate_au, dtype: float64

Second Purification Stage (Tail - Residues)

In [88]:
second_purif_au = df_train['secondary_cleaner.output.tail_au']
df_train['secondary_cleaner.output.tail_au'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      4.274828
median    3.956963
var       5.241296
std       2.289388
Name: secondary_cleaner.output.tail_au, dtype: float64

Final Output

In [89]:
final_au = df_train['final.output.concentrate_au']
df_train['final.output.concentrate_au'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      44.131766
median    44.802348
var       17.400887
std        4.171437
Name: final.output.concentrate_au, dtype: float64

In [90]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=au_metal, name='Base Metal'))
fig.add_trace(go.Histogram(x=first_purif_au, name='First Purification'))
fig.add_trace(go.Histogram(x=final_au, name='Final Output'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Au Concentrate Distribution')
fig.show()

# Summary
df_train[['rougher.output.concentrate_au','primary_cleaner.output.concentrate_au','final.output.concentrate_au']].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the

Unnamed: 0,rougher.output.concentrate_au,primary_cleaner.output.concentrate_au,final.output.concentrate_au
mean,19.776763,32.380828,44.131766
median,20.262513,33.016945,44.802348
var,12.964445,28.393822,17.400887
std,3.600617,5.328585,4.171437


***Ag***

Base Concentrate

In [91]:
ag_metal = df_train['rougher.output.concentrate_ag']
df_train['rougher.output.concentrate_ag'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      11.842002
median    11.797862
var        7.218432
std        2.686714
Name: rougher.output.concentrate_ag, dtype: float64

First Purification Stage (Concentrate)

In [92]:
first_purif_ag = df_train['primary_cleaner.output.concentrate_ag']
df_train['primary_cleaner.output.concentrate_ag'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      8.437240
median    8.436041
var       3.567055
std       1.888665
Name: primary_cleaner.output.concentrate_ag, dtype: float64

Second Purification Stage (Tail - Residues)

In [93]:
second_purif_ag = df_train['secondary_cleaner.output.tail_ag']
df_train['secondary_cleaner.output.tail_ag'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      14.239971
median    14.706733
var       16.491916
std        4.061024
Name: secondary_cleaner.output.tail_ag, dtype: float64

Final Output

In [94]:
final_ag = df_train['final.output.concentrate_ag']
df_train['final.output.concentrate_ag'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      5.156254
median    4.999101
var       1.797010
std       1.340526
Name: final.output.concentrate_ag, dtype: float64

In [95]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=ag_metal, name='Base Metal'))
fig.add_trace(go.Histogram(x=first_purif_ag, name='First Purification'))
fig.add_trace(go.Histogram(x=final_ag, name='Final Output'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Ag Concentrate Distribution')
fig.update_yaxes(range=[0, 1100]) # matching the y axis of the above chart for a better comparison
fig.show()

# Summary
df_train[['rougher.output.concentrate_ag','primary_cleaner.output.concentrate_ag','final.output.concentrate_ag']].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the

Unnamed: 0,rougher.output.concentrate_ag,primary_cleaner.output.concentrate_ag,final.output.concentrate_ag
mean,11.842002,8.43724,5.156254
median,11.797862,8.436041,4.999101
var,7.218432,3.567055,1.79701
std,2.686714,1.888665,1.340526


***Pb***

Base Concentrate

In [96]:
pb_metal = df_train['rougher.output.concentrate_pb']
df_train['rougher.output.concentrate_pb'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      7.612353
median    7.708724
var       2.952758
std       1.718359
Name: rougher.output.concentrate_pb, dtype: float64

First Purification Stage (Concentrate)

In [97]:
first_purif_pb = df_train['primary_cleaner.output.concentrate_pb']
df_train['primary_cleaner.output.concentrate_pb'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      9.574840
median    9.919548
var       6.442381
std       2.538184
Name: primary_cleaner.output.concentrate_pb, dtype: float64

Second Purification Stage (Tail - Residues)

In [98]:
second_purif_pb = df_train['secondary_cleaner.output.tail_pb']
df_train['secondary_cleaner.output.tail_pb'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      5.511462
median    5.101502
var       6.396597
std       2.529149
Name: secondary_cleaner.output.tail_pb, dtype: float64

Final Output

In [99]:
final_pb = df_train['final.output.concentrate_pb']
df_train['final.output.concentrate_pb'].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.



mean      10.172706
median    10.347546
var        2.527265
std        1.589737
Name: final.output.concentrate_pb, dtype: float64

In [100]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=pb_metal, name='Base Metal'))
fig.add_trace(go.Histogram(x=first_purif_pb, name='First Purification'))
fig.add_trace(go.Histogram(x=final_pb, name='Final Output'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Pb Concentrate Distribution')
fig.update_yaxes(range=[0, 1100]) # matching the y axis of the above chart for a better comparison
fig.show()

# Summary
df_train[['rougher.output.concentrate_pb','primary_cleaner.output.concentrate_pb','final.output.concentrate_pb']].agg([np.mean, np.median, np.var, np.std])


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the

Unnamed: 0,rougher.output.concentrate_pb,primary_cleaner.output.concentrate_pb,final.output.concentrate_pb
mean,7.612353,9.57484,10.172706
median,7.708724,9.919548,10.347546
var,2.952758,6.442381,2.527265
std,1.718359,2.538184,1.589737


***Feed Particle Sizing***

In [101]:
train_feed = df_train['rougher.input.feed_size']
test_feed = df_test['rougher.input.feed_size']

fig = go.Figure()
fig.add_trace(go.Histogram(x=train_feed, name='Train dataset'))
fig.add_trace(go.Histogram(x=test_feed, name='Test dataset'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Particle Feed Distribution')
fig.show()

# Summary
print('Train dataset: \n', df_train['rougher.input.feed_size'].agg([np.mean, np.median, np.var, np.std]))
print('')
print('Test dataset: \n', df_test['rougher.input.feed_size'].agg([np.mean, np.median, np.var, np.std]))

Train dataset: 
 mean       57.215122
median     53.843214
var       357.739559
std        18.914004
Name: rougher.input.feed_size, dtype: float64

Test dataset: 
 mean       55.937535
median     50.002004
var       516.391711
std        22.724254
Name: rougher.input.feed_size, dtype: float64



The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the

***Stage Concentration***

In [102]:
raw_feed = df_train[['rougher.input.feed_au','rougher.input.feed_ag','rougher.input.feed_pb']].sum(axis=1)
rougher_concentrate = df_train[['rougher.output.concentrate_au','rougher.output.concentrate_ag','rougher.output.concentrate_pb']].sum(axis=1)
final_concentrate = df_train[['final.output.concentrate_au','final.output.concentrate_ag','final.output.concentrate_pb']].sum(axis=1).dropna()


fig = go.Figure()
fig.add_trace(go.Histogram(x=raw_feed, name='Raw Feed'))
fig.add_trace(go.Histogram(x=rougher_concentrate, name='Rougher Concentrate'))
fig.add_trace(go.Histogram(x=final_concentrate, name='Final Concentrate'))


# Overlaying histograms
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title='Concentrate Distribution')
fig.update_yaxes(range=[0, 2000]) # matching the y axis of the above chart for a better comparison
fig.show()

# Summary
print('Raw Feed: \n', raw_feed.agg([np.mean, np.median, np.var, np.std]))
print('')
print('Rougher Concentrate: \n', rougher_concentrate.agg([np.mean, np.median, np.var, np.std]))
print('')
print('Final Concentrate: \n', final_concentrate.agg([np.mean, np.median, np.var, np.std]))

Raw Feed: 
 mean      20.436083
median    19.597006
var       18.577679
std        4.310183
dtype: float64

Rougher Concentrate: 
 mean      39.231118
median    40.049979
var       43.990866
std        6.632561
dtype: float64

Final Concentrate: 
 mean      59.460725
median    60.159669
var       22.211627
std        4.712921
dtype: float64



The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.


The provided callable <function median at 0x10822bba0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.


The provided callable <function var at 0x107d3c040> is currently using Series.var. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "var" instead.


The provided callable <function std at 0x107d33ec0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.


The provided callable <function mean at 0x107d33d80> is currently using Series.mean. In a future version of pandas, the

**`Findings`**


<span style='color:teal'> 

Individual Concentrates
    
The concentrate distribution varies across Au, Ag, and Pb. 
   - Au (gold) has a higher frequency of instances overall and its values are larger with Base concentration in the 15-25 range, followed by First Purification in the 30-40 range then 40+ under the Final Output stage. The feature variance increases as the material is purified and converted into the Final Output (very accute compression at that stage) which in turn is showing us more final output concentration out of this metal. 
    
    
   - Ag (silver) concentrate shows a path reversal along with much smaller values overall (and less instances). As this material goes through its process towards the Final Output, we see a shift towards the left (less frequency in both values and instances compared to the First Purification and Base stages). Telling us that there is more difficulty refining/extracting this metal compared to gold.
    
    
   - Pb's (lead) concentration is much more centered compared to the other two, where we see Base concentrate spead over a 'longer' path as it passes through the First Purifications stage then it 'shrinks' as it goes through its Final Output stage. Sitting in the middle of the pack and giving us a sense that this metal does just okay as it passes through the process.
    
Total Concentrates - Raw Feed, Rougher and Final
    
Overall concentrate distribution follows a positive path towards its final output (good extraction overall from feed, all the way to the final output).
   - As raw feed is introduced into the floation process, we see a slightly higher rougher concentrate count as the metal is stabilized and 'concentrated'.
    
    
   - Once the raw feed is stabilized and is now a rougher concentrate, it enters the purification stages (two) where we see the final output being about double the amount from rougher to final.

Particle Feed
    
Feed distribution across the Training and Test datasets are very similar. 
   - We see higher frequency of values in the Test dataset which is expected given the preprocessing we did with the Train dataset (removal of NaNs, replacement, etc...). 
    
   
   - Both follow the same path, positive skews. Mean is higher than the median in both instances.
    
Anomalies - `performed in earlier stages for Training dataset (NaN removal, 0 fills and Infinity value replacement)`
    
The concentrate distribution across Au, Ag, and Pb originally showed 'outliers' where values ranging from 0 to 1, NaN or Inf values took a decent 'bite' out of the dataset, after corrections on these, all of the charts above have dimished values closely 'glued' to the y-axis without affecting the overall analysis. 
    
   - Removing these 'anomalies' should help the modeling process so the model itself can 'focus' on the more recurring and significant values across the metals/stages.

</span>

***Final sMAPE Calculation***

<div>
<img src="https://practicum-content.s3.us-west-1.amazonaws.com/resources/moved__smape_1589899561.jpg" width="700" align="left"> 
</div>

In [103]:
rougher_target = pd.Series(df_train['rougher.output.recovery'])
final_target = pd.Series(df_train['final.output.recovery'])

rougher_predict = pd.Series(estimated_recovery, name='estimated_recovery')
final_predict = pd.Series(final_estimated_recovery, name='final_estimated_recovery')

target = pd.concat([rougher_target, final_target], axis=1).to_numpy()
prediction = pd.concat([rougher_predict, final_predict], axis=1).to_numpy()

In [104]:
def sMAPE_final_calc(target, prediction):
    # passing in the targets and predictions from the earlier code block, specifying columns
    target = np.array(target)
    prediction = np.array(prediction)
    
    RT, FT = target[:, 0], target[:, 1]
    RP, FP = prediction[:, 0], prediction[:, 1]
    #print(RT)
    #print(FT)
    #print(RP)
    #print(FP)
    
    # creating the calculations for both the rougher and final recovery values
    rougher = 100/len(RT) * np.nansum(2 * np.abs(RP - RT) / (np.abs(RT) + np.abs(RP))) # + np.finfo(float).eps
    final = 100/len(FT) * np.nansum(2 * np.abs(FP - FT) / (np.abs(FT) + np.abs(FP))) # + np.finfo(float).eps
    final_sMAPE =  .25 * rougher + .75 * final
    
    """
    using np.nansum to treat NaN as zeros and + np.finfo(float).eps (going back & forth with np.finfo)
    ^np.finfo(float).eps doubles my final smape value but didn't change the intial output of the models (before introducing GridSearchCV)
    ^^if they ever decide to run/load on me now...(was getting values before GridSearchCV)
    
    """
    
    return final_sMAPE

sMAPE_scorer = make_scorer(sMAPE_final_calc, greater_is_better=False)

In [105]:
np.seterr(invalid='ignore')
result = sMAPE_final_calc(target, prediction)
print('Final sMAPE:', result)

#Final sMAPE: 1.2182257663447701e-14

Final sMAPE: 1.2182257663447701e-14


error log

In [106]:
# able to get the 1.218 value via using iloc/np.sum & abs in the rougher/final calculations
## toyed with the other solutions arr[:,0], etc but kept getting NaN
### 3rd attempt specifying column names

# Data Modeling

In [107]:
# creating the features and target variables for our datasets

target = df_train[['rougher.output.recovery','final.output.recovery']] # extracting target
features = df_train.drop(['rougher.output.recovery','final.output.recovery'], axis=1) # extracting features
features_net = df_train.drop(column_difference, axis=1) # using this afterwards

# constant model for comparison
constant = pd.Series(target.mean(), index = target.index)


***Linear Regression***

`scoring=` errors

In [108]:
# scoring issues
## AttributeError: 'numpy.ndarray' object has no attribute 'iloc'
### TypeError: '(slice(None, None, None), 0)' is an invalid key
#### IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [109]:
linear_model = LinearRegression()
linear_parameters = [{"positive": [True, False], "fit_intercept": [True, False], "n_jobs": list(range(1,9))}]
linear_clf = RandomizedSearchCV(linear_model, linear_parameters, cv=5, scoring=sMAPE_scorer)
linear_clf.fit(features_net, target)
linear_results = linear_clf.cv_results_
#linear_predict = linear_model.predict(features_net)
#linear_score = cross_val_score(linear_model, features_net, target, cv=10, scoring=sMAPE_scorer)
#print(linear_score)

print('Best Parameters:', linear_clf.best_params_)
print('')
print('Best sMAPE score:', abs(linear_clf.best_score_))
print('')
print('Average Fold Score:', np.mean(abs(linear_results['split0_test_score']) + abs(linear_results['split1_test_score'] +
                              abs(linear_results['split2_test_score'] + abs(linear_results['split3_test_score'] +
                              abs(linear_results['split4_test_score']))))))

Best Parameters: {'positive': True, 'n_jobs': 2, 'fit_intercept': True}

Best sMAPE score: 9.883450302464468

Average Fold Score: 10.647522914166839


***Linear/Ridge Regression***

In [110]:
rdg = Ridge(random_state=12345)
rdg_parameters = [{"alpha": [0.5, 10, 100, 500], "fit_intercept": [True, False], "solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}]
rdg_clf = RandomizedSearchCV(rdg, rdg_parameters, cv=5, scoring=sMAPE_scorer)
rdg_clf.fit(features_net, target)
rdg_results = rdg_clf.cv_results_
#rdg_predict = rdg.predict(features_net)
#ridge_score = cross_val_score(rdg, features_net, target, cv=5, scoring=sMAPE_scorer)

print('Best Parameters:', rdg_clf.best_params_)
print('')
print('Best sMAPE score:', abs(rdg_clf.best_score_))
print('')
print('Average Fold Score:', np.mean(abs(rdg_results['split0_test_score']) + abs(rdg_results['split1_test_score'] +
                              abs(rdg_results['split2_test_score'] + abs(rdg_results['split3_test_score'] +
                              abs(rdg_results['split4_test_score']))))))


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_i

Best Parameters: {'solver': 'saga', 'fit_intercept': True, 'alpha': 0.5}

Best sMAPE score: 10.047845122190463

Average Fold Score: 11.127905720364055


***Random Forest Regressor***

In [111]:
forest_model = RandomForestRegressor(random_state=12345)
forest_parameters = [{'max_depth': list(range(5, 20)), 'max_features': list(range(0,9))}]
forest_clf = RandomizedSearchCV(forest_model, forest_parameters, cv=5, scoring=sMAPE_scorer)
forest_clf.fit(features_net, target)
forest_results = forest_clf.cv_results_
#forest_predict = forest_model.predict(features_net)
#forest_score = cross_val_score(forest_model, features_net, target, cv=5, scoring=sMAPE_scorer)

print('Best Parameters:', forest_clf.best_params_)
print('')
print('Best sMAPE score:', abs(forest_clf.best_score_))
print('')
print('Average Fold Score:', np.mean(abs(forest_results['split0_test_score']) + abs(forest_results['split1_test_score'] +
                              abs(forest_results['split2_test_score'] + abs(forest_results['split3_test_score'] +
                              abs(forest_results['split4_test_score']))))))



5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/lib/python3.11/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/opt/homebrew/lib/python3.11/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/opt/homebrew/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 96, in validate_parameter_constraints
    raise InvalidParameter

Best Parameters: {'max_features': 6, 'max_depth': 5}

Best sMAPE score: 9.105681621307014

Average Fold Score: nan


***Decision Tree Regressor***

In [112]:
tree_model = DecisionTreeRegressor(random_state=12345)
tree_parameters = [{'max_depth': list(range(1, 20)), "splitter": ['best', 'random'], 'max_features': list(range(1,9))}]
tree_clf = RandomizedSearchCV(tree_model, tree_parameters, cv=5, scoring=sMAPE_scorer)
tree_clf.fit(features_net, target)
tree_results = tree_clf.cv_results_
#forest_predict = forest_model.predict(features_net)
#forest_score = cross_val_score(forest_model, features_net, target, cv=5, scoring=sMAPE_scorer)

print('Best Parameters:', tree_clf.best_params_)
print('')
print('Best sMAPE score:', abs(tree_clf.best_score_))
print('')
print('Average Fold Score:', np.mean(abs(tree_results['split0_test_score']) + abs(tree_results['split1_test_score'] +
                              abs(tree_results['split2_test_score'] + abs(tree_results['split3_test_score'] +
                              abs(tree_results['split4_test_score']))))))

Best Parameters: {'splitter': 'random', 'max_features': 8, 'max_depth': 1}

Best sMAPE score: 10.55577234952236

Average Fold Score: 20.375392765941577


***Model Testing = Test dataset***

In [113]:
# assuming we pull the target from the full dataset since test dataset does not contain such values
match_list = df_test['date'].to_list()
full_match_list = df_full['date'].to_list()

#display(match_list)
mask = df_full['date'].isin(match_list)
df_full.drop(columns='date', index=1, inplace=True)
#df_full = df_full[:-1]
matching_rows = df_full[mask].fillna(0)

test_target = matching_rows[['rougher.output.recovery','final.output.recovery']]
test_target = test_target[:-1] # kept getting a mismatch of one row, removed the last one

# Checks
#df_full[df_full['date'].isin(match_list)]
#df_test[df_test['date'].isin(full_match_list)]
#display(test_target)

df_test.drop(columns='date', index=1, inplace=True)
df_test = df_test.fillna(0)
df_test.replace([np.inf, -np.inf], 0, inplace=True)
features_test = df_test
#display(df_test)

# Predict on test data
forest_prediction = forest_clf.predict(features_test)
#print(features_test)

# Compute mean squared error
mse = mean_squared_error(test_target, forest_prediction)

# Print results
print("Test dataset MSE:", mse)
print("Test dataset RMSE:",mse ** 0.5)

Test dataset MSE: 504.45534808930336
Test dataset RMSE: 22.46008343905479



Boolean Series key will be reindexed to match DataFrame index.



In [114]:
# dummy regressor
dummy_reg = DummyRegressor(strategy='mean')
# model training
dummy_reg.fit(features_net, target)
# preditions on test data
dummy_pred = dummy_reg.predict(features_test)

# mean squared error calcs
mse = mean_squared_error(test_target, dummy_pred)
rmse = np.sqrt(mse)
print("Dummy MSE:", mse)
print("Dummy RMSE:", rmse)

Dummy MSE: 621.667515158829
Dummy RMSE: 24.9332612218865


***Final sMAPE score comparison to DummyRegressor***

In [115]:
print('DummyRegressor Final sMAPE score (test dataset) \n', sMAPE_final_calc(test_target,dummy_pred))
print('')
print('RandomForestRegressor Final sMAPE score (test dataset) \n', sMAPE_final_calc(test_target,forest_prediction))
print('')

# sMAPE score testing
#forest_predict = forest_clf.predict(features_net)
#print('RandomForestRegressor sMAPE score (train dataset) \n', sMAPE_final_calc(target,forest_predict))

DummyRegressor Final sMAPE score (test dataset) 
 25.97274711626031

RandomForestRegressor Final sMAPE score (test dataset) 
 25.134513196177796



In [116]:
# random_state = np.random.RandomState(seed=1)
# constants = random_state.randn(5)

# dummy_reg = DummyRegressor(strategy='mean')
# #dummy_params = [{'quantile': [0.25, 0.50, 0.75, 1.0], 'constant': constants}] # understand params are more specific to each strategy
# dummy_clf = RandomizedSearchCV(dummy_reg, cv=5, scoring=sMAPE_scorer)
# dummy_clf.fit(features_test, test_target)

# print('Final sMAPE score comparison to DummyRegressor')
# print('')
# print('DummyRegressor sMAPE score:', abs(dummy_clf.best_score_))
# print('')
# print('RandomForestRegressor sMAPE score:', abs(forest_clf.best_score_))
# print('')

# Conclusions

<span style='color:teal'> 
    
    
1. At start and from our model selection and replacing GridSearchCV with the RandomSearchCV call (so it isn't an overly extensive search), we find that out of the four models the best model option is the ****RandomForest Regressor**** based on comparisons from the `best_score_` values. Values were as follows:
    
    a) LinearRegression: 9.88
    
    b) RidgeRegression: 10.01
    
    ****c) RandomForest Regressor: 9.10****
    
    d) DecisionTree Regressor: 10.12
    
    
2. Further analyzing the model and leveraging the test dataset to see how the trained model handles new data, we find that the RandomForest Regressor computes an MSE of 462 and and RMSE of 21 (avg. output recovery delta to actual target values). 
    
    - When comparing to a DummyRegressor model which is also leveraging the test dataset, we see values that are not too far off from each other (Forest RMSE of 21 vs Dummy RMSE of 24). 
    
    
    
3. Based on the final sMAPE comparison between the RandomForestRegressor and the DummyRegressor model, we conclude that the RandomForestRegressor gives us a % error of ~25% when leveraging the test dataset (*compared to a nearly identical result from our DummyRegressor model*). This % error result from our RandomForestRegressor is still a better result compared to all other models tested.
    
    - When going one step further and analyzing the final sMAPE value on our train dataset, we find a ~7% error. Meaning our final sMAPE result on our test dataset introduces an +18% delta to our overall results/predictions. 

</span>