# Download and prepare the data. Explain the procedure.

### Project description

To produce gold a row ore goes in 3 steps where the second and third are part of step calls purification and the first calls flotation. In every step a raw material enters and two streams leave. One is with the concentrate and this is the main process line and where most of the gold stays and the second calls tails and there are all the leftover that will leave the process and if any gold passed there this is a loss and it will cause a reduction in the recovery.

![flowchart](img/flowchart.jpg)

### Import needed libraries

In [2]:
# Data tools
import pandas as pd
import numpy as np

# Graphics and display
from IPython.core.interactiveshell import InteractiveShell
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
%matplotlib inline

# Ml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.dummy import DummyRegressor

# Statistics
from scipy import stats

print('Project libraries has been successfully been imported!')

Project libraries has been successfully been imported!


 ### Set environment variables¶

In [3]:
# set to display all output not only print() or last output
InteractiveShell.ast_node_interactivity = "all"        

## Open the files and look into the data.

In [4]:
try:
    data_full = pd.read_csv('gold_recovery_full.csv')
    data_test = pd.read_csv('gold_recovery_test.csv')
    data_train = pd.read_csv('gold_recovery_train.csv')
except:
    data_full = pd.read_csv('/datasets/gold_recovery_full.csv')
    data_test = pd.read_csv('/datasets/gold_recovery_test.csv')
    data_train = pd.read_csv('/datasets/gold_recovery_train.csv')
    
print('Data has been read correctly!')

Data has been read correctly!


### Looking at the dataset

### Headers of the data

There are many features in this project and to easily tells what is in each one there is a formal naming system:

[stage].[parameter_type].[parameter_name]

Example: rougher.input.feed_ag


### Data describe

In [5]:
# Checking 0's
def zero_check(df):
    for i in df:
        print(i,len(df[df[i]==0]))
        
# function to determine if columns in file have null values        
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('{} has {} nulls, which is {:.{}%} percent of Nulls'.format(column, num_of_nulls, percent, num))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")       
        
# function to display general information about the dataset
def general_info(df):
    print("\033[1m" + "\033[0m")
    display(pd.concat([df.dtypes, df.count(),df.isna().sum(),df.isna().sum()/len(df)], keys=['type','count','na','na%'],
                      axis=1))
    print()
    print("\033[1m" + 'Head:')  
    display(df.head())
    print()
    print("\033[1m" + 'Tail:')
    display(df.tail())
    print()
    print("\033[1m" + 'Info:')
    print()
    display(df.info())
    print()
    print("\033[1m" + 'Describe:')
    print()
    display(df.describe())
    print()
    print("\033[1m" + 'Describe include: all :')
    print()
    display(df.describe(include='all'))
    print()
    print("\033[1m" + 'nulls in the columns:')
    print()
    display(get_percent_of_na(df, 4))  # check this out
    print()
    print("\033[1m" + 'Zeros in the columns:') 
    print()
    display(zero_check(df))
    print()
    print("\033[1m" + 'Shape:', df.shape)
    print()
    print()
    print('Duplicated:',"\033[1m" + 'We have {} duplicated rows\n'.format(df.duplicated().sum()) + "\033[0m")
    print()
    print("\033[1m" + 'Dtypes:')  
    display(df.dtypes)
    print()

#### data_full

In [6]:
#print our info data
print('information about "data_full" dataset:')
general_info(data_full)

information about "data_full" dataset:
[1m[0m


Unnamed: 0,type,count,na,na%
date,object,22716,0,0.000000
final.output.concentrate_ag,float64,22627,89,0.003918
final.output.concentrate_pb,float64,22629,87,0.003830
final.output.concentrate_sol,float64,22331,385,0.016948
final.output.concentrate_au,float64,22630,86,0.003786
...,...,...,...,...
secondary_cleaner.state.floatbank5_a_level,float64,22615,101,0.004446
secondary_cleaner.state.floatbank5_b_air,float64,22615,101,0.004446
secondary_cleaner.state.floatbank5_b_level,float64,22616,100,0.004402
secondary_cleaner.state.floatbank6_a_air,float64,22597,119,0.005239



[1mHead:


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-01-15 00:00:00,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
1,2016-01-15 01:00:00,6.029369,9.968944,5.257781,42.701629,69.266198,10.462676,0.927452,16.634514,2.22493,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,2016-01-15 02:00:00,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,2016-01-15 03:00:00,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,2016-01-15 04:00:00,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691



[1mTail:


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
22711,2018-08-18 06:59:59,3.22492,11.356233,6.803482,46.713954,73.75515,8.769645,3.141541,10.403181,1.52922,...,23.031497,-501.167942,20.007571,-499.740028,18.006038,-499.834374,13.001114,-500.155694,20.00784,-501.296428
22712,2018-08-18 07:59:59,3.195978,11.349355,6.862249,46.86678,69.049291,8.897321,3.130493,10.54947,1.612542,...,22.960095,-501.612783,20.03566,-500.251357,17.998535,-500.395178,12.954048,-499.895163,19.968498,-501.041608
22713,2018-08-18 08:59:59,3.109998,11.434366,6.886013,46.795691,67.002189,8.529606,2.911418,11.115147,1.596616,...,23.015718,-501.711599,19.951231,-499.857027,18.019543,-500.451156,13.023431,-499.914391,19.990885,-501.518452
22714,2018-08-18 09:59:59,3.367241,11.625587,6.799433,46.408188,65.523246,8.777171,2.819214,10.463847,1.602879,...,23.024963,-501.153409,20.054122,-500.314711,17.979515,-499.272871,12.992404,-499.976268,20.013986,-500.625471
22715,2018-08-18 10:59:59,3.598375,11.737832,6.717509,46.299438,70.281454,8.40669,2.517518,10.652193,1.389434,...,23.018622,-500.492702,20.020205,-500.220296,17.963512,-499.93949,12.990306,-500.080993,19.990336,-499.191575



[1mInfo:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                22716 non-null  object 
 1   final.output.concentrate_ag                         22627 non-null  float64
 2   final.output.concentrate_pb                         22629 non-null  float64
 3   final.output.concentrate_sol                        22331 non-null  float64
 4   final.output.concentrate_au                         22630 non-null  float64
 5   final.output.recovery                               20753 non-null  float64
 6   final.output.tail_ag                                22633 non-null  float64
 7   final.output.tail_pb                                22516 non-null  float64
 8   final.output.tail_sol                               22445 non-nu

None


[1mDescribe:



Unnamed: 0,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,primary_cleaner.input.sulfate,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,22627.0,22629.0,22331.0,22630.0,20753.0,22633.0,22516.0,22445.0,22635.0,21107.0,...,22571.0,22587.0,22608.0,22607.0,22615.0,22615.0,22615.0,22616.0,22597.0,22615.0
mean,4.781559,9.095308,8.640317,40.001172,67.447488,8.92369,2.488252,9.523632,2.827459,140.277672,...,18.205125,-499.878977,14.356474,-476.532613,14.883276,-503.323288,11.626743,-500.521502,17.97681,-519.361465
std,2.030128,3.230797,3.785035,13.398062,11.616034,3.517917,1.189407,4.079739,1.262834,49.919004,...,6.5607,80.273964,5.655791,93.822791,6.372811,72.925589,5.757449,78.956292,6.636203,75.477151
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3e-06,...,0.0,-799.920713,0.0,-800.836914,-0.42326,-799.741097,0.427084,-800.258209,-0.079426,-810.473526
25%,4.018525,8.750171,7.116799,42.383721,63.282393,7.684016,1.805376,8.143576,2.303108,110.177081,...,14.09594,-500.896232,10.882675,-500.309169,10.941299,-500.628697,8.037533,-500.167897,13.968418,-500.981671
50%,4.953729,9.914519,8.908792,44.653436,68.322258,9.484369,2.653001,10.212998,2.913794,141.330501,...,18.007326,-499.917108,14.947646,-499.612292,14.859117,-499.865158,10.989756,-499.95198,18.004215,-500.095463
75%,5.862593,10.929839,10.705824,46.111999,72.950836,11.084557,3.28779,11.860824,3.555077,174.049914,...,22.998194,-498.361545,17.977502,-400.224147,18.014914,-498.489381,14.001193,-499.492354,23.009704,-499.526388
max,16.001945,17.031899,19.61572,53.611374,100.0,19.552149,6.086532,22.861749,9.789625,274.409626,...,60.0,-127.692333,31.269706,-6.506986,63.116298,-244.483566,39.846228,-120.190931,54.876806,-29.093593



[1mDescribe include: all :



Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,22716,22627.0,22629.0,22331.0,22630.0,20753.0,22633.0,22516.0,22445.0,22635.0,...,22571.0,22587.0,22608.0,22607.0,22615.0,22615.0,22615.0,22616.0,22597.0,22615.0
unique,22716,,,,,,,,,,...,,,,,,,,,,
top,2016-01-15 00:00:00,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,4.781559,9.095308,8.640317,40.001172,67.447488,8.92369,2.488252,9.523632,2.827459,...,18.205125,-499.878977,14.356474,-476.532613,14.883276,-503.323288,11.626743,-500.521502,17.97681,-519.361465
std,,2.030128,3.230797,3.785035,13.398062,11.616034,3.517917,1.189407,4.079739,1.262834,...,6.5607,80.273964,5.655791,93.822791,6.372811,72.925589,5.757449,78.956292,6.636203,75.477151
min,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-799.920713,0.0,-800.836914,-0.42326,-799.741097,0.427084,-800.258209,-0.079426,-810.473526
25%,,4.018525,8.750171,7.116799,42.383721,63.282393,7.684016,1.805376,8.143576,2.303108,...,14.09594,-500.896232,10.882675,-500.309169,10.941299,-500.628697,8.037533,-500.167897,13.968418,-500.981671
50%,,4.953729,9.914519,8.908792,44.653436,68.322258,9.484369,2.653001,10.212998,2.913794,...,18.007326,-499.917108,14.947646,-499.612292,14.859117,-499.865158,10.989756,-499.95198,18.004215,-500.095463
75%,,5.862593,10.929839,10.705824,46.111999,72.950836,11.084557,3.28779,11.860824,3.555077,...,22.998194,-498.361545,17.977502,-400.224147,18.014914,-498.489381,14.001193,-499.492354,23.009704,-499.526388



[1mnulls in the columns:

final.output.concentrate_ag has 89 nulls, which is 0.3918% percent of Nulls
final.output.concentrate_pb has 87 nulls, which is 0.3830% percent of Nulls
final.output.concentrate_sol has 385 nulls, which is 1.6948% percent of Nulls
final.output.concentrate_au has 86 nulls, which is 0.3786% percent of Nulls
final.output.recovery has 1963 nulls, which is 8.6415% percent of Nulls
final.output.tail_ag has 83 nulls, which is 0.3654% percent of Nulls
final.output.tail_pb has 200 nulls, which is 0.8804% percent of Nulls
final.output.tail_sol has 271 nulls, which is 1.1930% percent of Nulls
final.output.tail_au has 81 nulls, which is 0.3566% percent of Nulls
primary_cleaner.input.sulfate has 1609 nulls, which is 7.0831% percent of Nulls
primary_cleaner.input.depressant has 1546 nulls, which is 6.8058% percent of Nulls
primary_cleaner.input.xanthate has 1151 nulls, which is 5.0669% percent of Nulls
primary_cleaner.output.concentrate_ag has 98 nulls, which is 0.4314% pe

None


[1mZeros in the columns:

date 0
final.output.concentrate_ag 1613
final.output.concentrate_pb 1613
final.output.concentrate_sol 1613
final.output.concentrate_au 1613
final.output.recovery 151
final.output.tail_ag 1950
final.output.tail_pb 1950
final.output.tail_sol 1950
final.output.tail_au 1950
primary_cleaner.input.sulfate 0
primary_cleaner.input.depressant 57
primary_cleaner.input.feed_size 0
primary_cleaner.input.xanthate 0
primary_cleaner.output.concentrate_ag 1626
primary_cleaner.output.concentrate_pb 1626
primary_cleaner.output.concentrate_sol 1626
primary_cleaner.output.concentrate_au 1626
primary_cleaner.output.tail_ag 1953
primary_cleaner.output.tail_pb 1953
primary_cleaner.output.tail_sol 1953
primary_cleaner.output.tail_au 1953
primary_cleaner.state.floatbank8_a_air 349
primary_cleaner.state.floatbank8_a_level 0
primary_cleaner.state.floatbank8_b_air 345
primary_cleaner.state.floatbank8_b_level 0
primary_cleaner.state.floatbank8_c_air 330
primary_cleaner.state.floatbank8_

None


[1mShape: (22716, 87)


Duplicated: [1mWe have 0 duplicated rows
[0m

[1mDtypes:


date                                           object
final.output.concentrate_ag                   float64
final.output.concentrate_pb                   float64
final.output.concentrate_sol                  float64
final.output.concentrate_au                   float64
                                               ...   
secondary_cleaner.state.floatbank5_a_level    float64
secondary_cleaner.state.floatbank5_b_air      float64
secondary_cleaner.state.floatbank5_b_level    float64
secondary_cleaner.state.floatbank6_a_air      float64
secondary_cleaner.state.floatbank6_a_level    float64
Length: 87, dtype: object




notes
- There are 87 columns in the data
- All values are floats
- Some values are negative. They are reading of levels indicators. Since it's relative information it make sense
- There are 22716 observations in the data
- Almost all columns contains Non values
- Part of the columns contains some zeroes
- No duplicates rows


data_train['new_rougher_recovery'] = (
    data_train['rougher.output.concentrate_au'] * (
        data_train['rougher.input.feed_au'] - data_train['rougher.output.tail_au']) ) \
/ (data_train['rougher.input.feed_au'] * (
    data_train['rougher.output.concentrate_au'] - data_train['rougher.output.tail_au']) )*100

#### data_train

In [7]:
#print our info data
print('information about "data_train" dataset:')
general_info(data_train)

information about "data_train" dataset:
[1m[0m


Unnamed: 0,type,count,na,na%
date,object,16860,0,0.000000
final.output.concentrate_ag,float64,16788,72,0.004270
final.output.concentrate_pb,float64,16788,72,0.004270
final.output.concentrate_sol,float64,16490,370,0.021945
final.output.concentrate_au,float64,16789,71,0.004211
...,...,...,...,...
secondary_cleaner.state.floatbank5_a_level,float64,16775,85,0.005042
secondary_cleaner.state.floatbank5_b_air,float64,16775,85,0.005042
secondary_cleaner.state.floatbank5_b_level,float64,16776,84,0.004982
secondary_cleaner.state.floatbank6_a_air,float64,16757,103,0.006109



[1mHead:


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-01-15 00:00:00,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
1,2016-01-15 01:00:00,6.029369,9.968944,5.257781,42.701629,69.266198,10.462676,0.927452,16.634514,2.22493,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,2016-01-15 02:00:00,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,2016-01-15 03:00:00,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,2016-01-15 04:00:00,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691



[1mTail:


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
16855,2018-08-18 06:59:59,3.22492,11.356233,6.803482,46.713954,73.75515,8.769645,3.141541,10.403181,1.52922,...,23.031497,-501.167942,20.007571,-499.740028,18.006038,-499.834374,13.001114,-500.155694,20.00784,-501.296428
16856,2018-08-18 07:59:59,3.195978,11.349355,6.862249,46.86678,69.049291,8.897321,3.130493,10.54947,1.612542,...,22.960095,-501.612783,20.03566,-500.251357,17.998535,-500.395178,12.954048,-499.895163,19.968498,-501.041608
16857,2018-08-18 08:59:59,3.109998,11.434366,6.886013,46.795691,67.002189,8.529606,2.911418,11.115147,1.596616,...,23.015718,-501.711599,19.951231,-499.857027,18.019543,-500.451156,13.023431,-499.914391,19.990885,-501.518452
16858,2018-08-18 09:59:59,3.367241,11.625587,6.799433,46.408188,65.523246,8.777171,2.819214,10.463847,1.602879,...,23.024963,-501.153409,20.054122,-500.314711,17.979515,-499.272871,12.992404,-499.976268,20.013986,-500.625471
16859,2018-08-18 10:59:59,3.598375,11.737832,6.717509,46.299438,70.281454,8.40669,2.517518,10.652193,1.389434,...,23.018622,-500.492702,20.020205,-500.220296,17.963512,-499.93949,12.990306,-500.080993,19.990336,-499.191575



[1mInfo:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-nu

None


[1mDescribe:



Unnamed: 0,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,primary_cleaner.input.sulfate,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,16788.0,16788.0,16490.0,16789.0,15339.0,16794.0,16677.0,16715.0,16794.0,15553.0,...,16731.0,16747.0,16768.0,16767.0,16775.0,16775.0,16775.0,16776.0,16757.0,16775.0
mean,4.716907,9.113559,8.301123,39.467217,67.213166,8.757048,2.360327,9.303932,2.687512,129.479789,...,19.101874,-494.164481,14.778164,-476.600082,15.779488,-500.230146,12.377241,-498.956257,18.429208,-521.801826
std,2.096718,3.389495,3.82576,13.917227,11.960446,3.634103,1.215576,4.263208,1.272757,45.386931,...,6.883163,84.803334,5.999149,89.381172,6.834703,76.983542,6.219989,82.146207,6.958294,77.170888
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3e-06,...,0.0,-799.920713,0.0,-800.021781,-0.42326,-799.741097,0.427084,-800.258209,0.02427,-810.473526
25%,3.971262,8.825748,6.939185,42.055722,62.625685,7.610544,1.641604,7.870275,2.172953,103.064021,...,14.508299,-500.837689,10.741388,-500.269182,10.977713,-500.530594,8.925586,-500.147603,13.977626,-501.080595
50%,4.869346,10.065316,8.557228,44.498874,67.644601,9.220393,2.45369,10.021968,2.781132,131.783108,...,19.986958,-499.778379,14.943933,-499.593286,15.99834,-499.784231,11.092839,-499.93333,18.03496,-500.109898
75%,5.821176,11.054809,10.289741,45.976222,72.824595,10.97111,3.192404,11.648573,3.416936,159.539839,...,24.983961,-494.648754,20.023751,-400.137948,20.000701,-496.531781,15.979467,-498.418,24.984992,-499.56554
max,16.001945,17.031899,18.124851,53.611374,100.0,19.552149,6.086532,22.31773,9.789625,251.999948,...,60.0,-127.692333,28.003828,-71.472472,63.116298,-275.073125,39.846228,-120.190931,54.876806,-39.784927



[1mDescribe include: all :



Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,16860,16788.0,16788.0,16490.0,16789.0,15339.0,16794.0,16677.0,16715.0,16794.0,...,16731.0,16747.0,16768.0,16767.0,16775.0,16775.0,16775.0,16776.0,16757.0,16775.0
unique,16860,,,,,,,,,,...,,,,,,,,,,
top,2016-01-15 00:00:00,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,4.716907,9.113559,8.301123,39.467217,67.213166,8.757048,2.360327,9.303932,2.687512,...,19.101874,-494.164481,14.778164,-476.600082,15.779488,-500.230146,12.377241,-498.956257,18.429208,-521.801826
std,,2.096718,3.389495,3.82576,13.917227,11.960446,3.634103,1.215576,4.263208,1.272757,...,6.883163,84.803334,5.999149,89.381172,6.834703,76.983542,6.219989,82.146207,6.958294,77.170888
min,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-799.920713,0.0,-800.021781,-0.42326,-799.741097,0.427084,-800.258209,0.02427,-810.473526
25%,,3.971262,8.825748,6.939185,42.055722,62.625685,7.610544,1.641604,7.870275,2.172953,...,14.508299,-500.837689,10.741388,-500.269182,10.977713,-500.530594,8.925586,-500.147603,13.977626,-501.080595
50%,,4.869346,10.065316,8.557228,44.498874,67.644601,9.220393,2.45369,10.021968,2.781132,...,19.986958,-499.778379,14.943933,-499.593286,15.99834,-499.784231,11.092839,-499.93333,18.03496,-500.109898
75%,,5.821176,11.054809,10.289741,45.976222,72.824595,10.97111,3.192404,11.648573,3.416936,...,24.983961,-494.648754,20.023751,-400.137948,20.000701,-496.531781,15.979467,-498.418,24.984992,-499.56554



[1mnulls in the columns:

final.output.concentrate_ag has 72 nulls, which is 0.4270% percent of Nulls
final.output.concentrate_pb has 72 nulls, which is 0.4270% percent of Nulls
final.output.concentrate_sol has 370 nulls, which is 2.1945% percent of Nulls
final.output.concentrate_au has 71 nulls, which is 0.4211% percent of Nulls
final.output.recovery has 1521 nulls, which is 9.0214% percent of Nulls
final.output.tail_ag has 66 nulls, which is 0.3915% percent of Nulls
final.output.tail_pb has 183 nulls, which is 1.0854% percent of Nulls
final.output.tail_sol has 145 nulls, which is 0.8600% percent of Nulls
final.output.tail_au has 66 nulls, which is 0.3915% percent of Nulls
primary_cleaner.input.sulfate has 1307 nulls, which is 7.7521% percent of Nulls
primary_cleaner.input.depressant has 1262 nulls, which is 7.4852% percent of Nulls
primary_cleaner.input.xanthate has 985 nulls, which is 5.8422% percent of Nulls
primary_cleaner.output.concentrate_ag has 82 nulls, which is 0.4864% per

None


[1mZeros in the columns:

date 0
final.output.concentrate_ag 1263
final.output.concentrate_pb 1263
final.output.concentrate_sol 1263
final.output.concentrate_au 1263
final.output.recovery 89
final.output.tail_ag 1658
final.output.tail_pb 1658
final.output.tail_sol 1658
final.output.tail_au 1658
primary_cleaner.input.sulfate 0
primary_cleaner.input.depressant 57
primary_cleaner.input.feed_size 0
primary_cleaner.input.xanthate 0
primary_cleaner.output.concentrate_ag 1230
primary_cleaner.output.concentrate_pb 1230
primary_cleaner.output.concentrate_sol 1230
primary_cleaner.output.concentrate_au 1230
primary_cleaner.output.tail_ag 1549
primary_cleaner.output.tail_pb 1549
primary_cleaner.output.tail_sol 1549
primary_cleaner.output.tail_au 1549
primary_cleaner.state.floatbank8_a_air 256
primary_cleaner.state.floatbank8_a_level 0
primary_cleaner.state.floatbank8_b_air 257
primary_cleaner.state.floatbank8_b_level 0
primary_cleaner.state.floatbank8_c_air 282
primary_cleaner.state.floatbank8_c

None


[1mShape: (16860, 87)


Duplicated: [1mWe have 0 duplicated rows
[0m

[1mDtypes:


date                                           object
final.output.concentrate_ag                   float64
final.output.concentrate_pb                   float64
final.output.concentrate_sol                  float64
final.output.concentrate_au                   float64
                                               ...   
secondary_cleaner.state.floatbank5_a_level    float64
secondary_cleaner.state.floatbank5_b_air      float64
secondary_cleaner.state.floatbank5_b_level    float64
secondary_cleaner.state.floatbank6_a_air      float64
secondary_cleaner.state.floatbank6_a_level    float64
Length: 87, dtype: object




notes
- There are 87 columns in the data
- All values are floats
- Some values are negative. They are reading of levels indicators. Since it's relative information it make sense
- There are 16860 observations in the data
- Almost all columns contains Non values
- Part of the columns contains some zeroes
- No duplicates rows

1.1.3.2  data_test

In [8]:
#print our info data
print('information about "data_test" dataset:')
general_info(data_test)

information about "data_test" dataset:
[1m[0m


Unnamed: 0,type,count,na,na%
date,object,5856,0,0.0
primary_cleaner.input.sulfate,float64,5554,302,0.051571
primary_cleaner.input.depressant,float64,5572,284,0.048497
primary_cleaner.input.feed_size,float64,5856,0,0.0
primary_cleaner.input.xanthate,float64,5690,166,0.028347
primary_cleaner.state.floatbank8_a_air,float64,5840,16,0.002732
primary_cleaner.state.floatbank8_a_level,float64,5840,16,0.002732
primary_cleaner.state.floatbank8_b_air,float64,5840,16,0.002732
primary_cleaner.state.floatbank8_b_level,float64,5840,16,0.002732
primary_cleaner.state.floatbank8_c_air,float64,5840,16,0.002732



[1mHead:


Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-09-01 00:59:59,210.800909,14.993118,8.08,1.005021,1398.981301,-500.225577,1399.144926,-499.919735,1400.102998,...,12.023554,-497.795834,8.016656,-501.289139,7.946562,-432.31785,4.872511,-500.037437,26.705889,-499.709414
1,2016-09-01 01:59:59,215.392455,14.987471,8.08,0.990469,1398.777912,-500.057435,1398.055362,-499.778182,1396.151033,...,12.05814,-498.695773,8.130979,-499.634209,7.95827,-525.839648,4.87885,-500.162375,25.01994,-499.819438
2,2016-09-01 02:59:59,215.259946,12.884934,7.786667,0.996043,1398.493666,-500.86836,1398.860436,-499.764529,1398.075709,...,11.962366,-498.767484,8.096893,-500.827423,8.071056,-500.801673,4.905125,-499.82851,24.994862,-500.622559
3,2016-09-01 03:59:59,215.336236,12.006805,7.64,0.863514,1399.618111,-498.863574,1397.44012,-499.211024,1400.129303,...,12.033091,-498.350935,8.074946,-499.474407,7.897085,-500.868509,4.9314,-499.963623,24.948919,-498.709987
4,2016-09-01 04:59:59,199.099327,10.68253,7.53,0.805575,1401.268123,-500.808305,1398.128818,-499.504543,1402.172226,...,12.025367,-500.786497,8.054678,-500.3975,8.10789,-509.526725,4.957674,-500.360026,25.003331,-500.856333



[1mTail:


Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
5851,2017-12-31 19:59:59,173.957757,15.963399,8.07,0.896701,1401.930554,-499.728848,1401.441445,-499.193423,1399.810313,...,13.995957,-500.157454,12.069155,-499.673279,7.977259,-499.516126,5.933319,-499.965973,8.987171,-499.755909
5852,2017-12-31 20:59:59,172.91027,16.002605,8.07,0.896519,1447.075722,-494.716823,1448.851892,-465.963026,1443.890424,...,16.749781,-496.031539,13.365371,-499.122723,9.288553,-496.892967,7.372897,-499.942956,8.986832,-499.903761
5853,2017-12-31 21:59:59,171.135718,15.993669,8.07,1.165996,1498.836182,-501.770403,1499.572353,-495.516347,1502.749213,...,19.99413,-499.791312,15.101425,-499.936252,10.989181,-498.347898,9.020944,-500.040448,8.982038,-497.789882
5854,2017-12-31 22:59:59,179.697158,15.438979,8.07,1.501068,1498.466243,-500.483984,1497.986986,-519.20034,1496.569047,...,19.95876,-499.95875,15.026853,-499.723143,11.011607,-499.985046,9.009783,-499.937902,9.01266,-500.154284
5855,2017-12-31 23:59:59,181.556856,14.99585,8.07,1.623454,1498.096303,-499.796922,1501.743791,-505.146931,1499.535978,...,20.034715,-500.728588,14.914199,-499.948518,10.986607,-500.658027,8.989497,-500.337588,8.988632,-500.764937



[1mInfo:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   date                                        5856 non-null   object 
 1   primary_cleaner.input.sulfate               5554 non-null   float64
 2   primary_cleaner.input.depressant            5572 non-null   float64
 3   primary_cleaner.input.feed_size             5856 non-null   float64
 4   primary_cleaner.input.xanthate              5690 non-null   float64
 5   primary_cleaner.state.floatbank8_a_air      5840 non-null   float64
 6   primary_cleaner.state.floatbank8_a_level    5840 non-null   float64
 7   primary_cleaner.state.floatbank8_b_air      5840 non-null   float64
 8   primary_cleaner.state.floatbank8_b_level    5840 non-null   float64
 9   primary_cleaner.state.floatbank8_c_air      5840 non-null   float64
 10  

None


[1mDescribe:



Unnamed: 0,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,primary_cleaner.state.floatbank8_c_level,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,5554.0,5572.0,5856.0,5690.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,...,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0
mean,170.515243,8.482873,7.264651,1.32142,1481.990241,-509.057796,1486.90867,-511.743956,1468.495216,-509.741212,...,15.636031,-516.266074,13.145702,-476.338907,12.308967,-512.208126,9.470986,-505.017827,16.678722,-512.351694
std,49.608602,3.353105,0.611526,0.693246,310.453166,61.339256,313.224286,67.139074,309.980748,62.671873,...,4.660835,62.756748,4.304086,105.549424,3.762827,58.864651,3.312471,68.785898,5.404514,69.919839
min,0.000103,3.1e-05,5.65,3e-06,0.0,-799.773788,0.0,-800.029078,0.0,-799.995127,...,0.0,-799.798523,0.0,-800.836914,-0.223393,-799.661076,0.528083,-800.220337,-0.079426,-809.859706
25%,143.340022,6.4115,6.885625,0.888769,1497.190681,-500.455211,1497.150234,-500.936639,1437.050321,-501.300441,...,12.057838,-501.054741,11.880119,-500.419113,10.123459,-500.879383,7.991208,-500.223089,13.012422,-500.833821
50%,176.103893,8.023252,7.259333,1.183362,1554.659783,-499.997402,1553.268084,-500.066588,1546.160672,-500.079537,...,17.001867,-500.160145,14.952102,-499.644328,12.062877,-500.047621,9.980774,-500.001338,16.007242,-500.041085
75%,207.240761,10.017725,7.65,1.763797,1601.681656,-499.575313,1601.784707,-499.323361,1600.785573,-499.009545,...,18.030985,-499.441529,15.940011,-401.523664,15.017881,-499.297033,11.992176,-499.722835,21.009076,-499.395621
max,274.409626,40.024582,15.5,5.433169,2212.43209,-57.195404,1975.147923,-142.527229,1715.053773,-150.937035,...,30.051797,-401.565212,31.269706,-6.506986,25.258848,-244.483566,14.090194,-126.463446,26.705889,-29.093593



[1mDescribe include: all :



Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
count,5856,5554.0,5572.0,5856.0,5690.0,5840.0,5840.0,5840.0,5840.0,5840.0,...,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0,5840.0
unique,5856,,,,,,,,,,...,,,,,,,,,,
top,2016-09-01 00:59:59,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,170.515243,8.482873,7.264651,1.32142,1481.990241,-509.057796,1486.90867,-511.743956,1468.495216,...,15.636031,-516.266074,13.145702,-476.338907,12.308967,-512.208126,9.470986,-505.017827,16.678722,-512.351694
std,,49.608602,3.353105,0.611526,0.693246,310.453166,61.339256,313.224286,67.139074,309.980748,...,4.660835,62.756748,4.304086,105.549424,3.762827,58.864651,3.312471,68.785898,5.404514,69.919839
min,,0.000103,3.1e-05,5.65,3e-06,0.0,-799.773788,0.0,-800.029078,0.0,...,0.0,-799.798523,0.0,-800.836914,-0.223393,-799.661076,0.528083,-800.220337,-0.079426,-809.859706
25%,,143.340022,6.4115,6.885625,0.888769,1497.190681,-500.455211,1497.150234,-500.936639,1437.050321,...,12.057838,-501.054741,11.880119,-500.419113,10.123459,-500.879383,7.991208,-500.223089,13.012422,-500.833821
50%,,176.103893,8.023252,7.259333,1.183362,1554.659783,-499.997402,1553.268084,-500.066588,1546.160672,...,17.001867,-500.160145,14.952102,-499.644328,12.062877,-500.047621,9.980774,-500.001338,16.007242,-500.041085
75%,,207.240761,10.017725,7.65,1.763797,1601.681656,-499.575313,1601.784707,-499.323361,1600.785573,...,18.030985,-499.441529,15.940011,-401.523664,15.017881,-499.297033,11.992176,-499.722835,21.009076,-499.395621



[1mnulls in the columns:

primary_cleaner.input.sulfate has 302 nulls, which is 5.1571% percent of Nulls
primary_cleaner.input.depressant has 284 nulls, which is 4.8497% percent of Nulls
primary_cleaner.input.xanthate has 166 nulls, which is 2.8347% percent of Nulls
primary_cleaner.state.floatbank8_a_air has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_a_level has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_b_air has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_b_level has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_c_air has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_c_level has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_d_air has 16 nulls, which is 0.2732% percent of Nulls
primary_cleaner.state.floatbank8_d_level has 16 nulls, which is 0.2732% percent of Nulls
rougher.input.feed_ag has 1

None


[1mZeros in the columns:

date 0
primary_cleaner.input.sulfate 0
primary_cleaner.input.depressant 0
primary_cleaner.input.feed_size 0
primary_cleaner.input.xanthate 0
primary_cleaner.state.floatbank8_a_air 93
primary_cleaner.state.floatbank8_a_level 0
primary_cleaner.state.floatbank8_b_air 88
primary_cleaner.state.floatbank8_b_level 0
primary_cleaner.state.floatbank8_c_air 48
primary_cleaner.state.floatbank8_c_level 0
primary_cleaner.state.floatbank8_d_air 48
primary_cleaner.state.floatbank8_d_level 0
rougher.input.feed_ag 369
rougher.input.feed_pb 369
rougher.input.feed_rate 0
rougher.input.feed_size 0
rougher.input.feed_sol 369
rougher.input.feed_au 369
rougher.input.floatbank10_sulfate 0
rougher.input.floatbank10_xanthate 0
rougher.input.floatbank11_sulfate 0
rougher.input.floatbank11_xanthate 0
rougher.state.floatbank10_a_air 0
rougher.state.floatbank10_a_level 0
rougher.state.floatbank10_b_air 0
rougher.state.floatbank10_b_level 0
rougher.state.floatbank10_c_air 0
rougher.state.

None


[1mShape: (5856, 53)


Duplicated: [1mWe have 0 duplicated rows
[0m

[1mDtypes:


date                                           object
primary_cleaner.input.sulfate                 float64
primary_cleaner.input.depressant              float64
primary_cleaner.input.feed_size               float64
primary_cleaner.input.xanthate                float64
primary_cleaner.state.floatbank8_a_air        float64
primary_cleaner.state.floatbank8_a_level      float64
primary_cleaner.state.floatbank8_b_air        float64
primary_cleaner.state.floatbank8_b_level      float64
primary_cleaner.state.floatbank8_c_air        float64
primary_cleaner.state.floatbank8_c_level      float64
primary_cleaner.state.floatbank8_d_air        float64
primary_cleaner.state.floatbank8_d_level      float64
rougher.input.feed_ag                         float64
rougher.input.feed_pb                         float64
rougher.input.feed_rate                       float64
rougher.input.feed_size                       float64
rougher.input.feed_sol                        float64
rougher.input.feed_au       




notes
- There are 53 columns in the data
- All values are floats
- Some values are negative. They are reading of levels indicators. Since it's relative information it make sense
- There are 5856 observations in the data
- Almost all columns contains Non values
- Part of the columns contains some zeroes
- No duplicates rows

## Check that recovery is calculated correctly. Using the training set, calculate recovery for the rougher.output.recovery feature. Find the MAE between your calculations and the feature values. Provide findings.


![recovery](img/recovery.jpg)

In this section I will:

- Create new column with my calculation for rougher.output.recovery
- Calculate MAE for my column (prediction) vs the original (target)
- For this section use only rows without Non in the relevant column:
rougher.output.concentrate_au , rougher.input.feed_au ,  rougher.output.tail_au , rougher.output.recovery


In [9]:
# drop nan in all relevant columns in the training set for this calculation 
data_train_remove_nan = data_train.dropna(
    subset=['rougher.output.concentrate_au',
    'rougher.input.feed_au',
    'rougher.output.tail_au',
    'rougher.output.recovery'])

In [10]:
data_train_remove_nan[['rougher.output.concentrate_au',
    'rougher.input.feed_au',
    'rougher.output.tail_au',
    'rougher.output.recovery']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14287 entries, 0 to 16859
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   rougher.output.concentrate_au  14287 non-null  float64
 1   rougher.input.feed_au          14287 non-null  float64
 2   rougher.output.tail_au         14287 non-null  float64
 3   rougher.output.recovery        14287 non-null  float64
dtypes: float64(4)
memory usage: 558.1 KB


ok

In [11]:
C = 'rougher.output.concentrate_au'
T = 'rougher.output.tail_au'
F = 'rougher.input.feed_au'


data_train_remove_nan['new_rougher.output.recovery'] = ((
    data_train_remove_nan[C] * (
        data_train_remove_nan[F] - data_train_remove_nan[T]) ) \
/ (data_train_remove_nan[F] * (
    data_train_remove_nan[C] - data_train_remove_nan[T]) ))*100

In [12]:
mae = mean_absolute_error(y_true=data_train_remove_nan['rougher.output.recovery'],
                    y_pred=data_train_remove_nan['new_rougher.output.recovery'])

In [13]:
print('The MAE for the calculation of rougher recovery is: {:e}' .format(mae))

The MAE for the calculation of rougher recovery is: 9.303416e-15


The calculation predict exactly the recovery in the data. 

## Analyze the features not available in the test set. What are these parameters? What is their type?

In this section I will:
- Create a data frame using full data as a base with all features that are absent from the test.
- Check the info of this data frame
- Provide findings in text


In [None]:
# list of missing columns in the data_test include the date

# connect all columns names
column_list_with_duplicates  = list(data_full.columns.values) + list(data_test.columns.values)
#drop duplicates and leve date
only_missing_columns_list = [
    x for x in column_list_with_duplicates if column_list_with_duplicates.count(x)==1 or x == 'date']
# remove last value - it is date . The first value is also date
only_missing_columns_list = only_missing_columns_list[:-1]

In [None]:
# create data frame with only missing column
data_full_34_columns = data_full[only_missing_columns_list]

In [None]:
# add the missing 34 columns to the data_test to have all the removed features back in the test set
data_test_with_missing = data_test.merge(right=data_full_34_columns, how='left' ,on='date')

# create df of test set with only the features that was missing
data_test_only_missing = data_test_with_missing[only_missing_columns_list]

In [None]:
# print the type of the features that are not available in the test set
data_test_only_missing.info()

In [None]:
# print the info on the data_test
data_test.info()

### Notes

- All the parameters are floats
- The removed features include all calculation features
- The test set contains only the inputs to the process and process state parameters - **not any output** so from the test set all process output was removed


- Since the features of the training and the set must match to use an ML model a decision of how to treat the missing column must be taken. I will use the training without the features that are missing in the test set. The model will be base on the process input and process state values only!


## Perform data preprocessing

### Retain the originals data frames with the addition of _raw to their name

In [None]:
data_full_raw = data_full.copy(deep=True)
data_test_raw = data_test.copy(deep=True)
data_train_raw = data_train.copy(deep=True)

### EDA to determine how to treat the data

In [None]:
# Check how many nan in this set
data_full_na = data_full.isna().sum().to_frame()

fig = px.bar(data_full_na, orientation='h', width=900, height=2000)
fig.show()

There are lots of missing values in the data. I will observe the reading of the features. The problem is that producing scatter plots  and histograms of all the observations is to heavy. The solution I choose is performing downsampling. I will leave only 1 out of every 100 observations and do it systematically while reading from the csv using the skiprows argument. I will create new data frames for the full csv this way.

In [None]:
# a function that return true if the index modulue for 100 in the opend dataframe is not 0. 
# This way it will skip all rows beside the one their index modulue 100 is 0.
def skip(index):
    if index % 100 != 0:
       return True

In [None]:
try:
    data_full_downsample = pd.read_csv('gold_recovery_full.csv', skiprows= lambda x: skip(x))
except:
    data_full_downsample = pd.read_csv('/datasets/gold_recovery_full.csv', skiprows= lambda x: skip(x))
    
print('Data has been read correctly!')

In [None]:
# produce scatter for one of ~10 of the downsample features
skipping_interval = int(data_full_downsample.shape[1]/10)
counter = 1
for column in data_full_downsample.columns:
    if counter % skipping_interval == 0:
        print(f'{column} scatter and histogram')
        px.scatter(data_full_downsample, x='date', y=column).show()
        px.histogram(data_full_downsample, x=column).show()
        print()
        print()
    
    counter += 1

- When observing the data we can see that the values are generally represent continues process and therefore if we want to replace nan we better do it base on interpolate values. 
- There are some extreme values. My guess is that this is a noise that comes from the analysers. I will leave them since this represents the process and in any given test set and future predictions they will appear and I will have to let the model figure out how to treat them.
- Values are in different scale. This will interfere any regression model. Therefore I will scale my features.

### Add target columns to the test set from the full columns base on indexes

In [None]:
# columns to add and date that is the ref column
recovery_strings = ['date', 'final.output.recovery', 'rougher.output.recovery']

In [None]:
data_test = data_test.merge(right=data_full[recovery_strings], 
                           how='left',
                           on='date')

### Handling Missing Data

In this part I will remove all observation with no target because we can't reliably fill them following that I will forward fill all the other missing values 

In [None]:
# remove all observation with no target
data_full.dropna(subset=['final.output.recovery', 'rougher.output.recovery'], inplace=True)
data_test.dropna(subset=['final.output.recovery', 'rougher.output.recovery'], inplace=True)
data_train.dropna(subset=['final.output.recovery', 'rougher.output.recovery'], inplace=True)

In [None]:
# Remove zero recovery. This is not valid value for the smape calculation
data_full = data_full[
    (data_full['final.output.recovery'] != 0) & (data_full['rougher.output.recovery'] != 0)
]

data_test = data_test[
    (data_test['final.output.recovery'] != 0) & (data_test['rougher.output.recovery'] != 0)
]

data_train = data_train[
    (data_train['final.output.recovery'] != 0) & (data_train['rougher.output.recovery'] != 0)
]


In [None]:
# forward fill rest of the columns
data_full.fillna(method='ffill', inplace=True)
data_train.fillna(method='ffill', inplace=True)
data_test.fillna(method='ffill', inplace=True)

### Add  datetime column 

In [None]:
data_train['date_datetime'] = pd.to_datetime(data_train['date'], format='%Y-%m-%d %H:%M:%S') 
data_test['date_datetime'] = pd.to_datetime(data_test['date'], format='%Y-%m-%d %H:%M:%S') 
data_full['date_datetime'] = pd.to_datetime(data_test['date'], format='%Y-%m-%d %H:%M:%S') 

# Analyze the data

## Take note of how the concentrations of metals (Au, Ag, Pb) change depending on the purification stage


In this section I will do the following:
- Plot histogram of the metals concentration with line for each purification stage 
- Explain the graph

In [None]:
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid", rc={'figure.figsize':(14, 6)})

sns.histplot(
    data=data_full, x="rougher.input.feed_ag",
    color='r', label="rougher.input.feed_ag", kde=True);
sns.histplot(
    data=data_full, x="rougher.output.concentrate_ag",
    color='g', label="rougher.output.concentrate_ag", kde=True);
sns.histplot(
    data=data_full, x="primary_cleaner.output.concentrate_ag",
    color='b', label="primary_cleaner.output.concentrate_ag", kde=True);
ax=sns.histplot(
    data=data_full, x="final.output.concentrate_ag",
    color='y', label="final.output.concentrate_ag", kde=True);

plt.legend();
ax.set_title('Ag concentrations',fontsize=20)
ax.set_xlabel("concentrations",fontsize=15)
ax.set_ylabel("counts",fontsize=15)
# aa.set(xlabel='common xlabel', ylabel='common ylabel')
plt.show();


- The ag concentration increases in the flotation stage. 
- In the two later stage the concentration decreased a little bit. 
- It is clear the process does not increase the concentration of ag in the process line as should be expected since this is not a process to extract ag.

In [None]:
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid", rc={'figure.figsize':(14, 6)})

sns.histplot(
    data=data_full, x="rougher.input.feed_pb",
    color='r', label="rougher.input.feed_pb", kde=True);
sns.histplot(
    data=data_full, x="rougher.output.concentrate_pb",
    color='g', label="rougher.output.concentrate_pb", kde=True);
sns.histplot(
    data=data_full, x="primary_cleaner.output.concentrate_pb",
    color='b', label="primary_cleaner.output.concentrate_pb", kde=True);
ax=sns.histplot(
    data=data_full, x="final.output.concentrate_pb",
    color='y', label="final.output.concentrate_pb", kde=True);

plt.legend();
ax.set_title('Pb concentrations',fontsize=20)
ax.set_xlabel("concentrations",fontsize=15)
ax.set_ylabel("counts",fontsize=15)
# aa.set(xlabel='common xlabel', ylabel='common ylabel')
plt.show();

- pb concentration increased in the flotation from ~3 to ~8
- After that in the last two stages it increased a tiny bit to ~10
- This is because this process is not focus on pb

In [None]:
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid", rc={'figure.figsize':(14, 6)})

sns.histplot(
    data=data_full, x="rougher.input.feed_au",
    color='r', label="rougher.input.feed_au", kde=True);
sns.histplot(
    data=data_full, x="rougher.output.concentrate_au",
    color='g', label="rougher.output.concentrate_au", kde=True);
sns.histplot(
    data=data_full, x="primary_cleaner.output.concentrate_au",
    color='b', label="primary_cleaner.output.concentrate_au", kde=True);
ax=sns.histplot(
    data=data_full, x="final.output.concentrate_au",
    color='y', label="final.output.concentrate_au", kde=True);

plt.legend();
ax.set_title('Au concentrations',fontsize=20)
ax.set_xlabel("concentrations",fontsize=15)
ax.set_ylabel("counts",fontsize=15)
# aa.set(xlabel='common xlabel', ylabel='common ylabel')
plt.show();

- The au (gold) concentration increases in **all** stages. 
- The concentration increased from ~7 to ~20 in flotation stage. 
- In the primary cleaner the concentration increased to ~34.
- In the last stage the concentration increased to ~45!

### concentrations of metals conclusion
It is clear that the process is selective to gold. The other metals concentration almost don't change and for some of the cases it decreased. This show that for purifying the gold from the ore the other metals removed.


## Compare the feed particle size distributions in the training set and in the test set. If the distributions vary significantly, the model evaluation will be incorrect.


In this section I will do the following:

- Produce histograms of the particle size distributions in the training set and in the test set.
- Run t-test 
- If distributions are similar, it is safe to use the test to evaluate the trained model


In [None]:
# Get the columns of size distribution
size_columns = [
    s for s in data_test.columns.to_list() if 'size' in s]
size_columns

- I will take only the rougher input because this is not affected by the process and our goal is only to determine if something out of the process happen different in times it was taken for test or for training so that there is different

In [None]:
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above) 
sns.set(style="darkgrid", rc={'figure.figsize':(14, 6)})

sns.histplot(
    data=data_test, x="rougher.input.feed_size",
    color='r', label="data_test_input.feed_size", kde=True);

sns.histplot(
    data=data_train, x="rougher.input.feed_size",
    color='b', label="data_train_input.feed_size", kde=True);

plt.legend();
ax.set_title('feed_size for test and train set',fontsize=20)
ax.set_xlabel("feed_size",fontsize=15)
ax.set_ylabel("counts",fontsize=15)
plt.show();

The particle size are similar, it is safe to use the test to evaluate the trained model

## Consider the total concentrations of all substances at different stages: raw feed, rougher concentrate, and final concentrate. Do you notice any abnormal values in the total distribution? If you do, is it worth removing such values from both samples? Describe the findings and eliminate anomalies.


To solve this one I will do the following

- All substances will be Au, Ag, Pb. 
- Combine their concentration in each stage to produce a new column called:       [stage].total_concentration
- Produce histogram
- Look for abnormalities such as:
 - In later stages the amount increases

In [None]:
# create column  for each of the process stage
rougher_input_feed = [
 'rougher.input.feed_ag',
 'rougher.input.feed_pb',
 'rougher.input.feed_au'
]

rougher_output_concentrate = [
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_au'
]

primary_cleaner_output_concentrate = [
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_au'
]

final_output_concentrate = [
 'final.output.concentrate_ag',
 'final.output.concentrate_pb',
 'final.output.concentrate_au'
]

column_list = [
    'rougher_input_feed',
    'rougher_output_concentrate',
    'primary_cleaner_output_concentrate',
    'final_output_concentrate'
]

In [None]:
# create df for only substance
df_for_ag_au_pb_total = data_full[[
 'final.output.concentrate_ag',
 'final.output.concentrate_pb',
 'final.output.concentrate_au',
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_au',
 'rougher.input.feed_ag',
 'rougher.input.feed_pb',
 'rougher.input.feed_au',
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_au'
]]

In [None]:
# sum concentrations by stage
df_for_ag_au_pb_total['rougher_input_feed'] = df_for_ag_au_pb_total[
    rougher_input_feed].sum(axis=1) 
df_for_ag_au_pb_total['rougher_output_concentrate']= df_for_ag_au_pb_total[
    rougher_output_concentrate].sum(axis=1)
df_for_ag_au_pb_total['primary_cleaner_output_concentrate']= df_for_ag_au_pb_total[
    primary_cleaner_output_concentrate].sum(axis=1)
df_for_ag_au_pb_total['final_output_concentrate']= df_for_ag_au_pb_total[
    final_output_concentrate].sum(axis=1)

In [None]:
# left only sum
df_for_ag_au_pb_total = df_for_ag_au_pb_total[column_list]
df_for_ag_au_pb_total

In [None]:
# plot histograms

# Group data together
hist_data = [df_for_ag_au_pb_total['rougher_input_feed'],
             df_for_ag_au_pb_total['rougher_output_concentrate'],
             df_for_ag_au_pb_total['primary_cleaner_output_concentrate'],
             df_for_ag_au_pb_total['final_output_concentrate']]
        

group_labels = ['rougher_input_feed',
                'rougher_output_concentrate',
                'primary_cleaner_output_concentrate',
                'final_output_concentrate']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2)
fig.show()

the findings are:
- The total concentration increase from the feed to the output of the primary cleaner and is the highest in the final output
- The abnormality is that there are lots of zero values. I can see that they appear in all of the stages and there is no continuity between the main concentrations and the zero values thus we can say that these results cause by a measurement error from some kind. Since the concentration don't make it to our model we would not treat it but otherwise we should have remove them somehow. 


# Build the model

## Write a function to calculate the final sMAPE value.

What I will do here is:
- Write a function to calculate the sMAPE for the rougher and the final stage using the predicted recovery and target recovery
- test that the function works using the test set

In [None]:
def smape(y_true, y_pred):
    return (np.abs(y_true-y_pred)/((np.abs(y_true)+np.abs(y_pred))/2)).mean()

First we will create features and target sets for the training set and for the test set 

In [None]:
features_train = data_train[data_test.columns]
features_train = features_train.drop(['final.output.recovery',
                                      'rougher.output.recovery',
                                     'date',
                                     'date_datetime'], axis=1).values
target_train = data_train[['rougher.output.recovery', 'final.output.recovery']].values

In [None]:
features_train.shape

In [None]:
target_train.shape

In [None]:
features_test = data_test.drop(['final.output.recovery',
                                      'rougher.output.recovery',
                                     'date',
                                     'date_datetime'], axis=1).values
target_test = data_test[['rougher.output.recovery', 'final.output.recovery']].values

In [None]:
target_test.shape

In [None]:
target_test.shape

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(features_train, target_train)

In [None]:
pred_lr = lr.predict(features_test)

In [None]:
pred_lr[:, 0].shape

In [None]:
smape(target_test[:,0], pred_lr[:,0])

In [None]:
smape(target_test[:,1], pred_lr[:,1])

In [None]:
def smape_final(y_true,y_pred):
    smape_out_rougher = smape(y_true[:,0], y_pred[:,0])
    smape_out_final = smape(y_true[:,1], y_pred[:,1])
    return 0.25*smape_out_rougher + 0.75*smape_out_final

In [None]:
smape_final(target_test, pred_lr)

## Train different models. Evaluate them using cross-validation. Pick the best model and test it using the test sample. Provide findings

In [None]:
# define scorer
smape_score = make_scorer(smape_final, greater_is_better=False)

Do GridSearchCV for LinearRegression

In [None]:
lr.get_params()

In [None]:
parameters = {'fit_intercept': [True, False]
             }

In [None]:
lr = LinearRegression()

In [None]:
# function to do GridSearchCV 
def gs_evaluate(model, params):
    gs = GridSearchCV(model, param_grid=params , cv=5, scoring=smape_score, verbose=0, refit=True)
    gs.fit(features_train, target_train)
    best_score = -gs.best_score_
    score_train = -gs.score(features_train, target_train)
    score_test = -gs.score(features_test, target_test)
    best_params = gs.best_params_
    return best_score, score_train, score_test, best_params

In [None]:
best_score_lr, score_train_lr, score_test_lr, best_params_lr = gs_evaluate(lr, parameters)

In [None]:
print('For Linear Regression the best score in the cross validation is {:0.3f}, \n\
the best score on all training set is {:0.3f}, \n\
the best score on the test set is {:0.3f} \n\
and the parameters obtained from the GridSearchCV are: \n{}'.format(
best_score_lr, score_train_lr, score_test_lr, best_params_lr
))

Do GridSearchCV for DecisionTreeRegressor

In [None]:
dtr = DecisionTreeRegressor()

In [None]:
dtr.get_params()

In [None]:
parameters = {'max_depth': range(1,4),
          'min_impurity_decrease' : np.arange(.01, .20, .05),
          'min_samples_split' : range(2, 6, 2)}

In [None]:
dtr = DecisionTreeRegressor(random_state=12345)

In [None]:
best_score_dtr, score_train_dtr, score_test_dtr, best_params_dtr = gs_evaluate(dtr, parameters)

In [None]:
print('For Decision Tree Regressor the best score in the cross validation is {:0.3f}, \n\
the best score on all training set is {:0.3f}, \n\
the best score on the test set is {:0.3f} \n\
and the parameters obtained from the GridSearchCV are: \n{}'.format(
best_score_dtr, score_train_dtr, score_test_dtr, best_params_dtr
))

Strange that the GridSearchCV choose 'max_depth': 1. Yet the results is a little better then of linear regression.

In [None]:
rfr = RandomForestRegressor(random_state=12345)

In [None]:
rfr.get_params()

In [None]:
parameters = {'n_estimators' : range(20, 41, 10),
             'max_depth' : range(2, 5, 1)}

In [None]:
best_score_rfr, score_train_rfr, score_test_rfr, best_params_rfr = gs_evaluate(rfr, parameters)

In [None]:
print('For Random Forest Regressor the best score in the cross validation is {:0.3f}, \n\
the best score on all training set is {:0.3f}, \n\
the best score on the test set is {:0.3f} \n\
and the parameters obtained from the GridSearchCV are: \n{}'.format(
best_score_rfr, score_train_rfr, score_test_rfr, best_params_rfr
))

comparison with a simple baseline 

In [None]:
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_train, target_train)
dummy_pred = dummy_regr.predict(features_test)

In [None]:
smpe_dummy = smape_final(target_test, dummy_pred)

In [None]:
print('The dummy sMAPE is {:0.3f}'.format(smpe_dummy))

## Conclusion 
In this section after optimizing 3 models the best one is Random Forest Regressor. It gave same sMAPE score for the test set as the Decision Tree Regressor but for the training it worked slightly better. Meanwhile the linear regression showed results almost as those of the dummy regression. We have to admit that even our top results was not much better then the dummy test. Probably in this case the mean value of the target resemble quite good it predicted value since the chemical separation process is pretty stable.