# Integrated Project 2

### Author: Jinyu Du

## Project description

Prepare a prototype of a machine learning model for Zyfra. The company develops efficiency solutions for heavy industry.

The model should predict the amount of gold recovered from gold ore. You have the data on extraction and purification.

The model will help to optimize the production and eliminate unprofitable parameters.

You need to:

1. Prepare the data;
2. Perform data analysis;
3. Develop and train a model.

To complete the project, you may want to use documentation from *pandas*, *matplotlib*, and *sklearn.*

## Data preparation

### Initialization

In [36]:
# Loading all the libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score


from sklearn.metrics import mean_absolute_error

### Load and study the datasets

Let's load the datasets.

In [37]:
try: 
    full_data = pd.read_csv('datasets/gold_recovery_full.csv')
    train_data = pd.read_csv('datasets/gold_recovery_train.csv')
    test_data = pd.read_csv('datasets/gold_recovery_test.csv')
except:
    full_data = pd.read_csv('/datasets/gold_recovery_full.csv')
    train_data = pd.read_csv('/datasets/gold_recovery_train.csv')
    test_data = pd.read_csv('/datasets/gold_recovery_test.csv')
    

Let's look at the general information of the datasets. First, let's look at `full_data`.

In [38]:
full_data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                22716 non-null  object 
 1   final.output.concentrate_ag                         22627 non-null  float64
 2   final.output.concentrate_pb                         22629 non-null  float64
 3   final.output.concentrate_sol                        22331 non-null  float64
 4   final.output.concentrate_au                         22630 non-null  float64
 5   final.output.recovery                               20753 non-null  float64
 6   final.output.tail_ag                                22633 non-null  float64
 7   final.output.tail_pb                                22516 non-null  float64
 8   final.output.tail_sol                               22445 non-null  float64


In [39]:
display(full_data.head(10))

Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-01-15 00:00:00,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
1,2016-01-15 01:00:00,6.029369,9.968944,5.257781,42.701629,69.266198,10.462676,0.927452,16.634514,2.22493,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,2016-01-15 02:00:00,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,2016-01-15 03:00:00,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,2016-01-15 04:00:00,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691
5,2016-01-15 05:00:00,6.482968,10.049416,5.480257,41.633678,69.465816,10.182708,0.664118,16.999638,1.918586,...,13.938497,-500.970168,11.88335,-500.395298,10.054147,-496.374715,7.965083,-499.364752,14.017067,-599.707915
6,2016-01-15 06:00:00,6.533849,10.058141,4.5691,41.995316,69.300835,10.304598,0.807342,16.723575,2.058913,...,14.046819,-500.971133,12.091543,-500.501426,10.003247,-497.08318,8.01089,-500.002423,14.029649,-600.90547
7,2016-01-15 07:00:00,6.130823,9.935481,4.389813,42.452727,70.230976,10.443288,0.949346,16.689959,2.143437,...,13.974691,-501.819696,12.101324,-500.583446,9.873169,-499.171928,7.993381,-499.794518,13.984498,-600.41107
8,2016-01-15 08:00:00,5.83414,10.071156,4.876389,43.404078,69.688595,10.42014,1.065453,17.201948,2.209881,...,13.96403,-504.25245,12.060738,-501.174549,10.033838,-501.178133,7.881604,-499.729434,13.967135,-599.061188
9,2016-01-15 09:00:00,5.687063,9.980404,5.282514,43.23522,70.279619,10.487013,1.159805,17.483979,2.209593,...,13.989632,-503.195299,12.052233,-500.928547,9.962574,-502.986357,7.979219,-500.146835,13.981614,-598.070855


In [40]:
display(full_data.tail(30))

Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
22686,2018-08-17 05:59:59,3.575851,11.907212,6.802939,45.44444,72.666035,9.478193,3.057374,9.7978,1.731699,...,23.02199,-501.491291,20.081289,-500.659152,17.98541,-500.023766,13.016793,-499.96802,20.010051,-500.549456
22687,2018-08-17 06:59:59,3.676767,11.811546,6.701614,45.626776,70.12452,10.101142,3.458313,9.202298,1.985746,...,23.001575,-500.73139,19.995845,-499.827275,18.021047,-499.853567,12.971879,-500.117227,19.998784,-500.926436
22688,2018-08-17 07:59:59,3.571477,11.612422,6.81993,45.868268,66.461188,10.346141,3.447106,9.025933,2.037761,...,23.007388,-500.709707,20.063421,-500.286289,18.033292,-498.564918,12.961688,-499.93024,20.002626,-498.54153
22689,2018-08-17 08:59:59,3.651242,11.546264,6.896616,45.942074,66.417292,10.080692,3.10999,9.244794,1.941038,...,23.004905,-500.241069,20.019732,-499.381247,17.96455,-500.668205,13.006377,-500.012333,19.986553,-499.845622
22690,2018-08-17 09:59:59,3.991065,11.599666,6.905807,45.073197,69.884868,9.654578,2.835589,9.375307,1.713933,...,23.005506,-500.379181,20.016167,-500.494647,18.001244,-500.340417,12.987431,-499.878309,19.971031,-500.525421
22691,2018-08-17 10:59:59,4.894652,12.713174,6.660313,41.92418,72.346428,9.274585,2.773309,9.499506,1.593406,...,22.981693,-500.375436,20.045049,-499.630463,18.008026,-500.690545,12.99499,-500.075004,19.987483,-501.548787
22692,2018-08-17 11:59:59,4.827192,12.974069,6.641652,41.821659,72.75445,9.504075,3.028277,9.111587,1.63285,...,23.024332,-501.673294,20.05296,-500.448548,17.997242,-500.720112,13.010501,-499.626615,19.995955,-502.213469
22693,2018-08-17 12:59:59,4.51513,12.868653,6.579353,42.645208,72.248239,9.501524,3.189444,9.023446,1.696612,...,22.994118,-500.717182,19.98087,-499.580871,18.012514,-500.06629,12.999451,-500.008828,20.015172,-501.271089
22694,2018-08-17 13:59:59,4.108712,12.666388,6.563593,43.687607,72.582287,9.32,3.27772,9.397333,1.67807,...,23.014369,-500.212424,19.992876,-499.888065,17.977496,-499.967811,12.954171,-500.260455,19.983515,-503.243695
22695,2018-08-17 14:59:59,3.861031,12.441737,6.667624,44.339835,72.001394,8.752653,3.258726,10.303485,1.742921,...,23.025088,-501.311668,20.031273,-500.67443,17.953446,-500.681679,13.013927,-500.258257,20.012671,-505.750254


The `full_data` has 22716 rows and 87 columns. It has missing values. The parameter `rougher.output.recovery` has the highest percentage of missing values, which is 13.73%.

In [41]:
full_data_missing_perct = (full_data.isna().sum())/full_data.shape[0]*100
print(full_data_missing_perct.sort_values(ascending=False))

rougher.output.recovery                     13.730410
rougher.output.tail_ag                      12.048776
rougher.output.tail_au                      12.044374
rougher.output.tail_sol                     12.044374
rougher.input.floatbank11_xanthate           9.935728
                                              ...    
primary_cleaner.state.floatbank8_b_level     0.189294
primary_cleaner.state.floatbank8_c_level     0.189294
primary_cleaner.state.floatbank8_d_level     0.189294
primary_cleaner.input.feed_size              0.000000
date                                         0.000000
Length: 87, dtype: float64


Next, let's look at `train_data`.

In [42]:
train_data.info() 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-null  float64


In [43]:
display(train_data.head(10))

Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,1/15/16 0:00,6.055403,9.889648,5.507324,42.19202,70.541216,10.411962,0.895447,16.904297,2.143149,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.84198
1,1/15/16 1:00,6.029369,9.968944,5.257781,42.701629,69.266198,10.462676,0.927452,16.634514,2.22493,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,1/15/16 2:00,6.055926,10.213995,5.383759,42.657501,68.116445,10.507046,0.953716,16.208849,2.257889,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,1/15/16 3:00,6.047977,9.977019,4.858634,42.689819,68.347543,10.422762,0.883763,16.532835,2.146849,...,14.03651,-500.857308,11.99955,-501.193686,9.970366,-499.20164,7.977324,-500.255908,14.005551,-599.996129
4,1/15/16 4:00,6.148599,10.142511,4.939416,42.774141,66.927016,10.360302,0.792826,16.525686,2.055292,...,14.027298,-499.838632,11.95307,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691
5,1/15/16 5:00,6.482968,10.049416,5.480257,41.633678,69.465816,10.182708,0.664118,16.999638,1.918586,...,13.938497,-500.970168,11.88335,-500.395298,10.054147,-496.374715,7.965083,-499.364752,14.017067,-599.707915
6,1/15/16 6:00,6.533849,10.058141,4.5691,41.995316,69.300835,10.304598,0.807342,16.723575,2.058913,...,14.046819,-500.971133,12.091543,-500.501426,10.003247,-497.08318,8.01089,-500.002423,14.029649,-600.90547
7,1/15/16 7:00,6.130823,9.935481,4.389813,42.452727,70.230976,10.443288,0.949346,16.689959,2.143437,...,13.974691,-501.819696,12.101324,-500.583446,9.873169,-499.171928,7.993381,-499.794518,13.984498,-600.41107
8,1/15/16 8:00,5.83414,10.071156,4.876389,43.404078,69.688595,10.42014,1.065453,17.201948,2.209881,...,13.96403,-504.25245,12.060738,-501.174549,10.033838,-501.178133,7.881604,-499.729434,13.967135,-599.061188
9,1/15/16 9:00,5.687063,9.980404,5.282514,43.23522,70.279619,10.487013,1.159805,17.483979,2.209593,...,13.989632,-503.195299,12.052233,-500.928547,9.962574,-502.986357,7.979219,-500.146835,13.981614,-598.070855


The `train_data` has 16860 rows and 87 columns. It has missing values. The parameter `rougher.output.recovery` has the highest percentage of missing values, which is 15.26%.

In [44]:
train_data_missing_perct = (train_data.isna().sum())/train_data.shape[0]*100
print(train_data_missing_perct.sort_values(ascending=False))

rougher.output.recovery                               15.260973
rougher.output.tail_ag                                13.345196
rougher.output.tail_sol                               13.339265
rougher.output.tail_au                                13.339265
secondary_cleaner.output.tail_sol                     11.779359
                                                        ...    
primary_cleaner.state.floatbank8_d_level               0.160142
rougher.calculation.floatbank10_sulfate_to_au_feed     0.160142
rougher.calculation.floatbank11_sulfate_to_au_feed     0.160142
primary_cleaner.input.feed_size                        0.000000
date                                                   0.000000
Length: 87, dtype: float64


Next, let's look at `test_data`.

In [45]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   date                                        5856 non-null   object 
 1   primary_cleaner.input.sulfate               5554 non-null   float64
 2   primary_cleaner.input.depressant            5572 non-null   float64
 3   primary_cleaner.input.feed_size             5856 non-null   float64
 4   primary_cleaner.input.xanthate              5690 non-null   float64
 5   primary_cleaner.state.floatbank8_a_air      5840 non-null   float64
 6   primary_cleaner.state.floatbank8_a_level    5840 non-null   float64
 7   primary_cleaner.state.floatbank8_b_air      5840 non-null   float64
 8   primary_cleaner.state.floatbank8_b_level    5840 non-null   float64
 9   primary_cleaner.state.floatbank8_c_air      5840 non-null   float64
 10  primary_clea

In [46]:
display(test_data.head(10))

Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,2016-09-01 00:59:59,210.800909,14.993118,8.08,1.005021,1398.981301,-500.225577,1399.144926,-499.919735,1400.102998,...,12.023554,-497.795834,8.016656,-501.289139,7.946562,-432.31785,4.872511,-500.037437,26.705889,-499.709414
1,2016-09-01 01:59:59,215.392455,14.987471,8.08,0.990469,1398.777912,-500.057435,1398.055362,-499.778182,1396.151033,...,12.05814,-498.695773,8.130979,-499.634209,7.95827,-525.839648,4.87885,-500.162375,25.01994,-499.819438
2,2016-09-01 02:59:59,215.259946,12.884934,7.786667,0.996043,1398.493666,-500.86836,1398.860436,-499.764529,1398.075709,...,11.962366,-498.767484,8.096893,-500.827423,8.071056,-500.801673,4.905125,-499.82851,24.994862,-500.622559
3,2016-09-01 03:59:59,215.336236,12.006805,7.64,0.863514,1399.618111,-498.863574,1397.44012,-499.211024,1400.129303,...,12.033091,-498.350935,8.074946,-499.474407,7.897085,-500.868509,4.9314,-499.963623,24.948919,-498.709987
4,2016-09-01 04:59:59,199.099327,10.68253,7.53,0.805575,1401.268123,-500.808305,1398.128818,-499.504543,1402.172226,...,12.025367,-500.786497,8.054678,-500.3975,8.10789,-509.526725,4.957674,-500.360026,25.003331,-500.856333
5,2016-09-01 05:59:59,168.485085,8.817007,7.42,0.791191,1402.826803,-499.299521,1401.511119,-499.205357,1404.088107,...,12.029797,-499.814895,8.036586,-500.371492,8.041446,-510.037054,4.983949,-499.99099,24.978973,-500.47564
6,2016-09-01 06:59:59,144.13344,7.92461,7.42,0.788838,1398.252401,-499.748672,1393.255503,-499.19538,1396.738566,...,12.026296,-499.473127,8.027984,-500.983079,7.90734,-507.964971,5.010224,-500.043697,25.040709,-499.501984
7,2016-09-01 07:59:59,133.513396,8.055252,6.988,0.801871,1401.669677,-501.777839,1400.754446,-502.514024,1400.465244,...,12.040911,-501.293852,8.02049,-499.185229,8.116897,-511.927561,5.036498,-500.149615,25.03258,-503.970657
8,2016-09-01 08:59:59,133.735356,7.999618,6.935,0.789329,1402.358981,-499.981597,1400.985954,-496.802968,1401.168584,...,11.998184,-499.481608,8.01261,-500.896783,7.974422,-521.199104,5.061599,-499.791519,25.005063,-497.613716
9,2016-09-01 09:59:59,126.961069,8.017856,7.03,0.805298,1400.81612,-499.014158,1399.975401,-499.570552,1401.871924,...,12.040725,-499.987743,7.989503,-499.750625,7.98971,-509.946737,5.068811,-499.2939,24.992741,-499.272255


The `test_data` has 5856 rows and 53 columns. It has missing values. The parameter `rougher.input.floatbank11_xanthate` has the highest percentage of missing values, which is 6.03%.

In [47]:
test_data_missing_perct = (test_data.isna().sum())/test_data.shape[0]*100
print(test_data_missing_perct.sort_values(ascending=False).head(20))

rougher.input.floatbank11_xanthate            6.028005
primary_cleaner.input.sulfate                 5.157104
primary_cleaner.input.depressant              4.849727
rougher.input.floatbank10_sulfate             4.388661
primary_cleaner.input.xanthate                2.834699
rougher.input.floatbank10_xanthate            2.100410
rougher.input.feed_sol                        1.144126
rougher.input.floatbank11_sulfate             0.939208
rougher.input.feed_rate                       0.683060
secondary_cleaner.state.floatbank3_a_air      0.580601
secondary_cleaner.state.floatbank2_b_air      0.392760
rougher.input.feed_size                       0.375683
secondary_cleaner.state.floatbank2_a_air      0.341530
rougher.state.floatbank10_a_air               0.290301
rougher.state.floatbank10_c_air               0.290301
rougher.state.floatbank10_d_air               0.290301
rougher.state.floatbank10_e_air               0.290301
rougher.state.floatbank10_b_air               0.290301
rougher.st

### Check the recovery calculation

#### Calculate the recovery for the `rougher.output.recovery` feature

First, let's calculate the recovery for the `rougher.output.recovery` feature, using the **training set**. 

Use the following formula to simulate the process of the recovery of gold from gold ore.

$$ Recovery = \frac{C\times(F-T)}{F\times(C-T)}\times100\%$$

where:
- C — the share of gold in the concentrate right after flotation (for finding the rougher concentrate recovery)/after purification (for finding the final concentrate recovery)
- F — the share of gold in the feed before flotation (for finding the rougher concentrate recovery)/in the concentrate right after flotation (for finding the final concentrate recovery)
- T — the share of gold in the rougher tails right after flotation (for finding the rougher concentrate recovery)/after purification (for finding the final concentrate recovery)

In [48]:
### TEST CODE for recovery calculation

# recovery = rougher.output.recovery
# C = rougher.output.concentrate_au
# F = rougher.input.feed_au
# T = rougher.output.tail_au

# C = train_data['rougher.output.concentrate_au'][0]
# C

# F = train_data['rougher.input.feed_au'][0]
# F

# T = train_data['rougher.output.tail_au'][0]
# T

# rougher_output_recovery = C*(F-T)/(F*(C-T))*100
# rougher_output_recovery

In [49]:
def rougher_output_recovery(row):
    """
    the rougher_output_recovery function calculate the 
    recovery for the rougher.output.recovery feature
    based on the formula above
    """
    C = row['rougher.output.concentrate_au']
    F = row['rougher.input.feed_au']
    T = row['rougher.output.tail_au']
    
    if (F*(C-T)) != 0:
    # recovery is a percentage recovery
        recovery = C*(F-T)/(F*(C-T))*100
    else:
        recovery = np.nan  
    return recovery
    
# rougher_output_recovery(train_data.loc[0])    

train_data['rougher_output_recovery_calc'] = train_data.apply(rougher_output_recovery, axis=1)

In [50]:

# row is a row of a dataframe

def rougher_output_recovery_is_close(row):
    """
    the rougher_output_recovery_is_close function checks
    whether the computation based on the formula is close/equal
    to the rougher.output.recovery feature in the train_data
    """
    if pd.Series(row['rougher.output.recovery']).isna()[0]==False:
        if abs(row["rougher_output_recovery_calc"]-row['rougher.output.recovery'])<0.001:
            is_close = True
        else:
            is_close = False 
    else:
        is_close = np.nan
    return is_close


In [51]:
train_data["rougher_output_recovery_equal"] = train_data.apply(rougher_output_recovery_is_close, axis=1)
np.nansum(train_data["rougher_output_recovery_equal"])
# ref: https://numpy.org/doc/stable/reference/generated/numpy.nansum.html

14287

In [52]:
train_data["rougher_output_recovery_equal"].isna().sum()

2573

In [53]:
14287+2573

16860

It shows that the computation for recovery based on the formula matches the `rougher.output.recovery` feature in the train_data. The `rougher.output.recovery` feature has 2573 missing values. For the 14287 non-missing values in the `rougher.output.recovery` feature, the computation for recovery based on the formula matches the `rougher.output.recovery` feature in the train_data, with consideratin for the rounding errors. 

#### Find the Mean Absolute Error (MAE) 

Next, let's find the MAE between the above `rougher.output.recovery` calculations and the feature values. 

In [54]:

def mae(df):
    mean_abs_error = 0
    count = 0
    for i in range(len(df)):
        if pd.Series(df.loc[i,'rougher.output.recovery']).isna()[0]==False:
            count += 1
            mean_abs_error += abs(df.loc[i,"rougher_output_recovery_calc"]-df.loc[i,'rougher.output.recovery'])

    return mean_abs_error/count

In [55]:
print("The MAE between my calculations and the feature values is ", mae(train_data))

The MAE between my calculations and the feature values is  4.1207194979061396e-09


The recovery and Mean Absolute Error (MAE) calculation show that my calculation based on the formula matches the `rougher.output.recovery` feature in the train_data. Moreover, the MAE between my calculations and the feature values is  4.1207194979061396e-09, which is very small. That confirms that the recovery was calculated correctly.  


### Analyze the features not available in the test set

Originally, both the `full_data` and the `train_data` have 87 columns/parameters. The `test_data` only has 53 columns/parameters. After creating two columns `rougher_output_recovery_calc` and `rougher_output_recovery_equal` for the `train_data` previously, it has 89 columns/parameters.

We will drop the `rougher_output_recovery_calc` and `rougher_output_recovery_equal` columns from `train_data`. 

In [56]:
train_data = train_data.drop(columns=['rougher_output_recovery_calc', 'rougher_output_recovery_equal'])


Now let's find out what parameters/features are not available in the test set and their type.

In [57]:
mismatch_columns1 = [x for x in train_data.columns if x not in test_data.columns]
mismatch_columns1

['final.output.concentrate_ag',
 'final.output.concentrate_pb',
 'final.output.concentrate_sol',
 'final.output.concentrate_au',
 'final.output.recovery',
 'final.output.tail_ag',
 'final.output.tail_pb',
 'final.output.tail_sol',
 'final.output.tail_au',
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_sol',
 'primary_cleaner.output.concentrate_au',
 'primary_cleaner.output.tail_ag',
 'primary_cleaner.output.tail_pb',
 'primary_cleaner.output.tail_sol',
 'primary_cleaner.output.tail_au',
 'rougher.calculation.sulfate_to_au_concentrate',
 'rougher.calculation.floatbank10_sulfate_to_au_feed',
 'rougher.calculation.floatbank11_sulfate_to_au_feed',
 'rougher.calculation.au_pb_ratio',
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_sol',
 'rougher.output.concentrate_au',
 'rougher.output.recovery',
 'rougher.output.tail_ag',
 'rougher.output.tail_pb',
 'rougher.output.ta

In [58]:
print(len(mismatch_columns1))

34


Above is a list of the 34 columns that are in `train_data`, but not in the `test_data`. 

In [59]:
for column in mismatch_columns1:
    print(train_data[column].dtypes)
    
# ref: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html

float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64


All those columns have `float64` data type.

### Perform data preprocessing





Because the `test_data` does not have 34 columns that are in the `train_data`, I think most of those 34 columns should be dropped from the `train_data`. Two columns (`rougher.output.recovery`, `final.output.recovery`) of those 34 columns should not be dropped because they are target variables. 

In [60]:
targets = ['rougher.output.recovery', 'final.output.recovery']
mismatch_columns1_less_targets = [i for i in mismatch_columns1 if i not in targets]
mismatch_columns1_less_targets

['final.output.concentrate_ag',
 'final.output.concentrate_pb',
 'final.output.concentrate_sol',
 'final.output.concentrate_au',
 'final.output.tail_ag',
 'final.output.tail_pb',
 'final.output.tail_sol',
 'final.output.tail_au',
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_sol',
 'primary_cleaner.output.concentrate_au',
 'primary_cleaner.output.tail_ag',
 'primary_cleaner.output.tail_pb',
 'primary_cleaner.output.tail_sol',
 'primary_cleaner.output.tail_au',
 'rougher.calculation.sulfate_to_au_concentrate',
 'rougher.calculation.floatbank10_sulfate_to_au_feed',
 'rougher.calculation.floatbank11_sulfate_to_au_feed',
 'rougher.calculation.au_pb_ratio',
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_sol',
 'rougher.output.concentrate_au',
 'rougher.output.tail_ag',
 'rougher.output.tail_pb',
 'rougher.output.tail_sol',
 'rougher.output.tail_au',
 'secondary_cleane

In [61]:
type(mismatch_columns1_less_targets)

list

In [62]:
# train_data_2 = train_data.drop(mismatch_columns1_less_targets, axis=1)
train_data2 = train_data.drop(columns = mismatch_columns1_less_targets)

In [63]:
train_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 55 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   date                                        16860 non-null  object 
 1   final.output.recovery                       15339 non-null  float64
 2   primary_cleaner.input.sulfate               15553 non-null  float64
 3   primary_cleaner.input.depressant            15598 non-null  float64
 4   primary_cleaner.input.feed_size             16860 non-null  float64
 5   primary_cleaner.input.xanthate              15875 non-null  float64
 6   primary_cleaner.state.floatbank8_a_air      16820 non-null  float64
 7   primary_cleaner.state.floatbank8_a_level    16827 non-null  float64
 8   primary_cleaner.state.floatbank8_b_air      16820 non-null  float64
 9   primary_cleaner.state.floatbank8_b_level    16833 non-null  float64
 10  primary_cl

Let's look at the percentage of missing values in `train_data2`. 

In [64]:
train_data2_missing_perct = (train_data2.isna().sum())/train_data2.shape[0]*100
print(train_data2_missing_perct.sort_values(ascending=False))

rougher.output.recovery                       15.260973
rougher.input.floatbank11_xanthate            11.293001
final.output.recovery                          9.021352
primary_cleaner.input.sulfate                  7.752076
primary_cleaner.input.depressant               7.485172
rougher.input.floatbank10_sulfate              6.192171
primary_cleaner.input.xanthate                 5.842230
rougher.input.floatbank11_sulfate              3.695136
rougher.state.floatbank10_e_air                3.576512
rougher.input.feed_rate                        3.042705
rougher.input.feed_size                        2.473310
secondary_cleaner.state.floatbank2_a_air       2.153025
rougher.input.floatbank10_xanthate             2.052195
rougher.input.feed_sol                         1.731910
rougher.input.feed_pb                          1.352313
secondary_cleaner.state.floatbank2_b_air       0.919336
secondary_cleaner.state.floatbank4_a_air       0.765125
secondary_cleaner.state.floatbank4_a_level     0

In [65]:
train_data2 = train_data2.dropna()
train_data2

Unnamed: 0,date,final.output.recovery,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
0,1/15/16 0:00,70.541216,127.092003,10.128295,7.25,0.988759,1549.775757,-498.912140,1551.434204,-516.403442,...,14.016835,-502.488007,12.099931,-504.715942,9.925633,-498.310211,8.079666,-500.470978,14.151341,-605.841980
1,1/15/16 1:00,69.266198,125.629232,10.296251,7.25,1.002663,1576.166671,-500.904965,1575.950626,-499.865889,...,13.992281,-505.503262,11.950531,-501.331529,10.039245,-500.169983,7.984757,-500.582168,13.998353,-599.787184
2,1/15/16 2:00,68.116445,123.819808,11.316280,7.25,0.991265,1601.556163,-499.997791,1600.386685,-500.607762,...,14.015015,-502.520901,11.912783,-501.133383,10.070913,-500.129135,8.013877,-500.517572,14.028663,-601.427363
3,1/15/16 3:00,68.347543,122.270188,11.322140,7.25,0.996739,1599.968720,-500.951778,1600.659236,-499.677094,...,14.036510,-500.857308,11.999550,-501.193686,9.970366,-499.201640,7.977324,-500.255908,14.005551,-599.996129
4,1/15/16 4:00,66.927016,117.988169,11.913613,7.25,1.009869,1601.339707,-498.975456,1601.437854,-500.323246,...,14.027298,-499.838632,11.953070,-501.053894,9.925709,-501.686727,7.894242,-500.356035,13.996647,-601.496691
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16855,8/18/18 6:59,73.755150,123.381787,8.028927,6.50,1.304232,1648.421193,-400.382169,1648.742005,-400.359661,...,23.031497,-501.167942,20.007571,-499.740028,18.006038,-499.834374,13.001114,-500.155694,20.007840,-501.296428
16856,8/18/18 7:59,69.049291,120.878188,7.962636,6.50,1.302419,1649.820162,-399.930973,1649.357538,-399.721222,...,22.960095,-501.612783,20.035660,-500.251357,17.998535,-500.395178,12.954048,-499.895163,19.968498,-501.041608
16857,8/18/18 8:59,67.002189,105.666118,7.955111,6.50,1.315926,1649.166761,-399.888631,1649.196904,-399.677571,...,23.015718,-501.711599,19.951231,-499.857027,18.019543,-500.451156,13.023431,-499.914391,19.990885,-501.518452
16858,8/18/18 9:59,65.523246,98.880538,7.984164,6.50,1.241969,1646.547763,-398.977083,1648.212240,-400.383265,...,23.024963,-501.153409,20.054122,-500.314711,17.979515,-499.272871,12.992404,-499.976268,20.013986,-500.625471


In [66]:
train_data2_missing_perct = (train_data2.isna().sum())/train_data2.shape[0]*100
print(train_data2_missing_perct.sort_values(ascending=False))

date                                          0.0
secondary_cleaner.state.floatbank3_a_air      0.0
rougher.state.floatbank10_c_level             0.0
rougher.state.floatbank10_d_air               0.0
rougher.state.floatbank10_d_level             0.0
rougher.state.floatbank10_e_air               0.0
rougher.state.floatbank10_e_level             0.0
rougher.state.floatbank10_f_air               0.0
rougher.state.floatbank10_f_level             0.0
secondary_cleaner.state.floatbank2_a_air      0.0
secondary_cleaner.state.floatbank2_a_level    0.0
secondary_cleaner.state.floatbank2_b_air      0.0
secondary_cleaner.state.floatbank2_b_level    0.0
secondary_cleaner.state.floatbank3_a_level    0.0
rougher.state.floatbank10_b_level             0.0
secondary_cleaner.state.floatbank3_b_air      0.0
secondary_cleaner.state.floatbank3_b_level    0.0
secondary_cleaner.state.floatbank4_a_air      0.0
secondary_cleaner.state.floatbank4_a_level    0.0
secondary_cleaner.state.floatbank4_b_air      0.0


Up till this point, I think the way I handle the columns and missing values might not be correct. I am stuck here. Thus, I am not sure how to proceed. Would you please review what I did and offer some input ? Thank you. 

## Analyze the data

### Concentration of metals (Au, Ag, Pb) changes during the process

Take note of how the concentration of metals (Au, Ag, Pb) changes depending on the purification stage.
 
### Compare the feed particle size distributions

Compare the feed particle size distributions in the training set and in the test set. If the distributions vary significantly, model evaluation will be performed incorrectly.

### Consider the total concentrations of all substances at different stages of the recovery process: the raw feed, rougher concentrate, and final concentrate. 

Do you notice any abnormal values in the total distribution? If you do, is it worth removing them from both sets? Describe your findings and eliminate anomalies.




## Build the model

### Write a function to calculate the final sMAPE value

### Train different models and evaluate them using cross-validation 

### Pick the best model and test it using the test set



## Data analysis

## Model

## Task completion check list

- [x]  Jupyter Notebook
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  Step 1 performed: data has been prepared
    - [ ]  The formula for calculating the flotation effectiveness has been checked
    - [ ]  The features unavailable in the test set have been analyzed
    - [ ]  The data has been preprocessed
- [ ]  Step 2 performed: data has been analyzed
    - [ ]  The change in concentration of elements has been analyzed at each stage
    - [ ]  Distributions of particle size have been analyzed for training set and test set
    - [ ]  Total concentrations have been analyzed
    - [ ]  Abnormal values have been analyzed and processed
- [ ]  Step 3 performed:  model for prediction has been built
    - [ ]  Final *sMAPE* calculation function has been written
    - [ ]  Several models have been trained and tested
    - [ ]  The best model has been picked and tested using the test set