# Моделирование восстановления золота из руды

**Описание проекта**

Подготовьте прототип модели машинного обучения для «Цифры». Компания разрабатывает решения для эффективной работы промышленных предприятий.

Модель должна предсказать коэффициент восстановления золота из золотосодержащей руды. В вашем распоряжении данные с параметрами добычи и очистки.

Модель поможет оптимизировать производство, чтобы не запускать предприятие с убыточными характеристиками.

План работы:

1. Загрузите и подготовьте данные.
2. Исследуйте баланс классов, обучите модель без учёта дисбаланса.
3. Улучшите качество модели, учитывая дисбаланс классов. Обучите разные модели и найдите лучшую.
4. Проведите финальное тестирование.

**Описание данных**

Данные находятся в трёх файлах:

1. `gold_industry_train.csv` — обучающая выборка;
2. `gold_industry_test.csv` — тестовая выборка;
3. `gold_industry_full.csv` — исходные данные.

Данные индексируются датой и временем получения информации (признак date). Соседние по времени параметры часто похожи.
Некоторые параметры недоступны, потому что замеряются и/или рассчитываются значительно позже. Из-за этого в тестовой выборке отсутствуют некоторые признаки, которые могут быть в обучающей. Также в тестовом наборе нет целевых признаков.

Признаки технологического процесса:

1. `Rougher feed` — исходное сырье
2. `Rougher additions (или reagent additions)` — флотационные реагенты: Xanthate, Sulphate, Depressant
3. `Xanthate` — ксантогенат (промотер, или активатор флотации);
4. `Sulphate` — сульфат (на данном производстве сульфид натрия);
5. `Depressant` — депрессант (силикат натрия).
6. `Rougher process` (англ. «грубый процесс») — флотация
7. `Rougher tails` — отвальные хвосты
8. `Float banks` — флотационная установка
9. `Cleaner process` — очистка
10. `Rougher Au` — черновой концентрат золота
11. `Final Au` — финальный концентрат золота

Параметры этапов:

1. `air amount` — объём воздуха
2. `fluid levels` — уровень жидкости
3. `feed size` — размер гранул сырья
4. `feed rate` — скорость подачи

Наименование признаков по формуле [этап].[тип_параметра].[название_параметра] (`rougher.input.feed_ag`):

Возможные значения для блока [этап]:

1. `rougher` — флотация
2. `primary_cleaner` — первичная очистка
3. `secondary_cleaner` — вторичная очистка
4. `final` — финальные характеристики

Возможные значения для блока [тип_параметра]:

1. `input` — параметры сырья
2. `output` — параметры продукта
3. `state` — параметры, характеризующие текущее состояние этапа
4. `calculation` — расчётные характеристики

In [2]:
import pandas as pd
import optuna
import plotly.graph_objects as go
import plotly.express as px

from plotly.subplots import make_subplots

from collections import defaultdict
from ydata_profiling import ProfileReport
from IPython.display import display

from fast_ml import eda

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

from sklearn.metrics import (
    accuracy_score, f1_score, auc, roc_curve, roc_auc_score
)

In [3]:
FIG_WIDTH = 10 * 100
FIG_HEIGHT = 5 * 100
RANDOM_SEED = 42
FILE_NAMES = ['train', 'test', 'full']

In [5]:
dct_splits = {}
for file_name in FILE_NAMES:
    try:
        dct_splits[file_name] = pd.read_csv(f"gold_industry_{file_name}.csv")
    except:
        dct_splits[file_name] = pd.read_csv(f"/datasets/gold_industry_{file_name}.csv")

## Исследовательский анализ данных

Изучим основные зависимости в данных перед тем, как мы будем использовать их в алгоритмах машинного обучения.

Таблица-резюме:

In [8]:
for file_name in FILE_NAMES:
    display(eda.df_info(dct_splits[file_name]))

Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
date,object,Categorical,14579,"[2016-01-15 00:00:00, 2016-01-15 01:00:00, 201...",0,0.0
rougher.input.feed_au,float64,Numerical,14556,"[6.486149787902832, 6.478582788213898, 6.36222...",0,0.0
rougher.input.feed_ag,float64,Numerical,14556,"[6.100378036499023, 6.161113482574631, 6.11645...",0,0.0
rougher.input.feed_pb,float64,Numerical,14480,"[2.284912109375, 2.2660326347086164, 2.1596216...",72,0.493861
rougher.input.feed_sol,float64,Numerical,14469,"[36.80859375, 35.75338486777412, 35.9716300688...",77,0.528157
...,...,...,...,...,...,...
final.output.recovery,float64,Numerical,14399,"[70.54121591421571, 69.26619763433304, 68.1164...",0,0.0
final.output.tail_au,float64,Numerical,14493,"[2.143149375915528, 2.2249303504399105, 2.2578...",0,0.0
final.output.tail_ag,float64,Numerical,14492,"[10.411961555480955, 10.462675680986608, 10.50...",1,0.006859
final.output.tail_pb,float64,Numerical,14418,"[0.89544677734375, 0.9274521342416604, 0.95371...",75,0.514439


Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
date,object,Categorical,4860,"[2017-12-09 14:59:59, 2017-12-09 15:59:59, 201...",0,0.0
rougher.input.feed_au,float64,Numerical,4857,"[4.365491264352128, 4.362781355723665, 5.08168...",0,0.0
rougher.input.feed_ag,float64,Numerical,4857,"[6.158717726619336, 6.048130219512707, 6.08274...",0,0.0
rougher.input.feed_pb,float64,Numerical,4827,"[3.8757267771199024, 3.902536612285509, 4.5640...",28,0.576132
rougher.input.feed_sol,float64,Numerical,4834,"[39.135118562631504, 39.71390631887648, 37.208...",22,0.452675
rougher.input.feed_rate,float64,Numerical,4856,"[555.8202083494116, 544.731686878923, 558.1551...",4,0.082305
rougher.input.feed_size,float64,Numerical,4811,"[94.54435802875824, 123.74242951851237, 82.610...",44,0.90535
rougher.input.floatbank10_sulfate,float64,Numerical,4857,"[6.1469819662834695, 6.210119180198626, 7.3638...",3,0.061728
rougher.input.floatbank10_xanthate,float64,Numerical,4859,"[9.308612125741996, 9.297709489474396, 9.00356...",1,0.020576
rougher.state.floatbank10_a_air,float64,Numerical,4859,"[1196.2381122289923, 1201.9041773892254, 1200....",1,0.020576


Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
date,object,Categorical,19439,"[2016-01-15 00:00:00, 2016-01-15 01:00:00, 201...",0,0.0
rougher.input.feed_au,float64,Numerical,19409,"[6.486149787902832, 6.478582788213898, 6.36222...",0,0.0
rougher.input.feed_ag,float64,Numerical,19409,"[6.100378036499023, 6.161113482574631, 6.11645...",0,0.0
rougher.input.feed_pb,float64,Numerical,19300,"[2.284912109375, 2.2660326347086164, 2.1596216...",100,0.51443
rougher.input.feed_sol,float64,Numerical,19292,"[36.80859375, 35.75338486777412, 35.9716300688...",99,0.509285
...,...,...,...,...,...,...
final.output.recovery,float64,Numerical,19235,"[70.54121591421571, 69.26619763433304, 68.1164...",0,0.0
final.output.tail_au,float64,Numerical,19329,"[2.143149375915528, 2.2249303504399105, 2.2578...",0,0.0
final.output.tail_ag,float64,Numerical,19328,"[10.411961555480955, 10.462675680986608, 10.50...",1,0.005144
final.output.tail_pb,float64,Numerical,19228,"[0.89544677734375, 0.9274521342416604, 0.95371...",101,0.519574


In [10]:
for file_name in FILE_NAMES:
    display(round(dct_splits[file_name].describe().T, 2))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rougher.input.feed_au,14579.0,8.35,1.93,0.01,6.93,8.23,9.83,13.73
rougher.input.feed_ag,14579.0,8.88,1.92,0.01,7.34,8.72,10.26,14.60
rougher.input.feed_pb,14507.0,3.60,1.06,0.01,2.88,3.53,4.26,7.05
rougher.input.feed_sol,14502.0,36.56,5.21,0.01,34.09,37.10,39.90,53.48
rougher.input.feed_rate,14572.0,474.33,108.50,0.00,411.05,498.19,549.59,717.51
...,...,...,...,...,...,...,...,...
final.output.recovery,14579.0,66.76,10.62,0.00,63.11,67.96,72.60,100.00
final.output.tail_au,14579.0,3.09,0.92,0.00,2.51,3.03,3.61,8.25
final.output.tail_ag,14578.0,9.73,2.36,0.00,8.09,9.82,11.17,19.55
final.output.tail_pb,14504.0,2.72,0.96,0.00,2.04,2.77,3.35,5.80


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rougher.input.feed_au,4860.0,8.01,1.99,0.01,6.57,7.81,9.56,13.42
rougher.input.feed_ag,4860.0,8.55,1.96,0.01,6.98,8.18,10.08,14.53
rougher.input.feed_pb,4832.0,3.58,1.03,0.01,2.9,3.54,4.2,7.14
rougher.input.feed_sol,4838.0,37.1,4.93,0.01,34.51,37.5,40.46,53.48
rougher.input.feed_rate,4856.0,490.29,94.37,0.01,434.1,502.12,555.6,702.52
rougher.input.feed_size,4816.0,59.11,19.13,0.05,47.6,55.51,66.66,363.99
rougher.input.floatbank10_sulfate,4857.0,12.06,3.41,0.02,9.89,12.0,14.5,30.01
rougher.input.floatbank10_xanthate,4859.0,6.1,1.04,0.02,5.5,6.1,6.8,9.4
rougher.state.floatbank10_a_air,4859.0,1108.64,156.49,300.79,999.72,1001.41,1202.84,1521.98
rougher.state.floatbank10_a_level,4859.0,-368.34,91.16,-600.57,-499.73,-300.18,-299.96,-281.04


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rougher.input.feed_au,19439.0,8.27,1.96,0.01,6.85,8.13,9.77,13.73
rougher.input.feed_ag,19439.0,8.79,1.94,0.01,7.24,8.59,10.21,14.60
rougher.input.feed_pb,19339.0,3.60,1.05,0.01,2.89,3.53,4.24,7.14
rougher.input.feed_sol,19340.0,36.70,5.15,0.01,34.21,37.20,40.04,53.48
rougher.input.feed_rate,19428.0,478.32,105.37,0.00,416.53,499.42,550.17,717.51
...,...,...,...,...,...,...,...,...
final.output.recovery,19439.0,67.05,10.13,0.00,63.30,68.17,72.69,100.00
final.output.tail_au,19439.0,3.04,0.92,0.00,2.46,2.98,3.57,8.25
final.output.tail_ag,19438.0,9.69,2.33,0.00,8.06,9.74,11.13,19.55
final.output.tail_pb,19338.0,2.71,0.95,0.00,2.04,2.75,3.33,5.80


In [12]:
# for file_name in FILE_NAMES:
ProfileReport(dct_splits[file_name]).to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:  80%|████████  | 6010/7490 [13:12<03:15,  7.57it/s, scatter secondary_cleaner.state.floatbank6_a_level, secondary_cleaner.state.floatbank5_a_level]                