
# "Steel Processing"
> "We examine the energy consumption of a steel processing plant to try and reduce energy costs"

- toc: true
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- hide: false


## Problem Statement:
In order to optimize production costs, the steel plant, Steelproof, has decided to reduce their energy consumption at the steel processing stage. 

## Initial Information:
### Data:
- Additive materials (Bulk and Wire) + timestamps
- Heating times and power usage
- Temp readings + timestamps
- Inert Gas usage  

### Questions:
**Question:** Is any of the data not immediately available during production?  
**Response:** Intermediate temperatures cannot be used in the model, the sensors are unable to quickly determine these temperatures.

**Question:**  For the Bulk materials and wire, are the NaNs unknown values or the automatic value for when that material wasn't used?  
**Response:** When the material wasn't used  
  
### Plan:
1. Isolate the target variable into its own series.
2. Remove date-time variables after deriving a duration variable from electrode data
3. Merge remaining data into a single dataframe where each iteration is only one row.  
4. Build a variety of Regression models and test them

## Solution:  
- We stuck to the plan for the most part.  
- We completely removed the timestamps for the bulk, wire, and electrode data.  
- Isolated the target variable and merged the remaining data into one dataframe.
- Our biggest deviation from the plan was calculating the entire process duration instead of the heating duration  
  
- Seperating the first and and final temp readings for each iteration was the trickier but most crucial part of the process
- After that merging the data was the final hurdle but was simplified by having a key column to merge on  
  
- We tested 3 models against a dummy model to try and find the best one:  
    1. Linear Regression
    2. CatBoost Regressor
    3. Ridge
- The dummy model (sanity test) scored just over an 8 MAE
- Linear Regression and Ridge performed adequately with MAE scores just under 6
- We tried CatBoost with some parameter tuning and the basic format - the base model performed better  


## Results:  
- Catboost overall was our best model. The average error rate (MAE) was down to 5.22 with CatBoost
- Model training time is usally a concern with CatBoost but since we are using the base model CatBoost runs quickly

## Init

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV

In [2]:
#collapse-hide
data_elec = pd.read_csv('/datasets/data_arc_en.csv')
data_bulk = pd.read_csv('/datasets/data_bulk_en.csv')
data_temp = pd.read_csv('/datasets/data_temp_en.csv')
data_wire = pd.read_csv('/datasets/data_wire_en.csv')
data_gas = pd.read_csv('/datasets/data_gas_en.csv')

In [3]:
#hide
# data_bulk_time = pd.read_csv('/datasets/data_bulk_time_en.csv')
# data_wire_time = pd.read_csv('/datasets/data_wire_time_en.csv')

## Cleaning & Merging

In [4]:
data_temp['Sampling time'] = pd.to_datetime(data_temp['Sampling time'])

final_temp = (data_temp
              .drop_duplicates(['key'], keep='last')
              .reset_index(drop=True)
              .rename(columns={'Temperature': 'Final Temp', 'Sampling time': 'End Time'}))

initial_temp = (data_temp
              .drop_duplicates(['key'])
              .reset_index(drop=True)
              .rename(columns={'Temperature': 'Initial Temp', 'Sampling time': 'Start Time'}))

final_temp['Duration'] = (final_temp['End Time'] - initial_temp['Start Time']).dt.total_seconds() 
initial_temp = initial_temp.drop(['Start Time'], axis=1)
final_temp = final_temp.drop(['End Time'], axis=1)

display(initial_temp.head())
final_temp.head()

Unnamed: 0,key,Initial Temp
0,1,1571.0
1,2,1581.0
2,3,1596.0
3,4,1601.0
4,5,1576.0


Unnamed: 0,key,Final Temp,Duration
0,1,1613.0,861.0
1,2,1602.0,1305.0
2,3,1599.0,1300.0
3,4,1625.0,388.0
4,5,1602.0,762.0


In [5]:
data_elec = (data_elec
             .pivot_table(index='key', values=['Active power', 'Reactive power'], aggfunc='sum')
             .reset_index())
data_elec.head()

Unnamed: 0,key,Active power,Reactive power
0,1,4.878147,3.183241
1,2,3.052598,1.998112
2,3,2.525882,1.599076
3,4,3.20925,2.060298
4,5,3.347173,2.252643


In [6]:
data_bulk = data_bulk.fillna(0)
data_wire = data_wire.fillna(0)

In [7]:
data_all = (initial_temp
            .merge(final_temp, on='key', how='outer')
            .merge(data_bulk, on='key', how='outer')
            .merge(data_wire, on='key', how='outer')
            .merge(data_elec, on='key', how='outer')
            .merge(data_gas, on='key', how='outer')
            .drop('key', axis = 1))

In [8]:
#collapse-output
data_all.info()
data_all.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3241 entries, 0 to 3240
Data columns (total 30 columns):
Initial Temp      3216 non-null float64
Final Temp        2477 non-null float64
Duration          3216 non-null float64
Bulk 1            3129 non-null float64
Bulk 2            3129 non-null float64
Bulk 3            3129 non-null float64
Bulk 4            3129 non-null float64
Bulk 5            3129 non-null float64
Bulk 6            3129 non-null float64
Bulk 7            3129 non-null float64
Bulk 8            3129 non-null float64
Bulk 9            3129 non-null float64
Bulk 10           3129 non-null float64
Bulk 11           3129 non-null float64
Bulk 12           3129 non-null float64
Bulk 13           3129 non-null float64
Bulk 14           3129 non-null float64
Bulk 15           3129 non-null float64
Wire 1            3081 non-null float64
Wire 2            3081 non-null float64
Wire 3            3081 non-null float64
Wire 4            3081 non-null float64
Wire 5       

Unnamed: 0,Initial Temp,Final Temp,Duration,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,...,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9,Active power,Reactive power,Gas 1
0,1571.0,1613.0,861.0,0.0,0.0,0.0,43.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.878147,3.183241,29.749986
1,1581.0,1602.0,1305.0,0.0,0.0,0.0,73.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.052598,1.998112,12.555561
2,1596.0,1599.0,1300.0,0.0,0.0,0.0,34.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.525882,1.599076,28.554793
3,1601.0,1625.0,388.0,0.0,0.0,0.0,81.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.20925,2.060298,18.841219
4,1576.0,1602.0,762.0,0.0,0.0,0.0,78.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.347173,2.252643,5.413692


In [9]:
#collapse-output
data_all = data_all.dropna()
data_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2329 entries, 0 to 2476
Data columns (total 30 columns):
Initial Temp      2329 non-null float64
Final Temp        2329 non-null float64
Duration          2329 non-null float64
Bulk 1            2329 non-null float64
Bulk 2            2329 non-null float64
Bulk 3            2329 non-null float64
Bulk 4            2329 non-null float64
Bulk 5            2329 non-null float64
Bulk 6            2329 non-null float64
Bulk 7            2329 non-null float64
Bulk 8            2329 non-null float64
Bulk 9            2329 non-null float64
Bulk 10           2329 non-null float64
Bulk 11           2329 non-null float64
Bulk 12           2329 non-null float64
Bulk 13           2329 non-null float64
Bulk 14           2329 non-null float64
Bulk 15           2329 non-null float64
Wire 1            2329 non-null float64
Wire 2            2329 non-null float64
Wire 3            2329 non-null float64
Wire 4            2329 non-null float64
Wire 5       

## Train/Test Split

In [10]:
features = data_all.drop('Final Temp', axis=1)
target = data_all['Final Temp']

In [11]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=54321)

In [12]:
display(features_train.shape)
display(target_train.shape)
display(features_test.shape)
target_test.shape

(1863, 29)

(1863,)

(466, 29)

(466,)

## Model 0 - Sanity Test

In [13]:
model_0 = DummyRegressor(strategy='mean')

In [14]:
model_0.fit(features_train, target_train)
pred_0 = model_0.predict(features_test)

In [15]:
print('Sanity Test MAE:', MAE(target_test, pred_0))

Sanity Test MAE: 8.025588660128683


## Model 1 - LinearRegression

In [16]:
model_1 = LinearRegression()

In [17]:
model_1.fit(features_train, target_train)
pred_1 = model_1.predict(features_test)

In [18]:
print('Linear Regression MAE:', MAE(target_test, pred_1))

Linear Regression MAE: 5.9500027174198


## Model 2 - CatBoost

In [19]:
model_2 = CatBoostRegressor(random_state=54321, loss_function='MAE')

In [20]:
#collapse-output
#params = {'learning_rate': [0.03, 0.05, 0.1],
#        'depth': [4, 6, 10],
#        'l2_leaf_reg': [1, 3, 5, 7, 9]}
#
#model_2.grid_search(params, 
#                  X=features_train, 
#                  y=target_train,
#                  refit=True,
#                  partition_random_seed= 12345)



model_2.fit(features_train, target_train)
pred_2 = model_2.predict(features_test)

0:	learn: 7.9989684	total: 50.1ms	remaining: 50.1s
1:	learn: 7.9565330	total: 52.7ms	remaining: 26.3s
2:	learn: 7.8866288	total: 55ms	remaining: 18.3s
3:	learn: 7.8106662	total: 57.7ms	remaining: 14.4s
4:	learn: 7.7578909	total: 79.7ms	remaining: 15.9s
5:	learn: 7.6960067	total: 82.1ms	remaining: 13.6s
6:	learn: 7.6343002	total: 84.3ms	remaining: 12s
7:	learn: 7.5814495	total: 86.7ms	remaining: 10.8s
8:	learn: 7.5321962	total: 89.2ms	remaining: 9.82s
9:	learn: 7.4817674	total: 96.7ms	remaining: 9.57s
10:	learn: 7.4338809	total: 178ms	remaining: 16s
11:	learn: 7.3868121	total: 181ms	remaining: 14.9s
12:	learn: 7.3400538	total: 184ms	remaining: 13.9s
13:	learn: 7.2879249	total: 186ms	remaining: 13.1s
14:	learn: 7.2449540	total: 193ms	remaining: 12.7s
15:	learn: 7.1894576	total: 277ms	remaining: 17.1s
16:	learn: 7.1415323	total: 280ms	remaining: 16.2s
17:	learn: 7.1078082	total: 282ms	remaining: 15.4s
18:	learn: 7.0719712	total: 285ms	remaining: 14.7s
19:	learn: 7.0252944	total: 298ms	rem

- Grid Search offered worse results over default settings
- The default config trains in less than 1/100th of the time compared to running the grid search method

In [21]:
print('CatBoostRegressor MAE:', MAE(target_test, pred_2))

CatBoostRegressor MAE: 5.221709765858424


## Model 3 - Ridge

In [22]:
model_3 = Ridge(random_state=54321)

In [23]:
#collapse-output
param_grid = {'alpha' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

grid = GridSearchCV(estimator=model_3, 
                    param_grid=param_grid, 
                    scoring='neg_mean_absolute_error', 
                    cv=4, verbose=1, n_jobs=-1)
                    
grid.fit(features_train, target_train)

model_3.set_params(**grid.best_params_)

model_3.fit(features_train, target_train)
pred_3 = model_3.predict(features_test)

Fitting 4 folds for each of 7 candidates, totalling 28 fits


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 out of  28 | elapsed:    1.9s finished


In [24]:
print('Ridge MAE:', MAE(target_test, pred_3))

Ridge MAE: 5.98106353055716


## Conclusion

- Linear Regression performed great compared to our sanity test model.
- CatBoost trained quickly and gave us amazing results. Initially I ran a grid search for catboost but this produced worse results and slowed the model training down tremendously
- Ridge was used as an alternative for Linear Regression as it performs better when some of the variables are interdependent. A quick grid search was done still arriving at very good results
- All of the variables used can be derived primarily through sensors which makes automation a lot more viable
- Our final model for production will be CatBoost as it performed significantly better than our other models with an MAE of 5.22