# Predicting the temperature of steel

In order to optimize production costs, the steel plant Steelproof decided to reduce their energy consumption at the steel processing stage. A model will be developed that will be able to predict the temperature of the metal.

## Importing the libraries

In [None]:
import pandas as pd
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

## Loading the data

In [None]:
df_arc = pd.read_csv('/datasets/data_arc_en.csv')

In [None]:
df_bulk = pd.read_csv('/datasets/data_bulk_en.csv')

In [None]:
df_bulk_time = pd.read_csv('/datasets/data_bulk_time_en.csv')

In [None]:
df_gas = pd.read_csv('/datasets/data_gas_en.csv')

In [None]:
df_temp = pd.read_csv('/datasets/data_temp_en.csv')

In [None]:
df_wire = pd.read_csv('/datasets/data_wire_en.csv')

In [None]:
df_wire_time = pd.read_csv('/datasets/data_wire_time_en.csv')

## Examining each file

### `df_arc`: electrode data

In [None]:
df_arc.head()

Unnamed: 0,key,Arc heating start,Arc heating end,Active power,Reactive power
0,1,2019-05-03 11:02:14,2019-05-03 11:06:02,0.976059,0.687084
1,1,2019-05-03 11:07:28,2019-05-03 11:10:33,0.805607,0.520285
2,1,2019-05-03 11:11:44,2019-05-03 11:14:36,0.744363,0.498805
3,1,2019-05-03 11:18:14,2019-05-03 11:24:19,1.659363,1.062669
4,1,2019-05-03 11:26:09,2019-05-03 11:28:37,0.692755,0.414397


In [None]:
df_arc.tail()

Unnamed: 0,key,Arc heating start,Arc heating end,Active power,Reactive power
14871,3241,2019-09-01 03:58:58,2019-09-01 04:01:35,0.53367,0.354439
14872,3241,2019-09-01 04:05:04,2019-09-01 04:08:04,0.676604,0.523631
14873,3241,2019-09-01 04:16:41,2019-09-01 04:19:45,0.733899,0.475654
14874,3241,2019-09-01 04:31:51,2019-09-01 04:32:48,0.220694,0.145768
14875,3241,2019-09-01 04:34:47,2019-09-01 04:36:08,0.30658,0.196708


In [None]:
df_arc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14876 entries, 0 to 14875
Data columns (total 5 columns):
key                  14876 non-null int64
Arc heating start    14876 non-null object
Arc heating end      14876 non-null object
Active power         14876 non-null float64
Reactive power       14876 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 581.2+ KB


In [None]:
df_arc.describe()

Unnamed: 0,key,Active power,Reactive power
count,14876.0,14876.0,14876.0
mean,1615.220422,0.670441,0.452592
std,934.571502,0.408159,5.878702
min,1.0,0.030002,-715.504924
25%,806.0,0.395297,0.290991
50%,1617.0,0.555517,0.415962
75%,2429.0,0.857034,0.637371
max,3241.0,3.731596,2.676388


In [None]:
df_arc[df_arc['Reactive power'] < 0]

Unnamed: 0,key,Arc heating start,Arc heating end,Active power,Reactive power
9780,2116,2019-07-24 00:44:48,2019-07-24 00:46:37,0.495782,-715.504924


In [None]:
df_arc.describe(include='object')

Unnamed: 0,Arc heating start,Arc heating end
count,14876,14876
unique,14875,14876
top,2019-06-10 22:02:03,2019-08-10 04:39:05
freq,2,1


In [None]:
missing_keys_arc = []

for i in range(1, 3242):
    if i not in df_arc['key'].unique():
        missing_keys_arc.append(i)
len(missing_keys_arc)

27

In [None]:
print(missing_keys_arc)

[41, 42, 195, 279, 355, 382, 506, 529, 540, 607, 683, 710, 766, 1133, 1300, 1437, 2031, 2103, 2278, 2356, 2373, 2446, 2469, 2491, 2683, 3200, 3207]


In [None]:
df_arc.duplicated().sum()

0

There are no null values found in `df_arc`.

`Arc heating start` and `Arc heating end` would be better suited as datetime columns.

There is one negative value in `Reactive power`.

27 keys are missing from `df_arc`.

No duplicates were found.

### `df_bulk`: bulk material supply data (volume)

In [None]:
df_bulk.head()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
0,1,,,,43.0,,,,,,,,206.0,,150.0,154.0
1,2,,,,73.0,,,,,,,,206.0,,149.0,154.0
2,3,,,,34.0,,,,,,,,205.0,,152.0,153.0
3,4,,,,81.0,,,,,,,,207.0,,153.0,154.0
4,5,,,,78.0,,,,,,,,203.0,,151.0,152.0


In [None]:
df_bulk.tail()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
3124,3237,,,170.0,,,,,,,,,252.0,,130.0,206.0
3125,3238,,,126.0,,,,,,,,,254.0,,108.0,106.0
3126,3239,,,,,,114.0,,,,,,158.0,,270.0,88.0
3127,3240,,,,,,26.0,,,,,,,,192.0,54.0
3128,3241,,,,,,,,,,,,,,180.0,52.0


In [None]:
df_bulk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3129 entries, 0 to 3128
Data columns (total 16 columns):
key        3129 non-null int64
Bulk 1     252 non-null float64
Bulk 2     22 non-null float64
Bulk 3     1298 non-null float64
Bulk 4     1014 non-null float64
Bulk 5     77 non-null float64
Bulk 6     576 non-null float64
Bulk 7     25 non-null float64
Bulk 8     1 non-null float64
Bulk 9     19 non-null float64
Bulk 10    176 non-null float64
Bulk 11    177 non-null float64
Bulk 12    2450 non-null float64
Bulk 13    18 non-null float64
Bulk 14    2806 non-null float64
Bulk 15    2248 non-null float64
dtypes: float64(15), int64(1)
memory usage: 391.2 KB


In [None]:
df_bulk.describe()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
count,3129.0,252.0,22.0,1298.0,1014.0,77.0,576.0,25.0,1.0,19.0,176.0,177.0,2450.0,18.0,2806.0,2248.0
mean,1624.383509,39.242063,253.045455,113.879045,104.394477,107.025974,118.925347,305.6,49.0,76.315789,83.284091,76.819209,260.47102,181.111111,170.284747,160.513345
std,933.337642,18.277654,21.180578,75.483494,48.184126,81.790646,72.057776,191.022904,,21.720581,26.060347,59.655365,120.649269,46.088009,65.868652,51.765319
min,1.0,10.0,228.0,6.0,12.0,11.0,17.0,47.0,49.0,63.0,24.0,8.0,53.0,151.0,16.0,1.0
25%,816.0,27.0,242.0,58.0,72.0,70.0,69.75,155.0,49.0,66.0,64.0,25.0,204.0,153.25,119.0,105.0
50%,1622.0,31.0,251.5,97.5,102.0,86.0,100.0,298.0,49.0,68.0,86.5,64.0,208.0,155.5,151.0,160.0
75%,2431.0,46.0,257.75,152.0,133.0,132.0,157.0,406.0,49.0,70.5,102.0,106.0,316.0,203.5,205.75,205.0
max,3241.0,185.0,325.0,454.0,281.0,603.0,503.0,772.0,49.0,147.0,159.0,313.0,1849.0,305.0,636.0,405.0


In [None]:
missing_keys_bulk = []

for i in range(1, 3242):
    if i not in df_bulk['key'].unique():
        missing_keys_bulk.append(i)
len(missing_keys_bulk)

112

In [None]:
print(missing_keys_bulk)

[41, 42, 51, 52, 53, 54, 55, 56, 72, 80, 81, 110, 151, 188, 195, 225, 269, 302, 330, 331, 332, 343, 350, 355, 382, 506, 529, 540, 607, 661, 683, 710, 766, 830, 874, 931, 933, 934, 960, 961, 964, 966, 983, 984, 1062, 1105, 1133, 1221, 1268, 1300, 1334, 1402, 1437, 1517, 1518, 1535, 1566, 1623, 1656, 1783, 1818, 1911, 1959, 1974, 1979, 2009, 2010, 2031, 2043, 2056, 2103, 2195, 2196, 2197, 2198, 2216, 2217, 2231, 2278, 2310, 2356, 2373, 2390, 2408, 2434, 2446, 2460, 2468, 2469, 2471, 2491, 2595, 2599, 2600, 2608, 2625, 2628, 2683, 2738, 2739, 2816, 2821, 2863, 2884, 2891, 3018, 3026, 3047, 3182, 3200, 3207, 3216]


In [None]:
df_bulk.duplicated().sum()

0

There are many null values in `df_bulk`. The null values possibly represent instances where 0 of the respective bulk material was used, since the minimum values for each column is all greater than 0.

112 keys also do not exist in `df_bulk`.

No duplicates were found.

### `df_bulk_time`: bulk material delivery data (time)

In [None]:
df_bulk_time.head()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
0,1,,,,2019-05-03 11:21:30,,,,,,,,2019-05-03 11:03:52,,2019-05-03 11:03:52,2019-05-03 11:03:52
1,2,,,,2019-05-03 11:46:38,,,,,,,,2019-05-03 11:40:20,,2019-05-03 11:40:20,2019-05-03 11:40:20
2,3,,,,2019-05-03 12:31:06,,,,,,,,2019-05-03 12:09:40,,2019-05-03 12:09:40,2019-05-03 12:09:40
3,4,,,,2019-05-03 12:48:43,,,,,,,,2019-05-03 12:41:24,,2019-05-03 12:41:24,2019-05-03 12:41:24
4,5,,,,2019-05-03 13:18:50,,,,,,,,2019-05-03 13:12:56,,2019-05-03 13:12:56,2019-05-03 13:12:56


In [None]:
df_bulk_time.tail()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
3124,3237,,,2019-08-31 22:51:28,,,,,,,,,2019-08-31 22:46:52,,2019-08-31 22:46:52,2019-08-31 22:46:52
3125,3238,,,2019-08-31 23:39:11,,,,,,,,,2019-08-31 23:33:09,,2019-08-31 23:33:09,2019-08-31 23:33:09
3126,3239,,,,,,2019-09-01 01:51:58,,,,,,2019-09-01 01:39:41,,2019-09-01 01:33:25,2019-09-01 01:33:25
3127,3240,,,,,,2019-09-01 03:12:40,,,,,,,,2019-09-01 02:41:27,2019-09-01 02:41:27
3128,3241,,,,,,,,,,,,,,2019-09-01 04:05:34,2019-09-01 04:05:34


In [None]:
df_bulk_time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3129 entries, 0 to 3128
Data columns (total 16 columns):
key        3129 non-null int64
Bulk 1     252 non-null object
Bulk 2     22 non-null object
Bulk 3     1298 non-null object
Bulk 4     1014 non-null object
Bulk 5     77 non-null object
Bulk 6     576 non-null object
Bulk 7     25 non-null object
Bulk 8     1 non-null object
Bulk 9     19 non-null object
Bulk 10    176 non-null object
Bulk 11    177 non-null object
Bulk 12    2450 non-null object
Bulk 13    18 non-null object
Bulk 14    2806 non-null object
Bulk 15    2248 non-null object
dtypes: int64(1), object(15)
memory usage: 391.2+ KB


In [None]:
df_bulk_time.describe()

Unnamed: 0,key
count,3129.0
mean,1624.383509
std,933.337642
min,1.0
25%,816.0
50%,1622.0
75%,2431.0
max,3241.0


In [None]:
df_bulk_time.describe(include='object')

Unnamed: 0,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
count,252,22,1298,1014,77,576,25,1,19,176,177,2450,18,2806,2248
unique,252,22,1298,1014,77,576,25,1,19,176,177,2450,18,2806,2248
top,2019-06-07 01:29:04,2019-07-23 14:35:55,2019-08-29 00:16:34,2019-08-29 00:16:34,2019-07-01 22:19:07,2019-05-10 05:30:27,2019-08-08 08:46:28,2019-07-05 17:46:11,2019-05-14 04:38:14,2019-08-16 03:05:53,2019-08-08 23:47:00,2019-06-25 17:10:43,2019-08-24 11:33:10,2019-06-13 07:35:15,2019-06-25 17:10:43
freq,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


In [None]:
missing_keys_bulk_time = []

for i in range(1, 3242):
    if i not in df_bulk_time['key'].unique():
        missing_keys_bulk_time.append(i)
len(missing_keys_bulk_time)

112

In [None]:
print(missing_keys_bulk_time)

[41, 42, 51, 52, 53, 54, 55, 56, 72, 80, 81, 110, 151, 188, 195, 225, 269, 302, 330, 331, 332, 343, 350, 355, 382, 506, 529, 540, 607, 661, 683, 710, 766, 830, 874, 931, 933, 934, 960, 961, 964, 966, 983, 984, 1062, 1105, 1133, 1221, 1268, 1300, 1334, 1402, 1437, 1517, 1518, 1535, 1566, 1623, 1656, 1783, 1818, 1911, 1959, 1974, 1979, 2009, 2010, 2031, 2043, 2056, 2103, 2195, 2196, 2197, 2198, 2216, 2217, 2231, 2278, 2310, 2356, 2373, 2390, 2408, 2434, 2446, 2460, 2468, 2469, 2471, 2491, 2595, 2599, 2600, 2608, 2625, 2628, 2683, 2738, 2739, 2816, 2821, 2863, 2884, 2891, 3018, 3026, 3047, 3182, 3200, 3207, 3216]


In [None]:
df_bulk_time.duplicated().sum()

0

It looks like the null values in `df_bulk_time` match the null values in `df_bulk`.

The columns can be converted into datatime columns.

112 keys do not exist in `df_bulk_time`.

Duplicate entries were not found.

### `df_gas`: gas purge data

In [None]:
df_gas.head()

Unnamed: 0,key,Gas 1
0,1,29.749986
1,2,12.555561
2,3,28.554793
3,4,18.841219
4,5,5.413692


In [None]:
df_gas.tail()

Unnamed: 0,key,Gas 1
3234,3237,5.543905
3235,3238,6.745669
3236,3239,16.023518
3237,3240,11.863103
3238,3241,12.680959


In [None]:
df_gas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3239 entries, 0 to 3238
Data columns (total 2 columns):
key      3239 non-null int64
Gas 1    3239 non-null float64
dtypes: float64(1), int64(1)
memory usage: 50.7 KB


In [None]:
df_gas.describe()

Unnamed: 0,key,Gas 1
count,3239.0,3239.0
mean,1621.861377,11.002062
std,935.386334,6.220327
min,1.0,0.008399
25%,812.5,7.043089
50%,1622.0,9.836267
75%,2431.5,13.769915
max,3241.0,77.99504


In [None]:
missing_keys_gas = []

for i in range(1, 3242):
    if i not in df_gas['key'].unique():
        missing_keys_gas.append(i)
len(missing_keys_gas)

2

In [None]:
print(missing_keys_gas)

[193, 259]


In [None]:
df_gas.duplicated().sum()

0

There are 2 keys missing from `df_gas`.

No null values exist for the keys that are in the dataset.

Duplicate entries were not found.

### `df_temp`: temperature measurement results

In [None]:
df_temp.head(15)

Unnamed: 0,key,Sampling time,Temperature
0,1,2019-05-03 11:16:18,1571.0
1,1,2019-05-03 11:25:53,1604.0
2,1,2019-05-03 11:29:11,1618.0
3,1,2019-05-03 11:30:01,1601.0
4,1,2019-05-03 11:30:39,1613.0
5,2,2019-05-03 11:37:27,1581.0
6,2,2019-05-03 11:38:00,1577.0
7,2,2019-05-03 11:49:38,1589.0
8,2,2019-05-03 11:55:50,1604.0
9,2,2019-05-03 11:58:24,1608.0


In [None]:
df_temp.tail(15)

Unnamed: 0,key,Sampling time,Temperature
15892,3239,2019-09-01 02:23:02,
15893,3239,2019-09-01 02:24:15,
15894,3240,2019-09-01 02:39:01,1617.0
15895,3240,2019-09-01 02:48:33,
15896,3240,2019-09-01 03:03:21,
15897,3240,2019-09-01 03:12:19,
15898,3240,2019-09-01 03:19:09,
15899,3240,2019-09-01 03:31:27,
15900,3240,2019-09-01 03:34:31,
15901,3240,2019-09-01 03:35:16,


In [None]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15907 entries, 0 to 15906
Data columns (total 3 columns):
key              15907 non-null int64
Sampling time    15907 non-null object
Temperature      13006 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 372.9+ KB


In [None]:
df_temp.describe()

Unnamed: 0,key,Temperature
count,15907.0,13006.0
mean,1607.88087,1591.84092
std,942.212073,21.375851
min,1.0,1191.0
25%,790.0,1581.0
50%,1618.0,1591.0
75%,2427.0,1601.0
max,3241.0,1705.0


In [None]:
df_temp.describe(include='object')

Unnamed: 0,Sampling time
count,15907
unique,15907
top,2019-07-01 14:19:51
freq,1


In [None]:
df_temp[df_temp['Temperature'].isnull()]

Unnamed: 0,key,Sampling time,Temperature
12268,2500,2019-08-06 03:24:43,
12269,2500,2019-08-06 03:25:16,
12270,2500,2019-08-06 03:28:21,
12272,2501,2019-08-06 04:01:59,
12273,2501,2019-08-06 04:14:35,
...,...,...,...
15901,3240,2019-09-01 03:35:16,
15903,3241,2019-09-01 04:16:12,
15904,3241,2019-09-01 04:22:39,
15905,3241,2019-09-01 04:33:42,


In [None]:
missing_keys_temp = []
for i in range(1, 3242):
    if i not in df_temp['key'].unique():
        missing_keys_temp.append(i)
len(missing_keys_temp)

25

In [None]:
print(missing_keys_temp)

[41, 42, 355, 382, 506, 529, 540, 607, 683, 710, 766, 1133, 1300, 1437, 2031, 2103, 2278, 2356, 2373, 2446, 2469, 2491, 2683, 3200, 3207]


In [None]:
df_temp.duplicated().sum()

0

There are 2901 null values in the `Temperature` column, starting with entries that have the 2500 key and up.

`Sampling time` column is better as a datetime column dataype.

25 keys are missing in `df_temp`.

Duplicate entries were not found.

### `df_wire`: wire materials data (volume)

In [None]:
df_wire.head()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1,60.059998,,,,,,,,
1,2,96.052315,,,,,,,,
2,3,91.160157,,,,,,,,
3,4,89.063515,,,,,,,,
4,5,89.238236,9.11456,,,,,,,


In [None]:
df_wire.tail()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
3076,3237,38.088959,,,,,,,,
3077,3238,56.128799,,,,,,,,
3078,3239,143.357761,,,,,,,,
3079,3240,34.0704,,,,,,,,
3080,3241,63.117595,,,,,,,,


In [None]:
df_wire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3081 entries, 0 to 3080
Data columns (total 10 columns):
key       3081 non-null int64
Wire 1    3055 non-null float64
Wire 2    1079 non-null float64
Wire 3    63 non-null float64
Wire 4    14 non-null float64
Wire 5    1 non-null float64
Wire 6    73 non-null float64
Wire 7    11 non-null float64
Wire 8    19 non-null float64
Wire 9    29 non-null float64
dtypes: float64(9), int64(1)
memory usage: 240.8 KB


In [None]:
df_wire.describe()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
count,3081.0,3055.0,1079.0,63.0,14.0,1.0,73.0,11.0,19.0,29.0
mean,1623.426485,100.895853,50.577323,189.482681,57.442841,15.132,48.016974,10.039007,53.625193,34.155752
std,932.996726,42.012518,39.320216,99.513444,28.824667,,33.919845,8.610584,16.881728,19.931616
min,1.0,1.9188,0.03016,0.144144,24.148801,15.132,0.03432,0.234208,45.076721,4.6228
25%,823.0,72.115684,20.19368,95.135044,40.807002,15.132,25.0536,6.762756,46.094879,22.058401
50%,1619.0,100.158234,40.142956,235.194977,45.234282,15.132,42.076324,9.017009,46.279999,30.066399
75%,2434.0,126.060484,70.227558,276.252014,76.124619,15.132,64.212723,11.886057,48.089603,43.862003
max,3241.0,330.314424,282.780152,385.008668,113.231044,15.132,180.454575,32.847674,102.762401,90.053604


In [None]:
missing_keys_wire = []

for i in range(1, 3242):
    if i not in df_wire['key'].unique():
        missing_keys_wire.append(i)
len(missing_keys_wire)

160

In [None]:
print(missing_keys_wire)

[41, 42, 51, 52, 53, 54, 55, 56, 81, 82, 83, 84, 85, 88, 109, 195, 197, 209, 210, 211, 212, 269, 330, 331, 332, 355, 375, 376, 377, 378, 382, 506, 529, 540, 607, 683, 710, 711, 712, 713, 714, 715, 744, 748, 754, 755, 766, 796, 797, 798, 799, 800, 841, 929, 930, 931, 932, 933, 934, 1102, 1103, 1104, 1105, 1106, 1107, 1133, 1184, 1300, 1379, 1380, 1437, 1525, 1526, 1527, 1528, 1564, 1565, 1566, 1646, 1743, 1744, 1745, 1746, 1747, 1748, 1758, 1817, 1818, 1836, 1946, 1977, 1978, 1979, 2010, 2031, 2043, 2103, 2195, 2196, 2197, 2198, 2214, 2215, 2216, 2217, 2218, 2219, 2236, 2238, 2278, 2356, 2360, 2367, 2368, 2369, 2370, 2373, 2388, 2389, 2390, 2391, 2392, 2393, 2446, 2469, 2491, 2624, 2625, 2626, 2627, 2628, 2629, 2683, 2788, 2789, 2790, 2791, 2792, 2814, 2815, 2846, 2847, 2848, 2849, 2850, 2863, 2871, 2872, 2873, 2874, 2875, 2876, 3035, 3036, 3037, 3038, 3039, 3040, 3200, 3207]


In [None]:
df_wire.duplicated().sum()

0

Just like in the `df_bulk` dataset, the null values present in `df_wire` likely represent instances where 0 of the respective wire material was used.

160 keys are missing from `df_wire`.

Duplicated entries were not found.

### `df_wire_time`: wire materials data (time)

In [None]:
df_wire_time.head()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1,2019-05-03 11:11:41,,,,,,,,
1,2,2019-05-03 11:46:10,,,,,,,,
2,3,2019-05-03 12:13:47,,,,,,,,
3,4,2019-05-03 12:48:05,,,,,,,,
4,5,2019-05-03 13:18:15,2019-05-03 13:32:06,,,,,,,


In [None]:
df_wire_time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3081 entries, 0 to 3080
Data columns (total 10 columns):
key       3081 non-null int64
Wire 1    3055 non-null object
Wire 2    1079 non-null object
Wire 3    63 non-null object
Wire 4    14 non-null object
Wire 5    1 non-null object
Wire 6    73 non-null object
Wire 7    11 non-null object
Wire 8    19 non-null object
Wire 9    29 non-null object
dtypes: int64(1), object(9)
memory usage: 240.8+ KB


In [None]:
df_wire_time.describe()

Unnamed: 0,key
count,3081.0
mean,1623.426485
std,932.996726
min,1.0
25%,823.0
50%,1619.0
75%,2434.0
max,3241.0


In [None]:
df_wire_time.describe(include='object')

Unnamed: 0,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
count,3055,1079,63,14,1,73,11,19,29
unique,3055,1079,63,14,1,73,11,19,29
top,2019-08-05 22:06:31,2019-05-22 00:29:24,2019-07-23 18:07:01,2019-07-23 18:09:32,2019-08-08 16:01:07,2019-05-08 18:52:53,2019-08-08 18:27:34,2019-07-18 18:58:07,2019-07-02 08:17:45
freq,1,1,1,1,1,1,1,1,1


In [None]:
missing_keys_wire_time = []

for i in range(1, 3242):
    if i not in df_wire_time['key'].unique():
        missing_keys_wire_time.append(i)
len(missing_keys_wire_time)

160

In [None]:
print(missing_keys_wire_time)

[41, 42, 51, 52, 53, 54, 55, 56, 81, 82, 83, 84, 85, 88, 109, 195, 197, 209, 210, 211, 212, 269, 330, 331, 332, 355, 375, 376, 377, 378, 382, 506, 529, 540, 607, 683, 710, 711, 712, 713, 714, 715, 744, 748, 754, 755, 766, 796, 797, 798, 799, 800, 841, 929, 930, 931, 932, 933, 934, 1102, 1103, 1104, 1105, 1106, 1107, 1133, 1184, 1300, 1379, 1380, 1437, 1525, 1526, 1527, 1528, 1564, 1565, 1566, 1646, 1743, 1744, 1745, 1746, 1747, 1748, 1758, 1817, 1818, 1836, 1946, 1977, 1978, 1979, 2010, 2031, 2043, 2103, 2195, 2196, 2197, 2198, 2214, 2215, 2216, 2217, 2218, 2219, 2236, 2238, 2278, 2356, 2360, 2367, 2368, 2369, 2370, 2373, 2388, 2389, 2390, 2391, 2392, 2393, 2446, 2469, 2491, 2624, 2625, 2626, 2627, 2628, 2629, 2683, 2788, 2789, 2790, 2791, 2792, 2814, 2815, 2846, 2847, 2848, 2849, 2850, 2863, 2871, 2872, 2873, 2874, 2875, 2876, 3035, 3036, 3037, 3038, 3039, 3040, 3200, 3207]


In [None]:
df_wire_time.duplicated().sum()

0

The null values in `df_wire_time` appear to match the null values in `df_wire`.

The columns can also be converted into datatime columns.

160 keys are missing from `df_wire_time`.

No duplicates were found.

## Conclusion

7 files were opened and examined:

- `data_arc_en.csv` — electrode data
- `data_bulk_en.csv` — bulk material supply data (volume)
- `data_bulk_time_en.csv` — bulk material delivery data (time)
- `data_gas_en.csv` — gas purge data
- `data_temp_en.csv` — temperature measurement results
- `data_wire_en.csv` — wire materials data (volume)
- `data_wire_time_en.csv` — wire materials data (time)

In each file, the `key` column contains the batch number.

Within these datasets, several things were noticed:

Regarding null values,

- There are no null values found in `df_arc`.
- There are many null values in `df_bulk`. The null values possibly represent instances where 0 of the respective bulk material was used, since the minimum values for each column is all greater than 0.
- It looks like the null values in `df_bulk_time` match the null values in `df_bulk`.
- No null values exist for the keys that are in `df_gas`.
- There are 2901 null values in the `Temperature` column in `df_temp`, starting with entries that have the 2500 key and up.
- Just like in the `df_bulk` dataset, the null values present in `df_wire` likely represent instances where 0 of the respective wire material was used.
- The null values in `df_wire_time` appear to match the null values in `df_wire`.


Regarding missing keys,

- 27 keys are missing from `df_arc`.
- 112 keys do not exist in `df_bulk`.
- 112 keys do not exist in `df_bulk_time`.
- There are 2 keys missing from `df_gas`.
- 25 keys are missing in `df_temp`.
- 160 keys are missing from `df_wire`.
- 160 keys are missing from `df_wire_time`.


Some columns in the provided datasets are better suited as `datetime` column datatypes:

- `Arc heating start` and `Arc heating end` in `df_arc`
- All columns in `df_bulk_time`
- `Sampling time` in `df_temp`
- All columns in `df_wire_time`


There is one negative value in `Reactive power` for the `df_arc` dataset.

No duplicates were found for all 7 datasets.

## Plan for solving the task

1. Handle the null values and consider the effect of the missing keys from the datasets
2. Convert columns into the appropriate datatypes
3. Combine the data into one dataset and possibly create new features
4. Split the data into train, valid, and test
5. Build models and choose the best one based on MAE

## Data Preprocessing

In [None]:
for df, name in [(df_arc, 'df_arc'), 
                 (df_bulk, 'df_bulk'), 
                 (df_bulk_time, 'df_bulk_time'), 
                 (df_gas, 'df_gas'), 
                 (df_temp, 'df_temp'), 
                 (df_wire, 'df_wire'), 
                 (df_wire_time, 'df_wire_time')]:
    print("% removed from", name, ":",  len(df[df['key'] >= 2500].index) / len(df.index))
    df.drop(df[df['key'] >= 2500].index, inplace = True)

% removed from df_arc : 0.2303710674912611
% removed from df_bulk : 0.23042505592841164
% removed from df_bulk_time : 0.23042505592841164
% removed from df_gas : 0.22908305032417411
% removed from df_temp : 0.22883007480983214
% removed from df_wire : 0.22979552093476144
% removed from df_wire_time : 0.22979552093476144


We are dropping entries with a key of 2500 and up because they are lacking a high number of temperature values, which is our target in this study. It appears that only the first temperature measurement has been saved for iterations 2500 and up.

In [None]:
df_bulk = df_bulk.fillna(0)

In [None]:
df_wire = df_wire.fillna(0)

NA values in `df_bulk` and `df_wire` were filled in with 0's, since these are likely instances where 0 of the respective material was used.

The NA values in `df_bulk_time` and `df_wire_time` will be left alone because they appear irrelevant to the target result.

### Handling the missing keys

In [None]:
missing_keys_all = list(set(missing_keys_arc + missing_keys_bulk + missing_keys_gas + missing_keys_temp + missing_keys_wire))
missing_keys_filtered = []

for num in missing_keys_all:
    if num < 2500:
        missing_keys_filtered.append(num)

print(len(missing_keys_filtered), 'keys under 2500 are missing data.')

170 keys under 2500 are missing data.


In [None]:
for df in [df_arc, df_bulk, df_gas, df_temp, df_wire]:
    for num in missing_keys_filtered:
        df.drop(df[df['key'] == num].index, inplace = True)

170 keys that are not present across all five datasets were removed, since these keys may be missing measurements that could have an effect on the final result.

### Converting columns to the appropriate datatypes

In [None]:
df_arc['Arc heating start'] = pd.to_datetime(df_arc['Arc heating start'])

In [None]:
df_arc['Arc heating end'] = pd.to_datetime(df_arc['Arc heating end'])

In [None]:
df_temp['Sampling time'] = pd.to_datetime(df_temp['Sampling time'])

### Handing negative values in `df_arc`

In [None]:
df_arc[df_arc['Reactive power'] < 0]

Unnamed: 0,key,Arc heating start,Arc heating end,Active power,Reactive power
9780,2116,2019-07-24 00:44:48,2019-07-24 00:46:37,0.495782,-715.504924


In [None]:
for df in [df_arc, df_bulk, df_gas, df_temp, df_wire]:
    df.drop(df[df['key'] == 2116].index, inplace = True)

Iteration 2116 contains a negative value for `Reactive power`, possibly due to a mistake in measurement. This is removed from the data to avoid any possible conflicts.

In [None]:
len(pd.unique(df_temp['key']))

2328

2328 iterations remain after handling the missing data.

### Consolidating the data

In [None]:
df = pd.DataFrame({'key': pd.unique(df_temp['key'])})
df.set_index('key', inplace=True)

In [None]:
df['Initial Temperature'] = df_temp.pivot_table(index='key', values='Temperature', aggfunc='first')
df['Final Temperature'] = df_temp.pivot_table(index='key', values='Temperature', aggfunc='last')

df['Active Power'] = df_arc.pivot_table(index='key', values='Active power', aggfunc='sum')
df['Reactive Power'] = df_arc.pivot_table(index='key', values='Reactive power', aggfunc='sum')

In [None]:
df = df.merge(df_bulk, on='key')
df = df.merge(df_gas, on='key')
df = df.merge(df_wire, on='key')

In [None]:
df.head()

Unnamed: 0,key,Initial Temperature,Final Temperature,Active Power,Reactive Power,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,...,Gas 1,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1,1571.0,1613.0,4.878147,3.183241,0.0,0.0,0.0,43.0,0.0,...,29.749986,60.059998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,1581.0,1602.0,3.052598,1.998112,0.0,0.0,0.0,73.0,0.0,...,12.555561,96.052315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,1596.0,1599.0,2.525882,1.599076,0.0,0.0,0.0,34.0,0.0,...,28.554793,91.160157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,1601.0,1625.0,3.20925,2.060298,0.0,0.0,0.0,81.0,0.0,...,18.841219,89.063515,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,1576.0,1602.0,3.347173,2.252643,0.0,0.0,0.0,78.0,0.0,...,5.413692,89.238236,9.11456,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df.tail()

Unnamed: 0,key,Initial Temperature,Final Temperature,Active Power,Reactive Power,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,...,Gas 1,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
2323,2495,1570.0,1591.0,3.21069,2.360777,0.0,0.0,21.0,0.0,0.0,...,7.125735,89.150879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2324,2496,1554.0,1591.0,4.203064,2.810185,0.0,0.0,0.0,63.0,0.0,...,9.412616,114.179527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2325,2497,1571.0,1589.0,2.212379,1.851269,0.0,0.0,0.0,85.0,0.0,...,6.271699,94.086723,9.048,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2326,2498,1591.0,1594.0,3.408725,2.355428,0.0,0.0,90.0,0.0,0.0,...,14.953657,118.110717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2327,2499,1569.0,1603.0,4.098431,2.777865,0.0,0.0,47.0,0.0,0.0,...,11.336151,110.160958,50.00528,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2328 entries, 0 to 2327
Data columns (total 30 columns):
key                    2328 non-null int64
Initial Temperature    2328 non-null float64
Final Temperature      2328 non-null float64
Active Power           2328 non-null float64
Reactive Power         2328 non-null float64
Bulk 1                 2328 non-null float64
Bulk 2                 2328 non-null float64
Bulk 3                 2328 non-null float64
Bulk 4                 2328 non-null float64
Bulk 5                 2328 non-null float64
Bulk 6                 2328 non-null float64
Bulk 7                 2328 non-null float64
Bulk 8                 2328 non-null float64
Bulk 9                 2328 non-null float64
Bulk 10                2328 non-null float64
Bulk 11                2328 non-null float64
Bulk 12                2328 non-null float64
Bulk 13                2328 non-null float64
Bulk 14                2328 non-null float64
Bulk 15                2328 non-null flo

### Creating the target and features

In [None]:
target = df['Final Temperature']
features = df.drop(['Final Temperature', 'key'], axis=1)

In [None]:
features.head()

Unnamed: 0,Initial Temperature,Active Power,Reactive Power,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,...,Gas 1,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1571.0,4.878147,3.183241,0.0,0.0,0.0,43.0,0.0,0.0,0.0,...,29.749986,60.059998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1581.0,3.052598,1.998112,0.0,0.0,0.0,73.0,0.0,0.0,0.0,...,12.555561,96.052315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1596.0,2.525882,1.599076,0.0,0.0,0.0,34.0,0.0,0.0,0.0,...,28.554793,91.160157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1601.0,3.20925,2.060298,0.0,0.0,0.0,81.0,0.0,0.0,0.0,...,18.841219,89.063515,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1576.0,3.347173,2.252643,0.0,0.0,0.0,78.0,0.0,0.0,0.0,...,5.413692,89.238236,9.11456,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Splitting the data into a 3:1:1 ratio

In [None]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.4, random_state = 12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.5, shuffle = False)

In [None]:
print("--- Train Sizes (Rows, Columns) ---")
print("target_train:", target_train.shape)
print("features_train:", features_train.shape)
print("")
print("--- Valid Sizes (Rows, Columns) ---")
print("target_valid:", target_valid.shape)
print("features_valid:", features_valid.shape)
print("")
print("--- Test Sizes (Rows, Columns) ---")
print("target_test:", target_test.shape)
print("features_test:", features_test.shape)

--- Train Sizes (Rows, Columns) ---
target_train: (1396,)
features_train: (1396, 28)

--- Valid Sizes (Rows, Columns) ---
target_valid: (466,)
features_valid: (466, 28)

--- Test Sizes (Rows, Columns) ---
target_test: (466,)
features_test: (466, 28)


## Model Training

#### Linear Regression

In [None]:
lr_model = LinearRegression()
lr_model.fit(features_train, target_train)
predicted_valid = lr_model.predict(features_valid)
print('MAE:', mean_absolute_error(target_valid, predicted_valid))

MAE: 6.511423291455314


#### Random Forest Regressor

In [None]:
for num in range(1, 20, 5):
    rf_model = RandomForestRegressor(n_estimators=num, random_state=99)
    rf_model.fit(features_train, target_train)
    predicted_valid = rf_model.predict(features_valid)
    print('--- n_estimators:', num, '---')
    print('MAE:', mean_absolute_error(target_valid, predicted_valid))
    print('')

--- n_estimators: 1 ---
MAE: 8.581545064377682

--- n_estimators: 6 ---
MAE: 6.539699570815449

--- n_estimators: 11 ---
MAE: 6.317986734295744

--- n_estimators: 16 ---
MAE: 6.180391630901288



#### LightGBM Regressor

In [None]:
for num in range(50, 150, 25):
    for rate in [.25, .5, .75]:
        lg_model = lgb.LGBMRegressor(n_estimators=num, learning_rate=rate, random_state=99)
        lg_model.fit(features_train, target_train)
        predicted_valid = lg_model.predict(features_valid)
        print('--- n_estimators:', num, '| learning_rate:', rate, '---')
        print('MAE:', mean_absolute_error(target_valid, predicted_valid))
        print('')

--- n_estimators: 50 | learning_rate: 0.25 ---
MAE: 6.487968532513309

--- n_estimators: 50 | learning_rate: 0.5 ---
MAE: 7.04129788729827

--- n_estimators: 50 | learning_rate: 0.75 ---
MAE: 7.4861533853318

--- n_estimators: 75 | learning_rate: 0.25 ---
MAE: 6.604644851032791

--- n_estimators: 75 | learning_rate: 0.5 ---
MAE: 7.091468223351901

--- n_estimators: 75 | learning_rate: 0.75 ---
MAE: 7.5230490971111905

--- n_estimators: 100 | learning_rate: 0.25 ---
MAE: 6.6571988136924265

--- n_estimators: 100 | learning_rate: 0.5 ---
MAE: 7.080274841940039

--- n_estimators: 100 | learning_rate: 0.75 ---
MAE: 7.538509846036559

--- n_estimators: 125 | learning_rate: 0.25 ---
MAE: 6.698095815014557

--- n_estimators: 125 | learning_rate: 0.5 ---
MAE: 7.089066179657614

--- n_estimators: 125 | learning_rate: 0.75 ---
MAE: 7.535745699323372



#### CatBoost Regressor

In [None]:
for num in range(50, 150, 25):
    for rate in [.25, .5, .75]:
        cb_model = CatBoostRegressor(n_estimators=num, learning_rate=rate, random_state=99)
        cb_model.fit(features_train, target_train)
        predicted_valid = cb_model.predict(features_valid)
        print('--- n_estimators:', num, '| learning_rate:', rate, '---')
        print('MAE:', mean_absolute_error(target_valid, predicted_valid))
        print('')

0:	learn: 10.4448476	total: 49.5ms	remaining: 2.42s
1:	learn: 9.8545054	total: 51.9ms	remaining: 1.25s
2:	learn: 9.3295637	total: 53.8ms	remaining: 844ms
3:	learn: 9.0061328	total: 57.4ms	remaining: 661ms
4:	learn: 8.7320155	total: 123ms	remaining: 1.11s
5:	learn: 8.5639072	total: 126ms	remaining: 922ms
6:	learn: 8.3749868	total: 128ms	remaining: 784ms
7:	learn: 8.2289666	total: 129ms	remaining: 680ms
8:	learn: 8.0914192	total: 131ms	remaining: 598ms
9:	learn: 7.9459908	total: 133ms	remaining: 532ms
10:	learn: 7.8465215	total: 221ms	remaining: 784ms
11:	learn: 7.7647353	total: 223ms	remaining: 706ms
12:	learn: 7.6471075	total: 225ms	remaining: 641ms
13:	learn: 7.5694267	total: 227ms	remaining: 583ms
14:	learn: 7.4822021	total: 229ms	remaining: 535ms
15:	learn: 7.4594873	total: 232ms	remaining: 493ms
16:	learn: 7.3920524	total: 319ms	remaining: 620ms
17:	learn: 7.3445014	total: 322ms	remaining: 572ms
18:	learn: 7.2595547	total: 323ms	remaining: 528ms
19:	learn: 7.2123486	total: 325ms	re

In this section, I compared the MAE of 4 models while tuning hyperparameters:

1. `LinearRegression` - MAE: 6.51
2. `RandomForestRegressor` - MAE: 6.18, with n_estimators=16
3. `LightGBM Regressor` - MAE: 6.48, with n_estimators=50 and learning_rate=0.25
4. `CatBoost Regressor` - MAE: 6.13, with n_estimators=75 and learning_rate=0.25

The CatBoost Regressor model performed the best, with a MAE of 6.13. The best parameters that were found will be used for the final model.

## Training the Final Model

In [None]:
cb_final = CatBoostRegressor(n_estimators=75, learning_rate=0.25, random_state=99)
cb_final.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
predicted_test = cb_final.predict(features_test)
print('Final MAE:', mean_absolute_error(target_test, predicted_test))

0:	learn: 10.4459480	total: 2.59ms	remaining: 192ms
1:	learn: 10.0122318	total: 4.55ms	remaining: 166ms
2:	learn: 9.6156969	total: 6.81ms	remaining: 163ms
3:	learn: 9.2223687	total: 9.11ms	remaining: 162ms
4:	learn: 8.9764495	total: 11ms	remaining: 154ms
5:	learn: 8.7698758	total: 13ms	remaining: 149ms
6:	learn: 8.6063244	total: 20.3ms	remaining: 197ms
7:	learn: 8.4340449	total: 108ms	remaining: 901ms
8:	learn: 8.2865830	total: 111ms	remaining: 811ms
9:	learn: 8.2186542	total: 113ms	remaining: 737ms
10:	learn: 8.0917731	total: 118ms	remaining: 689ms
11:	learn: 7.9759391	total: 202ms	remaining: 1.06s
12:	learn: 7.9192583	total: 205ms	remaining: 978ms
13:	learn: 7.8543223	total: 209ms	remaining: 911ms
14:	learn: 7.7626233	total: 213ms	remaining: 850ms
15:	learn: 7.7002757	total: 216ms	remaining: 795ms
16:	learn: 7.6299608	total: 303ms	remaining: 1.03s
17:	learn: 7.5767137	total: 307ms	remaining: 972ms
18:	learn: 7.5402073	total: 311ms	remaining: 916ms
19:	learn: 7.4888793	total: 314ms	re

The MAE for the final CatBoostRegressor model with hyperparameters `n_estimators=75` and `learning_rate=0.25` is 5.785.

## Solution Report

### What steps of the plan were performed and what steps were skipped (explain why)?

The plan was as follows:

1. Handle the null values and consider the effect of the missing keys from the datasets
2. Convert columns into the appropriate datatypes
3. Combine the data into one dataset and possibly create new features
4. Split the data into train, valid, and test
5. Build models and choose the best one based on MAE

No steps were skipped.

### What difficulties did you encounter and how did you manage to solve them?

The part that took me the longest was the data preprocessing section. It took time to really understand the data in each table and figure out how to preprocess them properly. Aditionally, the target temperature was mixed in with the feature temperatures. I was able to extract the target using the pivot table function.

### What were some of the key steps to solving the task?

First, I familiarized myself with the project goal and what the provided data was. Next, I formulated a plan for the project. The data was preprocessed as thoroughly as possible in order to achieve good results. All of the necessary data was combined into a single dataset, which was then split into train, valid, and test sets. A few models were explored and the best one was chosen in order to solve the task.

### What is your final model and what quality score does it have?

The MAE for the final CatBoostRegressor model with hyperparameters `n_estimators=75` and `learning_rate=0.25` is 5.785.