# Comprehensive problem set based on the dataset (`ccm_rul_dataset.csv`) that covers:

* **Summarizing and Computing Descriptive Statistics**
* **Correlation and Covariance**
* **Unique Values, Value Counts, and Membership**
* **Additional related concepts**: ranking, quantiles, percent change, rolling windows, and group-based stats.



## General Problem Statement

The goal of this problem set is to apply Pandas' capabilities for descriptive statistics, value analysis, and summary operations on a real-world manufacturing dataset. This will deepen understanding of practical data profiling and exploratory data analysis (EDA).



## Purpose of the Problems

These problems will help:

* Identify the structure and trends in large datasets.
* Analyze numeric, categorical, and time-series values.
* Extract insight using core Pandas statistical and summarization tools.
* Understand data variation, quality, and relationships across manufacturing attributes.



## Dataset Overview

**Dataset:** `ccm_rul_dataset.csv`

**Context:** This dataset comes from a continuous casting manufacturing (CCM) process. It includes timestamps, material compositions, operational parameters (temperatures, water flow, speed), and an important predictive column: **RUL** (Remaining Useful Life).

**Why it’s perfect for this topic:**

* Mixture of numeric, categorical, and time-related data.
* Contains missing values and varying distributions.
* Enables correlation analysis between physical properties and outcomes.
* Provides opportunities for statistical grouping, rolling analysis, and ranking.




In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('ccm_rul_dataset.csv')

In [3]:
df = pd.DataFrame(data)

In [4]:
df.head(5)

Unnamed: 0,date,"workpiece_weight, tonn",steel_type,doc_requirement,cast_in_row,workpiece_slice_geometry,alloy_type,"steel_weight_theoretical, tonn","slag_weight_close_grab1, tonn","metal_residue_grab1, tonn",...,"Al, %","Ca, %","N, %","Pb, %","Mg, %","Zn, %",sleeve,num_crystallizer,num_stream,RUL
0,2020-01-05,144.9,Arm240,DOC 34028-2016,4,150x150,open,145.3,1.8,0.4,...,0.0022,0.0008,0.0085,0.0,0.0,0.0,30012261,22,4,384.0
1,2020-01-05,165.9,St3sp,Contract,10,150x150,open,166.3,1.8,0.4,...,0.0028,0.0004,0.0049,0.0,0.0,0.0,30013346,2,1,1037.0
2,2020-01-05,168.0,Arm240,DOC 34028-2016,5,150x150,open,168.4,1.8,0.4,...,0.0031,0.0011,0.0068,0.0,0.0,0.0,30012261,22,4,355.0
3,2020-01-05,170.1,St3sp,Contract,7,150x150,open,170.5,1.8,0.4,...,0.0034,0.0005,0.0051,0.0,0.0,0.0,30012261,22,4,300.0
4,2020-01-05,163.8,St3sp,Contract,12,150x150,open,164.2,1.8,0.4,...,0.0032,0.0004,0.0038,0.0,0.0,0.0,30012261,22,4,164.0


In [5]:
numeric_columns = df.select_dtypes(include=[np.number]).columns

In [6]:
numeric_columns

Index(['workpiece_weight, tonn', 'cast_in_row',
       'steel_weight_theoretical, tonn', 'slag_weight_close_grab1, tonn',
       'metal_residue_grab1, tonn', 'steel_weight, tonn',
       'residuals_grab2, tonn', 'technical_trim, tonn', 'grab1_num',
       'steel_temperature_grab1, Celsius deg.', 'grab2_num',
       'resistance, tonn', 'swing_frequency, amount/minute',
       'crystallizer_movement, mm', 'alloy_speed, meter/minute',
       'water_consumption, liter/minute',
       'water_temperature_delta, Celsius deg.',
       'water_consumption_secondary_cooling_zone_num1, liter/minute',
       'water_consumption_secondary_cooling_zone_num2, liter/minute',
       'water_consumption_secondary_cooling_zone_num3, liter/minute',
       'quantity, tonn', 'temperature_measurement1, Celsius deg.',
       'temperature_measurement2, Celsius deg.', 'Ce, %', 'C, %', 'Si, %',
       'Mn,%', 'S, %', 'P, %', 'Cr, %', 'Ni, %', 'Cu, %', 'As, %', 'Mo, %',
       'Nb, %', 'Sn, %', 'Ti, %', 'V, %', 'Al,

In [7]:
non_numeric_columns = df.select_dtypes(exclude=[np.number]).columns

In [8]:
non_numeric_columns

Index(['date', 'steel_type', 'doc_requirement', 'workpiece_slice_geometry',
       'alloy_type', 'kind', 'time_temperature_measurement1',
       'time_temperature_measurement2', 'sample_time_continuous_caster',
       'sleeve'],
      dtype='object')

## Problem Set (20–30 Problems)

### Part 1: Descriptive Statistics

1. Compute the mean, median, standard deviation, and interquartile range of `steel_weight, tonn`.


In [9]:
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

In [10]:
df['steel_weight, tonn'].mean()

np.float64(163.93864320402218)

In [11]:
df['steel_weight, tonn'].median()

np.float64(163.296)

In [12]:
df['steel_weight, tonn'].std()

np.float64(5.258551916587412)

In [13]:
q1 = df['steel_weight, tonn'].quantile(0.25)

In [14]:
q3 = df['steel_weight, tonn'].quantile(0.75)

In [15]:
IQR = q3 - q1

In [16]:
IQR

np.float64(6.2239999999999895)

### Interpretation
Mean ≈ 163.94 tells the average steel weight per record.

Median ≈ 163.30 is very close to the mean, suggesting a roughly symmetric distribution.

Standard deviation ≈ 5.26 tells the average variation from the mean.

IQR ≈ 6.22 shows the middle 50% of values fall within a narrow range, indicating consistency.

2. Find the maximum and minimum values of `temperature_measurement1, Celsius deg.` and identify the corresponding dates.

In [17]:
max_temp = df['temperature_measurement1, Celsius deg.'].max()

In [18]:
min_temp = df['temperature_measurement2, Celsius deg.'].min()

In [19]:
max_temp_index = df['temperature_measurement1, Celsius deg.'].idxmax()

In [20]:
min_temp_index = df['temperature_measurement2, Celsius deg.'].idxmin()

In [21]:
max_date = df.loc[max_temp_index, 'date']

In [22]:
min_date = df.loc[min_temp_index, 'date']

In [23]:
temp_measurement = pd.DataFrame(
    {
        'Max_value': (max_date, max_temp),
        'Min_value': (min_date, min_temp),
    },
    index=['Date', 'Temperature °C']    
)

In [24]:
temp_measurement

Unnamed: 0,Max_value,Min_value
Date,2020-08-23,2020-08-07
Temperature °C,1939.0,1518.0


### Interpretation
This gives both the extreme values and the exact time they occurred — very useful in quality control and process optimization.

3. Compute the range (max - min) for `RUL`.

In [25]:
RUL_range = df['RUL'].max() - df['RUL'].min()

In [26]:
RUL_range

np.float64(8473111.0)

### Interpretation
That means the Remaining Useful Life values vary across a range of over 8.47 million units, which is quite large and may indicate outliers or an extensive timescale.

4. Determine how many rows have a `NaN` in `technical_trim, tonn` and compute its mean (excluding NaNs).

In [27]:
nan_count = df['technical_trim, tonn'].isna().sum()

In [28]:
nan_count

np.int64(17427)

In [29]:
mean_trim = df['technical_trim, tonn'].mean(skipna=True)

In [30]:
mean_trim

np.float64(5.271973684210526)

In [31]:
valid_data_count = df.index.shape[0] - nan_count

In [32]:
valid_data_count

np.int64(76)

In [33]:
sum_of_valid_values = df['technical_trim, tonn'].sum()

In [34]:
sum_of_valid_values

np.float64(400.66999999999996)

### Interpretation:
1. Very Sparse Data

    Only 76 valid records (≈ 0.43%) means this column is mostly missing.

- That’s a red flag if planning to use this field in any modeling or decision-making.

- Any statistical metric (like the mean) is only reflective of those 76 cases, not the overall dataset.

2. Mean Technical Trim ≈ 5.27 tons

    Among the 76 records where technical trimming was recorded, the average material trimmed was 5.27 tons.

- That’s a large trim, suggesting it occurs only under specific conditions (e.g., defects, misalignment, operational issues).

- It may represent exception handling or quality control actions, not normal operation.

### Possible Insights:
- Investigate those 76 records: Are they linked to specific steel_type, cast_in_row, or high resistance values?

- Why is data missing in 99.5% of rows? → Could be due to:

    - A process that's only used under specific conditions.

    - A sensor failure.

    - Post-processing data entry (only when trimming happens).

### Actionable Ideas:
- Don’t use this column as-is for modeling — it’s too sparse.

- But it’s valuable for case-by-case analysis or to trigger a binary flag:

`
df['was_trimmed'] = df['technical_trim, tonn'].notna().astype(int)`

Now it can be analyzed which types of products or conditions lead to trimming events.

5. Find the row index where `water_temperature_delta, Celsius deg.` is highest.

In [35]:
df['water_temperature_delta, Celsius deg.'].idxmax()

0

In [36]:
df.loc[df['water_temperature_delta, Celsius deg.'].idxmax()]

date                                                               2020-01-05
workpiece_weight, tonn                                                  144.9
steel_type                                                             Arm240
doc_requirement                                                DOC 34028-2016
cast_in_row                                                                 4
workpiece_slice_geometry                                              150x150
alloy_type                                                               open
steel_weight_theoretical, tonn                                          145.3
slag_weight_close_grab1, tonn                                             1.8
metal_residue_grab1, tonn                                                 0.4
steel_weight, tonn                                                      144.9
residuals_grab2, tonn                                                     NaN
technical_trim, tonn                                            

###  Interpretation:
This tells that the largest temperature difference in water happened on:

`Date: 2020-01-05`

`Steel Type: Arm240`

`Water ΔT: 9 °C `

Compare event to others — e.g., was it due to high alloy speed, cast geometry, or extreme RUL?

### Part 2: Quantile & Percentile Analysis

6. Calculate the 25th, 50th, and 75th percentiles of `C, %`.

In [39]:
percentiles_list = [25, 50, 75]
for percentile in percentiles_list:
    print(f'At {percentile} percentile: C, % is {df['C, %'].quantile(percentile / 100)}')

At 25 percentile: C, % is 0.1875
At 50 percentile: C, % is 0.1924
At 75 percentile: C, % is 0.1974


###  Interpretation:
These values are closely clustered, indicating a tight, narrow distribution — possibly due to strict quality control in material carbon content.

7. Determine how many records have `C, %` above the 75th percentile.

In [40]:
mask = df['C, %'] > df['C, %'].quantile(0.75)

In [42]:
df[mask].count()

date                                                           4358
workpiece_weight, tonn                                         4358
steel_type                                                     4358
doc_requirement                                                4358
cast_in_row                                                    4358
workpiece_slice_geometry                                       4358
alloy_type                                                     4358
steel_weight_theoretical, tonn                                 4358
slag_weight_close_grab1, tonn                                  4358
metal_residue_grab1, tonn                                      4358
steel_weight, tonn                                             4358
residuals_grab2, tonn                                            86
technical_trim, tonn                                             24
grab1_num                                                      4277
steel_temperature_grab1, Celsius deg.           

In [44]:
mask.sum()

np.int64(4358)

In [45]:
df[mask].shape[0]

4358

### What Carbon Percentage (`C, %`) Indicates in Steel Manufacturing

* Carbon is a **critical alloying element** in steel that affects:

  * **Hardness and strength** (higher carbon → harder, stronger steel)
  * **Ductility and weldability** (higher carbon → more brittle, less weldable)
* Different **steel grades** require **precise carbon control** within narrow specifications.
* Deviations outside target ranges can lead to **non-conformances, waste, or product rejection**.


### So What Does This Result Tell Us?

> **4,358 records (≈ 25%) have unusually high carbon content** relative to the rest of the dataset.

This may imply any of the following:



### Interpretation for the Process

#### 1. **Intentional High-Carbon Products**

* If certain products (e.g. tool steels, high-strength steels) require high carbon, these entries may reflect:

  * A **specific batch type or alloy type**.
  * Compliance with **specialized specifications**.

**Action**: Group by `steel_type` or `doc_requirement` and verify if high-carbon records correspond to expected product types.



#### 2. **Process Variability or Drift**

* If most products are meant to stay within a narrow carbon range (e.g. structural steel), this top 25% could indicate:

  * **Process drift**
  * **Improper deoxidation or alloy feeding**
  * **Inconsistent melting or ladle practices**

**Action**: Check temporal trends or operators — are high-carbon batches clustered in time or by equipment?



#### 3. **Potential Quality Risk**

* If excess carbon is **not allowed** in some specs, these 4,358 records may violate compliance.

**Action**: Cross-check if these values exceed spec limits using `doc_requirement`.



### Final Insight

> **Knowing which batches fall in the top 25% carbon range allows process engineers to trace potential risks or confirm product specialization**.



8. Identify the interquartile range (IQR) of `Mn,%`.

In [46]:
IQR = df['Mn,%'].quantile(0.75) - df['Mn,%'].quantile(0.25)

In [47]:
IQR

np.float64(0.48030000000000006)

In [48]:
# identify outliers in Mn, % using IQR
lower_bound = df['Mn,%'].quantile(0.25) - 1.5 * IQR
upper_bound = df['Mn,%'].quantile(0.75) + 1.5 * IQR

In [49]:
outliers = df[(df['Mn,%'] < lower_bound) | (df['Mn,%'] > upper_bound)]

In [53]:
outliers.shape[0]

0

###  Interpretation:
The IQR of Mn,% is approximately 0.4803.

This tells that the middle 50% of manganese content in dataset lies within a 0.48% range, which is:

Relatively tight → good for process control

Reflects consistent alloying practice, unless product variation demands different manganese levels

### Part 3: Value Counts & Membership

9. Count how many unique `steel_type` values exist.

In [55]:
steel_type_unique = df['steel_type'].unique()

In [56]:
steel_type_unique.shape[0]

12

In [61]:
steel_type_unique

array(['Arm240', 'St3sp', 'Arm500', '1015', 'V500V', '1008', '25G2S',
       'YP', '1018', 'St4sp', '20', '1010'], dtype=object)

### Interpretation:
There are 12 different steel types in the dataset. This suggests:

There is a diverse product line.

steel_type is likely a key categorical feature for grouping, profiling, or stratified analysis.

10. List the 5 most frequent values in `alloy_type`.

In [62]:
alloy_type_most_frequent_values = df['alloy_type'].value_counts()

In [64]:
alloy_type_most_frequent_values.head(5)

alloy_type
open     17315
close      188
Name: count, dtype: int64

## What Do `open` and `close` Mean in Alloying?

These terms refer to the **state of the alloying system** during steel production — particularly how **alloying elements are introduced** and **how the system is controlled**:



### `open` alloying:

* The system is **exposed to the atmosphere** or only partially enclosed.
* Alloys (e.g., carbon, manganese, chromium) are added **in ladles or tundish** with less control over oxidation.
* Often used for:

  * **Standard structural steels**
  * **Bulk production**
  * **Less expensive, less sensitive alloys**
* Easier to manage, cheaper, but **more susceptible to minor compositional deviations** due to oxidation or heat loss.



### `close` alloying:

* The system is **sealed or controlled**, sometimes under inert gas (e.g., argon) or vacuum.
* Alloying is more **precise**, used when exact chemical composition is critical.
* Common in:

  * **High-grade steel production**
  * **Tool steels**
  * **Aerospace or automotive applications**
* More expensive, slower, but ensures **tighter quality control**.



## Interpretation for the Process

Given that:

* `open`: **17,315 entries**
* `close`: **188 entries**



### Process Insights:

| Type  | Meaning                            | Use Case                     | Interpretation from Data                                                                            |
| ----- | ---------------------------------- | ---------------------------- |-----------------------------------------------------------------------------------------------------|
| open  | Standard, cost-effective alloying  | Mass production              | Dominates dataset (≈99%) — indicates high process standardization and possibly less precision demand. |
| close | Sealed/controlled alloying process | High-quality, special alloys | Rare (≈1%) — suggests niche product types, tight specs, or legacy/special runs.                     |



## Strategic Takeaways:

* **Process Control**: Standardized around `open`, optimizing for throughput and cost — good for structural grades.
* **Rare but Critical**: `close` is likely used where **chemical precision matters**. It’s worth profiling:

  * Do `close` records align with certain `steel_type`?
  * Do they have **lower RUL** (suggesting specialty life cycle needs)?
  * Are they tied to specific dates, shifts, or suppliers?

11. How many times does the `doc_requirement` appear as "T1"?

In [84]:
t1_occurrence = (df['doc_requirement'] == 'T1').sum()

In [85]:
t1_occurrence

np.int64(0)

In [86]:
# alternative
t1_occurrence = df['doc_requirement'].value_counts().get('T1', 0)

In [87]:
t1_occurrence

0

In [88]:
df['doc_requirement'].unique()

array(['DOC 34028-2016', 'Contract', 'ASTM A510/A510M-18', 'DIN 488-2009',
       'ASTM A510/A510М-18 / Contract', 'DOC 5781-82',
       'ТУ 24.10.61-022-24688283-2020', 'DOC 380-2005', 'DOC 10702-2016',
       'ASTM/Contract 4.13.2'], dtype=object)

In [89]:
df['doc_requirement'].value_counts()

doc_requirement
DOC 34028-2016                   13934
Contract                          2033
ASTM A510/A510M-18                 770
DOC 5781-82                        421
DOC 380-2005                       123
ASTM/Contract 4.13.2                99
ТУ 24.10.61-022-24688283-2020       45
ASTM A510/A510М-18 / Contract       30
DIN 488-2009                        24
DOC 10702-2016                      24
Name: count, dtype: int64

### Interpretation:
There are zero records in the dataset where doc_requirement == "T1".

#### What This Say:
"T1" does not exist in the dataset — at least not spelled exactly that way.

The earlier assumption about "T1" being a document requirement might be wrong.

12. Filter all rows where `steel_type` is in the top 3 most common types.

In [92]:
top_3_steel_type = df['steel_type'].value_counts().head(3).index

In [96]:
top_3_steel_type

Index(['Arm500', 'St4sp', 'St3sp'], dtype='object', name='steel_type')

In [99]:
# to filter the rows
df[df['steel_type'].isin(top_3_steel_type)].head()

Unnamed: 0,date,"workpiece_weight, tonn",steel_type,doc_requirement,cast_in_row,workpiece_slice_geometry,alloy_type,"steel_weight_theoretical, tonn","slag_weight_close_grab1, tonn","metal_residue_grab1, tonn",...,"Al, %","Ca, %","N, %","Pb, %","Mg, %","Zn, %",sleeve,num_crystallizer,num_stream,RUL
1,2020-01-05,165.9,St3sp,Contract,10,150x150,open,166.3,1.8,0.4,...,0.0028,0.0004,0.0049,0.0,0.0,0.0,30013346,2,1,1037.0
3,2020-01-05,170.1,St3sp,Contract,7,150x150,open,170.5,1.8,0.4,...,0.0034,0.0005,0.0051,0.0,0.0,0.0,30012261,22,4,300.0
4,2020-01-05,163.8,St3sp,Contract,12,150x150,open,164.2,1.8,0.4,...,0.0032,0.0004,0.0038,0.0,0.0,0.0,30012261,22,4,164.0
5,2020-01-05,163.8,St3sp,Contract,14,150x150,open,164.2,1.8,0.4,...,0.0043,0.0005,0.0035,0.0,0.0,0.0,30012261,22,4,109.0
8,2020-01-05,161.7,St3sp,Contract,6,150x150,open,162.1,1.8,0.4,...,0.0033,0.0011,0.0056,0.0,0.0,0.0,30012261,22,4,327.0


### Interpretation:

### 1. **Dominance of 'Arm500'**

> `'Arm500'` represents the **majority of steel production** by a large margin.

* This type likely corresponds to a **core product line** — such as structural steel or rebar.
* Operational parameters (e.g., weight, geometry, alloying method) are likely optimized around it.



### 2. **Filtered Data Is Ideal for Profiling**

* With these filtered row:

  * Compute **average RUL**, **element composition**, or **temperature stats** per steel type.
  * Identify **performance variability** among the top products.


### Summary Insight:

> These top 3 steel types drive most of the production. They are processed in standardized ways (open alloying, consistent dimensions, and contract-based documentation). This group forms a **strong foundation** for modeling quality, predicting RUL, or setting benchmarks.

### Part 4: Ranking and Order

13. Assign a rank to `RUL` from highest to lowest using average ranking.

In [104]:
df['url_rank'] = df['RUL'].rank(method='average', ascending=False)

In [105]:
df[['date', 'RUL', 'url_rank']].sort_values(by='url_rank').head()

Unnamed: 0,date,RUL,url_rank
8084,2020-05-30,8473111.0,2.0
7259,2020-05-27,8473111.0,2.0
8046,2020-05-30,8473111.0,2.0
8112,2020-05-30,8473091.0,4.5
8103,2020-05-30,8473091.0,4.5


## Statistical Interpretation:

* There are **three entries** with the highest `RUL` value of **8,473,111.0**.

  * Since they’re tied, each receives the **average rank**:
    $(1 + 2 + 3) / 3 = 2.0$
* Two more entries with the next highest `RUL` of **8,473,091.0** get:

  * $(4 + 5) / 2 = 4.5$

This confirms that the ranking used the `"average"` method correctly.


## Operational Interpretation:

> These entries represent the **top-performing steel batches or products** in terms of **Remaining Useful Life**.

### What this could mean for the process:

* They likely underwent **optimal processing conditions**.
* They may belong to a **superior steel type**, or had ideal:

  * Alloy composition (`C, %`, `Mn, %`)
  * Casting speed / temperature control
  * Ladle treatment, or cooling uniformity



## Next Steps (Optional):

1. **Profile these top-ranked records**:

   * What `steel_type` or `alloy_type` are they?
   * What was their `temperature_delta`, `cast_in_row`, etc.?

2. **Compare to average/bottom-ranked batches**:

   * What's different in conditions?
   * Are there predictors of high RUL?

3. **Visualize**:

   * A bar chart of top 10 RUL values with labels and ranks
   * Time series trends of high-ranked batches

14. Assign a rank to `RUL` using first occurrence (method=`first`).

In [113]:
df['new_rul_rank'] = df['RUL'].rank(method='first', ascending=False)

In [115]:
df[['steel_type', 'alloy_type', 'RUL', 'new_rul_rank']].sort_values(by='new_rul_rank').head()

Unnamed: 0,steel_type,alloy_type,RUL,new_rul_rank
7259,Arm500,open,8473111.0,1.0
8046,1015,open,8473111.0,2.0
8084,1015,open,8473111.0,3.0
8103,1015,open,8473091.0,4.0
8112,1015,open,8473091.0,5.0


## Interpretation:

### 1. `Arm500` Is a Top Performer:

* It was the **first entry in the dataset** to reach the **maximum RUL**
* Suggests that this steel type (in this instance) is highly durable

### 2. `1015` Is Also a Strong Candidate:

* Several `1015` batches also reached the same or slightly lower RUL
* Indicates that **under similar conditions**, it may match or approach the performance of `Arm500`

### 3. `alloy_type = open`:

* All top entries use `open` alloying
* This might suggest that **open alloying is favorable** for achieving high RUL in the process



## Operational Insight:

> This ranking can be used to **profile best-performing material batches**, identify **repeatable high-quality configurations**, or feed into **predictive models** (e.g., predicting RUL based on steel type + process variables).

15. Find the record with the top 5 lowest `resistance, tonn` values.

In [127]:
df.sort_values(by='resistance, tonn').head(5)

Unnamed: 0,date,"workpiece_weight, tonn",steel_type,doc_requirement,cast_in_row,workpiece_slice_geometry,alloy_type,"steel_weight_theoretical, tonn","slag_weight_close_grab1, tonn","metal_residue_grab1, tonn",...,"N, %","Pb, %","Mg, %","Zn, %",sleeve,num_crystallizer,num_stream,RUL,url_rank,new_rul_rank
8046,2020-05-30,159.264,1015,ASTM A510/A510M-18,4,180x180,open,159.864,1.8,0.4,...,0.0066,0.0,0.0,0.0,30014143,11,6,8473111.0,2.0,2.0
7259,2020-05-27,160.272,Arm500,DOC 34028-2016,1,180x180,open,161.572,1.8,0.4,...,0.0038,0.0,0.0,0.0,30013876,18,6,8473111.0,2.0,1.0
7255,2020-05-27,160.272,Arm500,DOC 34028-2016,1,180x180,open,161.572,1.8,0.4,...,0.0038,0.0,0.0,0.0,30014193,6,6,8928.0,3539.5,3540.0
8084,2020-05-30,159.264,1015,ASTM A510/A510M-18,4,180x180,open,159.864,1.8,0.4,...,0.0066,0.0,0.0,0.0,30013876,18,6,8473111.0,2.0,3.0
13468,2020-07-26,153.6,Arm500,DOC 34028-2016,1,180x180,open,155.238,1.8,0.4,...,0.0078,0.0032,0.0001,0.0018,30014804,11,5,12551.0,1287.0,1287.0


In [130]:
# alternative with rank
df['resistance_rank'] = df['resistance, tonn'].rank(method='first', ascending=True)

In [131]:
# top 5 resistance row
df[df['resistance_rank'] <= 5]

Unnamed: 0,date,"workpiece_weight, tonn",steel_type,doc_requirement,cast_in_row,workpiece_slice_geometry,alloy_type,"steel_weight_theoretical, tonn","slag_weight_close_grab1, tonn","metal_residue_grab1, tonn",...,"Pb, %","Mg, %","Zn, %",sleeve,num_crystallizer,num_stream,RUL,url_rank,new_rul_rank,resistance_rank
2624,2020-04-08,147.456,Arm500,DOC 34028-2016,1,180x180,open,149.1,1.8,0.4,...,0.0028,0.0001,0.0019,30014083,23,4,11628.0,1578.0,1578.0,5.0
7255,2020-05-27,160.272,Arm500,DOC 34028-2016,1,180x180,open,161.572,1.8,0.4,...,0.0,0.0,0.0,30014193,6,6,8928.0,3539.5,3540.0,1.0
7259,2020-05-27,160.272,Arm500,DOC 34028-2016,1,180x180,open,161.572,1.8,0.4,...,0.0,0.0,0.0,30013876,18,6,8473111.0,2.0,1.0,2.0
8046,2020-05-30,159.264,1015,ASTM A510/A510M-18,4,180x180,open,159.864,1.8,0.4,...,0.0,0.0,0.0,30014143,11,6,8473111.0,2.0,2.0,3.0
8084,2020-05-30,159.264,1015,ASTM A510/A510M-18,4,180x180,open,159.864,1.8,0.4,...,0.0,0.0,0.0,30013876,18,6,8473111.0,2.0,3.0,4.0


### What This Means for the Process:
These 5 rows should be examined more closely:

Are they linked to specific steel types, temperatures, or operational settings?

Are they outliers, or common for a certain production configuration?

May use these findings to:

Improve process safety, energy efficiency, or material quality.

Investigate maintenance schedules or equipment degradation (e.g., if low resistance is linked to specific machine states or dates).

### Part 5: Percent Change & Differences

16. Compute the percent change between consecutive values in `steel_weight, tonn`.

In [132]:
df['steel_weight, tonn'].pct_change(fill_method=None)

0             NaN
1        0.144928
2        0.012658
3        0.012500
4       -0.037037
           ...   
17498    0.018514
17499    0.000000
17500   -0.018177
17501    0.018514
17502    0.000000
Name: steel_weight, tonn, Length: 17503, dtype: float64

The **percent change in `steel_weight, tonn`** gives insight into how much the steel weight **fluctuates from one record to the next**, which can be very important for **monitoring consistency and detecting anomalies** in the process.



### Interpretation in Context:

Let’s break it down:

| Observation         | Meaning                                                                                                                                         |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `0.1449` (≈14.5%)   | The steel weight **increased by \~14.5%** compared to the previous record — potentially a heavier input batch or different casting requirement. |
| `-0.0370` (≈-3.7%)  | The steel weight **dropped by 3.7%** — indicating a slight reduction in input.                                                                  |
| `-0.1732` (≈-17.3%) | A sharp **drop** in steel weight, possibly indicating a **change in product type**, casting issue, or manual override.                          |
| `0.1939` (≈+19.4%)  | A **significant increase** — perhaps ramp-up or shift to larger components.                                                                     |



### What It Says About the Process:

#### Normal operations:

* Small fluctuations (±1–5%) are common and expected in industrial settings.

#### Significant deviations (±10% or more):

* May indicate **process shifts**, **manual changes**, **equipment behavior**, or **data entry errors**.
* Should be **flagged for review**, especially if they persist or correlate with other anomalies (e.g., sudden drops in resistance, spike in temperature, etc.).


### Use Cases:

1. **Process Stability**: Stable % changes imply a well-controlled process.
2. **Quality Control**: Detect outliers that could affect product quality.
3. **Root Cause Analysis**: Combine with other variables (e.g., temperature, alloy type) to explain major swings.
4. **Trend Detection**: Plot the changes to find cycles, bottlenecks, or batch-based variations.

17. Compute the absolute difference between consecutive values in `temperature_measurement2, Celsius deg.`.

In [133]:
df['temperature_measurement2, Celsius deg.'].diff().abs()

0         NaN
1        10.0
2        13.0
3         9.0
4         2.0
         ... 
17498     6.0
17499     0.0
17500     6.0
17501     6.0
17502     0.0
Name: temperature_measurement2, Celsius deg., Length: 17503, dtype: float64

### What This Means for the Process:

* **Small absolute differences (e.g., 2–5°C)** suggest **temperature stability**, which may be ideal for consistent production.
* **Large jumps (e.g., >10°C)** may signal:

  * A process change
  * A batch variation
  * A control issue
  * Or even faulty sensor readings

Monitoring these differences over time can help identify **instabilities, trends, or anomalies** in the thermal profile of the process.

18. Which date had the largest percent drop in `RUL` compared to the previous record?

In [137]:
df['rul_pct_change'] = df['RUL'].pct_change(fill_method=None)

In [139]:
df['rul_pct_change']

0             NaN
1        1.700521
2       -0.657666
3       -0.154930
4       -0.453333
           ...   
17498   -0.007623
17499    2.286486
17500    0.002337
17501   -0.672338
17502    0.257512
Name: rul_pct_change, Length: 17503, dtype: float64

In [140]:
nonzero_drop = df[df['RUL'] != 0]

In [141]:
nonzero_drop[['date', 'RUL', 'rul_pct_change']].sort_values(by='rul_pct_change', ascending=True).head(5)

Unnamed: 0,date,RUL,rul_pct_change
8235,2020-05-31,281.0,-0.999967
8200,2020-05-31,338.0,-0.99996
9338,2020-06-09,1204.0,-0.999858
9328,2020-06-09,1226.0,-0.999855
8088,2020-05-30,1386.0,-0.999836


This output shows the **largest non-zero percent drops in RUL (Remaining Useful Life)** — excluding cases where RUL simply dropped to 0. These are near-complete losses in RUL, with nearly -100% drops (around `-0.9999`) that are **sudden and significant**.


### Interpretation

* These drops **did not go to zero**, but lost **almost all of their prior value** in a single step.
* This indicates **extreme degradation** or **critical failure onset**.
* All events happened in a **tight timeframe (May–June 2020)**, suggesting:

  * A process fault
  * Sensor miscalibration
  * A severe batch-related issue
  * Or a specific machine failure


### What This Means for the Process

This insight could indicate **a systemic issue** or **urgent failure risk**:

1. **Critical Review Needed**: Engineers should investigate what happened on those dates. This is not normal behavior.
2. **Preventative Maintenance**: Set alerts for steep RUL drops approaching this magnitude.
3. **Predictive Modeling**: These records should be flagged as high-risk in predictive failure models.
4. **Cross-Variable Correlation**: Look at `steel_type`, `alloy_type`, temperature, and resistance on those dates to find contributing factors.

### Part 6: Correlation & Covariance

19. Compute the Pearson correlation between `RUL` and `temperature_measurement1, Celsius deg.`.

In [142]:
df['RUL'].corr(df['temperature_measurement1, Celsius deg.'])

np.float64(0.00531403167843449)

### Interpretation
- A correlation of ~0.0053 indicates no linear relationship between the two variables.

- This means that as the temperature goes up or down, it has virtually no effect on the Remaining Useful Life (RUL) — at least in a linear sense.

- Pearson correlation does not capture nonlinear patterns — so while linear influence is absent, nonlinear patterns might still exist and can be explored using other methods (e.g., mutual information, scatter plots, or machine learning models).

20. Compute the correlation between all numeric chemical composition columns (e.g., C, Si, Mn, S, etc.).

In [145]:
all_columns = df.columns

In [146]:
composition_columns = []
for column in all_columns:
    if column.endswith('%'):
        composition_columns.append(column)

In [147]:
composition_columns

['Ce, %',
 'C, %',
 'Si, %',
 'Mn,%',
 'S, %',
 'P, %',
 'Cr, %',
 'Ni, %',
 'Cu, %',
 'As, %',
 'Mo, %',
 'Nb, %',
 'Sn, %',
 'Ti, %',
 'V, %',
 'Al, %',
 'Ca, %',
 'N, %',
 'Pb, %',
 'Mg, %',
 'Zn, %']

In [148]:
df[composition_columns].corr()

Unnamed: 0,"Ce, %","C, %","Si, %","Mn,%","S, %","P, %","Cr, %","Ni, %","Cu, %","As, %",...,"Nb, %","Sn, %","Ti, %","V, %","Al, %","Ca, %","N, %","Pb, %","Mg, %","Zn, %"
"Ce, %",1.0,0.801069,0.842557,0.946956,0.131664,0.147803,0.295927,0.068912,0.139638,0.153586,...,0.376635,-0.011064,0.813117,0.567288,0.246622,0.106836,0.190541,0.221167,0.144093,0.094153
"C, %",0.801069,1.0,0.691559,0.49159,0.088408,0.092592,0.159496,0.028714,0.023368,0.127358,...,0.078811,0.018933,0.507784,0.284397,0.052309,-0.003323,0.16219,0.055085,0.154073,0.165425
"Si, %",0.842557,0.691559,1.0,0.621296,0.06085,-0.000653,0.127881,0.008056,0.030743,0.177214,...,0.177307,-5.9e-05,0.700356,0.374161,0.188194,0.039495,0.152943,0.054966,0.109329,0.115426
"Mn,%",0.946956,0.49159,0.621296,1.0,0.085118,0.198205,0.389499,0.039351,0.080058,0.092829,...,0.426418,-0.036507,0.645788,0.666978,0.117044,0.015326,0.129754,0.15512,0.083253,0.081496
"S, %",0.131664,0.088408,0.06085,0.085118,1.0,0.199453,0.126871,0.141385,0.288822,0.011205,...,-0.013884,0.079669,0.115758,0.091978,0.109095,0.085878,0.347341,0.438183,0.201518,0.207442
"P, %",0.147803,0.092592,-0.000653,0.198205,0.199453,1.0,0.310724,0.259556,0.236584,0.104981,...,0.18825,0.064765,0.088469,0.235833,-0.044226,-0.04706,-0.104215,0.049439,-0.001769,-0.011688
"Cr, %",0.295927,0.159496,0.127881,0.389499,0.126871,0.310724,1.0,0.474812,0.491731,0.162103,...,0.209517,0.064363,0.255662,0.314498,0.031649,-0.002377,0.16466,0.257909,0.134157,0.137219
"Ni, %",0.068912,0.028714,0.008056,0.039351,0.141385,0.259556,0.474812,1.0,0.628063,0.380659,...,-0.005247,0.112911,-0.005321,-0.000289,-0.056682,-0.073154,0.056385,0.040615,0.054505,0.119676
"Cu, %",0.139638,0.023368,0.030743,0.080058,0.288822,0.236584,0.491731,0.628063,1.0,0.367737,...,0.140857,0.118818,0.17754,0.135373,0.125884,0.092172,0.242991,0.37686,0.18661,0.153194
"As, %",0.153586,0.127358,0.177214,0.092829,0.011205,0.104981,0.162103,0.380659,0.367737,1.0,...,0.118109,0.106873,0.089429,0.086457,0.040077,-0.067479,-0.036943,-0.143094,-0.023778,0.03622


### Interpretation

* The correlation matrix (as shown) helps:

  * Detect **redundant variables** (e.g., `'Ce, %'` and `'C, %'` have a correlation ≈ 0.80)
  * Identify **multicollinearity** before applying modeling or dimensionality reduction.
  * Suggest chemical **interactions or co-presence patterns** (useful in alloy formulation or quality control).



### Example Insights

* `Ti, %` and `Ce, %`: **strong positive correlation** (≈ 0.81)
* `Mn, %` and `P, %`: moderately correlated (≈ 0.39)
* `S, %` appears **weakly correlated** with most others, suggesting it's an independent variable in this context

21. Compute the covariance matrix for `RUL`, `alloy_speed, meter/minute`, and `swing_frequency, amount/minute`.

In [149]:
df[['RUL', 'alloy_speed, meter/minute', 'swing_frequency, amount/minute']].cov()

Unnamed: 0,RUL,"alloy_speed, meter/minute","swing_frequency, amount/minute"
RUL,1655525000000.0,7996.354368,-252990.012453
"alloy_speed, meter/minute",7996.354,0.111955,-2.999822
"swing_frequency, amount/minute",-252990.0,-2.999822,141.331874


#### Key Observations:

* `RUL` has a **positive covariance** with `alloy_speed`, suggesting that as alloy speed increases, `RUL` tends to increase too (but check the units and scale).
* `RUL` has a **negative covariance** with `swing_frequency`, which may indicate an inverse relationship.
* Diagonal entries are **variances** (e.g., `Var(RUL) = 1.655525e+12`).



### Reminder:

Covariance shows **direction and scale**, but not strength. To measure **strength and standardized direction**, use `.corr()` (Pearson correlation matrix).

### Part 7: Missing Data in Stats

22. How many missing values are there in each column?

In [150]:
df.isna().sum()

date                        0
workpiece_weight, tonn      0
steel_type                  0
doc_requirement             0
cast_in_row                 0
                         ... 
RUL                       224
url_rank                  224
new_rul_rank              224
resistance_rank           224
rul_pct_change            345
Length: 61, dtype: int64

### Interpretation from the Output:

* Columns like:

  * `'residuals_grab2, tonn'` → **16,990** missing
  * `'technical_trim, tonn'` → **17,427** missing
  * `'grab1_num'` → **271** missing
  * `'resistance, tonn'` → **224** missing
  * `'steel_temperature_grab1, Celsius deg.'` → **3** missing
  * `'grab2_num'` → **16** missing

* Columns like:

  * `'steel_weight, tonn'`, `'swing_frequency, amount/minute'`, and `'alloy_speed, meter/minute'` have **0** missing values — good for modeling.


### What to Do Next?

Might want to:

* **Drop** columns with excessive missing values (e.g., over 90% missing).
* **Impute** (fill in) missing values using mean, median, forward fill, etc.
* **Visualize** missing data with a heatmap or bar chart (e.g., using `missingno` or `seaborn`) for clarity.

23. Compute the mean of `Ce, %` with and without missing values included.

In [151]:
df['Ce, %'].mean(skipna=True)

np.float64(0.34175648140478754)

In [152]:
df['Ce, %'].mean(skipna=False)

np.float64(nan)

### Interpretation for the Process:

#### 1. `skipna=True` — Ignoring Missing Values:

* **Meaning**: Calculated the mean based **only on available data**.
* **Use Case**: This is common in exploratory data analysis (EDA), especially when:

  * Not yet ready to handle missing values formally.
  * Assumes the missing values are random and won't bias the result.
* **Impact**: Allows to still compute summary statistics even when the dataset is incomplete.

#### 2. `skipna=False` — Include Missing Values:

* **Meaning**: Since there's at least one `NaN`, the result is `NaN`.
* **Use Case**: Forces to **acknowledge data gaps**.
* **Impact**: This prevents hidden issues when:

  * Preparing data for production or machine learning.
  * There is a need for a complete, clean dataset.
  * There is a need to raise alerts when missing values may affect reliability.



### Process Implications:

| Scenario                           | What It Means                                                | Suggested Action                                                  |
|------------------------------------| ------------------------------------------------------------ |-------------------------------------------------------------------|
| Got a numeric mean (`skipna=True`) | Some data is available and being used.                       | Proceed, but investigate how many values were skipped.            |
| Got `NaN` (`skipna=False`)         | At least one `NaN` exists, which prevents safe calculations. | **Must clean or impute** these missing values before further use. |



### Recommendation:

Should **check how many missing values** exist in `'Ce, %'` (e.g., `df['Ce, %'].isna().sum()`) and decide whether to:

* Drop them (if very few),
* Fill them with a default or interpolated value,
* Or model them if the missingness is systematic.

24. Drop all rows with missing values in `residuals_grab2, tonn` and compute its mean.

In [153]:
df = pd.DataFrame(data)

In [159]:
df['residuals_grab2, tonn'].dropna().mean()

np.float64(5.771189083820663)

### What It Means for the Process:

This value — **5.77 tons** — is the **average value of residuals in grab 2**, *after removing missing records*. Here's what this tells us:



### Interpretation for the Process:

1. **Data Quality Check Passed**:

   * Filtered out **16,990 missing rows** (from earlier steps).
   * The average is now calculated **only from reliable, complete data**.

2. **Operational Insight**:

   * This number likely represents **residual material left over** in a secondary grab or phase of the process (e.g., waste or byproduct).
   * An average of \~5.77 tons could indicate how efficiently or cleanly the process completes material transfer or handling.

3. **Benchmark for Optimization**:

   * If want to **reduce waste** or **improve recovery**, this value can be a **baseline**.
   * Any future improvements to reduce this average would signal better efficiency.



### Suggestion:

To go further:

* Compare residuals by **steel type**, **alloy type**, or **time period**.
* Use `.groupby()` and `.mean()` to find which combinations produce more or less residual.

### Part 8: Group-Based Descriptive Statistics

25. Group the data by `steel_type` and compute the average `RUL` for each group.

In [160]:
df.groupby('steel_type')['RUL'].mean()

steel_type
1008        5113.400000
1010        7735.550459
1015      782444.459924
1018        4327.913636
20          1646.208333
25G2S       8212.926014
Arm240      1607.670103
Arm500    191426.146574
St3sp     625431.181013
St4sp      60354.872781
V500V       2679.500000
YP          3301.022222
Name: RUL, dtype: float64

### What It Means for the Process:

1. **Steel Performance Insight**:

   * Know which steel types have higher or lower average `RUL`, meaning:

     * How long that steel type is expected to last on average.
     * How durable or effective it is under the operational conditions.

2. **Examples from Output**:

   * `1015` has the **highest average RUL**: \~782,444 → very long-lasting.
   * `Arm240` has the **lowest**: \~1,607 → much shorter lifespan.
   * `Arm500` is also durable: \~191,426.
   * `St3sp` is fairly high as well: \~625,431.

3. **Use Cases**:

   * Optimize material selection based on RUL vs cost.
   * Identify underperforming steel types to phase out.
   * Correlate with temperature, residuals, or processing speed for deeper analysis.

26. Group the data by `num_crystallizer` and report the standard deviation of `alloy_speed, meter/minute`.

In [161]:
df.groupby('num_crystallizer')['alloy_speed, meter/minute'].std()

num_crystallizer
1     0.471321
2     0.361131
3     0.481220
4     0.511189
5     0.386893
6     0.498255
7     0.285566
8     0.235683
9     0.210392
10    0.240568
11    0.369843
12    0.265081
13    0.207943
14    0.255642
15    0.231936
16    0.249733
17    0.212356
18    0.282976
19    0.302506
20    0.386946
21    0.496801
22    0.463276
23    0.462137
24    0.333562
Name: alloy_speed, meter/minute, dtype: float64

### What It Means for the Process:

1. **Standard Deviation** tells how much the alloy speed varies for each crystallizer.

   * A **higher std** → more inconsistency in alloy speed.
   * A **lower std** → more stable speed for that crystallizer.

2. **Usefulness**:

   * Helps identify **process stability** per crystallizer unit.
   * For instance:

     * Crystallizer `3` and `5` have **higher variation** → may need inspection.
     * Crystallizer `8` and `9` are **more stable** → tighter control.

3. **Next Steps**:

   * Investigate reasons for high variation in specific crystallizers.
   * Compare with product quality or defect rates.
   * Consider maintenance or calibration if inconsistencies affect product integrity.

27. Group by both `alloy_type` and `steel_type` and compute the count of entries for each combination.

In [168]:
df.groupby(['alloy_type', 'steel_type']).size()

alloy_type  steel_type
close       1008             30
            1010            117
            20               24
            Arm240            6
            Arm500           11
open        1015            532
            1018            220
            25G2S           421
            Arm240           93
            Arm500        13824
            St3sp           804
            St4sp          1352
            V500V            24
            YP               45
dtype: int64

In [169]:
# as a DataFrame
df.groupby(['alloy_type', 'steel_type']).size().reset_index(name='count')

Unnamed: 0,alloy_type,steel_type,count
0,close,1008,30
1,close,1010,117
2,close,20,24
3,close,Arm240,6
4,close,Arm500,11
5,open,1015,532
6,open,1018,220
7,open,25G2S,421
8,open,Arm240,93
9,open,Arm500,13824


###  What it says about the process:
- Production volume insight: Can identify which steel and alloy combinations are most common or have the highest throughput.

- Resource tracking: If workpiece_weight is costly, can use this to assess material usage per steel-alloy pairing.

- Anomaly detection: If a combination has unusually high or low values, it might signal a process issue, material bottleneck, or overproduction.

### Part 9: Rolling and Expanding Windows

28. Compute a 5-point rolling mean on `RUL` and show the first 10 values.

In [170]:
df['RUL'].rolling(window=5).mean().head(10)

0      NaN
1      NaN
2      NaN
3      NaN
4    448.0
5    393.0
6    301.0
7    328.2
8    333.6
9    328.0
Name: RUL, dtype: float64

###  What this means for the process:
A 5-point rolling mean smooths out short-term fluctuations in RUL (Remaining Useful Life), helping:

    - Visualize trends more clearly,

    - Detect drops or stability in equipment health,

    - Reduce the noise from sudden outliers.

29. Compute an expanding mean of `temperature_measurement2, Celsius deg.`.

In [172]:
df['temperature_measurement2, Celsius deg.'].expanding().mean()

0        1545.000000
1        1540.000000
2        1542.666667
3        1541.750000
4        1541.600000
            ...     
17498    1540.544955
17499    1540.544638
17500    1540.544664
17501    1540.544347
17502    1540.544029
Name: temperature_measurement2, Celsius deg., Length: 17503, dtype: float64

###  Interpretation for the process:
This is useful for trend analysis over time. An expanding mean can:

    - Smooth fluctuations,

    - Show how the average temperature stabilizes or shifts as more data is collected,

    - Support control system checks or gradual process drift detection.

30. Identify the rows where the 3-point rolling mean of `steel_weight, tonn` is greater than 10.

In [174]:
df[(df['steel_weight, tonn'].rolling(window=3).mean() > 10)]

Unnamed: 0,date,"workpiece_weight, tonn",steel_type,doc_requirement,cast_in_row,workpiece_slice_geometry,alloy_type,"steel_weight_theoretical, tonn","slag_weight_close_grab1, tonn","metal_residue_grab1, tonn",...,"Al, %","Ca, %","N, %","Pb, %","Mg, %","Zn, %",sleeve,num_crystallizer,num_stream,RUL
2,2020-01-05,168.000,Arm240,DOC 34028-2016,5,150x150,open,168.400,1.8,0.4,...,0.0031,0.0011,0.0068,0.0000,0.0000,0.0000,30012261,22,4,355.0
3,2020-01-05,170.100,St3sp,Contract,7,150x150,open,170.500,1.8,0.4,...,0.0034,0.0005,0.0051,0.0000,0.0000,0.0000,30012261,22,4,300.0
4,2020-01-05,163.800,St3sp,Contract,12,150x150,open,164.200,1.8,0.4,...,0.0032,0.0004,0.0038,0.0000,0.0000,0.0000,30012261,22,4,164.0
5,2020-01-05,163.800,St3sp,Contract,14,150x150,open,164.200,1.8,0.4,...,0.0043,0.0005,0.0035,0.0000,0.0000,0.0000,30012261,22,4,109.0
6,2020-01-05,163.800,Arm240,DOC 34028-2016,5,150x150,open,164.200,1.8,0.4,...,0.0032,0.0011,0.0046,0.0000,0.0000,0.0000,30012261,22,4,577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17498,2020-08-26,168.960,Arm500,DOC 34028-2016,17,180x180,open,169.731,1.8,0.4,...,0.0032,0.0011,0.0070,0.0038,0.0000,0.0000,30014135,12,6,3515.0
17499,2020-08-26,168.960,Arm500,DOC 34028-2016,17,180x180,open,169.731,1.8,0.4,...,0.0032,0.0011,0.0070,0.0038,0.0000,0.0000,30014818,18,5,11552.0
17500,2020-08-26,165.888,Arm500,DOC 34028-2016,16,180x180,open,166.653,1.8,0.4,...,0.0032,0.0015,0.0074,0.0039,0.0001,0.0000,30014818,18,5,11579.0
17501,2020-08-26,168.960,Arm500,DOC 34028-2016,7,180x180,open,169.731,1.8,0.4,...,0.0039,0.0011,0.0072,0.0044,0.0001,0.0011,30014135,12,6,3794.0


### **Interpretation for the Process:**

1. **Rolling Mean Context:**

   * A 3-point rolling mean smooths out short-term fluctuations and shows the local trend in steel weight over time.
   * It averages every 3 consecutive values:
     $(val_{i-2} + val_{i-1} + val_i)/3$

2. **Threshold > 10 tons:**

   * Only keeping rows where the **local average steel weight** is consistently above 10 tons.
   * This helps **identify operational periods** when **heavier-than-normal steel** was being processed.



### Why it matters for the steel process:

* **Quality Monitoring:** Indicates periods where **heavy slabs** or **larger ingots** are being processed, which might relate to higher-grade steel.
* **Operational State Detection:** Could reflect **shifts in machine usage**, production orders, or changes in casting specs.
* **Outlier Detection:** If 10 tons is unusually high for the normal operating range, this can help flag **anomalies** or **peak load periods**.



### Example Use Cases:

* Identify **heat treatment runs** that involved heavy slabs.
* Filter data for **stress or fatigue analysis** based on mass.
* Feed this condition into **predictive models** (e.g., estimating energy use or failure risk).