## Project Overview

This project focuses on predictive maintenance in the mining industry,
inspired by large-scale copper and cobalt mining operations such as
Kamoto Copper Company (KCC).

Mining companies rely on heavy equipment (excavators, crushers, conveyors,
haul trucks) operating in harsh conditions. Unexpected equipment failures
can lead to production downtime, safety risks, and significant financial losses.

Using operational and sensor data, this project applies data mining techniques
to analyze equipment behavior and predict potential failures before they occur.

## Project Objectives

The main objectives of this project are:

- Analyze operational data from mining equipment
- Identify patterns and signals that precede equipment failures
- Understand which factors contribute most to breakdowns
- Build a data-driven foundation for predictive maintenance strategies

Ultimately, this project aims to support decision-making by reducing unplanned
downtime and improving operational efficiency in mining operations.

## Business Questions

- Which equipment types experience the most failures?
- How do temperature, torque, and operating hours influence failures?
- Are failures more frequent at high production levels?
- Which operational signals appear shortly before a breakdown?

## üëâ STEP 1 ‚Äî Exploratory Data Analysis (EDA)

## Loading the data

In [1]:
import pandas as pd

df = pd.read_excel("C:/Users/danny/Copper-Mining-Predictive-Maintenance/Data/copper_mining_predictive_maintenance_dataset.xlsx")
df.head(5)

Unnamed: 0,Equipment_ID,Equipment_Type,Air_Temperature_C,Process_Temperature_C,Rotational_Speed_RPM,Torque_Nm,Hydraulic_Pressure_bar,Operating_Hours,Daily_Production_Tons,Equipment_Failure
0,1051,Excavator,33.05,89.95,1533,52.75,212.59,979,1255.0,1
1,1092,Conveyor,16.77,78.37,1513,47.91,228.12,731,1290.8,0
2,1014,Haul Truck,34.06,91.07,1476,47.89,211.86,57,1142.4,1
3,1071,Haul Truck,29.42,70.93,2027,42.63,212.05,2244,1336.6,0
4,1060,Excavator,26.34,69.16,1909,30.38,220.33,1702,1149.4,0


In [2]:
df.describe()

Unnamed: 0,Equipment_ID,Air_Temperature_C,Process_Temperature_C,Rotational_Speed_RPM,Torque_Nm,Hydraulic_Pressure_bar,Operating_Hours,Daily_Production_Tons,Equipment_Failure
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,1048.659167,25.106925,80.049917,1501.6125,40.012317,210.404233,2432.7375,1181.867083,0.115
std,29.511022,5.124566,10.207868,292.640127,10.054278,19.68674,1444.507652,301.02434,0.319155
min,1000.0,8.06,47.61,598.0,2.91,148.71,3.0,172.3,0.0
25%,1023.0,21.605,73.3325,1309.0,32.8175,197.12,1148.0,961.525,0.0
50%,1049.0,25.295,80.25,1496.0,40.33,210.2,2391.0,1192.0,0.0
75%,1074.0,28.6625,86.9725,1696.0,47.005,223.8825,3647.5,1383.55,0.0
max,1099.0,41.27,112.6,2597.0,75.99,269.64,4997.0,2162.6,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Equipment_ID            1200 non-null   int64  
 1   Equipment_Type          1200 non-null   object 
 2   Air_Temperature_C       1200 non-null   float64
 3   Process_Temperature_C   1200 non-null   float64
 4   Rotational_Speed_RPM    1200 non-null   int64  
 5   Torque_Nm               1200 non-null   float64
 6   Hydraulic_Pressure_bar  1200 non-null   float64
 7   Operating_Hours         1200 non-null   int64  
 8   Daily_Production_Tons   1200 non-null   float64
 9   Equipment_Failure       1200 non-null   int64  
dtypes: float64(5), int64(4), object(1)
memory usage: 93.9+ KB


## Dataset Overview

The dataset contains operational data from mining equipment, including
temperature readings, mechanical stress indicators, production levels,
and failure status.

The target variable is:
- Equipment_Failure (0 = Normal operation, 1 = Failure)

## üìà Breakdown of the failure rate

In [4]:
df["Equipment_Failure"].value_counts(normalize=True)

Equipment_Failure
0    0.885
1    0.115
Name: proportion, dtype: float64

### Failure Distribution

Most observations correspond to normal operations, while a smaller
percentage represents equipment failures. This reflects real-world
industrial environments where failures are relatively rare but critical.

## üõ†Ô∏è Analysis by equipment type

In [5]:
df.groupby("Equipment_Type")["Equipment_Failure"].mean().sort_values(ascending=False)

Equipment_Type
Excavator     0.131579
Conveyor      0.116279
Crusher       0.112903
Haul Truck    0.098246
Name: Equipment_Failure, dtype: float64

### Failures by Equipment Type

Certain equipment types show higher failure rates, indicating
higher mechanical stress or more intensive usage.
This insight helps prioritize maintenance resources.

## üå°Ô∏è Temperature & breakdowns

In [6]:
df.groupby("Equipment_Failure")[
    ["Air_Temperature_C", "Process_Temperature_C"]
].mean()

Unnamed: 0_level_0,Air_Temperature_C,Process_Temperature_C
Equipment_Failure,Unnamed: 1_level_1,Unnamed: 2_level_1
0,25.120085,79.365414
1,25.005652,85.317609


### Temperature Impact

Failed equipment shows higher average process temperatures,
suggesting overheating as a strong failure indicator.

## ‚öôÔ∏è Torque, pressure & speed

In [7]:
df.groupby("Equipment_Failure")[
    ["Torque_Nm", "Hydraulic_Pressure_bar", "Rotational_Speed_RPM"]
].mean()

Unnamed: 0_level_0,Torque_Nm,Hydraulic_Pressure_bar,Rotational_Speed_RPM
Equipment_Failure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,39.150537,210.186073,1502.160075
1,46.644275,212.083116,1497.398551


### Mechanical Stress Indicators

Higher torque and hydraulic pressure are observed before failures,
indicating increased mechanical load on equipment.

## ‚è±Ô∏è Operating hours

In [8]:
df.groupby("Equipment_Failure")["Operating_Hours"].mean()

Equipment_Failure
0    2366.977401
1    2938.804348
Name: Operating_Hours, dtype: float64

### Operating Hours

Equipment failures tend to occur after long operating hours,
confirming wear and fatigue as key contributors to breakdowns.

## üîç Identifying Pre-Failure Signals (VERY IMPORTANT üî•)

## Failure Signals Identified

Based on the exploratory analysis, the following signals are strongly
associated with equipment failures:

- High process temperature (overheating)
- Increased mechanical torque
- High hydraulic pressure
- Long operating hours without maintenance
- High production load combined with mechanical stress

These signals can be used as early warning indicators for predictive
maintenance systems in mining operations.

## üí°Business Interpretation

By monitoring these operational signals in real time, mining companies
can anticipate failures before they occur. This allows maintenance teams
to intervene proactively, reducing downtime, maintenance costs, and
production losses.

## üëâ Step 2 ‚Äî Feature Engineering & Data Preparation

## Feature Engineering & Data Preparation

The objective of this step is to transform raw operational data into
meaningful features that can be used by machine learning models.

This includes:
- Encoding categorical variables
- Creating new operational indicators
- Scaling numerical features
- Preparing training and testing datasets

## üßπ Check & Clean

In [9]:
df.isnull().sum()

Equipment_ID              0
Equipment_Type            0
Air_Temperature_C         0
Process_Temperature_C     0
Rotational_Speed_RPM      0
Torque_Nm                 0
Hydraulic_Pressure_bar    0
Operating_Hours           0
Daily_Production_Tons     0
Equipment_Failure         0
dtype: int64

No missing values were detected in the dataset, allowing direct
feature engineering and modeling.

## üè∑Ô∏è Encoding categorical variables

In [10]:
df_encoded = pd.get_dummies(df, columns=["Equipment_Type"], drop_first=True)

In [12]:
df_encoded.head(5)

Unnamed: 0,Equipment_ID,Air_Temperature_C,Process_Temperature_C,Rotational_Speed_RPM,Torque_Nm,Hydraulic_Pressure_bar,Operating_Hours,Daily_Production_Tons,Equipment_Failure,Equipment_Type_Crusher,Equipment_Type_Excavator,Equipment_Type_Haul Truck
0,1051,33.05,89.95,1533,52.75,212.59,979,1255.0,1,False,True,False
1,1092,16.77,78.37,1513,47.91,228.12,731,1290.8,0,False,False,False
2,1014,34.06,91.07,1476,47.89,211.86,57,1142.4,1,False,False,True
3,1071,29.42,70.93,2027,42.63,212.05,2244,1336.6,0,False,False,True
4,1060,26.34,69.16,1909,30.38,220.33,1702,1149.4,0,False,True,False


Equipment_Type was converted into numerical format using one-hot encoding
to ensure compatibility with machine learning models.

## ‚öôÔ∏è Creating new features (VERY IMPORTANT üî•)

## üîπ Thermal stress indicator

In [16]:
df_encoded["Thermal_Stress"] = (
    df_encoded["Process_Temperature_C"] - df_encoded["Air_Temperature_C"]
)

## üîπ Mechanical load indicator

In [17]:
df_encoded["Mechanical_Load"] = (
    df_encoded["Torque_Nm"] * df_encoded["Hydraulic_Pressure_bar"]
)

## üîπ Operational intensity indicator

In [18]:
df_encoded["Operational_Intensity"] = (
    df_encoded["Daily_Production_Tons"] / (df_encoded["Operating_Hours"] + 1)
)

These engineered features capture thermal stress, mechanical load,
and operational intensity, which are critical factors in mining
equipment degradation.

## üéØ Definition of the target and features

In [19]:
X = df_encoded.drop(columns=["Equipment_Failure", "Equipment_ID"])
y = df_encoded["Equipment_Failure"]

## ‚úÇÔ∏è Train/Test Split

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

The dataset was split into training and testing sets while preserving
the failure distribution using stratified sampling.

## üìè Data normalization

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Feature scaling was applied to ensure that all numerical variables
contribute equally to model training.

## üëâ STEP 3 ‚Äî Modeling & Evaluation

## Modeling & Evaluation

The objective of this step is to build machine learning models capable
of predicting equipment failures and to evaluate their performance
using appropriate classification metrics.

Two models are implemented:
- Logistic Regression (baseline)
- Random Forest (non-linear model)

## üì¶ Import of models & metrics

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

## 1Ô∏è‚É£ Model ‚Äî Logistic Regression (Baseline)

In [25]:
log_model = LogisticRegression(max_iter=1000, random_state=42)

log_model.fit(X_train_scaled, y_train)

y_pred_log = log_model.predict(X_test_scaled)
y_prob_log = log_model.predict_proba(X_test_scaled)[:, 1]

## üìä Evaluation

In [27]:
print("Logistic Regression Results")
print(classification_report(y_test, y_pred_log))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_log))

Logistic Regression Results
              precision    recall  f1-score   support

           0       0.89      0.98      0.94       266
           1       0.33      0.06      0.10        34

    accuracy                           0.88       300
   macro avg       0.61      0.52      0.52       300
weighted avg       0.83      0.88      0.84       300

ROC-AUC: 0.7658115877930118


In [28]:
confusion_matrix(y_test, y_pred_log)

array([[262,   4],
       [ 32,   2]])

### Logistic Regression Interpretation

Logistic Regression provides a strong baseline model and helps
understand linear relationships between operational variables
and equipment failures.

## 2Ô∏è‚É£ Model ‚Äî Random Forest

In [29]:
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42,
    class_weight="balanced"
)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]

## üìä Evaluation

In [31]:
print("Random Forest Results")
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_rf))

Random Forest Results
              precision    recall  f1-score   support

           0       0.89      0.98      0.94       266
           1       0.33      0.06      0.10        34

    accuracy                           0.88       300
   macro avg       0.61      0.52      0.52       300
weighted avg       0.83      0.88      0.84       300

ROC-AUC: 0.8461963732861565


In [32]:
confusion_matrix(y_test, y_pred_rf)

array([[262,   4],
       [ 32,   2]])

### Random Forest Interpretation

Random Forest captures non-linear relationships and interactions
between operational signals, leading to improved predictive performance.

## üèÜ Model Comparison

In [33]:
pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest"],
    "ROC_AUC": [
        roc_auc_score(y_test, y_prob_log),
        roc_auc_score(y_test, y_prob_rf)
    ]
})

Unnamed: 0,Model,ROC_AUC
0,Logistic Regression,0.765812
1,Random Forest,0.846196


Random Forest outperforms Logistic Regression, making it the preferred
model for predicting equipment failures in mining operations.


## üîç Feature Importance (VERY IMPORTANT üî•)

In [34]:
feature_importance = pd.Series(
    rf_model.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

feature_importance.head(10)

Torque_Nm                 0.197850
Mechanical_Load           0.144362
Operating_Hours           0.130307
Process_Temperature_C     0.114525
Operational_Intensity     0.088686
Thermal_Stress            0.081545
Air_Temperature_C         0.061038
Hydraulic_Pressure_bar    0.057872
Rotational_Speed_RPM      0.054729
Daily_Production_Tons     0.048931
dtype: float64

### Key Drivers of Equipment Failure

The most influential features include:
- Process temperature
- Operating hours
- Mechanical load
- Thermal stress
- Hydraulic pressure

These variables act as early warning signals for equipment degradation.

## Business Insights from the Model

The model demonstrates that equipment failures are primarily driven
by thermal stress, prolonged usage, and high mechanical load.

By monitoring these key indicators, mining operations can implement
predictive maintenance strategies, reducing downtime, maintenance costs,
and production losses.

## üëâ Step 4 ‚Äî Conclusion & Business Recommendations

## Executive Summary

This project applied data mining and machine learning techniques
to mining equipment operational data inspired by copper and cobalt
mining operations such as Kamoto Copper Company.

The objective was to identify early warning signals of equipment failure
and support predictive maintenance strategies.

The Random Forest model demonstrated strong predictive performance,
highlighting key operational factors that contribute to equipment breakdowns.

## Key Findings

- Equipment failures are strongly associated with high process temperature
- Long operating hours significantly increase failure risk
- High mechanical load and hydraulic pressure act as early stress indicators
- Random Forest outperformed Logistic Regression in predicting failures

## Business Recommendations

1. Implement real-time monitoring of temperature and mechanical stress indicators
2. Schedule preventive maintenance for equipment with high operating hours
3. Reduce operational load during extreme thermal conditions
4. Prioritize maintenance resources on high-risk equipment types
5. Use predictive models to plan maintenance and minimize production downtime

In [40]:
def get_equipment_type(row):
    if row["Equipment_Type_Crusher"] == True:
        return "Crusher"
    elif row["Equipment_Type_Excavator"] == True:
        return "Excavator"
    elif row["Equipment_Type_Haul Truck"] == True:
        return "Haul Truck"
    else:
        return "Unknown"

df_encoded["Equipment_Type"] = df_encoded.apply(get_equipment_type, axis=1)

In [41]:
df["Equipment_Type"].value_counts()

Equipment_Type
Crusher       310
Excavator     304
Conveyor      301
Haul Truck    285
Name: count, dtype: int64

In [42]:
df_encoded.to_excel("cleaned_copper_mining_predictive_maintenance.xlsx", index=False)