# Predictive Maintenance - Classifying whether a machine will fail within the next 7 days

### Author : Kelvin Kipkorir

#image

# Overview
****
Predictive maintenance is a strategy that anticipates failures in industrial machinery before they occur. Unlike traditional maintenance practices, which often involve scheduled servicing or reacting to breakdowns, predictive maintenance enables industries to perform maintenance only when it's truly needed.

Traditional maintenance strategies typically fall into two categories:

1. `Run-to-Failure` – The machine is used until it breaks down, after which it is either repaired or replaced.
2. `Preventive Maintenance` – Maintenance is performed at regular, scheduled intervals, regardless of the machine’s actual condition.

With the development of microprocessors and advancements in machine learning (ML), predictive maintenance has become increasingly practical and effective. Unlike physics-based models, ML models can handle **large-scale, high-dimensional datasets**, making them well-suited for analyzing sensor readings and operational metrics. These models have revolutionized the estimation of **Remaining Useful Life (RUL)** for industrial systems and play a critical role in minimizing costly downtime caused by unexpected failures.

This project aims to support **TechLine Industries** in adopting predictive maintenance across their smart factory operations. By leveraging machine learning — specifically **logistic regression** and **decision trees** — the project classifies whether a machine is likely to fail within the next 7 days. 

# Business Problem
****
Techline Industries is a growing manufacturing company that has numerous machines in its manufacturing plants. As part of its digital transformation strategy, the company is investing in predictive maintenance to reduce operational costs, minimize unplanned downtime, and improve equipment reliability. However, with complex sensor data being generated across various machine types, the company lacks a robust system to anticipate failures before they occur.

To address this challenge, historical sensor data will be used to develop a machine learning model that can predict equipment failures up to one week in advance. This will empower maintenance teams to take timely, corrective actions — helping the company avoid costly downtimes typically associated with traditional maintenance strategies.

# Objectives
****
The main objective of this project is to develop a machine learning model that predicts whether a machine is likely to fail within the next 7 days, using sensor and operational data. To achieve this, the following specific objectives will be pursued:

1. **Explore and Understand the Dataset**  
   Analyze the structure, content, and distribution of the dataset to gain insights into the key variables and relationships that may affect machine failures.

2. **Preprocess the Data for Modeling**  
   Clean and transform the data to ensure it is suitable for machine learning, including handling missing values, encoding categorical features, and balancing the classes.

3. **Build and Evaluate Baseline Classification Models**  
   Implement basic models such as logistic regression and decision trees to establish performance benchmarks.

4. **Perform Hyperparameter Tuning**  
   Optimize model performance through parameter tuning using techniques like grid search or cross-validation.

5. **Compare Model Performances**  
   Evaluate and compare models using metrics such as accuracy, precision, recall, F1-score, and AUC-ROC to determine the best approach.

6. **Draw Business Insights and Recommendations**  
   Interpret the results to provide actionable insights for TechLine Industries, enabling them to reduce downtime and improve equipment reliability through predictive maintenance.


# Data Understanding

In [1]:
#importing standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [6]:

#extract data from the zip file 
df = pd.read_csv('industrial_iot_data.zip', compression='zip')
df

Unnamed: 0,Machine_ID,Machine_Type,Installation_Year,Operational_Hours,Temperature_C,Vibration_mms,Sound_dB,Oil_Level_pct,Coolant_Level_pct,Power_Consumption_kW,...,Failure_History_Count,AI_Supervision,Error_Codes_Last_30_Days,Remaining_Useful_Life_days,Failure_Within_7_Days,Laser_Intensity,Hydraulic_Pressure_bar,Coolant_Flow_L_min,Heat_Index,AI_Override_Events
0,MC_000000,Mixer,2027,81769,73.43,12.78,83.72,36.76,68.74,84.95,...,5,True,3,162.0,False,,,,,2
1,MC_000001,Industrial_Chiller,2032,74966,58.32,14.99,77.04,100.00,62.13,154.61,...,2,True,4,147.0,False,,,40.92,,2
2,MC_000002,Pick_and_Place,2003,94006,49.63,23.78,69.08,42.96,35.96,51.90,...,1,True,6,0.0,True,,,,,2
3,MC_000003,Vision_System,2007,76637,63.73,12.38,85.58,94.90,48.94,75.61,...,1,False,4,161.0,False,,,,,0
4,MC_000004,Shuttle_System,2016,20870,42.77,4.42,96.72,47.56,53.78,224.93,...,2,False,1,765.0,False,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,MC_499995,Vacuum_Packer,2011,14425,65.42,16.50,81.95,59.21,73.67,255.87,...,3,False,0,820.0,False,,,,,0
499996,MC_499996,Conveyor_Belt,2003,75501,44.83,12.88,64.94,73.69,29.25,198.37,...,1,False,4,34.0,False,,,,,0
499997,MC_499997,CMM,2039,19855,37.26,11.46,70.70,70.70,49.04,156.59,...,2,False,4,815.0,False,,,,,0
499998,MC_499998,Dryer,2035,86823,67.72,16.76,77.45,97.00,15.40,132.33,...,2,True,0,99.0,False,,,,,2


In [12]:
#look at the columns and if there any missing values
df.info()
print("The dataset shape : {}".format(df.shape))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 22 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Machine_ID                  500000 non-null  object 
 1   Machine_Type                500000 non-null  object 
 2   Installation_Year           500000 non-null  int64  
 3   Operational_Hours           500000 non-null  int64  
 4   Temperature_C               500000 non-null  float64
 5   Vibration_mms               500000 non-null  float64
 6   Sound_dB                    500000 non-null  float64
 7   Oil_Level_pct               500000 non-null  float64
 8   Coolant_Level_pct           500000 non-null  float64
 9   Power_Consumption_kW        500000 non-null  float64
 10  Last_Maintenance_Days_Ago   500000 non-null  int64  
 11  Maintenance_History_Count   500000 non-null  int64  
 12  Failure_History_Count       500000 non-null  int64  
 13  AI_Supervision

In [13]:
# looking at the amount of missing values

df.isna().sum()

Machine_ID                         0
Machine_Type                       0
Installation_Year                  0
Operational_Hours                  0
Temperature_C                      0
Vibration_mms                      0
Sound_dB                           0
Oil_Level_pct                      0
Coolant_Level_pct                  0
Power_Consumption_kW               0
Last_Maintenance_Days_Ago          0
Maintenance_History_Count          0
Failure_History_Count              0
AI_Supervision                     0
Error_Codes_Last_30_Days           0
Remaining_Useful_Life_days         0
Failure_Within_7_Days              0
Laser_Intensity               484844
Hydraulic_Pressure_bar        469660
Coolant_Flow_L_min            454376
Heat_Index                    454786
AI_Override_Events                 0
dtype: int64

The `Laser_Intesity`,`Hydraulic_Pressure_bar`,`Coolant_Flow_L_min`,`Heat_Index` have a lot of missing values 