<a href="https://colab.research.google.com/github/neharika950/Predictive-Maintenance-and-Fault-Analysis-in-Power-Transformers/blob/main/Predictive_Maintenance_and_Fault_Analysis_in_Power_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, we focused on analyzing the health and life expectancy of transformers by leveraging data from various parameters, such as gas concentrations, dielectric rigidity, and water content. Our approach involved using machine learning techniques, including Random Forest and Support Vector Regression, to predict the health index and life expectancy of transformers based on historical data. Additionally, we applied clustering algorithms like K-Means for pattern recognition and anomaly detection to identify critical factors affecting transformer performance. This analysis is crucial for improving transformer maintenance schedules, reducing unplanned downtimes, and extending the operational lifespan of transformers in power systems.

we aimed to predict the health index of transformers using a variety of features such as gas concentrations, dielectric rigidity, and water content. The data was preprocessed to handle missing values, and feature scaling was applied. We split the data into training and testing sets and trained a Random Forest Regressor model. The model's performance was evaluated using Mean Squared Error (MSE) and R² score, yielding a good fit with an R² score of 0.74.

Additionally, we performed exploratory data analysis (EDA) to visualize relationships between the features and health index using pair plots, and applied Principal Component Analysis (PCA) for dimensionality reduction, visualizing the results in a 2D scatter plot. We also used K-Means clustering to identify patterns in the data, determining the optimal number of clusters through the elbow method and visualizing the clusters based on selected features. This analysis provides valuable insights into the factors influencing transformer health and can aid in predictive maintenance.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA

# Load the dataset
data = pd.read_csv('/content/Health index1(1)(1).csv')  # Adjust file name/path accordingly

# Display the first few rows
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Drop or fill missing values as needed
data.fillna(data.mean(), inplace=True)  # Fill missing values with the mean

# Define features and target variable
X = data.drop(['Health index', 'Life expectation'], axis=1)  # Drop target variables
y = data['Health index']

# Optional: Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Build a regression model (Random Forest Regressor in this case)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R^2 Score: {r2:.2f}')

# Exploratory Data Analysis (EDA)
# Pairplot to see relationships
sns.pairplot(data, x_vars=X.columns, y_vars='Health index', height=2.5)
plt.title('Pairplot of Features vs Health Index')
plt.show()

# PCA for dimensionality reduction (optional)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plotting PCA results
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('PCA of Transformer Health Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Health Index')
plt.show()

# Feature importance
importance = model.feature_importances_
feature_names = X.columns
indices = np.argsort(importance)[::-1]

plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importance[indices], align="center")
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data.drop(columns=['Health index', 'Life expectation']))

# Determine the optimal number of clusters using the elbow method
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plot the elbow graph
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal Clusters')
plt.show()

# Fit KMeans with the optimal number of clusters (e.g., 3)
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(X_scaled)

# Visualize the clusters
sns.scatterplot(data=data, x='Hydrogen', y='Methane', hue='Cluster', palette='viridis')
plt.title('K-Means Clustering of Transformer Data')
plt.show()



FileNotFoundError: [Errno 2] No such file or directory: '/content/Health index1(1)(1).csv'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In this analysis, we predicted the life expectancy of transformers using a linear regression model. The data was split into training and testing sets, and the model was trained on the training data. After making predictions on the test set, the model's performance was evaluated using Mean Squared Error (MSE) and R² score. The model achieved an R² score of 0.53, indicating a moderate fit to the data, and an MSE of 133.79. While the model shows some predictive power, there is potential for improvement in capturing the complexities of the data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Prepare data
X = data.drop(columns=['Health index', 'Life expectation'])
y = data['Life expectation']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predictions
y_pred = lin_reg.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R^2 Score: {r2:.2f}')





NameError: name 'data' is not defined

In this analysis, a Random Forest Regressor was used to predict the health index of transformers. The model was trained on the dataset, and the importance of each feature in predicting the health index was extracted. A bar plot was created to visualize these feature importances, highlighting which factors contribute most to the model's predictions. This analysis helps identify key features that impact transformer health, which can be used for further optimization and decision-making in transformer maintenance.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Prepare data
X = data.drop(columns=['Health index', 'Life expectation'])
y = data['Health index']

# Fit Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# Get feature importances
importances = rf_model.feature_importances_
features = X.columns

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x=importances, y=features)
plt.title('Feature Importance from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


NameError: name 'data' is not defined

In this analysis, an Isolation Forest model was used to detect anomalies in transformer data. The model, trained with a contamination rate of 5%, classifies data points as either normal (1) or anomalous (-1). A scatter plot was created to visualize the anomalies, using the 'Interfacial V' and 'Dielectric rigidity' features. The anomalies are highlighted in red, while normal data points are shown in green. This approach helps identify outliers that could indicate potential issues in transformer performance or other underlying problems.

In [None]:
from sklearn.ensemble import IsolationForest

# Fit the model
iso_forest = IsolationForest(contamination=0.05, random_state=42)
data['Anomaly'] = iso_forest.fit_predict(X_scaled)

# Visualize anomalies
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Interfacial V', y='Dielectric rigidity', hue='Anomaly', palette={1: 'green', -1: 'red'})
plt.title('Anomaly Detection in Transformer Data')
plt.xlabel('Interfacial V')
plt.ylabel('Dielectric rigidity')
plt.show()


NameError: name 'X_scaled' is not defined

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Step 1: Load your dataset
# Replace 'your_dataset.csv' with your actual file name
data = pd.read_csv('/content/Health index1(1).csv')

# Display the first few rows of the dataset
print("Dataset Preview:")
print(data.head())

# Step 2: Categorize Life Expectation
# Define categories based on Life expectation
bins = [0, 10, 20, np.inf]  # Adjust these bins based on your data
labels = ['Low', 'Medium', 'High']
data['Life expectation category'] = pd.cut(data['Life expectation'], bins=bins, labels=labels)

# Check the updated DataFrame
print("\nUpdated DataFrame with Life Expectation Categories:")
print(data[['Life expectation', 'Life expectation category']].head())

# Step 3: Prepare Data for SVM
# Features and target variable
X = data.drop(columns=['Life expectation', 'Life expectation category'])
y = data['Life expectation category']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Create and train the SVM classification model
svm_model = SVC(kernel='rbf')  # Using the Radial Basis Function kernel
s = svm_model.predict(X_test_scaled)

# Step 5: Generate and plot the confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
disp.plot(cmap='Blues')
plt.title('Confusion Matrix for Life Expectation Prediction')
plt.show()
vm_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred



FileNotFoundError: [Errno 2] No such file or directory: '/content/Health index1(1).csv'

In this analysis, a Support Vector Regression (SVR) model with a Radial Basis Function (RBF) kernel was applied to predict the "Life Expectation" variable from the dataset. The dataset was first split into training and testing sets, and the features were standardized using StandardScaler to ensure the SVR model performs optimally. After training the model, predictions were made on the test set, and the performance was evaluated using Mean Squared Error (MSE) and R² score. A scatter plot of true vs predicted values was also generated to visually assess the model's performance, with a diagonal line representing perfect predictions. The evaluation metrics showed the model's ability to approximate life expectancy with reasonable accuracy.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset
# Replace 'your_dataset.csv' with your actual dataset file path


data = pd.read_csv('/content/Health index1(1)(1).csv')

# Inspect the data
print(data.head())

# Handle missing values if necessary
# Assuming no missing values since you've checked before
# If there are missing values, you can fill or drop them:
# data.fillna(data.mean(), inplace=True)

# Separate features and target variable
X = data.drop(columns=['Life expectation'], errors='ignore')  # Exclude the target column
y = data['Life expectation']  # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Create and train the SVR model
svr_model = SVR(kernel='rbf')  # Using the Radial Basis Function kernel
svr_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = svr_model.predict(X_test_scaled)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.2f}')

# Plotting the true vs predicted values
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('True vs Predicted Life Expectation')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)  # Diagonal line
plt.xlim(y.min(), y.max())
plt.ylim(y.min(), y.max())
plt.show()



FileNotFoundError: [Errno 2] No such file or directory: '/content/Health index1(1)(1).csv'