In [None]:
# Mini-project 5.3 Detecting the anomalous activity of a ship's engine

**Welcome to your first mini-project: Detecting the anomalous activity of a ship's engine!**

This mini-project allows you to dive deep into a real-world challenge, applying and honing the data science skills you've been cultivating so far. In this immersive exploration into detecting the anomalous activity of a ship's engine, you can practically apply the concepts you've learned over the past few weeks.

A poorly maintained ship engine in the supply chain industry can lead to inefficiencies, increased fuel consumption, higher risks of malfunctions, and potential safety hazards. Your challenge in this project is to apply critical thinking and ML concepts to design and implement a robust anomaly detection model.

Please set aside approximately **12 hours** to complete the mini-project.

<br></br>

## **Business context**
You are provided with a real data set to identify anomalous activity in a ship's engine functionality (Devabrat,  2022). As you work through this project, keep in mind that, typically speaking, anomalies would make up a minority of the data points (i.e., about 1% to 5% of the data points would be anomalies).

The data set contains six important features continuously monitored to evaluate the engine's status as 'good' or 'bad'. These features are:
- **Engine rpm (revolutions per minute):**

A high rpm indicates the engine is operating at a higher speed than designed for prolonged periods, which can lead to overheating, excessive wear, and eventual failure.

A low rpm could signal a lack of power, issues with fuel delivery, or internal mechanical problems.

- **Lubrication oil pressure:**

Low lubrication oil pressure indicates insufficient lubrication, leading to increased friction, overheating, and engine damage.

A high lubrication oil pressure could signal a blockage in the oil delivery system, potentially causing seal or gasket failure.

- **Fuel pressure:**

High fuel pressure can cause poor engine performance and incomplete combustion, indicating fuel pump or filter issues.

A low fuel pressure may result in excessive fuel consumption, poor emissions, or damage to the fuel injectors.

- **Coolant pressure:**

Low coolant pressure indicates a potential leak in the cooling system or a coolant pump failure, risking engine overheating.

A high coolant pressure could be a sign of a blockage in the cooling system or a failing head gasket, which can also lead to overheating.

- **Lubrication oil temperature:**

High lubrication oil temperature suggests the oil is overheating, which can degrade its lubricating properties and lead to engine damage.

A low lubrication oil temperature may indicate it is not reaching its optimal operating temperature, potentially causing inadequate lubrication.

- **Coolant temperature:**

High coolant temperature signals overheating, which various issues, including a failed thermostat, coolant leak, or insufficient coolant flow can cause.

A low coolant temperature could suggest the engine is not reaching its optimal operating temperature, affecting performance and efficiency.

Issues with engines could lead to engine malfunctions, potential safety hazards, and downtime (e.g. delayed deliveries), resulting in the breakdown of a ship's overall functionality, consequently impacting the business, such as affecting revenue via failure to deliver goods. By predicting timely maintenance, the business aims to increase profit by reducing downtime, reducing safety risks for the crew, limiting fuel consumption, and increasing customer satisfaction through timely deliveries.

Your task is to develop a robust anomaly detection system to protect a company's shipping fleet by evaluating engine functionality. Therefore, you'll explore the data and:
- employ preprocessing and feature engineering
- perform anomaly detection.

You must prepare a report illustrating your insights to the prospective stakeholders, explaining your approach in identifying anomalies, presenting your findings and including recommendations.

<br></br>

> **Disclaimer**
>
> Please note that although a real-life data set was provided, the business context in this project is fictitious. Any resemblance to companies and persons (living or dead) is coincidental. The course designers and hosts assume no responsibility or liability for any errors or omissions in the content of the business context and data sets. The information in the data sets is provided on an 'as is' basis with no guarantees of completeness, accuracy, usefulness, or timeliness.

<br></br>

## **Objective**
By the end of this mini-project, you will be able to understand and apply statistical and ML methods for detecting anomalies.

In the Notebook, you will:
- explore the data set
- preprocess the data and conduct feature engineering
- apply statistical techniques to detect anomalies
- use ML algorithms to detect anomalies.

You will also write a report summarising the results of your findings and recommendations.

<br></br>

## **Assessment criteria**
By completing this project, you will be able to provide evidence that you can:
- demonstrate enhanced problem-solving skills and proposed strategic solutions by systematically analysing complex organisational challenges
- identify meaningful patterns in complex data to evidence advanced critical and statistical thinking skills
- select statistical techniques appropriate to a solutions design approach and evidence the ability to evaluate their effectiveness
- demonstrate enhanced data representation and improved model performance by systematically implementing relevant techniques
- design innovative solutions through critically selecting, evaluating and implementing effective unsupervised learning techniques.

<br></br>

## **Project guidance**
1. Import the required libraries and data set with the provided URL.
2. View the DataFrame and perform EDA, including identifying missing or duplicate values.
3. Generate the descriptive statistics of the data, including:
 - observing the mean for each feature
 - identifying the median
4. Visualise the data to determine the distribution and extreme values.
5. Perform anomaly detection with a statistical method and identify possible anomalies. Specifically:
  - Use the interquartile range (IQR) method to identify outliers for each feature.
  - Create a new column (corresponding to each feature) that will indicate (in binary â€“ 0,1) if the value of that feature is an outlier as per IQR calculations.
  - Use IQR to identify the number of features that must simultaneously be in outlier condition, in order for a sample to be classified as an outlier, such that the total percentage of samples identified as outliers falls within the 1-5% range.
  - Record your thoughts and observations.
6. Perform anomaly detection with ML models:
  - Using one-class SVM,
    - identify possible anomalies
    - visualise the output in 2D after performing PCA and ensure the outliers are in a different colour
    - apply different combinations of parameter settings to improve the model's outlier predictions to the expected 1-5%
    - record your insights about the use of this method.
  - Using Isolation Forest,
    - identify possible anomalies
    - visualise the output in 2D after performing PCA and ensure the outliers are in a different colour
    - apply different combinations of parameter settings to improve the model's outlier predictions to the expected 1-5%
    - record your insights about the use of this method.
7. Document your approach and major inferences from the data analysis and describe which method (and parameters) provided the best results and why.
8. When you've completed the activity:
  - Download your completed Notebook as an IPYNB (Jupyter Notebook). Save the file as follows: LastName_FirstName_CAM_C101_W5_Mini-project.ipynb
  - Prepare a detailed report (between 800-1000 words) that includes:
    - an overview of the problem that is being addressed in this project
    - an overview of your approach, with a clear visualisation of your anomaly detection approach
    - key figures and tables of the main results
    - interpretation of the anomaly detection results
    - an evaluation of the effectiveness of 2D PCA plots in highlighting outliers
    - recommendations based on gathered evidence.
  - Save the document as a PDF named according to the following convention: LastName_FirstName_CAM_C101_W5_Mini-project.pdf.


<br></br>
> **Declaration**
>
> By submitting your project, you indicate that the work is your own and has been created with academic integrity. Refer to the Cambridge plagiarism regulations.
> Start your activity here. Select the pen from the toolbar to add your entry.
url = 'https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/engine.csv'
import pandas as pd
df = pd.read_csv(url)
print("DataFrame Head:")
print(df.head())

print("\nDataFrame Info:")
df.info()

print("\nDescriptive Statistics:")
print(df.describe())

print("\nNull Values:")
print(df.isnull().sum())
import matplotlib.pyplot as plt
import seaborn as sns

# Get the list of feature columns
features = df.columns

# Determine the number of rows and columns for the subplot grid
num_features = len(features)
num_cols = 2  # You can adjust this for desired layout
num_rows = (num_features + num_cols - 1) // num_cols

plt.figure(figsize=(15, num_rows * 5))

for i, feature in enumerate(features):
    plt.subplot(num_rows, num_cols, i + 1)
    sns.scatterplot(x=df.index, y=df[feature])
    plt.title(f'Scatter Plot of {feature}')
    plt.xlabel('Data Point Index')
    plt.ylabel(feature)

plt.tight_layout()
plt.show()

print("These scatter plots show each feature's values across the dataset's index. Extreme values will appear as points far from the main cluster. For a more direct view of distribution and quartiles, box plots and histograms can also be very insightful.")
# Clean the data: Remove extreme outliers in Coolant temp
df_cleaned = df.copy()
df_cleaned = df_cleaned[df_cleaned['Coolant temp'] <= 100]

# Store total samples for later use
total_samples = len(df_cleaned)

print("Descriptive Statistics for 'Coolant temp' after removing values > 100:")
print(df_cleaned['Coolant temp'].describe())

print(f"\nOriginal DataFrame shape: {df.shape}")
print(f"Cleaned DataFrame shape: {df_cleaned.shape}")
print(f"\nTotal samples stored in variable: {total_samples}")

# =============================================================================
# PHASE 1: OPERATING MODE DISCOVERY
# =============================================================================

## Investigation: Are the bimodal distributions caused by two distinct operating modes?

Based on the histograms, several features show bimodal distributions (two bumps). 
I hypothesize that **Engine rpm** is the driving variable creating two distinct operating modes.

**Approach:** 
- Explore the Engine rpm distribution
- Split data into LOW and HIGH operating modes
- Verify that this split explains the bimodal patterns in other features

In [None]:
# =============================================================================
# STEP 1.1: Explore Engine RPM distribution
# =============================================================================

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Histogram with KDE
plt.figure(figsize=(12, 6))
sns.histplot(df_cleaned['Engine rpm'], kde=True, bins=50, color='steelblue', edgecolor='black')
plt.title('Distribution of Engine RPM', fontsize=14, fontweight='bold')
plt.xlabel('Engine RPM', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.axvline(df_cleaned['Engine rpm'].median(), color='red', linestyle='--', linewidth=2, label=f'Median: {df_cleaned["Engine rpm"].median():.2f}')
plt.axvline(df_cleaned['Engine rpm'].mean(), color='orange', linestyle='--', linewidth=2, label=f'Mean: {df_cleaned["Engine rpm"].mean():.2f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print descriptive statistics
print("="*80)
print("ENGINE RPM - DESCRIPTIVE STATISTICS")
print("="*80)
print(df_cleaned['Engine rpm'].describe())
print("\nAdditional Statistics:")
print(f"Variance: {df_cleaned['Engine rpm'].var():.2f}")
print(f"Skewness: {df_cleaned['Engine rpm'].skew():.2f}")
print(f"Kurtosis: {df_cleaned['Engine rpm'].kurtosis():.2f}")

print("\n" + "="*80)
print("VISUAL INSPECTION GUIDE")
print("="*80)
print("Look at the histogram above:")
print("  - Do you see TWO distinct peaks (bumps)?")
print("  - Is there a 'valley' between them?")
print("  - Where would you draw a line to separate LOW mode from HIGH mode?")
print("\nSuggested threshold: Use the MEDIAN as the split point (red line)")
print(f"Median RPM: {df_cleaned['Engine rpm'].median():.2f}")
print("="*80)

In [None]:
# =============================================================================
# STEP 1.2: Create operating_mode column
# =============================================================================

# Use median as threshold to split into LOW and HIGH modes
rpm_threshold = df_cleaned['Engine rpm'].median()

# Create operating_mode column
df_cleaned['operating_mode'] = df_cleaned['Engine rpm'].apply(
    lambda x: 'LOW' if x <= rpm_threshold else 'HIGH'
)

# Print summary statistics
print("="*80)
print("OPERATING MODE SPLIT SUMMARY")
print("="*80)
print(f"Threshold used: {rpm_threshold:.2f} RPM (median)\n")

mode_counts = df_cleaned['operating_mode'].value_counts()
print(f"{'Mode':<10} {'Count':>10} {'Percentage':>12}")
print("-"*80)
for mode, count in mode_counts.items():
    pct = (count / total_samples) * 100
    print(f"{mode:<10} {count:>10} {pct:>11.2f}%")
print("="*80)

# Visualize Engine RPM colored by operating mode
plt.figure(figsize=(12, 6))

# Plot LOW mode
low_data = df_cleaned[df_cleaned['operating_mode'] == 'LOW']['Engine rpm']
high_data = df_cleaned[df_cleaned['operating_mode'] == 'HIGH']['Engine rpm']

plt.hist(low_data, bins=30, alpha=0.6, color='blue', label='LOW Mode', edgecolor='black')
plt.hist(high_data, bins=30, alpha=0.6, color='red', label='HIGH Mode', edgecolor='black')

plt.axvline(rpm_threshold, color='green', linestyle='--', linewidth=3, label=f'Threshold: {rpm_threshold:.2f}')
plt.title('Engine RPM Distribution by Operating Mode', fontsize=14, fontweight='bold')
plt.xlabel('Engine RPM', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\n" + "="*80)
print(">>> CHECKPOINT 1: STOP HERE")
print("="*80)
print("Please verify:")
print("  1. Does the split look reasonable?")
print("  2. Are the two modes roughly balanced?")
print("  3. Should we use the median, or suggest a different threshold?")
print("\nPlease confirm before I continue to Phase 2.")
print("="*80)