# Lesson 4: Introduction to Machine Learning (Unsupervised Learning)


### 🎯 Learning Objectives
By the end of this lesson, you will be able to:
- Understand what machine learning is and how it differs from traditional programming
- Grasp the concept of unsupervised learning
- Use clustering (e.g., KMeans) to group similar drilling operations or intervals
- Apply dimensionality reduction to high-dimensional sensor datasets
- Visualize and interpret machine-generated groupings in a drilling context


## 📘 Section 1: What is Machine Learning?


- **Concept**: Traditional programming vs. Machine Learning  
- **ML Analogy**: Think of ML as teaching a drilling rig to "spot patterns" in sensor data, without us telling it the rules.
- **Supervised vs. Unsupervised vs. Reinforcement** (Quick visual comparison)
- **Why ML matters in Drilling**: Detect patterns in large data streams, reduce NPT, improve decision-making, automate report generation.


In [None]:

# Exercise 1: Traditional decision logic example
def rpm_status(rpm):
    if rpm < 20:
        return "Slide"
    elif rpm < 70:
        return "Rotate Low"
    elif rpm < 120:
        return "Normal"
    else:
        return "High"

# Example usage
[rpm_status(r) for r in [70, 100, 160]]


## 📘 Section 2: Intro to Unsupervised Learning


- **Definition**: Unsupervised = no labeled outputs; model finds structure in data on its own
- **Real-World Analogy**: Sorting drill cuttings by texture without being told which ones are sandstone or shale
- **Use Cases in Drilling**:
  - Cluster similar BHA configurations based on vibration/RPM/WOB
  - Group drilling intervals with similar mechanical specific energy profiles
  - Detect anomalies in MWD/LWD data without needing labeled failure events


In [None]:
import pandas as pd
# Load the dataset
# df = pd.read_csv("Drilling_Dataset.csv")
# change directory to the location of the dataset
import os
#check the current working directory
print(os.getcwd())
# os.chdir("..")
# Load the dataset
file_path = "module_3/all_run_df.csv"
df = pd.read_csv(file_path)


In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

file_name = 'all_run_df.csv'  # Replace with your file name once uploaded to Google Drive

file_path = f'/content/drive/My Drive/python-for-drilling-engineers/module_4/{file_name}'

on_btm_df = pd.read_csv(file_path)

In [None]:
df['depth_of_cut'] = 0

df.loc[df.bit_rpm > 10, 'depth_of_cut'] = (0.2) * df['rop'] / df['bit_rpm']

In [None]:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Select features and perform clustering
# features = df[['ROP_ft_hr', 'WOB_klbs', 'RPM', 'MSE']]

features = df[['rop', 'wob', 'td_rpm', 'diff_press']]
print(features.describe())

df.dropna(subset=features.columns, inplace=True)
print(features.columns[0])
print(len(df[df[features.columns[3]].isna()]))
# Normalize features
# scaler = StandardScaler()
# scaled_features = scaler.fit_transform(features)
# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=0)
df['Cluster'] = kmeans.fit_predict(features)

sampled = df.sample(3000)
plt.figure(figsize=(10, 6))
plt.scatter(sampled['md'], sampled['rop'], c=sampled['Cluster'], cmap='viridis', s=10)
plt.xlabel('Depth (ft)')
plt.ylabel('ROP (ft/hr)')
plt.title('ROP by Depth with KMeans Clustering')
plt.colorbar(label='Cluster')
plt.show()


In [None]:
# Visualize subplot with shared x-axis of 'md' and plots with the y-axis being each of the 4 items in features.columns
plt.figure(figsize=(12, 8))
for i, col in enumerate(features.columns):
    plt.subplot(3, 2, i + 1)
    plt.scatter(df['md'], df[col], c=df['Cluster'], cmap='viridis', s=10)
    plt.xlabel('Depth (ft)')
    plt.ylabel(col)
    plt.title(f'{col} by Depth with KMeans Clustering')
plt.tight_layout()
plt.show()

## 📘 Section 2.1: Scaling Your Features

In [None]:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Select features and perform clustering
# features = df[['ROP_ft_hr', 'WOB_klbs', 'RPM', 'MSE']]

features = df[['rop', 'wob', 'td_rpm', 'diff_press']]
print(features.describe())

df.dropna(subset=features.columns, inplace=True)
print(features.columns[0])
print(len(df[df[features.columns[3]].isna()]))
# Normalize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=0)
df['Cluster'] = kmeans.fit_predict(scaled_features)

sampled = df.sample(3000)
plt.figure(figsize=(10, 6))
plt.scatter(sampled['md'], sampled['rop'], c=sampled['Cluster'], cmap='viridis', s=10)
plt.xlabel('Depth (ft)')
plt.ylabel('ROP (ft/hr)')
plt.title('ROP by Depth with KMeans Clustering')
plt.colorbar(label='Cluster')
plt.show()


In [None]:
# Visualize subplot with shared x-axis of 'md' and plots with the y-axis being each of the 4 items in features.columns
plt.figure(figsize=(12, 8))
for i, col in enumerate(features.columns):
    plt.subplot(3, 2, i + 1)
    plt.scatter(df['md'], df[col], c=df['Cluster'], cmap='viridis', s=10)
    plt.xlabel('Depth (ft)')
    plt.ylabel(col)
    plt.title(f'{col} by Depth with KMeans Clustering')
plt.tight_layout()
plt.show()

### Section 2.1: Bringing Your Clusters into Your DataFrame

In [None]:
df_grouped = df.groupby(['Cluster']).agg(
    count=('rop', 'size'),
    avg_rop=('rop', 'mean'),
    avg_wob=('wob', 'mean'),
    avg_td_rpm=('td_rpm', 'mean'),
    avg_diff_press=('diff_press', 'mean'),
).reset_index()
df_grouped

## 📘 Section 3: Dimensionality Reduction


- **Problem**: Drilling data is high-dimensional (lots of sensors!)
- **Solution**: Use PCA or t-SNE to project data into 2D or 3D for visualization
- **Analogy**: Reducing a 3D well log to a 2D cross-section for easier interpretation


In [None]:

from sklearn.decomposition import PCA

# Apply PCA to reduce features to 2D
pca = PCA(n_components=2)
pca_result = pca.fit_transform(features)
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]

# Plot PCA result
plt.figure(figsize=(10, 6))
plt.scatter(df['PCA1'], df['PCA2'], c=df['rop'], cmap='viridis', s=10)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('PCA of Drilling Features')
plt.colorbar(label='Cluster')
plt.show()


## 📘 Section 4: Real-World Case Study


**Title**: Identifying Operational Modes While Drilling  
- Load a sample drilling dataset (ROP, WOB, Torque, MSE, SPP)
- Run clustering (KMeans, DBSCAN)
- Label resulting clusters as: "Rotating", "Sliding", "Connection", "Anomaly"


In [None]:

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
scaled = scaler.fit_transform(features)

# Run DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
df['DBSCAN_Cluster'] = dbscan.fit_predict(scaled)

# Visualize result
plt.figure(figsize=(10, 6))
plt.scatter(df['md'], df['rop'], c=df['DBSCAN_Cluster'], cmap='plasma', s=10)
plt.xlabel('Depth (ft)')
plt.ylabel('ROP (ft/hr)')
plt.title('DBSCAN Clustering on Drilling Data')
plt.colorbar(label='Cluster')
plt.show()


## 📘 Section 5: Wrap-Up and Discussion


- What did we learn?
- Where can clustering help in your daily workflows?
- How might we combine unsupervised learning with human expertise?
- **Teaser**: Next lesson - supervised learning and anomaly detection!
