# Python for Drilling Engineers - Module 1
## Introduction
Welcome to the Python for Drilling Engineers course! In this module, we'll cover the basics of Python and data manipulation for drilling-related applications.

**Why Python?**
- Open-source and widely used in data science
- Great for automating repetitive tasks
- Strong ecosystem for data analysis and visualization

**What's in it for a drilling engineer?**
- Automate drilling KPIs, reports, and calculations
- Analyze well logs and real-time drilling data
- Unlimited flexibility
- Improve decision-making with data-driven insights

## Python Basics: Built-in Data Structures
Before working with datasets, let's cover fundamental Python data structures.

### Summary Table: Python Data Structures

| Data Structure | Ordered? | Mutable? | Duplicates Allowed? | Best Use Case |
|---------------|---------|----------|------------------|--------------|
| **List** (`list`) | ✅ Yes | ✅ Yes | ✅ Yes | General-purpose, ordered data storage |
| **Dictionary** (`dict`) | ❌ No (Python 3.7+ maintains insertion order) | ✅ Yes | ❌ No (keys must be unique) | Key-value lookups, structured data |
| **Tuple** (`tuple`) | ✅ Yes | ❌ No | ✅ Yes | Immutable, fixed collections |
| **Set** (`set`) | ❌ No | ✅ Yes (elements can be added/removed) | ❌ No | Unique element storage, set operations |



### Lists
**Definition:**
A list is an ordered, mutable (modifiable) collection that allows duplicate elements.

**Key Features:**
- Ordered: elements maintain the order in which they were added
- Mutable: Elements can be changed, added, or removed.
- Allows Duplicates: Multiple elements with the same value are allowed.

**When to Use?**
- When you need an ordered collection of items.
- When frequent updates (insertion/deletion/modification) are needed.
- When you want to store heterogeneous data (e.g., ["Drill Bit", 10, 45.7]).

In [None]:
# Lists
drilling_tools = ['Bit', 'Mud Motor', 'MWD', 'Rotary Table']
print(drilling_tools[0])  # Access first item
print(len(drilling_tools))  # Number of elements

### Dictionaries
**Definition:** A dictionary is an unordered collection of key-value pairs, where keys are unique and immutable.

**Key Features:**
- Key-value pairs: Allows efficient lookups.
- Keys must be unique: No duplicate keys are allowed.
- Mutable: You can update values or add new key-value pairs.

**When to Use?**
- When you need fast lookups based on unique keys.
- When you need to store related attributes (e.g., drilling parameters per well).
- When you need flexible and structured data storage.

In [None]:
# Dictionaries
drilling_data = {
    'Depth': 5000,
    'ROP': 50,
    'Mud Weight': 10.5
}
print(drilling_data['Depth'])  # Accessing dictionary value

### Tuples
**Definition:** A tuple is an ordered, immutable collection that allows duplicate elements.

**Key Features:**
- Ordered: Elements maintain their order.
- Immutable: Cannot be changed after creation.
- Allows Duplicates: Multiple identical elements are allowed.

**When to Use?**
- When you need a fixed collection that should not change.
- When performance is critical (tuples are faster than lists).
- When using as dictionary keys (since they are immutable).

In [None]:
# Tuples
drilling_parameters = (5000, 50, 10.5)  # Immutable list
print(drilling_parameters[0])  # Access first item

### Sets
**Definition:** A set is an unordered, mutable collection that only stores unique elements.

**Key Features:**
- Unordered: No guaranteed element order.
- Mutable (but only for adding/removing elements).
- No Duplicates: Automatically removes duplicates.

**When to Use?**
- When you need to store unique values only (e.g., unique well names).
- When you need fast membership testing (in operator is fast).
- When performing set operations (union, intersection, difference).

In [None]:
# Sets
drilling_tools = {'Bit', 'Mud Motor', 'MWD', 'Rotary Table'}
print(drilling_tools)  # Unique elements
# Step 1: Adding an element to a set
drilling_tools.add('Casing')
print('Step 1 Result:')
print(drilling_tools)
# Step 2: Removing an element from a set
drilling_tools.remove('Bit')
print('Step 2 Result:')
print(drilling_tools)
# Step 3: Check if an element exists in a set
print('Step 3 Result:')
print('Bit' in drilling_tools)  # Returns False
# Iterating through a set
for tool in drilling_tools:
    print(tool)

# List Comprehensions
# Create a list of drilling tools with 'Drill' prefix
drilling_tools = ['Bit', 'Mud Motor', 'MWD', 'Rotary Table']
drilling_tools_with_prefix = [f'Drill {tool}' for tool in drilling_tools]
print(drilling_tools_with_prefix)  # ['Drill Bit', 'Drill Mud Motor', 'Drill MWD', 'Drill Rotary Table']

### Dictionary Comprehension

In [None]:
# Dictionary Comprehensions
# Create a dictionary with drilling tools and their depths
drilling_tools = ['Bit', 'Mud Motor', 'MWD', 'stabilizer']
tool_od_list = [12.25, 8.5, 8.5, 11.75]  # Outer diameter list
# tool_length_list = [1.75, 26.3, 28.5, 7.45]  # Length list
# Create a dictionary with drilling tools and their ODs
drilling_tool_dict = {tool: od for tool, od in zip(drilling_tools, tool_od_list)}
print(drilling_tool_dict)  # {'Bit': 12.25, 'Mud Motor': 8.5, 'MWD': 8.5, 'stabilizer': 11.75}

# Get the OD of the MWD from the dictionary:
mwd_od = drilling_tool_dict.get('MWD')
print(mwd_od)  # 8.5


In [None]:
# F-strings & for loops:
tool_type = 'Bit'
print(f'The {tool_type} has an outer diameter of {drilling_tool_dict[tool_type]} inches.')

# Iterate through the dictionary and print each tool's name and outer diameter
for name, value in drilling_tool_dict.items():
    print(f'The {name} has an outer diameter of {value} inches.')

## Working with DataFrames
We'll use Pandas to create and manipulate dataframes.

In [None]:
import pandas as pd

# Creating a simple DataFrame
data = {'Depth': [1050, 1100, 1150, 1200], 'ROP': [323, 350, 355, 385], 'WOB': [42, 43, 48, 50], 'RPM': [120, 120, 120, 120], 
        'DIFF': [458, 473, 491, 526]}
df = pd.DataFrame(data)
print(df)

## Uploading **.csv** Data Files Locally
We'll demonstrate how to upload CSV, Excel, and LAS files.

In [None]:
# Get my current path:
import os

# Get the current working directory
# This is the directory where the script is running
current_path = os.getcwd()

# replace \ with \\ in current_path
current_path = current_path.replace('\\', '\\\\')
print(f'Current path: {current_path}')  # Print the current path

# upload file from current_path
file_path = current_path + '\\\\16A_78-32_time_data_10s_intervals_standard.csv'
print(f'File path: {file_path}')

forge_16A_df = pd.read_csv(file_path)

## Uploading **.csv** Data Files from Google Drive

The below snippet of code should be run to import the Forge data while running on GoogleColab

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/python-for-drilling-engineers/module_1/16A_78-32_time_data_10s_intervals_standard.csv'

forge_16A_df = pd.read_csv(file_path)

## Rapid Dataset Reviews
Let's start by taking a look at the format of the data pulled in from the CSV.

In [None]:
print(forge_16A_df.shape)  # Display the shape of the DataFrame
print(f'Row Count: {forge_16A_df.shape[0]} \nColumn Count: {forge_16A_df.shape[1]}')  # Display row and column count
print(f'Column names: \n {list(forge_16A_df.columns)}')  # Display the column names
# print the first 10 rows of the first 5 columns
print(f'First look at the dataframe: \n {forge_16A_df.iloc[:10, :6]}')


The first row contains unit information. Let's save it as a dictionary for reference, then remove the row from the dataframe.

In [None]:
# Save the first row as a dictionary with the key as the column name and the value as the first row value.
first_row_dict = forge_16A_df.iloc[0].to_dict()
print(first_row_dict)

print(f'ROP units: {first_row_dict['Rate of Penetration (Depth/Hour)']}')  # Access the ROP units
print(first_row_dict['Rate of Penetration (Minute/Depth)'])

# drop first row (units)
forge_16A_df.drop(index=0, inplace=True)  # Drop the first row

Now Let's take a look at the data types and non-null counts for each column.

In [None]:
print(forge_16A_df.info(max_cols=None))  # Display DataFrame info

Several columns are null. Let's remove them to focus our efforts.

In [None]:
# Remove columns with lte 1 non-null value
forge_16A_df = forge_16A_df.dropna(axis=1, thresh=2)
print(forge_16A_df.info(max_cols=None))

## Run Pandas Profiling Report to Explore the Data Further
First, let's define the data type in each column.

In [None]:
# Set the columns with 'Date' in the header to datetime
for col in forge_16A_df.columns:
    if 'Date' in col:
        forge_16A_df[col] = pd.to_datetime(forge_16A_df[col], errors='coerce')

# Set all other columns to float
for col in forge_16A_df.columns:
    if 'Date' not in col:
        forge_16A_df[col] = pd.to_numeric(forge_16A_df[col], errors='coerce')

print(forge_16A_df.head(10))

Now, let's generate a profile report using the ydata-profiling library (formerly pandas profiling).

In [None]:
# If running on GoogleColab, you must pip install ydata-profiling before running the next cell
!pip install ydata-profiling

In [None]:
# Generate a profile report
from ydata_profiling import ProfileReport
profile = ProfileReport(forge_16A_df, title="Forge 16A Data Analysis", explorative=True)
profile.to_notebook_iframe()
# Save the profile report to an HTML file
profile.to_file(output_file="forge_16A_report.html")

Let's rename our columns to have a more code-friendly title.

In [None]:
print(forge_16A_df.columns)  # Check the columns in the DataFrame
print(f'Row count: {forge_16A_df.shape[0]}')  # Check the number of rows

df = forge_16A_df.copy()
# Rename column headers
df.rename(columns={'Date': 'rig_time',
                   'Bit Diameter': 'bit_size',
                   'Top Drive Revolutions per Minute': 'td_rpm',
                   'Bit Revolutions per Minute': 'bit_rpm',
                   'Weight on Bit': 'wob',
                   'Differential Pressure': 'diff_press',
                   'Block Position': 'block_height',
                   'Rate of Penetration (Depth/Hour)': 'rop',
                   'Depth Hole Total Vertical Depth': 'md',
                   'Inclination': 'inc',
                   'Azimuth': 'azi',
                   'Hookload': 'hookload',
                   'Pump Pressure': 'pump_press',
                   'Return Flow': 'flow_out',
                   'Flow In': 'flow_in',
                   'Top Drive Torque': 'td_torque',
                   'Gamma Measured while Drilling': 'gamma',
                   'Rig Mode': 'rig_mode',
                   'On Bottom': 'on_bottom_status',
                   'Total Strokes per Minute': 'total_spm'
                   }, inplace=True)
print(df.columns)  # Check the columns in the DataFrame

We're ready to start our analysis.

Let's wrap our heads around the dataset by visualizing a common DVD curve using 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


df = df[['rig_time', 'md', 'rop', 'wob', 'diff_press', 'td_rpm', 'td_torque',
         'bit_rpm', 'block_height', 'inc', 'azi', 'bit_size', 'on_bottom_status']]
df['rig_time'] = pd.to_datetime(df['rig_time'], errors='coerce')  # Convert to datetime
# print(df.head(10))

# reduce df to take every 12th row
plot_df = df.copy()
plot_df = plot_df.iloc[::240, :]  # Take every 12th row
plot_df.sort_values(by='rig_time', inplace=True)  # Sort by rig_time
# drop rows where rig_time or md is null
plot_df.dropna(subset=['rig_time', 'md'], inplace=True)
# set rig_time as datetime
plot_df['rig_time'] = pd.to_datetime(plot_df['rig_time'], errors='coerce')  # Convert to datetime
plot_df['rig_time'] = plot_df['rig_time'].dt.strftime('%Y-%m-%d %H:%M:%S')  # Format datetime

# convert md to numeric
plot_df['md'] = pd.to_numeric(plot_df['md'], errors='coerce')

print(f'Reduced row count: {plot_df.shape[0]}')  # Check the number of rows after reduction

# Ensure plots are displayed in Jupyter Notebook
%matplotlib inline 

# plot line graph x axis = rig_time, y axis = bit_depth, then invert the y-axis
plt.figure(figsize=(10, 6))
plt.plot(plot_df['rig_time'], plot_df['md'], label='Bit Depth', color='blue')
plt.gca().invert_yaxis()  # Invert the y-axis
plt.xlabel('Rig Time')  # Set x-label
plt.ylabel('Depth (ft)')  # Set y-label
plt.title('DvD Curve')
plt.show()

In [None]:
on_btm_plot_df = plot_df[plot_df['on_bottom_status'] == 1]  # Filter the data frame to show only data while bit is on-bottom (on_bottom_status = 1)

plt.figure(figsize=(10, 6))
plt.plot(on_btm_plot_df['rig_time'], on_btm_plot_df['md'], label='On Bottom', color='red')
plt.gca().invert_yaxis()  # Invert the y-axis
plt.xlabel('Rig Time')  # Set x-label
plt.ylabel('Depth (ft)')  # Set y-label
plt.title('On Bottom DvD Curve')
plt.show()

## Fix rig_time to show accurate time stamps
The rig time is duplicated across every minute's worth of data.
Let's fix this with a quick script.

In [None]:
on_btm_df = df[df['on_bottom_status'] == 1].copy()  
group_mins_df = on_btm_df.groupby('rig_time')
for name, group in group_mins_df:
    # check if the group contains 6 rows
    if len(group) == 6:
        for index, row in group.iterrows():
            # if the row is the first row, continue, else add 10 seconds to the previous row's time
            if index == group.index[0]:
                continue
            else:
                on_btm_df.at[index, 'rig_time'] = on_btm_df.loc[group.index[0], 'rig_time'] + pd.Timedelta(seconds=10 * (index - group.index[0]))
# print(on_btm_df[['rig_time']].head(50))

## Identify Unique Bit Runs

In [None]:
# set rig_time as datetime
on_btm_df['rig_time'] = pd.to_datetime(on_btm_df['rig_time'], errors='coerce')  # Convert to datetime
on_btm_df['rig_time_delta'] = on_btm_df['rig_time'].diff()  # Calculate the time difference between rows
on_btm_df['rig_time_delta'] = on_btm_df['rig_time_delta'].dt.total_seconds() / 3600  # Convert time difference to seconds
on_btm_df['md_delta'] = on_btm_df['md'].diff()  # Calculate the depth difference between rows

run_number = 1
# each time the time difference is greater than 5, increment the run_number by 1
for index, row in on_btm_df.iterrows():
    if row['rig_time_delta'] > 10:
        run_number += 1
    on_btm_df.at[index, 'run_number'] = run_number
# get the start and end time for each run_number
start_end_times_df = on_btm_df.groupby('run_number')['rig_time'].agg(['min', 'max']).reset_index()
start_end_times_df.rename(columns={'min': 'start_time', 'max': 'end_time'}, inplace=True)
start_end_times_df['run_duration'] = (start_end_times_df['end_time'] - start_end_times_df['start_time']).dt.total_seconds() / 3600  # Calculate the duration of each run in hours
# Get the start and end depths for each run
start_end_md_df = on_btm_df.groupby('run_number')['md'].agg(['min', 'max']).reset_index()
start_end_md_df.rename(columns={'min': 'start_depth', 'max': 'end_depth'}, inplace=True)
start_end_md_df['run_length'] = start_end_md_df['end_depth'] - start_end_md_df['start_depth']  # Calculate the length of each run
# Merge the start and end times with the start and end depths
bit_run_df = pd.merge(start_end_times_df, start_end_md_df, on='run_number')
bit_run_df.rename(columns={'min': 'start_depth', 'max': 'end_depth'}, inplace=True)
# reduce bit_run_df to only show runs greater than 1 hour and (gt 100 feet, lt 10000 feet)
bit_run_df = bit_run_df[(bit_run_df['run_duration'] > 1) & (bit_run_df['run_length'] > 100) & (bit_run_df['run_length'] < 10000)]
print(bit_run_df)  # Display the start and end times for each run_number

# Plot DVD Curve Color Coded by Run Number

In [None]:
plot_df['run_number'] = 0  # Initialize a new column for run_number
plot_df['rig_time'] = pd.to_datetime(plot_df['rig_time'], errors='coerce')  # Convert to datetime
# Assign run_number to plot_df based on the run_number in on_btm_df
for index, row in bit_run_df.iterrows():
    plot_df.loc[(plot_df['rig_time'] >= row['start_time']) & (plot_df['rig_time'] <= row['end_time']), 'run_number'] = row['run_number']

plt.figure(figsize=(10, 6))
plt.scatter(plot_df['rig_time'], plot_df['md'], c=plot_df['run_number'], cmap='viridis', label='Run Number')
plt.gca().invert_yaxis()  # Invert the y-axis
plt.xlabel('Rig Time')  # Set x-label
plt.ylabel('Depth (ft)')  # Set y-label
plt.title('DvD Curve with Run Number')
plt.colorbar(label='Run Number')
plt.show()

## Data Transformation & KPI Calculation
Now, let's transform data and calculate key performance indicators (KPIs).

First, we will create a calculated column: 'Depth of Cut' using ROP and Bit RPM.

**Formula**

depth_of_cut = 0.2 * rop / bit_rpm

In [None]:
on_btm_df['depth_of_cut'] = 0

on_btm_df.loc[on_btm_df.bit_rpm > 10, 'depth_of_cut'] = (0.2) * on_btm_df['rop'] / on_btm_df['bit_rpm']

print(f'ROP stats: \n{on_btm_df.loc[(on_btm_df.rop < 1000) & (on_btm_df.rop > 0), 'rop'].describe()}\n')  # Display ROP statistics
print(f'Bit RPM stats: \n{on_btm_df.loc[on_btm_df.bit_rpm > 0, "bit_rpm"].describe()}\n')  # Display bit RPM statistics


print(f'Depth of Cut Stats: \n{on_btm_df.loc[on_btm_df.on_bottom_status == 1, "depth_of_cut"].describe()}\n')  # Display depth of cut statistics

### Add the run numbers to the "On Bottom" Dataframe

In [None]:
# Merge run_number from start_end_times_df into on_btm_df
on_btm_df['run_number'] = 0  # Initialize a new column for run_number
for index, row in bit_run_df.iterrows():
    on_btm_df.loc[(on_btm_df['rig_time'] >= row['start_time']) & (on_btm_df['rig_time'] <= row['end_time']), 'run_number'] = row['run_number']
on_btm_df.loc[on_btm_df.run_number == 0, 'run_number'] = None  # Set run_number to None where it is 0

Now let's calculate KPIs across each run

In [None]:
# Calculate KPIs for each bit_size, when on_bottom is True
group_df = on_btm_df.groupby(['run_number'])

# print(f'\n{group_df.size()}')
for name, group in group_df:
    print(f'\nRun Number: {name}')
    print(group[['rop', 'wob', 'td_rpm', 'td_torque', 'depth_of_cut']].describe())

Now let's check for any ROP outliers and remove them.

In [None]:
# check for ROP outliers and remove them from group_df
print(group_df.reset_index(inplace=True))
filtered_group_df = group_df.filter(lambda x: x['rop'].between(x['rop'].quantile(0.05), x['rop'].quantile(0.95)).all())

## Data Visualization with Matplotlib & Seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Plot a bar chart of ROP for each run number in group_df
plt.figure(figsize=(10, 6))
sns.barplot(x='run_number', y='rop', data=group_df)
plt.xlabel('Run Number')  # Set x-label
plt.ylabel('ROP (ft/hr)')  # Set y-label
plt.title('ROP by Run')
# overlay a line plot of the run_length for each run_number
# plt.twinx()
# sns.lineplot(x='run_number', y='run_length', data=bit_run_df, color='red')
# plt.ylabel('Run Length (ft)')  # Set y-label for the line plot
# plt.legend(['Run Length', 'ROP'])
plt.show()

In [None]:
# Scatterplot Example
plt.figure(figsize=(8,4))
sns.scatterplot(x='Depth', y='MSE', data=df)
plt.title('MSE vs Depth')
plt.show()

## Final Exercise
Try the following:
1. Create a new DataFrame with Well Name, Depth, and ROP.
2. Upload a CSV file and explore the data.
3. Merge two DataFrames with a common column.
4. Create a scatterplot of Depth vs ROP.

**Congratulations on completing Module 1!** 🎉