# Python for Drilling Engineers - Module 2
## Today's Objectives
- Loading Your Own Datasets
- DataFrames Walkthrough
  - Filtering
  - Calculating KPIs
  - 
- Data QA/QC Processes

# Module 2: From Excel to Empowered – Working with DataFrames in Python 🚀

Welcome to Module 2! If you're a drilling engineer who's been living in Excel for years, this is where the journey gets exciting. Python unlocks *next-level control, speed, and insights* from your data—and it all starts with mastering the DataFrame.

In this notebook, you’ll learn how to go from a basic spreadsheet mindset to a confident Python-powered workflow. Step by step. No overwhelm.

---

## 🔍 What We'll Cover

This module is intentionally focused and practical. Here's what you'll learn:

### 1. Why DataFrames?
- What a DataFrame is (and why it's like Excel—but better)
- How drilling engineers can benefit: filtering, aggregating, plotting, and automating common workflows
- Quick side-by-side: Excel vs. Python for rig data

### 2. Setting the Stage: Load Your First Real Dataset
- Load the Forge 16A well bit run dataset
- Quick overview of the data: rows, columns, and what's inside

### 3. Getting Comfortable: Exploring Your Data
- `.head()`, `.info()`, `.describe()` — fast ways to peek under the hood
- How to identify bad or missing data early
- Simple column selection and row slicing

### 4. Making it Useful: Filtering & Sorting Like a Pro
- Filtering rows by bit_od, bit manufacturerer, etc.
- Sorting your DataFrames

### 5. Adding Value: Creating New Columns & Grouping
- Calculating new metrics like "ROP ft/hr" or "Torque-to-WOB ratio"
- Vectorized operations: Why it's *fast* and *clean*

### 6. Creating a 'Channel Mapper'
- Create a dictionary to define consistent channel names for your dataset
- Load and apply it to the Forge 16A Dataset

### 7. BONUS: Quick Charting with `matplotlib` and `pandas`
- Param Plots
- KPI Bar Charts
- Why visualizing in Python beats Excel every time

---

## 👷‍♂️ Why This Matters for You

If you're still relying solely on Excel, you're leaving insight—and time—on the table. This module gives you the tools to:

✅ Make better decisions, faster  
✅ Catch data issues before they catch you  
✅ Automate the boring stuff  
✅ Impress your team with insights they didn’t even know were possible  

This is just the beginning. Let’s get after it.

---

🧠 **Pro Tip:** Bookmark the commands and use them in your daily workflows. The more you use Python, the more it works *for* you.



## Loading the Data

In [None]:
import pandas as pd
# Load the bit_run_df from a CSV file
file_name = 'bit_run_df.csv'

bit_run_df = pd.read_csv('bit_run_df.csv')


If you have not sent your gmail to connect to Google Drive yet, run this code to create the bit_run_df:

In [None]:
import pandas as pd
run_number = [4, 5, 6, 7, 10, 11, 13, 15, 16, 17, 18, 23, 26, 27, 28, 30, 31, 32, 33]
start_time = ["10/30/2020 3:20", "11/4/2020 8:49", "11/7/2020 16:38", "11/9/2020 22:41", "11/14/2020 11:43",
              "11/15/2020 2:46", "11/20/2020 0:20", "11/24/2020 2:46", "11/25/2020 23:21", "11/26/2020 22:57",
              "11/28/2020 14:49", "12/3/2020 23:23", "12/7/2020 3:30", "12/8/2020 20:34", "12/9/2020 13:17",
              "12/12/2020 12:35", "12/13/2020 11:44", "12/17/2020 0:21", "12/18/2020 10:01"]
end_time = ["10/31/2020 4:27", "11/7/2020 6:25", "11/8/2020 22:54", "11/10/2020 14:58", "11/14/2020 16:20",
            "11/16/2020 13:28", "11/22/2020 6:35", "11/24/2020 18:30", "11/26/2020 10:40", "11/27/2020 19:57",
            "11/29/2020 9:39", "12/5/2020 6:59", "12/7/2020 23:55", "12/9/2020 0:21", "12/9/2020 15:31",
            "12/13/2020 0:54", "12/14/2020 8:19", "12/17/2020 18:38", "12/18/2020 23:58"]
run_duration = [25.11666667, 69.59722222, 30.26666667, 16.28333333, 4.616666667, 34.70277778, 54.25, 15.73333333,
                11.31666667, 21, 18.83333333, 31.6, 20.41666667, 3.783333333, 2.233333333, 12.31666667, 20.58333333,
                18.28333333, 13.95]
start_depth = [120.95001, 1629.09, 4552, 4964.7676, 5112, 5112.0776, 5505.0513, 5892.058, 6360.5713, 6527.22,
               6945.0454, 7389, 8024.0015, 8242.251, 8392.4375, 8535.091, 9064.573, 9747.119, 10490.042]
end_depth = [1629.0634, 4556.19, 4964.3687, 5113.364, 5379.8945, 5472.668, 5855.826, 6360.453, 6526.268, 6944.9404,
             7394.7295, 8024.3887, 8241.282, 8391.413, 8540.855, 9064.383, 9747.942, 10490.022, 10960.597]
run_length = [1508.11339, 2927.1, 412.3687, 148.5964, 267.8945, 360.5904, 350.7747, 468.395, 165.6967, 417.7204,
              449.6841, 635.3887, 217.2805, 149.162, 148.4175, 529.292, 683.369, 742.903, 470.555]
bit_make = ["NOV", "NOV", "Smith", "Smith", "Smith", "Ulterra", None, "NOV", "NOV", "NOV", "NOV", "NOV",
            "NOV", "NOV", "NOV", "NOV", "NOV", "NOV", "NOV"]
bit_model = ["TKC76", "TKC66", "MDSi616", "Z713S", "XS616", "U616M", None, "TKC63", "SKC613M", "SKC513M",
             "FTKC63-01", "TKC63", "SKC513M", "SKC613M", "SKC613M", "TKC63", "FTKC63-01", "TKC63", "TKC63"]
bit_od = [17.5, 12.25, 12.25, 12.25, 12.25, 12.25, None, 8.75, 8.75, 8.75, 8.75, 8.75, 8.75, 8.75, 8.75, 8.75,
          8.75, 8.75, 8.75]
motor = [False, True, True, True, True, True, None, True, True, True, True, True, True, True, True, True, True, True, True]
motor_make = [None, "Scout", "Scout", "Scout", "Scout", "Scout", None, "Scout", "Scout", "Scout", "Scout", "Scout",
              "Scout", "Scout", "Scout", "Scout", "Scout", "Scout", "Scout"]
motor_od = [None, 9.625, 9.625, 9.625, 9.625, 9.625, None, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5]
motor_config = [None, "7/8-5.9", "7/8-5.9", "7/8-5.9", "7/8-5.9", "7/8-3.0", None, "7/8-5.7", "7/8-5.7", "7/8-5.7",
                "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7", "7/8-5.7"]
rss = [True, True, True, True, True, True, None, False, False, False, False, False, False, False, False, False,
       False, False, False]
rss_make = ["Scout Vertical", "Scout Vertical", "Scout Vertical", "Scout Vertical", "Scout Vertical", "Scout Vertical",
            None, None, None, None, None, None, None, None, None, None, None, None, None]

# Create a new DataFrame with the provided data
bit_run_dict = {
    'run_number': run_number,
    'start_time': start_time,
    'end_time': end_time,
    'run_duration': run_duration,
    'start_depth': start_depth,
    'end_depth': end_depth,
    'run_length': run_length,
    'bit_make': bit_make,
    'bit_model': bit_model,
    'bit_od': bit_od,
    'motor': motor,
    'motor_make': motor_make,
    'motor_od': motor_od,
    'motor_config': motor_config,
    'rss': rss,
    'rss_make': rss_make
}
bit_run_df = pd.DataFrame(bit_run_dict)

In [None]:
bit_run_df

## Exploring the Data

In [None]:
bit_run_df.head(1)  # Display the first few rows of the DataFrame to verify the data

In [None]:
bit_run_df.tail(1)

In [None]:
bit_run_df.info()  # Display information about the DataFrame, including data types and non-null counts

In [None]:
bit_run_df.describe()

In [None]:
bit_run_df['run_duration'].describe()

## Filtering & Sorting DataFrames

### Filtering

In [None]:
# Filter for BHA's with a hole size = 12.25

filtered_df = bit_run_df[bit_run_df['bit_od'] == 12.25]
# Display the filtered DataFrame
filtered_df

In [None]:
hole_size = 12.25  # Change this to the desired hole size
filtered_df = bit_run_df[bit_run_df['bit_od'] == hole_size]

# Display the filtered DataFrame
filtered_df

#### Exercise 1 - **Now you try.**

Filter the DataFrame to look at the 8.75" Bit Runs

In [None]:
# Type your code here.


### Applying Multiple Filters

In [None]:
hole_size = 12.25
screen = (bit_run_df['bit_od'] == hole_size)
filtered_df = bit_run_df[screen].reset_index(drop=True)
# Display the filtered DataFrame
filtered_df

In [None]:
hole_size = 12.25
bit_make = 'Ulterra'
screen = (bit_run_df['bit_od'] == hole_size) & (bit_run_df['bit_make'] == bit_make)
filtered_df = bit_run_df[screen].reset_index(drop=True)
# Display the filtered DataFrame
filtered_df

#### Exercise 2

**Now you try.**

Filter for BHAs where the hole size is 8.75 and the bit model is TKC63

In [None]:
# Type your code here.

### Sorting Like a Pro

In [None]:
bit_run_df.sort_values(by='run_duration')

In [None]:
bit_run_df.sort_values(by='run_duration', ascending=False, inplace=True)
bit_run_df

#### Exercise 3

**Now you try.**

Sort the BHA's by run_length from greatest to smallest (ascending=False)

In [None]:
# Type your code here.

## Adding Value - Calculating Columns

Calculate a column called avg_rop

In [None]:
bit_run_df['avg_rop'] = bit_run_df['run_length'] / bit_run_df['run_duration']
bit_run_df

### Calculated Columns for Specific Rows

In [None]:
import pandas as pd
# Load the on_btm_df from a CSV file
file_name = 'on_btm_df.csv'
# get current directory
import os
current_directory = os.getcwd()
file_path = 'C:\\Users\\RDavis\\Desktop\\Github\\python-for-drilling-engineers\\module_3'
print(file_path)
on_btm_df = pd.read_csv(f'{file_path}\\{file_name}')

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/python-for-drilling-engineers/module_2/on_btm_df.csv'

on_btm_df = pd.read_csv(file_path)

### Using a for loop and filtering, you can calculate KPIs from one dataframe and save to another

In [None]:
for index, row in bit_run_df.iterrows():
    start_time = row['start_time']
    end_time = row['end_time']
    screen = (on_btm_df['rig_time'] >= start_time) & (on_btm_df['rig_time'] <= end_time)
    filtered_df = on_btm_df[screen].reset_index(drop=True)
    avg_wob = filtered_df.wob.mean()
    avg_rpm = filtered_df.td_rpm.mean()
    avg_rop_raw = filtered_df.rop.mean()
    bit_run_df.loc[bit_run_df.index == index, 'avg_wob'] = avg_wob
    bit_run_df.loc[bit_run_df.index == index, 'avg_rpm'] = avg_rpm
    bit_run_df.loc[bit_run_df.index == index, 'avg_rop_raw'] = avg_rop_raw
bit_run_df

## First Look at Real Drilling Data

**Load Data from Google Drive**

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/python-for-drilling-engineers/module_1/16A_78-32_time_data_10s_intervals_standard.csv'

forge_16A_df = pd.read_csv(file_path)

The first row contains unit information. Let's save it as a dictionary for reference, then remove the row from the dataframe.

In [None]:
# Save the first row as a dictionary with the key as the column name and the value as the first row value.
unit_dict = forge_16A_df.iloc[0].to_dict()
print(unit_dict)

print(f'ROP units: {unit_dict["Rate of Penetration (Depth/Hour)"]}')  # Access the ROP units
print(unit_dict['Rate of Penetration (Minute/Depth)'])

# drop first row (units)
forge_16A_df.drop(index=0, inplace=True)  # Drop the first row

Now Let's take a look at the data types and non-null counts for each column using the .info function in the pandas library.

In [None]:
forge_16A_df.info()  # Display DataFrame info

Several columns are null. Let's remove them to focus our efforts.

In [None]:
# Remove columns with lte 1 non-null value
forge_16A_df = forge_16A_df.dropna(axis=1, thresh=2)
print(forge_16A_df.info(max_cols=None))

##  Mapping Channels
Let's make a copy of our dataframe so as we manipulate the data, the original dataset remains preserved.

In [None]:
df = forge_16A_df.copy()  # Create a copy of the DataFrame

Let's clean up the DataFrame a bit:
1. Rename our columns to have a more code-friendly title.
2. Reduce the columns to only what we care about.
3. Set the rig_time column type to datetime.
4. Sort the dataframe by 'rig_time'.

In [None]:
print('Original Columns:')
print(forge_16A_df.columns)  # Check the columns in the DataFrame

### Channel Mapper Creation

Use a dictionary to create a channel mapper.

In [None]:
# Define the channel mapper dictionary
channel_mapper_dict = {
    'Date': 'rig_time',
    'Bit Diameter': 'bit_size',
    'Top Drive Revolutions per Minute': 'td_rpm',
    'Bit Revolutions per Minute': 'bit_rpm',
    'Weight on Bit': 'wob',
    'Differential Pressure': 'diff_press',
    'Block Position': 'block_height',
    'Rate of Penetration (Depth/Hour)': 'rop',
    'Depth Hole Total Vertical Depth': 'md',
    'Inclination': 'inc',
    'Azimuth': 'azi',
    'Hookload': 'hookload',
    'Pump Pressure': 'pump_press',
    'Return Flow': 'flow_out',
    'Flow In': 'flow_in',
    'Top Drive Torque': 'td_torque',
    'Gamma Measured while Drilling': 'gamma',
    'Rig Mode': 'rig_mode',
    'On Bottom': 'on_bottom_status',
    'Total Strokes per Minute': 'total_spm'
}

# Rename column headers using the dictionary
df.rename(columns=channel_mapper_dict, inplace=True)

print('Renamed Columns:')
print(df.columns)  # Check the renamed columns

### Limit the Columns


In [None]:
# Rearrange and limit the columns in the dataframe
df = df[['rig_time', 'md', 'rop', 'wob', 'diff_press', 'td_rpm', 'td_torque',
         'bit_rpm', 'block_height', 'inc', 'azi', 'bit_size', 'on_bottom_status']]

### Use .describe to get a look under the hood

`.describe()` gives you a look at the data distributions in each of your columns

In [None]:
df.describe()

**There's a problem with our data types that is keeping `.describe()` from working as expected.**

We need to change the data types.

In [None]:
# set each column as type 'float' except rig_time, which should be a datetime
for col in df.columns:
    if col != 'rig_time':
        df[col] = df[col].astype(float)
    if col == 'rig_time':
        df[col] = pd.to_datetime(df[col])
df.info()

# Homework

**Create Your Own Mappers for 5 Wells**

1. Create a new Jupyter Notebook in Google Colab:
   1. Go to https://colab.research.google.com/.
   2. Click **+ New notebook**
2. Save it to the **python-for-drilling-engineers** Google Drive in the module_2 folder as **last_name_mod_2_hw.ipynb**.
3. Download EDR data from 3 of your recent wells, grabbing all the available channels (traces).
4. Load them into your Jupyter notebook.
5. Use `.info` and `.describe` to understand the data sets for the first well.
6. Create a **channel_mapper_dict** to map the channels to standard channel mnemonics.
7. Use `.rename()` function to rename the columns to your standard defined in **channel_mapper_dict**.
8. Load the next two wells and repeat the steps.

# Bonus Learning
## Plotting Basics

### DVD Plot
We're ready to start our analysis.

Let's wrap our heads around the dataset by visualizing a common DVD curve using **MatPlotLib**.

In [None]:
# Import the MatPlotLib and Seaborn libraries for plotting.
import matplotlib.pyplot as plt

# Reduce the frequency of the dataframe to make the plot less heavy
plot_df = df.copy()
plot_df = plot_df.iloc[::240, :]  # Take every 240th row using python slice notation --> start:stop:step
# drop rows where rig_time or md is null
plot_df.dropna(subset=['rig_time', 'md'], inplace=True)
# set rig_time as datetime
# plot_df['rig_time'] = pd.to_datetime(plot_df['rig_time'], errors='coerce')  # Convert to datetime
plot_df['rig_time'] = plot_df['rig_time'].dt.strftime('%Y-%m-%d %H:%M:%S')  # Format datetime

# convert md to numeric
plot_df['md'] = pd.to_numeric(plot_df['md'], errors='coerce')

print(f'Reduced row count: {plot_df.shape[0]}')  # Check the number of rows after reduction

# Ensure plots are displayed in Jupyter Notebook
%matplotlib inline 

# plot line graph x axis = rig_time, y axis = bit_depth, then invert the y-axis
plt.figure(figsize=(10, 6))
plt.plot(plot_df['rig_time'], plot_df['md'], label='Bit Depth', color='blue')
# Reduce the x-axis ticks to only show every 200th tick
plt.xticks(plot_df['rig_time'][::200], rotation=45)  # Rotate x-axis labels for better readability
plt.gca().invert_yaxis()  # Invert the y-axis
plt.xlabel('Rig Time')  # Set x-label
plt.ylabel('Depth (ft)')  # Set y-label
plt.title('DvD Curve')
plt.show()

## Grouping Your DataFrame

In [None]:
bit_run_df.groupby('bit_make').size()

In [None]:
bit_make_counts = bit_run_df.groupby('bit_make').size().reset_index(name='count')
bit_make_counts

In [None]:
bit_runs_grouped = bit_run_df.groupby(['bit_od', 'bit_make']).size().reset_index(name='count')
bit_runs_grouped

### Calculating Statistics on Groups

In [None]:
bit_runs_grouped = bit_run_df.groupby(['bit_od', 'bit_make']).agg(
    count=('run_number', 'size'),
    avg_run_duration=('run_duration', 'mean'),
    avg_run_length=('run_length', 'mean'),
).reset_index()
bit_runs_grouped

#### Exercise 4 - **Now You Try**

Group the bit runs by bit_od, bit_make, bit model and calculate the run count and the average ROP

In [None]:
# Type your code here.
