Petrophysics Evaluation: Lithology Prediction From Well Logs Using Deep Learning
Overview
This project demonstrates an advanced approach for well log facies classification using deep learning on AWS SageMaker. We'll use a modern ML pipeline to predict lithology from well log measurements, leveraging the latest AWS services and best practices.

Updated reference architecture for petrophysics and well log analytics

Contents
[Setup](https://console.harmony.a2z.com/internal-ai-assistant#setup)
[Data Preparation](https://console.harmony.a2z.com/internal-ai-assistant#data-preparation)
[Model Development](https://console.harmony.a2z.com/internal-ai-assistant#model-development)
[Training and Tuning](https://console.harmony.a2z.com/internal-ai-assistant#training-and-tuning)
[Deployment and Inference](https://console.harmony.a2z.com/internal-ai-assistant#deployment-and-inference)
[Monitoring and MLOps](https://console.harmony.a2z.com/internal-ai-assistant#monitoring-and-mlops)
[Cleanup](https://console.harmony.a2z.com/internal-ai-assistant#cleanup)


In [22]:
"""
Setup
First, let's set up our environment and configure AWS credentials:
"""  
import boto3
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/petrophysics-facies-classification'

region = boto3.Session().region_name

In [1]:
"""
Data Preparation
We'll use AWS Glue and Amazon Athena to prepare and query our data:
"""
    
import awswrangler as wr

# Assuming data is already in S3
s3_input_path = f"s3://{bucket}/{prefix}/parquet_data/welllog_data.parquet"

# Create the Glue database if it doesn't exist
wr.catalog.create_database(
    name="petrophysics_db",
    description="Database for petrophysics data",
    exist_ok=True  # This will not raise an error if the database already exists
)

# Create Glue catalog table
wr.catalog.create_parquet_table(
    database="petrophysics_db",
    table="well_logs",
    path=s3_input_path,
    columns_types={
        "Well Name": "string",
        "Depth": "float",
        "GR": "float",
        "ILD_log10": "float",
        "DeltaPHI": "float",
        "PHIND": "float",
        "PE": "float",
        "NM_M": "int",
        "RELPOS": "float",
        "Facies": "string"
    },
    partitions_types={},
)



ModuleNotFoundError: No module named 'awswrangler'

In [None]:
# Query data using Athena
df = wr.athena.read_sql_query(
    sql="""
    SELECT * FROM petrophysics_db.well_logs
    WHERE "Well Name" NOT IN ('SHRIMPLIN', 'SHANKLE')
    """,
    database="petrophysics_db"
)


QueryFailed: COLUMN_NOT_FOUND: line 8:13: Column 'well name' cannot be resolved or requester is not authorized to access requested resources. You may need to manually clean the data at location 's3://aws-athena-query-results-479837311121-us-east-1/tables/9542f326-9e80-4427-96d9-ef2adb2845fc' before retrying. Athena will not delete data in your account.

In [None]:

# Prepare features and target
features = ['depth', 'gr', 'ild_log10', 'deltaphi', 'phind', 'pe',
       'nm_m', 'relpos', 'facies']
target = 'facies'

X = df[features]
y = df[target]

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save processed data to S3
wr.s3.to_parquet(
    df=pd.concat([X_train, y_train], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/train/",
    dataset=True
)

wr.s3.to_parquet(
    df=pd.concat([X_test, y_test], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/test/",
    dataset=True
)

    



KeyError: "None of [Index(['depth', 'gr', 'ild_log10', 'deltaphi', 'phind', 'pe', 'nm_m', 'relpos',\n       'facies'],\n      dtype='object')] are in the [columns]"

In [None]:
import pandas as pd
import awswrangler as wr
import boto3

role = get_execution_role()
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/petrophysics-facies-classification'

# Path to your local CSV file
local_csv_path = "facies_vectors.csv"  # Replace with your local file path

# S3 path for the Parquet file
s3_parquet_path = f"s3://{bucket}/{prefix}/parquet_data/welllog_data.parquet"

# Read the CSV file
df = pd.read_csv(local_csv_path)

# Print info about the dataframe
print(f"Dataframe shape: {df.shape}")
print("\nDataframe info:")
df.info()

# Convert to Parquet and save to S3
wr.s3.to_parquet(
    df=df,
    path=s3_parquet_path,
    index=False,
    dataset=True
)

print(f"\nParquet file saved to: {s3_parquet_path}")

# Verify the file was uploaded
try:
    obj = s3.head_object(Bucket=bucket, Key=f"{prefix}/parquet_data/welllog_data.parquet")
    print(f"File successfully uploaded. Size: {obj['ContentLength']} bytes")
except Exception as e:
    print(f"Error verifying file upload: {str(e)}")

# Optional: Read back a sample from the Parquet file to verify
df_sample = wr.s3.read_parquet(path=s3_parquet_path, dataset=True).head()
print("\nSample from uploaded Parquet file:")
print(df_sample)


Dataframe shape: (4149, 11)

Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Facies     4149 non-null   object 
 1   Formation  4149 non-null   object 
 2   Well Name  4149 non-null   object 
 3   Depth      4149 non-null   float64
 4   GR         4149 non-null   float64
 5   ILD_log10  4149 non-null   float64
 6   DeltaPHI   4149 non-null   float64
 7   PHIND      4149 non-null   float64
 8   PE         3232 non-null   float64
 9   NM_M       4149 non-null   int64  
 10  RELPOS     4149 non-null   float64
dtypes: float64(7), int64(1), object(3)
memory usage: 356.7+ KB

Parquet file saved to: s3://sagemaker-us-east-1-479837311121/sagemaker/petrophysics-facies-classification/parquet_data/welllog_data.parquet
Error verifying file upload: An error occurred (404) when calling the HeadObject operation: Not Found

Sample from uploaded

In [None]:
# Create the Glue database if it doesn't exist
wr.catalog.create_database(
    name="petrophysics_db",
    description="Database for petrophysics data",
    exist_ok=True
)

# Delete the table if it already exists
wr.catalog.delete_table_if_exists(database="petrophysics_db", table="well_logs")

# Create Glue catalog table
wr.catalog.create_csv_table(
    database="petrophysics_db",
    table="well_logs",
    path=s3_input_path,
    columns_types={
        "Facies": "string",
        "Formation": "string",
        "Well Name": "string",
        "Depth": "float",
        "GR": "float",
        "ILD_log10": "float",
        "DeltaPHI": "float",
        "PHIND": "float",
        "PE": "float",
        "NM_M": "int",
        "RELPOS": "float"
    },
    partitions_types={},
)

# Query data using Athena
df = wr.athena.read_sql_query(
    sql="""
    SELECT * FROM petrophysics_db.well_logs
    WHERE "Well Name" NOT IN ('SHRIMPLIN', 'SHANKLE')
    """,
    database="petrophysics_db"
)

# Prepare features and target
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
target = 'Facies'

X = df[features]
y = df[target]

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save processed data to S3
wr.s3.to_parquet(
    df=pd.concat([X_train, y_train], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/train/",
    dataset=True
)

wr.s3.to_parquet(
    df=pd.concat([X_test, y_test], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/test/",
    dataset=True
)

print("Data preparation completed successfully.")


QueryFailed: COLUMN_NOT_FOUND: line 8:13: Column 'well name' cannot be resolved or requester is not authorized to access requested resources. You may need to manually clean the data at location 's3://aws-athena-query-results-479837311121-us-east-1/tables/380e8c7b-645b-4c5a-8282-18d38e2bfac9' before retrying. Athena will not delete data in your account.

In [None]:
import awswrangler as wr
import pandas as pd
import boto3
from sklearn.model_selection import train_test_split

# Set up S3 client and bucket information
session = boto3.Session()
bucket = "sagemaker-us-east-1-479837311121"
prefix = 'sagemaker/petrophysics-facies-classification'


# S3 path for the Parquet file
s3_parquet_path = f"s3://{bucket}/{prefix}/parquet_data/welllog_data.parquet"

# 1. Read the Parquet data
print("Reading Parquet data from S3...")
df = wr.s3.read_parquet(path=s3_parquet_path)

print(f"Shape of df: {df.shape}")
print(df.head())

# 2. Create or update the Athena table
print("\nCreating/Updating Athena table...")
wr.catalog.create_parquet_table(
    database="petrophysics_db",
    table="well_logs",
    path=s3_parquet_path,
    columns_types=wr.catalog.get_parquet_metadata(s3_parquet_path)["columns_types"],
    mode="overwrite"
)

# 3. Query data using Athena
print("\nQuerying data using Athena...")
df = wr.athena.read_sql_query(
    sql="""
    SELECT * FROM petrophysics_db.well_logs
    WHERE "Well Name" NOT IN ('SHRIMPLIN', 'SHANKLE')
    """,
    database="petrophysics_db"
)

print(f"Shape of df after Athena query: {df.shape}")
print(df.head())

# 4. Prepare features and target
print("\nPreparing features and target...")
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
target = 'Facies'

X = df[features]
y = df[target]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# 5. Split data
print("\nSplitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")

# 6. Save processed data to S3
print("\nSaving processed data to S3...")
wr.s3.to_parquet(
    df=pd.concat([X_train, y_train], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/train/",
    dataset=True
)

wr.s3.to_parquet(
    df=pd.concat([X_test, y_test], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/test/",
    dataset=True
)

print("Data preparation completed successfully.")


Reading Parquet data from S3...
Shape of df: (4149, 11)
                Facies Formation Well_Name   Depth     GR  ILD_log10  \
0  Nonmarine sandstone     A1 SH   SHANKLE  2785.0  58.63      0.444   
1  Nonmarine sandstone     A1 SH   SHANKLE  2785.5  59.16      0.413   
2  Nonmarine sandstone     A1 SH   SHANKLE  2786.0  52.86      0.378   
3  Nonmarine sandstone     A1 SH   SHANKLE  2786.5  54.32      0.344   
4  Nonmarine sandstone     A1 SH   SHANKLE  2787.0  53.27      0.320   

   DeltaPHI   PHIND   PE  NM_M  RELPOS  
0       4.4  12.440  2.7     1   0.661  
1       3.7  13.640  2.6     1   0.645  
2       2.5  14.490  2.6     1   0.629  
3       1.1  15.085  2.6     1   0.613  
4      -0.3  14.985  2.7     1   0.597  

Creating/Updating Athena table...


AttributeError: module 'awswrangler.catalog' has no attribute 'get_parquet_metadata'

In [None]:
print(df.head())

# 4. Prepare features and target
print("\nPreparing features and target...")
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
target = 'Facies'

X = df[features]
y = df[target]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# 5. Split data
print("\nSplitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")

# 6. Save processed data to S3
print("\nSaving processed data to S3...")
wr.s3.to_parquet(
    df=pd.concat([X_train, y_train], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/train/",
    dataset=True
)

wr.s3.to_parquet(
    df=pd.concat([X_test, y_test], axis=1),
    path=f"s3://{bucket}/{prefix}/processed_data/test/",
    dataset=True
)

print("Data preparation completed successfully.")


Shape of df after Athena query: (4149, 11)
                Facies Formation Well_Name   Depth     GR  ILD_log10  \
0  Nonmarine sandstone     A1 SH   SHANKLE  2785.0  58.63      0.444   
1  Nonmarine sandstone     A1 SH   SHANKLE  2785.5  59.16      0.413   
2  Nonmarine sandstone     A1 SH   SHANKLE  2786.0  52.86      0.378   
3  Nonmarine sandstone     A1 SH   SHANKLE  2786.5  54.32      0.344   
4  Nonmarine sandstone     A1 SH   SHANKLE  2787.0  53.27      0.320   

   DeltaPHI   PHIND   PE  NM_M  RELPOS  
0       4.4  12.440  2.7     1   0.661  
1       3.7  13.640  2.6     1   0.645  
2       2.5  14.490  2.6     1   0.629  
3       1.1  15.085  2.6     1   0.613  
4      -0.3  14.985  2.7     1   0.597  

Preparing features and target...
Shape of X: (4149, 7)
Shape of y: (4149,)

Splitting data...
Shape of X_train: (3319, 7)
Shape of X_test: (830, 7)

Saving processed data to S3...
Data preparation completed successfully.


In [None]:
df.head()

Unnamed: 0,Facies,Formation,Well_Name,Depth,GR,ILD_log10,DeltaPHI,PHIND,PE,NM_M,RELPOS
0,Nonmarine sandstone,A1 SH,SHANKLE,2785.0,58.63,0.444,4.4,12.44,2.7,1,0.661
1,Nonmarine sandstone,A1 SH,SHANKLE,2785.5,59.16,0.413,3.7,13.64,2.6,1,0.645
2,Nonmarine sandstone,A1 SH,SHANKLE,2786.0,52.86,0.378,2.5,14.49,2.6,1,0.629
3,Nonmarine sandstone,A1 SH,SHANKLE,2786.5,54.32,0.344,1.1,15.085,2.6,1,0.613
4,Nonmarine sandstone,A1 SH,SHANKLE,2787.0,53.27,0.32,-0.3,14.985,2.7,1,0.597


In [None]:
# Create Glue catalog table
wr.catalog.create_csv_table(  # Note: changed from create_parquet_table to create_csv_table
    database="petrophysics_db",
    table="well_logs",
    path=s3_input_path,
    columns_types={
        "Well Name": "string",  # Changed from "Well Name" to "well name"
        "Depth": "float",
        "GR": "float",
        "ILD_log10": "float",
        "DeltaPHI": "float",
        "PHIND": "float",
        "PE": "float",
        "NM_M": "int",
        "RELPOS": "float",
        "Facies": "string"
    },
    partitions_types={},
)


In [None]:
"""   
Model Development
We'll use PyTorch to create a deep learning model:
"""
    
import torch
import torch.nn as nn

class FaciesClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FaciesClassifier, self).__init__()
        self.layer_1 = nn.Linear(input_dim, hidden_dim)
        self.layer_2 = nn.Linear(hidden_dim, hidden_dim)
        self.layer_out = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.1)
        self.batchnorm1 = nn.BatchNorm1d(hidden_dim)
        self.batchnorm2 = nn.BatchNorm1d(hidden_dim)

    def forward(self, inputs):
        x = self.relu(self.layer_1(inputs))
        x = self.batchnorm1(x)
        x = self.relu(self.layer_2(x))
        x = self.batchnorm2(x)
        x = self.dropout(x)
        x = self.layer_out(x)
        return x

# Define model architecture
input_dim = len(features)
hidden_dim = 64
output_dim = len(y.unique())

model = FaciesClassifier(input_dim, hidden_dim, output_dim)

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter
import boto3
import pandas as pd

# Set up the SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define your S3 bucket and prefix
bucket = "sagemaker-us-east-1-479837311121"
prefix = "sagemaker/petrophysics-facies-classification"


# Define hyperparameter ranges
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.001, 0.1),
    "batch_size": IntegerParameter(32, 256),
    "hidden_dim": IntegerParameter(32, 256),
}

# Create PyTorch estimator
estimator = PyTorch(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    framework_version="1.9.1",
    py_version="py3",
    hyperparameters={
        "epochs": 50,
    },
    sagemaker_session=sagemaker_session
)

# Create hyperparameter tuner
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name="validation_accuracy",
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=[
        {"Name": "validation_accuracy", "Regex": "Validation accuracy: ([0-9\\.]+)"}
    ],
    max_jobs=20,
    max_parallel_jobs=3,
)


In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

# Set up the SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define your S3 bucket and prefix
bucket = "sagemaker-us-east-1-479837311121"
prefix = "sagemaker/petrophysics-facies-classification"

# Define hyperparameter ranges
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.001, 0.1),
    "batch_size": IntegerParameter(32, 256),
    "hidden_dim": IntegerParameter(32, 256),
}

# Create PyTorch estimator
estimator = PyTorch(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    framework_version="1.9.1",
    py_version="py38",
    hyperparameters={
        "epochs": 50,
    },
    sagemaker_session=sagemaker_session
)

# Create hyperparameter tuner
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name="validation_accuracy",
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=[
        {"Name": "validation_accuracy", "Regex": "Validation accuracy: ([0-9\\.]+)"}
    ],
    max_jobs=20,
    max_parallel_jobs=3,
)

# Start hyperparameter tuning job
tuner.fit({"train": f"s3://{bucket}/{prefix}/processed_data/train",
           "test": f"s3://{bucket}/{prefix}/processed_data/test"})

# Get best training job
best_training_job = tuner.best_training_job()
print(f"Best training job: {best_training_job}")


No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


...................................*


UnexpectedStatusException: Error for HyperParameterTuning job pytorch-training-250109-1757: Failed. Reason: No training job succeeded after 5 attempts. For additional details, please take a look at the training job failures by listing training jobs for the hyperparameter tuning job.

In [None]:
estimator.fit({"train": f"s3://{bucket}/{prefix}/processed_data/train",
               "test": f"s3://{bucket}/{prefix}/processed_data/test"})


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-01-09-18-17-17-647


2025-01-09 18:17:21 Starting - Starting the training job...
2025-01-09 18:17:34 Starting - Preparing the instances for training...
2025-01-09 18:18:19 Downloading - Downloading the training image......
2025-01-09 18:19:05 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2025-01-09 18:19:15,270 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-01-09 18:19:15,272 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-01-09 18:19:15,281 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-01-09 18:19:15,283 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2025-01-09 18:19:15,442 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installe

UnexpectedStatusException: Error for Training job pytorch-training-2025-01-09-18-17-17-647: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/train/train.parquet'
"
Command "/opt/conda/bin/python3.8 train.py --epochs 50", exit code: 1

In [None]:
"""    
Monitoring and MLOps
Set up model monitoring and implement MLOps practices:
"""
    
from sagemaker.model_monitor import DataCaptureConfig, ModelMonitor

# Enable data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri=f"s3://{bucket}/{prefix}/data_capture"
)

# Create model monitor
model_monitor = ModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

# Set up monitoring schedule
model_monitor.create_monitoring_schedule(
    monitor_schedule_name='facies-classification-monitor',
    endpoint_input=predictor.endpoint,
    statistics=model_monitor.baseline_statistics(),
    constraints=model_monitor.suggested_constraints(),
    schedule_cron_expression='cron(0 * ? * * *)'
)

In [None]:
"""    
Cleanup
Clean up resources when you're done:
"""
    
# Delete endpoint
predictor.delete_endpoint()

# Delete model monitor schedule
model_monitor.delete_monitoring_schedule()

# Delete S3 objects
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
bucket.objects.filter(Prefix=prefix).delete()

In [None]:
 

    
This updated version incorporates modern AWS services, improves the ML pipeline with deep learning and hyperparameter tuning, adds data preparation using Glue and Athena, implements model monitoring, and follows MLOps best practices. Remember to create appropriate IAM roles, encrypt data, and follow AWS security best practices throughout the implementation.