The following file demonstrates the process and code used in our research:

**Imputation of missing values in well log data using k-nearest neighbor collaborative filtering**

The following python libraries are utilized in our research:

*   NumPy
*   Pandas
*   Matplotlib
*   Scikit-Learn





Important to Note:
* Run each cell sequentially unless stated otherwise
* Some cells may take long time to run



In [16]:
# Import all needed python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

# To ignore warnings (not needed)
import warnings
warnings.filterwarnings("ignore")

# Import Well Log Data

Import Well Log Data Files: "CSV_train.csv", "CSV_hidden_test.csv" using Pandas Dataframe and make necessary adjusments

In [17]:
# Import DF and Combine data_set
df_labeled_1 = pd.read_csv('CSV_train.csv', sep=';') # Change file address if nescessary
df_labeled_2 = pd.read_csv('CSV_hidden_test.csv', sep=';') # Change file address if nescessary

df_labeled = pd.concat([df_labeled_1, df_labeled_2], ignore_index=True)

In [18]:
# Add "True Vertical Depth (TVD)" column

df_labeled['DEPTH_TVD'] = df_labeled['Z_LOC'] * -1
column_to_move = df_labeled.pop('DEPTH_TVD')
df_labeled.insert(2, 'DEPTH_TVD', column_to_move)

The following code makes adjustments of the lithology column for a more convenient analysis.

Original data contains 12 lithology, but our research assumes only 11. "Basement" lithology has been removed as it is a very small portion in the dataset.

In [19]:
## Following cells adds a lithology column for a more convenient analysis

lithology_numbers = {
                 30000: 0,
                 65030: 1,
                 65000: 2,
                 80000: 3,
                 74000: 4,
                 70000: 5,
                 70032: 6,
                 88000: 7,
                 86000: 8,
                 99000: 9,
                 90000: 10,
                 93000: 11
                 }

df_labeled=df_labeled.replace({"FORCE_2020_LITHOFACIES_LITHOLOGY": lithology_numbers})



lithology_type = {
                 0: 'Sandstone',
                 1: 'ShalySand',
                 2: 'Shale',
                 3: 'Marl',
                 4: 'Dolomite',
                 5: 'Limestone',
                 6: 'Chalk',
                 7: 'Halite',
                 8: 'Anhydrite',
                 9: 'Tuff',
                 10: 'Coal',
                 11: 'Basement'
                 }

df_labeled['Lithology_Type'] = df_labeled['FORCE_2020_LITHOFACIES_LITHOLOGY'].map(lithology_type)

# Drop basement lithology
df_labeled  = df_labeled[df_labeled['Lithology_Type'] != 'Basement']

In [20]:
# Fill Null Cells in Dataframe with -9999

df_labeled = df_labeled.fillna(-9999)

# Data Preprocessing (Test Data Fabrication)

Our research utilizes four log features: GR, RHOB, NPHI, DTC.

Other log curves are ignored.

In [21]:
# Features used
features = ["WELL", "DEPTH_TVD",'GR', 'RHOB', 'NPHI', 'DTC', 'Lithology_Type', 'FORCE_2020_LITHOFACIES_LITHOLOGY']
df_labeled_with_features_original = df_labeled[features]

# Save file for later
# This file represents the original data file used to validate our method.
df_labeled_with_features_original.to_csv('df_labeled_with_features_original.csv') 

The following codes represent the test data fabrication process for **when two logs are missing simultaneously**.

Initially, RHOB, NPHI, DTC log curves representing 50 meters of data are intentionally removed.
Process is repeated three times for each feature using different wells.

The following log curves are also intentionally removed:
1) For RHOB, DTC is also removed for the same depth intervals.
2) For NPHI, RHOB is also removed for the same depth intervals.
3) For DTC, NPHI is also removed for the same depth intervals.


The removed datapoints are gathered to create the "Test Data Matrix". Remaining data is used to create the "Neighbor Data Matrix".

In [24]:
## Erase Parts of Each Log Curve Intentionally ##
df_labeled_with_features_sampling = df_labeled_with_features_original.copy()
df_labeled_with_features_sampling['WELL'] = df_labeled['WELL']
df_labeled_with_features_sampling['DEPTH_TVD']=df_labeled['DEPTH_TVD']



#For Well 35/11-10
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '35/11-10') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2600) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2650)].tolist()
for index in index_to_remove:
    for feature in ['RHOB', 'DTC']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999

#For Well 16/7-4
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '25/11-5') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1900) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 1950)].tolist()
for index in index_to_remove:
    for feature in ['RHOB', 'DTC']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999

#For Well 31/3-4
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '31/3-4') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2000) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2050)].tolist()
for index in index_to_remove:
    for feature in ['RHOB', 'DTC']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999
####################################################################################################################################


#For Well 31/2-10
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '31/2-10') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1650) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 1700)].tolist()
for index in index_to_remove:
    for feature in ['NPHI', 'RHOB']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999


#For Well 35/11-11
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '35/11-11') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 3000) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 3050)].tolist()
for index in index_to_remove:
    for feature in ['NPHI', 'RHOB']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999


#For Well 31/6-5
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '31/6-5') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1950) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2000)].tolist()
for index in index_to_remove:
    for feature in ['NPHI', 'RHOB']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999

####################################################################################################################################

#For Well 16/7-6
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '16/7-6') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2350) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2400)].tolist()
for index in index_to_remove:
    for feature in ['DTC', 'NPHI']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999


# For Well 31-2-21 S
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '31/2-21 S') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2800) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2850)].tolist()
for index in index_to_remove:
    for feature in ['DTC', 'NPHI']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999


# For Well 34/7-13
index_to_remove = df_labeled_with_features_sampling.index[(df_labeled_with_features_sampling['WELL'] == '34/7-13') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2800) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2850)].tolist()
for index in index_to_remove:
    for feature in ['DTC', 'NPHI']:
        df_labeled_with_features_sampling.loc[index, feature] = -9999

####################################################################################################################################        
        
        
## Save file for later
# Test Data file
df_labeled_with_features_sampling.to_csv('df_labeled_with_features_sampling_with_multiple_logs_removed.csv')

In [25]:
## This cell extracts data points from the previous cell to make the "test data matrix" for each of the three features

#RHOB
df_orig_1 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '35/11-10') & (df_labeled_with_features_original['DEPTH_TVD'] > 2600) & (df_labeled_with_features_original['DEPTH_TVD'] < 2650)]
df_orig_2 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '25/11-5') & (df_labeled_with_features_original['DEPTH_TVD'] > 1900) & (df_labeled_with_features_original['DEPTH_TVD'] < 1950)]
df_orig_3 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '31/3-4') & (df_labeled_with_features_original['DEPTH_TVD'] > 2000) & (df_labeled_with_features_original['DEPTH_TVD'] < 2050)]

#NPHI
df_orig_4 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '31/2-10') & (df_labeled_with_features_original['DEPTH_TVD'] > 1650) & (df_labeled_with_features_original['DEPTH_TVD'] < 1700)]
df_orig_5 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '35/11-11') & (df_labeled_with_features_original['DEPTH_TVD'] > 3000) & (df_labeled_with_features_original['DEPTH_TVD'] < 3050)]
df_orig_6 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '31/6-5') & (df_labeled_with_features_original['DEPTH_TVD'] > 1950) & (df_labeled_with_features_original['DEPTH_TVD'] < 2000)]

#DTC
df_orig_7 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '31/2-21 S') & (df_labeled_with_features_original['DEPTH_TVD'] > 2800) & (df_labeled_with_features_original['DEPTH_TVD'] < 2850)]
df_orig_8 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '16/7-6') & (df_labeled_with_features_original['DEPTH_TVD'] > 2350) & (df_labeled_with_features_original['DEPTH_TVD'] < 2400)]
df_orig_9 = df_labeled_with_features_original[(df_labeled_with_features_original['WELL'] == '34/7-13') & (df_labeled_with_features_original['DEPTH_TVD'] > 2800) & (df_labeled_with_features_original['DEPTH_TVD'] < 2850)]

df_original_test = pd.concat([df_orig_1, df_orig_2, df_orig_3, df_orig_4, df_orig_5, df_orig_6, df_orig_7, df_orig_8, df_orig_9])

# Save File for Later
df_original_test.to_csv('df_original_test.csv') # Test Data matrix (Original Data)

########################################################################################################################

#RHOB
df_sample_1 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '35/11-10') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2600) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2650)]
df_sample_2 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '25/11-5') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1900) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 1950)]
df_sample_3 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '31/3-4') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2000) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2050)]

#NPHI
df_sample_4 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '31/2-10') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1650) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 1700)]
df_sample_5 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '35/11-11') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 3000) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 3050)]
df_sample_6 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '31/6-5') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 1950) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2000)]

#DTC
df_sample_7 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '31/2-21 S') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2800) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2850)]
df_sample_8 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '16/7-6') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2350) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2400)]
df_sample_9 = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['WELL'] == '34/7-13') & (df_labeled_with_features_sampling['DEPTH_TVD'] > 2800) & (df_labeled_with_features_sampling['DEPTH_TVD'] < 2850)]

df_sample_test = pd.concat([df_sample_1, df_sample_2, df_sample_3, df_sample_4, df_sample_5, df_sample_6, df_sample_7, df_sample_8, df_sample_9])

# Save File for Later
df_sample_test.to_csv('df_sample_test_with_multiple_logs_removed.csv') # Test Data matrix with missing intervals

In [26]:
## This cell creates the "Neighbor Data Matrix" which utilizes the remaining data excluding data points that is used for prediction

# From this cell, if the following files created from previous cells are imported, run the ipynb file starting from this cell.
df_labeled_with_features_sampling = pd.read_csv('df_labeled_with_features_sampling_with_multiple_logs_removed.csv')
df_sample_test = pd.read_csv('df_sample_test_with_multiple_logs_removed.csv')


RHOB_unknown = df_sample_test[(df_sample_test['RHOB']==-9999) & (df_sample_test['DTC']==-9999)]

NPHI_unknown = df_sample_test[(df_sample_test['NPHI']==-9999) & (df_sample_test['RHOB']==-9999)]

DTC_unknown = df_sample_test[(df_sample_test['DTC']==-9999) & (df_sample_test['NPHI']==-9999)]


features_df_dict ={    
    
    'RHOB_unknown': RHOB_unknown,
    
    'NPHI_unknown': NPHI_unknown,
    
    'DTC_unknown': DTC_unknown

}


# Neighbor Data Matrix
features_known_df = df_labeled_with_features_sampling[(df_labeled_with_features_sampling['GR'] != -9999) & (df_labeled_with_features_sampling['RHOB'] != -9999) & (df_labeled_with_features_sampling['NPHI'] != -9999) & (df_labeled_with_features_sampling['DTC'] != -9999)]

# Collabortive Filtering Algorithm on Well Log Data

Warning: The following cell may take a very long time depending on your specifications.

Result files already been created so you may utilize those files.

In [None]:
### Collabortive Filtering##

# Cosine Similarity Function
def new_cosine(u_df, v_df):
    from sklearn.metrics.pairwise import cosine_similarity
    
    compare_df = pd.concat([u_df, v_df])
    compare_df_reset = compare_df.reset_index(drop=True)
    compare_df_drop = compare_df_reset.replace(-9999, 0)
    compare_df_drop = compare_df_drop.drop(columns=['Lithology_Type']) # Lithology Information is dropped as our research assumes lithology information is unknown

    target_array = np.array(compare_df_drop.values)[[0]]
    compare_arrays = np.array(compare_df_drop.values)[1:]

    cosine_sim = cosine_similarity(target_array, compare_arrays)

    return cosine_sim[0]


# Collaborive FIltering Algorithm
feature_list = ['RHOB', 'NPHI', 'DTC']
cos_sim_compare_num_list = [2, 5, 10, 40, 90, 130, 170] # Number of Neighbors k (Change as will)
for feature in feature_list:
    for cos_sim_compare_num in cos_sim_compare_num_list:
        features_known_df = features_known_df
        single_feature_unknown_df = features_df_dict.get('%s_unknown' % feature)
        features_known_df = features_known_df[['GR', 'RHOB', 'NPHI', 'DTC', 'Lithology_Type']]
        single_feature_unknown_df = single_feature_unknown_df[['GR', 'RHOB', 'NPHI', 'DTC', 'Lithology_Type']]


        for index in single_feature_unknown_df.index:
            lith = single_feature_unknown_df.loc[index, 'Lithology_Type']
            cosine_sim = new_cosine(single_feature_unknown_df.loc[[index]], features_known_df)
            related_doc_indices = cosine_sim.argsort()

            features_known_without_lith_df = features_known_df.drop(columns=['Lithology_Type'])

            # Use Top k Values
            if len(cosine_sim) < cos_sim_compare_num:
                row_index_total = np.array(features_known_without_lith_df.index.tolist())[related_doc_indices[::-1]][0:].tolist()
            else:
                most_similar_row_index = np.array(features_known_without_lith_df.index.tolist())[related_doc_indices[::-1]][0:].tolist()
                row_index_total = most_similar_row_index[0:cos_sim_compare_num]

            if len(cosine_sim) < cos_sim_compare_num:
                cosine_sim_total = cosine_sim
            else:
                cosine_sim_total = cosine_sim[related_doc_indices[::-1]][0:cos_sim_compare_num]

            # Weighted Sum Method
            features_known_without_lith_df_cs_total = features_known_without_lith_df.loc[row_index_total].reset_index(drop='True')
            features_known_without_lith_df_cs_total['Cos_Sim'] = pd.Series(cosine_sim_total)
            features_known_without_lith_df_cs_total['Weighted_Sum'] = features_known_without_lith_df_cs_total[feature] * features_known_without_lith_df_cs_total['Cos_Sim']

            predicted_value = features_known_without_lith_df_cs_total['Weighted_Sum'].sum() / sum(cosine_sim_total)

            lith_count_df = features_known_df.loc[row_index_total].reset_index(drop='True').Lithology_Type.value_counts().rename_axis('unique_values').reset_index(name='counts')


            # Final
            df_sample_test.loc[index, feature] = predicted_value
        
        # Predicted Test Data Matrix using k neighbors
        df_sample_test.to_csv('df_sample_test_%s_%s_with_multiple_logs_removed.csv'%(feature, cos_sim_compare_num))

