<div style="display: flex; align-items: center;padding-bottom:20px;">
    
    <div>    
        <h1 style="margin: 0;">Data Scientist Consultant – Commodity Insights</h1>
        <h3 style="margin-top: 5px;">Technical Assessment</h3>
    </div>
</div>
<div >        
    <h4 style="margin-top: 5px;">Presentend by Jairo Ruiz Saenz - April 21st, 2024</h4>
    <p style="font-size: 15px;">
        This notebook provides a comprehensive analysis of the data challenge, detailing all steps from data preprocessing and cleaning to extracting insights and addressing business questions.
    </p>
</div>

## Technical Assessment

- **Objective**: Build a model to estimate production
- **Dataset**: http://huy302.github.io/interview_dataset.csv
- **Features description**: https://huy302.github.io/feature_desc.md
- **Problem description**: Production prediction is one of the core problems in our business. The provided dataset is a set of nearby wells located in the United States and their 12 months cumulative production. As a data scientist you want to build a model from scratch to predict production and show your manager that your model can perform well on unseen data.
- **Submitting material**: code (Python/notebook) and supporting materials if needed (analysis, documentation, paper, slide)

## Dataset feature descriptions

- **treatment company**: The treatment company who provides treatment service.
- **azimuth**: Well drilling direction.
- **md (ft)**: Measure depth.
- **tvd (ft)**: True vertical depth.
- **date on production**: First production date.
- **operator**: The well operator who performs drilling service.
- **footage lateral length**: Horizontal well section.
- **well spacing**: Distance to the closest nearby well.
- **porpoise deviation**: How much max (in ft.) a well deviated from its horizontal.
- **porpoise count**: How many times the deviations (porpoises) occurred.
- **shale footage**: How much shale (in ft) encountered in a horizontal well.
- **acoustic impedance**: The impedance of a reservoir rock (ft/s * g/cc).
- **log permeability**: The property of rocks that is an indication of the ability for fluids (gas or liquid) to flow through rocks
- **porosity**: The percentage of void space in a rock. It is defined as the ratio of the volume of the voids or pore space divided by the total volume. It is written as either a decimal fraction between 0 and 1 or as a percentage.
- **poisson ratio**: Measures the ratio of lateral strain to axial strain at linearly elastic region.
- **water saturation**: The ratio of water volume to pore volume.
- **toc**: Total Organic Carbon, indicates the organic richness (hydrocarbon generative potential) of a reservoir rock.
- **vcl**: The amount of clay minerals in a reservoir rock.
- **p-velocity**: The velocity of P-waves (compressional waves) through a reservoir rock (ft/s).
- **s-velocity**: The velocity of S-waves (shear waves) through a reservoir rock (ft/s).
- **youngs modulus**: The ratio of the applied stress to the fractional extension (or shortening) of the reservoir rock parallel to the tension (or compression) (giga pascals).
- **isip**: When the pumps are quickly stopped, and the fluids stop moving, these friction pressures disappear and the resulting pressure is called the instantaneous shut-in pressure, ISIP.
- **breakdown pressure**: The pressure at which a hydraulic fracture is created/initiated/induced.
- **pump rate**: The volume of liquid that travels through the pump in a given time. A hydraulic fracture is formed by pumping fluid into a wellbore at a rate sufficient to increase pressure at the target depth, to exceed that of the fracture gradient (pressure gradient) of the rock.
- **total number of stages**: Total stages used to fracture the horizontal section of the well.
- **proppant volume**: The amount of proppant in pounds used in the completion of a well (lbs).
- **proppant fluid ratio**: The ratio of proppant volume/fluid volume (lbs/gallon).
- **production**: The 12 months cumulative gas production (mmcf).

In [1]:
# Import of libraries used in the script

import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error

In [2]:
# Set display options to ensure all columns and rows are displayed when using functions like df.head()

pd.set_option('display.max_columns', None)  # Set maximum number of columns to display to None (unlimited)
pd.set_option('display.max_rows', 50)  # Set maximum number of rows to display to None (unlimited)
pd.set_option('display.precision', 2)  # Set precision for float numbers to 2 decimal places
pd.set_option('display.max_colwidth', None)  # Set maximum column width to None (unlimited)

In [3]:
# Reads the CSV file using it's url and create a DataFrame
data_url = 'http://huy302.github.io/interview_dataset.csv'
df = pd.read_csv(data_url)

In [4]:
# Preview of the DataFrame
df.sample(n=5)

Unnamed: 0,treatment company,azimuth,md (ft),tvd (ft),date on production,operator,footage lateral length,well spacing,porpoise deviation,porpoise count,shale footage,acoustic impedance,log permeability,porosity,poisson ratio,water saturation,toc,vcl,p-velocity,s-velocity,youngs modulus,isip,breakdown pressure,pump rate,total number of stages,proppant volume,proppant fluid ratio,production
417,treatment company 4,-43.19,13446,6449.0,9/1/2016,operator 29,6663.0,852.61,40.74,5,0,30707.77,0.82,0.0,0.32,,4.81,0.43,12763.71,7491.37,32.68,4893.0,,101,42,12500000.0,0.95,1122.13
411,treatment company 12,-70.38,15134,8020.0,6/1/2018,operator 4,5543.0,904.93,4.17,7,3701,33353.53,0.56,0.01,0.34,0.87,4.56,0.3,13490.1,6877.69,30.66,5744.0,,95,28,,1.41,2186.6
581,treatment company 8,,16207,5979.0,10/1/2014,operator 1,9848.0,,30.53,8,0,35224.25,0.06,0.02,0.34,,4.61,0.73,13170.64,6895.76,30.78,,,79,39,15200000.0,1.22,650.56
508,treatment company 4,-49.3,17903,7787.0,2/1/2016,operator 4,9866.0,3806.57,6.07,8,2385,34775.43,1.2,0.07,0.35,,5.08,0.93,12073.87,6411.02,,5675.0,,86,58,,0.9,2347.62
928,treatment company 5,-32.51,19323,7165.0,1/1/2017,operator 5,11558.0,3924.36,5.26,16,4431,34553.62,1.11,0.05,0.34,,3.92,1.68,12545.16,6535.6,29.84,5199.0,5573.0,75,54,23400000.0,0.96,4594.82


In [5]:
df.dtypes

treatment company          object
azimuth                   float64
md (ft)                     int64
tvd (ft)                  float64
date on production         object
operator                   object
footage lateral length    float64
well spacing              float64
porpoise deviation        float64
porpoise count              int64
shale footage               int64
acoustic impedance        float64
log permeability          float64
porosity                  float64
poisson ratio             float64
water saturation          float64
toc                       float64
vcl                       float64
p-velocity                float64
s-velocity                float64
youngs modulus            float64
isip                      float64
breakdown pressure        float64
pump rate                   int64
total number of stages      int64
proppant volume           float64
proppant fluid ratio      float64
production                float64
dtype: object

In [6]:
df["date on production"] = pd.to_datetime(df["date on production"], format="%m/%d/%Y", errors="coerce")

In [7]:
# df["date on production 3"] = pd.to_datetime(df["date on production"], format="%d/%m/%Y", errors="coerce")

In [8]:
# df[["date on production", "date on production 2", "date on production 3"]].sample(n=5)

In [9]:
# profile = ProfileReport(df, title=f"Profiling Report", minimal=True, correlations={"auto": {"calculate": False}})

# profile = ProfileReport(df, title=f"Profiling Report", minimal=False)
# profile.to_file(f"profile_report.html")

In [10]:
df["date_on_production_year"] = df["date on production"].dt.year
df["date_on_production_month"] = df["date on production"].dt.month
# df[‘ScheduledDay_week’] = df[‘ScheduledDay’].dt.week
# df[‘ScheduledDay_day’] = df[‘ScheduledDay’].dt.day
# df[‘ScheduledDay_hour’] = df[‘ScheduledDay’].dt.hour
# df[‘ScheduledDay_minute’] = df[‘ScheduledDay’].dt.minute
# df[‘ScheduledDay_dayofweek’] = df[‘ScheduledDay’].dt.dayofweek

In [11]:
df.head(3)

Unnamed: 0,treatment company,azimuth,md (ft),tvd (ft),date on production,operator,footage lateral length,well spacing,porpoise deviation,porpoise count,shale footage,acoustic impedance,log permeability,porosity,poisson ratio,water saturation,toc,vcl,p-velocity,s-velocity,youngs modulus,isip,breakdown pressure,pump rate,total number of stages,proppant volume,proppant fluid ratio,production,date_on_production_year,date_on_production_month
0,treatment company 1,-32.28,19148,6443.0,2018-03-01,operator 1,11966.0,4368.46,6.33,12,1093,30123.2,0.68,0.02,0.34,0.85,5.0,0.42,13592.23,6950.44,30.82,4149.0,,83,56,21600000.0,1.23,5614.95,2018,3
1,treatment company 2,-19.8,15150,7602.0,2014-07-01,operator 2,6890.0,4714.99,1.28,4,0,30951.61,1.85,0.17,0.19,0.69,4.22,0.74,11735.04,7162.45,29.72,5776.0,,102,33,9840000.0,1.47,2188.84,2014,7
2,treatment company 3,-26.88,14950,5907.0,2018-08-01,operator 1,8793.0,798.92,2.03,6,3254,28900.25,0.29,0.02,0.33,,4.69,0.61,13227.81,6976.93,30.99,4628.0,,88,62,17100000.0,1.67,1450.03,2018,8


In [12]:
# Teniendo en cuenta la distribución de datos, se busca eliminar valores extremos en la variable area
# percentil_inferior_precio = data_modelo_gdp_depurado['area'].quantile(0.05)
# percentil_superior_precio = data_modelo_gdp_depurado['area'].quantile(0.98)

# print(f'percentil_inferior_precio: {percentil_inferior_precio}')
# print(f'percentil_superior_precio: {percentil_superior_precio}')

# filtro = (data_modelo_gdp_depurado['area'] >= percentil_inferior_precio) & (data_modelo_gdp_depurado['area'] <= percentil_superior_precio)
# data_modelo_gdp_depurado = data_modelo_gdp_depurado[filtro]

In [13]:
list(df)

['treatment company',
 'azimuth',
 'md (ft)',
 'tvd (ft)',
 'date on production',
 'operator',
 'footage lateral length',
 'well spacing',
 'porpoise deviation',
 'porpoise count',
 'shale footage',
 'acoustic impedance',
 'log permeability',
 'porosity',
 'poisson ratio',
 'water saturation',
 'toc',
 'vcl',
 'p-velocity',
 's-velocity',
 'youngs modulus',
 'isip',
 'breakdown pressure',
 'pump rate',
 'total number of stages',
 'proppant volume',
 'proppant fluid ratio',
 'production',
 'date_on_production_year',
 'date_on_production_month']

In [14]:
# print(jairo)

In [15]:
df.dtypes

treatment company                   object
azimuth                            float64
md (ft)                              int64
tvd (ft)                           float64
date on production          datetime64[ns]
operator                            object
footage lateral length             float64
well spacing                       float64
porpoise deviation                 float64
porpoise count                       int64
shale footage                        int64
acoustic impedance                 float64
log permeability                   float64
porosity                           float64
poisson ratio                      float64
water saturation                   float64
toc                                float64
vcl                                float64
p-velocity                         float64
s-velocity                         float64
youngs modulus                     float64
isip                               float64
breakdown pressure                 float64
pump rate  

In [16]:
# Definimos las variables de entrenamiento X
data_modelo_x = df[['treatment company', 'azimuth', 'md (ft)', 'tvd (ft)', 'operator',
 'footage lateral length', 'well spacing', 'porpoise deviation', 'porpoise count', 'shale footage', 'acoustic impedance',
 'log permeability', 'porosity', 'poisson ratio', 'water saturation', 'toc', 'vcl', 'p-velocity', 's-velocity', 'youngs modulus',
 'isip', 'breakdown pressure', 'pump rate', 'total number of stages', 'proppant volume', 'proppant fluid ratio', 'date_on_production_year',
 'date_on_production_month']]

# Definimos la variable de interes Y
data_modelo_y = df['production']

In [21]:
data_modelo_X = pd.get_dummies(data_modelo_x, columns=['treatment company', 'operator'])

X = data_modelo_X
y = data_modelo_y

# Se hace la división del dataset en entrenamiento y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Procedemos con el entrenamiento de un modelo de random forest
model = RandomForestRegressor(n_estimators=100, random_state=42)  # You can adjust n_estimators as needed
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mape = mean_absolute_percentage_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("MAPE:", mape)
print("RMSE:", rmse)

MAPE: 0.3281221057000822
RMSE: 795.8395984062198


In [20]:
# Una de las razones de la elección del modelo de random forest es la habilidad 
# para poder identificar las variables importante que inciden en la preddición del valor del inmueble
# a continuación se presentan las 5 características más importantes

feature_importances = model.feature_importances_

# Ordena las importancias de las variables de mayor a menor
sorted_indices = np.argsort(feature_importances)[::-1]

# Nombres de las variables correspondientes a las importancias ordenadas
feature_names = X_train.columns[sorted_indices]

# Importancias de las variables ordenadas
sorted_feature_importances = feature_importances[sorted_indices]

feature_importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': sorted_feature_importances})
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)

pd.set_option('display.float_format', lambda x: '%.5f' % x)
feature_importances_df.head(5)

Unnamed: 0,Feature,Importance
0,proppant volume,0.26219
1,tvd (ft),0.08627
2,youngs modulus,0.07492
3,total number of stages,0.0716
4,md (ft),0.06922
