# Project Description

You work for the OilyGiant mining company. Your task is to find the best place for a new well.

Steps to choose the location:

Collect the oil well parameters in the selected region: oil quality and volume of reserves;

Build a model for predicting the volume of reserves in the new wells;

Pick the oil wells with the highest estimated values;

Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

# Initialization

In [1]:
import pandas as pd
import numpy as np
from numpy.random import RandomState
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")


# Load data

In [2]:
# Load the data files
try:
    data_0 = pd.read_csv('/datasets/geo_data_0.csv')
except:
    data_0 = pd.read_csv('geo_data_0.csv')
try:
    data_1 = pd.read_csv('/datasets/geo_data_1.csv')
except:
    data_1 = pd.read_csv('geo_data_1.csv')
try:
    data_2 = pd.read_csv('/datasets/geo_data_2.csv')
except:
    data_2 = pd.read_csv('geo_data_2.csv')
    

# Explore the data

In [3]:
# Print general information about the 'data_0' dataframe
data_0.info()
data_0.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [4]:
# Print a sample of the data for 'data_0'
data_0


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


In [5]:
# Print general information about the 'data_1' dataframe
data_1.info()
data_1.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [6]:
# Print a sample of the data for 'data_1'
data_1


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276000,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.001160,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305
...,...,...,...,...,...
99995,QywKC,9.535637,-6.878139,1.998296,53.906522
99996,ptvty,-10.160631,-12.558096,5.005581,137.945408
99997,09gWa,-7.378891,-3.084104,4.998651,137.945408
99998,rqwUm,0.665714,-6.152593,1.000146,30.132364


In [7]:
# Print general information about the 'data_2' dataframe
data_2.info()
data_2.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


In [8]:
# Print a sample of the data for 'data_2'
data_2


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.871910
3,q6cA6,2.236060,-0.553760,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746
...,...,...,...,...,...
99995,4GxBu,-1.777037,1.125220,6.263374,172.327046
99996,YKFjq,-1.261523,-0.894828,2.524545,138.748846
99997,tKPY3,-1.199934,-2.957637,5.219411,157.080080
99998,nmxp2,-2.419896,2.417221,-5.548444,51.795253


Based on the exploration of the above data, the only issue that needs to be addressed is the 'id' column which should be dropped since it has no relevance to the target. The 'id' column is dropped from the three dataframes below. However, no further cleaning is needed as the numeric data appears to be normalized, there are no missing data points, and no other apparent issues.

# Clean the data

In [9]:
# Drop the 'id' feature which has no relation to the target from data_0
data_0 = data_0.drop(['id'], axis = 1)
data_0


Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.221170,105.280062
1,1.334711,-0.340164,4.365080,73.037750
2,1.022732,0.151990,1.419926,85.265647
3,-0.032172,0.139033,2.978566,168.620776
4,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...
99995,0.971957,0.370953,6.075346,110.744026
99996,1.392429,-0.382606,1.273912,122.346843
99997,1.029585,0.018787,-1.348308,64.375443
99998,0.998163,-0.528582,1.583869,74.040764


In [10]:
# Drop the 'id' feature which has no relation to the target from data_0
data_1 = data_1.drop(['id'], axis = 1)
data_1


Unnamed: 0,f0,f1,f2,product
0,-15.001348,-8.276000,-0.005876,3.179103
1,14.272088,-3.475083,0.999183,26.953261
2,6.263187,-5.948386,5.001160,134.766305
3,-13.081196,-11.506057,4.999415,137.945408
4,12.702195,-8.147433,5.004363,134.766305
...,...,...,...,...
99995,9.535637,-6.878139,1.998296,53.906522
99996,-10.160631,-12.558096,5.005581,137.945408
99997,-7.378891,-3.084104,4.998651,137.945408
99998,0.665714,-6.152593,1.000146,30.132364


In [11]:
# Drop the 'id' feature which has no relation to the target from data_0
data_2 = data_2.drop(['id'], axis = 1)
data_2


Unnamed: 0,f0,f1,f2,product
0,-1.146987,0.963328,-0.828965,27.758673
1,0.262778,0.269839,-2.530187,56.069697
2,0.194587,0.289035,-5.586433,62.871910
3,2.236060,-0.553760,0.930038,114.572842
4,-0.515993,1.716266,5.899011,149.600746
...,...,...,...,...
99995,-1.777037,1.125220,6.263374,172.327046
99996,-1.261523,-0.894828,2.524545,138.748846
99997,-1.199934,-2.957637,5.219411,157.080080
99998,-2.419896,2.417221,-5.548444,51.795253


# Train and test the model for each region

In [12]:
# Split data_0 into a training set and validation set at a ratio of 75:25
features = data_0.drop(['product'], axis=1) # The features consists of all the columns except 'product'
target = data_0['product'] # The target is the 'product' column

# Use train_test_split to create the training set (75% of the data) and the validation set (25% of the data)
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features,
                                                                          target,
                                                                          train_size = 0.75,
                                                                          random_state = 12345)


In [13]:
# Train the model and make predictions for the validation set
model_0 = LinearRegression()
model_0.fit(features_train_0, target_train_0)


LinearRegression()

In [14]:
# Save the predictions and correct answers for the validation set for Region 0
predictions_valid_0 = model_0.predict(features_valid_0)
pred_real_0 = pd.merge(pd.Series(predictions_valid_0, name = 'predictions'), target_valid_0.reset_index(drop=True), right_index=True, left_index=True).reset_index(drop=True)
pred_real_0.head()


Unnamed: 0,predictions,product
0,95.894952,10.038645
1,77.572583,114.551489
2,77.89264,132.603635
3,90.175134,169.072125
4,70.510088,122.32518


In [15]:
# Print the average volume of predicted reserves and model RMSE
print(f'Average volume of predicted reserves in Region 0: {predictions_valid_0.mean()}')
result_0 = (mean_squared_error(target_valid_0, predictions_valid_0)) ** 0.5
print("RMSE of the linear regression model on the validation set:", result_0)
print(f'Scaled RSME: {result_0/predictions_valid_0.mean()}')


Average volume of predicted reserves in Region 0: 92.59256778438035
RMSE of the linear regression model on the validation set: 37.5794217150813
Scaled RSME: 0.4058578632638446


In [16]:
# Split data_1 into a training set and validation set at a ratio of 75:25
features = data_1.drop(['product'], axis=1) # The features consists of all the columns except 'product'
target = data_1['product'] # The target is the 'product' column

# Use train_test_split to create the training set (75% of the data) and the validation set (25% of the data)
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features,
                                                                          target,
                                                                          train_size = 0.75,
                                                                          random_state = 12345)


In [17]:
# Train the model and make predictions for the validation set
model_1 = LinearRegression()
model_1.fit(features_train_1, target_train_1)


LinearRegression()

In [18]:
# Save the predictions and correct answers for the validation set for Region 0
predictions_valid_1 = model_1.predict(features_valid_1)
pred_real_1 = pd.merge(pd.Series(predictions_valid_1, name = 'predictions'), target_valid_1.reset_index(drop=True), right_index=True, left_index=True).reset_index(drop=True)
pred_real_1.head()


Unnamed: 0,predictions,product
0,82.663314,80.859783
1,54.431786,53.906522
2,29.74876,30.132364
3,53.552133,53.906522
4,1.243856,0.0


In [19]:
# Print the average volume of predicted reserves and model RMSE
print(f'Average volume of predicted reserves in Region 1: {predictions_valid_1.mean()}')
result_1 = (mean_squared_error(target_valid_1, predictions_valid_1)) ** 0.5
print("RMSE of the linear regression model on the validation set:", result_1)
print(f'Scaled RMSE: {result_1/predictions_valid_1.mean()}')


Average volume of predicted reserves in Region 1: 68.728546895446
RMSE of the linear regression model on the validation set: 0.893099286775617
Scaled RMSE: 0.012994589979244773


In [20]:
# Split data_2 into a training set and validation set at a ratio of 75:25
features = data_2.drop(['product'], axis=1) # The features consists of all the columns except 'product'
target = data_2['product'] # The target is the 'product' column

# Use train_test_split to create the training set (75% of the data) and the validation set (25% of the data)
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features,
                                                                          target,
                                                                          train_size = 0.75,
                                                                          random_state = 12345)


In [21]:
# Train the model and make predictions for the validation set
model_2 = LinearRegression()
model_2.fit(features_train_2, target_train_2)


LinearRegression()

In [22]:
# Save the predictions and correct answers for the validation set for Region 2
predictions_valid_2 = model_2.predict(features_valid_2)
pred_real_2 = pd.merge(pd.Series(predictions_valid_2, name = 'predictions'), target_valid_2.reset_index(drop=True), right_index=True, left_index=True).reset_index(drop=True)
pred_real_2.head()


Unnamed: 0,predictions,product
0,93.599633,61.212375
1,75.105159,41.850118
2,90.066809,57.776581
3,105.162375,100.053761
4,115.30331,109.897122


In [23]:
# Print the average volume of predicted reserves and model RMSE
print(f'Average volume of predicted reserves in Region 2: {predictions_valid_2.mean()}')
result_2 = (mean_squared_error(target_valid_2, predictions_valid_2)) ** 0.5
print("RMSE of the linear regression model on the validation set:", result_2)
print(f'Scaled RSME: {result_2/predictions_valid_2.mean()}')


Average volume of predicted reserves in Region 2: 94.96504596800489
RMSE of the linear regression model on the validation set: 40.02970873393434
Scaled RSME: 0.42152044813858075


Based on the above analysis, Region 2 had the highest average volume of predicted reserves (95.0), Region 0 had the next highest average volume (92.6) and Region 1 had the lowest average volume (68.7). However, the RMSE for Region 1 was far lower than that of Region 2 and Region 0, even when it was scaled. This indicates that the model poorly fits the data from Region 2 and 0.

Got it, I tried to organize and break up the code based on the project instructions so it would be easy to grade and follow but will keep this in mind for future projects!

# Prepare for profit calculation

In [24]:
# Store all key values for calculations in separate variables
UNIT_DEVELOPMENT_BUDGET = 100000000/200
UNIT_REVENUE = 4500


I didn't realize this was a convention! Thank you for letting me know!

In [25]:
# Calculate the volume of reserves sufficient for developing a new well without losses.
# Compare the obtained value with the average volume of reserves in each region
SUFFICIENT_VOLUME = UNIT_DEVELOPMENT_BUDGET / UNIT_REVENUE
print(f'Volume of reserves sufficient for developing a new well without losses: {SUFFICIENT_VOLUME}')


Volume of reserves sufficient for developing a new well without losses: 111.11111111111111


Based on the above calculation, each well must produce 111.11 units of volume in order to prevent financial loss. However, based on the calculated average volumes of reserves for each region, none of the Regions have averages greater than the sufficient volume.


# Write a function to calculate profit from a set of selected oil wells and model predictions

In [26]:
# Write a function to calculate profit for the selected oil wells and model predictions. Pick 200 wells with the highest values of predictions
def calculate_profit(pred_real):
    pred_real = pred_real.nlargest(200, 'predictions')
    return (pred_real['product'].sum() * 4500 - 100000000)


In [27]:
# Calculate profits for Region 0
region_0_profit = calculate_profit(pred_real_0)
region_0_profit


33208260.43139851

In [28]:
# Calculate profits for Region 1
region_1_profit = calculate_profit(pred_real_1)
region_1_profit


24150866.966815114

In [29]:
# Calculate profits for Region 2
region_2_profit = calculate_profit(pred_real_2)
region_2_profit


27103499.635998324

The calculated profits show that the most profitable area is Region 0 with (32,208,260 USD) which points towards choosing Region 0 for oil well development.

# Calculate risks and profit for each region

In [30]:
# Use the bootstrapping technique with 1000 samples to find the distribution of profit for Region 0
state = np.random.RandomState(12345)
profits_0 = []
for i in range(1000):
    subsample_0 = pred_real_0.sample(500, replace = True, random_state = state)
    profits_0.append(calculate_profit(subsample_0))

profits_0 = pd.Series(profits_0)
profits_0


0      6.054641e+06
1      5.363934e+06
2      2.937858e+06
3      1.789934e+06
4      2.719929e+06
           ...     
995    5.253551e+06
996    7.790094e+06
997    6.494122e+06
998    3.149995e+06
999    2.197184e+06
Length: 1000, dtype: float64

In [31]:
# Use the bootstrapping technique with 1000 samples to find the distribution of profit for Region 1
state = np.random.RandomState(12345)
profits_1 = []
for i in range(1000):
    subsample_1 = pred_real_1.sample(500, replace = True, random_state = state)
    profits_1.append(calculate_profit(subsample_1))

profits_1 = pd.Series(profits_1)
profits_1


0      2.280162e+06
1      3.343157e+06
2      2.537047e+06
3      6.139661e+06
4      3.571430e+06
           ...     
995    6.831945e+06
996    6.468698e+06
997    2.386523e+06
998    4.142425e+06
999    1.245778e+06
Length: 1000, dtype: float64

In [32]:
# Use the bootstrapping technique with 1000 samples to find the distribution of profit for Region 1
state = np.random.RandomState(12345)
profits_2 = []
for i in range(1000):
    subsample_2 = pred_real_2.sample(500, replace = True, random_state = state)
    profits_2.append(calculate_profit(subsample_2))

profits_2 = pd.Series(profits_2)
profits_2


0     -7.189923e+05
1      6.459964e+06
2      6.261756e+06
3      4.123517e+06
4     -5.596049e+05
           ...     
995    5.668660e+06
996   -5.850207e+05
997    5.902561e+06
998    4.977628e+06
999    2.009241e+06
Length: 1000, dtype: float64

In [33]:
# Find average profit, 95% confidence interval and risk of losses for Region 0
# Loss is negative profit, calculate it as a probability and then express as a percentage.
print(f'Average profit: {profits_0.mean()}')
      
lower = profits_0.quantile(0.025)
upper = profits_0.quantile(0.975)
print(f'95% confidence interval: [{lower}, {upper}]')

print(f'Risk of losses: {(profits_0 < 0).mean() * 100}%')


Average profit: 3961649.8480237117
95% confidence interval: [-1112155.4589049604, 9097669.41553423]
Risk of losses: 6.9%


Thank you for this suggestion!

In [34]:
# Find average profit, 95% confidence interval and risk of losses for Region 1
# Loss is negative profit, calculate it as a probability and then express as a percentage.
print(f'Average profit: {profits_1.mean()}')
      
lower = profits_1.quantile(0.025)
upper = profits_1.quantile(0.975)
print(f'95% confidence interval: [{lower}, {upper}]')

print(f'Risk of losses: {(profits_1 < 0).mean() * 100}%')


Average profit: 4560451.057866608
95% confidence interval: [338205.0939898458, 8522894.538660347]
Risk of losses: 1.5%


In [35]:
# Find average profit, 95% confidence interval and risk of losses for Region 2
# Loss is negative profit, calculate it as a probability and then express as a percentage.
print(f'Average profit: {profits_2.mean()}')
      
lower = profits_2.quantile(0.025)
upper = profits_2.quantile(0.975)
print(f'95% confidence interval: [{lower}, {upper}]')

print(f'Risk of losses: {(profits_2 < 0).mean() * 100}%')


Average profit: 4044038.665683568
95% confidence interval: [-1633504.1339559986, 9503595.749237997]
Risk of losses: 7.6%


Based on the average profits, confidence intervals, risk of losses, and RMSE, the Region that I would recommend overall would be **Region 1**. The reason being that Region 1 has the highest average profit, a risk of losses that is still below 2.5%, and the lowest RMSE for the linear regression model. 