You work for the OilyGiant mining company. Your task is to find the best place for a new well.
- Steps to choose the location:
 - Collect the oil well parameters in the selected region: oil quality and volume of reserves;
 - Build a model for predicting the volume of reserves in the new wells;
 - Pick the oil wells with the highest estimated values;
 - Pick the region with the highest total profit for the selected oil wells.
 
You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
geo0 = pd.read_csv('/datasets/geo_data_0.csv')

In [3]:
geo1 = pd.read_csv('/datasets/geo_data_1.csv')

In [4]:
geo2 = pd.read_csv('/datasets/geo_data_2.csv')

In [5]:
data = pd.concat([geo0, geo1, geo2])
data

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,4GxBu,-1.777037,1.125220,6.263374,172.327046
99996,YKFjq,-1.261523,-0.894828,2.524545,138.748846
99997,tKPY3,-1.199934,-2.957637,5.219411,157.080080
99998,nmxp2,-2.419896,2.417221,-5.548444,51.795253


Imported existing data, concatenated data into one data frame for easier data observation. 

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 99999
Data columns (total 5 columns):
id         300000 non-null object
f0         300000 non-null float64
f1         300000 non-null float64
f2         300000 non-null float64
product    300000 non-null float64
dtypes: float64(4), object(1)
memory usage: 13.7+ MB


In [7]:
data[data.isnull().any(axis=1)]

Unnamed: 0,id,f0,f1,f2,product


In [8]:
data.duplicated().sum()

0

In [9]:
data['id'].duplicated().sum()

49

In [10]:
duplicates = data[data['id'].duplicated(keep=False)]
duplicates.sort_values('id')

Unnamed: 0,id,f0,f1,f2,product
27380,2tyMi,-1.789602,-1.359044,-4.840745,145.901447
45429,2tyMi,0.576679,-0.411140,-3.725859,69.292672
5849,5ltQ6,-3.435401,-12.296043,1.999796,57.085625
84461,5ltQ6,18.213839,2.191999,3.993869,107.813044
72896,5ssQt,-0.651825,0.782415,2.690636,120.108761
...,...,...,...,...,...
27885,wqgPo,0.052461,1.424025,0.085541,10.686576
82873,wt4Uk,10.259972,-9.376355,4.994297,134.766305
47591,wt4Uk,-9.091098,-8.109279,-0.002314,3.179103
43233,xCHr8,-0.847066,2.101796,5.597130,184.388641


100 out of the 300,000 entries have duplicate ids, and so can be discarded, since we do not know which is the truthful entry.

In [11]:
geo0['id'].duplicated(keep=False).sum()

20

In [12]:
geo0['id'].drop_duplicates(keep=False, inplace=True)
geo0['id'].duplicated(keep=False).sum()

0

In [13]:
geo1['id'].duplicated(keep=False).sum()

8

In [14]:
geo1['id'].drop_duplicates(keep=False, inplace=True)
geo1['id'].duplicated(keep=False).sum()

0

In [15]:
geo2['id'].duplicated(keep=False).sum()

8

In [16]:
geo2['id'].drop_duplicates(keep=False, inplace=True)
geo2['id'].duplicated(keep=False).sum()

0

Duplicated ids removed.

In [17]:
features0 = geo0.drop(['id', 'product'], axis=1)
target0 = geo0['product']
features1 = geo1.drop(['id', 'product'], axis=1)
target1 = geo1['product']
features2 = geo2.drop(['id', 'product'], axis=1)
target2 = geo2['product']

In [18]:
train_features0, valid_features0, train_target0, valid_target0 = train_test_split(features0, target0, test_size=.25, random_state=42)

In [19]:
train_features1, valid_features1, train_target1, valid_target1 = train_test_split(features1, target1, test_size=.25, random_state=42)

In [20]:
train_features2, valid_features2, train_target2, valid_target2 = train_test_split(features2, target2, test_size=.25, random_state=42)

Applied features to characteristics of oil wells and target as the product. Split into train and valid sets at a 75:25 ratio. Each geographical area has its own features and target.

In [21]:
from sklearn.linear_model import LinearRegression
model0 = LinearRegression()
model0.fit(train_features0, train_target0)
predict_target0 = model0.predict(valid_features0)

In [22]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(valid_target0, predict_target0)
rmse = mse**.5
rmse

37.75660035026169

In [23]:
target0.max() - target0.min()

185.3643474222929

In [24]:
predict_target0.mean()

92.39879990657768

Looking at the general range of the target (product), our RMSE is not very good, it is far too high. We would like to see an RMSE much closer to 0, indicating that there's a smaller difference between predicted values and actual values. Geo0 does not have a good model so far. Average volume of predicted reserves in Geo0 is 93 units.

In [25]:
model1 = LinearRegression()
model1.fit(train_features1, train_target1)
predict_target1 = model1.predict(valid_features1)
mse = mean_squared_error(valid_target1, predict_target1)
rmse = mse**.5
rmse

0.8902801001028828

In [26]:
predict_target1.mean()

68.71287803913764

RMSE in Geo1 is vastly more accurate than Geo0. Average predicted reserve volume for geo1 is 69 units.

In [27]:
model2 = LinearRegression()
model2.fit(train_features2, train_target2)
predict_target2 = model2.predict(valid_features2)
mse = mean_squared_error(valid_target2, predict_target2)
rmse = mse**.5
rmse

40.145872311342174

In [28]:
predict_target2.mean()

94.77102387765939

RMSE in Geo2 is also not reliable, Geo1 is the winner for the most reliable model. Average predicted reserve volume for geo2 is 95 units. Geo1 is the most reliable model, but appears to have the lowest potential reserve volume. 

In [29]:
budget = 100000000
wells = 200
revenue = 4500

Set key values to start determining minimum volume of a well that breaks even.

In [30]:
minvol = budget/wells/revenue
minvol

111.11111111111111

The minimum reserve volume per well needed to break even is close to 112 units, which is higher than any of our existing predicted means. Geo2 comes close, with 94 units.

In [31]:
predict_target0 = pd.Series(predict_target0, index=valid_target0.index)
valid_target0[predict_target0.sort_values(ascending=False).head(200).index].sum()

29686.9802543604

In [32]:
predict_target1 = pd.Series(predict_target1, index=valid_target1.index)
valid_target1[predict_target1.sort_values(ascending=False).head(200).index].sum()

27589.081548181137

In [33]:
predict_target2 = pd.Series(predict_target2, index=valid_target2.index)
valid_target2[predict_target2.sort_values(ascending=False).head(200).index].sum()

27996.82613194247

In [34]:
def profit(target_valid, predicted_valid, sample_size):
    sample = predicted_valid.sort_values(ascending=False).head(sample_size).index
    income = target_valid[sample].sum()*revenue
    return income - budget

In [35]:
profit(valid_target0, predict_target0, 200)

33591411.14462179

In [36]:
profit(valid_target1, predict_target1, 200)

24150866.966815114

In [37]:
profit(valid_target2, predict_target2, 200)

25985717.59374112

When we look at the top 200 predicted performing wells in each region, and their profit after the given budget, it is evident that Geo0 has the most profitable region to invest in. The estimated profit based off the predicted best producing wells is $33,591,411 in Geo0.

In [38]:
import numpy as np
state = np.random.RandomState(12345)
values = []
for i in range(1000):
    sample = valid_target0.sample(n=500, replace=True, random_state=state)
    subsample = predict_target0[sample.index]
    values.append(profit(sample, subsample, 200))
values = pd.Series(values)
lower = values.quantile(.025)
upper = values.quantile(.975)
mean = values.mean()
lower, upper, mean

(-208849.82296241407, 12397305.759180633, 6150470.043942593)

In [39]:
loss = (((values < 0).sum())/1000) *100
loss

3.0

In [40]:
values = []
for i in range(1000):
    sample = valid_target1.sample(n=500, replace=True, random_state=state)
    subsample = predict_target1[sample.index]
    values.append(profit(sample, subsample, 200))
values = pd.Series(values)
lower = values.quantile(.025)
upper = values.quantile(.975)
mean = values.mean()
lower, upper, mean

(1437975.893868118, 12029099.133309357, 6470976.778904684)

In [41]:
loss = (((values < 0).sum())/1000) *100
loss

0.3

In [42]:
values = []
for i in range(1000):
    sample = valid_target2.sample(n=500, replace=True, random_state=state)
    subsample = predict_target2[sample.index]
    values.append(profit(sample, subsample, 200))
values = pd.Series(values)
lower = values.quantile(.025)
upper = values.quantile(.975)
mean = values.mean()
lower, upper, mean

(-180544.37179017137, 12899547.191920979, 6011182.796492722)

In [43]:
loss = (((values < 0).sum())/1000) *100
loss

3.1

After running a bootstrap on each geographical region, I would recommend to develop further in Geo1. Not only does it have the most accurate model for predictions, but when we look at running simulations, Geo1 has the least chance of risking a loss, at .3%. The other areas have slim chances of loss, but they are closer to 3%, 10 times higher than the risk from Geo1. Geo1 is the most reliable for many reasons, and would provide the best opportunities for the company.