**Project description:**

We work for the OilyGiant mining company. Our task is to find the best place for a new well.
Steps to choose the location:

Collect the oil well parameters in the selected region: oil quality and volume of reserves;

Build a model for predicting the volume of reserves in the new wells;

Pick the oil wells with the highest estimated values;

Pick the region with the highest total profit for the selected oil wells.

**1.Preparation of the data.**

In [1]:
#We will upload the libraries we need.
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from joblib import dump
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from scipy import stats as st

In [2]:
#Let's load our data sets.
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
#We will check if our data set looks correct.
data_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
data_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [5]:
data_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [6]:
#We will check if we have duplicates in the data.
data_0.duplicated().sum()

0

In [7]:
data_1.duplicated().sum()

0

In [8]:
data_2.duplicated().sum()

0

In [9]:
#We will check if we have missing values in the data.
pd.isnull(data_0).sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [10]:
pd.isnull(data_1).sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [11]:
pd.isnull(data_2).sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

Steps taken up to this point:

1.Upload libraries and data sets.

2.Checking the integrity of the columns of the data sets - correct.

3.Search for duplicates - not found.

4.Search for missing values ​​- not found.

You can proceed to the next step.

**2.Train and test the model for each region.**

In [12]:
#2.1. Split the data into a training set and validation set at a ratio of 75:25.
data_0_train, data_0_valid = train_test_split(data_0, test_size=0.25, random_state=12345)
data_1_train, data_1_valid = train_test_split(data_1, test_size=0.25, random_state=12345)
data_2_train, data_2_valid = train_test_split(data_2, test_size=0.25, random_state=12345)

In [13]:
#Defining features and targets.
features_train_0 = data_0_train.drop(['product', 'id'], axis=1)
target_train_0 = data_0_train['product']
features_valid_0 = data_0_valid.drop(['product', 'id'], axis=1)
target_valid_0 = data_0_valid['product']

features_train_1 = data_1_train.drop(['product', 'id'], axis=1)
target_train_1 = data_1_train['product']
features_valid_1 = data_1_valid.drop(['product', 'id'], axis=1)
target_valid_1 = data_1_valid['product']

features_train_2 = data_2_train.drop(['product', 'id'], axis=1)
target_train_2 = data_2_train['product']
features_valid_2 = data_2_valid.drop(['product', 'id'], axis=1)
target_valid_2 = data_2_valid['product']

In [14]:
#let's scale the features using standardize the data.
numeric = ['f0','f1','f2']
scaler = StandardScaler()
scaler.fit(features_train_0[numeric])

features_train_0[numeric] = scaler.transform(features_train_0[numeric])
features_valid_0[numeric] = scaler.transform(features_valid_0[numeric])

features_train_1[numeric] = scaler.transform(features_train_1[numeric])
features_valid_1[numeric] = scaler.transform(features_valid_1[numeric])

features_train_2[numeric] = scaler.transform(features_train_2[numeric])
features_valid_2[numeric] = scaler.transform(features_valid_2[numeric])

In [15]:
#2.2. Train the model and make predictions for the validation set.
#Ragion 0:
model = LinearRegression()
model.fit(features_train_0, target_train_0) # train model on training set
predictions_valid_0 = model.predict(features_valid_0) # get model predictions on validation set

result = mean_squared_error(target_valid_0, predictions_valid_0)**0.5
print("Average volume of predicted reserves region 0:", target_valid_0.mean())
print("RMSE of the linear regression model on the validation set region 0:", result)

Average volume of predicted reserves region 0: 92.07859674082927
RMSE of the linear regression model on the validation set region 0: 37.5794217150813


In [16]:
#Ragion 1:
model = LinearRegression()
model.fit(features_train_1, target_train_1) # train model on training set
predictions_valid_1 = model.predict(features_valid_1) # get model predictions on validation set

result = mean_squared_error(target_valid_1, predictions_valid_1)**0.5
print("Average volume of predicted reserves region 1:", target_valid_1.mean())
print("RMSE of the linear regression model on the validation set region 1:", result)

Average volume of predicted reserves region 1: 68.72313602435997
RMSE of the linear regression model on the validation set region 1: 0.893099286775617


In [17]:
#Ragion 2:
model = LinearRegression()
model.fit(features_train_2, target_train_2) # train model on training set
predictions_valid_2 = model.predict(features_valid_2) # get model predictions on validation set

result = mean_squared_error(target_valid_2, predictions_valid_2)**0.5
print("Average volume of predicted reserves region 2:", target_valid_2.mean())
print("RMSE of the linear regression model on the validation set region 2:", result)

Average volume of predicted reserves region 2: 94.88423280885438
RMSE of the linear regression model on the validation set region 2: 40.02970873393434


**Analysis of results:**

*Region 0:*

Average volume of predicted reserves - 92.0

RMSE 0 - 37.5

*Region 1:*

Average volume of predicted reserves - 68.7

RMSE  - 0.89

*Region 2:*

Average volume of predicted reserves - 94.8

RMSE  - 40.0

It can be clearly seen that the RMSE score of region 1 is the lowest by a large margin from the other two, which means it is the most accurate and appropriate.
Therefore we will use its result to evaluate the average volume of predicted reserves.

**3.Prepare for profit calculation.**

In [18]:
#Store all key values for calculations in separate variables.
TOTAL_BUDGET = 100000000
TOTAL_WELLS = 200
BUDGET_PER_WELLS = TOTAL_BUDGET/TOTAL_WELLS
ONE_BARREL_REV = 4.5
ONE_UNITE_REV = 4500
VOL_REV_FOR_DEV = BUDGET_PER_WELLS / ONE_UNITE_REV

In [19]:
#3.2. Calculate the volume of reserves sufficient for developing a new well without losses.
print("Budget for the development of one well:", BUDGET_PER_WELLS)
print("Revenue of one reserve unit:",ONE_UNITE_REV)
print("The volume of reserves sufficient for developing a new well without losses:",VOL_REV_FOR_DEV)

Budget for the development of one well: 500000.0
Revenue of one reserve unit: 4500
The volume of reserves sufficient for developing a new well without losses: 111.11111111111111


In [20]:
#Compare the obtained value with the average volume of reserves in each region.
print("Average volume of reserves region 0:", data_0['product'].mean())
print("Average volume of reserves region 1:", data_1['product'].mean())
print("Average volume of reserves region 2:", data_2['product'].mean())
print("The volume of reserves sufficient for developing a new well without losses:",VOL_REV_FOR_DEV)

Average volume of reserves region 0: 92.50000000000001
Average volume of reserves region 1: 68.82500000000002
Average volume of reserves region 2: 95.00000000000004
The volume of reserves sufficient for developing a new well without losses: 111.11111111111111


It seems that at this stage, by average only, none of the areas will cover the cost of the well for us.

**4.Write a function to calculate profit from a set of selected oil wells and model predictions.
Calculate risks and profit for each region.**

In [21]:
#We will change what suits our actions.
target_valid_0.reset_index(drop=True, inplace=True)
target_valid_1.reset_index(drop=True, inplace=True)
target_valid_2.reset_index(drop=True, inplace=True)

In [22]:
predictions_valid_0 = pd.Series(predictions_valid_0)
predictions_valid_1 = pd.Series(predictions_valid_1)
predictions_valid_2 = pd.Series(predictions_valid_2)

In [34]:
#5.1. Use the bootstrapping technique with 1000 samples to find the distribution of profit.

def boot_strap(target, predictions):
    def profit(target, probabilities):
        probs_sorted = probabilities.sort_values(ascending=False)
        selected = target[probs_sorted.index][:200]
        well_boot = (selected*ONE_UNITE_REV) - BUDGET_PER_WELLS
        return well_boot.sum()

    state = np.random.RandomState(12345)
    n_tests = 1000
    values = []
    for i in range(n_tests):
        target_subsample = target.sample(n=500, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
        sample = profit(target_subsample, probs_subsample)
        values.append(sample)
    
    loss = 0
    for val in values:
        if val < 0:
            loss += 1
    values = pd.Series(values)
    lower = values.quantile(0.025)

    mean = values.mean()
    print("Average profit:", mean)
    print("5% quantile:", lower)
    print('Number of losses:', loss)
    print("percent losses:", loss / n_tests * 100)

In [35]:
boot_strap(target_valid_0, predictions_valid_0)

Average profit: 4259385.269105923
5% quantile: -1020900.9483793728
Number of losses: 60
percent losses: 6.0


In [36]:
boot_strap(target_valid_1, predictions_valid_1)

Average profit: 5152227.734432898
5% quantile: 688732.2537050338
Number of losses: 10
percent losses: 1.0


In [37]:
boot_strap(target_valid_2, predictions_valid_2)

Average profit: 4350083.627827557
5% quantile: -1288805.4732978877
Number of losses: 64
percent losses: 6.4


Overall conclusion:

The area I would choose is Area 1.
Area 1 has the largest profit and the lowest number of losses.
It can also be seen that even the bottom 5 percent of the area is still profitable, indicating that the non-profit wells (10 in number) have much lowered the average.
This area also has the lowest chance of losses (1%).