# Choosing a Location for an Oil Well

The dataset contains samples of oil in three regions: in each region, there are 10,000 oil fields where the quality of oil and its volume reserves were measured. The task is to build a machine learning model that will help determine the region where extraction will bring the highest profit. It is necessary to analyze potential profits and risks.

Steps for choosing a location:

- In the selected region, we search for oil fields and determine the feature values for each;
- We build a model and estimate the volume of reserves;
- We select the fields with the highest estimated values. The number of fields depends on the company's budget and the cost of developing one well;
- Profit is equal to the cumulative profit from the selected fields.

**Data Description**
`id` — unique well identifier  
`f0`, `f1`, `f2` — three feature values of the points  
`product` — volume of reserves in the well (thousand barrels)

## Loading and Preparing the Data

In [1]:
import os
from math import sqrt
from numpy.random import RandomState

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse

In [2]:
BASE_DIR = os.getcwd()
r_state = RandomState(12345)
state = 1

Loading the datasets.

In [3]:
d_1 = pd.read_csv(f'{BASE_DIR}/datasets/geo_data_0.csv')
d_2 = pd.read_csv(f'{BASE_DIR}/datasets/geo_data_1.csv')
d_3 = pd.read_csv(f'{BASE_DIR}/datasets/geo_data_2.csv')

Analyzing general information about the data.

In [4]:
d_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [5]:
d_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [6]:
d_1.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

There are no missing values in the data.   
We need to remove the unnecessary column with data identifiers since they will not be used during model training. If we need the identifiers later, we can restore them using the indices from the original file.   
Features f0, f1, f2 need to be scaled.   
It makes sense to change the data types to lighter ones.

### 1.1 Removing the Well ID Column

In [7]:
d_1 = d_1.drop('id', axis=1)
d_2 = d_2.drop('id', axis=1)
d_3 = d_3.drop('id', axis=1)

In [8]:
d_1.head(5)

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265647
3,-0.032172,0.139033,2.978566,168.620776
4,1.988431,0.155413,4.751769,154.036647


### 1.2 Checking for Duplicates

In [9]:
print(d_1.duplicated().sum())
print(d_2.duplicated().sum())
print(d_3.duplicated().sum())

0


0
0


### ~~1.3 Scaling the Features~~

This step will be performed during model training.

### 1.4 Changing the Data Types

In [10]:
d_1 = d_1.apply(pd.to_numeric, downcast='float')
d_2 = d_2.apply(pd.to_numeric, downcast='float')
d_3 = d_3.apply(pd.to_numeric, downcast='float')

### 1.5 Conclusions

Checking the results

In [11]:
d_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   f0       100000 non-null  float32
 1   f1       100000 non-null  float32
 2   f2       100000 non-null  float32
 3   product  100000 non-null  float32
dtypes: float32(4)
memory usage: 1.5 MB


In [12]:
d_1.head()

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497822,1.22117,105.28006
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265648
3,-0.032172,0.139033,2.978566,168.620773
4,1.988431,0.155413,4.751769,154.036652




**As a result of data preparation:**
   - features with unnecessary information were removed;
   - duplicates and missing values were checked;
   - ~~quantitative features were scaled;~~
   - data types were changed to lighter ones.

## 2 Training and Testing the Model

### 2.1 Model Training

Function for training the model and calculating the average predicted oil reserves and RMSE of the model.

In [13]:
def mean_rmse(d):
    feature = d.drop('product', axis=1)
    target = d['product']

    # Split the data into training and validation sets in a 3:1 ratio
    feature_train, feature_valid, target_train, target_valid = train_test_split(
        feature, target, test_size=0.25, random_state=state)

    # Scale the features
    scaler = StandardScaler()
    scaler.fit(feature_train)
    feature_train = pd.DataFrame(scaler.transform(feature_train))
    feature_valid = pd.DataFrame(scaler.transform(feature_valid))

    # Train the model
    model = LinearRegression()
    model.fit(feature_train, target_train)

    # Make predictions on the validation set
    target_predicted = model.predict(feature_valid)
    target_predicted_df = pd.Series(target_predicted, index=target_valid.index)

    # Print the average predicted reserve and RMSE of the model
    print(
        f'Average predicted reserve = {target_predicted.mean()}, RMSE of the model = {sqrt(mse(target_valid, target_predicted))}')

    # Return Series with the true and predicted values, they will be needed later
    return {'y_pred': target_predicted_df, 'y_true': target_valid}


In [14]:
print('Results for d_1 : ', end="")
res_mean_rmse_1 = mean_rmse(d_1)

print('Results for d_2 : ', end="")
res_mean_rmse_2 = mean_rmse(d_2)

print('Results for d_3 : ', end="")
res_mean_rmse_3 = mean_rmse(d_3)


Results for d_1 : Average predicted reserve = 92.49263000488281, RMSE of the model = 37.74258774497981
Results for d_2 : Average predicted reserve = 69.12039947509766, RMSE of the model = 0.8943376956413401
Results for d_3 : Average predicted reserve = 94.95684051513672, RMSE of the model = 39.86671213340931


### 2.2 Analyzing the Results

On segment 2, the RMSE value is the lowest, indicating that the model can make predictions with an order of magnitude higher accuracy compared to segments 1 and 3. On the other hand, the average predicted oil reserves in this segment are lower.

## 3 Preparing for Profit Calculation

 ### 3.1 Saving All Key Values for Calculations in Separate Variables

In [15]:
PLACE_COUNT = 200
PRUDUCT_PRICE = 450_000
BUDGET = 10_000_000_000

### 3.2 Calculating the Sufficient Volume of Oil for Break-Even Development of a New Well. Comparing the Obtained Volume with the Average Reserves in Each Region.

The cost of well development is 10 billion.  
The cost per unit volume of oil is 450 thousand rubles.  
We are developing 200 wells, so the expenses per well are 10 billion / 200 = 50 million.  
The sufficient volume for break-even development of a new well is 50 million / 450 thousand = 111.1 units of volume.

In [16]:
print(d_1['product'].mean())
print(d_2['product'].mean())
print(d_3['product'].mean())

92.5008316040039
68.83883666992188
94.99874877929688


 ### 3.3  Conclusions for the Profit Calculation Preparation Stage

The required volume for break-even development of one new well is 111.1 thousand barrels. The average reserves in each region are insufficient. Particularly in region d_2. Therefore, the probability of making a profit by randomly selecting well locations is low, and machine learning should be used.

## 4 Profit Calculation and Risks

### 4.1 Function for Calculating Profit Based on Selected Wells and Model Predictions

Randomly selecting 500 wells out of 25,000 (for our task, it is not necessary to take values from the entire dataset; we can take them from the validation set). The model has already been trained, so we select 500 true values and 500 predicted values.

In [17]:
def revenue(y_true, y_pred, r_stat):

    # Select 500 random object indices, then select the objects themselves based on the indices
    random_500_index = y_true.sample(n=500, random_state=r_stat).index
    y_true_500 = y_true[random_500_index]
    y_pred_500 = y_pred[random_500_index]

    # Select the top 200 objects from the 500 random ones
    indexes_pred_volumes_200 = y_pred_500.sort_values(ascending=False)[:PLACE_COUNT].index
    indexes_pred_volumes_200

    # Calculate the total revenue from the 200 wells
    product_volume_200 = sum(y_true_500[indexes_pred_volumes_200])

    # Return the gross revenue
    return product_volume_200 * PRUDUCT_PRICE - BUDGET

A function that applies bootstrap with 1000 samples to find the profit distribution and calculates the average profit, 95% confidence interval, and risk of losses.

In [18]:
def revenue_bootstrap(*args):
    revenue_list = [revenue(*args, x) for x in range(1000)]
    revenue_series = pd.Series(revenue_list)
    print(f'Confidence interval = {revenue_series.quantile(0.025)} - {revenue_series.quantile(0.975)}')
    print(f'Mean revenue        = {revenue_series.mean()}')
    print(f'Loss risk           = {round(100 - (len(revenue_series.loc[revenue_series >= 0])/1000)*100,2)} %')
    return

In [19]:
revenue_bootstrap(res_mean_rmse_1['y_true'], res_mean_rmse_1['y_pred'])

Confidence interval = -102060258.08811188 - 923950109.272003
Mean revenue        = 425660296.4588938
Loss risk           = 5.8 %


In [20]:
revenue_bootstrap(res_mean_rmse_2['y_true'], res_mean_rmse_2['y_pred'])

Confidence interval = 91717732.2769165 - 875081112.5469208
Mean revenue        = 477779668.70803833
Loss risk           = 0.9 %


In [21]:
revenue_bootstrap(res_mean_rmse_3['y_true'], res_mean_rmse_3['y_pred'])

Confidence interval = -145932964.3434286 - 869402555.0526003
Mean revenue        = 384331351.84384066
Loss risk           = 8.0 %


### 4.2 Conclusions: Proposed Region for Well Development and Justification for the Choice

The best choice among the three regions is the second region (data stored in the file "geo_data_1.csv").  
This region has the lowest risk of losses, 0.9% compared to 5.8% and 8% in the first and third regions, respectively. Moreover, the average predicted profit on the selected 200 wells is higher than in the other areas.  
Based on all significant metrics, the second region is more preferable than the other two.

## 5 Conclusions

The following steps were taken during the project: 
Data was prepared by removing unnecessary columns, scaling features, and changing data types to lighter ones.  
A linear regression model was trained, and its results were analyzed.  
The required oil reserves in a well for break-even development were calculated.  
The average profit and risk of losses were calculated using the bootstrap method.  
As a result of the analysis, a region for well development was recommended, and justifications for the recommendations were provided.