# OilyGiant Oil Well Region Selection

# Introduction

My task is to choose the best region for 200 new oil wells, from a set of three regions, each of which has a large number of observations regarding oil quality and volume. The best region is defined as the region with the highest profit margin; revenue will be calculated using data from the 200 best-performing observations, with measured risk calculated as a loss and development costs being a constant.

First I will preprocess the data, verifying its suitability for training. For each region I will train and validate a linear regression model, save the predictions and correct values, and evaluate the models using root mean squared error and R2 score.



In [1]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Preprocessing

I need to import the datasets and examine them for duplicate observations and missing values. I will one-hot encode any categorical features and standardize numeric features. I will split the datasets into features and target. I will also create some lists of the three datasets, so that when I write functions later, I can use these lists as input. 

In [2]:
df_0 = pd.read_csv('geo_data_0.csv')
df_1 = pd.read_csv('geo_data_1.csv')
df_2 = pd.read_csv('geo_data_2.csv')

df_list = [df_0, df_1, df_2]

Check out preview of a dataset.

In [3]:
i = 0
for df in df_list:
    print(f"For region {i}:")
    print(df.head())
    print()
    i += 1

For region 0:
      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647

For region 1:
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305

For region 2:
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746



From the first few rows, it seems that the numeric features have already been standardized, and there are no categorical features. The id can be dropped; f0, f1, and f2 are the features; and, product (thousand barrels of oil) is the target.

Check for duplicates using a loop. Print out the number of duplicate rows and duplicate id's.

In [4]:
i = 0
for df in df_list:
    print(f"For region {i}:")
    
    print("Duplicate observations:", df.duplicated().sum())
    print("Duplicate id's:", df.id.duplicated().sum())
    print()
    
    # print(df[df.id.duplicated()])
    # print()
    
    i += 1

For region 0:
Duplicate observations: 0
Duplicate id's: 10

For region 1:
Duplicate observations: 0
Duplicate id's: 4

For region 2:
Duplicate observations: 0
Duplicate id's: 4



It is possible that some of the potential oil well sites had multiple samples taken. Because these duplicate id's have different data attached to them, I will keep them in. However, I will keep lists of these id's and later in the project, I will make sure to adjust the final results, if necessary, to have 200 unique oil well spots as desired by the company.

In [5]:
dup_sites_0 = df_0[df_0.id.duplicated()][['id']]
dup_sites_1 = df_1[df_1.id.duplicated()][['id']]
dup_sites_2 = df_2[df_2.id.duplicated()][['id']]

Let's check for missing values and datatypes now.

In [6]:
i = 0
for df in df_list:
    print(f"For region {i}:")
    df.info()
    print()
    
    i += 1

For region 0:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

For region 1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

For region 2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data column

Excellent - no missing values, and the features/target are all floats.

# Model Training

Now I will train a linear regression model for each of these datasets, using a function.

In [72]:
def train_model(df):
    
    """
    This function takes one of our three pandas DataFrames as input, splits it into features and target,
    further splits these into training/validation sets (0.75/0.25), trains a linear regression model, 
    makes predictions, saves the predictions and the answers, and prints average predicted reserve volume, root mean squared error, 
    and R2 score. This function returns a tuple in the format (predictions, answers, model, average volume).
    """
    
    features = df[['f0', 'f1', 'f2']]
    target = df['product']
    
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=0)
    
    model = LinearRegression()
    model.fit(features_train, target_train)
    pred = model.predict(features_valid)
    
    rmse = mean_squared_error(target_valid, pred)**0.5
    score = r2_score(target_valid, pred)
    avg_vol = pred.mean()
    answers = pd.Series(target_valid).reset_index(drop=True)
    
    print("Average volume of predicted reserves:", avg_vol)
    print("RMSE:", rmse)
    print("R2 score:", score)
    print()
    
    return pd.Series(pred), answers, model, avg_vol

In [122]:
region_0_pred, region_0_answers, region_0_model, region_0_avg_vol = train_model(df_0)

Average volume of predicted reserves: 92.27144852242301
RMSE: 37.48100896950594
R2 score: 0.2809263356941697



In [74]:
region_1_pred, region_1_answers, region_1_model, region_1_avg_vol = train_model(df_1)

Average volume of predicted reserves: 69.15162398290752
RMSE: 0.8872573052219335
R2 score: 0.9996271830439484



In [75]:
region_2_pred, region_2_answers, region_2_model, region_2_avg_vol = train_model(df_2)

Average volume of predicted reserves: 94.70753129105672
RMSE: 40.31290686044374
R2 score: 0.19438402105974983



Region 2 has the largest predicted volume, but the worst RMSE and R2 score. Region 1 has by far the best RMSE and R2 scores, but notably lower predicted reserve volume than the others. Region 0 has a slightly lower predicted reserve volume than region 2, but a somewhat better RMSE and R2 score than region 2.

To sum: region 2 has the most projected volume, region 1's model seems to be the most reliable, and region 0 is somewhere in the middle.

# Prepare for profit calculation

I'll store some of our values into variables for later ease of use, and figure out how much volume of oil is needed to break even. (Where revenue = development cost)

In [92]:
usd_per_1000_barrels = 4500
dev_cost = 10**8

sample_size = 500
top_200 = 200

Development cost is 100,000,000 USD for the planned 200 oil wells. In the datasets, the volume of oil is measured in 1000 barrels. One barrel provides 4.5 USD in revenue, so 1000 barrels provides 4,500 USD in revenue. Development cost divided by the revenue per 1000 barrels gives us the volume needed to break even.

In [94]:
min_volume = dev_cost / usd_per_1000_barrels
print(min_volume)

22222.222222222223


The top 200 reserves in the region need to have a combined volume of at least 22,222.22 thousand barrels for a profit to be made.

The models gave predictions for the average volume of each predicted reserve. Let's multiply these values by 200.

In [78]:
print('Region 0 predicted total volume:', region_0_avg_vol*200)
print('Region 1 predicted total volume:', region_1_avg_vol*200)
print('Region 2 predicted total volume:', region_2_avg_vol*200)

Region 0 predicted total volume: 18454.289704484603
Region 1 predicted total volume: 13830.324796581504
Region 2 predicted total volume: 18941.506258211346


Going off of the average predicted volume, none of these regions seem to have enough oil to justify building oil wells. However, realistically we will be able to choose the best 200 sites possible for oil wells, so these total predicted volume should be higher.

# Calculate profit from the top 200 wells

Let's write a function to find profit given oil wells are built at the top 200 predicted oil sites.

In [125]:
def top_200_profits(predictions, answers):
    top_200_pred_sites = predictions.sort_values(ascending=False).head(200).index
    top_200_sites_sum = answers[top_200_pred_sites].sum()
    print("Total reserve volume:", top_200_sites_sum)
    revenue = top_200_sites_sum * usd_per_1000_barrels
    profit = revenue - dev_cost
    print("Profit:", profit)
    return profit

In [126]:
region_0_profit = top_200_profits(region_0_pred, region_0_answers)

Total reserve volume: 29696.462399128774
Profit: 33634080.79607949


33.6 million USD profit for region 0

In [128]:
region_1_profit = top_200_profits(region_1_pred, region_1_answers)

Total reserve volume: 27589.081548181137
Profit: 24150866.966815114


24.1 million USD profit for region 1

In [120]:
region_2_profit = top_200_profits(region_2_pred, region_2_answers)

Total reserve volume: 28053.06374114759
Profit: 26238786.83516416


26.2 million USD profit for region 2

All three regions have total reserve volume greater than 22,222 thousand barrels, and so therefore, all regions will yield a profit. Region 0 looks like by far the most promising in terms of profit. However, we still need to factor in the risk introduced by the imperfect prediction systems. The model for region 1 had a very high r^2 score compared to that of the others, despite having the lowest preliminary profit. We will add risk into our calculations with bootstrapping.

# Bootstrapping