# Region for a New Oil Well

As an analyst at OilyGiant mining company, the job is to analyze oil reserve data in three different regions and their profit margin.

##  Introduction
Presented is an analysis of oil reserve data that based on the region and their profit margin. 

###  Goal:
This report will focus on developing a Linear Regression model that would predict which region is most suitable for a new oil well (the one with the highest profit margin). Each region will be used to train a separate model to cauclate how much oil it has, and the reveune. 
The end goal is to pick the region with the highest profit margin.


### Stages:
This project will consist of the following stages:

1. Introduction
2. General Information
3. Split Dataset
    1. Training
    2. Validation
4. Model Traning
    1. Region 1
    2. Region 2
    3. Region 3
5. Profit Pre-determination
6. Profit and Risk
    1. Region 1
    2. Region 2
    3. Region 3
7.  Conclusion

## General Information

Import all libraries and modules

In [19]:
import pandas as pd 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error 
from sklearn.preprocessing import StandardScaler
from numpy.random import RandomState
import warnings

Now, let's open all of our datasets and study them. 

In [2]:
#import datasets

region_1 = pd.read_csv('/datasets/geo_data_0.csv')
region_2 = pd.read_csv('/datasets/geo_data_1.csv')
region_3 = pd.read_csv('/datasets/geo_data_2.csv')

In [4]:
#take a look of region 1

print(region_1.head())
region_1.info()

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [5]:
#region 2

print(region_2.head())
region_2.info()

      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [6]:
#region 3

print(region_3.head())
region_3.info()

      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


The datasets consist of the following columns:
- `d` — unique oil well identifier
- `f0, f1, f2` — three features of points 
- `product` — volume of reserves in the oil well (thousand barrels).

Judging from the datasets infomation, it appears that there are no missing information. The target will be the **'product'** column.

However, we will need to remove the columns that do not contribute to our model, in this case, it will the 'id' column. 

The remaining columns will be features. 

We will also need to split our data into training and validation sets with a 75:25 ratio. 

Finally, we will need to standardize numerical columns. 

## Split the datasets

Since we are spliting the datasets all the same ways, we can create a function to do so for us. 

In [9]:
#make a function for spliting datasets into training and validation sets

def splitdata(dataset):
    #dropping unnecessary column
    dataset = dataset.drop(['id'], axis = 1)
    
    #set features and target
    features = dataset.drop(['product'], axis = 1)
    target = dataset['product']
    
    #now split
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    #then standardize the numerical columns
    scaler = StandardScaler()
    numerical = ['f0', 'f1', 'f2']
    
    scaler.fit(features_train[numerical])
    
    features_train[numerical] = scaler.transform(features_train[numerical])
    features_valid[numerical] = scaler.transform(features_valid[numerical])
    
    #return features and target for both sets
    
    return features_train, features_valid, target_train, target_valid

In [14]:
warnings.filterwarnings('ignore')

#Now apply the function to all three datasets

r1_features_train, r1_features_valid, r1_target_train, r1_target_valid = splitdata(region_1)

r2_features_train, r2_features_valid, r2_target_train, r2_target_valid = splitdata(region_2)

r3_features_train, r3_features_valid, r3_target_train, r3_target_valid = splitdata(region_3)

#to check

print(r1_features_train.shape,
     r1_features_valid.shape)
print(r1_features_train.head())
print(r2_features_train.shape,
     r2_features_valid.shape)
print(r2_features_train.head())
print(r3_features_train.shape,
     r3_features_valid.shape)
print(r3_features_train.head())

(75000, 3) (25000, 3)
             f0        f1        f2
27212 -0.544828  1.390264 -0.094959
7866   1.455912 -0.480422  1.209567
62041  0.260460  0.825069 -0.204865
70185 -1.837105  0.010321 -0.147634
82230 -1.299243  0.987558  1.273181
(75000, 3) (25000, 3)
             f0        f1        f2
27212 -0.850855  0.624428  0.296943
7866   1.971935  1.832275  0.294333
62041  1.079305  0.170127 -0.296418
70185 -1.512028 -0.887837 -0.880471
82230 -1.804775 -0.718311 -0.293255
(75000, 3) (25000, 3)
             f0        f1        f2
27212 -0.526160  0.776329 -0.400793
7866  -0.889625 -0.404070 -1.222936
62041 -1.133984  0.208576  0.296765
70185  1.227045  1.570166 -0.764556
82230 -0.194289  0.878312  0.840821


Now that we have successfully split and standardized the all of the sets, we will train and test the Linear Regression model.

## Model Training

Just like spliting the dataset: we are doing the same thing for all datasets, so we will create a function for easier process. 

In [15]:
#function

def model_pred(features_train, features_valid, target_train, target_valid):
    
    model = LinearRegression()
    
    model.fit(features_train, target_train)
    
    predicted_valid = model.predict(features_valid)
    
    RMSE = mean_squared_error(target_valid, predicted_valid)**0.5
    
    return RMSE, predicted_valid

In [17]:
#applying the function

RMSE1, predicted_valid_r1 = model_pred(r1_features_train, r1_features_valid, r1_target_train, r1_target_valid)
RMSE2, predicted_valid_r2 = model_pred(r2_features_train, r2_features_valid, r2_target_train, r2_target_valid)
RMSE3, predicted_valid_r3 = model_pred(r3_features_train, r3_features_valid, r3_target_train, r3_target_valid)


print('Region 1 RMSE:', RMSE1, 'Region 1 average volume of predicted reserves:', predicted_valid_r1.mean())
print('Region 2 RMSE:', RMSE2, 'Region 2 average volume of predicted reserves:', predicted_valid_r2.mean())
print('Region 3 RMSE:', RMSE3, 'Region 3 average volume of predicted reserves:', predicted_valid_r3.mean())

Region 1 RMSE: 37.5794217150813 Region 1 average volume of predicted reserves: 92.59256778438035
Region 2 RMSE: 0.893099286775617 Region 2 average volume of predicted reserves: 68.728546895446
Region 3 RMSE: 40.02970873393434 Region 3 average volume of predicted reserves: 94.96504596800489


Based on the model, the region with the highest average volume of reserves region 3, follow closely by region 2 (94.96 and 92.59, respectively). However, region 2 has the by far the lowest RMSE meaning that the potential for lower profit than expected is low (+/- 0.89 units) while region 1 is (+/-37.57), region 3 (+/- 40.03). We will look into the data more to determine which is better. 

## Profit Pre-Determination

According to the company:

- The budget for development of 200 oil wells is 100 USD million.

- A study of 500 points is carried out with picking the best 200 points for profit calculation

- One barrel of raw materials brings 4.5 USD of revenue. While the revenue from one unit of product is 4,500 dollars(volume of reserves is in thousand barrels)


All things consider, we will need to calculate the volume of reserve needed for a new well without suffering a loss in profit.

In [18]:
#calculation

wells = 200
budget = 100000000
profit_per_volume = 4500

budget_per_well = budget / wells

product_revenue = budget_per_well / profit_per_volume

print('Volume of reserve needed for a new well without a loss:', product_revenue)

Volume of reserve needed for a new well without a loss: 111.11111111111111


The calculation suggested that the minium volume of reserve needed for a new well that will turn a profit is 111.11. Which none of the regions met in the first rounds of model training. 

Ideally, we will need to choose locations with the volume of predicted reserves is greater than the average of its region. We will do that.

## Risk and Profit

Let's calculate the predicted profit for the top 200 wells in each region(dataset). We will be using the bootstrapping techniques to create subsamples (1000) to caculate this. Since we are repeating the operation, we will write a function to save time. 

In [37]:
#write two functions: 1 for revenue of the top 200 wells at region
def revenue(target, predicted, n):
    predicted = pd.Series(predicted)
    target = target.reset_index(drop = True)
    index = predicted.sort_values(ascending = False).index
    return (target.loc[index][:200].sum() * profit_per_volume) - budget

#write the function for calculating average profit with 200 wells. 

def rev_calc(target, predictions):
    state = RandomState(12345)
    values = []
    target = target.reset_index(drop=True)
    
    #bootstrapping
    for i in range(1000):
        target_sample = target.sample(n=500,replace = True, random_state = state)
        predictions_sample = predictions[target_sample.index]
        values.append(revenue(target_sample, predictions_sample, wells))
        
    
    values = pd.Series(values)
    
    mean = values.mean()
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    loss_risk = len(values[values < 0] / len(values)) / 1000
    
    print('Average Profit: ${} USD'.format(mean))
    print('95% confidence interval for average profit is: ({}, {})'.format(lower, upper))
    print('Risk of loss: {0:0%}'.format(loss_risk))
    
print('Region 1:')
rev_calc(r1_target_valid, predicted_valid_r1)

print('\nRegion 2:')
rev_calc(r2_target_valid, predicted_valid_r2)

print('\nRegion 3:')
rev_calc(r3_target_valid, predicted_valid_r3)

Region 1:
Average Profit: $3961649.8480237117 USD
95% confidence interval for average profit is: (-1112155.4589049604, 9097669.41553423)
Risk of loss: 6.900000%

Region 2:
Average Profit: $4560451.057866608 USD
95% confidence interval for average profit is: (338205.0939898458, 8522894.538660347)
Risk of loss: 1.500000%

Region 3:
Average Profit: $4044038.665683568 USD
95% confidence interval for average profit is: (-1633504.1339559986, 9503595.749237997)
Risk of loss: 7.600000%


Based on our model calculation, Region 2 appeared to be the region that will bring the most profit margin. With a risk of loss at 1.5%, average profit = $4560451.057866608 USD, and the 95% confidence interval limits being (338205.0939898458, 8522894.538660347). Other regions have a higher risk of loss, and confidence interval range between negative (loss). 

## Conclusion

In conclusion:

We have trained Linear Regression model for all three regions. We have calculated the units of volumes needed to not suffer a loss. We then bootstrapped the target values with 500 samples(amounts of points carried out) and calculated profit values for the top 200 wells.

The boostrapping step suggested that Region 2 appeared to be the region is most likely to be profitable and with the biggest profit margin. 