# Project 9 - Prediction of New Locations for OilyGiant Oil Wells

## Project Description

An oil mining company called OilyGiant needed a way to find a suitable location to dig a new oil well.

Here are the steps to choose a new location:
- Collect parameters for building oil wells in several selected regions: oil quality and volume of oil reserves;
- Create a model that is able to predict the volume of oil reserves in new wells;
- Choose the oil well with the highest estimated value;
- Select the area with the highest total profit for the selected oil well.

There is oil sample data from three regions. The parameters of each oil well in the area are known. The model will be used to help select regions with the highest profit margins. An analysis of potential profits and risks will be carried out using bootstrapping techniques.

### Steps of The Project
1. Initialization
2. Data Overview
3. Model Testing
4. Basic Profit Calculation
5. Model Function for Profit Calculation
6. Profit and Risk Calculation

### Data Description
Geological exploration data for the three regions is stored in several files:
- geo_data_0.csv
- geo_data_1.csv
- geo_data_2.csv

- `id` — Unique ID of the oil well
- `f0, f1, f2` — three point features (the specific meaning is not important, but the features themselves are significant)
- `product` — volume of oil reserves in the well (thousands of barrels).

Condition:
- Only linear regression is suitable for model training (the rest are inadequate for prediction).
- When exploring the area, a study of 500 points was carried out by selecting the 200 best points for profit calculations.
- The budget to develop 200 oil wells is 100 million USD.
- One barrel of raw materials generates 4.5 USD in revenue. Income from one unit of product is 4,500 dollars (volume of oil reserves in thousands of barrels).
- After evaluating the risks, stick to only areas with a risk of loss lower than 2.5%. From the list of regions that meet the criteria, select the region with the highest average profit.

This data is artificial: contract details and well characteristics are not shown.

## Initialization

In [1]:
# import general and machine learning library

import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

## Data Overview

In [2]:
data0 = pd.read_csv(r'geo_data_0.csv')
data1 = pd.read_csv(r'geo_data_1.csv') 
data2 = pd.read_csv(r'geo_data_2.csv')

In [3]:
print(data0.head())
print('---------------------------------------------------')
print(data1.head())
print('---------------------------------------------------')
print(data2.head())

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
---------------------------------------------------
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
---------------------------------------------------
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.51599

In [4]:
data0.shape, data1.shape, data2.shape

((100000, 5), (100000, 5), (100000, 5))

In [5]:
print(data0.info())
print('///////////////////////////////////////////////')
print(data1.info())
print('///////////////////////////////////////////////')
print(data2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
///////////////////////////////////////////////
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
///////////////////////////////////////////////
<class 'pandas.core.frame.

In [6]:
print(data0.describe())
print('------------------------------------------------------------------')
print(data1.describe())
print('------------------------------------------------------------------')
print(data2.describe())

                  f0             f1             f2        product
count  100000.000000  100000.000000  100000.000000  100000.000000
mean        0.500419       0.250143       2.502647      92.500000
std         0.871832       0.504433       3.248248      44.288691
min        -1.408605      -0.848218     -12.088328       0.000000
25%        -0.072580      -0.200881       0.287748      56.497507
50%         0.502360       0.250252       2.515969      91.849972
75%         1.073581       0.700646       4.715088     128.564089
max         2.362331       1.343769      16.003790     185.364347
------------------------------------------------------------------
                  f0             f1             f2        product
count  100000.000000  100000.000000  100000.000000  100000.000000
mean        1.141296      -4.796579       2.494541      68.825000
std         8.965932       5.119872       1.703572      45.944423
min       -31.609576     -26.358598      -0.018144       0.000000
25%      

**Findings :**

- There are no missing data values or wrong data types
- Each data consists of 10,000 rows and 5 columns

## Test the Model

***Include :***

- Separate Training and Validation datam within rasio 75:25
- Model train and prediction for validation set
- Prediction and correct answer for validation set
- Show mean product and RMSE model

In [7]:
data_all = [
    data0.drop('id', axis=1),
    data1.drop('id', axis=1),
    data2.drop('id', axis=1),
]

In [8]:
rs = np.random.RandomState(12345)

samples_target = []
samples_predictions= []

for region in range(len(data_all)):
    data = data_all[region]

    features = data.drop('product', axis=1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=rs)
    
    model = LinearRegression()
    model.fit(features_train, target_train)
    predict = model.predict(features_valid)
    
    samples_target.append(target_valid.reset_index(drop=True))
    samples_predictions.append(pd.Series(predict))
    
    mean_product_target = target.mean()
    mean_product_predict = predict.mean()
    rmse = mean_squared_error(target_valid, predict)**0.5
    
    print('Region', region)
    print('Mean Target :', mean_product_target)
    print('Mean Predict :', mean_product_predict)
    print('Model RMSE :', rmse)

Region 0
Mean Target : 92.50000000000001
Mean Predict : 92.59256778438035
Model RMSE : 37.5794217150813
Region 1
Mean Target : 68.82500000000002
Mean Predict : 68.76995145799754
Model RMSE : 0.889736773768065
Region 2
Mean Target : 95.00000000000004
Mean Predict : 95.087528122523
Model RMSE : 39.958042459521614


**Findings :**

- The three data above will be combined to make analysis easier
- Linear Regression is the model used in this analysis to check the target value and the actual value
- From the three data above, it can be concluded that data with region 1 has a low error rate compared to regions 0 and 2 which produce RMSE values that are very far apart.

## Basic Profit Calculation

***Include :***
 
- Key score for different variables
- Compare oil reserves

In [9]:
sample_point = 500
sample_bootstrap = 1000

budget = 100000000
cost = 500000
product_income = 4500
points = 200

In [10]:
sample_region_0 = samples_predictions[0]

best_200 = sample_region_0.sort_values(ascending=False)[:200]
best_product = best_200.sum()
total_income = product_income*best_product
profit = total_income-budget
print('Profit : $', round(profit))

Profit : $ 39960489


**Findings :**

The profits generated from the oil reserves obtained are quite large with several conditions:
- Using region 0 as an example
- From 500 oil well points, the 200 best points are selected for calculating profits
- Generated $39,960,489 which resulted in approximately 40% profit from initial capital

## Model Function for Profit Calculation

***Include :***
- Highest Prediction
- Summarize target based on prediction

In [25]:
def prediction_profit(predict, name, product_income=4500, budget=100000000, points=200):
    prediction = predict[name]
    predict_best200 = prediction.sort_values(ascending=False)[:points]
    best_of_product = predict_best200.sum()
    total_budget = round(budget)
    income = round(product_income * best_of_product)
    profit = round(income - total_budget)
    
    print(f'Region : {name}')
    print(f'Total Income : {income}')
    print(f'Total Budget : {total_budget}')
    print(f'Profit : {profit}')

In [21]:
def target_profit(target, predict):
    sort_predict = predict.sort_values(ascending=False)
    selected_points = target[sort_predict.index][:points]
    product = selected_points.sum()
    revenue = product * product_income
    cost = budget
    profit = revenue-cost
    return profit

In [26]:
prediction_profit(predict=samples_predictions, name=0)
prediction_profit(predict=samples_predictions, name=1)
prediction_profit(predict=samples_predictions, name=2)

Region : 0
Total Income : 139960489
Total Budget : 100000000
Profit : 39960489
Region : 1
Total Income : 124873891
Total Budget : 100000000
Profit : 24873891
Region : 2
Total Income : 134224063
Total Budget : 100000000
Profit : 34224063


In [24]:
print("Profit Region 0:", target_profit(samples_target[0], samples_predictions[0]))
print("Profit Region 1:", target_profit(samples_target[1], samples_predictions[1]))
print("Profit Region 2:", target_profit(samples_target[2], samples_predictions[2]))

Profit Region 0: 33208260.43139851
Profit Region 1: 24150866.966815114
Profit Region 2: 25399159.45842947


**Findings :**
- Region 0 gets the biggest advantage compared to the others from the target value and final value (prediction)
- In calculating the target value, region 2 has quite a big difference from the proper value, where the difference reaches approximately 10,000,000, where the difference in other regions is around 6,000,000 in region 0 and 700,000 in region 1
- Region 0 is a region that can be considered for further development because it produces quite high profits, followed by region 2 and region 1

## Profit and Risk Calculation

***Include :***
- Bootstraping for profit distribution
- Mean profit, and loss risk

In [15]:
predict = pd.Series(predict)

def bootstrap_profit(predict, name, product_income=4500, budget=100000000, points=200):
    predict_best200 = predict.sort_values(ascending=False)[:points]
    best_product = predict_best200.sum()
    total_budget = budget
    total_income = product_income * best_product
    profit = total_income - total_budget
    return profit

In [16]:
def profit(target, predict):
    sort_predict = predict.sort_values(ascending=False)
    selected_points = target[sort_predict.index][:points]
    product = selected_points.sum()
    revenue = product * product_income
    cost = budget
    profit = revenue - cost
    return profit

In [17]:
for region in range(3):
    
    target = samples_target[region]
    predictions = samples_predictions[region]
    
    profit_values = []
    
    for i in range(sample_bootstrap):
        sample_target = target.sample(sample_point, replace=True, random_state=rs)
        sample_predictions = predict[sample_target.index]
        profit_values.append(bootstrap_profit(predict=sample_predictions, name=region))
        
    profit_values = pd.Series(profit_values)
    mean_profit = profit_values.mean()
    confidence_interval = (profit_values.quantile(0.025), profit_values.quantile(0.975))
    loss_risk = (profit_values < 0).mean()
    
    print('Region', region)
    print('Mean Profit :', mean_profit, 'USD')
    print('Confidence Interval (95%) :', confidence_interval)
    print('Risk of Loss :', loss_risk, '%')

Region 0
Mean Profit : 2832358.6371044866 USD
Confidence Interval (95%) : (954186.401231312, 4721774.128740624)
Risk of Loss : 0.002 %
Region 1
Mean Profit : 2863384.024266842 USD
Confidence Interval (95%) : (852630.2807096888, 4869479.598859357)
Risk of Loss : 0.002 %
Region 2
Mean Profit : 2852822.3719312106 USD
Confidence Interval (95%) : (925963.4941324021, 4786743.6800649315)
Risk of Loss : 0.002 %


In [18]:
for region in range(3):
    
    target = samples_target[region]
    predictions = samples_predictions[region]
    
    profit_values = []
    
    for i in range(sample_bootstrap):
        sample_target = target.sample(sample_point, replace=True, random_state=rs)
        sample_predictions = predict[sample_target.index]
        profit_values.append(profit(sample_target, sample_predictions))
        
    profit_values = pd.Series(profit_values)
    mean_profit = profit_values.mean()
    confidence_interval = (profit_values.quantile(0.025), profit_values.quantile(0.975))
    loss_risk = (profit_values < 0).mean()
    
    print('Region', region)
    print('Mean Profit :', mean_profit, 'USD')
    print('Confidence Interval (95%) :', confidence_interval)
    print('Risk of Loss :', loss_risk, '%')

Region 0
Mean Profit : -17525408.48674446 USD
Confidence Interval (95%) : (-23365818.813181605, -11655551.705542097)
Risk of Loss : 1.0 %
Region 1
Mean Profit : -37620266.894872464 USD
Confidence Interval (95%) : (-43376475.19191398, -31941735.61546329)
Risk of Loss : 1.0 %
Region 2
Mean Profit : 3646019.9108492387 USD
Confidence Interval (95%) : (-1619638.6069313644, 9295097.313283822)
Risk of Loss : 0.097 %


**Findings :**

Calculating the distribution of profits and risks is very necessary, in order to maximize profits for the company, therefore the following results are obtained:
- For region 0 1 2, the average profit value is almost the same as the bootstrap calculation and the 95% confidence interval, the average profit obtained is around 2.8 million dollars
- The resulting risk is the same for all three regions, at 0.02%, which is a very small number for a risk, even though the profit is small when compared to the initial capital
- But in calculating the target value, only region 2 shows a profit, where the other regions only show a loss
- The results of the target value in region 2 produce a profit of $3,646,019 with a risk of only 0.097%, this proves that region 2 can be considered for development, seen from the target calculations and predictions which both show absolute profits