# Menemukan Lokasi yang Cocok untuk Pengeboran Sumur Minyak Baru

## Content <a id='contents'></a>

# Content

* [2 Overview](#pverview)
    * [2.1 Introduction](#intro)
    * [2.2 Data Descrition](#data_description)
    * [2.3 Goals](#goals)

* [3 Data_Preprocessing](#data_preprocessing)
    * [3.1 Load Data](#load_data)
    * [3.2 Initial_Data_Exploration](#initial_data_exploration)
    * [3.3 Initial Summary](#initial_summary)
    
* [4 Train and Test Model for Each Region](#train_data_and_test_model)
    * [4.1 Spliting Data into Training Set and Validation Set](#split_data)
    * [4.2 Model Train and Generate Prediction for Validation Dataset](#train_model)
    * [4.3 Save Prediction and Correct Validation Dataset](#save_valid_answer)
    * [4.4 Average Product Volume and RMSE Model](#mean_product_volume)
    * [4.5 Initial Analysis](#initial_analysis)

* [5 Initial Profit Calculation](#initial_profit_calculation)
    * [5.1 Main Variable](#key_variable)
    * [5.2 Volume of Oil Reserves Sufficient to Develop a New wel](#minimum_oil_volume)
    * [5.3 Subsection Conclusion](#subsection_conclusion)


* [6 Profit Calculation](#profit_calculation)

* [7 Risk and Return](#risk_and_return)

* [8 Summary](#Summary)

## Overview <a id='overview'></a>

### Introduction <a id='intro'></a>

Sebagai seorang data scientist di perusahaan `OilyGiant`, diminta untuk menemukan lokasi yang cocok untuk penggalian sumur minyak baru. Data yang tersedia adalah data sampel minyak dari tiga wilayah. Pada project ini akan membuat sebuah model yang akan membantu memilih wilayah dengan margin laba tertinggi. Analisis terhadap laba dan risiko potensial akan dilakukan menggunakan teknik `bootstrapping`

### Data Description <a id='data_description'></a>

`id` — ID unik sumur minyak

`f0, f1, f2` — tiga fitur titik

`product` — volume cadangan minyak di sumur

### Goals <a id='goals'></a>

Tujuan dari proyek ini adalah untuk menemukan lokasi yang cocok untuk pengeboran sumur minyak.

Langkah-langkah yang harus diambil;

1. Latih dan uji model untuk setiap area :
- Pisahkan data menjadi training set dan validation set dengan perbandingan 75:25.
- Latih model dan buat prediksi untuk set validasi.
- Simpan prediksi dan jawaban yang benar untuk set validasi.
- Menampilkan prediksi volume rata-rata cadangan minyak dan RMSE model.
- Analisis hasilnya.
2. Mempersiapkan perhitungan keuntungan:
- Simpan semua nilai kunci untuk penghitungan keuntungan dalam variabel terpisah.
- Hitung volume cadangan minyak yang cukup untuk mengembangkan sumur baru tanpa kerugian.
- Bandingkan nilai yang diperoleh dengan rata-rata volume cadangan minyak di setiap wilayah.
- Presentasikan temuan Anda mengenai persiapan penghitungan keuntungan.
3. Buat fungsi untuk menghitung keuntungan dari serangkaian sumur minyak yang dipilih dan prediksi model:
- Pilih sumur dengan nilai prediksi tertinggi.
- Ringkaslah target volume cadangan minyak berdasarkan prediksi tersebut.
- Memberi usulan area untuk pengembangan sumur minyak dan berikan justifikasi atau alasan pilihan lalu hitung keuntungan dari volume cadangan minyak yang diperoleh.
4. Hitung risiko dan keuntungan untuk setiap area:
- Gunakan teknik bootstrapping dengan 1.000 sampel untuk mengetahui distribusi keuntungan.
- Temukan keuntungan rata-rata, interval kepercayaan 95%, dan risiko kerugian. Kerugian adalah keuntungan negatif, hitung kemungkinan kemungkinan kerugian dan nyatakan dalam persentase.


## Data Preprocessing <a id='data_preprocessing'></a>

In [112]:
# Muat semua library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy import stats as st 

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, f1_score, roc_auc_score, make_scorer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.metrics import roc_auc_score

from sklearn.utils import shuffle

import warnings
warnings.filterwarnings('ignore')

In [113]:
# Muat file data menjadi DataFrame
geo_data0 = pd.read_csv('/datasets/geo_data_0.csv')
geo_data1 = pd.read_csv('/datasets/geo_data_1.csv')
geo_data2 = pd.read_csv('/datasets/geo_data_2.csv')

### Initial Data Exploration <a id='initial_data_exploration'></a>

In [114]:
# Menampilkan sample data untuk melihat data secara sekilas
geo_data0.sample(5)

Unnamed: 0,id,f0,f1,f2,product
55112,tBFQa,2.014725,0.221052,3.789812,143.022204
53922,Dcqbg,0.620338,-0.564341,7.234352,73.526752
97176,13ijC,-0.77271,0.365721,-5.284948,22.556509
41266,hz8nW,1.681697,-0.10249,-1.586632,116.987499
96971,VpGzj,0.682694,0.670935,1.413691,127.75823


In [115]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
geo_data0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [116]:
# Memampilkan nilai statistik dari kolom numerik
geo_data0.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [117]:
# Menampilkan sample data untuk melihat data secara sekilas
geo_data1.sample(5)

Unnamed: 0,id,f0,f1,f2,product
23709,ANKdO,-2.640818,-11.215067,3.004322,84.038886
51874,Ct1AB,7.500104,-4.481511,4.009053,107.813044
80987,n15A1,2.783969,0.551256,-0.003467,0.0
4694,zH1yi,-2.46321,-5.194706,5.002314,137.945408
58608,QsHjk,-15.553805,-6.134755,3.996667,110.992147


In [118]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
geo_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [119]:
# Memampilkan nilai statistik dari kolom numerik
geo_data1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [120]:
# Menampilkan sample data untuk melihat data secara sekilas
geo_data2.sample(5)

Unnamed: 0,id,f0,f1,f2,product
89109,ZgBgt,-1.547086,-1.101981,-3.677697,18.323167
22039,ugJHH,-2.385558,-0.637016,2.322332,146.08892
62829,hVsf9,0.85086,3.504191,3.831136,175.211766
40279,BV5I2,-0.270028,-4.875826,2.330012,110.291518
73205,MRkIi,1.450782,0.237289,6.47941,75.210609


In [121]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
geo_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [122]:
# Memampilkan nilai statistik dari kolom numerik
geo_data2.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


### Initial Summary <a id='initial_summary'></a>

Insights:

1. Data yang digunakan sudah lengkap tidak tedapat data null dan keseluruhan tipe datanya sudah benar
2. Rata-rata titik pada `geo_data0` dan `geo_data1` memiliki volume cadangan minyak yang lebih tinggi dari pada `geo_data2`

## Train and Test Model Each Region <a id='train_data_and_test_model'></a>

### Spliting Data into Training Set and Validation Set <a id='split_data'></a>

In [123]:
# Fungsi untuk split data menjadi training dan validation set
def split_data (data):
    features = data.drop(['product','id'], axis=1)
    target = data['product']
    
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, 
                                                                                  random_state=12345, 
                                                                                  test_size=0.25)
    return features_train, features_valid, target_train, target_valid

In [124]:
# Recall fungsi split_data
features_train_0, features_valid_0, target_train_0, target_valid_0 = split_data(geo_data0)
features_train_1, features_valid_1, target_train_1, target_valid_1 = split_data(geo_data1)
features_train_2, features_valid_2, target_train_2, target_valid_2 = split_data(geo_data2)

In [125]:
# Checking split_data step
print(features_train_0.shape)
print(features_valid_0.shape)
print(target_train_0.shape)
print(target_valid_0.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [126]:
# Checking split_data step
print(features_train_1.shape)
print(features_valid_1.shape)
print(target_train_1.shape)
print(target_valid_1.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [127]:
# Checking split_data step
print(features_train_2.shape)
print(features_valid_2.shape)
print(target_train_2.shape)
print(target_valid_2.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [128]:
geo_data_all = [
    geo_data0.drop('id', axis = 1),
    geo_data1.drop('id', axis = 1),
    geo_data2.drop('id', axis = 1),]

In [129]:
state = np.random.RandomState(12345)

samples_target = []
samples_predictions = []

for region in range(len(geo_data_all)):
    data  = geo_data_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = LinearRegression()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.5794217150813

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.889736773768065

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.958042459521614



<div class="alert alert-success">
<b>Code Reviewers's comment v.1</b> <a class="tocSkip"></a>

Bagus, code yang dijalankan sudah sesuai dengan instruksi dari project.

</div>

### Model Train and Generate Prediction for Validation Datasets <a id='train_model'></a>

In [130]:
# Membuat fungsi untuk menyetel hyperparameter pada Regresi Linier
def fit(features_train, target_train, features_valid, target_valid):
    param_grid = {'fit_intercept': [True, False],
                  'copy_X' : [True, False],
                  'n_jobs': [1, 2, -1],
                  'positive': [True, False],
    }

    lr = LinearRegression(np.random.RandomState(12345))
    rmse_scorer = make_scorer(lambda target_valid, target_pred: 
                              np.sqrt(mean_squared_error(target_valid, target_pred)), 
                              greater_is_better=False)
    
    grid = GridSearchCV(estimator = lr, 
                        param_grid = param_grid, 
                        cv = 5, 
                        scoring = rmse_scorer)
    
    grid.fit(features_train, target_train)

    best_params = grid.best_params_
    best_rmse = -grid.best_score_
    target_pred = grid.predict(features_valid)
    mean_product = target_pred.mean()
    model_rmse = mean_squared_error(target_valid, target_pred)**0.5
    
    return best_params, best_rmse, target_pred, mean_product, model_rmse

In [131]:
# Membuat fungsi untuk hyperparameter pada LinearRegression
def fit2(features_train, target_train, features_valid, target_valid):
    param_grid = {
                  'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'],
                  'alpha': [0.01, 0.1, 1, 10, 100],
    }

    r = Ridge(np.random.RandomState(12345))
    rmse_scorer = make_scorer(lambda target_valid, target_pred: 
                              np.sqrt(mean_squared_error(target_valid, target_pred)), 
                              greater_is_better=False)
    
    grid = GridSearchCV(estimator = r, 
                        param_grid = param_grid, 
                        cv = 5, 
                        scoring = rmse_scorer)
    
    grid.fit(features_train, target_train)

    best_params = grid.best_params_
    best_rmse = -grid.best_score_
    target_pred = grid.predict(features_valid)
    mean_product = target_pred.mean()
    model_rmse = mean_squared_error(target_valid, target_pred)**0.5
    
    return best_params, best_rmse, target_pred, mean_product, model_rmse

In [132]:
# Memanggil Fit Function ke Train Model pada geo_data0
best_params_0, best_rmse_0, target_pred_0, mean_product_0, model_rmse_0 = fit2(features_train_0, 
                                                                              target_train_0, 
                                                                              features_valid_0, 
                                                                              target_valid_0)

In [133]:
# Print Best Params pada geo_data0
best_params_0

{'alpha': 0.01, 'solver': 'sag'}

In [134]:
# Print Best RMSE pada geo_data0
best_rmse_0

37.732360614259214

In [135]:
# Memanggil Fit Function ke Train Model pada geo_data1
best_params_1, best_rmse_1, target_pred_1, mean_product_1, model_rmse_1 = fit2(features_train_1, 
                                                                              target_train_1, 
                                                                              features_valid_1, 
                                                                              target_valid_1)

In [136]:
# Print Best Params pada df1
best_params_1

{'alpha': 0.1, 'solver': 'lsqr'}

In [137]:
# Print Best RMSE pada geo_data1
best_rmse_1

0.8895409504762819

In [138]:
# Memanggil Fit Function ke Train Model pada geo_data2
best_params_2, best_rmse_2, target_pred_2, mean_product_2, model_rmse_2 = fit2(features_train_2, 
                                                                              target_train_2, 
                                                                              features_valid_2, 
                                                                              target_valid_2)

In [139]:
# Print Best Params untuk geo_data2
best_params_2

{'alpha': 1, 'solver': 'sag'}

In [140]:
# Print Best RMSE untuk geo_data2
best_rmse_2

40.06572229753867

In [141]:
geo_data_all = [
    geo_data0.drop('id', axis = 1),
    geo_data1.drop('id', axis = 1),
    geo_data2.drop('id', axis = 1),
]

In [142]:
state = np.random.RandomState(12345)

samples_target = []
samples_predictions = []

for region in range(len(geo_data_all)):
    data  = geo_data_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = LinearRegression(copy_X = True, 
                             fit_intercept = True, 
                             n_jobs = 1, 
                             positive = False)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.5794217150813

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.889736773768065

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.958042459521614



In [143]:
state = np.random.RandomState(12345)

samples_target_2 = []
samples_predictions_2 = []

for region in range(len(geo_data_all)):
    data  = geo_data_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = Ridge(alpha = 1, solver = 'sag')
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.57943025886763

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.8917228822685093

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.95802510571808



In [144]:
state = np.random.RandomState(12345)

samples_target_3 = []
samples_predictions_3 = []

for region in range(len(geo_data_all)):
    data  = geo_data_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = RandomForestRegressor()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 38.79194400862664

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.7465390957112833

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.285687565849244



### Save Prediction and Correct Validation Dataset <a id='save_valid_answer'></a>

In [145]:
target_pred_geo_data0 = pd.DataFrame(target_pred_0, columns = ['product'])
target_pred_geo_data1 = pd.DataFrame(target_pred_1, columns = ['product'])
target_pred_geo_data2 = pd.DataFrame(target_pred_2, columns = ['product'])

### Average Product Volume and RMSE Model <a id='mean_product_volume'></a>

In [146]:
print("Region 0 Product Mean", target_pred_geo_data0.mean())
print("Region 0 Model RMSE:", model_rmse_0)
print()
print("Region 1 Product Mean", target_pred_geo_data1.mean())
print("Region 1 Model RMSE:", model_rmse_1)
print()
print("Region 2 Product Mean", target_pred_geo_data2.mean())
print("Region 2 Model RMSE:", model_rmse_2)

Region 0 Product Mean product    92.592458
dtype: float64
Region 0 Model RMSE: 37.57974697680131

Region 1 Product Mean product    68.728547
dtype: float64
Region 1 Model RMSE: 0.8930993684427244

Region 2 Product Mean product    94.965103
dtype: float64
Region 2 Model RMSE: 40.029761131481024


In [147]:
print("Actual Region 0 Product Mean", target_valid_0.mean())
print()
print("Actual Region 1 Product Mean", target_valid_1.mean())
print()
print("Actual Region 2 Product Mean", target_valid_2.mean())

Actual Region 0 Product Mean 92.07859674082927

Actual Region 1 Product Mean 68.72313602435997

Actual Region 2 Product Mean 94.88423280885438


<div class="alert alert-success">
<b>Code Reviewers's comment v.1</b> <a class="tocSkip"></a>

Bagus, code yang dijalankan sudah sesuai dengan instruksi dari project.

</div>

### Initial Analysis <a id='initial_analysis'></a>

**Insights:**

1. `Wilayah 2` menghasilkan rata-rata hasil prediksi produk tertinggi, namun juga memiliki tingkat kesalahan tertinggi sebesar 40,03.
2. `Wilayah 0` menghasilkan prediksi hasil produk rata-rata serupa dengan Wilayah 2, namun dengan tingkat kesalahan yang lebih rendah. Oleh karena itu, di sarankan memilih Wilayah 0.
3. `Wilayah 1` menghasilkan rata-rata hasil prediksi produk terendah, namun memiliki tingkat kesalahan terendah. Oleh karena itu, jika ingin memilih wilayah dengan tingkat kesalahan paling rendah, dapat digunakan Wilayah 1.

## Initial Profit Calculation  <a id='initial_profit_calculation'></a>

### Main Variable <a id='key_variable'></a>

In [148]:
# Biaya
total_cost = 100000000
total_oil_well = 200
cost_per_well = total_cost / total_oil_well
income = 4500

### Volume of Oil Reserves Sufficient to Develop a New Well <a id='minimum_oil_volume'></a>

In [149]:
min_oil_vol = cost_per_well / income
print("Oil Reserves Sufficient to Develop a New Well", np.ceil(min_oil_vol))

Oil Reserves Sufficient to Develop a New Well 112.0


In [150]:
target_pred_geo_data0.describe()

Unnamed: 0,product
count,25000.0
mean,92.592458
std,23.152469
min,-9.245837
25%,76.672447
50%,92.657715
75%,108.415672
max,180.079697


In [151]:
target_pred_geo_data1.describe()

Unnamed: 0,product
count,25000.0
mean,68.728547
std,46.010204
min,-1.893744
25%,28.53668
50%,57.851592
75%,109.346467
max,139.818939


In [152]:
target_pred_geo_data2.describe()

Unnamed: 0,product
count,25000.0
mean,94.965103
std,19.847116
min,17.157966
25%,81.394447
50%,95.031146
75%,108.488912
max,165.837018


In [153]:
print(target_pred_geo_data0.quantile(0.8))
print(target_pred_geo_data1.quantile(0.832))
print(target_pred_geo_data2.quantile(0.81))

product    112.339468
Name: 0.8, dtype: float64
product    112.146728
Name: 0.832, dtype: float64
product    112.592494
Name: 0.81, dtype: float64


In [154]:
# Menghitung Profit untuk Region 0
top_200_product = pd.Series(target_pred_0).sort_values(ascending = False)[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit), "USD")
print()

Profit 39900638 USD



In [155]:
# Menghitung Profit untuk Region 1
top_200_product = pd.Series(target_pred_1).sort_values(ascending = False )[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit), "USD")
print()

Profit 24857093 USD



In [156]:
# Menghitung Profit untuk Region 2
top_200_product = pd.Series(target_pred_1).sort_values(ascending = False )[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit),"USD")
print()

Profit 24857093 USD



### Subcetion Conclution <a id='subcetion_conlcution'></a>

**Insights:**

1. Suatu sumur harus mempunyai volume minyak lebih dari 112 ribu barel supaya investasinya dapat menguntungkan.
2. Rata-rata ketiga wilayah tersebut memiliki 20 juta dari 100 juta titik yang memenuhi syarat untuk dikembangkan.
3. Jika kita mengembangkan 200 sumur teratas di setiap wilayah, keuntungan tertinggi akan diperoleh di wilayah 1 dengan total keuntungan hampir 40 juta USD.

## Profit Calculation <a id='profit_calculation'></a>

In [157]:
def calculate_profit(prediction, name, income = 4500, total_cost = 100000000, points = 200):
    predict_top200 = prediction.sort_values(ascending = False, by = 'product')[:points]
    product = predict_top200.sum()
    total_cost = round(total_cost / 1000000)
    total_income = round(income * product / 1000000)
    profit = round(total_income - total_cost)
    geo = name
    print('-------------------')
    print(f'Profitability Geo Data {geo}')
    print(f'Total Income: {total_income}')
    print(f'Total Cost  : {total_cost}')
    print(f'Profit      : {profit}', 'M USD')

In [158]:
calculate_profit(prediction = target_pred_geo_data0, name = 0)
calculate_profit(prediction = target_pred_geo_data1, name = 1)
calculate_profit(prediction = target_pred_geo_data2, name = 2)

-------------------
Profitability Geo Data 0
Total Income: product    140.0
dtype: float64
Total Cost  : 100
Profit      : product    40.0
dtype: float64 M USD
-------------------
Profitability Geo Data 1
Total Income: product    125.0
dtype: float64
Total Cost  : 100
Profit      : product    25.0
dtype: float64 M USD
-------------------
Profitability Geo Data 2
Total Income: product    133.0
dtype: float64
Total Cost  : 100
Profit      : product    33.0
dtype: float64 M USD


**Insights:**

1. Jika ingin menginvestasikan uang pada 200 sumur teratas di tiga wilayah ini, masih akan menghasilkan keuntungan dari ketiga wilayah tersebut.
2. Wilayah dengan keuntungan tertinggi diproduksi di wilayah 0
3. Wilayah 0 juga memiliki jumlah poin tertinggi di atas 112

## Risk and Return <a id='risk_and_return'></a>

In [159]:
SAMPLE_SIZE = 500
BOOTSTRAP_SIZE = 1000

BUDGET = 100000000
COST_PER_POINT = 500000
POINTS_PER_BUDGET = BUDGET // COST_PER_POINT

PRODUCT_PRICE = 4500
POINTS_PER_BUDGET

200

In [160]:
def calculate_profit_bootstrap(prediction, name, income = 4500, total_cost = 100000000, points = 200):
    predict_top200 = prediction.sort_values(ascending = False)[:points]
    product = predict_top200.sum()
    total_cost = total_cost
    total_income = income *  product
    profit = total_income - total_cost
    geo = name

In [161]:
def profit(target, predictions):
    prediction_sorted = predictions.sort_values(ascending = False)
    selected_points = target[prediction_sorted.index][:POINTS_PER_BUDGET]
    product = selected_points.sum()
    revenue = product * PRODUCT_PRICE
    cost = BUDGET
    return revenue - cost

In [162]:
for region in range(3):

    target = samples_target[region]
    predictions = samples_predictions[region]

    profit_values = []
    
    for i in range(BOOTSTRAP_SIZE):
        target_sample = target.sample(SAMPLE_SIZE, replace = True, random_state = state)
        predictions_sample = predictions[target_sample.index]
        #profit_values.append(calculate_profit_bootstrap(prediction = predictions_sample, name = region))
        profit_values.append(profit(target_sample, predictions_sample))

    profit_values = pd.Series(profit_values)

    mean_profit = profit_values.mean()
    confidence_interval = (profit_values.quantile(0.025), profit_values.quantile(0.975))
    negative_profit_chance = (profit_values < 0).mean()
    
    print("—Region", region, "—")
    print("Mean profit =", round(mean_profit), "USD")
    print("95% confidence interval:", confidence_interval)
    print("Risk of losses =", negative_profit_chance * 100, "%")
    print()


—Region 0 —
Mean profit = 4238972 USD
95% confidence interval: (-761878.1389036368, 9578465.319517836)
Risk of losses = 4.8 %

—Region 1 —
Mean profit = 5132567 USD
95% confidence interval: (1080668.9523396173, 9285744.392324952)
Risk of losses = 0.6 %

—Region 2 —
Mean profit = 3811204 USD
95% confidence interval: (-1428006.300878686, 8933805.657503996)
Risk of losses = 7.3999999999999995 %



## Summary  <a id='summary'></a>

Proyek ini menghasilkan suatu model yang mampu memprediksi volume cadangan minyak dalam suatu sumur, dengan harapan investasi yang dilakukan dapat menghasilkan keuntungan. Berdasarkan prediksi model, ditemukan bahwa wilayah 2 memiliki rata-rata cadangan minyak tertinggi. Untuk menghasilkan keuntungan, sebuah sumur minyak harus memiliki cadangan minyak minimal 112 ribu barel.

Setelah dilakukan proses `bootstrapping`, diketahui bahwa investasi pada wilayah 2 memiliki risiko tinggi dengan rata-rata pendapatan paling rendah dibandingkan dua wilayah lainnya. Oleh karena itu, saya menyarankan untuk berinvestasi di wilayah 1 yang memiliki risiko terendah dan keuntungan tertinggi.