# Machine Learning in Business Project
***

In this Notebook, we will be looking at data from 3 different oil rich regions for the OilyGiant mining company in order to determine which region will give the company the best possible ROI and least significant risk of loss for building wells, keeping in mind these details:

- OilyGiant mining company wants to build 200 wells.
- The total budget for building the wells is 100 million USD.
- One barrel of raw materials brings 4.5 USD of revenue.

We will empoly both the use of a machine learning algorithm (Linear Regression) and the Bootstrapping technique in order to verify the accuracy of the results.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from scipy import stats as st
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Loading Data

OilyGiant has provided data in 3 CSV files called `geo_data_0.csv`, `geo_data_1.csv`, and `geo_data_2.csv`. 

They will be imported into pandas DataFrames labeled `data_0`, `data_1`, and `data_2` respectively. 

A summary of the data is then displayed.

In [2]:
data_0 = pd.read_csv('geo_data_0.csv')
data_1 = pd.read_csv('geo_data_1.csv')
data_2 = pd.read_csv('geo_data_2.csv')
data_0.info()
data_1.info()
data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null 

In [3]:
display(data_0.head())
display(data_1.head())
display(data_2.head())

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


All 3 DataFrames have 100,000 rows and 5 columns. 

Each row contains unique location data for various points in each region.

The columns include:

- `id` -> Unique Location Identifier (unessecary for training)
- `f0` `f1` `f2` -> 3 relevant points of geographical data
- `product` -> volume of reserves in the oil well location (measured in thousand barrels)

## Preparing Data

There are no null values in any region, for the sake of accuracy the data will be checked for  duplicates.

In [4]:
# Check for duplicated values
print(data_0.duplicated().sum())
print(data_1.duplicated().sum())
print(data_2.duplicated().sum()) 

0
0
0


Now in order to train the Linear Regression algorithm the data will need to be sorted into the following categories:

**Features:**

- `f0`
- `f1`
- `f2`

**Target**:

- `product`

This will allow the algorithm to use the geographical data given by OilyGiant to estimate the average reserves each regions locations.

The data is then split into training and validation dataset at a ratio of `75:25`.

In [5]:
features_0 = data_0.drop(['id', 'product'], axis=1)
features_1 = data_1.drop(['id', 'product'], axis=1)
features_2 = data_2.drop(['id', 'product'], axis=1)

target_0 = data_0['product']
target_1 = data_1['product']
target_2 = data_2['product']

features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0, target_0, test_size=0.25, random_state=123)
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1, target_1, test_size=0.25, random_state=123)
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2, target_2, test_size=0.25, random_state=123)

print(len(features_train_0) / len(data_0), len(features_valid_0) / len(data_0))
print(len(features_train_1) / len(data_1), len(features_valid_1) / len(data_1))
print(len(features_train_2) / len(data_2), len(features_valid_2) / len(data_2))

0.75 0.25
0.75 0.25
0.75 0.25


Feature data will be scaled to fit between -1 and 1 to ensure the algorithm models the data accurately.

In [6]:
# Create the scaler instance
scaler = StandardScaler()
   
# Fit and transform the training data, then transform the validation data
features_train_0 = scaler.fit_transform(features_train_0)
features_valid_0 = scaler.transform(features_valid_0)
   
features_train_1 = scaler.fit_transform(features_train_1)
features_valid_1 = scaler.transform(features_valid_1)
   
features_train_2 = scaler.fit_transform(features_train_2)
features_valid_2 = scaler.transform(features_valid_2)

## Training Model

Next, we will write a program that allows us to train the algorithm for each region to find the average reserves at the locations in each region. We will also include the calculations for the RMSE(root mean squared error), a metric that gives the approximate error rate of the model.

In [7]:
def train_eval(features_train, target_train, features_valid, target_valid):
    model = LinearRegression()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    avg_reserves = predictions.mean()
    rmse = mean_squared_error(target_valid, predictions, squared=False)
    return avg_reserves, rmse, model

region_0 = train_eval(features_train_0, target_train_0, features_valid_0, target_valid_0)
region_1 = train_eval(features_train_1, target_train_1, features_valid_1, target_valid_1)
region_2 = train_eval(features_train_2, target_train_2, features_valid_2, target_valid_2)

display(f'Region 0 Average Predicted Reserves: {region_0[0]:.3f} RMSE: {region_0[1]:.3f}')
display(f'Region 1 Average Predicted Reserves: {region_1[0]:.3f} RMSE: {region_1[1]:.3f}')
display(f'Region 2 Average Predicted Reserves: {region_2[0]:.3f} RMSE: {region_2[1]:.3f}')



'Region 0 Average Predicted Reserves: 92.549 RMSE: 37.648'

'Region 1 Average Predicted Reserves: 69.280 RMSE: 0.895'

'Region 2 Average Predicted Reserves: 95.099 RMSE: 40.128'

The model predicts that wells in Region 0 have an average reserve of 92.549 thousand barrels of oil with an approximate error of 37.648 thousand barrels.

The model predicts that wells in Region 1 have an average reserve of 69.280 thousand barrels of oil with an approximate error of 0.895 thousand barrels.

The model predicts that wells in Region 2 have an average reserve of 95.099 thousand barrels of oil with an approximate error of 40.128 thousand barrels.

From this data we can tell that wells in regions 0 and 2 have more predicted reserves overall with a wider spread of predictions, while the model predicted region 1 has less on average but more consistently.

## Estimating Profits

Now we will preform some calculations in order to determine which regions will be considered profitable.

In [8]:
# Store all key values for calculations
number_of_wells = 200
revenue_per_barrel = 4.5 # in thousand USD
total_cost = 100000  # in thousand USD (given budget)

# Calculate the volume of reserves sufficient for developing a new well without losses
cost_per_well = (total_cost / number_of_wells)
barrels_for_profit = cost_per_well / revenue_per_barrel
print(f"Cost per well: ${cost_per_well * 10 ** 3:.2f}")
print(f"Barrels needed to break even: {barrels_for_profit}")

Cost per well: $500000.00
Barrels needed to break even: 111.11111111111111


You can see that each well will cost $\$500000$ and need to produce an average of $111111.\overline{1}$ barrels of oil in order for a region to be considered profitable.

Now we create another program that will sort the target values by the predictions made by the model then calculate the reserves and the potential profit from the top 200 wells in each region.

In [9]:
def calculate_profit(predictions, targets, count):
    predictions_df = pd.DataFrame({'prediction': predictions, 'target': targets})
    sorted_predictions_df = predictions_df.sort_values(by='prediction', ascending=False)
    selected = sorted_predictions_df.head(count)
    top_reserves = selected['target'].sum()
      
    # Calculate the profit
    revenue = top_reserves * revenue_per_barrel
    profit = revenue - (cost_per_well * count)
    
    
    return top_reserves, profit



# Use the function for each region with validation predictions and true values
pred_0 = region_0[2].predict(features_valid_0)
reserves_0, profit_0 = calculate_profit(pred_0, target_valid_0, number_of_wells)

pred_1 = region_1[2].predict(features_valid_1)
reserves_1, profit_1 = calculate_profit(pred_1, target_valid_1, number_of_wells)

pred_2 = region_2[2].predict(features_valid_2)
reserves_2, profit_2 = calculate_profit(pred_2, target_valid_2, number_of_wells)

# Print the results
print(f"Region 0 - Top 200 Location Reserves: {reserves_0:.2f} thousand barrels, Profit: {profit_0:.2f} thousand USD")
print(f"Region 1 - Top 200 Location Reserves: {reserves_1:.2f} thousand barrels, Profit: {profit_1:.2f} thousand USD")
print(f"Region 2 - Top 200 Location Reserves: {reserves_2:.2f} thousand barrels, Profit: {profit_2:.2f} thousand USD")

Region 0 - Top 200 Location Reserves: 30077.05 thousand barrels, Profit: 35346.71 thousand USD
Region 1 - Top 200 Location Reserves: 27589.08 thousand barrels, Profit: 24150.87 thousand USD
Region 2 - Top 200 Location Reserves: 27489.65 thousand barrels, Profit: 23703.44 thousand USD


The model estimates that the top 200 wells in Region 0 contain 30,077,050 barrels worth appoximately 135,246,710 USD. This region can be considered profitable.

The model estimates that the top 200 wells in Region 1 contain 27,589,080 barrels worth appoximately 124,050,870 USD. THis region can be considered profitable.

The model estimates that the top 200 wells in Region 2 contain 27,489,650 thousand barrels worth appoximately 123,603,440 USD. This region can be consdered profitable. 

From the predictions of the model Region 0 is likely to be the most profitable.

## Risks and Profit

Using the bootstrapping method, we will now calculate the average profit, 95% confidence interval and risk of losses as a percentage for all 3 regions.

In [10]:
state=np.random.RandomState(1115)

def bootstrap_profits(predictions_valid, target_valid, n_samples=1000, count=1):
    profits = []
    for i in range(n_samples):
        sample_df = pd.DataFrame({'prediction': predictions_valid, 'target': target_valid})
        subsample = sample_df.sample(n=500, random_state=state, replace=True)
        sample_profit = calculate_profit(subsample['prediction'], subsample['target'], count)
        profits.append(sample_profit[1])
    
    profits = pd.Series(profits)
    sample_mean = profits.mean()
    lower_bound = profits.quantile(0.025)
    upper_bound = profits.quantile(0.975)
    risk_of_loss = (profits < 0).mean()
    return sample_mean, lower_bound, upper_bound, risk_of_loss

bootstrap_0 = bootstrap_profits(pred_0, target_valid_0)
bootstrap_1 = bootstrap_profits(pred_1, target_valid_1)
bootstrap_2 = bootstrap_profits(pred_2, target_valid_2)

display(f'Region 0 - Average Profit: ${bootstrap_0[0]:.2f} thousand 95% C.I. ({bootstrap_0[1]:.2f}, {bootstrap_0[2]:.3f}) Risk of Loss: {bootstrap_0[3]*100:.2f}%')
display(f'Region 1 - Average Profit: ${bootstrap_1[0]:.2f} thousand 95% C.I. ({bootstrap_1[1]:.2f}, {bootstrap_1[2]:.3f}) Risk of Loss: {bootstrap_1[3]*100:.2f}%')
display(f'Region 2 - Average Profit: ${bootstrap_2[0]:.2f} thousand 95% C.I. ({bootstrap_2[1]:.2f}, {bootstrap_2[2]:.3f}) Risk of Loss: {bootstrap_2[3]*100:.2f}%')

'Region 0 - Average Profit: $186.23 thousand 95% C.I. (-28.71, 322.894) Risk of Loss: 3.40%'

'Region 1 - Average Profit: $120.75 thousand 95% C.I. (120.75, 120.754) Risk of Loss: 0.00%'

'Region 2 - Average Profit: $124.32 thousand 95% C.I. (-180.61, 322.583) Risk of Loss: 12.40%'

The results show a similar finding to the machine learning algorithm in that Regions 0 and 2 have more varied oil reserves and carry more risk of loss as a result. Choosing to develop in region 1 carries the least Risk with considerable average profit per well. 