# Challenge: What's a probablistic estimate for number of days to drill a well?

Often we use neighboring data points to built an estimate:
* Number of days to drill a well.
* Volume of oil produced after 90 days.
* etc.

To do this work we select data from the surrounding area and look at averages, medians, and averages.  But there's a lot more insights we can pull from these data if we use different tools.  </br>  

For this notebook we are going to work with a statistical method called survival analysis which models the duration of event.  Survival analysis is used in the medical world to measure the effectiveness of new drugs but can be used for any type of data that has a duration.


## Introduction - Survival Analysis of Polictical Regimes.
Example taken from:</br>
https://lifelines.readthedocs.io/en/latest/Survival%20analysis%20with%20lifelines.html

In [None]:
#Import libraries
import pandas as pd
import numpy as np
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.misc
%matplotlib inline
sns.set()

Let's load and inspect the example dataset of regimes.

In [None]:
from lifelines.datasets import load_dd
data = load_dd()
data.head()

Interesting stuff!  For this analysis we need two columns "duration" and "observed".  The former is the data to make the plot and the latter filters the data to only leaders that finished their term naturally, no coups or death in offices.</br>

Let's pull these columns out into their own objects: T and E</br>

Next we'll call the a method of survival analysis called the Kaplan Meier and fit it to the data.

In [None]:
#Select data for analysis
T = data["duration"]
E = data["observed"]

#Initiate model and fit model
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)

The model is built let's plot the results.

In [None]:
#Plot a Survival Function 
kmf.plot(figsize=(10,10))
plt.title('Survival function of political regimes');

##### What this graph is telling you
* x-axis: duration in office in years
* y-axis: probability of a leader still around after x years in office
* The shaded area is the confidence interval of the data.
* For Example: _There's a 20% that a leader will be in office more than 8 years._

##### However, that not the whole story . . .
There are many different types of governments which behave differently.  Let's create another plot but this time filter out Democratic vs. Non-Democratic regimes.

In [None]:
#Survival analysis plots for Democratic vs. Non-Democratic regimes
ax = plt.subplot(111)

dem = (data["democracy"] == "Democracy") #filter for regimes

#Fit two different models
kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax, figsize=(10,10))
kmf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
kmf.plot(ax=ax)

#plot
plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");

This plot makes sense as dictactors are more likley to remain in power longer than democratically elected officials.  

Now let's try this technique with well data.

## Exercise - Survival Analysis of Days Drilling in the Mississippi Canyon Protracton, GOM.

The Mississippi Canyon Protraction Area in the Gulf of Mexico is one of the most prolific parts of the basin with some of its largest fields (Mars/Ursa, Thunderhorse).  Thousands of wells by different operators have been drilled here and likley many more.  When planning a well it is a common analysis to look at surrounding wells to estimate time it will take to drill.  Instead of coming up with one number (i.e. average days of drilling), let's calculate a probability distribution.

## Step 1.  Load Data and Generate Calculated Columns
We will be loading these data from an csv file download for the U.S. BOEM.

In [None]:
#Load all well drilled in protraction area
df = pd.read_csv('../data/BoreholeMC.csv')

#Show first 5 columns
df.head()

Next we need to calculate days drilled for each well using the columns "Spud Date" and "Total Depth Date".  We'll also need to filter out empty values as the Kaplan Meier method doesn't accept null values.

In [None]:
#Remove empty values
days=df[['Total Depth Date','Spud Date']].apply(pd.to_datetime, errors='coerce').dropna()

#Calculate time difference
days['drill_days']=days['Total Depth Date']-days['Spud Date']

#Convert Date Difference to Days
days['drill_days'] = days['drill_days']/np.timedelta64(1, 'D')

#Show first 5 columns
days.head()

In [None]:
#Initiate model and fit model
kmf = KaplanMeierFitter()
kmf.fit(days.drill_days, event_observed=None)

In [None]:
#Plot a Survival Function 
kmf.plot(figsize=(10,10))
plt.title('Drilling Days Mississippi Canyon Protraciton Area');

#### Does this look right?
No, it does not.

## Step 2. Clean and Filter Data

The plot above is wrong, there's no way a well would have been drilling for 13 years! There must be some spurious data.  Let's investigate and clean.

In [None]:
#Use Describe to look at metrics for dataframe
days.describe().T

This quick description tells us alot:
* The Min values is 0 days.  A deepwater well can't be drilled in 0 days.
* As expected the max value is too high.
* The P25, P50, P75 look right, implying that are spurious data is on the limits.

We need to figure out a reasonable cutoffs.  To do so let's create a histogram of drilling days.

In [None]:
#Histogram of drilling days
fig = plt.subplots(figsize=(10,8))
plt.hist(days.drill_days, range=(0,5000))
plt.xlabel('Drilling Days')
plt.title ('Histogram of Drilling Days')
plt.show()

This histogram isn't that useful since there's a large proportion of the data that is <500 days.  An ECDF plot, which shows the proportion of data points at a certain value might be more instructive.

In [None]:
#Generate inputs for ECDF
n = len(days.drill_days)
x = np.sort(days.drill_days.values)
y = np.arange (1,n+1)/n

#Plot ECDF
fig = plt.subplots(figsize=(10,8))
plt.plot(x, y, marker='.', linestyle='none')
plt.title('ECDF of Drilling Days in MC')
plt.xlabel('Days Drill')
plt.ylabel('Proportion of Data')
plt.show()

Still heavily skewed but if we zoom in to the upper right of the image we can make better sense of it.

In [None]:
#Plot zoom of ECDF upper, left
fig = plt.subplots(figsize=(10,8))
plt.plot(x, y, marker='.', linestyle='none');
plt.xlim(100,500)
plt.ylim(0.9,1)
plt.title('Zoom - ECDF of Drilling Days in MC')
plt.xlabel('Days Drill')
plt.ylabel('Proportion of Data')
plt.show()

Now we can easily read the plot we see that 93% of the data is less than 150 days, and that 96% of the data is less than 365 days.  Let's use this information to filter down the data to a more realistic range.  No one plans to drill a well for over a year.  Also, it is unlikely that an offshore well can be drilled in <7 days.  Let's use the Query fucntion to reduce the days range.

<br />  _If you would like to experiment with different numbers go ahead and update the code block below._

In [None]:
#Filter data to 7<x<150
days_filtered = days.query("drill_days<150 & drill_days>7")

#Describe filtered data
days_filtered.describe().T

In [None]:
#Plot filtered Survival Function 
kmf.fit(days_filtered.drill_days, event_observed=None)
kmf.plot(figsize=(10,10));

This plot makes more sense.  _We can read the plot as 50% of the wells in MC took 35 days to drill._  The narrow confidence interval shows that this distribution is well constrained.

##  Step 3. Breaking out the data - Exploration vs. Development
The graph above is okay but just like in the introduction we aren't taking account of the differences in the data.  One simple division we can make is to separate Exploration and Development wells.

To divide the wells we need to grab the "Type Code" column from the original data source.  One way to do that is to Merge the original dataframe with the days_filtered dataframe.  You may have noticed that pandas as an index column and as we've done our manipulations and filters that index column has been unchanged.  This allows us to match index columns from different dataframes to merge data.

In [None]:
#Merge dataframes
df_filtered = pd.merge(df, days_filtered['drill_days'], left_index=True, right_index=True)

#New dataframe of data for analysis and drop an empyt cells
df_filtered=df_filtered[['drill_days','Type Code']].dropna()

#Create separate dataframes for Exploration and Development wells
expl_days = df_filtered['drill_days'][df_filtered["Type Code"] == "E"].dropna()
dev_days = df_filtered['drill_days'][df_filtered["Type Code"] != "E"].dropna()

In [None]:
#Survival plot for Exploration vs. Development
ax = plt.subplot(111)
kmf.fit(expl_days, event_observed=None, label="Exploration Wells")
kmf.plot(ax=ax, figsize=(10,10))
kmf.fit(dev_days, event_observed=None, label="Development Wells")
kmf.plot(ax=ax)
plt.ylim(0, 1);
plt.title("Drilling Days for Exploration vs. Development Wells");

This is more informative and it makes sense.  Development wells (orange) should take shorter time to drill than Exploration wells (blue). 

## Step 4. Functions and Exploring the data

Now that we have the data in shape we can ask a log more questions like:
* How do Exploration and Development wells compare for different companies?
* How do different companies compare in their drill times?

There's an addage that goes: 

__"If you've repeated a workflow, its time to write a function."__ 

Functions in Python allow us to save out a sequence of code then call it when needed with the ability to put in new data types or variables.

The funciton below allows us to compare Exploration and Development wells for a particular company.  We'll call this fucntion "company_expl_dev_lifelines" and it has several inputs that are behind the brackes:
1. df - this is a placeholder for an dataframe with a "drill_days" column
2. compnay - this is a placeholder for a name of a company
3. mindays - this is a variable that filters the data we can chose to set or not 

In [None]:
#function to compare Exploration and Development wells for a particular company

def company_expl_dev_lifelines(df, company):
    
    #Filter Data
    dn= df.loc[df['Company Name'].str.contains(company)]
    dk = pd.merge(dn, days_filtered['drill_days'], left_index=True, right_index=True)
    dk=dk[['drill_days','Type Code']].dropna()

    de = dk['drill_days'][dk["Type Code"] == "E"].dropna()
    dd = dk['drill_days'][dk["Type Code"] != "E"].dropna()

    #Make Plot 
    ax = plt.subplot(111)
    kmf.fit(de, event_observed=None, label="Exploration Wells")
    kmf.plot(ax=ax, figsize=(10,10))
    kmf.fit(dd, event_observed=None, label="Development Wells")
    kmf.plot(ax=ax)

    plt.ylim(0, 1);
    plt.title(f"Drilling Days for {company} - Exploration vs. Development Wells");
    plt.show()

Let's try this fucntion out with Shell.

In [None]:
company_expl_dev_lifelines(df, 'Shell')

Now its your turn to picks companies to plot.  To help you find names, below is a bar chart of the most prolific drillers in MC Protraction.  Note how the confidence intervals expand as there are fewer datapoints (i.e. Taylor Energy).

In [None]:
#Quick Plot of who's drilled the most in the protraction
comp_counts = df['Company Name'].value_counts()
comp_counts = comp_counts[comp_counts>50]
comp_counts.plot(kind='barh', figsize=(5,5), title='Top Operators in MC (>50 Wells)', label='# Wells');

In [None]:
company_expl_dev_lifelines(df, 'Taylor')

### How do different companies compare in their drill times?

Below is a similiar looking function but it compares wells from two different companies.  Note that you now need to add two compay names.

In [None]:
def company_compare_lifelines(df, company1, company2):
    
    #Filter Data
    dk = pd.merge(df, days_filtered['drill_days'], left_index=True, right_index=True)
    dk=dk[['drill_days','Type Code']].dropna()    
    dn= dk.loc[df['Company Name'].str.contains(company1)].dropna()
    do = dk.loc[df['Company Name'].str.contains(company2)].dropna()
    
    #Make Plot 
    ax = plt.subplot(111)
    kmf.fit(dn.drill_days, event_observed=None, label=company1)
    kmf.plot(ax=ax, figsize=(10,10))
    kmf.fit(do.drill_days, event_observed=None, label=company2)
    kmf.plot(ax=ax)

    plt.ylim(0, 1);
    plt.title(f"Drilling Days for {company1} vs. {company2}");
    plt.show()

In [None]:
company_compare_lifelines(df, 'Shell', 'Exxon')

There's lots more to explore with these Suvival Analysis: different cuts of data, different day, analysis of the distributions, etc.

Where else do you have duration data that might fit well in these kinds of plots?

In [None]:
x