# Pipeline Hours Analysis: Predicting Business Implementation from Pipeline Data

## Overview
This notebook explores methods to predict the conversion rate of pipeline opportunities into actual implemented business hours. The analysis aims to help business planning by providing better forecasting of resource requirements based on current pipeline data. This notebook was mainly EDA and exploration into the space to see if classification or regression models would help in our forecasting efforts. In the end, the best choice was to use the monthly snapshot approach, which approximated the hours needed the best out of any more sophisticated model with much more simplicity. 

## Project Goals
- Develop models to predict which opportunities will close successfully
- Analyze the timeline of opportunity closures
- Calculate the relationship between contracted and implemented hours
- Create monthly snapshots of pipeline conversion rates
- Provide insights for resource planning and capacity management

## Technical Approach
The analysis employs several sophisticated methods:
1. **Binary Classification Models**
   - Random Forest and XGBoost classifiers to predict opportunity success
   - Feature engineering including opportunity age, type, category, and contract values
   - SMOTE for handling class imbalance

2. **Survival Analysis**
   - Kaplan-Meier estimator for analyzing opportunity lifecycle
   - Cox Proportional Hazards model for understanding factors affecting closure times
   - Time-based binning (0-30 days, 31-60 days, 60+ days) for practical planning

3. **Pipeline Conversion Analysis**
   - Monthly snapshot creation of contracted vs. implemented hours
   - Category-wise analysis of conversion rates
   - Historical trending of implementation rates

## Data Sources
The analysis uses Salesforce opportunity data, including:
- Opportunity details (ID, Type, Category)
- Timeline information (Start Date, Close Date)
- Contract values (Weekly Hours, ACV)
- Implementation metrics
- Status and stage information

## Business Impact
This analysis helps answer critical business questions:
- What percentage of pipeline opportunities will convert to actual business?
- When are opportunities likely to close?
- How many resources should be prepared for upcoming implementations?
- What is the typical gap between contracted and implemented hours?

These insights enable more accurate resource planning and improve operational efficiency by better aligning capacity with expected demand.

In [70]:
# %pip install imblearn

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, classification_report, roc_auc_score, accuracy_score, confusion_matrix
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import optuna
from datetime import datetime
# Import the module under a specific name
import importlib
import sf_queries_class
importlib.reload(sf_queries_class)
from sf_queries_class import SfQueries
import my_sf_secrets
import capacity_portion
importlib.reload(capacity_portion)
from reportforce import Reportforce
from lifelines.statistics import logrank_test
from xgboost import XGBClassifier
import statsmodels.api as sm
from lifelines import KaplanMeierFitter, CoxPHFitter
my_sf_username, my_sf_password, my_sf_security_token = my_sf_secrets.get_my_sf_secrets()
queries = SfQueries(
    username=my_sf_username,
    password=my_sf_password,
    security_token=my_sf_security_token
)

rf = Reportforce(session_id=queries.sf.session_id, instance_url=queries.sf.sf_instance)

In [None]:
if 'ofph' not in globals():
    print("You need OFPH")
    ofph = rf.get_report("ReportID", id_column='Opportunity ID')
len(ofph)

In [None]:
ofph_fil = ofph[(ofph['Type'].isin(['New Client', 'Existing Client - New Service Line', 'Existing Client - Service Line Expansion'])) &
                 (ofph['Exclude from Resource Requests'] == 'false') &
                 (ofph['Close Date'] <= datetime.now())].copy()
len(ofph_fil)

In [75]:
# create a column that is called 'Won' if `Probability (%)` is > 0 otherwise 'Lost'
ofph_fil.loc[:, 'Won'] = (ofph_fil['Probability (%)'] > 0).astype(int)

In [76]:
unique_identifier = ['Opportunity ID'] # String

# removed Opportunity Owner, Created Date, Close Date
important_features = ['Opportunity ID', ## String
                      'Age', # Int
                      'Type', # String
                      'Category', # String
                      'Contracted Weekly Hours', # Float
                      'Backlog', # Bool
                      'Won', # Int
                      'Contracted Opportunity ACV', # Float
                      'Hospital Count', # Int
                    #   'Implemented Weekly Hours', # Float choosing Contracted for classification
                      ]

In [None]:
ofph_index_reset = ofph_fil.reset_index()
# convert the Backlog string column to a boolean by changing the true values to 1 and the false values to 0
ofph_index_reset['Backlog'] = ofph_index_reset['Backlog'].str.replace('true', '1').str.replace('false', '0').astype(int)
ofph_index_reset_sel = ofph_index_reset[important_features]
ofph_index_reset_sel = ofph_index_reset_sel.dropna()
ofph_index_reset_sel

In [None]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

ofph_index_reset_sel.columns

In [None]:
ofph_index_reset_sel.info()

In [None]:
# Apply one-hot encoding to the 'Category' column
encoded_df = pd.get_dummies(ofph_index_reset_sel, columns=['Type','Category'], drop_first=True)


# Display the encoded DataFrame
encoded_df.head()

In [82]:
# Define the columns to normalize
columns_to_normalize = [
    'Contracted Opportunity ACV',
    'Contracted Weekly Hours',
    'Hospital Count',
    'Age'
]
# groupby Opportunity ID and get the first value from 'Age', 'Backlog', and 'Hospital Count'
abhc = encoded_df.groupby('Opportunity ID').first()[['Age', 'Backlog', 'Hospital Count', 'Contracted Opportunity ACV']]
# group by Opportunity ID and calculate the sum for all the other columns
rest = encoded_df.groupby('Opportunity ID').sum().drop(['Age', 'Backlog', 'Hospital Count', 'Contracted Opportunity ACV'], axis=1)
# merge the 2 back together
encoded_df_merged = pd.merge(abhc, rest, left_index=True, right_index=True)
# in all columns except for 'Age', 'Backlog', 'Hospital Count', 'Contracted Weekly Hours', and 'Contracted Opportunity ACV', replace any number greater than 0 with 1
cols_not_to_change = ['Age', 'Backlog', 'Hospital Count', 'Contracted Weekly Hours', 'Contracted Opportunity ACV']
for col in encoded_df_merged.columns:
    if col not in cols_not_to_change:
        encoded_df_merged[col] = encoded_df_merged[col].apply(lambda x: 1 if x > 0 else 0)
encoded_df_base = encoded_df_merged.copy()
encoded_df_merged_reg = encoded_df_merged.copy()
# Fit and transform the selected columns
encoded_df_merged[columns_to_normalize] = scaler.fit_transform(encoded_df_merged[columns_to_normalize])
cols_for_reg = [
    'Contracted Opportunity ACV',
    'Contracted Weekly Hours',
    'Hospital Count']
encoded_df_merged_reg[cols_for_reg] = scaler.fit_transform(encoded_df_merged_reg[cols_for_reg])

encoded_df_merged_index_reset = encoded_df_merged.reset_index()
encoded_df_merged_reg_index_reset = encoded_df_merged_reg.reset_index()

# Win Probability

In [None]:
# Define target variable
target = 'Won'

# Split the data
X = encoded_df_merged_index_reset.drop(columns=[target, 'Opportunity ID'])
y = encoded_df_merged_index_reset[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBClassifier(eval_metric='logloss')

model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print('AUC-ROC:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

In [None]:
# Get probabilities
probabilities = model.predict_proba(X_test)

# Get the probability of the positive class (usually class 1)
positive_class_probs = probabilities[:, 1]

# You can then use these probabilities for various purposes
for i, prob in zip(X_test.index, positive_class_probs):
    print(f"Sample {i}: Probability of positive class = {prob:.2f}")

The XGBoost model tells us the probability that we will win the opportunity. Given that we win, now we need to determine what is the age that the opportunity will close at and how many hours will be implemented versus what is contracted for. 

I think figuring out how much is implemented versus what is contracted for is an easier task to determine. 

# Implemented Vs Contracted Hours

In [17]:
con_imp = ofph_fil.reset_index().loc[(ofph_fil.reset_index()['Probability (%)'] > 0), ['Category', 'Contracted Weekly Hours', 'Implemented Weekly Hours']]\
    .groupby('Category').sum()[['Contracted Weekly Hours', 'Implemented Weekly Hours']]
con_imp.loc[:, 'diff'] = con_imp['Implemented Weekly Hours'] - con_imp['Contracted Weekly Hours']
con_imp.loc[:, 'diff_perc'] = round(con_imp['diff'] / con_imp['Contracted Weekly Hours'], 4)
con_imp.loc[:, 'diff'] = round(con_imp.loc[:, 'diff'], 4)
# con_imp

now we have a percentage that given we win an opportunity, how much of that opportunity we will actually recognize in implemented hours.

# Closing Age

In [21]:
encoded_df_merged_reg_index_reset['Age_logged'] = np.log1p(encoded_df_merged_reg_index_reset['Age'])
features_i = encoded_df_merged_reg_index_reset.columns
features = [f for f in features_i if f not in ['Opportunity ID', 'Age_logged', 'Won'] and not f.startswith('Category_')]

def prepare_data(df):
    X = df[features]
    y = df['Age_logged']
    return train_test_split(X, y, test_size=0.2, random_state=42)

X_train_won, X_test_won, y_train_won, y_test_won = prepare_data(encoded_df_merged_reg_index_reset[encoded_df_merged_reg_index_reset['Won'] == 1])
X_train_lost, X_test_lost, y_train_lost, y_test_lost = prepare_data(encoded_df_merged_reg_index_reset[encoded_df_merged_reg_index_reset['Won'] == 0])

In [None]:
# create a random forest regressor model
rf_won = RandomForestRegressor(n_estimators = 136, max_depth = 9, min_samples_split = 4, max_features = 'sqrt', random_state=42) 
# {'n_estimators': 136, 'max_depth': 9, 'min_samples_split': 4, 'max_features': 'sqrt'}
rf_lost = RandomForestRegressor(n_estimators = 77, max_depth = 5, min_samples_split = 3, max_features = 'sqrt', random_state=42) 
# {'n_estimators': 77, 'max_depth': 5, 'min_samples_split': 3, 'max_features': 'sqrt'}

rf_won.fit(X_train_won, y_train_won)
rf_lost.fit(X_train_lost, y_train_lost)

y_pred_won = rf_won.predict(X_test_won)
y_pred_lost = rf_lost.predict(X_test_lost)

def evaluate_model(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{model_name} - RMSE: {rmse:.2f}, R2 Score: {r2:.2f}")

evaluate_model(y_test_won, y_pred_won, "Won Opportunities Model")
evaluate_model(y_test_lost, y_pred_lost, "Lost Opportunities Model")

In [None]:
# plot the predictions versus the true values
plt.scatter(y_test_won, y_pred_won)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()


In [None]:
def plot_feature_importance(model, feature_names):
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]
    
    plt.figure(figsize=(10,6))
    plt.title("Feature Importances")
    plt.bar(range(len(importances)), importances[indices])
    plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=90)
    plt.tight_layout()
    plt.show()

plot_feature_importance(rf_won, features)
plot_feature_importance(rf_lost, features)

In [None]:


def learning_curve_pandas(rf, X_train, y_train, X_test, y_test, train_sizes):
    """
    Manually plots learning curves for a given sklearn model.

    Args:
        rf: The sklearn Random Forest model instance.
        X_train: The training features.
        y_train: The training target.
        X_test: The testing features.
        y_test: The testing target.
        train_sizes: Relative or absolute numbers of training examples to use for generating the learning curve.
    """
    train_scores = []
    test_scores = []

    for size in train_sizes:
        # Sample the training data according to the current size
        sample_size = int(len(X_train) * size)
        X_sample, y_sample = X_train[:sample_size], y_train[:sample_size]
        
        # Train the model
        rf.fit(X_sample, y_sample)
        
        # Make predictions and evaluate the model on both training and testing data
        train_pred = rf.predict(X_sample)
        test_pred = rf.predict(X_test)
        
        train_rmse = np.sqrt(mean_squared_error(y_sample, train_pred))
        test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
        
        train_scores.append(train_rmse)
        test_scores.append(test_rmse)

    return train_scores, test_scores

def train_and_evaluate_rf_optuna(X_train, X_test, y_train, y_test):
    def objective(trial):
        # Define the hyperparameter search space
        n_estimators = trial.suggest_int("n_estimators", 50, 200)
        max_depth = trial.suggest_int("max_depth", 5, 15)
        min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
        max_features = trial.suggest_categorical("max_features", ["sqrt", "log2"])

        # Create a Random Forest model with the suggested hyperparameters
        rf = RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            max_features=max_features,
            random_state=42
        )

        # Train the model
        rf.fit(X_train, y_train)

        # Make predictions on the validation data
        predictions = rf.predict(X_test)

        # Evaluate the model using RMSE
        rmse = np.sqrt(mean_squared_error(y_test, predictions))

        return rmse

    # Create an Optuna study
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=100)  # Adjust the number of trials as needed

    # Get the best hyperparameters and RMSE
    best_params = study.best_trial.params
    best_rmse = study.best_value
    print(f"Best Hyperparameters: {best_params}\nBest RMSE: {best_rmse}")

    # Train the final model with the best hyperparameters
    best_rf = RandomForestRegressor(
        n_estimators=best_params["n_estimators"],
        max_depth=best_params["max_depth"],
        min_samples_split=best_params["min_samples_split"],
        max_features=best_params["max_features"],
        random_state=42
    )
    best_rf.fit(X_train, y_train)

    # Make predictions on the testing data
    predictions = best_rf.predict(X_test)

    # Extract feature importances
    feature_importances = best_rf.feature_importances_

    # Return the best model, feature importances, and predictions
    return best_rf, feature_importances, predictions

def plot_figs(X_train, X_test, y_train, y_test, input_cols):
    print(f"\n{'='*50}\nProcessing output column: Age\n{'='*50}")
    
    rf_model, feature_importances, predictions = train_and_evaluate_rf_optuna(X_train, X_test, y_train, y_test)
    
    residuals = y_test - predictions
    
    # Summary Statistics of Residuals
    print("\nSummary Statistics of Residuals:")
    print(pd.Series(residuals).describe())

    # Create a single figure with 6 subplots
    fig, axs = plt.subplots(3, 2, figsize=(20, 24))
    fig.suptitle("Analysis for Age", fontsize=16)

    # 1. Learning Curve
    train_sizes = np.linspace(0.1, 1.0, 5)
    train_scores, test_scores = learning_curve_pandas(rf_model, X_train, y_train, X_test, y_test, train_sizes)

    axs[0, 0].plot(train_sizes, train_scores, label='Train RMSE')
    axs[0, 0].plot(train_sizes, test_scores, label='Test RMSE')
    axs[0, 0].set_xlabel('Training Examples')
    axs[0, 0].set_ylabel('RMSE')
    axs[0, 0].set_title('Learning Curves')
    axs[0, 0].legend()
    axs[0, 0].grid(True)

    # 2. Feature Importances
    axs[0, 1].bar(range(len(input_cols)), feature_importances)
    axs[0, 1].set_xticks(range(len(input_cols)))
    axs[0, 1].set_xticklabels(input_cols, rotation=90)
    axs[0, 1].set_xlabel("Features")
    axs[0, 1].set_ylabel("Importance")
    axs[0, 1].set_title("Feature Importances")

    # 3. Test vs Predictions
    axs[1, 0].scatter(y_test, predictions)
    axs[1, 0].set_xlabel("Actual Age")
    axs[1, 0].set_ylabel("Predicted Age")
    axs[1, 0].set_title("Test vs Predictions")

    # 4. Residual Plot
    axs[1, 1].scatter(predictions, residuals)
    axs[1, 1].axhline(y=0, color='r', linestyle='--')
    axs[1, 1].set_xlabel("Predicted Values")
    axs[1, 1].set_ylabel("Residuals")
    axs[1, 1].set_title("Residual Plot")

    # 5. Histogram of Residuals
    axs[2, 0].hist(residuals, bins=20)
    axs[2, 0].set_xlabel("Residuals")
    axs[2, 0].set_ylabel("Frequency")
    axs[2, 0].set_title("Histogram of Residuals")

    # 6. Q-Q Plot
    sm.qqplot(pd.Series(residuals), line='s', ax=axs[2, 1])
    axs[2, 1].set_title("Q-Q Plot")

    plt.tight_layout()
    plt.show()

# Assuming you have your data in a pandas DataFrame called 'df'
# with 'Age' as the target variable and other columns as features
# df = pd.read_csv('your_data.csv')

# Prepare your data
# X = df.drop('Age', axis=1)
# y = df['Age']
# input_cols = X.columns.tolist()

# Split the data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Run the analysis
plot_figs(X_train_won, X_test_won, y_train_won, y_test_won, features)
plot_figs(X_train_lost, X_test_lost, y_train_lost, y_test_lost, features)


In [31]:
# Define the bins and corresponding labels
bins = [0, 30, 60, float('inf')]
labels = ['0-30 days', '31-60 days', '> 61 days']

# Create a new column 'Age_Bin' based on the bins
encoded_df_merged_reg_index_reset['Age_Bin'] = pd.cut(encoded_df_merged_reg_index_reset['Age'], bins=bins, labels=labels, right=False)


In [None]:
encoded_df_merged_reg_index_reset['Age_Bin'].value_counts()

In [33]:
# Update the feature list to exclude 'Age' and include the binned target variable 'Age_Bin'
features_i = encoded_df_merged_reg_index_reset.columns
features = [f for f in features_i if f not in ['Opportunity ID', 'Age', 'Won', 'Age_Bin']] #  and not f.startswith('Category_')

def prepare_data(df):
    X = df[features]
    y = df['Age_Bin']
    return train_test_split(X, y, test_size=0.2, random_state=42)

# Prepare training and testing sets based on the new target
X_train_won, X_test_won, y_train_won, y_test_won = prepare_data(encoded_df_merged_reg_index_reset[encoded_df_merged_reg_index_reset['Won'] == 1])
X_train_lost, X_test_lost, y_train_lost, y_test_lost = prepare_data(encoded_df_merged_reg_index_reset[encoded_df_merged_reg_index_reset['Won'] == 0])

# Apply SMOTE to the training data
smote_won = SMOTE(random_state=42)
X_resampled_won, y_resampled_won = smote_won.fit_resample(X_train_won, y_train_won)

# Apply SMOTE to the training data
smote_lost = SMOTE(random_state=42)
X_resampled_lost, y_resampled_lost = smote_lost.fit_resample(X_train_lost, y_train_lost)

In [None]:
y_train_won.value_counts()

In [None]:
y_train_lost.value_counts()

In [None]:
def objective(trial):
    # Define the hyperparameter search space
    n_estimators = trial.suggest_int("n_estimators", 50, 300)
    max_depth = trial.suggest_int("max_depth", 5, 30)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
    max_features = trial.suggest_categorical("max_features", ["sqrt", "log2", None])
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 4)
    
    # Create the Random Forest model with the suggested hyperparameters
    rf_classifier = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )
    
    # Train the model
    rf_classifier.fit(X_resampled_won, y_resampled_won)
    
    # Make predictions on the validation set
    y_pred = rf_classifier.predict(X_test_won)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test_won, y_pred)
    
    return accuracy

# Create an Optuna study object and specify the optimization direction
study = optuna.create_study(direction="maximize")

# Optimize the study using the objective function
study.optimize(objective, n_trials=100, n_jobs=-1)  # Adjust n_trials as needed

# Output the best trial
print(f"Best trial:\n{study.best_trial}")
print(f"Best accuracy: {study.best_value}")
print(f"Best hyperparameters: {study.best_trial.params}")

# Extract the best hyperparameters
best_params = study.best_trial.params

# Train the final model using the best hyperparameters
best_rf_classifier = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    max_depth=best_params["max_depth"],
    min_samples_split=best_params["min_samples_split"],
    max_features=best_params["max_features"],
    min_samples_leaf=best_params["min_samples_leaf"],
    random_state=42
)

# Fit the model on the full training data
best_rf_classifier.fit(X_resampled_won, y_resampled_won)

# Make predictions on the test data
y_pred_best = best_rf_classifier.predict(X_test_won)

# Evaluate the model
print("Classification Report for 'Won' with Best Hyperparameters:")
print(classification_report(y_test_won, y_pred_best))

# Confusion matrix
print("Confusion Matrix for 'Won' with Best Hyperparameters:")
print(confusion_matrix(y_test_won, y_pred_best))


In [None]:
# Assuming y_train_won and y_test_won contain your target labels
label_encoder = LabelEncoder()

# Fit the label encoder and transform the target labels
y_resampled_won_encoded = label_encoder.fit_transform(y_resampled_won)
y_test_won_encoded = label_encoder.transform(y_test_won)

# Print the classes to see the mapping
print("Classes:", label_encoder.classes_)


def objective(trial):
    # Define the hyperparameter search space
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'use_label_encoder': False,  # Important to avoid warnings in XGBoost > 1.3
        'eval_metric': 'mlogloss'  # Metric to use
    }

    # Create the XGBClassifier model with the suggested hyperparameters
    xgb_classifier = XGBClassifier(**param, random_state=42)
    
    # Train the model
    xgb_classifier.fit(X_resampled_won, y_resampled_won_encoded)
    
    # Make predictions on the validation set
    y_pred = xgb_classifier.predict(X_test_won)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test_won_encoded, y_pred)
    
    return accuracy
# Create an Optuna study object and specify the optimization direction
study = optuna.create_study(direction="maximize")

# Optimize the study using the objective function
study.optimize(objective, n_trials=100, n_jobs=-1)  # Adjust n_trials as needed

# Output the best trial
print(f"Best trial:\n{study.best_trial}")
print(f"Best accuracy: {study.best_value}")
print(f"Best hyperparameters: {study.best_trial.params}")
# Extract the best hyperparameters
best_params = study.best_trial.params

# Train the final model using the best hyperparameters
best_xgb_classifier = XGBClassifier(**best_params, random_state=42, use_label_encoder=False, eval_metric='mlogloss')

# Fit the model on the full training data
best_xgb_classifier.fit(X_resampled_won, y_resampled_won_encoded)

# Make predictions on the test data
y_pred_best = best_xgb_classifier.predict(X_test_won)

# Evaluate the model
print("Classification Report for 'Won' with Best Hyperparameters:")
print(classification_report(y_test_won_encoded, y_pred_best))

# Confusion Matrix
print("Confusion Matrix for 'Won' with Best Hyperparameters:")
print(confusion_matrix(y_test_won_encoded, y_pred_best))


best_parameters XGBoost -> Best accuracy: 0.5294117647058824
Best hyperparameters: {'n_estimators': 121, 'max_depth': 10, 'learning_rate': 0.2602322469736991, 'subsample': 0.9877578858959679, 'colsample_bytree': 0.9998237546974039, 'gamma': 1.303410777487739, 'reg_alpha': 0.41009343222603883, 'reg_lambda': 0.8400173246665398, 'min_child_weight': 10}

best_parameters RF -> Best accuracy: 0.5294117647058824
Best hyperparameters: {'n_estimators': 224, 'max_depth': 7, 'min_samples_split': 7, 'max_features': 'sqrt', 'min_samples_leaf': 2}

# Base case for comparison against Models

In [None]:
encoded_df_base.columns

In [None]:
# create a histogram for encoded_df_base['Age']
plt.hist(encoded_df_base.loc[encoded_df_base['Won'] == 1, 'Age'], bins=100)

# 1. **Survival Analysis**

In [None]:
kmf = KaplanMeierFitter()
kmf.fit(durations=encoded_df_base['Age'], event_observed=encoded_df_base['Won'])
kmf.plot()

In [None]:
encoded_df_base.head()

In [None]:
# groupby Opportunity ID and get the first value from 'Type', 'Age', 'Won', 'Contracted Opportunity ACV' and 'Hospital Count' and find the sum of 'Contracted Weekly Hours',
grouped_for_logrank = ofph_index_reset_sel[~ofph_index_reset_sel['Category'].isin([
    'Implementation', 'Technology', 'Advisory - Spark', 'Advisory - Validation'])].groupby(['Opportunity ID', 'Category']).agg({
    'Type': 'first', 'Age': 'first', 'Won': 'first', 'Contracted Opportunity ACV': 'first', 
    'Hospital Count': 'first', 'Contracted Weekly Hours': 'sum'#, 'Opportunity ID': 'nunique'
    }).reset_index()


In [None]:
count_of_opps_per_cat = grouped_for_logrank.groupby('Category').agg({'Opportunity ID': 'nunique'}).sort_values('Opportunity ID', ascending=False)
cats_gr25 = list(count_of_opps_per_cat[count_of_opps_per_cat['Opportunity ID'] > 25].index)
cats_gr25

In [None]:
grouped_for_logrank

In [None]:
grouped_for_logrank.columns

In [None]:
data1 = grouped_for_logrank[grouped_for_logrank['Category'] == 'CAT1'].copy()
data2 = grouped_for_logrank[grouped_for_logrank['Category'] == 'CAT2'].copy()
result = logrank_test(data1['Age'], data2['Age'], 
                      data1['Won'], data2['Won'])
print(result.p_value)

In [None]:
kmf = KaplanMeierFitter()
kmf.fit(durations=data2['Age'], event_observed=data2['Won'])
kmf.plot()

In [None]:
df = encoded_df_merged_reg.loc[:, 
    ((~encoded_df_merged_reg.columns.str.startswith('Category_')) |
    (encoded_df_merged_reg.columns.str.replace('Category_', '').isin(cats_gr25)))
].copy()

# Prepare the target variable
df['event_0_30'] = (df['Age'] <= 30).astype(int)
df['event_31_60'] = ((df['Age'] > 30) & (df['Age'] <= 60)).astype(int)
df['event_60_plus'] = (df['Age'] > 60).astype(int)

# Prepare the target variable
# df['event'] = pd.cut(df['Age'], bins=[-1, 30, 60, df['Age'].max()], labels=['0-30', '31-60', '60+'])
df['duration'] = df['Age']

# Prepare features
features = ['Category', 'Type', 'Contracted Opportunity ACV', 'Hospital Count', 'Contracted Weekly Hours']

# Split the data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=9999)

# Add event columns
train_df['event_0_30'] = train_df['event_0_30']
train_df['event_31_60'] = train_df['event_31_60']
train_df['event_60_plus'] = train_df['event_60_plus']
test_df['event_0_30'] = test_df['event_0_30']
test_df['event_31_60'] = test_df['event_31_60']
test_df['event_60_plus'] = test_df['event_60_plus']

# Fit the Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(train_df, duration_col='duration', event_col='event_0_30')

# Print the model summary
print(cph.print_summary())

# Predict survival function for test data
test_survival_func_0_30 = cph.predict_survival_function(test_df)

# cph2 = CoxPHFitter()
# cph2.fit(train_df, duration_col='duration', event_col='event_31_60')
# test_survival_func_31_60 = cph2.predict_survival_function(test_df)

# cph3 = CoxPHFitter()
# cph3.fit(train_df, duration_col='duration', event_col='event_60_plus')
# test_survival_func_60_plus = cph3.predict_survival_function(test_df)

# Function to predict closure within a given timeframe
def predict_closure(survival_func, timeframe):
    if timeframe == '0-30':
        return 1 - survival_func.loc[30]
    elif timeframe == '31-60':
        return 1 - survival_func.loc[60]
    else:
        return 1 - survival_func.loc[test_df['Age'].max()]

# Predict closure for 0-30 days, 31-60 days, and beyond 60 days
predictions_0_30_days = predict_closure(test_survival_func_0_30, '0-30')
# predictions_31_60_days = predict_closure(test_survival_func_31_60, '31-60')
# predictions_60_plus_days = predict_closure(test_survival_func_60_plus, '60+')

# Add predictions to the test dataframe
test_df['prob_close_0_30_days'] = predictions_0_30_days.values
# test_df['prob_close_31_60_days'] = predictions_31_60_days.values
# test_df['prob_close_60_plus_days'] = predictions_60_plus_days.values

# Function to evaluate predictions
def evaluate_predictions(df, prob_cols, actual_cols, threshold=0.5):
    df['predicted'] = df[prob_cols].idxmax(axis=1)
    df['actual_0_30'] = df['event_0_30']
    df['actual_31_60'] = df['event_31_60']
    df['actual_60_plus'] = df['event_60_plus']
    
    precision_0_30 = ((df['predicted'] == '0-30') & (df['actual_0_30'] == 1)).sum() / ((df['predicted'] == '0-30')).sum() if ((df['predicted'] == '0-30')).sum() > 0 else 0
    recall_0_30 = ((df['predicted'] == '0-30') & (df['actual_0_30'] == 1)).sum() / df['actual_0_30'].sum() if df['actual_0_30'].sum() > 0 else 0
    f1_score_0_30 = 2 * (precision_0_30 * recall_0_30) / (precision_0_30 + recall_0_30) if (precision_0_30 + recall_0_30) > 0 else 0

    precision_31_60 = ((df['predicted'] == '31-60') & (df['actual_31_60'] == 1)).sum() / ((df['predicted'] == '31-60')).sum() if ((df['predicted'] == '31-60')).sum() > 0 else 0
    recall_31_60 = ((df['predicted'] == '31-60') & (df['actual_31_60'] == 1)).sum() / df['actual_31_60'].sum() if df['actual_31_60'].sum() > 0 else 0
    f1_score_31_60 = 2 * (precision_31_60 * recall_31_60) / (precision_31_60 + recall_31_60) if (precision_31_60 + recall_31_60) > 0 else 0

    precision_60_plus = ((df['predicted'] == '60+') & (df['actual_60_plus'] == 1)).sum() / ((df['predicted'] == '60+')).sum() if ((df['predicted'] == '60+')).sum() > 0 else 0
    recall_60_plus = ((df['predicted'] == '60+') & (df['actual_60_plus'] == 1)).sum() / df['actual_60_plus'].sum() if df['actual_60_plus'].sum() > 0 else 0
    f1_score_60_plus = 2 * (precision_60_plus * recall_60_plus) / (precision_60_plus + recall_60_plus) if (precision_60_plus + recall_60_plus) > 0 else 0

    return {
        'Precision_0_30': precision_0_30,
        'Recall_0_30': recall_0_30, 
        'F1-Score_0_30': f1_score_0_30,
        'Precision_31_60': precision_31_60,
        'Recall_31_60': recall_31_60,
        'F1-Score_31_60': f1_score_31_60,
        'Precision_60_plus': precision_60_plus,
        'Recall_60_plus': recall_60_plus,
        'F1-Score_60_plus': f1_score_60_plus
    }

# Evaluate predictions
prob_cols = ['prob_close_0_30_days']#, 'prob_close_31_60_days', 'prob_close_60_plus_days']
actual_cols = ['event_0_30']#, 'event_31_60', 'event_60_plus']
results = evaluate_predictions(test_df, prob_cols, actual_cols)
print("Prediction results:", results)

In [None]:
df = encoded_df_merged_reg.loc[:, 
    ((~encoded_df_merged_reg.columns.str.startswith('Category_')) |
    (encoded_df_merged_reg.columns.str.replace('Category_', '').isin(cats_gr25)))
].drop(columns='Won').copy()

# Prepare the target variable
# df['event_0_30'] = (df['Age'] <= 30).astype(int)
# df['event_31_60'] = ((df['Age'] > 30) & (df['Age'] <= 60)).astype(int)
# df['event_60_plus'] = (df['Age'] > 60).astype(int)

# Convert categorical variables to numeric
# le = LabelEncoder()
# df['Category'] = le.fit_transform(df['Category'])
# df['Type'] = le.fit_transform(df['Type'])

# Create the target variable (0-30, 31-60, 61+)
df['Age_bin'] = pd.cut(df['Age'], bins=[-1, 30, 60, float('inf')], labels=[0, 1, 2])
df['Age_bin'] = df['Age_bin'].astype('Int64')
# Create the event indicator (1 if the opportunity closed, 0 if it's still open)
# df['event'] = 1  # Assuming all opportunities in the dataset have closed

df.info()

In [None]:
# Split the data into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Prepare the data for the Cox Proportional Hazards model
cph_columns = ['Contracted Opportunity ACV', 'Hospital Count', 'Contracted Weekly Hours']

# Fit the Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(train_df, duration_col='Age', event_col='event') #, strata=cph_columns

# Print the model summary
print(cph.print_summary())



In [None]:
test_df.columns

In [None]:
test_df.groupby(['Contracted Opportunity ACV', 'Hospital Count', 'Contracted Weekly Hours']).size()

In [None]:
test_df.drop(columns='Age_bin').columns

In [None]:
# Make predictions on the test set
test_predictions = cph.predict_expectation(test_df)

# Convert predictions to age bins
test_predictions_bin = pd.cut(test_predictions, bins=[-1, 30, 60, float('inf')], labels=[0, 1, 2])

# Calculate accuracy
accuracy = (test_predictions_bin == test_df['Age_bin']).mean()
print(f"Accuracy: {accuracy:.2f}")

# Calculate precision, recall, and F1-score for each bin

print(classification_report(test_df['Age_bin'], test_predictions_bin))

# Plot the survival curves for different categories

# for category in df['Category'].unique():
#     mask = (train_df['Category'] == category)
#     cph.plot_partial_effects('Category', values=[category], plot_baseline=False)
#     plt.title(f"Survival Curve for Category {category}")
#     plt.show()

# # Feature importance
# cph.plot_covariate_groups('Contracted Opportunity ACV', values=[df['Contracted Opportunity ACV'].quantile(0.25), 
#                                                                 df['Contracted Opportunity ACV'].median(), 
#                                                                 df['Contracted Opportunity ACV'].quantile(0.75)])
# plt.title("Impact of Contracted Opportunity ACV on Survival")
# plt.show()

In [None]:
merged = pd.merge(test_df[['Age_bin']], test_predictions_bin.to_frame().rename(columns={0: 'Age_bin_pred'}), left_index=True, right_index=True)
merged.loc[:, 'diff'] = merged['Age_bin'] - merged['Age_bin_pred'].astype(int)
merged[merged['diff'] != 0]

In [6]:
important_features = ['Opportunity ID', ## String
                      'Age', # Int
                      'Type', # String
                      'Category', # String
                      'Contracted Weekly Hours', # Float
                      'Backlog', # Bool
                      'Contracted Opportunity ACV', # Float
                      'Hospital Count', # Int
                      'Close Date'
                    #   'Implemented Weekly Hours', # Float choosing Contracted for classification
                      ]
cats_gr25 = ['SPECIFIC CATS']
ofph_fil = ofph[(ofph['Type'].isin(['New Client', 'Existing Client - New Service Line', 'Existing Client - Service Line Expansion'])) &
                 (ofph['Exclude from Resource Requests'] == 'false')].copy()
ofph_index_reset = ofph_fil.reset_index()
ofph_index_reset_select_cats = ofph_index_reset[ofph_index_reset['Category'].isin(cats_gr25)].copy()
# convert the Backlog string column to a boolean by changing the true values to 1 and the false values to 0
ofph_index_reset_select_cats['Backlog'] = ofph_index_reset_select_cats['Backlog'].str.replace('true', '1').str.replace('false', '0').astype(int)
ofph_index_reset_sel = ofph_index_reset_select_cats[important_features]

# groupby Opportunity ID and get the first value from 'Age', 'Backlog', and 'Hospital Count'
abhc = ofph_index_reset_sel.groupby(['Opportunity ID', 'Category']).first()[['Age', 'Backlog', 'Hospital Count', 'Contracted Opportunity ACV', 'Close Date']]
# group by Opportunity ID and calculate the sum for all the other columns
rest = ofph_index_reset_sel.drop(['Age', 'Backlog', 'Hospital Count', 'Contracted Opportunity ACV', 'Close Date'], axis=1).groupby(['Opportunity ID', 'Category']).sum()
# merge the 2 back together
ofph_index_reset_sel_merged = pd.merge(abhc, rest, left_index=True, right_index=True)
ofph_index_reset_sel_merged.loc[:, 'event'] = (ofph_index_reset_sel_merged['Close Date'] <= datetime.now()).astype(int)
merged_ready = ofph_index_reset_sel_merged.reset_index()

In [None]:
merged_ready

# KaplanMeierFitter

In [None]:
kmf = KaplanMeierFitter()
# split grouped_for_logrank into train and test
train, test = train_test_split(merged_ready, test_size=0.2)
# fit the kmf model to the training data
kmf.fit(durations=train['Age'], event_observed=train['event'])
# predict the survival probability for the test data
survival_prob = kmf.predict(test['Age']).to_frame()
# add a column to survival_prob that is 1 if the survival probability is greater than 0.75, otherwise 0
survival_prob['event'] = survival_prob['KM_estimate'] > 0.95
survival_prob['event'] = survival_prob['event'].astype(int)

# # calculate the log-rank p-value
logrank_test(durations_A=test['Age'], durations_B=survival_prob.index, event_observed_A=test['event'], event_observed_B=survival_prob['event']).p_value

In [None]:
train['event'].value_counts()

In [None]:
test_index_set = test.reset_index().rename(columns={'event':'actual_event'})
survival_prob['row_number'] = range(1, len(survival_prob) + 1)
test_index_set['row_number'] = range(1, len(test_index_set) + 1)

prob_test_merged = survival_prob.reset_index().merge(test_index_set[['row_number', 'actual_event']], on='row_number', how='inner')
prob_test_merged.loc[:, 'diff'] = prob_test_merged['actual_event'] - prob_test_merged['event']
print(len(prob_test_merged[prob_test_merged['diff'] != 0])/len(prob_test_merged), len(prob_test_merged[prob_test_merged['diff'] != 0]), len(prob_test_merged))

In [None]:
# plot the index of survival_prob on the x-axis and the 'KM_estimate' on the y-axis
plt.scatter(survival_prob.index, survival_prob['KM_estimate'])
# set the title and axis labels
plt.title('Kaplan-Meier Estimate of Survival Probability')
plt.xlabel('Age')
plt.ylabel('Survival Probability')
# show the plot
plt.show()

In [None]:
# split grouped_for_logrank into train and test
train, test = train_test_split(grouped_for_logrank, test_size=0.2, random_state=9999)
# fit the kmf model to the training data
kmf.fit(durations=train['Age'], event_observed=train['Won'])
# predict the survival probability for the test data
survival_prob = kmf.predict(test['Age']).to_frame()
survival_prob
# add a column to survival_prob that is 1 if the survival probability is greater than 0.75, otherwise 0
survival_prob['Won'] = survival_prob['KM_estimate'] > 0.75
survival_prob['Won'] = survival_prob['Won'].astype(int)
survival_prob
# # calculate the log-rank p-value
logrank_test(durations_A=test['Age'], durations_B=survival_prob.index, event_observed_A=test['Won'], event_observed_B=survival_prob['Won']).p_value

In [None]:
grouped_for_logrank.columns

In [None]:
cph = CoxPHFitter()
cph.fit(encoded_df_base[['Age','Won', 'Backlog', 'Contracted Weekly Hours', 'Hospital Count']].copy(), duration_col='Age', event_col='Won')
cph.print_summary()

# Monthly Snapshots

In [3]:
ofph_monthly = ofph[(ofph['Type'].isin(['New Client', 'Existing Client - New Service Line', 'Existing Client - Service Line Expansion'])) &
                    (ofph['Exclude from Resource Requests'] == 'false')].copy()
ofph_monthly_idx_reset = ofph_monthly.reset_index()
# filter out any Technology category
ofph_monthly_idx_reset = ofph_monthly_idx_reset[~ofph_monthly_idx_reset['Category'].str.contains('Technology')]
# groupby Opportunity ID and get the first value from 'Age', 'Backlog', and 'Hospital Count'
abhc = ofph_monthly_idx_reset.groupby(['Opportunity ID', 'Category']).first()[['Opportunity Start Date', 'Close Date', 'Stage']]
# group by Opportunity ID and calculate the sum for all the other columns
rest = ofph_monthly_idx_reset.groupby(['Opportunity ID', 'Category'])[['Contracted Weekly Hours', 'Implemented Weekly Hours']].sum()
# merge the 2 back together
ofph_monthly_idx_reset_merged = pd.merge(abhc, rest, left_index=True, right_index=True)
ofph_monthly_ready = ofph_monthly_idx_reset_merged.reset_index()

In [None]:
ofph_monthly_ready['Stage'].unique()

In [None]:
def create_monthly_snapshots(opportunities_df, start_date, end_date):
    """
    Create monthly snapshots of contracted and implemented hours by category.
    
    Parameters:
    opportunities_df: DataFrame with columns ['Category', 'Start_Date', 'End_Date', 
                     'Contracted_Hours', 'Implemented_Weekly_Hours', 'Stage']
    start_date: datetime object for the start of the analysis period
    end_date: datetime object for the end of the analysis period
    """
    
    # Create a date range for all months in the analysis period
    months = pd.date_range(start=start_date, end=end_date, freq='MS')
    
    snapshots = []
    
    for month_start in months:
        month_end = month_start + pd.offsets.MonthEnd(0)
        
        # Filter opportunities that are active in the current month
        active_mask = (
            (opportunities_df['Opportunity Start Date'] < month_start) & 
            (opportunities_df['Close Date'] >= month_end)
        )
        active_opportunities = opportunities_df[active_mask]
        
        # Calculate contracted hours by category
        contracted_hours = (
            active_opportunities
            .groupby('Category')['Contracted Weekly Hours']
            .sum()
            .reset_index()
        )
        
        # Calculate implemented hours only for opportunities with 'Contract Signed (Implementing)' stage
        implementing_opportunities = opportunities_df[ 
            ((opportunities_df['Stage'] == 'Contract Signed (Implementing)') | (opportunities_df['Stage'] == 'Customer Invoiced')) &
            (
            (opportunities_df['Close Date'] >= month_start) &
            (opportunities_df['Close Date'] <= month_end)
        )
        ].copy()
        
        implemented_hours = (
            implementing_opportunities
            .groupby('Category')['Implemented Weekly Hours']
            .sum()
            .reset_index()
        )
        
        # Merge the results
        month_snapshot = pd.merge(
            contracted_hours,
            implemented_hours,
            on='Category',
            how='outer'
        ).fillna(0)
        
        # Add month information
        month_snapshot['Month'] = month_start.strftime('%Y-%m')
        
        snapshots.append(month_snapshot)
    
    # Combine all monthly snapshots
    final_df = pd.concat(snapshots, ignore_index=True)
    
    # Reorder columns
    final_df = final_df[['Month', 'Category', 'Contracted Weekly Hours', 'Implemented Weekly Hours']]
    
    return final_df

ofph_snapshots = create_monthly_snapshots(ofph_monthly_ready, '8/1/2022', datetime.now())
ofph_snapshots.info()

In [None]:
ofph_snapshots.loc[:, 'perc_realized'] = ofph_snapshots['Implemented Weekly Hours'] / ofph_snapshots['Contracted Weekly Hours'] 
ofph_snapshots[ofph_snapshots['Category'] == 'PCI']

In [None]:
# create a double bar graph that shows the Contracted Weekly Hours and Implemented Weekly Hours by Month
pci = ofph_snapshots[(ofph_snapshots['Category'] == 'PCI') & (ofph_snapshots['Month'] > '2023-12')]
print(pci['perc_realized'].mean(),
pci['perc_realized'].median())

In [22]:
cat = ofph_snapshots.loc[(ofph_snapshots['Month'] > '2023-12'), ['Category', 'Contracted Weekly Hours', 'Implemented Weekly Hours']].groupby('Category').mean()
cat.loc[:, 'perc_realized'] = cat['Implemented Weekly Hours'] / cat['Contracted Weekly Hours']


In [None]:
cat