# Data 5740 - 2025 Midterm
### Commercial/Industrial Construction:
### Predicting Customer Retention Spend

You are an analyst at a commercial/industrial construction firm. Leadership wants to **improve customer retention and expansion**, but doesn't know how to prioritize where to focus these improvements or which focus areas will have the biggest impact on future revenue. After a project finishes, some clients sign additional work within 12 months; others don't. You've been asked to **predict expected next-12-months revenue from each customer** based on project attributes and delivery performance, and to **identify which levers most influence retention spend.**

Your deliverable should be in the form of a Snowflake notebook and follow the **Steps in Model Building** we use in class (included in the rubric below) and demonstrate solid data hygiene, reasoning, and statistical rigor.

### Data provided
File: midterm_construction_projects.csv (650 rows; one row per completed project)

**Target:**  
* **next12mo_spend** - dollars of revenue expected from the customer in the 12 months after the indexed project.

**Possible Features:**  
* Customer: industry, region  
* Project: project_type, contract_type, project_size_usd, scoppe_complexity, close_time_days  
* Delivery: n_change_orders, on_time_milestones_pct, safety_incidents, time_overrun_pct, cost_overrun_pct, payment_delay_days, pm_experience_years, discount_pct, is_union_site  
* Context: prior_relationship_years, competition_count 

---

**1. Identify & Clarify the Problem (10 pts)**  
* In your own words: business objective, decision context, and why prediction + interpretation matter.
---
I am tasked with modeling customer retention revenues based on a number of factors. This model needs to accomplish two things:  
1. I need to predict expected 12-month revenues from each customer based on various features in the data. These include customer data like industry and region, information about each specific project, information about the delivery of the project, and the context of the project/customer.
2. I need to identify the most important factors to influence spend. This involves looking at the results of the model and interpreting the weights of each coefficient and its significance.  

I will make my decisions for this model based on the overall predictive accuracy of the model and also by the model's ability to describe the relationships between the variables.

Prediction and interpretation are both important for this exercise because the stated objectives lend themselves to a need to not only predict values, but also understand where to invest retention efforts. Understanding the coefficients and significance for all of the model's components will allow us to understand where to better focus retention efforts in order to maximize 12 month revenues.

**2. Background (10 pts)**  
* Briefly describe factors that *could* drive retention spend in construction (cite a source if you like)
---
Project size will be important in determining 12-month revenues because a larger project means more revenue.

Customer satisfaction is often influential on customer retention because satisfied customers are easier to retain. Customer appreciation efforts like discounts can also be heavily influential in retaining top customers. I know from my experience in Supply Chain that when we are evaluating bids and selecting suppliers or contractors, payment discounts are always an influential factor. For example, if there are multiple companies with comparable pricing and all of them offering 2 10/Net 30 discounts, a supplier with 2 15/Net 45 would have a leg up on their competitors. This also ties into project type, a CostPlus project gives more transparency in costs to the customers and a Guarunteed Maximum Price (GMP) project allows the customer to know the maximum they will pay regardless of the actual contract cost.


Project management is particularly influential in the construction industry. This includes how well the project stays on schedule and budget, given in our data by the variables **cost_overrun_pct** and **time_overrun_pct**. On time milestones is also important here, as clients are more likely to return to a customer that stays on schedule. This can also relate to project manager experience, under the assumption that a more experienced project manager is better (which is not always the case). But also looking at amemendments made to the original agreement or change orders.

I am expecting that some of the most important drivers in customer retention will be the projects efficiency (time and cost overrun) and customer satisfaction. Some other things I would expect to come into play are project complexity, a more complex project will likely have less competition, and prior relationship - it is harder to lose a customer that you have had a for a long time.

**3. Select Variables (10 pts)**  
* Propose a starting set of predictors. Briefly justify any exclusions or transformations.
----
Off the bat, I expect the set of predictors to include project data: industry, region, project_type, contract_type. I will dummy encode these categorical variables. project_size_usd will also be relevant because more revenue is earned from larger projects. 

The customer relationship will also likely be relevant to the model, including predictors prior_relationsip_years, discount_pct, reatined_12mo, and customer_satisfaction.  

The project delivery and project management variables are ones I am also expecting to be important to the model: safety_incidents, cost_overrun_pct, time_overrun_pct, n_change_orders, pm_experience_years, and on_time_milestones_pct.  

The final model will likely exclude the following fields: competition_count, is_union_site, close_time_days, and payment_delay_days.

**4. Acquire Data (10 pts)**
* Load the provided CSV
* Summarize sample size and data dictionary (your own one-liner per field).

In [None]:
import pandas as pd

construction_projects = pd.read_csv('midterm_construction_projects.csv')

construction_projects.head()

In [None]:
print(construction_projects.shape)
construction_projects.dtypes

In [None]:
categorical_cols = construction_projects[['industry',
                                         'region',
                                         'project_type',
                                         'contract_type',
                                         'is_union_site',
                                         'retained_12mo']]

numeric_cols = construction_projects[['project_size_usd',
                                     'scope_complexity',
                                     'close_time_days',
                                     'prior_relationship_years',
                                     'competition_count',
                                     'discount_pct',
                                     'pm_experience_years',
                                     'safety_incidents',
                                     'on_time_milestones_pct',
                                     'customer_satisfaction',
                                     'cost_overrun_pct',
                                     'time_overrun_pct',
                                     'payment_delay_days',
                                     'n_change_orders',
                                     'next12mo_spend']]

numeric_cols.describe()

There is a total of 650 rows in the dataset. There are missing values for the following columns:  
* discount_pct  
* on_time_milestones_pct
* pm_experience_years  

Looking at **on_time_milestones_pct**, it is scaled 0-100. All other percent variables in the data are scaled 0-1. I will transform this by dividing each value by 100 so all the percents are scaled the same. This will aid in interpretability. 

In [None]:
construction_projects['on_time_milestones_pct'] = construction_projects['on_time_milestones_pct'] / 100
construction_projects['on_time_milestones_pct'].describe()

In [None]:
# create disctionary using dtypes and blank description column
data_dictionary = pd.DataFrame({
    "Field Name": construction_projects.columns,
    "Data Type": construction_projects.dtypes.astype(str),
    "Description":""
})

# create descriptions for fields
descriptions = {
    "project_id":"unique identifier for each project",
    "industry":"industry sector for the project",
    "region":"geographic location (region) of the project",
    "project_type":"type of construction project",
    "contract_type":"type of contract agreement",
    "is_union_site":"union site indicator, 0/1 binary",
    "project_size_usd":"total volume of project in dollars",
    "scope_complexity":"complexity of project on 1-5 scale",
    "close_time_days":"days to close deal with customer",
    "prior_relationship_years":"duration in years of relationship with customer prior to closing",
    "competition_count":"number of other construction companies evaluated for project",
    "discount_pct":"rate (percent) of discount awarded on project",
    "pm_experience_years":"years of experience for project manager",
    "safety_incidents":"number of safety incidents occured on project",
    "on_time_milestones_pct":"percent of project milestones met on schedule",
    "customer_satisfaction":"score 1-5 of customer satisfaction",
    "cost_overrun_pct":"percent of cost overrun compared to proposal",
    "time_overrun_pct":"percent of time overrun compared to budget",
    "payment_delay_days":"delay in days of payment received from customer",
    "n_change_orders":"number of change orders (or contract amendments) issued",
    "next12mo_spend":"dollars of revenue expected from the customer in the 12 months after the indexed project",
    "retained_12mo":"customer retained for 12 months indicator, 0/1 binary"
}

# add the descriptions to the data dictionary
data_dictionary["Description"] = data_dictionary["Field Name"].map(descriptions)

print(f"construction_projects contains {len(construction_projects)} rows.")
data_dictionary

**5. Choose Modeling Approach (20 pts)**  
* Primary: **Multiple Linear Regression** (OLS) predicting next12mo_spend.
* Mention alternatives you considered (e.g., log-transform target; regularization; or classification for retained_12mo) and explain why OLS is appropriate here.
---
I chose to use Ordinary Least Squares (OLS) regression to predict next12mo_spend in its original form. The spend data is already close to normal, so there is no immediate need to transform the data. Keeping the numbers in dollars makes the results of the model easier to communicate to leadership - the predictions will be in real, meaningful units.  

A few other options could be considered for this analysis, but they are less appropriate for the question being asked. A log transformation of the predictor could help if the data were skewed, but it is unnecessary here and would require converting the results back to dollars to communicate the results of the model. Classification models would be well-suited if we were simply trying to predict whether a customer spends or not, but the business question is not *if* a customer will spend in the next 12 months, but *how much* a customer will spend.  

Tree-based models can handle nonlinear relationships better, but they are harder to interpret and explain the results to stakeholders. Since the other question asked is which features are the most important to the model, interpretability is key in this exercise. Overall, OLS with the untransformed spend data provides the best balance of accuracy, interpretability, and business relevance.

**6. Exploratory Data Analysis & Assumptions (20 pts)**  
* Descriptives and **plots** for key variables (distributions, pairwise relationships).
* Address missingness (where, how much, pattern).
* Consider transformations (e.g. log of project_size_usd or of the target if skewed).
* Multicollinearity check (e.g., **VIF**). State modeling assumptions.

In [None]:
target = numeric_cols['next12mo_spend']
numeric_predictors = numeric_cols.drop(columns = ['next12mo_spend'])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create subplots
fig, axes = plt.subplots(1, 2, figsize = (12,5))

# First subplot: KDE distribution
sns.kdeplot(data=construction_projects,
           x=target,
           fill=True,
           ax=axes[0])

# Add mean and median lines
mean_val = target.mean()
median_val = target.median()
axes[0].axvline(mean_val,
                color='blue',
                linestyle='--',
                linewidth=1,
                label=f'Mean: {mean_val:.2f}')
axes[0].axvline(median_val,
                color='red',
                linestyle='-',
                linewidth=1,
                label=f'Median: {median_val:.2f}')
axes[0].legend()
axes[0].set_title('Distribution of Next 12-Month Spend')

# Second subplot: Boxplot
sns.boxplot(data=construction_projects,
           x=target,
           ax=axes[1],
           color='lightgray')
axes[1].set_title('Boxplot of Next 12-Month Spend')

# Adjust layout

plt.tight_layout()
plt.show()

The distribution for the target variable looks fairly bell-shaped and unimodal, the mean and median are very close to eachother. The boxplot has a few outliers, but they don't seem to be very far and likely aren't too influential on the model. The whiskers being of similar lengths implies a decent level of normality in the data. I will move forward without transforming the target variable for now.

In [None]:
for num_col in numeric_predictors:
    fig,axes = plt.subplots(1, 2, figsize=(25,10))

    # First subplot: KDE distributions

    # Integer values get funky (i.e. safety incidents), so we will handle those differently
    if pd.api.types.is_integer_dtype(construction_projects[num_col]):
        sns.histplot(data=construction_projects,
                    x=num_col,
                    discrete=True,
                    ax=axes[0],
                    color='gray')
    else:
        sns.kdeplot(data = construction_projects,
                    x = num_col,
                    color='gray',
                    fill=True,
                    ax=axes[0])

    # add mean and median lines
    mean_val = construction_projects[num_col].mean()
    median_val = construction_projects[num_col].median()
    axes[0].axvline(mean_val,
                    color='blue',
                    linestyle='--',
                    linewidth=1,
                    label=f'Mean: {mean_val:.2f}')
    axes[0].axvline(median_val,
                    color='red',
                    linestyle='-',
                    linewidth=1,
                    label=f'Median: {median_val:.2f}')
    axes[0].legend()
    axes[0].set_title(f'Distribution of {num_col}')

    # Second subplot: Boxplot
    sns.boxplot(data = construction_projects,
               x = num_col,
               ax=axes[1],
               color='lightgray')
    axes[1].set_title(f'Boxplot of {num_col}')

    #Adjust layout
    plt.tight_layout()
    plt.show()

In [None]:
construction_projects['safety_incidents'].value_counts()

In [None]:
sns.pairplot(numeric_cols)

Many of the distributions are fairly normal already, so not many transformations are needed. To preserve interpretability in the model, I will try to minimize transformations. However, there are a few variables I want to transform to better work with the model:  
* project_size_usd: this field is very heavily skewed with a wide-spread of data. I am going to use a log transform on this.  
* safety_incidents: this field is tricky because it takes on integer values and they range from 1 to 5. The data is heavily concentrated at 0 incidents with 344. This could heavily interfere with the model. I predict that safety incidents will be important to the overall model, so I will transform this into a binary field that simply takes a 0/1 value indicating if incidents occured. This will give me a 344-306 split, which I will be more usable and still meaningful in analysis.
* prior_relationship_years: this field has some skew, it is not as extreme as project_size_usd. I will keep this field as-is for the initial model, but if the residuals are skewed, I will revisit this variable's effect on the overall model.  

cost and time overrun look to be almost perfectly correlated, so i will likely exclude one of them from the model. When I run VIF, I will pay close attention to these two.

In [None]:
# Create binary safety_incidents category
construction_projects['safety_incidents_occurred'] = (construction_projects['safety_incidents'] > 0).astype(int)

# add new field to the categorical_cols variables
categorical_cols = categorical_cols.copy()
categorical_cols.loc[:, 'safety_incidents_occurred'] = construction_projects['safety_incidents_occurred']

categorical_cols

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# log transform project_size_usd
construction_projects['project_size_usd_norm'] = np.log(construction_projects['project_size_usd'])

# create figure for before and after plotting
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Original distribution
sns.histplot(data=construction_projects,
             x='project_size_usd',
             ax=axes[0],
             color='gray',
             bins=50)
axes[0].axvline(construction_projects['project_size_usd'].mean(), color='blue', linestyle='--', label='Mean')
axes[0].axvline(construction_projects['project_size_usd'].median(), color='red', linestyle='-', label='Median')
axes[0].set_title('Original Distribution of project_size_usd')
axes[0].legend()

# Transformed distribution
sns.histplot(data=construction_projects,
             x='project_size_usd_norm',
             ax=axes[1],
             color='steelblue',
             bins=50)
axes[1].axvline(construction_projects['project_size_usd_norm'].mean(), color='blue', linestyle='--', label='Mean')
axes[1].axvline(construction_projects['project_size_usd_norm'].median(), color='red', linestyle='-', label='Median')
axes[1].set_title('Log-Transformed Distribution of project_size_usd')
axes[1].legend()

plt.tight_layout()
plt.show()

After log transforming project_size_usd, the data appears fairly normal. I will use this in my model.

In [None]:
for cat_col in categorical_cols.columns:
    fig, axes = plt.subplots(1, 2, figsize=(25, 10))

    # First subplot: Bar plot of category counts
    sns.countplot(data=construction_projects,
                  x=cat_col,
                  hue=cat_col,
                  ax=axes[0],
                  legend=False)
    axes[0].set_title(f'Frequency of {cat_col}')
    axes[0].tick_params(axis='x', rotation=45)

    # Second subplot: Boxplot of target variable by category
    sns.boxplot(data=construction_projects,
                x=cat_col,
                hue=cat_col,
                y=target,
                ax=axes[1])

    axes[1].set_title(f'next12mo_spend by {cat_col}')
    axes[1].tick_params(axis='x', rotation=45)

    plt.tight_layout()
    plt.show()

Looking at the frequency and boxplots of the categorical variables, there does not seem to be a striking difference between any of the categories with regard to the target variables. I don't need to do any transformation or editing besides dummy encoding industry, region, project_type, and contract_type. I will do this after imputing the missing values.

I will encode each variable using its highest valued category as the reference column:  
* Industry: Manufacturing
* Region: Midwest
* Project Type: NewBuild
* Contract: FixedBid

In [None]:
construction_projects.isna().sum()

In [None]:
sns.heatmap(construction_projects.isna(), cbar=False)

There are missing values in discount_pct (14), pm_experience_years(24), and on_time_milestones_pct (11). Looking at the heatmap, there does not appear to be a pattern among the missing values. I will impute the missing values based on the respective spreads of their data.

* discount_pct: there is a slight right-skew, so the median is safer to impute with than the mean since it is more robust to outliers.
* pm_experience_years: slight right skew, median will be safer than mean
* on_time_milestones_pct: fairly symmetric, so I will impute using the average.

In [None]:
construction_projects['discount_pct_cleaned'] = construction_projects['discount_pct'].fillna(construction_projects['discount_pct'].median())
construction_projects['pm_experience_years_cleaned'] = construction_projects['pm_experience_years'].fillna(construction_projects['pm_experience_years'].median())
construction_projects['on_time_milestones_pct_cleaned'] = construction_projects['on_time_milestones_pct'].fillna(construction_projects['on_time_milestones_pct'].mean())


In [None]:
# dummy encode industry and drop Manufacturing as the reference
construction_projects = pd.concat([
    construction_projects,
    pd.get_dummies(construction_projects['industry'],prefix='industry').drop('industry_Manufacturing', axis=1)
], axis=1)

# dummy encode region and drop Midwest as the reference
construction_projects = pd.concat([
    construction_projects,
    pd.get_dummies(construction_projects['region'],prefix='region').drop('region_Midwest', axis=1)
], axis=1)

# dummy encode project_type and drop NewBuild as the reference
construction_projects = pd.concat([
    construction_projects,
    pd.get_dummies(construction_projects['project_type'],prefix='project_type').drop('project_type_NewBuild', axis=1)
], axis=1)

# dummy encode contract_type and drop FixedBid as the reference
construction_projects = pd.concat([
    construction_projects,
    pd.get_dummies(construction_projects['contract_type'],prefix='contract_type').drop('contract_type_FixedBid', axis=1)
], axis=1)

construction_projects.head()

In [None]:
cleaned_const_proj = construction_projects[['is_union_site',
                                           'scope_complexity',
                                           'close_time_days',
                                           'prior_relationship_years',
                                           'competition_count',
                                           'customer_satisfaction',
                                           'cost_overrun_pct',
                                           'time_overrun_pct',
                                           'payment_delay_days',
                                           'n_change_orders',
                                           'next12mo_spend',
                                           'retained_12mo',
                                           'safety_incidents_occurred',
                                           'project_size_usd_norm',
                                           'discount_pct_cleaned',
                                           'pm_experience_years_cleaned',
                                           'on_time_milestones_pct_cleaned',
                                           'industry_Energy',
                                           'industry_Food&Beverage',
                                           'industry_Healthcare',
                                           'industry_Logistics',
                                           'industry_Pharma',
                                           'industry_Technology',
                                           'region_Northeast',
                                           'region_South',
                                           'region_West',
                                           'project_type_Expansion',
                                           'project_type_Retrofit',
                                           'project_type_TenantImprovement',
                                           'contract_type_CostPlus',
                                           'contract_type_GMP']]

In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = cleaned_const_proj.drop(columns=['next12mo_spend'])

X = X.apply(lambda col: col.astype(int) if col.dtype == 'bool' else col)

X = X.apply(pd.to_numeric, errors='coerce')

X = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range (X.shape[1])]

print(vif_data)


As suspected, time_overrun_pct and cost_overrun_pct have very high VIFs, almost identical. The other variables with moderately high VIFs are also fields that could be somewhat related to cost or time overrun (a more or less experienced project manager will be more or less likely to overrun the project.) I will drop cost_overrun_pct from the model since it is likely a factor of time overrun, and then rerun the VIF to see what has changed.

In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = cleaned_const_proj.drop(columns=['next12mo_spend','cost_overrun_pct'])

X = X.apply(lambda col: col.astype(int) if col.dtype == 'bool' else col)

X = X.apply(pd.to_numeric, errors='coerce')

X = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range (X.shape[1])]

print(vif_data)

After dropping cost_overrun, all of the VIFs are now at an acceptable level.

**7. Fit the Model (20 pts)**  
* Fit your baseline OLS. State the formula/design explicitly (how you encoded categoricals, chosen interactions if any, and any transforms).
* Report coefficient table with **units/interpretation** (turn slopes into practical business statements).

In [None]:
import statsmodels.formula.api as smf
import pandas as pd

# explicitly state regression formula
formula = 'next12mo_spend ~ scope_complexity + \
                            prior_relationship_years + \
                            customer_satisfaction + \
                            time_overrun_pct + \
                            n_change_orders + \
                            retained_12mo + \
                            safety_incidents_occurred + \
                            project_size_usd_norm + \
                            discount_pct_cleaned + \
                            pm_experience_years_cleaned + \
                            on_time_milestones_pct_cleaned + \
                            industry_Energy + \
                            Q("industry_Food&Beverage") + \
                            industry_Healthcare + \
                            industry_Logistics + \
                            industry_Pharma + \
                            industry_Technology + \
                            region_Northeast + \
                            region_South + \
                            region_West + \
                            project_type_Expansion + \
                            project_type_Retrofit + \
                            project_type_TenantImprovement + \
                            contract_type_CostPlus + \
                            contract_type_GMP'

# fit OLS to model
baseline_model = smf.ols(formula=formula, data = cleaned_const_proj).fit()

baseline_model.summary()

Based on the results of the model, the significant terms are the intercept, prior_relationship_years, customer_satisfaction, safety_incidents_occured, project_size_usd_norm, and on_time_milestones_pct_cleaned. This model explains 31% of the variation in the data.  
1. **Intercept:** The interpretation of the intercept is that if all other coefficients are zero, the next12mo_spend for a customer will be $115,300. Since we established reference categories for our dummy variables, this default value takes on the category of those reference variables. In other words, for a New Build project on a Fixed Bid contract in the Manufacturing Industry located in the Midwest, this is the baseline revenue. The other coefficients for the dummy variables indicate how much the target is influenced by other categories.
2. **Project Size:** This coefficient is $17,470. Since project_size_usd was log transformed, we can interpret this as meaning that for every time the project_size increases by a factor of e (~2.178), the target variable increases by 17,470.
3. **Prior Relationship:** For every additional year a client has been retained, their next12mo_spend increases by 8,078.
4. **Customer Satisfaction:** For every additional point of customer satisfaction, next12mo_spend increases by $26,640.
5. **On Time Milestones:** For every unit increase of on_time milestones, spend increases by 126,300. Since this variable is scaled to 0-1, if 100% of milestones are met on time, that amount is added to the baseline spend.
6. **Number of Change Orders:** For every additional change order issued, the revenue drops by about $3,550.
7. **Safety Incidents:** If one or more safety incidents occurs, it negatively impacts next 12-mo spend by 43,280.  

**Industry Dummies**  
Most of the industry dummies have very high p-values. Since Energy is close to significant, and manufacturing is build into the constant, I don't want to drop the industry dummies, but I will try combining them to see if they have more predictive power. I will combine Energy, Logistics, and Technology into Industrial Tech, Healthcare and Pharma into Health, and Food&Beverage into Consumer.

**Drop some variables with low significance**  
I will exclude from the final model some of the variables that do not carry very much predictive power in the model. Scope_complexity, time_overrun_pct, retained_12mo. Although industry seems to have a few dummies that are close to significant, region does not, so I will drop the region variables from the model as well. I want to keep project_type and contract_type because logically thinking, these could have some effect on revenues and project scale.  

**Impute using MICE**  
I will rerun the analysis using MICE imputation on missing values to see if I can improve the performance of those variables.

**Sqrt-Transform Prior Relationship**  
Calling back to earlier, prior_relationship_years is slightly skewed so I will sqrt transform it. Since it is significant to the model, transforming this data may improve the model's overall effectiveness.

**8. Diagnostics (20 pts)**  
* Residuals vs. fitted, Q-Q plot, influence (e.g., Cook's distance), heteroscedasticity check.
* Comment on where assumptions look OK vs. violated.

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import ProbPlot

plt.style.use('seaborn-v0_8')
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)

# fitted values (need a constant term for intercept)
model_fitted_y = baseline_model.fittedvalues

# model residuals
model_residuals = baseline_model.resid

# normalized residuals
model_norm_residuals = baseline_model.get_influence().resid_studentized_internal

# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# absolute residuals
model_abs_resid = np.abs(model_residuals)

# leverage, from statsmodels internals
model_leverage = baseline_model.get_influence().hat_matrix_diag

# cook's distance, from statsmodels internals
model_cooks = baseline_model.get_influence().cooks_distance[0]

In [None]:
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)

plot_lm_1.axes[0] = sns.residplot(x =model_fitted_y,
                                  y ='next12mo_spend',
                                  data=cleaned_const_proj,
                                  lowess=True,
                                  scatter_kws={'alpha': 0.5},
                                  line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')


# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]

for i in abs_resid_top_3.index:
    plot_lm_1.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_residuals[i]));

The Residuals vs Fitted plot looks pretty good overall. Most of the points are scattered randomly around zero, which means the model is doing a decent job capturing the main pattern in the data. There’s a little bit of a curve in the red line, so the model might be missing a small nonlinear trend, but it doesn’t look serious. A few points stand out as possible outliers, but nothing too extreme. The leverage plot will tell us if these points impact the model too much.

In [None]:
QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)

plot_lm_2.set_figheight(8)
plot_lm_2.set_figwidth(12)

plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');

# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]

for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i, 
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

The residuals from the model are approximately normally distributed, which supports the assumption of normality. However, a few observations deviate from this pattern, suggesting potential outliers that may be influencing the model’s fit. Next steps are to look at cook's distance and leverage for these points.

In [None]:
plot_lm_3 = plt.figure(3)
plot_lm_3.set_figheight(8)
plot_lm_3.set_figwidth(12)

plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(x=model_fitted_y, y=model_norm_residuals_abs_sqrt, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_3.axes[0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');

# annotations
abs_sq_norm_resid = np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]

for i in abs_norm_resid_top_3:
    plot_lm_3.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_norm_residuals_abs_sqrt[i]));

The Scale-Location plot also looks fine. The points are spread out fairly evenly across the fitted values, and the red line stays mostly flat, which suggests the variance of the residuals is consistent. That means the model isn’t showing big signs of heteroscedasticity. There are a couple of higher points, but overall the spread looks random and balanced.

In [None]:
plot_lm_4 = plt.figure(4)
plot_lm_4.set_figheight(8)
plot_lm_4.set_figwidth(12)

plt.scatter(model_leverage, model_norm_residuals, alpha=0.5)
sns.regplot(x=model_leverage, y=model_norm_residuals, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_4.axes[0].set_xlim(0.02, 0.09)
plot_lm_4.axes[0].set_ylim(-3, 4)
plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals')

# annotations
leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]

for i in leverage_top_3:
    plot_lm_4.axes[0].annotate(i, 
                               xy=(model_leverage[i], 
                                   model_norm_residuals[i]))
    
# shenanigans for cook's distance contours
def graph(formula, x_range, label=None):
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls='--', color='red')

p = len(baseline_model.params) # number of model parameters

graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50), 
      'Cook\'s distance') # 0.5 line

graph(lambda x: np.sqrt((1 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50)) # 1 line

plt.legend(loc='upper right');

The Residuals vs Leverage plot looks fine overall. Most of the points are clustered toward the lower end of leverage, which means most observations don’t have much influence on the model. The red line stays close to zero, showing no strong pattern, which is good. A few points like 139, 273, and 504 stand out a bit, but they don’t appear to be extreme enough to cause major concern. Overall, there’s no clear sign of influential outliers or leverage problems, so the model seems pretty stable.

---
The diagnostics do not provide evidence of any strong violations of model assumptions.


**9. Address Deficiencies (20 pts)**
* Reasoned iteration: try alternative feature set(s), transformations (e.g., mean/median vs. simple MICE via statsmodels or sklearn imputation).
* If you try ridge/lasso for stability, show how conclusions change (or don't).

---

Overall, the baseline model after appears to be a pretty decent fit. It highlights a few significant coefficients, explains 31.5% of the variation in the data, and does not violate any model assumptions. I will try a few additional enhancements and imputations to see if I can further improve the model.  

In [None]:
# combine industries into broader categories
construction_projects['industry_IndustrialTech'] = construction_projects[['industry_Energy', 'industry_Logistics', 'industry_Technology']].max(axis=1)
construction_projects['industry_Health'] = construction_projects[['industry_Healthcare', 'industry_Pharma']].max(axis=1)
construction_projects['industry_Consumer'] = construction_projects['industry_Food&Beverage']

construction_projects

In [None]:
#use square root transformation on prior_relationship_years
construction_projects['prior_relationship_years_norm'] = np.sqrt(construction_projects['prior_relationship_years'])

fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Original distribution
sns.histplot(data=construction_projects,
             x='prior_relationship_years',
             ax=axes[0],
             color='gray',
             bins=50)
axes[0].axvline(construction_projects['prior_relationship_years'].mean(), color='blue', linestyle='--', label='Mean')
axes[0].axvline(construction_projects['prior_relationship_years'].median(), color='red', linestyle='-', label='Median')
axes[0].set_title('Original Distribution of prior_relationship_years')
axes[0].legend()

# Transformed distribution
sns.histplot(data=construction_projects,
             x='prior_relationship_years_norm',
             ax=axes[1],
             color='steelblue',
             bins=50)
axes[1].axvline(construction_projects['prior_relationship_years_norm'].mean(), color='blue', linestyle='--', label='Mean')
axes[1].axvline(construction_projects['prior_relationship_years_norm'].median(), color='red', linestyle='-', label='Median')
axes[1].set_title('Log-Transformed Distribution of prior_relationship_years_norm')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# create final dataset
construction_projects_final = construction_projects[['customer_satisfaction',                                                     
                                                     'n_change_orders',
                                                     'safety_incidents_occurred',
                                                     'project_size_usd_norm',
                                                     'discount_pct',
                                                     'pm_experience_years',
                                                     'on_time_milestones_pct',
                                                     'prior_relationship_years_norm',
                                                     'industry_IndustrialTech',
                                                     'industry_Health',
                                                     'industry_Consumer',
                                                     'project_type_Expansion',
                                                     'project_type_Retrofit',
                                                     'project_type_TenantImprovement',
                                                     'contract_type_CostPlus',
                                                     'contract_type_GMP',
                                                     'next12mo_spend']]

In [None]:
import statsmodels.api as sm
from statsmodels.imputation import mice

imp = mice.MICEData(construction_projects_final)
fml = 'next12mo_spend ~ customer_satisfaction + \
                            n_change_orders + \
                            safety_incidents_occurred + \
                            project_size_usd_norm + \
                            discount_pct + \
                            pm_experience_years + \
                            on_time_milestones_pct + \
                            prior_relationship_years_norm + \
                            industry_IndustrialTech + \
                            industry_Consumer + \
                            project_type_Expansion + \
                            project_type_Retrofit + \
                            project_type_TenantImprovement + \
                            contract_type_CostPlus + \
                            contract_type_GMP'

mice = mice.MICE(fml, sm.OLS, imp)
final_model = mice.fit(10,10)
final_model.summary()

In [None]:
import patsy
import numpy as np

# Build design matrix from the formula and the dataset
y, X = patsy.dmatrices(fml, construction_projects_final, return_type='dataframe')

y_hat = X @ final_model.params

ss_res = ((y.values.flatten() - y_hat.values.flatten())**2).sum()
ss_tot = ((y.values.flatten() - y.values.flatten().mean())**2).sum()
r_squared = 1 - ss_res/ss_tot

n = X.shape[0]  # number of observations
p = X.shape[1] - 1  # number of predictors (excluding intercept)
adj_r_squared = 1 - (1 - r_squared)*(n-1)/(n-p-1)

print("R-squared:", r_squared)
print("Adjusted R-squared:", adj_r_squared)

For the final model, I combined the industry columns into broader columns, dropped a few insignificant variables, and transformed the prior_relationship variable to normalize the data. I also used MICE to impute the missing values than I had previously imputed using mean/median.

Overall, this model performed slightly worse than the original model, but not by a lot. The original model had an R2 of 0.315, and the final model had that of 0.306.

The MICE imputation likely didn't have much effect on the model because the intial data had very few missing values. None of the fields that were missing values were overly skewed and there was no pattern to the missingness. Due to this, the mean and median worked well-enough, and there wasn't signifcant improvement to be made by using MICE.

In my initial exploratory data analysis, I did some cleaning and data preparation already. Having re-scaled the on_time_milestones_pct column, created a binary safety_incidents column, and transformed the project_size_usd column, the only value left to transform was prior_relationship_years. This predictor was already signficant in the model, and the diagnostic plots didn't show any violations of assumptions, so any change to this variable likely would have had minimal effect on the model anyway.

For the coefficients dropped, this is fairly risky with OLS because of the variables interactions with each other. The constant lost significance in the final model. Since the coefficient captured the references for the categorical variables, I think it would be better to keep the model with the more significant constant and all of the categorical variables for region, industry, project_type, and contract_type.

Overall, I will go with my baseline model, more explainability of the overall variance and adequate direction for informing the business.

**10. Interpret & Communicate (30 pts)**
* Summarize what drives retention spend and how confident you are.
* Provide 2-3 actionable recommendations (e.g., reduce overruns, improve on-time milestones, PM staffing).
* Include a short "for executives" paragraph with the single clearest takeaway.

### Drivers of Retention Spend  
*Higher customer satisfaction*     
This strongly increases future spend. Each satisfaction point translates to anywhere from +\$19.1K to \$34.2K in additional spend.  

*Longer prior relationship*  
Retention spend is boosted by \$6.2K-\$10K for every year a customer has had a relationship with the company.  

*Larger project size*  
Larger projects generate more spend for the company. Referencing the normalized spend variable, the project increases \$114-$235 for every 1% increase in project size.    

*Better on-time milestone performance*   
Significantly increases spend. If a project’s milestones are met 100% on-time, their next 12-month spend is \$81.4K-\$171K higher. 

*Safety incidents*  
The occurrence of even 1 safety incident translates to a \$33.4K to \$53.2K loss in revenue.  

*More Change Orders*  
Each change order slightly decreases retention spend by \$266 to \$6,833.  

### Confidence  
The model is stastically strong overall. The combination of the model's F-statistic and low p-value, means that the factors in this model reliably explain the differences in future client spend. The model captures meaningful relationships between some of the variables, but it only explains 31.3% of the overall variance so there are likely other non-measured factors at play. Overall, the confidence in the direction and the significance of the key predictors is high so I am confident in these levers.

### Business Recommendations
1.	**Prioritize safety management:** The occurrence of even one safety incident predicts a ~$43K drop in future spend. Reinforce safety training among field workers.
2.	**Improve on-time performance:** Meeting milestones on schedule has a major impact on retention. Prioritize adherence to schedules and invest in project management tools or staffing to improve overall delivery.
3.	**Strengthen customer relationship:** Longer partnerships clearly pay off. Build account retention programs that focus on maintaining relationship after a project’s completion.

### Executive Summary
The single clearest takeaway of this analysis is that customer relationships are the strongest driver of future retention spend. Nearly every major lever ties back to client satisfaction – either directly (through prior relationships and satisfaction scores) or indirectly (through project delivery factors like safety, change orders, and on-time milestones). This suggests that maintaining and strengthening client trust has the greatest long-term revenue impact. Going forward, resources should be focused on understanding and improving the key drivers of customer satisfaction, since it sits at the center of nearly every factor that influence repeat business.
