
# Midterm Project — Foundations of Analytics  
### Commercial/Industrial Construction: Predicting Customer Retention Spend

**Deliverable:** Predict `next12mo_spend` and explain the drivers, following the **Steps in Model Building**.

Use the provided CSV: `midterm_construction_projects.csv`.



## 1) Identify & Clarify the Problem
- In your own words, describe the business decision and why prediction + interpretation matter.



## 2) Background
- Briefly list plausible drivers of post-project spend in construction.



## 3) Select Variables
- Which predictors will you start with? Which might need transforms?



## 4) Acquire Data


In [None]:

import pandas as pd

df = pd.read_csv("midterm_construction_projects.csv")
df.head()


In [None]:

# basic info
df.shape, df.dtypes.head(), df.describe(include='all').transpose().head(20)



## 5) Choose Modeling Approach
- Primary: **Multiple Linear Regression (OLS)** on `next12mo_spend`.
- Consider alternatives (e.g., log-transform) and justify.



## 6) EDA & Assumptions
- Distributions, pairplots where useful, missingness audit, correlations.
- Consider transformations for skewed variables.


In [None]:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

_ = df.hist(figsize=(14,10), bins=30)
plt.tight_layout(); plt.show()

# Missingness
df.isna().mean().sort_values(ascending=False).head(15)



## 7) Fit the Baseline Model
- Encode categoricals with a clear reference.
- State your model formula/design.


In [None]:

import statsmodels.api as sm
import patsy as pt

# Example design (edit as you see fit)
formula = """next12mo_spend ~ 
    np.log(project_size_usd) + scope_complexity + close_time_days
  + prior_relationship_years + competition_count + discount_pct
  + pm_experience_years + safety_incidents + on_time_milestones_pct
  + cost_overrun_pct + time_overrun_pct + payment_delay_days + n_change_orders
  + C(industry) + C(region) + C(project_type) + C(contract_type) + is_union_site
"""

y, X = pt.dmatrices(formula, df, return_type='dataframe')
model = sm.OLS(y, sm.add_constant(X, has_constant='add'), missing='drop').fit()
model.summary()



## 8) Diagnostics
- Residuals vs. fitted, Q–Q plot, heteroscedasticity test, VIF.


In [None]:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan

resid = model.resid.values
fitted = model.fittedvalues.values

plt.figure(figsize=(5,4)); plt.scatter(fitted, resid, alpha=0.5)
plt.axhline(0, color='r'); plt.xlabel("Fitted"); plt.ylabel("Residuals"); plt.title("Residuals vs Fitted"); plt.show()

sm.qqplot(model.resid, line='45'); plt.title("Q–Q Plot"); plt.show()

# Breusch–Pagan test
bp = het_breuschpagan(model.resid, model.model.exog)
{"Lagrange Multiplier": bp[0], "p-value": bp[1], "f-value": bp[2], "f p-value": bp[3]}


In [None]:

# VIF (exclude intercept)
import pandas as pd
exog = pd.DataFrame(model.model.exog, columns=model.model.exog_names).drop(columns=['const'], errors='ignore')
vif = pd.DataFrame({
    "feature": exog.columns,
    "VIF": [variance_inflation_factor(exog.values, i) for i in range(exog.shape[1])]
}).sort_values("VIF", ascending=False)
vif.head(15)



## 9) Address Deficiencies
- Try an alternative (e.g., log-transform the target or impute missing features).
- Optionally compare ridge/lasso for stability (not required).


In [None]:

# EXAMPLE: Simple imputation + refit (students may replace with their approach)
from sklearn.impute import SimpleImputer

imp_cols = ["on_time_milestones_pct","customer_satisfaction","pm_experience_years","discount_pct"]
df_imp = df.copy()
df_imp[imp_cols] = SimpleImputer(strategy='median').fit_transform(df_imp[imp_cols])

y2, X2 = pt.dmatrices(formula, df_imp, return_type='dataframe')
model2 = sm.OLS(y2, sm.add_constant(X2, has_constant='add')).fit()
model2.summary()



## 10) Interpret & Communicate
- Turn key coefficients into **plain-English** insights.
- Provide 2–3 actionable recommendations for leaders.
