#### Exercise 08: Principal Component Analysis
Pick from the data in one of our previous assignments or the midterm. Conduct the same kind of PCA we did in the lab focusing only on the continuous variables.  

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#### 1. Explore the Data


In [None]:
construction_projects = pd.read_csv("midterm_construction_projects.csv")
construction_projects.describe()

In [None]:
continuous_var = construction_projects[['project_size_usd',
                                       'scope_complexity',
                                       'close_time_days',
                                       'prior_relationship_years',
                                       'discount_pct',
                                       'pm_experience_years',
                                       'on_time_milestones_pct',
                                       'customer_satisfaction',
                                       'cost_overrun_pct',
                                       'time_overrun_pct',
                                       'payment_delay_days',
                                       'n_change_orders',
                                       'next12mo_spend']].copy()

In [None]:
continuous_var.isnull().sum()

There are missing values for pm_experience_years, on_time_milestones_pct, customer_satisfaction. Since discount_pct and pm_experience_years are pretty symmetrical, I will use the median. Since on_time_milestones is slightly skewed and bimodal, I will use the mean.

In [None]:
# impute missings
continuous_var['discount_pct'] = continuous_var['discount_pct'].fillna(continuous_var['discount_pct'].median())
continuous_var['pm_experience_years'] = continuous_var['pm_experience_years'].fillna(continuous_var['pm_experience_years'].median())
continuous_var['on_time_milestones_pct'] = continuous_var['on_time_milestones_pct'].fillna(continuous_var['on_time_milestones_pct'].mean())

continuous_var.describe()

project_size_usd is very skewed, so I will log_transform it.

In [None]:
# log transform project_size_usd
continuous_var['project_size_usd'] = np.log(continuous_var['project_size_usd'])

In [None]:
corr = continuous_var.corr(numeric_only=True)
corr

* cost & time overrun percent are very highly correlated with each other.
* n_change_orders and scope complexity also have relatively high correlations.

#### 2. Standardize the data so we can do PCA.

In [None]:
scaler = StandardScaler()
continuous_var_scaled = scaler.fit_transform(continuous_var)
continuous_var_scaled = pd.DataFrame(continuous_var_scaled)
continuous_var_scaled.describe()

#### 3. Fit PCA and look at the variance explained

In [None]:
pca = PCA()
pca.fit(continuous_var_scaled)

explained = pca.explained_variance_ratio_

ev = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(len(explained))],
    'Explained Variance Ratio': explained
})
ev

In [None]:
# Scree plot (variance explained by each PC)
plt.figure(figsize=(6,4))
plt.plot(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_, marker='o')
plt.title('Scree Plot: Explained Variance by Principal Component')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.xticks(range(1, len(pca.explained_variance_ratio_)+1))
plt.grid(True, linestyle='--', linewidth=0.5)
plt.show()

In [None]:
# Loadings = eigenvectors of covariance of standardized data
loadings = pd.DataFrame(
    pca.components_.T,
    index=continuous_var.columns,
    columns=[f'PC{i+1}' for i in range(len(continuous_var.columns))]
)
loadings

The first few principal components explain most of the meaningful variation in the construction projects data. Each component represents a different underlying pattern in how projects perform and relate to one another:
1. PC1 – Project Efficiency: This component loads heavily on cost_overrun_pct and time_overrun_pct and slightly on scope_complexity, showing that more complex projects tend to go over budget and schedule.
2. PC2 – Customer Relationships: Driven by prior_relationship_years, next12mo_spend, and customer_satisfaction, this captures the strength of ongoing client relationships and how satisfaction connects to future spending.
3. PC3 – Project Complexity & Change: Influenced by scope_complexity and n_change_orders, it reflects how more complex projects often experience more scope adjustments during execution.
4. PC4 – Project Scale: Aligned with project_size_usd and customer_satisfaction, suggesting that larger projects have different satisfaction dynamics compared to smaller ones.

#### 4. Project original data onto the necessary principle components

In [None]:
pc_scores = pca.transform(continuous_var_scaled)

pc_df = pd.DataFrame(pc_scores, columns=[f'PC{i+1}' for i in range(pc_scores.shape[1])])

pc_df_4 = pc_df[['PC1','PC2','PC3','PC4']]

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid", context="notebook")
pairplot = sns.pairplot(pc_df_4, diag_kind='kde', plot_kws={'alpha':0.6, 's':50, 'edgecolor':'k'})
pairplot.fig.suptitle('Pairwise Plots of PC1–PC4', y=1.02)
plt.show()

When plotting the first few principal components, the projects all cluster near the origin with no strong separation or noticeable outliers. This indicates that the continuous project variables (like size, schedule performance, and satisfaction) vary somewhat consistently across projects, without distinct groupings or extreme cases. In other words, the main sources of variation in the data are relatively balanced rather than being driven by one or two standout dimensions.

#### 5. Explain in detail what the PCA has told you about the data, given the principal components some intuitive meaning, and explain how you would use your insights from the PCA.

-------

The PCA results show that the projects in this dataset are pretty similar overall, with no clear clusters or big outliers when looking at the first few components. This means the continuous variables like project size, overruns, satisfaction, and change orders vary in a fairly consistent way across projects without any strong patterns or distinct groups. The variation that does exist is spread across several smaller factors instead of being driven by one main source.

The first four principal components represent the main themes in the data. PC1 captures overall project inefficiency, mainly tied to cost and time overruns. PC2 reflects the strength of customer relationships and future spend. PC3 shows scope volatility, with more change orders and complexity usually linked to lower satisfaction, while PC4 separates larger projects that have high satisfaction but shorter prior relationships.

Overall, the PCA suggests that most projects perform similarly, but improving cost and time control (PC1) and managing scope changes (PC3) could make the biggest difference. The components could also be used as simplified inputs for modeling or for tracking project performance and relationship health over time.