# Dimensionality Reduction and Feature Selection
## Calculating PCA and Variance Threshold in a Linear Regression

**DataSource** https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv

In [2]:
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

## 1. Import the housing data as a data frame and ensure that the data is loaded properly.

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib.dates as mdates

import warnings

# Load dataset
file_path = "train.csv"
df = pd.read_csv(file_path)

# Display the first 5 rows
print("First 5 rows:")
print(df.head(5))

First 5 rows:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   200

## 2. Drop the "Id" column and any features that are missing more than 40% of their values.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler
# Drop Id column
df.drop(columns=["Id"], inplace=True)

# Drop columns with more than 40% missing values
threshold = len(df) * 0.4
df = df.dropna(thresh=threshold, axis=1)

# Display the first 5 rows
print("First 5 rows:")
print(df.head(5))

First 5 rows:
   MSSubClass MSZoning  LotFrontage  LotArea Street LotShape LandContour  \
0          60       RL         65.0     8450   Pave      Reg         Lvl   
1          20       RL         80.0     9600   Pave      Reg         Lvl   
2          60       RL         68.0    11250   Pave      IR1         Lvl   
3          70       RL         60.0     9550   Pave      IR1         Lvl   
4          60       RL         84.0    14260   Pave      IR1         Lvl   

  Utilities LotConfig LandSlope  ... EnclosedPorch 3SsnPorch ScreenPorch  \
0    AllPub    Inside       Gtl  ...             0         0           0   
1    AllPub       FR2       Gtl  ...             0         0           0   
2    AllPub    Inside       Gtl  ...             0         0           0   
3    AllPub    Corner       Gtl  ...           272         0           0   
4    AllPub       FR2       Gtl  ...             0         0           0   

  PoolArea MiscVal  MoSold  YrSold  SaleType  SaleCondition SalePrice  


## 3. For numerical columns, fill in any missing data with the median value.

In [9]:
# Separate numerical and categorical columns
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns

## 4. For numerical columns, fill in any missing data with the median value.

In [11]:
# Fill missing values
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

# Display the first 5 rows of num_cols
print("First 5 rows:")
print(df[num_cols])


First 5 rows:
      MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0             60         65.0     8450            7            5       2003   
1             20         80.0     9600            6            8       1976   
2             60         68.0    11250            7            5       2001   
3             70         60.0     9550            7            5       1915   
4             60         84.0    14260            8            5       2000   
...          ...          ...      ...          ...          ...        ...   
1455          60         62.0     7917            6            5       1999   
1456          20         85.0    13175            6            6       1978   
1457          70         66.0     9042            7            9       1941   
1458          20         68.0     9717            5            6       1950   
1459          20         75.0     9937            5            6       1965   

      YearRemodAdd  MasVnrArea  BsmtF

In [12]:
# Display the first 5 rows of cat_cols
print("First 5 rows:")
print(df[cat_cols])

First 5 rows:
     MSZoning Street LotShape LandContour Utilities LotConfig LandSlope  \
0          RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   
1          RL   Pave      Reg         Lvl    AllPub       FR2       Gtl   
2          RL   Pave      IR1         Lvl    AllPub    Inside       Gtl   
3          RL   Pave      IR1         Lvl    AllPub    Corner       Gtl   
4          RL   Pave      IR1         Lvl    AllPub       FR2       Gtl   
...       ...    ...      ...         ...       ...       ...       ...   
1455       RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   
1456       RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   
1457       RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   
1458       RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   
1459       RL   Pave      Reg         Lvl    AllPub    Inside       Gtl   

     Neighborhood Condition1 Condition2  ... KitchenQual Functional  \
0         Coll

## 5. Convert categorical columns to dummies

In [14]:
# Convert categorical columns to dummies
df = pd.get_dummies(df, drop_first=True)

# Split into features and target
X = df.drop(columns=["SalePrice"])
# Display the first 2 rows
print("First 2 rows:")
print(X.head(2))

First 2 rows:
   MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0          60         65.0     8450            7            5       2003   
1          20         80.0     9600            6            8       1976   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  SaleType_ConLI  \
0          2003       196.0         706           0  ...           False   
1          1976         0.0         978           0  ...           False   

   SaleType_ConLw  SaleType_New  SaleType_Oth  SaleType_WD  \
0           False         False         False         True   
1           False         False         False         True   

   SaleCondition_AdjLand  SaleCondition_Alloca  SaleCondition_Family  \
0                  False                 False                 False   
1                  False                 False                 False   

   SaleCondition_Normal  SaleCondition_Partial  
0                  True                  False  
1                  True   

In [15]:
y = df["SalePrice"]
# Display the first 5 rows
print("First 5 rows:")
print(y.head(2))

First 5 rows:
0    208500
1    181500
Name: SalePrice, dtype: int64


## 6. Split the data into a training and test set, where the SalePrice column is the target.

In [17]:
# Split into train and test sets
# X_train = input features (independent variables)

# y_train = output to predict (dependent variable)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the first 2 rows
print("First 2 rows:")
print(X_train.head(2))

First 2 rows:
      MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
254           20         70.0     8400            5            6       1957   
1066          60         59.0     7837            6            7       1993   

      YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  SaleType_ConLI  \
254           1957         0.0         922           0  ...           False   
1066          1994         0.0           0           0  ...           False   

      SaleType_ConLw  SaleType_New  SaleType_Oth  SaleType_WD  \
254            False         False         False         True   
1066           False         False         False         True   

      SaleCondition_AdjLand  SaleCondition_Alloca  SaleCondition_Family  \
254                   False                 False                 False   
1066                  False                 False                 False   

      SaleCondition_Normal  SaleCondition_Partial  
254                   True          

In [18]:
# Display the first 2 rows
print("First 2 rows:")
print(y_train.head(2))


First 2 rows:
254     145000
1066    178000
Name: SalePrice, dtype: int64


## 7. Run a linear regression and report the R2-value and RMSE on the test set.

In [20]:
# 2. Linear Regression on Original Data

# Run linear regression
lr = LinearRegression()

#train (fit) the model on training data:
# After training, make predictions on the test set (X_test).
# y_pred is what the model thinks the outputs should be.
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# R2 and RMSE

# evaluate how good your model's predictions are:

# R² Score (r2_score): Measures how much of the variation in y_test your model explains.

# Ranges from -∞ to 1. Closer to 1 = better.

# RMSE (Root Mean Squared Error): Tells you how far off, on average, your predictions are.

# Lower RMSE = better.
r2_original = r2_score(y_test, y_pred)
rmse_original = np.sqrt(mean_squared_error(y_test, y_pred))
print("Original Data - R2:", r2_original)
print("Original Data - RMSE:", rmse_original)


Original Data - R2: 0.6434086516194634
Original Data - RMSE: 52298.871543652116



The performance on the original data was:

R²: 0.6434

RMSE: 52,298.87

This suggests that the original dataset, with all features included, provides a reasonable model performance. The R² of 0.6434 indicates that around 64.34% of the variance in the target variable (SalePrice) is explained by the model, which is a moderate fit. The RMSE of 52,298.87 means the model's predictions are, on average, off by about that amount from the actual sale prices.

Summary of Findings:
Original data provides decent performance (R²: 0.6434) with reasonable error (RMSE: 52,298.87).

PCA transformation (90% variance) resulted in worse performance (R²: 0.0635, RMSE: 84,754.58) because reducing the dataset to just 1 principal component likely stripped away too much important information.

High variance features (with variance > 0.1 after min-max scaling) gave an improvement (R²: 0.6556, RMSE: 51,393.43), suggesting that retaining features with higher variance leads to a better model.

Key Takeaways:

Dimensionality reduction via PCA may not always be beneficial if it results in the loss of important features.

Retaining features with higher variance appears to have a more beneficial impact on model performance than reducing dimensions through PCA.

The original data with no transformation produced solid results, showing that sometimes the raw features may provide enough information to achieve good predictive accuracy.

## 8. Fit and transform the training features with a PCA so that 90% of the variance is retained 

In [23]:
# PCA Transformation (90% Variance)

pca = PCA(n_components=0.90)
X_train_pca = pca.fit_transform(X_train)


## 9 How many features are in the PCA-transformed matrix?

In [25]:
print("Number of PCA features:", X_train_pca.shape[1])

Number of PCA features: 1


## 10 Transform but DO NOT fit the test features with the same PCA.


In [27]:
X_test_pca = pca.transform(X_test)

The fact that PCA transformed the data into just 1 feature indicates that, after reducing the dimensionality to retain 90% of the variance, only one principal component is sufficient to represent most of the variability in the dataset.

This suggests that the original data may have been very collinear or highly dependent on one or a few key factors. It could also mean that a large portion of the variance in the original features can be explained by a single combination of features in the transformed space.


## 11. Repeat step 7 with your PCA transformed data.

In [30]:
# Linear Regression on PCA Data

lr_pca = LinearRegression()
lr_pca.fit(X_train_pca, y_train)
y_pred_pca = lr_pca.predict(X_test_pca)

r2_pca = r2_score(y_test, y_pred_pca)
rmse_pca = np.sqrt(mean_squared_error(y_test, y_pred_pca))
print("PCA Data - R2:", r2_pca)
print("PCA Data - RMSE:", rmse_pca)

PCA Data - R2: 0.06348978217577883
PCA Data - RMSE: 84754.5802129634


The PCA-transformed data has an R² of 0.0635 and an RMSE of 84,754.58, which shows a significant decline in performance compared to both the original and high-variance models. Here’s an analysis of these results:

Interpretation:
R² of 0.0635 means that the model can only explain about 6.35% of the variance in the target variable (SalePrice). This is much lower than the original data (R² = 0.6434) and high-variance data (R² = 0.6556).

RMSE of 84,754.58 suggests that the model’s predictions are now about 84,754 units off from the true sale prices, which is a much larger error compared to the original and high-variance data (RMSE = 52,298.87 and 51,393.43, respectively).

What this means for the model:
Since working with only one principal component (PC), the model has limited flexibility in representing the complexities of the data.

The drop in R² (0.0635) and increase in RMSE (84,754.58) aligns with this, as reducing the features to just one component may oversimplify the problem, losing important information necessary for accurate predictions.

Why this might have happened:
PCA reduces dimensionality by transforming features into principal components, which, while retaining most of the variance, can lose some of the original features' meaningful context or relationships with the target variable.

The fact that we have retained 90% of the variance implies that while most of the original data's information is preserved, the PCA transformation could have obscured important relationships, reducing the model's predictive power.

Conclusion:
PCA might not have been the best approach in this case, especially if retaining specific features' relationships with the target variable is crucial for prediction accuracy. The loss of interpretability and the transformation of features into principal components may have hindered the model’s performance.

## 13. Find the min-max scaled features in your training set that have a variance above 0.1 

In [33]:
# 5  Min-Max Scaling + Variance Threshold

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for variance check
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns)

# Variance threshold
high_variance_cols = X_train_scaled_df.loc[:, X_train_scaled_df.var() > 0.1].columns

# Subset with high variance features
X_train_highvar = X_train_scaled_df[high_variance_cols]
X_test_highvar = pd.DataFrame(X_test_scaled, columns=X.columns)[high_variance_cols]


## 14. Transform but DO NOT fit the test features with the same steps applied in steps 11 and 12.

In [35]:
lr_var = LinearRegression()
lr_var.fit(X_train_highvar, y_train)
y_pred_var = lr_var.predict(X_test_highvar)

## 15. Repeat step 7 with the high variance data.

In [37]:
r2_var = r2_score(y_test, y_pred_var)
rmse_var = np.sqrt(mean_squared_error(y_test, y_pred_var))
print("High Variance Data - R2:", r2_var)
print("High Variance Data - RMSE:", rmse_var)

High Variance Data - R2: 0.6556489506346084
High Variance Data - RMSE: 51393.43224984833


It looks like the High Variance Data approach has an R² of 0.6556 and an RMSE of 51,393.43, which confirms that selecting features based on high variance improved the model’s performance slightly over the original data.

Here’s a brief interpretation of this outcome:

Interpretation:
R² of 0.6556 indicates that the model can explain approximately 65.56% of the variance in the target variable (SalePrice), which is an improvement over the original model (R² = 0.6434).

RMSE of 51,393.43 suggests that the model's predictions are now, on average, about 51,393 units off from the true sale prices. Although there is still error, this is a notable reduction from the RMSE of 52,298.87 on the original data.

Conclusion:
Using high variance features is proving to be a useful approach for improving model performance. It seems that by focusing on the most informative features (those with higher variance), the model has better predictive power and accuracy, as reflected in both the R² and RMSE.

## 16. Summarize your findings.

In [40]:
print("\nSummary of R2 and RMSE:")
print(f"Original Data      => R2: {r2_original:.4f}, RMSE: {rmse_original:.2f}")
print(f"PCA (90% variance) => R2: {r2_pca:.4f}, RMSE: {rmse_pca:.2f}")
print(f"High Variance      => R2: {r2_var:.4f}, RMSE: {rmse_var:.2f}")


Summary of R2 and RMSE:
Original Data      => R2: 0.6434, RMSE: 52298.87
PCA (90% variance) => R2: 0.0635, RMSE: 84754.58
High Variance      => R2: 0.6556, RMSE: 51393.43


Summary of Results<br>

Model	R² Score	RMSE	Notes <br>
Original Data	0.6434	52,298.87	Performs reasonably well but may have overfitting or high complexity<br>
PCA (90% Variance)	0.0635	84,754.58	Significantly worse performance, likely due to too much dimensionality reduction<br>
High Variance Features	0.6556	51,393.43	Slightly better performance than original features by focusing on high-variance data<br>
Analysis:
Original Data:

R² = 0.6434 suggests that the model is able to explain around 64% of the variance in the target variable (SalePrice), which is decent.<br>

RMSE = 52,298.87 reflects the average error in prediction. Given the dataset might include houses with high price variability, this could be reasonable.<br>

PCA (90% Variance):<br>

R² = 0.0635 and RMSE = 84,754.58 indicate poor performance. The drastic drop in performance after applying PCA suggests that reducing the dimensionality to retain only 90% of variance results in losing important features and relationships in the data.<br>

PCA should generally improve performance if it simplifies the model without losing key information, but in this case, it seems the PCA transformation is detrimental.<br>

High Variance Features:<br>

R² = 0.6556 shows a small improvement over the original features, likely because this approach keeps only the most impactful features (those with high variance).<br>

RMSE = 51,393.43 is the lowest of the three, indicating that the focus on high-variance features improves predictions slightly.<br>

Conclusion:<br>
The original data performs well but might benefit from feature selection or regularization to improve generalization.<br>

PCA did not work well for this dataset, possibly due to a loss of relevant information when reducing dimensions.<br>

High variance features proved to be a good compromise, yielding the best balance between model complexity and performance.