In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA


# Item 1 Dataset Splitting
Load the dataset into a pandas dataset and perform train / test splitting using a random split and a group based split.

In [None]:


# Load the dataset from csv
df = pd.read_csv('dataset.csv')
df.head()

# Do the random train / test split with ratio of 70% / 30%
rand_train_df, rand_test_df = train_test_split(df, test_size=0.3, random_state=42)
print(f"Rand train set shape: {rand_train_df.shape}")
print(f"Rand test set shape: {rand_test_df.shape}")

print('------------------------------')

# Perform train test splitting based on X8 values

# Print value counts of X8 variable to see distribution
x8_counts = df['X8'].value_counts().sort_index()
print("Value counts in X8 variable:")
print(x8_counts)
print('------------------------------')


test_x8_values = [1,2] # two values arbitrary chosen
print(f"X8 values in test set only {test_x8_values}")

group_test_df = df[df['X8'].isin(test_x8_values)]
group_train_df = df[~df['X8'].isin(test_x8_values)]

print(f"Group train set shape: {group_train_df.shape}")
print(f"Group test set shape: {group_test_df.shape}")




## Item 1 Commentary

Splitting a dataset into training / test splits using a random method is advisable as the dataset may contain an inherent order, and splitting based on default row position may lead to imbalanced train / test splits. Splitting the dataset by withholding two classes in the X8 variable is extremely problematic, as the training dataset would have no cases of the two withheld classes and therefore would not be able to learn an accurate model for these two classes. The upshot of this in this case is that a model trained on the group split datasets would perform poorly at predicting the heating or cooling loads when presented with the two withheld classes of X8.


# Item 2 Dataset Split Analysis

Compare via bar charts, the effect of the differing split methods on the X6 (orientation) population in each of the splits

In [None]:
# Create a mapping for orientation values to their meanings
orientation_mapping = {2: 'East', 3: 'South', 4: 'West', 5: 'North'}

# Count the occurrences of each orientation in random split datasets
rand_train_counts = rand_train_df['X6'].value_counts().sort_index()
rand_test_counts = rand_test_df['X6'].value_counts().sort_index()

# Count the occurrences of each orientation in group split datasets
group_train_counts = group_train_df['X6'].value_counts().sort_index()
group_test_counts = group_test_df['X6'].value_counts().sort_index()

# Set up the figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot for random split
x = np.arange(len(orientation_mapping))
width = 0.35

ax1.bar(x - width/2, rand_train_counts, width, label='Train')
ax1.bar(x + width/2, rand_test_counts, width, label='Test')
ax1.set_xticks(x)
ax1.set_xticklabels([orientation_mapping[i] for i in sorted(orientation_mapping.keys())])
ax1.set_xlabel('Building Orientation')
ax1.set_ylabel('Number of Buildings')
ax1.set_title('Building Orientations in Random Split')
ax1.legend()

# Plot for group split
ax2.bar(x - width/2, group_train_counts, width, label='Train')
ax2.bar(x + width/2, group_test_counts, width, label='Test')
ax2.set_xticks(x)
ax2.set_xticklabels([orientation_mapping[i] for i in sorted(orientation_mapping.keys())])
ax2.set_xlabel('Building Orientation')
ax2.set_ylabel('Number of Buildings')
ax2.set_title('Building Orientations in Group Split')
ax2.legend()

plt.tight_layout()
plt.show()

# Print the actual counts for reference
print("Random Split - Building Orientations:")
for orient_val, orient_name in orientation_mapping.items():
    train_count = rand_train_counts.get(orient_val, 0)
    test_count = rand_test_counts.get(orient_val, 0)
    print(f"{orient_name}: Train={train_count}, Test={test_count}")

print("\nGroup Split - Building Orientations:")
for orient_val, orient_name in orientation_mapping.items():
    train_count = group_train_counts.get(orient_val, 0)
    test_count = group_test_counts.get(orient_val, 0)
    print(f"{orient_name}: Train={train_count}, Test={test_count}")

# Manually check the counts of X6 for one X8 group (1 in this case) as I would have thought removing groups where the glazing is 1 (north) would have affected X6 counts but this wasnt the case.
group_test_x8_1_counts = group_test_df[group_test_df['X8'] == 1]['X6'].value_counts().sort_index()

print("\nGroup Test - Building Orientations where X8 = 1:")
for orient_val, orient_name in orientation_mapping.items():
    count = group_test_x8_1_counts.get(orient_val, 0)
    print(f"{orient_name}: Count={count}")


## Item 2 Commentary 

Both random splits and group splits produced dataset segregations that did not result in under or over representation with respect to the X6 (Orientation) variable. This result is counter intuitive for the group split datasets, as the holding out of two orientations (north and east) for glazing would imply that the building orientation X6 variable should have been affected by this splitting method. Manual investigation of one X8 group (group 1, north) shows that the value count of the X6 (Orientation) variable had a uniform distribution of building orientations. The same was true for all other non zero X8 groups. Therefore the group split datasets have uniform counts for X6 when grouped by a X8 class.
Many machine learning algorithms expect that variables have a normal distribution, so any deviation from this will impact predictive performance.

# Item 3 Linear Regression 

In this item we use the random train / test split dataset and fit a linear regression model. We then evaluate that model using the test dataset to determine mean squared error, root mean squared error and R^2 score.

In [None]:
input_features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']
target_feature = 'Y1'

lr_model = LinearRegression()

# Define inputs out and out feature arrays
X_train = rand_train_df[input_features]
y_train = rand_train_df[target_feature]

X_test = rand_test_df[input_features]
y_test = rand_test_df[target_feature]

# Fit the model
lr_model.fit(X_train, y_train)

# Make predictions on the training set to evaluate the model
y_test_pred = lr_model.predict(X_test)

# Evaluate the model
train_mse = mean_squared_error(y_test, y_test_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_test, y_test_pred)

print("Linear Regression Results:")
print(f"Mean Squared Error: {train_mse:.4f}")
print(f"Root Mean Squared Error: {train_rmse:.4f}")
print(f"R^2 Score: {train_r2:.4f}")




## Item 3 Commentary

Mean Squared Error is the average squared difference between the observed and predicted values. The unit is the square of the unit of the target variable. A lower number is better, but further analysis and comparison to the distribution of the target variable is needed. Root Mean Squared Error is simply the square root of MSE and the same comment applies. R^2 score measures the proportion of variance in the input features that is predictable in the target feature. A score of 1 is the highest achievable. The R^2 score of 0.9128 shows that 91% of variance of the inputs features is predicted by the output target variable therefore the model have has a high predict power.

## Item 4 Principal Component Analysis

Use PCA to reduce the dimensionality of the input feature set and evaluate the PCA dataset against the original dataset


In [None]:

pca = PCA(n_components=3)

# Fit the pca model
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Print explained variance
explained_variance = pca.explained_variance_ratio_
print("Explained variance ratio by the 3 principal components:")
print(explained_variance)
print(f"Total variance explained: {sum(explained_variance):.4f}")

# Linear regression model on PCA dataset
pca_model = LinearRegression()
pca_model.fit(X_train_pca, y_train)

y_test_pca_pred = pca_model.predict(X_test_pca)

# Evaluate the model performance
pca_mse = mean_squared_error(y_test, y_test_pca_pred)
pca_rmse = np.sqrt(pca_mse)
pca_r2 = r2_score(y_test, y_test_pca_pred)

print("\nLinear Regression Results after PCA:")
print(f"Mean Squared Error: {pca_mse:.4f}")
print(f"Root Mean Squared Error: {pca_rmse:.4f}")
print(f"R^2 Score: {pca_r2:.4f}")

print("\nComparison of model performance:")
print(f"Original MSE: {train_mse:.4f} vs PCA MSE: {pca_mse:.4f}")
print(f"Original R^2: {train_r2:.4f} vs PCA R^2: {pca_r2:.4f}")




## Item 4 Commentary

The R^2 score for the original linear model was 0.9128 and for the PCA dataset the linear model's R^2 score was 0.7924. Therefore the PCA reduced dataset hurt performance. Some of the pros of reducing dimensionality is that it can simplify model, ignore potentially harmful features and make analysis simpler. Some cons are that it cannot capture 100% of the original variance of the underlying data. There is also a level of abstraction introduced as components are a combination of original features.

# Item 5 Ridge Regression

The task is to use Ridge Regression on the PCA dataset and to compare the results with linear regression results on the PCA dataset

In [None]:


ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train_pca, y_train)

y_test_ridge_pred = ridge_model.predict(X_test_pca)

# Calc eval
ridge_mse = mean_squared_error(y_test, y_test_ridge_pred)
ridge_rmse = np.sqrt(ridge_mse)
ridge_r2 = r2_score(y_test, y_test_ridge_pred)

print("\nRidge Regression Results on PCA dataset:")
print(f"Mean Squared Error: {ridge_mse:.4f}")
print(f"Root Mean Squared Error: {ridge_rmse:.4f}")
print(f"R^2 Score: {ridge_r2:.4f}")

print("\nComparison of Ridge vs Linear Regression on PCA dataset:")
print(f"Linear Regression PCA MSE: {pca_mse:.4f} vs Ridge PCA MSE: {ridge_mse:.4f}")
print(f"Linear Regression PCA R^2: {pca_r2:.4f} vs Ridge PCA R^2: {ridge_r2:.4f}")



## Item 5 Commentary

The addition of regularisation did not improve performance as the MSE and R^2 were almost identical. Regularisation is often used to address multicolinearity which had already been addressed by principal component analysis as the components created are not correlated. In general regularisation may still be useful after PCA if a large number of components are retained as overfitting may still occur. Components with small eignvalues (variance) may also cause numerical instability so regularisation would help to address this.