In [14]:
# Test date: 21 Jan 2025
# Code from Open AI ChatGPT generated using the following prompt
"""
Attached is a csv file called houses.99.txt and it is delimited
by ",".  The first row is a header.  The remainder rows contain 
the numerical data.  The first four columns contain the input
features, X_train, which are for predicting the house prices.
The fifth column contains the house prices in units of 
1000's of dollars, y_train. 
We wish to fit a linear model y = w.X + b, where w are the weights,
b is the bias value and X is the input feature set and 
y is the output house price in dollars.
Please give a python code to determine the linear model 
for X_train and y_train using sklearn and the SGDRegressor.  
Use scaling for X_train.  
Please also include the code for reading X_train and y_train 
from the houses99.txt file.  
Using this code, determine the weights and bias and show the model.  
Calculate the weights and the bias using this code, and give the model.  
Print the mean and standard deviation, for each column in X_train.  
Finally, predict the house price for a new feature set [1200, 3, 1, 40].
Give the scaled values for this feature set.  
Also, provide the python code listing and let the 
print statements for numbers be to 8 decimal places.
"""

# Import necessary libraries
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

# File path for the dataset
file_path = './data/houses99.txt'  # Adjust if necessary

# Step 1: Load data from the file
with open(file_path, 'r') as file:
    header = file.readline().strip().split(",")  # Read the header

# Load the numerical data (skip the header)
data = np.loadtxt(file_path, delimiter=",", skiprows=1)

# Extract X_train (first 4 columns) and y_train (5th column)
X_train = data[:, :4]
y_train = data[:, 4]

# Step 2: Scale the features in X_train
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Step 3: Fit the SGDRegressor model
sgd_model = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
sgd_model.fit(X_train_scaled, y_train)

# Step 4: Extract the weights (w) and bias (b)
weights = sgd_model.coef_  # Coefficients (weights)
bias = sgd_model.intercept_[0]  # Intercept (bias)

# Step 5: Compute mean and standard deviation for each column in X_train
mean_X_train = scaler.mean_
std_X_train = np.sqrt(scaler.var_)

# Step 6: Predict the house price for a new feature set
new_feature_set = np.array([1200, 3, 1, 40]).reshape(1, -1)
new_feature_scaled = scaler.transform(new_feature_set)
predicted_price = sgd_model.predict(new_feature_scaled)[0]

# Print Results
print("Linear Model:")
print(f"y = {weights[0]:.8f} * X1 + {weights[1]:.8f} * X2 + {weights[2]:.8f} * X3 + {weights[3]:.8f} * X4 + {bias:.8f}")

print("\nMean of X_train (to 8 decimal places):")
print([f"{mean:.8f}" for mean in mean_X_train])

print("\nStandard Deviation of X_train (to 8 decimal places):")
print([f"{std:.8f}" for std in std_X_train])

print("\nScaled values for the feature set [1200, 3, 1, 40] (to 8 decimal places):")
print([f"{scaled:.8f}" for scaled in new_feature_scaled.flatten()])

print(f"\nPredicted house price for the feature set [1200, 3, 1, 40]: {predicted_price:.8f} (in thousands of dollars)")



Linear Model:
y = 110.28069076 * X1 + -21.13073419 * X2 + -32.54592156 * X3 + -38.01263833 * X4 + 363.16331343

Mean of X_train (to 8 decimal places):
['1418.37373737', '2.71717172', '1.38383838', '38.38383838']

Standard Deviation of X_train (to 8 decimal places):
['411.61562893', '0.65196523', '0.48631932', '25.77788069']

Scaled values for the feature set [1200, 3, 1, 40] (to 8 decimal places):
['-0.53052829', '0.43380884', '-0.78927234', '0.06269567']

Predicted house price for the feature set [1200, 3, 1, 40]: 318.79395493 (in thousands of dollars)
