# Part 1


In [None]:
## import dataset calling class as well as pandas
from sklearn.datasets import fetch_california_housing
import pandas as pd

## load the California housing dataset
housing = fetch_california_housing()

## create a pandas dataframe with the feature data and a series with the target
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='med_house_value')




In [1]:
## printing previews of the dataset and series using head()
print("Feature dataset preview: ")
print(X.head())
print("\nTarget variable preview: ")
print(y.head())

Feature dataset preview: 


NameError: name 'X' is not defined

In [None]:
# displaying number of missing values
print(f"\nFeature names along with number of missing values per feature: \n{X.isnull().sum()}")


Feature names along with number of missing values per feature: 
MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64


In [None]:
# displaying the summary statistics of the feature dataset and target variable
print(f"\nSummary statistics of features: \n{X.describe()}")
print(f"\nSummary statistics of target variable: \n{y.describe()}")


Summary statistics of features: 
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692308     32.540000   -124.350000  
25% 

# Part 2

In [None]:
# import necessary linear regression classes and functions
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression as lr

# use tts to split the data into training and testing sets
X_train_raw, X_test_raw, y_train_raw, y_test_raw = tts(X, y, test_size=0.2, random_state=23)

# creating a linear regression model and fitting it to the training data
lin_reg_raw = lr()
lin_reg_raw.fit(X_train_raw, y_train_raw)

# predicting the target variable using the test data
y_pred_raw = lin_reg_raw.predict(X_test_raw)


In [None]:
# importing necessary methods for evaluation of model
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score
# evaluating the model using the mean squared error, root mean squared error, and R^2 score
mse_raw = mean_squared_error(y_test_raw, y_pred_raw)
rmse_raw = root_mean_squared_error(y_test_raw, y_pred_raw)
r2_raw = r2_score(y_test_raw, y_pred_raw)

print("Raw Data Model:")
print(f"Mean Squared Error: {mse_raw:.2f}")
print(f"Root Squared Error: {rmse_raw:.2f}")
print(f"R² Score: {r2_raw:.2f}")

print("\nModel Coefficients: ")
print(pd.Series(lin_reg_raw.coef_,
                index=X.columns))
print("\nModel Intercept: ")
print(pd.Series(lin_reg_raw.intercept_))

Raw Data Model:
Mean Squared Error: 0.53
Root Squared Error: 0.73
R² Score: 0.61

Model Coefficients: 
MedInc        0.436654
HouseAge      0.009149
AveRooms     -0.109385
AveBedrms     0.616678
Population   -0.000003
AveOccup     -0.008606
Latitude     -0.419646
Longitude    -0.431455
dtype: float64

Model Intercept: 
0   -36.570768
dtype: float64


### Part 2 Interpretation

The R^2 score tells us that in our model about 61% of the variation in med_house_value, our target variable, is explained by the feature variables. This number doesn't actually tell us that much in a vacuum because different fields are differently stringent and sometimes people are looking for causation more than correlation. However, with the amount of seemingly variables we have, this does seem kind of low and I am inclined to think it is not a great model.

Based on the model's coefficients, the average amount of bedrooms feature seems to have the most impact on the prediction for median house value closely followed by the median income feature while the other features seem to have smaller impacts. Both these variables have a positive impact on the median house value, while the average amount of rooms feature has the largest negative effect while also being less in magnitude. The latitude and longitude values seem to indicate that location has an impact on the median house value but it may be more difficult to interpret intuitively. 

I think the predicted values may not match very well because the root MSE of 0.73 is considerably larger for the range of values med_house_value can be. This is around 1/7 of the range and almost 60% of the standard deviation. Overall, this means that the model is not trustworthy enough to give you a close prediction of the target variable. Some more context may be needed about the feature variables, but with this and a R^2 of 0.61, it does not seem that our model matches very well.

# Part 4

The three features I am going to choose for my simplified model are AveBedrms, MedInc, and HouseAge. I am choosing the first two because the magnitude of their effects was the largest in the first model and intuitively I think they both make sense as factors in house value. Besides the location values, the next highest magnitude was AveRooms but I didn't want to include this because I think it may covary significantly with AveBedrms and may not tell us that much. So, because I thought it intuitively made more sense than the rest, I chouse HouseAge because surely wear and tear relates to value. 

In [None]:
# identify my new features in a list
new_features =["AveBedrms", "MedInc", "HouseAge"]

# create a new feature dataset with only the new features
X_new = pd.DataFrame(X[new_features])

# split the new feature dataset into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = tts(X_new, y, test_size=0.2, random_state=77)

# initialize a new linear regression model and fit it to the training data
lin_reg_new = lr()
lin_reg_new.fit(X_train_new, y_train_new)

# predict the target variable using the new test data
y_pred_new = lin_reg_new.predict(X_test_new)

In [None]:
# evaluate the new model using the mean squared error, root mean squared error, and R^2 score
mse_new = mean_squared_error(y_test_new, y_pred_new)
rmse_new = root_mean_squared_error(y_test_new, y_pred_new)
r2_new = r2_score(y_test_new, y_pred_new)   

print("\nNew Feature Model:")
print(f"Mean Squared Error: {mse_new:.2f}")
print(f"Root Squared Error: {rmse_new:.2f}")
print(f"R² Score: {r2_new:.2f}")    

print("\nNew Model Coefficients: ")
print(pd.Series(lin_reg_new.coef_,
                index=X_new.columns))
print("\nNew Model Intercept: ")
print(pd.Series(lin_reg_new.intercept_))


New Feature Model:
Mean Squared Error: 0.66
Root Squared Error: 0.81
R² Score: 0.50

New Model Coefficients: 
AveBedrms    0.025596
MedInc       0.433447
HouseAge     0.017980
dtype: float64

New Model Intercept: 
0   -0.150595
dtype: float64


My new model has a higher mean squared error and root squared error than the original model, so it even less accurately predicts the target variable. However, we do seem to have quieted out the noise by still keeping an R^2 score of 0.50 with only 3/8 of the original variables. I assume this means that I have chosen at least some of the most relevant variables. In terms of the coefficients, a very interesting observation is that the MedInc coefficient is virtually the exact same, perhaps showing it is truly related to the target variable in this way. However, in this model we see AveBedrms have a much smaller impact than in the first model, while HouseAge is still not a huge impact but has doubled in magnitude. 

### Part 4 Interpretation

The simplified model does not seem to really do a good job either, even performing worse in predictiveness. I think it does get rid of some unnecessary features but there may still be other confounding variables. It explains less of the variation and has a higher MSE and RMSE. I would not use this model in practice because I chose the variables based on prior coefficients and intuition and there are better ways to determine relevant variables to a model. I think it doesn't really do a better job at predicting the target variable or explaining relationships than the first model. However, I do like having less variables because this means the model is less likely to overfit the training data. 

# Part 3 



In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler and apply it to the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split the scaled data
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = tts(X_scaled, y, test_size=0.2, random_state=23)

# Initialize and train the linear regression model on scaled data
lin_reg_scaled = lr()
lin_reg_scaled.fit(X_train_scaled, y_train_scaled)

# Make predictions on the test set
y_pred_scaled = lin_reg_scaled.predict(X_test_scaled)

# Evaluate model performance
mse_scaled = mean_squared_error(y_test_scaled, y_pred_scaled)
r2_scaled = r2_score(y_test_scaled, y_pred_scaled)
rmse_scaled = root_mean_squared_error(y_test_scaled, y_pred_scaled)

print("\nScaled Data Model:")
print(f"Mean Squared Error: {mse_scaled:.2f}")
print(f"Root Mean Squared Error: {rmse_scaled:.2f}")
print(f"R² Score: {r2_scaled:.2f}")
print("Model Coefficients (Scaled):")
print(pd.Series(lin_reg_scaled.coef_, index=X.columns))


Scaled Data Model:
Mean Squared Error: 0.56
Root Mean Squared Error: 0.75
R² Score: 0.58
Model Coefficients (Scaled):
MedInc        0.852382
HouseAge      0.122382
AveRooms     -0.305116
AveBedrms     0.371132
Population   -0.002298
AveOccup     -0.036624
Latitude     -0.896635
Longitude    -0.868927
dtype: float64
