###  Data Loading and Exploration 

In [3]:
#Loading the Data
import pandas as pd 
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [4]:
#Create a Pandas DataFrame for the features and a Series for the target variable (med_house_value). 
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name="MedHouseVal")

In [5]:
#Display the first five rows of the dataset
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [6]:
#Print the feature names and check for missing values
X.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

In [7]:
#Generate summary statistics (mean, min, max, etc.)
print("Feature Means:")
print(X.mean())
print("Feature IQR and medians:")
print(X.quantile([.25, .5,.75]))
print("Feature Maximun Values:")
print(X.max())
print("Feature Minimum Values:")
print(X.min())

Feature Means:
MedInc           3.870671
HouseAge        28.639486
AveRooms         5.429000
AveBedrms        1.096675
Population    1425.476744
AveOccup         3.070655
Latitude        35.631861
Longitude     -119.569704
dtype: float64
Feature IQR and medians:
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0.25  2.56340      18.0  4.440716   1.006079       787.0  2.429741     33.93   
0.50  3.53480      29.0  5.229129   1.048780      1166.0  2.818116     34.26   
0.75  4.74325      37.0  6.052381   1.099526      1725.0  3.282261     37.71   

      Longitude  
0.25    -121.80  
0.50    -118.49  
0.75    -118.01  
Feature Maximun Values:
MedInc           15.000100
HouseAge         52.000000
AveRooms        141.909091
AveBedrms        34.066667
Population    35682.000000
AveOccup       1243.333333
Latitude         41.950000
Longitude      -114.310000
dtype: float64
Feature Minimum Values:
MedInc          0.499900
HouseAge        1.000000
AveRooms       

### Linear Regression on Unscaled Data (no feature scaling)

In [8]:
# Splitting the data into training and test sets (80% traing, 20% testing)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2)

In [9]:
#Training a linear regression model on the unscaled data
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train) 

#Making Predictions on the test set 
y_pred = lin_reg.predict(X_test)
y_pred

array([2.79903923, 2.43133763, 1.44723925, ..., 1.89745381, 1.43734959,
       2.8823339 ])

In [15]:
#Evaluation of the model's performance (MSE, RMSE, r^2 Score)
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

mse_lin = mean_squared_error(y_test, y_pred)
rmse_lin = root_mean_squared_error(y_test, y_pred)
r2s_lin = r2_score(y_test, y_pred)

print("Unscaled Data Model:")
print(f"Mean Squared Error:{mse_lin:.2f}")
print(f"Root Mean Squared Error: {rmse_lin:.2f}")
print(f"r^2 Score:{r2s_lin:.2f}")

Unscaled Data Model:
Mean Squared Error:0.52
Root Mean Squared Error: 0.72
r^2 Score:0.62


In [11]:
#Model Coefficients: 
coef_series = pd.Series(lin_reg.coef_, index = X.columns)
intercept = pd.Series(lin_reg.intercept_)
print("Coefficients:")
print(coef_series)
print("Intercept:")
print(intercept)

Coefficients:
MedInc        0.429276
HouseAge      0.009339
AveRooms     -0.097152
AveBedrms     0.603108
Population   -0.000006
AveOccup     -0.003406
Latitude     -0.428141
Longitude    -0.439009
dtype: float64
Intercept:
0   -37.2136
dtype: float64


####  Interpretation of Results: 
- The model's r^2 score is 0.60, meaning that the model explains 60% of the variance in the data. That is to say, 40% of the variance in housing median house value is not because of these features. The model's RMSE is 0.74 meaning average error of the model is +/- $74,000, since the model is in units of $100,000. While this number may be high for home buyers, it is a relativley low RMSE relative to the data and its units. Given the RMSE and the r^2 value, the model is moderately accurate. 
- The feature with the greatest impact on the prediction is longitude. While there coefficients are not the highest (Average Bedrooms), the longitude variable has observations in the 100s. Thus, when longitude is multiplied by the coefficient, it has the largest effect on the prediction. 
- As mentioned the predicted are off by an average of $74,000, which is decent for the model, given the units. This number is probably too large for homebuyers to use it as a realiable predictor, however. 

### Doing a linear regression with Scaled Data 

In [12]:
from sklearn.preprocessing import StandardScaler
# initalize the scaler and apply it ot the features 
scaler =  StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,2.344766,0.982143,0.628559,-0.153758,-0.974429,-0.049597,1.052548,-1.327835
1,2.332238,-0.607019,0.327041,-0.263336,0.861439,-0.092512,1.043185,-1.322844
2,1.782699,1.856182,1.15562,-0.049016,-0.820777,-0.025843,1.038503,-1.332827
3,0.932968,1.856182,0.156966,-0.049833,-0.766028,-0.050329,1.038503,-1.337818
4,-0.012881,1.856182,0.344711,-0.032906,-0.759847,-0.085616,1.038503,-1.337818


In [13]:
#Split the scaled data 
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled_df, y, test_size=0.2, random_state=42)
#Fit the scaled data 
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train_scaled)
#Make predictions 
y_pred_scaled = lin_reg_scaled.predict(X_test_scaled)
y_pred_scaled

array([0.71912284, 1.76401657, 2.70965883, ..., 4.46877017, 1.18751119,
       2.00940251])

In [14]:
# Evaluate the scaled model performance
mse_lin_scaled = mean_squared_error(y_test_scaled, y_pred_scaled)
rmse_lin_scaled = root_mean_squared_error(y_test_scaled, y_pred_scaled)
r2s_lin_scaled = r2_score(y_test_scaled, y_pred_scaled)

print("Scaled Data Model:")
print(f"Mean Squared Error:{mse_lin_scaled:.2f}")
print(f"Root Mean Squared Error: {rmse_lin_scaled:.2f}")
print(f"R Squared Score:{r2s_lin_scaled:.2f}")


# View our scaled model's coefficients
coef_series_scaled = pd.Series(lin_reg_scaled.coef_, index = X.columns)
intercept_scaled = pd.Series(lin_reg_scaled.intercept_)
print("Scaled Coefficients:")
print(coef_series_scaled)
print("Scaled Intercept:")
print(intercept_scaled)

Scaled Data Model:
Mean Squared Error:0.56
Root Mean Squared Error: 0.75
R Squared Score:0.58
Scaled Coefficients:
MedInc        0.852382
HouseAge      0.122382
AveRooms     -0.305116
AveBedrms     0.371132
Population   -0.002298
AveOccup     -0.036624
Latitude     -0.896635
Longitude    -0.868927
dtype: float64
Scaled Intercept:
0    2.067862
dtype: float64


### Feature Selection and Simplified Model
- I will select longitude and latitude because I believe that they are the features with the largest effect on the predicted outcome (location! location! location!). For my third variable I will select MedInc because it also has a large effect on the prediction.  

In [20]:
# Training the new linear regression model
simple_feature_vars = ["MedInc", "Latitude", "Longitude",]
simple_X= pd.DataFrame(X[simple_feature_vars])
simple_y = pd.Series(housing.target, name="MedHouseVal")
simple_X.head()

Unnamed: 0,MedInc,Latitude,Longitude
0,8.3252,37.88,-122.23
1,8.3014,37.86,-122.22
2,7.2574,37.85,-122.24
3,5.6431,37.85,-122.25
4,3.8462,37.85,-122.25


In [21]:
#Training the simple model
simple_X_train, simple_X_test, simple_y_train, simple_y_test = train_test_split(simple_X, simple_y,
                                                                                test_size=0.2)

In [22]:
#Training a new linear regression model on the unscaled data
simple_lin_reg = LinearRegression()
simple_lin_reg.fit(simple_X_train, simple_y_train) 

#Making Predictions on the test set with the new simple model  
simple_y_pred = simple_lin_reg.predict(simple_X_test)
simple_y_pred

array([1.10232228, 1.85048795, 2.32522914, ..., 1.81244912, 0.74328245,
       2.94467185])

In [23]:
#Evaluating the simple model's performance
simple_mse_lin = mean_squared_error(simple_y_test, simple_y_pred)
simple_rmse_lin = root_mean_squared_error(simple_y_test, simple_y_pred)
simple_r2s_lin = r2_score(simple_y_test,simple_y_pred)

print("Unscaled Data Model using only three features:")
print(f"Mean Squared Error of the Simple Model:{simple_mse_lin:.2f}")
print(f"Root Mean Squared Error of the Simple Model: {simple_rmse_lin:.2f}")
print(f"r^2 Score of the Simple Model:{simple_r2s_lin:.2f}")

Unscaled Data Model using only three features:
Mean Squared Error of the Simple Model:0.53
Root Mean Squared Error of the Simple Model: 0.73
r^2 Score of the Simple Model:0.59


#### Evaluation and Interpretation of the simplified Model
- The Statistics of the simplified model were very close. The MSE and RMSE of the simplified model were slightly lower than the old model. The r^2 score of the simplified model was the same as the old model, meaning that both models explain the same proportion of variance in the data, despite the old model having 5 additional features. Thus, one can conclude that the other features do not have a significant effect on the model's predictions 
- Since the models have the same r^2 values, and the MSE and RMSE values are slightly smaller, I would use the new, simplified model. I believe that it is good to go for simplicity when one can, especially when dealing with increasingly large datasets. 