### A. Load the Data

**Instructions:**
1. Import `pandas` and give it the shortened name `pd` using the `as` command 
2. Load the file `AmesHousing.csv` as a dataset called `housing` using the command `read_csv`
3. Print the first few rows of the dataset using the `head` command 



In [37]:
# Load Package
import pandas as pd

# Load Dataset 
housing = pd.read_csv('AmesHousing.csv')

### B. Subset Columns 
**Instructions:**
1. Create a list called selected_columns that contains the variables `SalePrice`, `Gr Liv Area`, `Year Built`, `Neighborhood`
2. Create a subset of the `housing` data that only includes the selected columns, and call this subset `housing` 

In [48]:
# Subset Columns  
selected_columns = ['SalePrice', 'Gr Liv Area', 'Year Built', 'Neighborhood']
housing = housing[selected_columns].copy()
# see the result from this 
print(housing.head())

   SalePrice  Gr Liv Area  Year Built Neighborhood
0     215000         1656        1960        NAmes
1     105000          896        1961        NAmes
2     172000         1329        1958        NAmes
3     244000         2110        1968        NAmes
4     189900         1629        1997      Gilbert


### C. Define Label and Features 
1. Define the label Y by selecting the 'SalePrice' column from the housing DataFrame.

2. Create a variable X as a placeholder feature matrix:
    - Import `numpy` package and call it `np` 
    - Use `numpy.ones` to create a matrix with one column and as many rows as there are in Y.
    - Each value in this matrix should be 1.

In [49]:
# Load Package
import numpy as np

# Define label variable 
Y = housing['SalePrice']
# Define sale price as label variable 
X = np.ones((len(Y), 1))


### D. Estimate Constant Model
1. Import `DummyRegressor` from `sklearn.dummy`.
2. Create an instance of `DummyRegressor` called `mean_model` with the parameter `strategy='mean'`.
3. Fit the dummy regressor to the data using the X and Y variables.

In [50]:
# Estimate Constant Model
from sklearn.dummy import DummyRegressor

# Estimate Mean Model
mean_model = DummyRegressor(strategy='mean')
mean_model.fit(X, Y)


### E. Use Trained Model to Create Predictions 
**Instructions:**
1. Use the predict method of the `mean_model` to generate predictions.
2. Save the predictions in a variable called Y_hat.

In [51]:
# Generate predictions using mean model 
Y_hat_dummy = mean_model.predict(X)


### F. Compute Estimate of Fit 
**Instructions:**
1. Import the `mean_squared_error` function from `sklearn.metrics`.
2. Compute the mean squared error (MSE) between the actual values (`Y`) and the predicted values (`Y_hat`) using the `mean_squared_error` function. Save the result in a variable called mse.
3. Compute the root mean squared error (RMSE) by taking the square root of the MSE using `numpy.sqrt`. Save the result in a variable called `rmse`.


In [None]:
# Compute mean squared error and RMSE 
from sklearn.metrics import mean_squared_error

# Compute MSE and RMSE
mse_dummy = mean_squared_error(Y, Y_hat_dummy)
rmse_dummy = np.sqrt(mse_dummy)

print(f"The mean squared error is {mse_dummy:,.0f}.")
print(f"On average, our prediction is off by ${rmse_dummy:,.0f}.")


The mean squared error is 6,379,705,498.
On average, our prediction is off by $79,873.


### G. Convert Categorical Variable to Dummies 
**Instructions:**
1. Use the `pandas.get_dummies` function to create dummy variables for the `Neighborhood` column in the `housing` DataFrame.
2. Specify the parameter `columns=['Neighborhood']` to indicate which column should be encoded.
3. Use the parameter `drop_first=True` to avoid multicollinearity by excluding the first category.
4. Save the resulting DataFrame as a new variable named `housing_dummies`.

In [53]:
# Create Neighborhood Dummy Variables 
housing_dummies = pd.get_dummies(housing, columns=['Neighborhood'], drop_first=True)


### H. Define label and features for linear regression 
**Instructions:**
1. Define the label variable `Y_reg` as the `SalePrice` column of the `housing_dummies` DataFrame.
2. Define the feature matrix `X_reg` by dropping the `SalePrice` column from the `housing_dummies` DataFrame using the drop method with `axis=1`.

Note that `Y_reg` is a pandas Series and `X_reg` is a pandas DataFrame.

In [58]:
# Define label and Features 
Y_reg = housing_dummies['SalePrice']
# feature
X_reg = housing_dummies.drop('SalePrice', axis=1)


### I. Train Linear Regression Model
**Instructions:**
1. Import the `LinearRegression` class from `sklearn.linear_model`.
2. Create an instance of `LinearRegression` called `reg_model`.
3. Train the regression model using the fit method with `X_reg` as the features and `Y_reg` as the label variable.

In [59]:
# Import Package
from sklearn.linear_model import LinearRegression

# Train the regression model
reg_model = LinearRegression()
reg_model.fit(X_reg, Y_reg)


### J. Use Regression Model to Make Predictions 
**Instructions:**
1. Use the predict method of the trained regression model (`reg_model`) to generate predictions for the feature matrix `X_reg`.                                                                                          
2. Save the predicted values in a variable called `Y_hat`.

In [62]:
# Generate predictions using the regression model
Y_hat_reg = reg_model.predict(X_reg)


### K. Compute Measures of Fit for Regression Model 
**Instructions:**
1. Compute the Mean Squared Error (MSE) for the regression model:
    - Use the `mean_squared_error` function from `sklearn.metrics` with `Y_reg` (actual values) and `Y_hat` (predicted values).
    - Save the result in a variable called `mse_reg`.
2. Compute the Root Mean Squared Error (RMSE) by taking the square root of `mse_reg` using `numpy.sqrt`.
- Save the result in a variable called `rmse_reg`.

In [65]:
# Compute mean squared error and RMSE for the regression model
mse_reg = mean_squared_error(Y_reg, Y_hat_reg)
rmse_reg = np.sqrt(mse_reg)

# Print results
print(f'The mean squared error of the regression model is {mse_reg:,.0f}.')
print(f'On average, the regression model is off by ${rmse_reg:,.0f}.')

The mean squared error of the regression model is 1,512,683,828.
On average, the regression model is off by $38,893.


# Qualitative Questions 
1. Why does the regression achieve lower mean-squared error? How does this relate to model complexity? Do more complex models always perform better?
2. Why must `Neighborhood` be converted into a dummy variable? What is the standard statistical interpretation of coefficient on a dummy variable? Why do we need to set `drop_first = True`?
3. We only use a small set of the available features in this exercise. Would adding more features necessarily improve the linear regression model's performance vis-a-vis mean squared error? If yes, is this a good thing? 
4. How can we interpret the mean squared error? The root mean squared error? What are the units of these quantities, and which is more interpretable?

# https://gemini.google.com/share/f71a358daff3
In this case, the additional complexity and adding of the specific of year built, neighborhood, and gross living area to calculate. The more features then the more complex it is. Same goes for tree models with more depth. Complex models do not always perform better. We were learning about LASSO Score and it was shown that the higher the complexity if overfit, will be off the accurate prediction, so much that the slope could be like 45 degrees off. I asked gemini the same thing earlier, and it just the fact that in math you cant process text to use in an equation, so these dummy variables will change it to binary in some cases. In this scenario, the coefficient on the dummy variable is the average change in SalePrice compared to the baseline. We set drop_first = True to avoid any redundancy of the variables, which would cause multiple collinearity. 
There has to be a balance in terms of complexity. You could lower MSE of training data, but when trying to figure out the real-life patterns, a more complex model with high LASSO score could falter. Sometimes models will overfit and end up memorizing the data. 
The MSE is a measuement of how far off the actual value, and punishing exponentially the ones that are more off. Also squaring will keep this value positive. RMSE is the distance from actual value and is squared root of the mentiponed mse. RMSE is in dollars here and is the more interpretable value