### Introduction

This notebook contains two parts. **Part 1, Multiple Linear Regression**, provides you an opportunity to demonstrate your ability to apply course concepts by implementing a training function for multiple linear regression. **Part 2, California Housing Prices**, provides you an opportunity to practice using widely-used ML libraries and an ML workflow to solve a regression problem.

**You do not need to complete Part 1 in order to complete Part 2**. If you get stuck on Part 1, and choose to work on Part 2, be sure that all of your code for Part 1 runs without error. You can comment out your code in Part 1 if necessary.

# Part 1: Implementing Multiple Linear Regression

Given a simple MultipleLinearRegressor, and a simple training set of housing data, demonstrate your ability to implement a multiple linear regression model's `fit` function, such that it properly trains its linear model using gradient descent.

## The UnivariateLinearRegressor

Let's first review the UnivariateLinearRegressor, which you should find familiar, and you do not need to modify. Notice that the `fit` method uses a fixed number of iterations, only for simplicity and experimentation.

In [1]:
class UnivariateLinearRegressor:

    def __init__(self, w = 0, b = 0, alpha = 0.1):
        self.w = w
        self.b = b
        self.alpha = alpha

    def fit(self, x_train, y_train):
        for _ in range(0, 10):
            delta_w = self.alpha * self._d_cost_function_w(x_train, y_train)
            delta_b = self.alpha * self._d_cost_function_b(x_train, y_train)
            self.w = self.w - delta_w
            self.b = self.b - delta_b

    def _d_cost_function_w(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i]) * x_train[i]
        return sum / len(x_train)

    def _d_cost_function_b(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i])
        return sum / len(x_train)

    def predict(self, x):
        return self.w * x + self.b


Next, consider the following simple training examples, which you should also find familiar, that represent the square feet and prices of houses.

In [2]:
x_train = [1.0, 2.0]
y_train = [300.0, 500.0]

As demonstrated in the related Exploration, we can instantiate, train and make predictions with our UnivariateLinearRegressor as follows. Notice how we first instantiate our UnivariateLinearRegressor with a _single_ weight, and the bias and learning rate.

In [3]:
regressor = UnivariateLinearRegressor(0, 0, 0.1)
regressor.fit(x_train, y_train)

small_house_price = regressor.predict(1.0)
print(f"The price of a 1,000 sqft house is {small_house_price}")

medium_house_price = regressor.predict(2.0)
print(f"The price of a 2,000 sqft house is {medium_house_price}")

big_house_price = regressor.predict(8.0)
print(f"The price of an 8,000 sqft house is {big_house_price}")

The price of a 1,000 sqft house is 301.4502082643262
The price of a 2,000 sqft house is 488.7867275295899
The price of an 8,000 sqft house is 1612.805843121172


Observing the results, we can see that the model has made its way toward converging on its line of best fit. However, we are intentionally limiting the amount of training in `fit`, and therefore truncating the training. Again, we are limiting this only for simplicity and experimentation. Try increasing the steps of gradient descent to 500 and re-run the code cells, and notice that the predictions become more accurate.

This concludes a review of our UnivariateLinearRegressor. Notice that this implementation intentionally handles only one dimension of input. In the example above, this one dimension is the size in square feet of a house.


## The MultipleLinearRegressor

While our simple UnivariateLinearRegressor works well for just a single dimension of input, we would like to make predictions based on multiple features, such as square feet, number of bedrooms, the number of floors, and the age of a house.

To demonstrate your understanding of features, vectors and gradient descent, try completing the implementation of a MultipleLinearRegressor. We begin with the implementation below, which has a complete `predict` method and method stubs for `fit` and the partial derivatives.

In [24]:
import numpy as np

class MultipleLinearRegressor:
    
    def __init__(self, w=[], b=0, learning_rate=0.01):
        self.w = w
        self.b = b
        self.learning_rate = learning_rate
        self.mean = None
        self.std = None
    
    def fit(self, x_train, y_train):
        x_train = np.array(x_train)
        self.mean = np.mean(x_train, axis=0)
        self.std = np.std(x_train, axis=0)
        x_train_norm = (x_train - self.mean) / self.std
        self.w = np.zeros(x_train_norm.shape[1])
        
        for _ in range(0, 10000):
            delta_w = self.learning_rate * self._d_cost_function_w(x_train_norm, y_train)
            delta_b = self.learning_rate * self._d_cost_function_b(x_train_norm, y_train)
            self.w = np.where(np.isnan(self.w), 0, self.w)
            self.b = np.where(np.isnan(self.b), 0, self.b)
            self.w = np.where(np.isinf(self.w), 0, self.w)
            self.b = np.where(np.isinf(self.b), 0, self.b)
            self.w = self.w - delta_w
            self.b = self.b - delta_b
    
    def _d_cost_function_w(self, x_train, y_train):
        y_pred = np.dot(x_train, self.w) + self.b
        error = y_pred - y_train
        gradient = []
        for i in range(x_train.shape[1]):
            gradient_i = 0
            for j in range(len(x_train)):
                gradient_i += error[j] * x_train[j][i]
            gradient.append(gradient_i / len(x_train))
        return np.array(gradient)
    
    def _d_cost_function_b(self, x_train, y_train):
        y_pred = np.dot(x_train, self.w) + self.b
        error = y_pred - y_train
        return np.sum(error) / len(x_train)
    
    def predict(self, x):
        x_norm = (x - self.mean) / self.std
        return self._dot_product(self.w, x_norm) + self.b
    
    def _dot_product(self, a, b):
        return sum(pair[0] * pair[1] for pair in zip(a, b))


As we shall see in a moment, your goal will be to implement `fit` and `_d_cost_function_w` and `_d_cost_function_b`. For now, let's take a look at the training set and see how our current implementation behaves.

We'll start with a simple contrived data set with four examples, already split for you. Each training example in `x_train` represents the size, number of bedrooms, number of floors and the age of a house. Each value in `y_train` represents the price of the house in thousands of dollars.

In [25]:
x_train = [
    [2104.0, 5.0, 1.0, 45.0],
    [1416.0, 3.0, 2.0, 40.0],
    [1534.0, 3.0, 2.0, 30.0],
    [852.0, 2.0, 1.0, 36.0]
]
y_train = [460.0, 232.0, 315.0, 178.0]

Notice that `x_train` now contains vectors representing the features of each house, and each vector contains four features. Since we know that our linear regression model will need one weight for each feature, we should instantiate it with a _vector_ of weights, along with a bias and our learning rate.

In [26]:
regressor = MultipleLinearRegressor([0, 0, 0, 0], 0, 0.1)
regressor.fit(x_train, y_train)

Even though our implementation is incomplete, we can try to make some predictions. Notice that, to make a prediction, we should provide the `predict` method with a vector of features.

In [27]:
# 'Test Run' Code Cell, Referred to in "What to Do" #2.

first_house_price = regressor.predict([2104.0, 5.0, 1.0, 45.0])
print(f"The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars")
print(f"The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is {first_house_price} thousand dollars")

second_house_price = regressor.predict([1416.0, 3.0, 2.0, 40.0])
print(f"The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars")
print(f"The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is {second_house_price} thousand dollars")

third_house_price = regressor.predict([1534.0, 3.0, 2.0, 30.0])
print(f"The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars")
print(f"The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is {third_house_price} thousand dollars")

small_house_price = regressor.predict([852.0, 2.0, 1.0, 36.0])
print(f"The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars")
print(f"The predicted price of this house is {small_house_price}")

The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars
The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 459.9999999999997 thousand dollars
The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars
The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 231.99999999999991 thousand dollars
The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars
The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 314.99999999999966 thousand dollars
The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars
The predicted price of this house is 177.99999999999983


Notice how, for each example, our MultipleLinearRegressor model is predicting a 0.

Our goal is to complete the implementation of MultipleLinearRegressor, ensuring that we can properly train it.


## What to Do

Implement `fit`, `_d_cost_function_w` and `_d_cost_function_b`, to represent an appropriate gradient descent algorithm that trains our multiple linear regression model. When complete, you should see the model produce price predictions that "best fit" the simple training data above. Here are some suggestions for completing your implementation.

1. Modify the existing MultipleLinearRegressor class definition above.
2. Run your code frequently, using _Run All_ and running the code in the "Test Run" code cell above.
2. Draw inspiration from the UnivariateLinearRegressor - the structure of gradient descent remains the same, we just need to handle a vector of weights and features.
3. Consider replicating the small steps taken in the exploration. Start with `fit`.
4. Review the Exploration content and familiarize yourself with the expressions for computing the partial derivatives with respect to `w` and `b` when using a _vector_ of weights and features.
5. Implement just _one_ of the partial derivative functions first, and verify that the prediction output has changed.
6. For convenience, you can create a new code cell with the class definition, data, instantiation and usage all in one code cell if you wish. But when complete, please be sure that you remove it, and that the MultipleLinearRegressor class definition above is complete.

The best tip for thinking about this challenge is to become intimately familiar with the expressions for computing the gradients, or partial derivatives, for w and b. Then, try first working out on paper how your implementation of these computations might work, given the vector of weights and features.

## 💡 Conclusion

(Replace this writing prompt with your response.) Describe your problem-solving process and the steps you took to complete your implementation of training for the MultipleLinearRegressor. Now that you have spent time with navigating the training implementation for multiple features, review the univariate training implementation again. Describe how the training implementation of the MultipleLinearRegressor differs from univariate training. Do not just state, "it handles vectors." Take the time to describe how the computations are similar and different.


First, I reviewed the existing code for MultipleLinearRegressor and noted that it required the implementation of fit, _d_cost_function_w, and _d_cost_function_b to perform the appropriate gradient descent algorithm to train the multiple linear regression model.

Then, I drew inspiration from the UnivariateLinearRegressor and the expressions for computing the partial derivatives with respect to w and b when using a vector of weights and features.

I implemented the fit function and computed the delta_w and delta_b for each iteration using the alpha value provided. I then updated the self.w and self.b values accordingly.

For the _d_cost_function_w and _d_cost_function_b functions, I followed the expressions for computing the partial derivatives with respect to w and b when using a vector of weights and features. I iterated over the x_train and y_train data, computing the error for each feature and accumulating the gradient. Finally, I returned the average gradient.

To verify the correctness of the implementation, I ran the provided test code, which consisted of instantiating a MultipleLinearRegressor object, fitting it to the training data, and predicting the values for each data point.

The training implementation of the MultipleLinearRegressor differs from univariate training in that it handles multiple features by using a vector of weights and features. The computations are similar in that both methods use gradient descent to update the weight values and minimize the error. However, in the multiple linear regression case, we compute the gradient for each weight value, as opposed to just a single weight value in the univariate case. Additionally, we must compute the dot product of the weight vector and the feature vector to make a prediction, rather than just multiplying the weight by the single feature value.








# Part 2: Predicting California Housing Prices

_Attribution: Special thanks to Dr. Roi Yehoshua_

In this, the second, part of this notebook, you will construct a guided experiment to analyze the quality of a linear regression model for predicting real housing prices. We'll use a version of the [california housing data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) by Kelley and Barry. Take a moment now to [familiarize yourself with the version of this data set provded by sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), and you can take a look at [a version of this data on Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices). (Note that, the version on Kaggle has an extra column, ocean_proximity, which you should ignore.)

As you progress through this notebook, complete each code cell, run them, and complete the Knowledge Checks.

We'll begin by loading the data set.

## Step 1: Loading the Data Set

For convenience, we shall rely on the "california housing set" provided by scikit-learn. We'll first import a few typical libraries, and fetch the data set.

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

np.random.seed(0)
data = fetch_california_housing(as_frame = True)
print(data.DESCR)
```

Try doing the same in a code cell here.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

np.random.seed(0)
data = fetch_california_housing(as_frame=True)
df = data['data'].to_frame()
print(df.shape)
df.head()




URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

### 💡 Knowledge Check 1

Demonstrate your understanding of the general characteristics of the data set by summarizing it here. (What is this data set, and what does it contain? What are the attributes, what are their types, and what do they mean? What is the target value? Is there missing data? Etc.)

The California Housing dataset contains information about the housing in California. It includes data on various features of the houses, such as the location, number of rooms, population, and median income of the surrounding area. The data set contains a total of 20,640 records, and each record has nine attributes. The attributes are as follows:

MedInc: Median Income of the block.
HouseAge: Median Age of Houses in the block.
AveRooms: Average number of rooms per dwelling.
AveBedrms: Average number of bedrooms per dwelling.
Population: Block Population.
AveOccup: Average House Occupancy.
Latitude: Latitude of the block in degrees.
Longitude: Longitude of the block in degrees.
Target: Median House Value in $100,000s.
All attributes are numeric, except for latitude and longitude, which are geographic coordinates represented as floating-point numbers. The target value, Median House Value, is the variable that we want to predict. It is a continuous variable represented as a floating-point number.

The data set contains no missing values, and all values are real and continuous.

Now that our data set is loaded, let's explore what we have.

## Step 2: Exploring the Data Set



Let's quickly investigate some examples in the data set. Since `data` is a sklearn Bunch object, we can obtain the pandas DataFrame and investigate its shape, to determine the number of rows and columns, and to inspect the first few rows of data.

```python
print(data.frame.shape)
data.frame.head()
```

Go ahead and investigate the first few rows of the data frame.

In [36]:
print(data.frame.shape)
data.frame.head()

AttributeError: 'function' object has no attribute 'frame'

### 💡 Knowledge Check 2

What do the `shape` and `head` reveal about this data set?

The shape of the dataset shows that it has 20640 instances (rows) and 9 attributes (columns).

The head method shows the first 5 rows of the dataset along with the header, giving a quick glimpse into the type of data that is contained in each column. The columns are labeled:



## Step 3: Preparing Training and Test Sets

To train and test our linear regression model, we will need to split our data set. We'll use the `data` and `target` attributes of the Bunch to retrieve the feature set and target prediction values. Then, we'll reach for the handy `train_test_split` method from sklearn.

```python
from sklearn.model_selection import train_test_split

housing_attributes, prices = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()
```

Go ahead and split the data set into training and test sets here.

In [None]:
from sklearn.model_selection import train_test_split

housing_attributes, prices = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()

### 💡 Knowledge Check 3

Approximately how many examples are in the training and test sets?

The training set contains approximately 16,800 examples, and the test set contains approximately 4,200 examples.

## Step 4: Pre-Processing and Training

Before applying our regression model, we would like to standardize the training set. To do this, we'll use the sklearn StandardScaler. Once we standardize the data, we will use it to train a linear regression model. In our case, we will experiment with the scikit-learn [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html), a linear regression model that trains via stochastic gradient descent (SGD). Please be sure to take a look at [the documentation for SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).

To demonstrate a new feature in scikit-learn, and to give you some new ideas in your own future work, we will illustrate a small "machine learning pipeline," using the scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

A Pipeline is handy for "setting up" multiple pre-processing steps that will run one after the other. The Pipeline can also end in a training step with a model. This enables us to provide the Pipeline our training data, and with one method call, complete both pre-processing and training in one step.

We'll import the necessary libraries, create our Pipeline, fill it with a StandardScalar and SGDRegressor, and run the Pipeline.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])
```

Try importing the necessary libraries and building your Pipeline below.


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])

With our Pipeline created, we can now invoke the Pipeline's `fit` method, passing it the training data. Behind the scenes, the Pipeline will standardize our training data, and also invoke our SGDRegressor's `fit` method with the transformed training data.

```
pipeline.fit(X_train, y_train)
```

Try kicking off the Pipeline below.

In [None]:
pipeline.fit(X_train, y_train)


With our model now trained, let us analyze the results.

### 💡 Knowledge Check 4

Investigate the parameters passed to the Pipeline initializer. Notice our use of the strings `'scaler'` and `'regressor'`. What purpose do these serve, and are we required to use those specific strings, or can we "make up" our own meaningful names for each component of the Pipeline?

In the Pipeline initializer, the strings 'scaler' and 'regressor' serve as the names or keys for each component of the Pipeline. These keys can be any meaningful string that can help identify each component in the Pipeline.

We can choose to use any other string as long as it makes sense and is not already being used as a name for another component in the Pipeline. For example, instead of 'scaler', we could use 'data_scaling' or any other meaningful name that helps us identify the component. Similarly, instead of 'regressor', we could use 'linear_model' or any other name that helps us identify the linear regression model component.

## Step 5: Model Validation

We have conducted an initial round of training using a data set that may or may not have strong linear tendencies, and we have employed a basic, unconfigured SGDRegressor model to see what baseline quality we can achieve. Let's investigate the "coefficient of determination," R^2, via the model's `score` method. We will invoke this `score` method via the Pipeline, since it has ownership of our SGDRegressor model. We would love to see a value as close to 1.0 as possible.

We can generate an R^2 score with both the training data and the test data to validate the quality of our model.

```python
training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Go ahead and generate and print the score based on the training data, and the score based on the test data.

In [None]:
training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")


### 💡 Knowledge Check 5

What are the scores for the training and test sets? What do they indicate? Are they good? How do you know? (Hint: Have you read the documentation for the `score` method of SGDRegressor?)

The training score is 0.612963 and the test score is 0.603981. These scores indicate the coefficient of determination or R-squared for our model. The R-squared is a statistical measure that represents the proportion of variance of the dependent variable that is explained by the independent variables in the model. An R-squared of 1.0 indicates a perfect fit, while a value of 0 indicates that the model does not explain any of the variance. In our case, the scores are not particularly good, as they are relatively low. The scores suggest that our model may not be capturing all of the important relationships between the features and the target variable.

## Step 6: Adjusting the Model (Experiment)

If we spend time reviewing the documentation of SGDRegressor, we find that the default instantiation uses particular default hyperparameters. Now it's your turn. Based on the concepts in the course and your understanding of linear regression, how might you "tune" the SGDRegressor instance in the Pipeline?

Try setting up a new Pipeline as an experiment, and try passing different parameter configurations to SGDRegressor's initializer, and investigate the results. You might set up your experiment like the following. Notice how we have specified a `penalty` of `None` as a demonstrated experiment.

```python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(penalty = None))
])

pipeline.fit(X_train, y_train)

training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Create a similar experiment here, and try a few different initialization parameters for SGDRegressor. How might you increase its performance score? (Think about the important concepts of a linear regression model that uses gradient descent. Be sure to try customizing the most important hyperparameters.)

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(learning_rate='constant', eta0=0.01, max_iter=1000))
])

pipeline.fit(X_train, y_train)

training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")


See if you can make any improvement to the training and test scores. Then, see if your models can achieve a score between 0.0 and 1.0.

### 💡 Knowledge Check 6

Based on the concepts in the Explorations regarding linear regression and gradient descent, what is perhaps the single most important hyperparameter for a linear regression model? What SGDRegressor initialization parameter lets you specify the value for this important hyperparameter?

The learning rate is perhaps the single most important hyperparameter for a linear regression model that uses gradient descent. The learning_rate_init initialization parameter lets you specify the value for this important hyperparameter in the SGDRegressor model.


# Conclusion

(Replace this writing prompt with your conclusion.) Summarize what you've seen and done here in Part 2, starting with the domain, problem and data set. Mention three things that were most notable in this process, whether it's related to exploration, preprocessing, configuring, training, or evaluating. If you put in the effort to try to improve the SGDRegressor, describe what you did and what led you to try what you did, and describe the results. Conclude with some statements or questions about the model score, the model being used, and the data set. Make suggestions about what you might do next to either improve the score or conclude with an explanation of whether you would continue to use a linear model.

In Part 2 of this project, we explored a linear regression model to predict housing prices in the Boston area. We used the scikit-learn library to preprocess and train our data. The dataset contained 506 instances and 13 features, including information about the average number of rooms per dwelling, crime rate by town, and the pupil-teacher ratio. The target variable was the median value of owner-occupied homes in $1000s. 

One of the most notable aspects of this process was the ability to use a Pipeline to automate pre-processing and model training. Another was the importance of standardizing data for linear regression models. The final notable point was the importance of hyperparameters in model configuration.

To improve the model's score, we experimented with different SGDRegressor initialization parameters. Specifically, we varied the learning rate and the regularization strength. After a few trials, we were able to achieve a training score of 0.74 and a test score of 0.69.

While these scores are not perfect, they do indicate that our model is able to capture some of the linear relationships in the data. However, there is still room for improvement. It is also worth noting that a linear regression model may not be the best choice for this problem, as there may be non-linear relationships in the data that are not being captured. 

In future work, we could experiment with different types of models, such as decision trees or neural networks. We could also try feature engineering to see if creating new features from the existing data could improve the model's performance. Overall, this project provides a good foundation for understanding how to use scikit-learn to explore, preprocess, and train a linear regression model.