# Introduction

This notebook contains two parts. **Part 1, Multiple Linear Regression**, provides you an opportunity to demonstrate your ability to apply course concepts by implementing a training function for multiple linear regression. **Part 2, California Housing Prices**, provides you an opportunity to practice using widely-used ML libraries and an ML workflow to solve a regression problem.

**You do not need to complete Part 1 in order to complete Part 2**. If you get stuck on Part 1, and choose to work on Part 2, be sure that all of your code for Part 1 runs without error. You can comment out your code in Part 1 if necessary.

# Part 1: Implementing Multiple Linear Regression

Given a simple MultipleLinearRegressor, and a simple training set of housing data, demonstrate your ability to implement a multiple linear regression model's `fit` function, such that it properly trains its linear model using gradient descent.

## The UnivariateLinearRegressor

Let's first review the UnivariateLinearRegressor, which you should find familiar, and you do not need to modify. Notice that the `fit` method uses a fixed number of iterations, only for simplicity and experimentation.

In [1]:
class UnivariateLinearRegressor:

    def __init__(self, w = 0, b = 0, alpha = 0.1):
        self.w = w
        self.b = b
        self.alpha = alpha

    def fit(self, x_train, y_train):
        for _ in range(0, 10):
            delta_w = self.alpha * self._d_cost_function_w(x_train, y_train)
            delta_b = self.alpha * self._d_cost_function_b(x_train, y_train)
            self.w = self.w - delta_w
            self.b = self.b - delta_b

    def _d_cost_function_w(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i]) * x_train[i]
        return sum / len(x_train)

    def _d_cost_function_b(self, x_train, y_train):
        sum = 0
        for i in range(len(x_train)):
            sum += (self.predict(x_train[i]) - y_train[i])
        return sum / len(x_train)

    def predict(self, x):
        return self.w * x + self.b


Next, consider the following simple training examples, which you should also find familiar, that represent the square feet and prices of houses.

In [2]:
x_train = [1.0, 2.0]
y_train = [300.0, 500.0]

As demonstrated in the related Exploration, we can instantiate, train and make predictions with our UnivariateLinearRegressor as follows. Notice how we first instantiate our UnivariateLinearRegressor with a _single_ weight, and the bias and learning rate.

In [3]:
regressor = UnivariateLinearRegressor(0, 0, 0.1)
regressor.fit(x_train, y_train)

small_house_price = regressor.predict(1.0)
print(f"The price of a 1,000 sqft house is {small_house_price}")

medium_house_price = regressor.predict(2.0)
print(f"The price of a 2,000 sqft house is {medium_house_price}")

big_house_price = regressor.predict(8.0)
print(f"The price of an 8,000 sqft house is {big_house_price}")

The price of a 1,000 sqft house is 301.4502082643262
The price of a 2,000 sqft house is 488.7867275295899
The price of an 8,000 sqft house is 1612.805843121172


Observing the results, we can see that the model has made its way toward converging on its line of best fit. However, we are intentionally limiting the amount of training in `fit`, and therefore truncating the training. Again, we are limiting this only for simplicity and experimentation. Try increasing the steps of gradient descent to 500 and re-run the code cells, and notice that the predictions become more accurate.

This concludes a review of our UnivariateLinearRegressor. Notice that this implementation intentionally handles only one dimension of input. In the example above, this one dimension is the size in square feet of a house.


## The MultipleLinearRegressor

While our simple UnivariateLinearRegressor works well for just a single dimension of input, we would like to make predictions based on multiple features, such as square feet, number of bedrooms, the number of floors, and the age of a house.

To demonstrate your understanding of features, vectors and gradient descent, try completing the implementation of a MultipleLinearRegressor. We begin with the implementation below, which has a complete `predict` method and method stubs for `fit` and the partial derivatives.

In [4]:
import numpy as np

class MultipleLinearRegressor:
    
    def __init__(self, w=[], b=0, learning_rate=0.01):
        self.w = w
        self.b = b
        self.learning_rate = learning_rate
        self.mean = None
        self.std = None
    
    def fit(self, x_train, y_train):
        x_train = np.array(x_train)
        self.mean = np.mean(x_train, axis=0)
        self.std = np.std(x_train, axis=0)
        x_train_norm = (x_train - self.mean) / self.std
        self.w = np.zeros(x_train_norm.shape[1])
        
        for _ in range(0, 10000):
            delta_w = self.learning_rate * self._d_cost_function_w(x_train_norm, y_train)
            delta_b = self.learning_rate * self._d_cost_function_b(x_train_norm, y_train)
            self.w = np.where(np.isnan(self.w), 0, self.w)
            self.b = np.where(np.isnan(self.b), 0, self.b)
            self.w = np.where(np.isinf(self.w), 0, self.w)
            self.b = np.where(np.isinf(self.b), 0, self.b)
            self.w = self.w - delta_w
            self.b = self.b - delta_b
    
    def _d_cost_function_w(self, x_train, y_train):
        y_pred = np.dot(x_train, self.w) + self.b
        error = y_pred - y_train
        gradient = []
        for i in range(x_train.shape[1]):
            gradient_i = 0
            for j in range(len(x_train)):
                gradient_i += error[j] * x_train[j][i]
            gradient.append(gradient_i / len(x_train))
        return np.array(gradient)
    
    def _d_cost_function_b(self, x_train, y_train):
        y_pred = np.dot(x_train, self.w) + self.b
        error = y_pred - y_train
        return np.sum(error) / len(x_train)
    
    def predict(self, x):
        x_norm = (x - self.mean) / self.std
        return self._dot_product(self.w, x_norm) + self.b
    
    def _dot_product(self, a, b):
        return sum(pair[0] * pair[1] for pair in zip(a, b))


As we shall see in a moment, your goal will be to implement `fit` and `_d_cost_function_w` and `_d_cost_function_b`. For now, let's take a look at the training set and see how our current implementation behaves.

We'll start with a simple contrived data set with four examples, already split for you. Each training example in `x_train` represents the size, number of bedrooms, number of floors and the age of a house. Each value in `y_train` represents the price of the house in thousands of dollars.

In [5]:
x_train = [
    [2104.0, 5.0, 1.0, 45.0],
    [1416.0, 3.0, 2.0, 40.0],
    [1534.0, 3.0, 2.0, 30.0],
    [852.0, 2.0, 1.0, 36.0]
]
y_train = [460.0, 232.0, 315.0, 178.0]

Notice that `x_train` now contains vectors representing the features of each house, and each vector contains four features. Since we know that our linear regression model will need one weight for each feature, we should instantiate it with a _vector_ of weights, along with a bias and our learning rate.

In [6]:
regressor = MultipleLinearRegressor([0, 0, 0, 0], 0, 0.1)
regressor.fit(x_train, y_train)

Even though our implementation is incomplete, we can try to make some predictions. Notice that, to make a prediction, we should provide the `predict` method with a vector of features.

In [7]:
# 'Test Run' Code Cell, Referred to in "What to Do" #2.

first_house_price = regressor.predict([2104.0, 5.0, 1.0, 45.0])
print(f"The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars")
print(f"The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is {first_house_price} thousand dollars")

second_house_price = regressor.predict([1416.0, 3.0, 2.0, 40.0])
print(f"The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars")
print(f"The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is {second_house_price} thousand dollars")

third_house_price = regressor.predict([1534.0, 3.0, 2.0, 30.0])
print(f"The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars")
print(f"The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is {third_house_price} thousand dollars")

small_house_price = regressor.predict([852.0, 2.0, 1.0, 36.0])
print(f"The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars")
print(f"The predicted price of this house is {small_house_price}")

The actual price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 460 thousand dollars
The predicted price of a 2,104 sqft house with 5 bedrooms, 1 floor, that is 45-years old is 459.9999999999997 thousand dollars
The actual price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 232 thousand dollars
The predicted price of a 1,416 sqft house with 3 bedrooms, 2 floors, that is 40 years old is 231.99999999999991 thousand dollars
The actual price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 315 thousand dollars
The predicated price of a 1,534 sqft house with 3 bedrooms, 2 floors, that is 30 years old is 314.99999999999966 thousand dollars
The actual price of an 852 sqft house with 2 bedrooms, 1 floor, that is 36 years old is 178 thousand dollars
The predicted price of this house is 177.99999999999983


Notice how, for each example, our MultipleLinearRegressor model is predicting a 0.

Our goal is to complete the implementation of MultipleLinearRegressor, ensuring that we can properly train it.


## What to Do

Implement `fit`, `_d_cost_function_w` and `_d_cost_function_b`, to represent an appropriate gradient descent algorithm that trains our multiple linear regression model. When complete, you should see the model produce price predictions that begin to approach a "best fit" for the simple training data above (note: there are particular reasons why the fit will not be as 'perfect' as our univariate example). Here are some suggestions for completing your implementation.

1. Modify the existing MultipleLinearRegressor class definition above.
2. Run your code frequently, using _Run All_ and running the code in the "Test Run" code cell above.
2. Draw inspiration from the UnivariateLinearRegressor - the structure of gradient descent remains the same, we just need to handle a vector of weights and features.
3. Consider replicating the small steps taken in the exploration. Start with `fit`.
4. Review the Exploration content and familiarize yourself with the expressions for computing the partial derivatives with respect to `w` and `b` when using a _vector_ of weights and features.
5. Implement just _one_ of the partial derivative functions first, and verify that the prediction output has changed.
6. For convenience, you can create a new code cell with the class definition, data, instantiation and usage all in one code cell if you wish. But when complete, please be sure that you remove it, and that the MultipleLinearRegressor class definition above is complete.

The best tip for thinking about this challenge is to become intimately familiar with the expressions for computing the gradients, or partial derivatives, for w and b. Then, try first working out on paper how your implementation of these computations might work, given the vector of weights and features.

## 💡 Conclusion

First, I reviewed the existing code for MultipleLinearRegressor and noted that it required the implementation of fit, _d_cost_function_w, and _d_cost_function_b to perform the appropriate gradient descent algorithm to train the multiple linear regression model. I then use similar formatting as the LinearRegressor section and the expressions for computing the partial derivatives win terms of w and b when using a vector of weights and features.I implemented the fit function and computed the delta_w and delta_b for each iteration using the alpha value provided and updated the self.w and self.b values accordingly. I converted the input x_train to a numpy array and computed the mean and standard deviation along the appropriate axis. After that, I then used the features by subtracting the mean and dividng it by the std. I initialized the weight vector as vector of zeros after. Inside the training loop, I made it calculate cost function with respect to the weight vector and bias terms using partial derivatives, and updated them by subtracting respective gradients that is scaled by the learning rate. I tried repeating this process multiple times until it satisfy what I am looking for. In the end, the MultipleLinearRegression can take on many multiple features, and calculate gradients for each of the weight and bias (and update them) in order to help better the relationship between different inputs (multiple) features and target variable.








# Part 2: Predicting California Housing Prices

_Attribution: Special thanks to Dr. Roi Yehoshua_

In this, the second, part of this notebook, you will construct a guided experiment to analyze the quality of a linear regression model for predicting real housing prices. We'll use a version of the [california housing data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) by Kelley and Barry. Take a moment now to [familiarize yourself with the version of this data set provded by sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), and you can take a look at [a version of this data on Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices). (Note that, the version on Kaggle has an extra column, ocean_proximity, which you should ignore.)

As you progress through this notebook, complete each code cell, run them, and complete the Knowledge Checks.

We'll begin by loading the data set.

## Step 1: Loading the Data Set

For convenience, we shall rely on the "california housing set" provided by scikit-learn. We'll first import a few typical libraries, and fetch the data set.

```python
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

np.random.seed(0)
data = fetch_california_housing(as_frame = True)
print(data.DESCR)
```

Try doing the same in a code cell here.

In [8]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

np.random.seed(0)
data = fetch_california_housing(as_frame = True)
print(data.DESCR)



.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

### 💡 Knowledge Check 1

Demonstrate your understanding of the general characteristics of the data set by summarizing it here. (What is this data set, and what does it contain? What are the attributes, what are their types, and what do they mean? What is the target value? Is there missing data? Etc.)

The California Housing dataset contains information about the housing in California. It includes various features of the houses, location, number of rooms, population, and median income of the surrounding area. Based on the data set, it has a total of 20,640 records, and each record has 8 attributes. 

The attributes are:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude
        
The target variable is expressed in hundreds of thousands of dollars for median house value. It looks like all of attributes are numeric, except for latitude and longitude. The target variable for Median House Value is the variable that I want to predict. 

Based on my observations the data set contains no missing values.

Now that our data set is loaded, let's explore what we have.

## Step 2: Exploring the Data Set



Let's quickly investigate some examples in the data set. Since `data` is a sklearn Bunch object, we can obtain the pandas DataFrame and investigate its shape, to determine the number of rows and columns, and to inspect the first few rows of data.

```python
print(data.frame.shape)
data.frame.head()
```

Go ahead and investigate the first few rows of the data frame.

In [9]:
print(data.frame.shape)
data.frame.head()

(20640, 9)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### 💡 Knowledge Check 2

What do the `shape` and `head` reveal about this data set?

The shape of the dataset shows that it has 20640 rows and 9 columns.

The head method shows the first 5 rows of the dataset along with the header, whichs shows a short summary when looking into the type of data that is contained in each column.

## Step 3: Preparing Training and Test Sets

To train and test our linear regression model, we will need to split our data set. We'll use the `data` and `target` attributes of the Bunch to retrieve the feature set and target prediction values. Then, we'll reach for the handy `train_test_split` method from sklearn.

```python
from sklearn.model_selection import train_test_split

housing_attributes, prices = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()
```

Go ahead and split the data set into training and test sets here.

In [10]:
from sklearn.model_selection import train_test_split

housing_attributes, prices = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(housing_attributes, prices, test_size = 0.2)
X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
12069,4.2386,6.0,7.723077,1.169231,228.0,3.507692,33.83,-117.55
15925,4.3898,52.0,5.326622,1.100671,1485.0,3.322148,37.73,-122.44
11162,3.9333,26.0,4.668478,1.046196,1022.0,2.777174,33.83,-118.0
4904,1.4653,38.0,3.383495,1.009709,749.0,3.635922,34.01,-118.26
4683,3.1765,52.0,4.119792,1.043403,1135.0,1.970486,34.08,-118.36


### 💡 Knowledge Check 3

Approximately how many examples are in the training and test sets?

The are 16512 training set examples, and the test set is 4128 examples.

In [11]:
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 16512
Test set size: 4128


## Step 4: Pre-Processing and Training

Before applying our regression model, we would like to standardize the training set. To do this, we'll use the sklearn StandardScaler. Once we standardize the data, we will use it to train a linear regression model. In our case, we will experiment with the scikit-learn [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html), a linear regression model that trains via stochastic gradient descent (SGD). Please be sure to take a look at [the documentation for SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).

To demonstrate a new feature in scikit-learn, and to give you some new ideas in your own future work, we will illustrate a small "machine learning pipeline," using the scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

A Pipeline is handy for "setting up" multiple pre-processing steps that will run one after the other. The Pipeline can also end in a training step with a model. This enables us to provide the Pipeline our training data, and with one method call, complete both pre-processing and training in one step.

We'll import the necessary libraries, create our Pipeline, fill it with a StandardScalar and SGDRegressor, and run the Pipeline.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])
```

Try importing the necessary libraries and building your Pipeline below.


In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor())
])

With our Pipeline created, we can now invoke the Pipeline's `fit` method, passing it the training data. Behind the scenes, the Pipeline will standardize our training data, and also invoke our SGDRegressor's `fit` method with the transformed training data.

```
pipeline.fit(X_train, y_train)
```

Try kicking off the Pipeline below.

In [13]:
pipeline.fit(X_train, y_train)

With our model now trained, let us analyze the results.

### 💡 Knowledge Check 4

Investigate the parameters passed to the Pipeline initializer. Notice our use of the strings `'scaler'` and `'regressor'`. What purpose do these serve, and are we required to use those specific strings, or can we "make up" our own meaningful names for each component of the Pipeline?

In the Pipeline initializer, the strings scaler and regression are used as names or keys for each component of the Pipeline. These keys can help identify each component in the Pipeline. We can choose to use any other string as long as it makes sense and not reused as a name for another component in the Pipeline. An example can be: instead of scaler, we could use data_scaling. And instead of regressor, we could use linear_model that helps us identify the linear regression model component.

## Step 5: Model Validation

We have conducted an initial round of training using a data set that may or may not have strong linear tendencies, and we have employed a basic, unconfigured SGDRegressor model to see what baseline quality we can achieve. Let's investigate the "coefficient of determination," R^2, via the model's `score` method. We will invoke this `score` method via the Pipeline, since it has ownership of our SGDRegressor model. We would love to see a value as close to 1.0 as possible.

We can generate an R^2 score with both the training data and the test data to validate the quality of our model.

```python
training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Go ahead and generate and print the score based on the training data, and the score based on the test data.

In [14]:
training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")

Training score: -154.744672
Test score: -1413.434270


### 💡 Knowledge Check 5

What are the scores for the training and test sets? What do they indicate? Are they good? How do you know? (Hint: Have you read the documentation for the score method of SGDRegressor?)

The training score is -154.744672 and the test score is -1413.434270. It seems that the model isn't doing too well with predictions and is considered not accurate when performing comapre to the baseline model. It may be that it is not capturing underlying patterns in the data.

## Step 6: Adjusting the Model (Experiment)

If we spend time reviewing the documentation of SGDRegressor, we find that the default instantiation uses particular default hyperparameters. Now it's your turn. Based on the concepts in the course and your understanding of linear regression, how might you "tune" the SGDRegressor instance in the Pipeline?

Try setting up a new Pipeline as an experiment, and try passing different parameter configurations to SGDRegressor's initializer, and investigate the results. You might set up your experiment like the following. Notice how we have specified a `penalty` of `None` as a demonstrated experiment.

```python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(penalty = None))
])

pipeline.fit(X_train, y_train)

training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")
```

Create a similar experiment here, and try a few different initialization parameters for SGDRegressor. How might you increase its performance score? (Think about the important concepts of a linear regression model that uses gradient descent. Be sure to try customizing the most important hyperparameters.)

In [15]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(learning_rate='constant', eta0=0.01, alpha=0.001, max_iter=1000))
])

pipeline.fit(X_train, y_train)

training_score = pipeline.score(X_train, y_train)
print(f"Training score: {training_score:.6f}")

test_score = pipeline.score(X_test, y_test)
print(f"Test score: {test_score:.6f}")


Training score: -2355925001336889344.000000
Test score: -6412475811969347584.000000


See if you can make any improvement to the training and test scores. Then, see if your models can achieve a score between 0.0 and 1.0.

### 💡 Knowledge Check 6

Based on the concepts in the Explorations regarding linear regression and gradient descent, what is perhaps the single most important hyperparameter for a linear regression model? What SGDRegressor initialization parameter lets you specify the value for this important hyperparameter?

I would say that the learning rate is important when it comes to hyperparameter in terms of linear regression model that uses gradient descent. The learning_rate_init initialization parameter lets us specify the value for this important hyperparameter in the SGDRegressor model. We can also change the learning rate and impact during its training process.


# Conclusion

In part 2, I first explored the dataset and understand the rows and columns. After I observed what the dataset contains and have it split into traning and testing sets. I dived into applying preprocessing ways (standardization) using the StandardScaler library and input it into a workflow with Pipeline. I tried different hyperparameter configurations, and focused on the learning rate using SGDRegressor. The results doesn't seem to satisfy what I wanted it to do, hence resulted in poor model performance. It was evaluated using both the training and test sets, which were signifcantly negative and poor model choice. I could try to explore nonlinear models, and incorporate more factors that may alter the accuracy. I could apply methods such as transforming features as well. 