# Linear Regression using Scikit-Learn

There is an open-source, commercially usable machine learning toolkit called [scikit-learn](https://scikit-learn.org/stable/index.html). This toolkit contains implementations of many of the algorithms that you will work with in this course.



## Goals
In this notebook you will:
- Utilize  scikit-learn to implement linear regression using a close form solution based on the normal equation

## Tools
You will utilize functions from scikit-learn as well as matplotlib and NumPy. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.style.use('./deeplearning.mplstyle')

## Linear Regression, closed-form solution
Scikit-learn has the [linear regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) which implements a closed-form linear regression.

Let's use the data from the early notebooks - a house with $1000$ square feet sold for $\$300,000$ and a house with $2000$ square feet sold for $\$500,000$.

| Size ($1000$ sqft)     | Price ($1000$s of dollars) |
| ----------------| ------------------------ |
| $1$               | $300$                      |
| $2$               | $500$                      |


### Load the data set

In [2]:
X_train = np.array([1.0, 2.0])    # features
y_train = np.array([300, 500])    # target value

In [4]:
print(X_train.shape)
print()
print(y_train.shape)

(2,)

(2,)


### Create and fit the model
The code below performs regression using scikit-learn. 
The first step creates a regression object.  
The second step utilizes one of the methods associated with the object, `fit`. This performs regression, fitting the parameters to the input data. The toolkit expects a two-dimensional X matrix.

In [5]:
linear_model = LinearRegression()

# X must be a 2-D Matrix
X_train = X_train.reshape(-1, 1)
linear_model.fit(X_train, y_train) 

### View Parameters 
The $\mathbf{w}$ and $\mathbf{b}$ parameters are referred to as **coefficients** and **intercept** in scikit-learn.

In [7]:
b = linear_model.intercept_
w = linear_model.coef_
print("w = {}, b = {}".format(w, b))
print("'manual' prediction: f_wb = wx+b : {}".format(1200*w + b))

w = [200.], b = 100.00000000000011
'manual' prediction: f_wb = wx+b : [240100.]


### Make Predictions

Calling the `predict` function generates predictions.

In [9]:
y_pred = linear_model.predict(X_train)
print("Prediction on training set:", y_pred)

X_test = np.array([[1200]])
print("Prediction for 1200 sqft house: ${}".format(linear_model.predict(X_test)[0]))

Prediction on training set: [300. 500.]
Prediction for 1200 sqft house: $240099.99999999994


## Second Example
The second example is from an earlier notebook with multiple features. The final parameter values and predictions are very close to the results from the un-normalized 'long-run' from that notebook. That un-normalized run took hours to produce results, while this is nearly instantaneous. The closed-form solution work well on smaller data sets such as these but can be computationally demanding on larger data sets. 
>**Note:** The closed-form solution does not require normalization.

In [10]:
# load the dataset
data = np.loadtxt("./data/houses.txt", delimiter=',', skiprows=1)
X_train = data[:,:4]
y_train = data[:,4]

X_features = ['size(sqft)', 'bedrooms', 'floors', 'age']

In [11]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train) 

In [12]:
b = linear_model.intercept_
w = linear_model.coef_
print("w = {}, b = {}".format(w, b))

w = [  0.26860107 -32.62006902 -67.25453872  -1.47297443], b = 220.42153358200798


In [14]:
print("Prediction on training set:\n {}".format(linear_model.predict(X_train)[:4]))
print("prediction using w,b:\n {}".format((np.dot(X_train, w) + b)[:4]))
print("Target values \n {}".format(y_train[:4]))

x_house = np.array([1200, 3,1, 40]).reshape(-1,4)
x_house_predict = linear_model.predict(x_house)[0]
print("predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = ${}".format(x_house_predict*1000))

Prediction on training set:
 [295.17615301 485.97796332 389.52416548 492.14712499]
prediction using w,b:
 [295.17615301 485.97796332 389.52416548 492.14712499]
Target values 
 [300.    509.8   394.    540.    415.    230.    560.    294.    718.2
 200.    302.    468.    374.2   388.    282.    311.8   401.    449.8
 301.    502.    340.    400.282 572.    264.    304.    298.    219.8
 490.7   216.96  368.2   280.    526.87  237.    562.426 369.8   460.
 374.    390.    158.    426.    390.    277.774 216.96  425.8   504.
 329.    464.    220.    358.    478.    334.    426.98  290.    463.
 390.8   354.    350.    460.    237.    288.304 282.    249.    304.
 332.    351.8   310.    216.96  666.336 330.    480.    330.3   348.
 304.    384.    316.    430.4   450.    284.    275.    414.    258.
 378.    350.    412.    373.    225.    390.    267.4   464.    174.
 340.    430.    440.    216.    329.    388.    390.    356.    257.8  ]
predicted price of a house with 1200 sqft, 3 be

In this notebook you:
- utilized an open-source machine learning toolkit, scikit-learn
- implemented linear regression using a close-form solution from that toolkit