Linear Regression
==========

This notebook is a demo on implementing Linear Regression using sk-learn. We will be using a housing datset to train our model and make the predicitions.

Regression deals with a subset of machine learning approaches, that are generally used for predicting a continous output. It can be thought of as directly searching for a set of parameters that maximize the performance of a particular prediction model.

A simple line equation with slope $\omega_1$ and intercept $\omega_0$ is given by
                               
\begin{equation*}
\ y = \omega_1 x +\omega_0 \
\end{equation*}

This equation represents the relation between two variables, namely $x$ and $y$. Our goal is to find the weight values (\omega_1 and \omega_0) that maximise the performance i.e  the weight values that allows us best predict $y$ for a given $x$

#### Load the dataset

In [0]:
import pandas as pd # pandas is used for loading data from different files like csv, excel sheets etc
import numpy as np # numpy is used for dealing with arrays
from sklearn.model_selection import train_test_split # sklearn is used for creating models
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt #matplotlib is used for visualising the data

# %matplotlib inline # since we want all the figures to be displayed in the notebook itself we pass this parameter
# %config InlineBackend.figure_format = 'retina' 

In [0]:
!wget https://gist.github.com/tdchaitanya/d84c787328df169c50a06eb1669666c9/raw/7ffeddc80bec1c22e91bfed6e026620cf989eacf/housing_data.csv

In [0]:
!ls

In [0]:
# load the data from a csv file
data = pd.read_csv('housing_data.csv')

#### Look at the data

In [0]:
# head is used to display first n entries of the data frame
data[["sqft_living","price"]].head(10)

#### Univariate Linear Regression

In [0]:
# sklearn expects arrays as input for building the model
X = np.array(data['sqft_living'])
y = np.array(data['price'])

In [0]:
# we a need a test set to measure the performance of our model. 
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [0]:
plt.scatter(data['sqft_living'],data['price'])
plt.xlabel('sqft_living', fontsize=12)
plt.ylabel('price', fontsize=12)

In [0]:
# X_train is a 1-d array, since skleanr expects 2-d array as input, we are changing the dimension.
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)

In [0]:
regr = LinearRegression()
regr.fit(X_train, y_train)

In [0]:
regr.coef_

In [0]:
regr.intercept_

In [0]:
# training error
regr.score(X_train, y_train)

In [0]:
# testing error
regr.score(X_test, y_test)

In [0]:
# Plot predictions on training data
plt.scatter(X_train, y_train)
plt.plot(X_train, regr.predict(X_train), color='blue',
         linewidth=3)

plt.xlabel('sqft_living', fontsize=12)
plt.ylabel('price', fontsize=12)

plt.show()

In [0]:
# Plot predictions on testing data
plt.scatter(X_test, y_test)
plt.plot(X_test, regr.predict(X_test), color='blue',
         linewidth=3)

plt.xlabel('sqft_living', fontsize=12)
plt.ylabel('price', fontsize=12)

plt.show()

##### Exercise: Try implementing Univariate Linear Regression with other variables. 

#### Multivariate Linear Regression

In [0]:
data.columns

In [0]:
X = np.array(data[['sqft_living', 'bedrooms', 'bathrooms']])
y = np.array(data['price'])

In [0]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [0]:
regr = LinearRegression()
regr.fit(X_train, y_train)

In [0]:
regr.coef_

In [0]:
regr.intercept_

In [0]:
# training error
regr.score(X_train, y_train)

In [0]:
# testing error
regr.score(X_test, y_test)