<h1> <center> Simple Linear Regression Model on Housing Data  </h1> <center>
    <h2> Using scikit-learn </h2>
    
Source: [ML: Regression, Washington University-Coursera](https://www.coursera.org/learn/ml-regression/supplement/z0Uef/fitting-a-simple-linear-regression-model-on-housing-data)
    
  

In [2]:
import pandas as pd

# <center> Load and Explore Data <center>

### <center>Load house sales data <center>

In [3]:
houses_df = pd.read_csv('../data/kc_house_data.csv')
train_df = pd.read_csv('../data/kc_house_train_data.csv')
test_df = pd.read_csv('../data/kc_house_test_data.csv')

In [4]:
train_df.head(5)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


<br>
<br>

# <center> Build Close Form Solution <center>

The closed form equations:

* the slope requires these two terms: <br>

    numerator = (sum of X* Y) - (1/N)* ((sum of X) * (sum of Y)) <br> 
    denominator = (sum of X^2) - (1/N)*((sum of X) * (sum of X))

    slope = numerator/denominator


* And for the intercept:

    intercept = (sum of Y)\*(1/N) - slope * ((sum of X)*\(1/N)


<br> 

__First doing it manually__  <br>
Compute the average of the house prices

In [4]:
# for the whole set
sum_prices = houses_df['price'].sum()

average = sum_prices/len(houses_df['price'])
average

540088.1417665294

In [5]:
# only train set
train_df['price'].sum() #/ len(train_df['price'])

9376349460.0

In [6]:
train_df['sqft_living'].sum() #/ len(train_df['sqft_living'])

36159233

Compute the sum of squares of the prices

In [7]:
(train_df['price']*train_df['price']).sum()

7433051852335772.0

In [8]:
(train_df['sqft_living'] * train_df['sqft_living']).sum()

89977452623

<br>

__Now with a function__

In [9]:
def simple_linear_regression(input_feature, output):
    
    #both should be the same lenght so calculating it only once from one of them
    N = len(input_feature)
    
    # compute the sum of input_feature and output: terms (sum of X) and (sum of Y)
    x_sum = input_feature.sum()
    y_sum = output.sum()
    
    # compute the product of the output and the input_feature and its sum (sum of X* Y)
    x_y_sum = (input_feature * output).sum()
    
    # compute the squared value of the input_feature and its sum
    x_x_sum = (input_feature * input_feature).sum()
    
    # use the formula for the slope
    numerator = x_y_sum - ((x_sum)*(y_sum)/N)
    denominator = x_x_sum - ((x_sum)*(x_sum)/N)
    slope = numerator / denominator
    
    # use the formula for the intercept
    # intercept = (sum of Y)*(1/N) - slope * ((sum of X)*(1/N)
    intercept = (y_sum/N) - (slope * x_sum / N)
    
    return (intercept, slope)


In [10]:
squarefeet_intercept, squarfeet_slope = simple_linear_regression(train_df['sqft_living'],train_df['price'])
squarefeet_intercept, squarfeet_slope

(-47116.07907289418, 281.9588396303426)

<br>

 Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?

In [11]:
y = 2650
y_at_2650 = squarefeet_intercept + squarfeet_slope * y
y_at_2650

700074.8459475137

In [15]:
1201918356321967.8/10**15

1.2019183563219678

In [17]:
1.2019183563219678*10**15

1201918356321967.8

In [18]:
275402936247141.4 - 493364582868288.06

-217961646621146.66