# Multiple Linear Regresion

# Dataset

### California Housing Prices

Contains the houses found in a given California district and some summary stats about them based on the 1990 census data

### Columns

1. **longitude**: A measure of how far west a house is; a higher value is farther west
2. **latitude**: A measure of how far north a house is; a higher value is farther north
3. **housing_median_age**: Median age of a house within a block; a lower number is a newer building
4. **total_rooms**: Total number of rooms within a block
5. **total_bedrooms**: Total number of bedrooms within a block
6. **population**: Total number of people residing within a block
7. **households**: Total number of households, a group of people residing within a home unit, for a block
8. **median_income**: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. **median_house_value**: Median house value for households within a block (measured in US Dollars)
10. **ocean_proximity**: Location of the house w.r.t ocean/sea

---

## Reading the dataset

In [12]:
import pandas as pd

In [13]:
housing_df = pd.read_csv('./housing.csv')
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [20]:
housing_df.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

**Notice `ocean_proximity` is a *string* (object) value, we will exclude it from the analysis**

---

## Aplying statsmodels OLS

Let's apply the linear regresion function `ols()` from `statsmodels`

In [14]:
from statsmodels.formula.api import ols

In [22]:
# Exclude the ocean_proximity since it's not a numeric value
formula = f"""median_house_value ~ 
                                    longitude +
                                    latitude +
                                    housing_median_age +
                                    total_rooms +
                                    total_bedrooms +
                                    population +
                                    households +
                                    median_income"""
results = ols(formula, data=housing_df).fit()

Print the summary

In [23]:
results.summary()

0,1,2,3
Dep. Variable:,median_house_value,R-squared:,0.637
Model:,OLS,Adj. R-squared:,0.637
Method:,Least Squares,F-statistic:,4478.0
Date:,"Wed, 28 Jul 2021",Prob (F-statistic):,0.0
Time:,14:24:28,Log-Likelihood:,-256820.0
No. Observations:,20433,AIC:,513700.0
Df Residuals:,20424,BIC:,513700.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3.585e+06,6.29e+04,-57.001,0.000,-3.71e+06,-3.46e+06
longitude,-4.273e+04,717.087,-59.588,0.000,-4.41e+04,-4.13e+04
latitude,-4.251e+04,676.952,-62.796,0.000,-4.38e+04,-4.12e+04
housing_median_age,1157.9003,43.389,26.687,0.000,1072.855,1242.945
total_rooms,-8.2497,0.794,-10.387,0.000,-9.807,-6.693
total_bedrooms,113.8207,6.931,16.423,0.000,100.236,127.405
population,-38.3856,1.084,-35.407,0.000,-40.511,-36.261
households,47.7014,7.547,6.321,0.000,32.909,62.493
median_income,4.03e+04,337.207,119.504,0.000,3.96e+04,4.1e+04

0,1,2,3
Omnibus:,4898.534,Durbin-Watson:,0.975
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18260.733
Skew:,1.166,Prob(JB):,0.0
Kurtosis:,7.002,Cond. No.,510000.0


### Results

In [28]:
results.rsquared

0.6369116857335635

- **R-squared:**	0.637
    - 63.7 % of the Median house value is explained by the 8 independant values.

Print the params (coeficients) of the variables

In [31]:
results.params

Intercept            -3.585396e+06
longitude            -4.273012e+04
latitude             -4.250974e+04
housing_median_age    1.157900e+03
total_rooms          -8.249725e+00
total_bedrooms        1.138207e+02
population           -3.838558e+01
households            4.770135e+01
median_income         4.029752e+04
dtype: float64

Print the pvalues of each independant variable 

In [30]:
results.pvalues

Intercept              0.000000e+00
longitude              0.000000e+00
latitude               0.000000e+00
housing_median_age    2.946266e-154
total_rooms            3.294848e-25
total_bedrooms         3.188906e-60
population            1.459679e-266
households             2.653505e-10
median_income          0.000000e+00
dtype: float64

---

## Sklearn LinearRegresion
Now we will use the `sklearn` `LinearRegression` to create a model and make predictions

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


Drop the not needed values. The dependant variable `median_house_value` and the non numerical values `ocean proximity`

In [34]:
x = housing_df.drop(['median_house_value', 'ocean_proximity'], axis=1)

Let's make sure there is no NA value, in this case we will fill with 0

In [35]:
x = x.fillna(0)

Get our `y` (depentant variable)

In [36]:
y = housing_df['median_house_value']

Split the values into tran and test dataset, using `train_test_split`, with a `test_size` of 20%

In [39]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)

Create the model and train it with `x_train` and `y_train`

In [40]:
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression()

Now we will be able to predict other values using `model.predict(x)` function

In [44]:
y_pred = model.predict(x_test)
y_pred

array([211730.93700633, 253151.39107362, 112130.0957762 , ...,
       236468.6303331 , 165853.34327934, 129166.83282363])

Let's get the accuracy of this model using `score` function

In [45]:
model.score(x_test, y_test)

0.6417937275429

Notice the score is pretty similar to the r squared value from the `OLS` method 