# Linear Regression using California Housing dataset from sklearn dataset

### Statistical Perspective

Import libraries and get the data

In [1]:
import statsmodels.api
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

Data Exploration

In [3]:
print(housing.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [4]:
print(housing.data.shape)

(20640, 8)


In [5]:
print(housing.feature_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [6]:
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Setting up dependent and independent variables

In [7]:
y = housing.target
X = housing.data[:,0:2]

Running the regression

In [10]:
regression = statsmodels.api.OLS(y,X)
model = regression.fit()
model.summary() #to print the results

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.883
Model:,OLS,Adj. R-squared (uncentered):,0.883
Method:,Least Squares,F-statistic:,78130.0
Date:,"Wed, 16 Aug 2023",Prob (F-statistic):,0.0
Time:,14:56:30,Log-Likelihood:,-24913.0
No. Observations:,20640,AIC:,49830.0
Df Residuals:,20638,BIC:,49850.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.4209,0.002,193.009,0.000,0.417,0.425
x2,0.0157,0.000,52.093,0.000,0.015,0.016

0,1,2,3
Omnibus:,4171.463,Durbin-Watson:,0.756
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9634.681
Skew:,1.147,Prob(JB):,0.0
Kurtosis:,5.438,Cond. No.,12.3


Adding Constant

In [11]:
X2 = statsmodels.api.add_constant(X)

regression2 = statsmodels.api.OLS(y,X2)
model2 = regression2.fit()
model2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.509
Model:,OLS,Adj. R-squared:,0.509
Method:,Least Squares,F-statistic:,10700.0
Date:,"Wed, 16 Aug 2023",Prob (F-statistic):,0.0
Time:,15:02:19,Log-Likelihood:,-24899.0
No. Observations:,20640,AIC:,49800.0
Df Residuals:,20637,BIC:,49830.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.1019,0.019,-5.320,0.000,-0.139,-0.064
x1,0.4317,0.003,144.689,0.000,0.426,0.438
x2,0.0174,0.000,38.726,0.000,0.017,0.018

0,1,2,3
Omnibus:,4099.868,Durbin-Watson:,0.787
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9707.077
Skew:,1.118,Prob(JB):,0.0
Kurtosis:,5.507,Cond. No.,108.0


### Data Mining Perspective

Import libraries and get the data

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [16]:
y = housing.target
X = housing.data[:,0:2]

In [17]:
lm1 = LinearRegression()
model = lm1.fit(X, y)

In [18]:
model.intercept_
model.coef_

array([0.43169191, 0.01744134])

### Cross Validation

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5) #set up the test-size and the random state

Running the linear regression

In [21]:
lm2 = LinearRegression()
model2 = lm2.fit(X_train,y_train)

Predict the results

In [22]:
y_test_pred = model2.predict(X_test)

We're using the Mean squared error to assess the performance of the regression

In [23]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_test_pred) # calculate the MSE by comparing the actual and the predicted outputs
print("Test MSE = "+str(mse))

Test MSE = 0.6546124998083807


Prediction based on the training set

In [24]:
y_train_pred = model.predict(X_train)

MSE 

In [25]:
mse2 = mean_squared_error(y_train, y_train_pred)
print("Train MSE = "+str(mse2))

Train MSE = 0.6533180530667242


## Cross Validation with k-fold

In [26]:
lm3 = LinearRegression()
from sklearn.model_selection import cross_val_score
score = cross_val_score(lm3, X, y, scoring='neg_mean_squared_error',cv=5)

In [27]:
print(score) 
print(score.mean())

[-0.56249842 -0.71756539 -0.7368484  -0.70568596 -0.66478026]
-0.6774756867751609
