# Author: Emmanuel Rodriguez

https://emmanueljrodriguez.com/

Date: 18 May 2022

Location: West Texas

## Requirements:

Perform multiple linear regression on a California housing dataset, by considering multiple features in the machine learning regression model, and predict future value.

Ref: Deitel, P., & Deitel, H. M. (2019). Intro to Python for Computer Science and Data Science. Pearson Education (US). https://bookshelf.vitalsource.com/books/9780135404812, p. 625.

### Read data

In [1]:
from sklearn.datasets import fetch_california_housing # Returns a 'Bunch' object containing the data and other info

california = fetch_california_housing()

List attributes of an object by:

In [15]:
#california.__dict__ # if this doesn't display anything, try:
california.__dir__()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

### Dataset description

In [2]:
# Display the dataset's description using the DESCR attribute
print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

Note: Each sample also has its target - a corresponding median house value in hundreds of thousands, so 3.55 = $355,000.

In [3]:
# Confirm the number of samples (rows) and features (columns)
california.data.shape

(20640, 8)

In [4]:
# Verify the number of target values - the median house values, matches the number of samples
california.target.shape

(20640,)

In [16]:
# Use the object's 'feature_name' attribute to show features
california.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

### Initial exploration of data with Pandas

In [17]:
import pandas as pd

# Set some options
pd.set_option('precision', 4)
pd.set_option('max_columns',9)
pd.set_option('display.width', None)

In [18]:
# Create DataFrame with data and features_names
california_df = pd.DataFrame(california.data, columns=california.feature_names)

# Add a column for the median house values stored in .target
california_df['MedHouseValue'] = pd.Series(california.target)

In [19]:
california_df.head() # Show first 5 samples

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.9841,1.0238,322.0,2.5556,37.88,-122.23,4.526
1,8.3014,21.0,6.2381,0.9719,2401.0,2.1098,37.86,-122.22,3.585
2,7.2574,52.0,8.2881,1.0734,496.0,2.8023,37.85,-122.24,3.521
3,5.6431,52.0,5.8174,1.0731,558.0,2.5479,37.85,-122.25,3.413
4,3.8462,52.0,6.2819,1.0811,565.0,2.1815,37.85,-122.25,3.422


In [None]:
# Continue, p. 629