DESCRIPTION

A real estate company wants to build homes at different locations in Boston. They have data for historical prices but have not decided the actual prices yet. They want to fix such a price for the homes that are affordable to the general public.

 

DATA DESCRIPTION:

You are expected to work with a Housing dataset from the scikit-learn library, which contains information about various houses in Boston.

 

It consists of 506 samples and 13 feature variables.

 

The features provided in the dataset are as follows:

data: It contains information on different houses.

target: It contains the prices of each house in dollars.

feature_names: It contains the names of all the features in the dataset.

DESCR: It consists of the data description.

 

The columns in the dataset, commonly known as feature names, are as follows:

 

Column Name
	

Column Description

CRIM
	

Per capita crime rate by town

ZN
	

Proportion of residential land zoned for lots over 25,000 sq. ft

INDUS
	

Proportion of nonretail business acres per town

CHAS
	

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX
	

Nitric oxide concentration (parts per 10 million)

RM
	

Average number of rooms per dwelling

AGE
	

Proportion of owner-occupied units built prior to 1940

DIS
	

Weighted distances to five Boston employment centers

RAD
	

Index of accessibility to radial highways

TAX
	

Full-value property tax rate per $10,000

PTRATIO
	

Pupil-teacher ratio by town

B
	

1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town

LSTAT
	

Percentage of lower status of the population

MEDV
	

Median value of owner-occupied homes in $1000

 

Here, the variable MEDV is our target variable, as it reflects the values of each house. The remaining variables are the feature variables that will be used to estimate a house's price.

 

Objectives:

• Import the Boston data from scikit-learn and read the description using DESCR.

• Analyze the data and predict the approximate prices for the houses.

In [1]:
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_squared_error
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
boston_dataset = load_boston()

In [3]:
print(boston_dataset.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [4]:
print(boston_dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

### EDA

In [5]:
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [6]:
X = boston_dataset.data
y = boston_dataset.target

In [7]:
df.shape

(506, 13)

In [8]:
y.size

506

In [14]:
y = y.astype(int)

In [15]:
y.mean()

22.114624505928855

In [23]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# fit linear regression model
lr_model = LinearRegression()
lr_model.fit(X_train.astype(int), y_train.astype(int))

LinearRegression()

In [24]:
#pred
preds = lr_model.predict(X_test.astype(int))

In [25]:
# confusion matrix
from sklearn.metrics import confusion_matrix
confusionmatrix = confusion_matrix(y_test.astype(int), preds)
print(confusionmatrix)

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

In [12]:
# classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

ValueError: continuous is not supported