# Housing prices in Boston


This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

There are 14 variables:
* CRIM : per capita crime rate by town
* ZN : proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS : proportion of non:retail business acres per town.
* CHAS : Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX : nitric oxides concentration (parts per 10 million)
* RM : average number of rooms per dwelling
* AGE : proportion of owner:occupied units built prior to 1940
* DIS : weighted distances to five Boston employment centres
* RAD : index of accessibility to radial highways
* TAX : full value property tax rate per 10000 US Dollars
* PTRATIO : pupil teacher ratio by town
* B : $1000(Bk - 0.63)^2$ US Dollars where Bk is the proportion of blacks by town
* LSTAT : percent lower status of the population
* MEDV : median value of owner occupied homes in 1000 US Dollars

With this dataset, the classical goal is to predict MEDV depending on the first 13 variables.

## Reference

* Regression Analysis with Python, Luca Massaron, Alberto Boschetti, Packt Publishing
* https://archive.ics.uci.edu/ml/datasets/Housing
* Harrison, Jr., David, Rubinfeld, Daniel L. (1978/03)."Hedonic housing prices and the demand for clean air." Journal of Environmental Economics and Management 5(1): 81-102. 
* Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
* Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [1]:
from sklearn.datasets import load_boston
import openturns as ot

In [2]:
boston = load_boston()

In [3]:
p = boston.data.shape[1]
p

13

In [4]:
n = boston.data.shape[0]
n

506

In [5]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [6]:
sample = ot.Sample(n,p+1)
sample[:,0:p] = boston.data

In [7]:
sample[:,p] = ot.Sample(boston.target,1)

In [8]:
descr = [boston.feature_names[i] for i in range(p)]
descr.append("MEDV")
descr

['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT',
 'MEDV']

In [9]:
sample.setDescription(descr)

In [10]:
sample.exportToCSVFile("Housing-prices-Boston.csv")