## Machine Learning and Statistics Assessment 2019

For this project assessment, I shall be exploring the ubiquitous Boston Housing Dataset. This small dataset of 506 cases attributes, was initially published by Harrison, D. and Rubinfeld, D.L. in "Hedonic prices and the demand for clean air', J. Environ. Economics & Management", vol.5, 81-102, 1978 (http://lib.stat.cmu.edu/datasets/boston).

The data was gathered by the US Census Service to gather information about housing in Boston, Massachusetts in 1978.
Before analysing any dataset, it is prudent to give consideration to how societal attitudes of the time, can influence what sort of data is gathered. In this dataset: there's an attribute "B", that refers to the proportion of persons of colour that live in a town. Personally, I would consider such a recorded attribute as racially biased.
But, this attitude was prevalent at the time in the USA. So, I'll be leaving this attribute in the dataset.

#### 1. Descriptive Statistics

In beginning this assessment project, I'll import the required python libraries and the Boston Housing dataset as included within the scikit-learn package.


In [29]:
# import required packages
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
import pandas as pd

#import Boston dataset from scikit-learn package
from sklearn.datasets import load_boston

#assign dataset to boston variable
boston = load_boston()

In [30]:
#Output Description of dataset from scikit-learn, describes attributes 
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

I'll explore the feature names and create a pandas dataframe from the dataset.

In [31]:
# output the feature names of columns in dataset
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [32]:
#create dataframe with pandas
df_boston = pd.DataFrame(boston.data,columns=boston.feature_names)

#display head of dataset
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


From reading the description of the dataset, and viewing the head() output. I can see that the target variable: 'MEDV' has been ommited. This isn't a problem, as I can add it to our dataset with the code below. 

This particular variable refers to the median value of homes in $1,000's. 

The 'LSTAT' refers to the percentage of lower status population in the area- we can see already that theres an element of bias by our present standards. I think this is interesting, as it shows that data can also be a snapshot of societal attitudes of it's time.

The variable 'NOX' is also a target attribute - refering to the concentration of nitrous oxides in air quality in an area.

In [33]:
#loading missing 'MEDV'variable into Sklearn dataset
df_boston['MEDV'] = pd.Series(boston.target)

The value of houses is in $1,000's, so I'll make this clearer in the 'MEDV' column:

In [34]:
# convert value to 1,000's 
df_boston['MEDV'] = (df_boston['MEDV']*1000)

In [35]:
#Keys of dataset
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [36]:
#Shape of dataset
print(boston.data.shape)

(506, 13)


In [37]:
#Feature names of dataset
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [38]:
df_boston

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24000.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21600.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34700.0
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33400.0
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36200.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22400.0
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20600.0
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23900.0
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22000.0


Now we have our data, I'll check for any null values.

In [40]:
# Check for NULL values
df_boston.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

No NULL or NaN values found, so we can continue. I'll use Numpy to round off and give us some summary statistics of the dataset.

In [43]:
#Rounding off and summary stats
np.round(df_boston.describe(),2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.61,11.36,11.14,0.07,0.55,6.28,68.57,3.8,9.55,408.24,18.46,356.67,12.65,22532.81
std,8.6,23.32,6.86,0.25,0.12,0.7,28.15,2.11,8.71,168.54,2.16,91.29,7.14,9197.1
min,0.01,0.0,0.46,0.0,0.38,3.56,2.9,1.13,1.0,187.0,12.6,0.32,1.73,5000.0
25%,0.08,0.0,5.19,0.0,0.45,5.89,45.02,2.1,4.0,279.0,17.4,375.38,6.95,17025.0
50%,0.26,0.0,9.69,0.0,0.54,6.21,77.5,3.21,5.0,330.0,19.05,391.44,11.36,21200.0
75%,3.68,12.5,18.1,0.0,0.62,6.62,94.07,5.19,24.0,666.0,20.2,396.22,16.96,25000.0
max,88.98,100.0,27.74,1.0,0.87,8.78,100.0,12.13,24.0,711.0,22.0,396.9,37.97,50000.0


#### 2. Inferential Statistics

#### 3. Predictive Statistics with Keras

### References
1. https://towardsdatascience.com/machine-learning-project-predicting-boston-house-prices-with-regression-b4e47493633d

2. https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

3. https://subscription.packtpub.com/book/programming/9781789804744/1/ch01lvl1sec11/our-first-analysis-the-boston-housing-dataset

4. https://www.ritchieng.com/machine-learning-project-boston-home-prices/