The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per \$10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000x(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - \% lower status of the population
* MEDV - Median value of owner-occupied homes in \$1000's

Importing required libraries

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sagemaker

# Preparing data

Loading the csv dataset

In [5]:
dataset = pd.read_csv('../data/HousingData.csv')

In [6]:
dataset.shape

(506, 14)

In [7]:
dataset.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


In [11]:
dataset.dtypes

MEDV       float64
CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX          int64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object

In [14]:
dataset.isna().sum()

MEDV        0
CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
dtype: int64

In [18]:
dataset.loc[dataset['CRIM'].isna() == True]

Unnamed: 0,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
53,23.4,,21.0,5.64,0.0,0.439,5.998,21.4,6.8147,4,243,16.8,396.9,8.43
115,18.3,,0.0,10.01,0.0,0.547,5.928,88.2,2.4631,6,432,17.8,344.91,15.76
183,32.5,,0.0,2.46,0.0,0.488,6.563,95.6,2.847,3,193,17.8,396.9,5.68
191,30.5,,45.0,3.44,0.0,0.437,6.739,30.8,6.4798,5,398,15.2,389.71,4.69
192,36.4,,45.0,3.44,0.0,0.437,7.178,26.3,6.4798,5,398,15.2,390.49,2.87
196,33.3,,80.0,1.52,0.0,0.404,7.287,34.1,7.309,2,329,12.6,396.9,4.08
229,31.5,,0.0,6.2,0.0,0.504,6.552,21.4,3.3751,8,307,17.4,380.34,3.76
236,25.1,,0.0,6.2,1.0,0.507,6.631,76.5,4.148,8,307,17.4,388.45,9.54
241,20.1,,30.0,4.93,0.0,0.428,6.095,65.1,6.3361,6,300,16.6,394.62,12.4
262,48.8,,20.0,3.97,0.0,0.647,8.398,91.5,2.2885,5,264,13.0,386.86,5.91


In [19]:
dataset = dataset.fillna(0.0)

In [20]:
dataset.isna().sum()

MEDV       0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

Amazon SageMaker requires that a csv file doesn't have a header record and that the target variable is in first column 

In [8]:
dataset = pd.concat([dataset['MEDV'], dataset.drop(['MEDV'], axis=1)], axis=1)

Checking if 'MEDV' column has been moved to first position

In [9]:
dataset.head()

Unnamed: 0,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,


Splitting the dataframe up into two parts - 90% for training, and 10% for validation 

In [24]:
training_dataset, validation_dataset = train_test_split(dataset, test_size=0.1, random_state=12345)

Save these two splits to individual CSV files, without either an index or a header

In [25]:
training_dataset.to_csv('../data/training_dataset.csv', index=False, header=False)
validation_dataset.to_csv('../data/validation_dataset.csv', index=False, header=False)

Uploading these two files to S3 using default bucket created by SageMaker in the region we're running in. 

In [28]:
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [29]:
bucket

'sagemaker-eu-central-1-065391449321'

In [30]:
prefix = 'boston-housing'

training_data_path = sess.upload_data(
    path='../data/training_dataset.csv',
    key_prefix=prefix+'/input/training'
)

validation_data_path = sess.upload_data(
    path='../data/validation_dataset.csv',
    key_prefix=prefix+'/input/validation'
)

In [31]:
print(training_data_path)
print(validation_data_path)

s3://sagemaker-eu-central-1-065391449321/boston-housing/input/training/training_dataset.csv
s3://sagemaker-eu-central-1-065391449321/boston-housing/input/validation/validation_dataset.csv


# Configuring a training job

In [39]:
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
from sagemaker import TrainingInput

SageMaker algorithms are packed in Docker containers.
Using boto3 and the image_uris.retrieve() it's easy to find the name of the Linear Learner algorithm in the region we're running. 

In [33]:
region = sess.boto_session.region_name
container = retrieve('linear-learner', region)

In [35]:
print(region)
print(container)

eu-central-1
664544806723.dkr.ecr.eu-central-1.amazonaws.com/linear-learner:1


Configuring training job with Estimator object.

In [56]:
ll_estimator = Estimator(
    container, # container name
    role=get_execution_role(), # IAM role that SageMaker instances will use
    instance_count=1, # number of instances used for training
    instance_type='ml.m4.2xlarge', # type of instances used for training
    use_spot_instances=True, # use spot instances (cheaper)
    max_run = (1 * 60 * 60),
    max_wait = (1 * 60 * 60),
    output_path=f's3://{bucket}/{prefix}/output' # output path for the model
)

Setting hyperparameters.
Only one hyperparameter is mandatory, the rest can be set by default.
normalize_data parameter is set to true by default, so it is unnecessary to normalize data ourselves.
Default value for mini_batch_size parameter is 1000, for dataset with 506-sample dataset this isn't going to work well - so we change this to 32.

In [57]:
ll_estimator.set_hyperparameters(
    predictor_type='regressor',
    mini_batch_size=32
)

Defining the data channels: channel is a named source of data passed to a SageMaker estimator.
All built-in algorithms need at least a training channel, and many also accept validation and testing channels.

In [50]:
training_data_channel = TrainingInput(
    s3_data=training_data_path,
    content_type='text/csv'
)

validation_data_channel = TrainingInput(
    s3_data=validation_data_path,
    content_type='text/csv'
)

# Launching a training job

In [58]:
ll_estimator.fit(
    {
        'train': training_data_channel,
        'validation': validation_data_channel
    }
)

2022-10-21 20:42:03 Starting - Starting the training job...
2022-10-21 20:42:26 Starting - Preparing the instances for trainingProfilerReport-1666384923: InProgress
............
2022-10-21 20:44:33 Downloading - Downloading input data
2022-10-21 20:44:33 Training - Downloading the training image...............
2022-10-21 20:46:54 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[10/21/2022 20:46:57 INFO 140694817969984] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 

Model artifact