<a href="https://colab.research.google.com/github/luigiselmi/dl_tensorflow/blob/main/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression
The goal of a regression algorithm is to find out the relationship between two or more continuous variables. In this notebook we will try to compute the price of houses in Boston from a set of numerical features. The [Boston Housing](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) dataset is characterized by the small size of the dataset and by the different ranges of values of each feature. Also this dataset can be downloaded from Tensorflow. The features of the dataset are

1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per 10000 dollars
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

The target feauture is MEDV, the median value of owner-occupied homes

In [1]:
from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = (boston_housing.load_data())

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


In [3]:
train_data.shape, test_data.shape

((404, 13), (102, 13))

## Data normalization
It might be difficult to model to deal with feautures with different ranges of values. A simple technique to avoid such problem is to standardize each feature by subtracting itm mean value and dividing by its standard deviation.

In [4]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std