# Housing Prices Dataset

The housing prices dataset used for this example is a reduced version of the original [Boston Housing Datset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html), which was originally collected by the U.S. Census Service. We choose to focus on three features of the originally dataset (RM, LSTST, and PTRATIO) and the output, or target (MEDV) that we are learning to predict.

|Feature| Description|
|---|---|
|RM |       Average number of rooms per dwelling|
|LSTAT |    A measurement of the socioeconomic status of people living in the area|
|PTRATIO |  Pupil-teacher ratio by town - approximation of the local education system's quality|
|MEDV |     Median value of owner-occupied homes in $1000's|

Sample:

|RM   |LSTAT|PTRATIO|MEDV|
|-----|----|----|--------|
|6.575|4.98|15.3|504000.0|
|6.421|9.14|17.8|453600.0|
|7.185|4.03|17.8|728700.0|
|6.998|2.94|18.7|701400.0|

## Prerequisites

Before you can deploy this example you need to have the following components:

1. AML-Pachyderm Integration following the setup from the root [README.md](../../README.MD) with the syncer mode set to `delimited`
2. `pachctl` locally configured to connect to the Pachyderm instance in AML from the same setup instructions
3. This notebook file and `utils.py` copied into your AzureML Workspace

## Detailed Walkthrough

### Step 1: Create an input data repository

Once the Pachyderm cluster is running, create a data repository called `housing_data` where we will put our dataset.

```bash
$ pachctl create repo housing_data
$ pachctl list repo
NAME                CREATED             SIZE
housing_data        3 seconds ago       0 B
```

### Step 2: Add the housing dataset to the repo

Now we can add the data, which will kick off the processing automatically. If we update the data with a new commit, then the pipeline will automatically re-run. 

```bash
$ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv
```

We can inspect that the data is in the repository by looking at the files in the repository.

```bash
$ pachctl list file housing_data@master
NAME                    TYPE SIZE
/housing-simplified.csv file 12.14KiB
```

### Configure the Notebook

Now Browse to the datasets on the left and you should see a `Pachyderm repo housing_data - delimited` dataset.

1. Click on Consume
2. Copy the sample usage and add it to this next code section and run it.


In [None]:
# Paste Here

### Data Analysis

When you run the code below, it creates a pair plot and a correlation matrix showing the relationship between features. By seeing what features are positively or negatively correlated to the target value (or each other), it can helps us understand what features may be valuable to the model.


In [None]:
from utils import data_analysis, set_dtypes

data = dataset.to_pandas_dataframe()

data = set_dtypes(data)

data_analysis(data)

### Regresssion

Setup regression


In [None]:
from sklearn.model_selection import train_test_split
from utils import load_data,train_model,test_model,create_learning_curve,plt

input_data = data

target_col = "MEDV"
training_data, test_data = train_test_split(input_data, test_size=0.1,random_state=42)
train_data, train_features, train_targets = load_data(training_data, target_col)
print("Training set has {} data points with {} variables each.".format(*train_data.shape))
test_data, test_features, test_targets = load_data(test_data, target_col)
print("Testing set has {} data points with {} variables each.".format(*test_data.shape))

reg = train_model(train_features, train_targets)
test_results = test_model(reg, test_features, test_targets)
create_learning_curve(reg, train_features, train_targets)
plt.show()

print(test_results)

When we inspect the learning curve, we can see that there is a large gap between the training score and the validation score. This typically indicates that our model could benefit from the addition of more data. 

Now let's update our dataset with additional examples.

### Step 6: Update Dataset
Similar to the original housing prices example, we'll now add some new data. 

```bash
$ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite
```


In [None]:
dataset = Dataset.get_by_name(workspace, name='Pachyderm repo housing_data - delimited', version='2')
data2 = dataset.to_pandas_dataframe()

data2 = set_dtypes(data2)

data_analysis(data2)

Now Lets Rerun the Model with the new data.

In [None]:
from sklearn.model_selection import train_test_split
from utils import load_data,train_model,test_model,create_learning_curve,plt

input_data = data2

target_col = "MEDV"
training_data, test_data = train_test_split(input_data, test_size=0.1,random_state=42)
train_data, train_features, train_targets = load_data(training_data, target_col)
print("Training set has {} data points with {} variables each.".format(*train_data.shape))
test_data, test_features, test_targets = load_data(test_data, target_col)
print("Testing set has {} data points with {} variables each.".format(*test_data.shape))

reg = train_model(train_features, train_targets)
test_results = test_model(reg, test_features, test_targets)
create_learning_curve(reg, train_features, train_targets)
plt.show()

print(test_results)

Wow, now things look pretty good. lets try and add more data.

### Step 7: Update Dataset Again
Similar to the original housing prices example, we'll now add some new data. 

```bash
$ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-error.csv --overwrite
```

Now lets retrain the model with the new data.


In [None]:
dataset = Dataset.get_by_name(workspace, name='Pachyderm repo housing_data - delimited')
data3 = dataset.to_pandas_dataframe()

data3 = set_dtypes(data3)

from sklearn.model_selection import train_test_split
from utils import load_data,train_model,test_model,create_learning_curve,plt

input_data = data3

target_col = "MEDV"
training_data, test_data = train_test_split(input_data, test_size=0.1,random_state=42)
train_data, train_features, train_targets = load_data(training_data, target_col)
print("Training set has {} data points with {} variables each.".format(*train_data.shape))
test_data, test_features, test_targets = load_data(test_data, target_col)
print("Testing set has {} data points with {} variables each.".format(*test_data.shape))

reg = train_model(train_features, train_targets)
test_results = test_model(reg, test_features, test_targets)
create_learning_curve(reg, train_features, train_targets)
plt.show()

print(test_results)

Uh Oh, that doesn't look right. Lets look at the data:

In [None]:
data_analysis(data3)

Yup, it looks like a bad test feature got in there. Let's switch back to the previous version by setting `version='2'`.

In [None]:
dataset = Dataset.get_by_name(workspace, name='Pachyderm repo housing_data - delimited', version='2')
data4 = dataset.to_pandas_dataframe()
data4 = set_dtypes(data4)

from sklearn.model_selection import train_test_split
from utils import load_data,train_model,test_model,create_learning_curve,plt

input_data = data4

target_col = "MEDV"
training_data, test_data = train_test_split(input_data, test_size=0.1,random_state=42)
train_data, train_features, train_targets = load_data(training_data, target_col)
print("Training set has {} data points with {} variables each.".format(*train_data.shape))
test_data, test_features, test_targets = load_data(test_data, target_col)
print("Testing set has {} data points with {} variables each.".format(*test_data.shape))

reg = train_model(train_features, train_targets)
test_results = test_model(reg, test_features, test_targets)
create_learning_curve(reg, train_features, train_targets)
plt.show()

print(test_results)