# The California Housing Dataset

---

### Colab Note

Don't forget that you can link your notebook to your drive and save your work there. Then you can download and backup your models, reload them to keep training them, or upload datasets to your drive.

In [None]:
import os
import sys

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')
    os.chdir('drive/My Drive/') # 'My Drive' is the default name of Google Drives
    os.listdir()

# use os.chdir("my-directory") # to change directory, and
# os.listdir()                 # to list its contents
# os.getcwd()                  # to get the name of the current directory
# os.mkdir("my-new-dir")       # to create a new directory
# See: https://realpython.com/working-with-files-in-python/

# You can also use bash commands directly, preceded by a bang
# !ls
# However, the following will *not* change the Python directory
# the notebook points to (use os.chdir for that)!
# !cd my-directory

### For reproducible results

```python
tf.random.set_seed(42) # can be any number
```

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

## Method 1: Scikit-Learn

Locally you will have to install `scikit-learn`, using conda or pip, in your environment.

In [None]:
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing()

In [None]:
all_data = california_housing['data']
all_targets = california_housing['target']

In [None]:
all_data.shape, all_targets.shape

In [None]:
# multiply by 100000 to get the real price
print(all_data.max())
print(all_targets.min())

The dataset is much bigger than the Boston Housing Dataset (404 train samples, 102 test samples). One way we could reproduce the set-up would be randomly to select 404 samples from this dataset, and pretend it's all we have.

In [None]:
n = 402
m = 102
indz = tf.keras.random.randint((n+m,), 0, all_data.shape[0])
reduced_data = all_data[indz]
reduced_targets = all_targets[indz]

reduced_data.shape, reduced_targets.shape

In [None]:
print(reduced_data.max())
print(reduced_targets.min())

In [None]:
train_data, test_data = reduced_data[:n], reduced_data[n:]
train_targets, test_targets = reduced_targets[:n], reduced_targets[n:]

print(train_data.shape)
print(train_targets.shape)
print(test_data.shape)
print(test_targets.shape)

In [None]:
mean = train_data.mean(axis = 0)
train_data -= mean # shift
std = train_data.std(axis = 0)
train_data /= std # rescale
test_data -= mean
test_data /= std

#### Note

Can you make the function below more modular? You could modify it so that it accepts arguments changing the architecture of the network, and other hyperparameters.

In [None]:
def build_model(clear=True):
    if clear:
        tf.keras.backend.clear_session()
    model = tf.keras.models.Sequential()
    model.add(tf.keras.Input((train_data.shape[1],)))
    model.add(tf.keras.layers.Dense(64, activation = 'relu'))
    model.add(tf.keras.layers.Dense(64, activation = 'relu'))
    model.add(tf.keras.layers.Dense(1))
    model.compile(
        optimizer='rmsprop',
        loss='mse',
        metrics=['mae']
    )
    return model

In [None]:
class CustomCallback(tf.keras.callbacks.Callback):
    def __init__(self, epochs):
        super(tf.keras.callbacks.Callback, self).__init__()
        self.epochs = epochs
    def on_epoch_begin(self, epoch, logs=None):
        c = ['|', '/', '-', '\\']
        print(f"\r{c[epoch % 4]} epoch: {epoch+1}/{self.epochs}", end="")
    def on_train_end(self, logs=None):
        print()

The K-fold algorithm:

In [None]:
K = 4
num_val_samples = len(train_data) // K
num_epochs = 500
all_mae_histories = []
for i in range(K):
    print('processing fold', i)

    # Prepare the validation data: data from partition i
    a, b = i * num_val_samples, (i + 1) * num_val_samples
    val_data = train_data[a : b]
    val_targets = train_targets[a : b]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate([train_data[:a], train_data[b:]], axis=0)
    partial_train_targets = np.concatenate([train_targets[:a], train_targets[b:]], axis=0)

    # Build the Keras model (already compiled)
    model = build_model()

    # Train the model (in silent mode, verbose=0)
    history = model.fit(
        partial_train_data,
        partial_train_targets,
        validation_data=(val_data, val_targets),
        epochs=num_epochs, batch_size=1, verbose=0,
        callbacks=[CustomCallback(num_epochs)]
    )

    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)

### Visualise your results

Can you think of a way to automate the visualisation once the training is done? This would mean encapsulating the plotting code into a function, and calling it once the K-fold loop is done.

In [None]:
average_mae_history = np.array(all_mae_histories).mean(axis=0)
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

In [None]:
def smooth_curve(points, beta = 0.9):       # beta must be between 0 and 1!
    smoothed_points = []
    for current in points:
        if smoothed_points:                 # (an nonempty list is 'True')
            previous = smoothed_points[-1]  # the last appended point
                                            # ↓ a weighted sum of previous & point, controlled by beta
            smoothed_points.append(beta * previous + (1 - beta) * current)
        else:
            smoothed_points.append(current) # at the start, the list is empty, we just add the first point
    return smoothed_points

In [None]:
smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

### Experiments

- Run k-fold validation on the California dataset;
- Notice that the mini-batch size is set to 1. Experiment with different mini-batch sizes. What do you observe? Can you account for your observation?
- Run a series of experiments to find the best model, like in previous labs.

The obvious thing to be done here is to compare the results between the small random subset and the full dataset, if you were to train models on it (don't forget to split into train, validation and test sets when you work on the full data!). Varying the size of the test set could also be of interest.

### 2. Conclusion

Retrain the best model (with the same hyperparameters on the entire the training data (`train_data` and `train_targets`) and evaluate on the test data (`test_data`, `test_targets`).

---

## Method 2: Manual Download



[California Housing](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html), original website. (Also available on [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).)

### 1. Download

The terminal commands to download it. (Add a `!` in front of them to use them from Jupyter or Colab.)

```bash
wget https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
tar -xvf cal_housing.tgz
```

### 2. Load the data

Use the name of the file `cal_housing.data` to:
- open it
- read the lines
- strip the final newline `\n`
- split on commas
- turn the data into a numpy array, casting it as floats

#### Note

You can see the features by loading `cal_housing.domain`, read its lines, and print its contents.

### 3. Separate the features and the targets

The price is the last feature, so you need to use NumPy to slice all the `targets` in the last dimension, and the rest of the `features` in another array.

### 4. Scale the prices to a more manageable range

You can print the `min()` and the `max()` of your `targets` to see the kind of range we are dealing with.

Then a division by `100000` will give us similar numbers to the Boston Housing Dataset (and the `scikit-learn` version, as above).

Once you have your reduced targets, you may want to print again the `min()` and the `max()` as a sanity check.


### 5. Reduce the dataset to Boston Housing size, or split your data into train and test

Use `.shape` on your `features` (and/or `targets`) to check how many samples this dataset has.

Either use random indices as above to select only 504 samples, or train normally, but splitting into train and test sets.

Slice both `features` and `targets` to obtain `train_data`, `test_data`, and `train_targets`, `test_targets` respectively.

This is actually a potential subject of experiment. You could slice it roughly in the middle, or have more in your training than your testing set.

As a sanity check, your shapes should look like this:
```
# n_train: number of training samples
# n_train: number of testing samples
train_data (n_train, 8)
train_targets (n_train,)
test_data (n_test, 8)
test_targets (n_test,)
```

### 6. Normalisation/scaling

Use the mean and standard deviation of the **train data** to normalise it, and apply the same transform to test data, exactly as above with the Boston Housing Dataset.

### 7. Everything is now set up for training

The rest of the procedure (define the model, train, plot) is now the same.

#### Note

This dataset is not small like the Boston Housing Dataset, so you may find that it's taking too long to do many epochs with K-fold given the compute you have. This doesn't matter *too* much, the important thing is to understand the K-fold logic.

### Experiments

The obvious thing to be done here is to compare the results between the small random subset and the full dataset. Varying the size of the test set could also be of interesting!

### 8. Conclusion

Don't forget to retrain on the entire training set using the best hyperparamemters, and evaluate your model on the test set.

## Visualisations

Three examples of how people use visualisations for this dataset:
- [California Housing Modelling and Map Visualisation](https://www.kaggle.com/code/qixuan/california-housing-modelling-and-map-visualisation)
- [California Housing Prices: EDA and Visualization](https://www.kaggle.com/code/ujwalkandi/california-housing-prices-eda-and-visualization)
- [The California housing dataset](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html)