# The California Housing Dataset

---

### Colab Note

Don't forget that you can link your notebook to your drive and save your work there. Then you can download and backup your models, reload them to keep training them, or upload datasets to your drive. 

```python
import os
import sys

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')
    os.chdir('drive/My Drive/') # 'My Drive' is the default name of Google Drives
    os.listdir()
    
# use os.chdir("my-directory") # to change directory, and
# os.listdir()                 # to list its contents
# os.getcwd()                  # to get the name of the current directory
# os.mkdir("my-new-dir")       # to create a new directory
# See: https://realpython.com/working-with-files-in-python/

# You can also use bash commands directly, preceded by a bang
# !ls
# However, the following will *not* change the Python directory 
# the notebook points to (use os.chdir for that)!
# !cd my-directory
```

### For reproducible results

```python
tf.random.set_seed(42) # can be any number
```

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

---

## Extra: Manual Download and data processing


[California Housing](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html), original website. (Also available on [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).)

### 1. Download

The terminal commands to download it. (Add a `!` in front of them to use them from Jupyter or Colab.)

```bash
wget https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
tar -xvf cal_housing.tgz
```

### 2. Load the data

Use the name of the file `cal_housing.data` to:
- open it 
- read the lines 
- strip the final newline `\n` 
- split on commas
- turn the data into a numpy array, casting it as floats

**Note**

You can see the features by loading `cal_housing.domain`, read its lines, and print its contents

### 3. Separate the features and the targets

The price is the last feature, so you need to use NumPy to slice all the `targets` in the last dimension, and the rest of the `features` in another array.

### 4. Reduce the dataset to Boston Housing size, or split your data into train and test

Use `.shape` on your `features` (and/or `targets`) to check how many samples this dataset has.

Either use random indices as above to select only 504 samples, or train normally, but splitting into train and test sets.

Slice both `features` and `targets` to obtain `train_data`, `test_data`, and `train_targets`, `test_targets` respectively.

This is actually a potential subject of experiment. You could slice it roughly in the middle, or have more in your training than your testing set.

As a sanity check, your shapes should look like this:
```
# n_train: number of training samples
# n_train: number of testing samples
train_data (n_train, 8)
train_targets (n_train,)
test_data (n_test, 8)
test_targets (n_test,)
```

### 6. Normalisation/scaling

Use the mean and standard deviation of the **train data** to normalise it, and apply the same transform to test data, exactly like with the Boston Housing Dataset.

Regarding the targets, you can print the `min()` and the `max()` to see the kind of range we are dealing with.

Depending which version, a division by `100000` will give you numbers between ~ `.1` and `5`.

Once you have your reduced targets, you may want to print again the `min()` and the `max()` as a sanity check.

### 7. Everything is now set up for training

The rest of the procedure (define the model, train, plot) is now the same.

### Experiments

The obvious thing to be done here is to compare the results between the small random subset and the full dataset. Varying the size of the test set could also be of interesting!

### 8. Conclusion

Retrain the best model (with the same hyperparameters on the entire the training data (`train_data` and `train_targets`) **without validation data** and instead evaluate on the test data (`test_data`, `test_targets`).

### Visualisations

Three examples of how people use visualisations for this dataset:
- [California Housing Modelling and Map Visualisation](https://www.kaggle.com/code/qixuan/california-housing-modelling-and-map-visualisation)
- [California Housing Prices: EDA and Visualization](https://www.kaggle.com/code/ujwalkandi/california-housing-prices-eda-and-visualization)
- [The California housing dataset](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html)