Learning how to deal with **overfitting** is important. Although it's often possible to achieve high accuracy on the *training set*, what we really want is to develop models that generalize well to a *testing set*.
<br>
<br>
To prevent **overfitting**:
- use more complete training data
- regularization
- etc

The opposite of overfitting is **underfitting**. This means the network has not learned the relevant patterns in the training data. This can happen for a number reasons: if the model is not powerful enough, is **over-regularized**, or has simply not been trained long enough.

## Setup

In [1]:
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import regularizers

print(tf.__version__)

2.2.0


In [3]:
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots

from IPython import display
import matplotlib.pyplot as plt

import numpy as np
import pathlib
import shutil
import tempfile

In [4]:
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
print(logdir)
shutil.rmtree(logdir, ignore_errors=True)

/tmp/tmp7aggvnoo/tensorboard_logs


## The Higgs Dataset

11000000 examples, each with 28 features, and a binary class label

In [None]:
gz = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')

In [6]:
FEATURES = 28

In [None]:
ds = tf.data.experimental.CsvDataset(gz, [float(), ] * (FEATURES+1), compression_type='GZIP')

In [7]:
def pack_row(*row):
    label = row[0]
    features = tf.stack(row[1:], 1)
    return features, label

In [None]:
packed_ds = ds.batch(10000).map(pack_row).unbatch()

In [None]:
for features, label in pack_row.batch(1000).take(1):
    print(features[0])
    plt.hist(features.numpy().flatten(), bins=101)

In [None]:
N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE