# CLASSIFICATION EXERCISE

So far, you've only created regression models. That is, you created models that produced floating-point predictions, such as, "houses in this neighborhood costs N thousand dollars." In this Colab, you'll create and evaluate a binary classification model. That is, you'll create a model that answers a binary question. In this exercise, the binary question will be, "Are houses in this neighborhood above a certain price?"

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from matplotlib import pyplot as plt

pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
# tf.keras.backend.set_floatx('float32')
print("ok")

ok


In [2]:
train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")
train_df = train_df.reindex(np.random.permutation(train_df.index))


## Normalize Values

When creating a model with multiple features, the values of each feature should cover roughly the same range. For example, if one feature's range spans 500 to 100,000 and another feature's range spans 2 to 12, then the model will be difficult or impossible to train. Therefore, you should normalize features in a multi-feature model.

The following code cell normalizes datasets by converting each raw value (including the label) to its Z-score. A Z-score is the number of standard deviations from the mean for a particular raw value. For example, consider a feature having the following characteristics:

 -- The mean is 60.
 
 --The standard deviation is 10.
 
 The raw value 75 would have a Z-score of +1.5:
 
       Z-score = (75 - 60) / 10 = +1.5

In [3]:
# En este caso no vamos a escalar los valores como hemos hecho en ejercicios anteriores
#train_df['median_house_value'] /= 1000
#test_df['median_house_value'] /= 1000

# en este caso vamos a NORMALIZAR los datos creando un panda nuevo con los Z-Scores:
train_df_mean = train_df.mean()
train_df_std = train_df.std()
train_df_norm = (train_df - train_df_mean)/train_df_std
train_df_norm.head()

test_df_mean = test_df.mean()
test_df_std  = test_df.std()
test_df_norm = (test_df - test_df_mean)/test_df_std

Ahora, examninando algunos de los valores del data set normalizado, se puede ver todos los Z-Scores estan entre +-2

### Task 1: Create a binary label

Your task is to create a new column named median_house_value_is_high in both the training set and the test set . If the median_house_value is higher than a certain arbitrary value (defined by threshold), then set median_house_value_is_high to 1. Otherwise, set median_house_value_is_high to 0.

In [4]:
threshold = 265000 # This is the 75th percentile for median house values.

train_df_norm['median_house_value_is_high'] = (train_df['median_house_value'] > threshold).astype(float)
test_df_norm['median_house_value_is_high'] = (test_df['median_house_value'] > threshold).astype(float)
train_df_norm["median_house_value_is_high"].head(800)

3003    1.0
3620    0.0
3108    1.0
5842    0.0
421     0.0
         ..
4155    0.0
2725    0.0
10776   0.0
14587   0.0
5794    0.0
Name: median_house_value_is_high, Length: 800, dtype: float64

### Represent features in feature columns

This code cell specifies the features that you'll ultimately train the model on and how each of those features will be represented. The transformations (collected in feature_layer) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model.

In [5]:
feature_column = []

media_income = tf.feature_column.numeric_column("median_income")
feature_column.append(media_income)

tr = tf.feature_column.numeric_column("total_rooms")
feature_column.append(tr)

feature_layer = layers.DenseFeatures(feature_column)

# Print the first 3 and last 3 rows of the feature_layer's output when applied
# to train_df_norm:
feature_layer(dict(train_df_norm))


<tf.Tensor: shape=(17000, 2), dtype=float32, numpy=
array([[ 5.021822  ,  0.1143769 ],
       [-0.534169  , -0.40444303],
       [ 1.6628206 , -0.15397824],
       ...,
       [ 0.34175494,  0.1694241 ],
       [-0.50670797, -0.49618837],
       [-0.10391082, -0.21131907]], dtype=float32)>

In [6]:
#crear funciones que crean el modelo y lo entrenan. ademas otro que hace la gráfica

def build_model(learning_rate, feature_layer, my_metrics):
    
    model = tf.keras.Sequential()
    
    model.add(feature_layer)
    
    # Funnel the regression value through a sigmoid function.
    model.add(tf.keras.layers.Dense(units = 1,
                                   input_shape=(1,),
                                   activation=tf.sigmoid))
    
    # Call the compile method to construct the layers into a model that
    # TensorFlow can execute.  Notice that we're using a different loss
    # function for classification than for regression. 
    # ademas se ve que le metemos las metrics que queramos
    model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=learning_rate),
                 loss=tf.keras.losses.BinaryCrossentropy(),
                  metrics=my_metrics)
    
    return model

def train_model(model, df, epochs, label_name, batch_size=None, shuffle=True):
    
    # The x parameter of tf.keras.Model.fit can be a list of arrays, where
    # each array contains the data for one feature.  Here, we're passing
    # every column in the dataset. Note that the feature_layer will filter
    # away most of those columns, leaving only the desired columns and their
    # representations as features.
    
    features = {name:np.array(value) for name, value in df.items()}
    label = np.array(features.pop(label_name))
    history = model.fit(x=features, y=label, batch_size = batch_size, epochs=epochs, shuffle= shuffle)
    
    epochs = history.epoch
    
    # Isolate the classification metric for each epoch.
    hist = pd.DataFrame(history.history)
    
    return epochs, hist

 

def plot_curve(epochs, hist, list_of_metrics):
    
    """Plot a curve of one or more classification metrics vs. epoch."""
        
    plt.figure()
    plt.xlabel("Epoch")
    plt.ylabel("Value")
    
    # esto hace que analice todos los metrics que le metemos sea accuracy, precision, recall o los que sean 
    for m in list_of_metrics:
        
        x=hist[m]
        plt.plot(epochs[1:], x[1:], label=m)
        
    plt.legend()
    
print("all functions created succesfully")

all functions created succesfully


In [7]:
learning_rate = 0.001
epochs = 30
batch_size = 100
label_name = "median_house_value_is_high"
threshold = 0.35

# Establish the metrics the model will measure.
METRICS= [tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=threshold)]

my_model = build_model(learning_rate, feature_layer, METRICS)

epochs, hist = train_model(my_model, train_df_norm, epochs, 
                           label_name, batch_size)

# Plot a graph of the metric(s) vs. epochs.
list_of_metrics_to_plot = ['accuracy'] 

plot_curve(epochs, hist, list_of_metrics_to_plot)

Epoch 1/30
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Accuracy should gradually improve during training (until it can no more)

#### Now evaluate the model against the test set

In [8]:
features = {name:np.array(value) for name, value in test_df_norm.items()}
label = np.array(features.pop(label_name))

my_model.evaluate(x = features, y = label, batch_size=batch_size)

Consider rewriting this model with the Functional API.


[0.4069024920463562, 0.8009999990463257]

Parece un buen modelo pero, en realidad un modelo que siempre predijera que "median_house_value_is_high" es False tendria una accuracy del 75% comparado con el 80% de este. entonces no es que nuestro modelo sea muy util.

Relying only in accuracy, can be a poor way to judge a classification model. Now let's modify our model to measure precision and recall too.

In [9]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 20
batch_size = 100
classification_threshold = 0.65
label_name = "median_house_value_is_high"

# Antes habiamos añadido solo accuracy, ahora tenemos que añadir precision y recall
METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy', 
                                      threshold=classification_threshold),
      tf.keras.metrics.Precision(thresholds=classification_threshold,
                                 name='precision' 
                                 ),
      tf.keras.metrics.Recall(thresholds=classification_threshold, name="recall"),
    
      tf.keras.metrics
]

# Establish the model's topography.
my_model = build_model(learning_rate, feature_layer, METRICS)

# Train the model on the training set.
epochs, hist = train_model(my_model, train_df_norm, epochs, 
                           label_name, batch_size)

# Plot metrics vs. epochs
list_of_metrics_to_plot = ['accuracy', 'precision', 'recall'] 
plot_curve(epochs, hist, list_of_metrics_to_plot)

Epoch 1/20
Consider rewriting this model with the Functional API.


ValueError: in user code:

    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:758 train_step
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:387 update_state
        self.build(y_pred, y_true)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:317 build
        self._metrics = nest.map_structure_up_to(y_pred, self._get_metric_objects,
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/nest.py:1159 map_structure_up_to
        return map_structure_with_tuple_paths_up_to(
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/nest.py:1257 map_structure_with_tuple_paths_up_to
        results = [
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/nest.py:1258 <listcomp>
        func(*args, **kwargs) for args in zip(flat_path_gen, *flat_value_gen)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/util/nest.py:1161 <lambda>
        lambda _, *values: func(*values),  # Discards the path arg.
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:418 _get_metric_objects
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:418 <listcomp>
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:437 _get_metric_object
        metric_obj = metrics_mod.get(metric)
    /home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/metrics.py:3494 get
        raise ValueError(

    ValueError: Could not interpret metric function identifier: <module 'tensorflow.keras.metrics' from '/home/candel/anaconda3/lib/python3.8/site-packages/tensorflow/keras/metrics/__init__.py'>


Ahora puedes hacer un metric que los junte a todos:



In [10]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 20
batch_size = 100
classification_threshold = 0.65
label_name = "median_house_value_is_high"

# Antes habiamos añadido solo accuracy, ahora tenemos que añadir precision y recall
METRICS = [
      tf.keras.metrics.AUC(num_thresholds=100, name='auc')
]

# Establish the model's topography.
my_model = build_model(learning_rate, feature_layer, METRICS)

# Train the model on the training set.
epochs, hist = train_model(my_model, train_df_norm, epochs, 
                           label_name, batch_size)

# Plot metrics vs. epochs
list_of_metrics_to_plot = ["auc"] 
plot_curve(epochs, hist, list_of_metrics_to_plot)

Epoch 1/20
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Podemos ver que el auc es el mismo que el acuraccy del codigo anterior y es mas comodo y se ve mas claro.