#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Classification with TensorFlow

By now, you should be familiar with classification in scikit-learn. In this colab, we will explore another commonly used tool for classification and machine learning: TensorFlow.

## Load the Dataset

The dataset we will be using comes packaged with scikit-learn. It is called the [MNIST Digits](https://en.wikipedia.org/wiki/MNIST_database) dataset, a large collection of hand-drawn digitzs. We load the dataset with `load_digits`, and the returned object is a scikit-learn bunch.

In [0]:
from sklearn.datasets import load_digits

digits_bunch = load_digits()
digits_bunch

## Bunch to DataFrame

We'll convert the scikit-learn `Bunch` into a Pandas `DataFrame` for ease of processing.

Columns 0 through 63 are the intensities of the pixels in the digit drawings and the *digit* column is the digit that is represented by the image.

In [0]:
import pandas as pd

digits_df = pd.DataFrame(digits_bunch.data)
digits_df['digit'] = digits_bunch.target
digits_df

## Examine the Data



Let's take a quick look at our dataset. With `describe`, we can see that each pixel column seems to range between 0.0 and 16.0, with some columns having a smaller maximum value.

The *digit* column contains the labels/targets that we are interested in.

In [0]:
digits_df.describe()

Let's group the data by digit and count the number of occurrences of each digit. That way, we can see that the digits are discrete values between 0 and 9, and that the number of samples of each digit is roughly equal.

In [0]:
digits_df.groupby('digit')['digit'].agg('count')

## Normalize the Data

The pixel values for each image ranged from 0.0 to as much as 16.0. For many machine learning models, higher values in feature data have a larger impact over the model. In this case, it would mean giving much more weight to darker pixels.

An easy way to proactively prevent this is to normalize the feature data between 0.0 and 1.0. This can be done by dividing by the maximum value, 16.0.

In [0]:
digits_df.update(digits_df[digits_df.columns[0:64]] / 16.0)

In [0]:
# Ensure that the feature data is now normalized.
digits_df.describe()

## Train/Test Split

We can now time to perform a train/test split on the data so that we can train a model in TensorFlow while retaining some data to test the model with.

Since the digits are evenly distributed, we will create a stratified test sample using scikit-learn.

In [0]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
  digits_df,
  stratify=digits_df['digit'],  
  test_size=0.2,
)

In [0]:
train_df.shape

In [0]:
test_df.shape

Verify the stratification for training data:

In [0]:
train_df.groupby('digit')['digit'].agg('count')

And for the testing data:

In [0]:
test_df.groupby('digit')['digit'].agg('count')

The splits look pretty even across digits. Our stratification seems to have worked!

## Feature Columns

We can now build a TensorFlow model for classification. The first step is to declare our feature columns. In this case, the features are the 64 pixel intensities.

Writing out 64 lines of code, one per pixel would be tedious and error-prone. An easy route to create the feature columns is to loop though the 64 columns and append them to a `feature_columns` array.

In [0]:
from tensorflow.feature_column import numeric_column

feature_columns = []

for column_name in digits_df.columns[:-1]:
  feature_columns.append(numeric_column(str(column_name)))

feature_columns

## Classes

We have 10 classes, one for each digit 0-9. We could hardcode the value 10, but instead it is better practice to count the number of unique targets we have.

In [0]:
class_count = len(digits_df['digit'].unique())

class_count

## Create a Classifier

We now know our feature columns (the 64 pixel intensities) and know how many classes we need to identify (10). To build the classifier, we feed that data into the object constructor. In this case, we will use a [LinearClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearClassifier).

In [0]:
from tensorflow.estimator import LinearClassifier

classifier = LinearClassifier(feature_columns=feature_columns, n_classes=class_count)

## Train the Classifier

The next step is to train the classifier. To do that we need to create an input function to feed data to the classifier.

In [0]:
import tensorflow as tf

from tensorflow.data import Dataset

def training_input():
  features = {}
  for i in range(64):
    features[str(i)] = train_df[i]
 
  labels = train_df['digit']

  training_ds = Dataset.from_tensor_slices((features, labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)

  return training_ds

classifier.train(training_input)

What was the final loss? The TensorFlow `LinearRegressor` loss is calculated by using [softmax cross entropy](https://deepnotes.io/softmax-crossentropy).

## Make Test Predictions

The next step is to make some test predictions. For this we need to create an input function that returns features. These features shouldn't be shuffled or repeated.

The result is an iterator over predictions.

In [0]:
def testing_input():
  features = {}
  for i in range(64):
    features[str(i)] = test_df[i]
  return Dataset.from_tensor_slices((features)).batch(1)

predictions_iterator = list(classifier.predict(testing_input))

predictions_iterator

## Examine the Predictions

The predictions returned from the classifier are Python dictionaries containing four keys: 'logits', 'probabilities', 'class_ids', and 'classes'.

- The 'class_ids' and 'classes' are lists containing the classes that seem probable to the model.

- The 'probabilities' value contains the probabilities that each class applies to the data point. The higher the probability, the more likely the class is applicable.

- The 'logits' column reflects the [logit](https://en.wikipedia.org/wiki/Logit) values for the prediction where the best value approaches 1.0.

In [0]:
for p in predictions_iterator:
  print(p.keys())
  print(p['logits'])
  print(p['probabilities'])
  print(p['class_ids'])
  print(p['classes'])
  break

You can run the code above a few times to examine predictions one-by-one.

We'll re-run some code to create and train the model again in order to reset it for statistical evaluation and then extract the predictions into an array.

In [0]:
predictions_iterator = classifier.predict(testing_input)

predictions = [p['class_ids'][0] for p in predictions_iterator]

Using these predictions, we can calculate the precision...

In [0]:
from sklearn.metrics import precision_score

precision_score(test_df['digit'], predictions, average='micro')

And recall.

In [0]:
from sklearn.metrics import recall_score

recall_score(test_df['digit'], predictions, average='micro')

# Exercises

## Exercise 1

TensorFlow has a [DNNClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier) estimator that can also perform classifications. The estimator relies on a deep neural network.

Try using the DNNClassifier instead of the [LinearClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearClassifier) to identify the MNIST digits that we used in our example code above. Play around with some settings, such as 'hidden_layers' to see if it has any effect on the model.

Try using the data as-is and normalized. Do you see any effect?

### Student Solution

In [0]:
# Your code goes here