# Classify structured data using TensorFlow

This tutorial demonstrates how to classify structured data, such as tabular data, using a simplified version of the <a href="https://www.kaggle.com/c/petfinder-adoption-prediction" class="external">PetFinder dataset from a Kaggle competition</a> stored in a CSV file.

You will use keras to define the model and to train the model. The goal is to predict if a pet will be adopted.


## The PetFinder.my mini dataset

There are several thousand rows in the PetFinder.my mini's CSV dataset file, where each row describes a pet (a dog or a cat) and each column describes an attribute, such as age, breed, color, and so on.

In the dataset's summary below, notice there are mostly numerical and categorical columns. In this tutorial, you will only be dealing with those two feature types, dropping `Description` (a free text feature) and `AdoptionSpeed` (a classification feature) during data preprocessing.

| Column          | Pet description               | Feature type   | Data type |
| --------------- | ----------------------------- | -------------- | --------- |
| `Type`          | Type of animal (`Dog`, `Cat`) | Categorical    | String    |
| `Age`           | Age                           | Numerical      | Integer   |
| `Breed1`        | Primary breed                 | Categorical    | String    |
| `Color1`        | Color 1                       | Categorical    | String    |
| `Color2`        | Color 2                       | Categorical    | String    |
| `MaturitySize`  | Size at maturity              | Categorical    | String    |
| `FurLength`     | Fur length                    | Categorical    | String    |
| `Vaccinated`    | Pet has been vaccinated       | Categorical    | String    |
| `Sterilized`    | Pet has been sterilized       | Categorical    | String    |
| `Health`        | Health condition              | Categorical    | String    |
| `Fee`           | Adoption fee                  | Numerical      | Integer   |
| `Description`   | Profile write-up              | Text           | String    |
| `PhotoAmt`      | Total uploaded photos         | Numerical      | Integer   |
| `AdoptionSpeed` | Categorical speed of adoption | Classification | Integer   |

## Import TensorFlow and other libraries


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import sklearn
import matplotlib.pyplot as plt

SEED = 12
np.random.seed(SEED)
tf.random.set_seed(SEED)

## Load the dataset and read it into a pandas DataFrame

Use `tf.keras.utils.get_file` to download and extract the CSV file with the PetFinder.my mini dataset, and load it into a dataframe.


In [None]:
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
data = pd.read_csv(csv_file)

Inspect the dataset by checking the first five rows of the DataFrame:

In [None]:
data.head()

## Create a target variable

`AdoptionSpeed` column contains the speed at which a pet will be adopted.


| Column            | Description               |
| ----------------- | ----------------------------- | 
| 0     | Pet was adopted on the same day as it was listed.                            |
| 1     | Pet was adopted between 1 and 7 days (1st week) after being listed.          |
| 2     | Pet was adopted between 8 and 30 days (1st month) after being listed.        |
| 3     | Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. |
| 4     | No adoption after 100 days of being listed.                                  |

In this tutorial, you will have to predict whether a pet was adopted or not.


---

***Task: 1***

Create a new column called `target` which contains 
`0` when the pet was not adopted, and `1` if it was.

Drop the columns `AdoptionSpeed` and `Description`.

---

In [None]:
################
### Solution ###
################

## Inspecting the data closer

In [None]:
numeric_columns = ['Age', 'Fee', 'PhotoAmt']
categoric_columns = ['Type', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health']

In [None]:
for cat_cols in categoric_columns:
  print(data[cat_cols].value_counts())
  print("--------------------------------")

---

***Task: 2***

Since the `Breed1` column has 166 different types, replace the less frequent breeds (occuring less than 100 times) with category `Rare`.

---

In [None]:
################
### Solution ###
################

data['Breed1'].value_counts()

## Categorical columns

In [None]:
data.head()

---

***Task: 3***

For each categoric feature in the dataset, convert it to one-hot encoded feature. The new column names should be in readable format like `is_Type_Cat`, `is_Type_Dog`, etc.

---

In [None]:
################
### Solution ###
################

data.head()

## Split the DataFrame into training, validation, and test sets

The dataset is in a single pandas DataFrame. Split it into training, validation, and test respectively:

---

***Task: 4***
Split the data sets in 80:10:10 ratio to `train`, `val`, and `test`.

---

In [None]:
################
### Solution ###
################

In [None]:
print(f'Training examples: {len(train)}')
print(f'Validation examples: {len(val)}')
print(f'Test examples: {len(test)}')

## Numerical columns

For each numeric feature in the dataset, standardize the distribution of the data.

---

***Task: 5***

Complete the `get_normalized_data` function by normalizing the columns given in the list `column_names`.

---

In [None]:
def get_normalized_data(df, column_names):
  ################
  ### Solution ###
  ################

  return df

In [None]:
train = get_normalized_data(train, ['Age', 'Fee', 'PhotoAmt'])
val = get_normalized_data(val, ['Age', 'Fee', 'PhotoAmt'])
test = get_normalized_data(test, ['Age', 'Fee', 'PhotoAmt'])

Get the features and targets

In [None]:
y_train = train.pop('target')
X_train = train

y_val = val.pop('target')
X_val = val

y_test = test.pop('target')
X_test = test



In [None]:
print(f"Feature shape: {X_train.shape}")
print(f"Target shape: {y_train.shape}")

## Create, compile, and train the model


The next step is to create a model using the sequential and configure the model with Keras `Model.compile`.

---

***Task: 6***

Complete the `create_model` function by building and compiling the model. Use 2 dense hidden layers and a dropout layer. Also choose an appropriate optimizer and loss function, and configure the model to log accuracy as a metric.

---

In [None]:
def create_model():
  ################
  ### Solution ###
  ################
  return model

model = create_model()
model.summary()

Next, train and test the model:

In [None]:
NUM_EPOCHS = 5

training_info = model.fit(x=X_train, y=y_train, epochs=NUM_EPOCHS, validation_data=(X_val, y_val))

In [None]:
loss_history = training_info.history['loss']
accuracy_history = training_info.history['accuracy']

val_loss_history = training_info.history['val_loss']
val_accuracy_history = training_info.history['val_accuracy']

---

***Task: 7***

Plot the loss and accuracy curves for both training and validation in 2 subplots.

---

In [None]:
################
### Solution ###
################

In [None]:
loss, accuracy = model.evaluate(x=X_test, y=y_test)
print("Accuracy", accuracy)

---

***Task: 8***

Set up an early stopping callback to stop training if the validation loss is not decreasing in the latest 5 epochs. Create and train the model again with the callback.

---


In [None]:
NUM_EPOCHS = 50

model = create_model()

################
### Solution ###
################