# Instructor Turn Activity 1 One-Hot Encoding

## Preprocessing Review
Why do we **preprocess** data when we build machine learning pipelines?

We preprocess data for two principle reasons:

1. To transform the data to better suit a model's underlying assumptions.
2. To format the data in the way a model expects.

Today, we're concerned with this second reason.

## Inputs to Neural Networks
What does the input to a neural network look like?
Inputs to neural networks are **vectors**. Each entry in the vector corresponds to a feature, which the net uses to make predictions.

Crucially, these vectors contain can contain only _numerical_ data. They _cannot_ contain string data.

In [None]:
# Good!
good_input_row1 = [1.3, 2.2, 5.4, 5.8, 0]
good_input_row2 = [1.3, 2.2, 5.4, 5.8, 1]

In [None]:
# Bad...
bad_input_row1 = [1.3, 2.2, 5.4, 5.8, 'dog']
bad_input_row2 = [1.3, 2.2, 5.4, 5.8, 'cat']

## One-Hot Encoding

This poses a problem when we want to train a neural network on categorical data, such as the classic [Iris data set](https://archive.ics.uci.edu/ml/datasets/Iris

![](https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg)

In [None]:
import pandas as pd

# Read from: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=names)

In [None]:
# Note the entries `iris-virginica`
df.tail(5)

Note that all of our data is numerical..._Except_ for the data in that `class` column, which contains strings.

The `class` column will contain one of three values:

1. `iris-setosa`
2. `iris-versicolour`
3. `iris-virginica`

As these are not numerical values, we can't use them to fit our nnet. To fix this, we must convert each class label to a numerical value.

We do this via the following steps:

1. **Label Encoding**. First, we convert the three possible classes to integer labels. E.g., `iris-setosa` will be `1`; `iris-versicolour`, `2`; and `iris-virginica`, `3`.
2. **One-Hot Encoding**. Then, we set each row's `class` value to an _array_. This array will have a `1` in whichever slot corresponds to the integer label. E.g., after one-hot encoding, a row with the class `iris-setosa` will have the array `[1, 0, 0]`. A row with class `iris-virginica`, the array `[0, 0, 1]`; etc.

In many cases, categories in the data sets you work with will already be label-encoded. In this case, you can apply one-hot encoding immediately.

## Applying One-Hot Encoding

In [None]:
# Step 0: Reformat data
data = df.values
X = data[:, 0:4]
y = data[:, 4]

In [None]:
from sklearn.preprocessing import LabelEncoder

# Step 1: Label-encode data set
label_encoder = LabelEncoder()
label_encoder.fit(y)
encoded_y = label_encoder.transform(y)

In [None]:
for label, original_class in zip(encoded_y, y):
    print('Original Class: ' + str(original_class))
    print('Encoded Label: ' + str(label))
    print('-' * 12)

In [None]:
Note that each of the original labels has been replaced with an integer.

In [None]:
from keras.utils import to_categorical

# Step 2: One-hot encoding
one_hot_y = to_categorical(encoded_y)
one_hot_y

# Everyone Activity 2 Neural Networks with Keras

In [None]:
!pip install keras
!pip install tensorflow

In [None]:
# Generate some fake data with 3 features

from sklearn.datasets import make_classification

X, y = make_classification(n_features=3, n_redundant=0, n_informative=3,
                           random_state=42, n_classes=2, n_clusters_per_class=1)

y = y.reshape(-1, 1)

print(X.shape)
print(y.shape)

Use train_test_split to create training and testing data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Data Preprocessing

It is really important to scale our data before using multilayer perceptron models. 

Without scaling, it is often difficult for the training cycle to converge

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaler = StandardScaler().fit(X_train)

Remember to scale both the training and testing data

In [None]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

One-hot encode the labels

In [None]:
from keras.utils import to_categorical

# One-hot encoding
y_train_categorical = to_categorical(y_train)
y_test_categorical = to_categorical(y_test)
y_train_categorical

## Creating our Model

We must first decide what kind of model to apply to our data. 

For numerical data, we use a regressor model. 

For categorical data, we use a classifier model. 

In this example, we will use a classifier to build the following network:

![nnet.png](Images/nnet.png)

## Defining our Model Architecture (the layers)

We first need to create a sequential model

In [None]:
from keras.models import Sequential

model = Sequential()

Next, we add our first layer. This layer requires you to specify both the number of inputs and the number of nodes that you want in the hidden layer.

In [None]:
from keras.layers import Dense
number_inputs = 3
number_hidden_nodes = 4
model.add(Dense(units=number_hidden_nodes,
                activation='relu', input_dim=number_inputs))

![first_layer](Images/nnet_first_layer.png)

Our final layer is the output layer. Here, we need to specify the activation function (typically `softmax` for classification) and the number of classes (labels) that we are trying to predict (2 in this example).

In [2]:
number_classes = 2
model.add(Dense(units=number_classes, activation='softmax'))

NameError: name 'model' is not defined

![output_layer](Images/nnet_output_layer.png)

In [3]:
## Model Summary

In [4]:
model.summary()

NameError: name 'model' is not defined

## Compile the Model

Now that we have our model architecture defined, we must compile the model using a loss function and optimizer. We can also specify additional training metrics such as accuracy.

In [5]:
# Use categorical crossentropy for categorical data and mean squared error for regression
# Hint: your output layer in this example is using software for logistic regression (categorical)
# If your output layer activation was `linear` then you may want to use `mse` for loss
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

NameError: name 'model' is not defined

## Training the Model
Finally, we train our model using our training data

Training consists of updating our weights using our optimizer and loss function. In this example, we choose 1000 iterations (loops) of training that are called epochs.

We also choose to shuffle our training data and increase the detail printed out during each training cycle.

In [11]:
# Fit (train) the model
model.fit(
    X_train_scaled,
    y_train_categorical,
    epochs=1000,
    shuffle=True,
    verbose=2
)

NameError: name 'model' is not defined

## Quantifying the Model
We use our testing data to validate our model. This is how we determine the validity of our model (i.e. the ability to predict new and previously unseen data points)

In [10]:
# Evaluate the model using the testing data
model_loss, model_accuracy = model.evaluate(
    X_test_scaled, y_test_categorical, verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

NameError: name 'model' is not defined

## Making Predictions with new data

We can use our trained model to make predictions using `model.predict`

In [12]:
import numpy as np
new_data = np.array([[0.2, 0.3, 0.4]])
print(f"Predicted class: {model.predict_classes(new_data)}")

NameError: name 'model' is not defined