Inga Ulusoy, Computational modelling in python, SoSe2020 

# A neural network using data from the periodic table: Regression

We will now build a NN using data from the periodic table and make predictions about the atomic radius. In a regression problem, the output of the NN is a continuous value (a float).

In [None]:
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import numpy as np

## The example dataset

We will train a model to predict the van der Waals radius of an atom, given the atomic mass. As training data, we will use a subset of the periodic table.

The dataset is taken from here:

https://www.kaggle.com/jwaitze/tablesoftheelements

On kaggle, you may find many other datasets that you can train on.

In [None]:
#read in the dataset
#if the read fails for you, you may try different encoding
#there will be some ? in the column with 'pronounciation'
#
#
#test_df = pd.read_csv('periodic_table_with_all_units.csv',encoding = 'ISO-8859-1',engine='python')
#test_df = pd.read_csv('periodic_table_with_all_units.csv',encoding = 'utf-8',engine='python')
test_df = pd.read_csv('periodic_table_with_all_units.csv',encoding = 'cp1252',engine='python')
pd.set_option('display.max_columns', None)
test_df.head(10)

This has way too much information. We will create a new dataframe with only the columns of interest.

In [None]:
mydata = pd.concat([test_df['number'], test_df['name'], test_df['symbol'], test_df['element_category'], 
                    test_df['atomic_weight'], test_df['electron_configuration'], 
                    test_df['covalent_radius'],test_df['van_der_waals_radius']], axis=1)
mydata.head(10)

In [None]:
mydata.describe()

Now we need to clean up the data a little bit.

In [None]:
#remove the parenthesis (xx) in atomic weight
print(mydata.atomic_weight.head(10))
old_data = mydata.atomic_weight.values
new_data = mydata.atomic_weight.str.split('(').str[0]
mydata.atomic_weight = mydata.atomic_weight.replace(old_data,new_data)
print(mydata.atomic_weight.tail(10))

In [None]:
#remove trailing spaces from the strings
mydata.atomic_weight = mydata.atomic_weight.str.strip()
#remove the [] in atomic_weight
mydata['atomic_weight'] = mydata['atomic_weight'].str.replace('[', '')
mydata['atomic_weight'] = mydata['atomic_weight'].str.replace(']', '')
#convert to float
mydata.atomic_weight = pd.to_numeric(mydata.atomic_weight)
mydata.atomic_weight.tail(3)

In [None]:
#remove the parenthesis (xx) in covalent radius
print(mydata.covalent_radius.head(10))
old_data = mydata.covalent_radius.values
new_data = mydata.covalent_radius.str.split('(').str[0]
mydata.covalent_radius = mydata.covalent_radius.replace(old_data,new_data)
print(mydata.covalent_radius.head(10))

In [None]:
#for the covalent radius we need only the first number
#as the formatting is not consistent, pandas read this column as 'object' datatype
#(basically the entries have different datatypes, some are lists, some are strings)
#first we make all the entries strings
old_data = mydata.covalent_radius.values
new_data = mydata.covalent_radius.str.join('')
#now we split them at the +-
new_data = new_data.str.split('±').str[0]
#we further remove the 'pm'
new_data = new_data.str.split(' ').str[0]
#we further remove the '-'
new_data = new_data.str.split('–').str[0]
mydata.covalent_radius = mydata.covalent_radius.replace(old_data,new_data)
#convert to float and put nan for non-numeric values
mydata.covalent_radius = pd.to_numeric(mydata.covalent_radius,errors='coerce')

In [None]:
#remove the 'pm' in van_der_waals_radius
print(mydata.van_der_waals_radius.head(5))
old_data = mydata.van_der_waals_radius.values
new_data = mydata.van_der_waals_radius.str.split('p').str[0]
mydata.van_der_waals_radius = mydata.van_der_waals_radius.replace(old_data,new_data)
print(mydata.van_der_waals_radius.head(5))

In [None]:
print(mydata.number.values)
print(mydata.atomic_weight.values)
print(mydata.covalent_radius.values)
print(mydata.van_der_waals_radius.values)

In [None]:
#drop the rows with nan
mydata = mydata.dropna()
#reset the row index to number consecutively
mydata = mydata.reset_index(drop=True)

In [None]:
mydata.head(20)

## Split the data into a training and a test set

In [None]:
train_mydata = mydata.sample(frac=0.8,random_state=20)
test_mydata = mydata.drop(train_mydata.index)
train_mydata.head(100)
#test_mydata.head(100)

## The descriptors

In the following, we want to use this data to predict 

1. the covalent radius from the atomic weight, van der Waals radius and proton number;
2. the covalent radius from the atomic weight, van der Waals radius, and number of protons (the atomic number).

We thus have multiple features in our data that we will use to derive a label.

In [None]:
# Generate a pairwise correlation matrix to determine linear connectivity of the values
# a high value means high correlation
train_mydata.corr()

In [None]:
import seaborn as sn
corr_matrix = train_mydata.corr()
sn.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
#sn.pairplot(train_mydata[["number", "atomic_weight", "covalent_radius", "van_der_waals_radius"]])
sn.pairplot(train_mydata[["number", "atomic_weight", "covalent_radius", "van_der_waals_radius"]], diag_kind="kde")

Covalent radius is the most correlated with van der Waals radius.

## Normalize values

When building a model with multiple features, the values of each feature should cover roughly the same range. Thus we need to normalize the data; here we take the Z-score or standard score:
\begin{align}
z = \frac{x-\mu}{\sigma}
\end{align}
with x the feature value, $\mu$ the mean and $\sigma$ the standard deviation.

In [None]:
# Calculate the Z-scores of each column in the training set:
train_mean = train_mydata.mean()
print('Mean of the training data:\n',train_mean)
train_std = train_mydata.std()
print('Standard deviation of the training data:\n',train_std)
train_mydata_norm = train_mydata.copy()
train_mydata_norm.number = (train_mydata.number - train_mean.number)/train_std.number
train_mydata_norm.atomic_weight = (train_mydata.atomic_weight - train_mean.atomic_weight)/train_std.atomic_weight
train_mydata_norm.covalent_radius = (train_mydata.covalent_radius - train_mean.covalent_radius)/train_std.covalent_radius
train_mydata_norm.van_der_waals_radius = (train_mydata.van_der_waals_radius - train_mean.van_der_waals_radius)/train_std.van_der_waals_radius
print(train_mydata_norm)

In [None]:
#calculate the Z-scores of each column in the test set:
#be careful to apply the same transformation with mean / std from the training data!
test_mydata_norm = test_mydata
test_mydata_norm.number = (test_mydata.number - train_mean.number)/train_std.number
test_mydata_norm.atomic_weight = (test_mydata.atomic_weight - train_mean.atomic_weight)/train_std.atomic_weight
test_mydata_norm.covalent_radius = (test_mydata.covalent_radius - train_mean.covalent_radius)/train_std.covalent_radius
test_mydata_norm.van_der_waals_radius = (test_mydata.van_der_waals_radius - train_mean.van_der_waals_radius)/train_std.van_der_waals_radius
print(test_mydata_norm)

## Represent data

The following code cell creates a feature layer A containing three features:

* the atomic weight;
* the van der Waals radius;
* the number of protons and electrons (the atomic number).

It also generates a second feature layer B containing
* the atomic weight;
* the van der Waals radius;
* the atomic weight x the number of protons and electrons (a feature cross).

These are the features that the model will be trained on and it defines how each of those features will be represented. The transformations (collected in `my_feature_layer`) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model. 

In [None]:
# Create an empty list that will eventually hold all created feature columns.
feature_columns_A = []

# We scaled all the columns, including latitude and longitude, into their
# Z scores. So, instead of picking a resolution in degrees, we're going
# to use resolution_in_Zs.  A resolution_in_Zs of 1 corresponds to 
# a full standard deviation. 
resolution_in_Zs = 0.3  # 3/10 of a standard deviation.

# Represent atomic weight as floating-point value
weight = tf.feature_column.numeric_column("atomic_weight")
feature_columns_A.append(weight)

# Represent van der Waals radius as floating-point value
vdW = tf.feature_column.numeric_column("van_der_waals_radius")
feature_columns_A.append(vdW)

# Represent atomic number as floating-point value
no_protons = tf.feature_column.numeric_column("number")
feature_columns_A.append(no_protons)

# Create an empty list that will eventually hold all created feature columns
feature_columns_B = []
feature_columns_B.append(weight)

# Create a bucket feature column for van der Waals radius
vdW_boundaries = list(np.arange(int(min(train_mydata_norm['van_der_waals_radius'])), 
                                     int(max(train_mydata_norm['van_der_waals_radius'])), 
                                     resolution_in_Zs))
vdW_b = tf.feature_column.bucketized_column(vdW, vdW_boundaries)
feature_columns_B.append(vdW_b)

# Create a bucket feature column for atomic weight
weight_boundaries = list(np.arange(int(min(train_mydata_norm['atomic_weight'])), 
                                     int(max(train_mydata_norm['atomic_weight'])), 
                                     resolution_in_Zs))
weight_b = tf.feature_column.bucketized_column(weight, weight_boundaries)

# Create a bucket feature column for atomic number
number_boundaries = list(np.arange(int(min(train_mydata_norm['number'])), 
                                     int(max(train_mydata_norm['number'])), 
                                     resolution_in_Zs))
no_protons_b = tf.feature_column.bucketized_column(no_protons, number_boundaries)

# Create a feature cross of atomic weight and atomic number
# we use very litte binning thus the bucket size is 5
# (this will combine the five closest values into one category)
#vdW_x_number = tf.feature_column.crossed_column([vdW_b, no_protons_b], hash_bucket_size=5)
weight_x_number = tf.feature_column.crossed_column([weight_b, no_protons_b], hash_bucket_size=5)
#crossed_feature = tf.feature_column.indicator_column(vdW_x_number)
crossed_feature = tf.feature_column.indicator_column(weight_x_number)
feature_columns_B.append(crossed_feature)  

# Convert the list of feature columns into a layer that will later be fed into
# the model. 
my_feature_layer_A = tf.keras.layers.DenseFeatures(feature_columns_A)
my_feature_layer_B = tf.keras.layers.DenseFeatures(feature_columns_B)

In [None]:
#define the plotting function.

def plot_the_loss_curve(epochs, mse):
    """Plot a curve of loss vs. epoch."""

    plt.figure()
    plt.xlabel("Epoch")
    plt.ylabel("Mean Squared Error")

    plt.plot(epochs, mse, label="Loss")
    plt.legend()
    plt.ylim([mse.min()*0.95, mse.max() * 1.03])
    plt.show()  

print("Defined the plot_the_loss_curve function.")

## Define a deep neural net model

The `create_model` function defines the topography of the deep neural net, specifying the following:

* The number of [layers](https://developers.google.com/machine-learning/glossary/#layer) in the deep neural net.
* The number of [nodes](https://developers.google.com/machine-learning/glossary/#node) in each layer.

The `create_model` function also defines the [activation function](https://developers.google.com/machine-learning/glossary/#activation_function) of each layer.

In [None]:
def create_model(my_learning_rate, my_feature_layer):
    """Create and compile a regression model."""
    model = tf.keras.models.Sequential()
    # Add the layer containing the feature columns to the model.
    model.add(my_feature_layer)

    # Describe the topography of the model by calling the tf.keras.layers.Dense
    # method once for each layer. We've specified the following arguments:
    #   * units specifies the number of nodes in this layer.
    #   * activation specifies the activation function (Rectified Linear Unit).
    #   * name is just a string that can be useful when debugging.

    # Define the first hidden layer with 10 nodes.   
    model.add(tf.keras.layers.Dense(units=10, 
                                  activation='relu', 
                                  name='Hidden1'))
  
    # Define the second hidden layer with 12 nodes. 
    model.add(tf.keras.layers.Dense(units=12, 
                                  activation='relu', 
                                  name='Hidden2'))

    # Define the output layer.
    model.add(tf.keras.layers.Dense(units=1,  
                                    name='Output'))                              
  
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.MeanSquaredError()])

    return model

## Define a training function

The `train_model` function trains the model from the input features and labels. The [tf.keras.Model.fit](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit) method performs the actual training. The `x` parameter of the `fit` method is very flexible, enabling you to pass feature data in a variety of ways. The following implementation passes a Python dictionary in which:

* The *keys* are the names of each feature (for example, `atomic_weight`, `number`, and so on).
* The *value* of each key is a NumPy array containing the values of that feature. 

**Note:** Although you are passing *every* feature to `model.fit`, most of those values will be ignored. Only the features accessed by `my_feature_layer` will actually be used to train the model.

In [None]:
def train_model(model, dataset, epochs, label_name,
                batch_size=None):
    """Train the model by feeding it data."""

    # Split the dataset into features and label.
    features = {name:np.array(value) for name, value in dataset.items()}
    label = np.array(features.pop(label_name))
    history = model.fit(x=features, y=label, batch_size=batch_size,
                      epochs=epochs, shuffle=True) 

    # The list of epochs is stored separately from the rest of history.
    epochs = history.epoch
  
    # To track the progression of training, gather a snapshot
    # of the model's mean squared error at each epoch. 
    hist = pd.DataFrame(history.history)
    mse = hist["mean_squared_error"]

    return epochs, mse

## Call the functions to build and train a deep neural net

Okay, it is time to actually train the deep neural net.  If time permits, experiment with the three hyperparameters to see if you can reduce the loss
against the test set.


In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 200
batch_size = 80

# Specify the label
label_name = "covalent_radius"

# Establish the model's topography.
my_model = create_model(learning_rate, my_feature_layer_A)

# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, mse = train_model(my_model, train_mydata_norm, epochs, 
                          label_name, batch_size)
plot_the_loss_curve(epochs, mse)

# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in test_mydata_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

We can now make a prediction:

In [None]:
example_mydata = test_mydata_norm.sample(frac=0.8,random_state=20)
example_mydata.head(20)

In [None]:
#run the evaluation
example_features = {name:np.array(value) for name, value in example_mydata.items()}
predicted = my_model.predict(example_features)

In [None]:
#un-normalize the predictions
predicted = predicted * train_std.covalent_radius + train_mean.covalent_radius
exact = example_mydata.covalent_radius * train_std.covalent_radius + train_mean.covalent_radius
print('predicted values are:')
print(predicted)
print('the exact values are:')
print(exact)
print('deviation is:')
for i in range(len(predicted)):
    print(np.asarray([exact.values[i]])-predicted[i],'for atom:',example_mydata.symbol.iloc[i])

### Now let's try this with the feature cross.

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 200
batch_size = 80

# Specify the label
label_name = "covalent_radius"

# Establish the model's topography.
my_model = create_model(learning_rate, my_feature_layer_B)

# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, mse = train_model(my_model, train_mydata_norm, epochs, 
                          label_name, batch_size)
plot_the_loss_curve(epochs, mse)

# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in test_mydata_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

In [None]:
example_mydata = test_mydata_norm.sample(frac=0.8,random_state=20)
example_mydata.head(20)

In [None]:
#run the evaluation
example_features = {name:np.array(value) for name, value in example_mydata.items()}
predicted = my_model.predict(example_features)

In [None]:
#un-normalize the predictions
predicted = predicted * train_std.covalent_radius + train_mean.covalent_radius
exact = example_mydata.covalent_radius * train_std.covalent_radius + train_mean.covalent_radius
print('predicted values are:')
print(predicted)
print('the exact values are:')
print(exact)
print('deviation is:')
for i in range(len(predicted)):
    print(np.asarray([exact.values[i]])-predicted[i],'for atom:',example_mydata.symbol.iloc[i])

# Task 1 #

Compare the two models and change the hyperparameters / the model topography. Does the feature cross help here? Can you think of an example where a feature cross might be useful? What do you achieve using a feature cross?

# Regularization

Notice that in the below example, the model's loss against the test set is *much higher* than the loss against the training set.  In other words, the deep neural network is __overfitting__ to the data in the training set. To reduce overfitting, the model can be regularized, using:

  * [L1 regularization](https://developers.google.com/machine-learning/glossary/#L1_regularization)
  * [L2 regularization](https://developers.google.com/machine-learning/glossary/#L2_regularization)
  * [Dropout regularization](https://developers.google.com/machine-learning/glossary/#dropout_regularization)

Your task is to experiment with one or more regularization mechanisms to bring the test loss closer to the training loss (while still keeping test loss relatively low).  

**Note:** When you add a regularization function to a model, you might need to tweak other hyperparameters. 

### Implementing L1 or L2 regularization

To use L1 or L2 regularization on a hidden layer, specify the `kernel_regularizer` argument to [tf.keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense). Assign one of the following methods to this argument:

* `tf.keras.regularizers.l1` for L1 regularization
* `tf.keras.regularizers.l2` for L2 regularization

Each of the preceding methods takes an `l` parameter, which adjusts the [regularization rate](https://developers.google.com/machine-learning/glossary/#regularization_rate). Assign a decimal value between 0 and 1.0 to `l`; the higher the decimal, the greater the regularization. For example, the following applies L2 regularization at a strength of 0.05. 

```
model.add(tf.keras.layers.Dense(units=20, 
                                activation='relu',
                                kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                name='Hidden1'))
```

### Implementing Dropout regularization

You implement dropout regularization as a separate layer in the topography. For example, the following code demonstrates how to add a dropout regularization layer between the first hidden layer and the second hidden layer:

```
model.add(tf.keras.layers.Dense( *define first hidden layer*)
 
model.add(tf.keras.layers.Dropout(rate=0.25))

model.add(tf.keras.layers.Dense( *define second hidden layer*)
```

The `rate` parameter to [tf.keras.layers.Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) specifies the fraction of nodes that the model should drop out during training. 

In [None]:
def create_model2(my_learning_rate, my_feature_layer):
    """Create and compile a regression model."""
    model = tf.keras.models.Sequential()
    # Add the layer containing the feature columns to the model.
    model.add(my_feature_layer)

    # Describe the topography of the model by calling the tf.keras.layers.Dense
    # method once for each layer. We've specified the following arguments:
    #   * units specifies the number of nodes in this layer.
    #   * activation specifies the activation function (Rectified Linear Unit).
    #   * name is just a string that can be useful when debugging.

    # Define the first hidden layer with 10 nodes.   
    model.add(tf.keras.layers.Dense(units=30, 
                                  activation='relu', 
                                  name='Hidden1'))
  
    # Define the second hidden layer with 32 nodes. 
    model.add(tf.keras.layers.Dense(units=62, 
                                  activation='relu', 
                                  name='Hidden2'))
    
    # Define the third hidden layer with 32 nodes. 
    model.add(tf.keras.layers.Dense(units=80, 
                                  activation='relu', 
                                  name='Hidden3'))
    
    # Define the fourth hidden layer with 12 nodes. 
    model.add(tf.keras.layers.Dense(units=22, 
                                  activation='relu', 
                                  name='Hidden4'))

    # Define the output layer.
    model.add(tf.keras.layers.Dense(units=1,  
                                    name='Output'))                              
  
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.MeanSquaredError()])

    return model

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 800
batch_size = 80

# Specify the label
label_name = "covalent_radius"

# Establish the model's topography.
my_model = create_model2(learning_rate, my_feature_layer_A)

# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, mse = train_model(my_model, train_mydata_norm, epochs, 
                          label_name, batch_size)
plot_the_loss_curve(epochs, mse)

# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in test_mydata_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

In [None]:
# in this example, L2 regularization is used
def create_model_with_penalty(my_learning_rate, my_feature_layer):
    """Create and compile a regression model."""
    model = tf.keras.models.Sequential()
    # Add the layer containing the feature columns to the model.
    model.add(my_feature_layer)

    # Describe the topography of the model by calling the tf.keras.layers.Dense
    # method once for each layer. We've specified the following arguments:
    #   * units specifies the number of nodes in this layer.
    #   * activation specifies the activation function (Rectified Linear Unit).
    #   * name is just a string that can be useful when debugging.

    # Define the first hidden layer with 10 nodes.   
    model.add(tf.keras.layers.Dense(units=30, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(0.04),
                                  name='Hidden1'))
  
    # Define the second hidden layer with 32 nodes. 
    model.add(tf.keras.layers.Dense(units=62, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(0.04),
                                  name='Hidden2'))
    
    # Define the third hidden layer with 32 nodes. 
    model.add(tf.keras.layers.Dense(units=80, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(0.04),
                                  name='Hidden3'))
    
    # Define the fourth hidden layer with 12 nodes. 
    model.add(tf.keras.layers.Dense(units=22, 
                                  activation='relu', 
                                  kernel_regularizer=tf.keras.regularizers.l2(0.04),
                                  name='Hidden4'))

    # Define the output layer.
    model.add(tf.keras.layers.Dense(units=1,  
                                    name='Output'))                              
  
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.MeanSquaredError()])

    return model

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 800
batch_size = 80

# Specify the label
label_name = "covalent_radius"

# Establish the model's topography.
my_model = create_model_with_penalty(learning_rate, my_feature_layer_A)

# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, mse = train_model(my_model, train_mydata_norm, epochs, 
                          label_name, batch_size)
plot_the_loss_curve(epochs, mse)

# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in test_mydata_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

# Task 2

Try out different regularization mechanisms. Why is overfitting leading to better performance of the model on the data of the training vs the test set? What does regularization enforce?

# Predict flight delays in the US in January: Binary classification

The dataset is taken from here: 

https://www.kaggle.com/divyansh22/flight-delay-prediction

We are interested in the columns: OP_CARRIER_AIRLINE_ID (which airline), ORIGIN (originating airport), DEST (destination airport), DEP_TIME (departure time), DEP_DEL15 (a '1' indicates if the departure was delayed 15 minutes or more, 0 is on time), ARR_TIME (arrival time), ARR_DEL15 (a '1' indicates if the departure was delayed 15 minutes or more, 0 is on time), CANCELLED (flight was cancelled), DIVERTED (flight was diverted), DISTANCE (distance traveled).

In [None]:
test_df = pd.read_csv('Jan_2019_ontime.csv',encoding = 'cp1252',engine='python')
test_df.head(100)

In [None]:
test_df.describe()

In [None]:
test_df.info()

In [None]:
df = pd.concat([test_df['OP_CARRIER_AIRLINE_ID'], test_df['ORIGIN'], test_df['DEST'], 
                    test_df['DEP_TIME'], test_df['DEP_DEL15'], test_df['ARR_TIME'], test_df['ARR_DEL15'],
                    test_df['CANCELLED'],test_df['DIVERTED'], test_df['DISTANCE']], axis=1)
# fill the empty cells with nan
df.replace('', np.nan, inplace=True)

In [None]:
# drop all nan's
my_df = df.dropna()
my_df.info()

In [None]:
my_df.head()

In [None]:
my_df.corr()

Departure delay and arrival delay are related. Departure time and arrival time are related. There is a slight correlation between arrival delay and departure time, and departure delay and departure time.

In [None]:
#split data into training and test set
from sklearn.model_selection import train_test_split
train_mydata, test_mydata = train_test_split(my_df, test_size=0.2)
train_mydata, val_mydata = train_test_split(train_mydata, test_size=0.2)
print(len(train_mydata), 'train examples')
print(len(val_mydata), 'validation examples')
print(len(test_mydata), 'test examples')
#train_mydata = my_df.sample(frac=0.8,random_state=20)
#test_mydata = my_df.drop(train_mydata.index)

In [None]:
# Calculate the Z-scores of selected columns in the training set: this only makes 
# sense for distance
train_mean_distance = train_mydata.DISTANCE.mean()
train_std_distance = train_mydata.DISTANCE.std()
train_df_norm = train_mydata.copy()
#Rescale the times so they lie between 0 and 1 and not 0 and 2400
#this could be done in a better way as the time is given as hhmm.0
#and we could account for mm having values between 0 and 59 but at this 
#point we do not want to complicate things
train_df_norm.ARR_TIME = train_mydata.ARR_TIME/2400
train_df_norm.DEP_TIME = train_mydata.DEP_TIME/2400
train_df_norm.DISTANCE = (train_mydata.DISTANCE - train_mean_distance)/train_std_distance
train_df_norm.head(10)

In [None]:
test_df_norm = test_mydata.copy()
test_df_norm.ARR_TIME = test_mydata.ARR_TIME/2400
test_df_norm.DEP_TIME = test_mydata.DEP_TIME/2400
test_df_norm.DISTANCE = (test_mydata.DISTANCE - train_mean_distance)/train_std_distance
test_df_norm.head(10)

In [None]:
val_df_norm = val_mydata.copy()
val_df_norm.ARR_TIME = val_mydata.ARR_TIME/2400
val_df_norm.DEP_TIME = val_mydata.DEP_TIME/2400
val_df_norm.DISTANCE = (val_mydata.DISTANCE - train_mean_distance)/train_std_distance
val_df_norm.head(10)

In [None]:
#get a list of all airports in origin and destination
df_cat = my_df.ORIGIN.astype('category')
airport_list = df_cat.cat.categories.tolist()
df_cat = my_df.DEST.astype('category')
airport_list_d = df_cat.cat.categories.tolist()

In [None]:
print(airport_list)

In [None]:
#print(airport_list)
temp = [item for item in airport_list if item not in airport_list_d]
print('Difference in origin and destination airports:',temp)

In [None]:
#find all the airline id's
al_cat = my_df.OP_CARRIER_AIRLINE_ID.astype('category')
al_list = al_cat.cat.categories.tolist()
print(al_list)

In [None]:
# Create an empty list that will eventually hold all created feature columns.
feature_columns = []

# We scaled all the columns, including latitude and longitude, into their
# Z scores. So, instead of picking a resolution in degrees, we're going
# to use resolution_in_Zs.  A resolution_in_Zs of 1 corresponds to 
# a full standard deviation. 
resolution_in_Zs = 0.1  # 1/10 of a standard deviation.

# Create a bucket feature column for departure time
dep = tf.feature_column.numeric_column('DEP_TIME')
dep_boundaries = list(np.arange(int(min(train_df_norm['DEP_TIME'])), 
                                     int(max(train_df_norm['DEP_TIME'])), 
                                     resolution_in_Zs))
#print(min(train_df_norm['DEP_TIME']),max(train_df_norm['DEP_TIME']),dep_boundaries)
dep_b = tf.feature_column.bucketized_column(dep, dep_boundaries)
feature_columns.append(dep_b)

# Create a bucket feature column for arrival time
arr = tf.feature_column.numeric_column('ARR_TIME')
arr_boundaries = list(np.arange(int(min(train_df_norm['ARR_TIME'])), 
                                     int(max(train_df_norm['ARR_TIME'])), 
                                     resolution_in_Zs))
arr_b = tf.feature_column.bucketized_column(arr, arr_boundaries)
feature_columns.append(arr_b)

# Create a bucket feature column for distance
dist = tf.feature_column.numeric_column('DISTANCE')
dist_boundaries = list(np.arange(int(min(train_df_norm['DISTANCE'])), 
                                     int(max(train_df_norm['DISTANCE'])), 
                                     resolution_in_Zs))
dist_b = tf.feature_column.bucketized_column(dist, dist_boundaries)
feature_columns.append(dist_b)

# Create a categorical feature column for originating airport
origin_c = tf.feature_column.categorical_column_with_vocabulary_list('ORIGIN',airport_list)
# Create an embedded feature column for originating airport
# the dimension has to be fine-tuned
origin_e = tf.feature_column.embedding_column(origin_c,dimension=1)
feature_columns.append(origin_e)

# Create a categorical feature column for destination airport
dest_c = tf.feature_column.categorical_column_with_vocabulary_list('DEST',airport_list_d)
# Create an embedded feature column for destination airport
# the dimension has to be fine-tuned
dest_e = tf.feature_column.embedding_column(dest_c,dimension=1)
feature_columns.append(dest_e)

# Create a one-hot categorical feature column for airline id
airline_c = tf.feature_column.categorical_column_with_vocabulary_list('OP_CARRIER_AIRLINE_ID',al_list)
# Create an embedded feature column for airline id
airline_e = tf.feature_column.embedding_column(airline_c,dimension=1)
feature_columns.append(airline_e)

#we want to predict arrival delay and departure delay
# Convert the list of feature columns into a layer that will later be fed into
# the model. 
my_feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [None]:
#@title Define the plotting function.
def plot_curve(epochs, hist, list_of_metrics):
    """Plot a curve of one or more classification metrics vs. epoch."""  
    # list_of_metrics should be one of the names shown in:
    # https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#define_the_model_and_metrics  

    plt.figure()
    plt.xlabel("Epoch")
    plt.ylabel("Value")

    for m in list_of_metrics:
        x = hist[m]
        plt.plot(epochs[1:], x[1:], label=m)

    plt.legend()

print("Defined the plot_curve function.")

## Define a neural net model

In [None]:
#@title Define the functions that create and train a model.
def create_model(my_learning_rate, feature_layer, my_metrics):
    """Create and compile a simple classification model."""
    model = tf.keras.models.Sequential()

    # Add the feature layer (the list of features and how they are represented)
    # to the model.
    model.add(feature_layer)

    # Funnel the regression value through a sigmoid function.
    model.add(tf.keras.layers.Dense(units=1, input_shape=(1,),
                                  activation=tf.sigmoid),)
  
    # Call the compile method to construct the layers into a model that
    # TensorFlow can execute.  Notice that we're using a different loss
    # function for classification than for regression.    
    model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),                                                   
                loss=tf.keras.losses.BinaryCrossentropy(),
                metrics=my_metrics)

    return model        

## Define the training function

In [None]:
def train_model(model, dataset, epochs, label_name,
                batch_size=None, shuffle=True):
    """Feed a dataset into the model in order to train it."""

    # The x parameter of tf.keras.Model.fit can be a list of arrays, where
    # each array contains the data for one feature.  Here, we're passing
    # every column in the dataset. Note that the feature_layer will filter
    # away most of those columns, leaving only the desired columns and their
    # representations as features.
    features = {name:np.array(value) for name, value in dataset.items()}
    label = np.array(features.pop(label_name)) 
    history = model.fit(x=features, y=label, batch_size=batch_size,
                      epochs=epochs, shuffle=shuffle)
  
    # The list of epochs is stored separately from the rest of history.
    epochs = history.epoch

    # Isolate the classification metric for each epoch.
    hist = pd.DataFrame(history.history)

    return epochs, hist  

## Call the functions to build and train the model 

In [None]:
print(train_df_norm.ARR_DEL15)

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.1
epochs = 10
batch_size = 200

# Specify the label
label_name = "ARR_DEL15"

#specify the classification threshold
classification_threshold = 0.4

# Establish the metrics the model will measure.
#metric = [tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=classification_threshold),]
metric = [tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=classification_threshold),
      tf.keras.metrics.Precision(thresholds=classification_threshold,name='precision'),
      tf.keras.metrics.Recall(thresholds=classification_threshold,name='recall'),]
          
# Establish the model's topography.
my_model = create_model(learning_rate, my_feature_layer,metric)

# Train the model on the training set.
# Train the model on the normalized training set. We're passing the entire
# normalized training set, but the model will only use the features
# defined by the feature_layer.
epochs, hist = train_model(my_model, train_df_norm, epochs, 
                           label_name, batch_size)

# Plot a graph of the metric(s) vs. epochs.
#list_of_metrics_to_plot = ['accuracy'] 
list_of_metrics_to_plot = ['accuracy', 'precision', 'recall'] 
plot_curve(epochs, hist, list_of_metrics_to_plot)

In [None]:
# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value) for name, value in test_df_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)

In [None]:
#make a prediction
example_features = {name:np.array(value) for name, value in val_df_norm.items()}
predicted = my_model.predict(example_features)

In [None]:
exact = val_df_norm.ARR_DEL15
for i in range(100):
    print(exact.values[i],(predicted[i] > .4).astype(int))

# Optional task

It also works if you do not complete the optional task. Every bit that you try will aid your learning. Highly recommended, but as I cannot force you to create a kaggle account, this is still optional.

1. Find a dataset on kaggle that looks interesting to you and that has reasonable data (not too small/too large, not too much missing/irregular data, formatting is ok).

2. Represent this dataset using pandas. Find correlations in the data.

3. Create a simple model to make predictions about the data.

Upload your notebook to moodle. 