# Scikit-Learn vs PyTorch, Keras, and TensorFlow #

Today's topic is to learn how to use the PyTorch, Keras, and TensorFlow frameworks. These three machine learning packages are built specifically for neural networks and deep learning.

### Scikit-Learn ###

Our old ML framework, Scikit-Learn, is great for classical ML algorithms and building shallow networks. However, it has significant limitations for modern neural networks such as CNNs, RNNs, and Transformers. And there is no support for GPU acceleration (contrary to what I said in class) *unless* you install NVIDIA's RAPIDS cuML package `import cuml` or Intel's OneAPI package `import sklearnex`. Each of these packages is linked directly to your graphics card, so the cuML module won't help if you have an Intel GPU and vice-versa.

### PyTorch, Keras, and TensorFlow ###

In contrast, these new frameworks are built for deep learning. PyTorch was originally created as an open-source project by Meta (Facebook) but at this point the development is mostly driven by the larger ML/AI community. TensorFlow is an open-source Google product. Although anyone can theoretically contribute to it, the project is largely under control of Google AI and it has built-in support for processing on Google Cloud AI.

All three deep-learning frameworks provide native GPU acceleration. PyTorch requires the user to manually enable the GPU whereas the other two frameworks offload large computations to the GPU automatically.

Keras used to be its own independent, high-level deep-learning package. It could run on top of a handful of various low-level packages, but the most common was Google's TensorFlow. In 2019, Google integrated Keras into TensorFlow. If you want create models at a high-level, taking the default architectures, then you use Keras `from tensorflow import keras`. But if you want low-level customization, then you can use TensorFlow on its own.


## Kaggle's "Give Me Some Credit" Dataset ##

Description

In [None]:
import pandas as pd

# Load the credit dataset
df = pd.read_csv('credit_training.csv').drop(columns=['Unnamed: 0'])
df = df.rename(columns={'RevolvingUtilizationOfUnsecuredLines':'CreditUtilization',
                        'NumberOfTime30-59DaysPastDueNotWorse':'PastDue30-59',
                        'NumberOfTime60-89DaysPastDueNotWorse':'PastDue60-89',
                        'NumberOfTimes90DaysLate':'PastDue90+',
                        'NumberOfOpenCreditLinesAndLoans':'CreditLines',
                        'NumberRealEstateLoansOrLines':'RealEstateLoans',
                        'NumberOfDependents':'Dependents',
                        'age':'Age',
                        'SeriousDlqin2yrs':'RejectLoan'})
print(df.shape)
df.head()

In [None]:

import numpy as np



from sklearn.ensemble import RandomForestClassifier



### Data Exploration and Cleanup ###

Let's peruse the data and then handle any missing values.

In [None]:
X = df.drop(columns=['RejectLoan'])
y = df['RejectLoan']
print(f"Features: {X.columns.to_list()[:5]}...")
print(f"Output: '{y.name}'")
X.head()


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=432, stratify=y)
print(f"Training count: {X_train.shape}")
print(f"Test count:     {X_test.shape}")

In [None]:
# Are we missing any values? Yes: MonthlyIncome and Dependents
X_train.info()

In [None]:
# Let's investigate 'Dependents' first, it seems the easier
X_train['Dependents'].describe()

In [None]:
# Save value for later; then fix and confirm
dependents_na_value = X_train['Dependents'].mean()
X_train['Dependents'] = X_train['Dependents'].fillna(dependents_na_value)
X_train['Dependents'].isna().sum()

In [None]:
# Now let's investigate monthly income... wow, more variance
X_train['MonthlyIncome'].describe()

In [None]:
# Does NaN mean the applicant doesn't have a job? 
# There are lots of samples (1250+) with a value of 0
X_train[X_train['MonthlyIncome'] == 0].shape

In [None]:
# Investigating more, it looks like the DebtRatio is high, but this implies they have some income
# Let's fill NaNs with the median value
X_train[X_train['MonthlyIncome'] == 0].head()

In [None]:
# Save value for later; then fix and confirm
income_na_value = X_train['Dependents'].median()
X_train['MonthlyIncome'] = X_train['MonthlyIncome'].fillna(0)
X_train['MonthlyIncome'].isna().sum()

In [None]:
# Final verification... remember we have two fills to do for any prediction
X_train.info()

### Feature Engineering ###

Neural networks usually perform better with standardized data because certain activation functions are sensitive to outliers and the gradient descent algorithm converges more efficiently when inputs have similar scales. So at this point, let's just standardize everything.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_train.head()

### Scikit-Learn MLP Model ###

Let's start with a model that we already know, a simple Scikit-Learn MLP neural network.

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(64,32),    # read somewhere that this was reasonable complexity
    activation='tanh',             # 'relu', 'logistic', 'tanh', 'identity'
    solver='sgd',                  # 'adam' or 'sgd' or 'lbfgs' ...usually adam is best for general case
    max_iter=500,
    random_state=123,
    learning_rate='adaptive',
    learning_rate_init=0.01
).fit(X_train, y_train)

In [None]:
# Preprocess the prediction data the same way that we did the training data
X_test['Dependents'] = X_test['Dependents'].fillna(dependents_na_value)
X_test['MonthlyIncome'] = X_test['MonthlyIncome'].fillna(income_na_value)
X_test = scaler.transform(X_test)
X_test = pd.DataFrame(X_test, columns=X.columns)
X_test.head()

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Make predictions and look at some performance metrics
y_pred = mlp.predict(X_test)
print(f"Accuracy Score: {accuracy_score(y_test, y_pred)}\n")
print(classification_report(y_test, y_pred))
y_probs = mlp.predict_proba(X_test)
y_probs_0 = y_probs[:,0] # test probabilities that were classified as 'approve' (0)
y_probs_1 = y_probs[:,1] # test probabilities that were classified as 'reject' (1)
print(f"Sample results for predicted 'approve': {y_probs_0[:5]}")
print(f"Sample results for predicted 'reject':  {y_probs_1[:5]}")

# Bring back the confusion matrix to quickly see TP/TN/FP/FN
class_names = ["Approve", "Reject"]
plt.figure(figsize=(3,3))
sns.heatmap(confusion_matrix(y_test, y_pred), 
            annot=True, fmt="d", cmap="Blues",
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f"Confusion Matrix (Threshold = 0.50)")
plt.show()

### Receiver Operating Characteristic (ROC) Curve ###

The ROC Curve is a new, visual evaluation technique. It works for models that output probabilities and is well suited for binary classification problems. It allows us to compare the tradeoff between the True Positive Rate (which you already know as *recall*) and the False Positive Rate.

$$
TPR = \frac{TP}{TP + FN}
$$

$$
FPR = \frac{FP}{FP + TN}
$$

The two metrics are somewhat related and the ROC curve shows us how certain tweaks to the model will affect its TPR and FPR. The left-most corner of the ROC Curve is considered the optimum level because it maximizes the difference between the TPR and the FPR. In general, we want to maximize TPR and minimize FPR (obviously), but there are situations where one metric might be preferred over the other.

In [None]:
from sklearn.metrics import roc_curve

# This calculates the FPR and TPR at a variety of threshold levels
# The three ROC Curve variables are all parallel lists
#   i.e., at this particular treshold level, these are the FPs and the TPs
y_probs = mlp.predict_proba(X_test)
y_probs_0 = y_probs[:,0] # test probabilities that were classified as 'approve' (0)
y_probs_1 = y_probs[:,1] # test probabilities that were classified as 'reject' (1)
fpr, tpr, thresholds = roc_curve(y_test, y_probs_1)

# Find the optimal threshold (closest to the top-left corner of the ROC curve, i.e., TPR = 1, FPR = 0)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

# Plot ROC Curve
# The optimal threshold is not always 0.5.
# Lower values (~0.3) favor recall (catch more positives, but increase false alarms).
# Higher values (~0.7) favor precision (reduce false positives, but miss some positives).
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', zorder=1)
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve', zorder=2)

# Annotate a few threshold points and highlight the optimal threshold
# ChatGPT's help
for i in range(0, len(thresholds), max(1, len(thresholds) // 5)):  
    plt.annotate(f"{thresholds[i]:.2f}", (fpr[i], tpr[i]), textcoords="offset points", xytext=(5, -5), ha='left', zorder=4)
plt.scatter(fpr[optimal_idx], tpr[optimal_idx], color='red', s=100, marker='x', label=f'Optimal Threshold = {optimal_threshold:.2f}', zorder=5)
plt.annotate(f"{optimal_threshold:.2f}", (fpr[optimal_idx], tpr[optimal_idx]), textcoords="offset points", xytext=(-20, 10), ha='right', color='red', zorder=6)

# Labels and legend
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve with Optimal Threshold")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

### Custom Thresholds ###

True Positive Rate and False Postivie Rate are related to each other and we can change them by setting a custom threshold for our model. By default, a classifier will choose the positive outcome if the final output is >= 0.50 and the negative outcome if it is < 0.50. Adjusting the classification threshold affects both TPR and FPR. Lowering the threshold increases TPR but often at the cost of a higher FPR, and vice versa.

For example, with the credit application dataset, if we want to minimize false negatives--predicting approval but the borrower actually defaults--then we need to lower the threshold value so that more predictions are categorized as the positive case (reject). We might decide that all outputs >= 0.30 should be rejected.

Whenever we change the threshold, we favor one type of prediction over the other (i.e., error on the side of... ). For example, lowering the threshold increases true positives but also increases false positives. Choosing an optimal threshold depends on the problem and whether we want to prioritize sensitivity (catching all positive cases--*recall*) or specificity (avoiding false positives).

In [None]:
# Re-run the prediction using the custom threshold level
# Note that we do *not* call predict() again, we just do a custom classification step
# using the output probabilities from the original model.
custom_threshold = 0.06
y_probs = mlp.predict_proba(X_test)
y_probs_0 = y_probs[:,0] # test probabilities that were classified as 'approve' (0)
y_probs_1 = y_probs[:,1] # test probabilities that were classified as 'reject' (1)
y_pred_custom = (y_probs_1 >= custom_threshold).astype(int)

# Evaluate accuracy
# Notice all of the 'y_pred' have been changed to 'y_pred_custom' 
# Otherwise, the code is exactly the same as before
print(f"Accuracy Score: {accuracy_score(y_test, y_pred_custom)}\n")
print(classification_report(y_test, y_pred_custom))
y_probs = mlp.predict_proba(X_test)
y_probs_0 = y_probs[:,0] # test probabilities that were classified as 'approve' (0)
y_probs_1 = y_probs[:,1] # test probabilities that were classified as 'reject' (1)
print(f"Sample results for predicted 'approve': {y_probs_0[:5]}")
print(f"Sample results for predicted 'reject':  {y_probs_1[:5]}")

# Bring back the confusion matrix to quickly see TP/TN/FP/FN
class_names = ["Approve", "Reject"]
plt.figure(figsize=(3,3))
sns.heatmap(confusion_matrix(y_test, y_pred_custom), 
            annot=True, fmt="d", cmap="Blues",
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f"Confusion Matrix (Threshold = 0.50)")
plt.show()

# Is this better? It depends on the application.
# Notice that our accuracy has dropped significantly but we a lot fewer false negatives

### Area Under Curve (AUC) Score ###

By changing the threshold value, we can bias the model towards the negative case (lower threshold) or the positive case (higher threshold). But nothing in the model has changed--it's still predicting the exact same values for every input. Changing the threshold value merely moves our model back and forth along the ROC Curve. So wouldn't it be nice to have a single number that would help us evaluate a model, regardless of the desired threshold?

There is such a metric: the Area Under Curve Score. The AUC Score is exactly as it sounds. It is the integral of the ROC Curve. The ideal ROC curve is like an upside down elbow. The y-value rises very, very quickly and then holds steady. Such a graph would indicate that we have a high true positive rate and a low false positive rate no matter the threshold value. It would mean that the optimal threshold value is close to the point (0,1).

In [None]:
from sklearn.metrics import roc_auc_score

# Notice that I ignored the the test probabilities classified as 0... this is standard
y_probs = mlp.predict_proba(X_test)[:,1]
print(f"Overall AUC Score for model: {roc_auc_score(y_test, y_probs)}")

In [None]:
import tensorflow as tf
from tensorflow import keras


# Check if Keras is using a GPU (via TensorFlow)
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))
print(tf.config.list_physical_devices('GPU'))  # Lists GPU devices


# Define the Keras model
model = keras.Sequential([
    keras.layers.Dense(100, activation='relu', input_shape=(X.shape[1],)),
    keras.layers.Dense(1, activation='sigmoid')  # Binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
y_prob = model.predict(X_test)
y_pred = (y_prob > 0.5).astype(int)  # Convert probabilities to binary outcomes
print(f"Overall AUC Score for model: {roc_auc_score(y_test, y_prob[:,1])}")

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print(classification_report(y_test, y_pred))

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf


# Check if TensorFlow is using a GPU
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))
print(tf.config.list_physical_devices('GPU'))  # Lists GPU devices

# Convert data to TensorFlow tensors
X_train_tensor = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tensor = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tensor = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tensor = tf.convert_to_tensor(y_test, dtype=tf.float32)

# Define model parameters
input_size = X.shape[1]
W1 = tf.Variable(tf.random.normal([input_size, 100]))
b1 = tf.Variable(tf.zeros([100]))
W2 = tf.Variable(tf.random.normal([100, 1]))
b2 = tf.Variable(tf.zeros([1]))

# Forward pass function
def model(X):
    hidden = tf.nn.relu(tf.matmul(X, W1) + b1)
    output = tf.sigmoid(tf.matmul(hidden, W2) + b2)
    return output

# Define loss function and optimizer
loss_fn = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.optimizers.Adam(learning_rate=0.001)

# Training loop
epochs = 10
batch_size = 32
num_batches = len(X_train) // batch_size

for epoch in range(epochs):
    total_loss = 0
    for i in range(num_batches):
        start = i * batch_size
        end = start + batch_size
        X_batch = X_train_tensor[start:end]
        y_batch = y_train_tensor[start:end]
        
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            loss = loss_fn(y_batch, y_pred)
        
        gradients = tape.gradient(loss, [W1, b1, W2, b2])
        optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2]))
        total_loss += loss.numpy()
    
    print(f'Epoch {epoch+1}, Loss: {total_loss/num_batches:.4f}')

# Evaluate the model
y_prob = model.predict(X_test)
y_pred = (y_prob > 0.5).astype(int)  # Convert probabilities to binary outcomes
print(f"Overall AUC Score for model: {roc_auc_score(y_test, y_prob[:,1])}")

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print(classification_report(y_test, y_pred))


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)  # Make y a 2D tensor
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

# Create PyTorch datasets and data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Define the PyTorch model
class CreditRiskModel(nn.Module):
    def __init__(self, input_size):
        super(CreditRiskModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 100)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(100, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

# Initialize model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CreditRiskModel(input_size=X.shape[1]).to(device)
criterion = nn.BCELoss()  # Binary cross-entropy loss
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}')

# Evaluation
model.eval()
y_pred = []
y_true = []
with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        outputs = model(X_batch)
        predicted = (outputs.cpu().numpy() > 0.5).astype(int)
        y_pred.extend(predicted.flatten())
        y_true.extend(y_batch.cpu().numpy().flatten())

accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print(classification_report(y_true, y_pred))


In [None]:
# What is the balance of our target variable? ...10x difference
df['RejectLoan'].value_counts()

# Let's oversample the minority class to avoid overfitting
# SMOTE does this by generating fake samples that statistically match the minority class
# SMOTE - Synthetic Mintority Over-sampling TechniquE
# pip install imbalanced-learn