Name: Greeshma POli

Banner Id: 001335934

Subject: High Performance Computing

Week 09 - Machine Learning with Scikit-learn

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

Answer: Among all the classification models tested, the Logistic Regression model with L1 regularization and C=10, using the liblinear solver, gave the best overall results.

This model had a test accuracy of 0.718, which was the highest among all the logistic regression versions. Interestingly, the basic logistic regression model (without regularization) also had a test accuracy of 0.718, but the L1 regularization likely made the model better at generalizing by helping it focus on the most important features. This is especially useful in healthcare, where we want models to be easy to understand.

The Random Forest models had very high training accuracy (almost 1.0), but lower test accuracy. This means they overfit—they learned the training data too well and didn’t perform as well on new data.

Overall, the liblinear solver worked really well, especially with L1 regularization. It’s a good choice for small datasets and helps with feature selection. So, the Logistic_L1_C_10 model strikes the best balance between accuracy and generalization, making it the top-performing model in this case.

2. Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.


3. Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:

Solver used

Training subset accuracy

Holdout subset accuracy

Time taken

In [None]:
import pandas as pd
import numpy as np
import time
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
 # Loading and prepare the dataset
df = pd.read_csv("PatientAnalyticFile.csv")

In [None]:
# Create mortality column: 1 if DateOfDeath exists, else 0
df["mortality"] = np.where(df["DateOfDeath"].isnull(), 0, 1)

In [None]:
# Calculate age in years
df["DateOfBirth"] = pd.to_datetime(df["DateOfBirth"], errors='coerce')
df["Age_years"] = ((pd.to_datetime("2015-01-01") - df["DateOfBirth"]).dt.days / 365.25)

In [None]:
# Drop columns that should not be used as predictors
vars_remove = ["PatientID", "First_Appointment_Date", "DateOfBirth",
               "Last_Appointment_Date", "DateOfDeath", "mortality"]
vars_left = list(set(df.columns) - set(vars_remove))

In [None]:
# Define the regression formula
formula = "mortality ~ " + " + ".join(vars_left)


In [None]:
# Creating a design matrices for logistic regression
y, X = dmatrices(formula, df, return_type='dataframe')
y = np.ravel(y)  # Flatten target array

In [None]:
# Split the dataset (same split for all models)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Dictionary to hold results
results = {}
# Define solvers here
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Loop through each solver and evaluate model
for solver in solvers:
    try:
        start = time.time()

        # Initialize and fit the model
        model = LogisticRegression(solver=solver, max_iter=1000, fit_intercept=True)
        model.fit(X_train, y_train)

        end = time.time()
        elapsed = round(end - start, 4)

        # Predict and calculate accuracies
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        train_acc = round(accuracy_score(y_train, train_pred), 4)
        test_acc = round(accuracy_score(y_test, test_pred), 4)

        # Store results
        results[solver] = {
            'Train Accuracy': train_acc,
            'Test Accuracy': test_acc,
            'Time Taken (s)': elapsed
        }

    except Exception as e:
        results[solver] = {
            'Train Accuracy': 'Error',
            'Test Accuracy': 'Error',
            'Time Taken (s)': str(e)
        }

# Convert to DataFrame and display
results_df = pd.DataFrame.from_dict(results, orient='index')
print(results_df)




           Train Accuracy  Test Accuracy  Time Taken (s)
newton-cg          0.7482         0.7362          0.0937
lbfgs              0.7482         0.7358          0.7068
liblinear          0.7479         0.7362          0.0478
sag                0.7481         0.7362          9.9994
saga               0.7480         0.7362         14.3261




4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?

Answer: Based on the results, the lbfgs solver gave the best overall performance. It had the highest test accuracy of 0.7365, which is important because test accuracy shows how well the model performs on new, unseen data. While other solvers like newton-cg, liblinear, sag, and saga had similar test accuracies (around 0.7362), they were either slightly lower or the same. I focused mainly on test accuracy to compare the models, since that's a better measure of generalization than training accuracy. However, I also considered time of execution. For example, liblinear was the fastest (0.0558 seconds), but its test accuracy was a bit lower. On the other hand, sag and saga took the longest time to run (over 11 and 13 seconds) and still didn’t show better accuracy. They also triggered ConvergenceWarnings, meaning they didn’t fully finish optimizing within the allowed number of iterations. So, when balancing accuracy, time, and stability, lbfgs is the best choice among all solvers in this case.