Murali

Week 09 - Machine Learning with Scikit-learn



Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.


The **Random Forest classifier** demonstrated superior performance compared to other models according to the results shown in the Python notebook through its high training accuracy score of **0.9993**. The Random Forest model achieved **0.9993** as its final training accuracy score. The extremely high training accuracy level indicates substantial overfitting has occurred. The testing accuracy for this model reached **0.686** even though its nearly perfect training performance indicated **0.9993** accuracy. The result still ranked as competitive when compared to logistic regression models.

The Random Forest classifier achieved good results because it successfully captured sophisticated data patterns that linear models such as logistic regression would find difficult to analyze. The combination of decision trees through the ensemble model structure leads to better generalization performance. The Random Forest model tends to overfit when hyperparameter tuning is absent but the problem was partially resolved by implementing grid search cross-validation (GridSearchCV) on max_depth and other parameters.

The logistic regression models produced strong performance outcomes that became most evident when using L1 regularization (LASSO). The logistic regression model with L1 regularization at \( C = 10 \) reached **0.7347** training accuracy and **0.718** testing accuracy as its best performance. The obtained results show strong generalization ability because the difference between training and testing performance remains small. Using the pipeline approach for cross-validated logistic regression with L1 regularization (Logistic_SL1_C_auto) produced almost equal results with **0.7307** training accuracy and **0.714** testing accuracy.

The Random Forest model demonstrated the best training performance but failed to generalize as effectively as logistic regression models particularly when L1 regularization was applied correctly. The finding demonstrates that achieving accurate results must be balanced against maintaining good generalization ability. The Random Forest model's performance can be improved through hyperparameter tuning which resolves overfitting problems as shown by the grid search results. The logistic regression models with L1 regularization performed consistently between training and testing phases which made them competitive even though they did not reach the highest accuracy level.

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from patsy import dmatrices

In [2]:
#Loading the requried dataset
df_patient = pd.read_csv('./PatientAnalyticFile.csv')

#Converting to moratlity variable
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

#Converting DateOfBirth to datetime and calculate age
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days / 365.25)

vars_remove = ['PatientID','First_Appointment_Date','DateOfBirth','Last_Appointment_Date','DateOfDeath','mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [3]:
Y, X = dmatrices(formula, df_patient, return_type='dataframe')

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(Y), test_size=0.2, random_state=42)

solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
results = []

In [4]:
# Fit models with each solver and record performance
for solver in solvers:
    start_time = time.time()
    clf = LogisticRegression(solver=solver, max_iter=1000)
    clf.fit(X_train, y_train)
    end_time = time.time()

    train_acc = accuracy_score(y_train, clf.predict(X_train))
    test_acc = accuracy_score(y_test, clf.predict(X_test))
    time_taken = end_time - start_time

    results.append([solver, train_acc, test_acc, time_taken])

results_df = pd.DataFrame(results, columns=['Solver used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken'])
print(results_df)



  Solver used  Training subset accuracy  Holdout subset accuracy  Time taken
0   newton-cg                  0.748188                  0.73625    0.138945
1       lbfgs                  0.748437                  0.73600    0.912498
2   liblinear                  0.747938                  0.73625    0.091597
3         sag                  0.748062                  0.73625   11.557895
4        saga                  0.748000                  0.73625   18.255994




**Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?**



The **lbfgs** solver produced the most favorable results because it maintained a good equilibrium between training accuracy and holdout accuracy and execution time. The lbfgs solver demonstrated the best training accuracy at 0.7484 and its holdout accuracy of 0.7360 ranked similarly to newton-cg and liblinear and sag solvers. The lbfgs solver executed its process in **0.91 seconds** which showed efficiency compared to sag and saga yet performed slower than newton-cg and liblinear.

The most critical indicator of model performance consists of holdout subset accuracy because this metric specifically determines the model's ability to predict unaffected data points. The primary goal of these models is to perform well outside training data so this metric should take precedence. Every solver demonstrated equivalent performance based on the accuracy score of **0.7360**.

The evaluation now focuses heavily on execution time because the holdout accuracy scores show minimal variations. The computation times for sag and saga were unacceptably long since they ran for **11.56 seconds** and **18.26 seconds** respectively. The solvers generated warnings because they did not reach an acceptable solution before exceeding the maximum iteration limit set to 1000. Their reasonable accuracy cannot overcome their inefficiency which makes them undesirable for use.

The most efficient solver for this task becomes the **lbfgs** when we evaluate its performance through all three metrics including holdout accuracy, execution time and training accuracy. The solver delivers satisfactory results through its combination of precision and runtime performance without encountering convergence problems. The newton-cg and liblinear solvers demonstrate good speed and accuracy but their slightly lower training accuracy makes them less attractive than lbfgs. The best solver selection depends on the evaluation of generalization performance versus computation time where solutions that maximize holdout accuracy gain a slight advantage.