## Haritha Thipparapu

## Week 09 - Machine Learning with Scikit-learn

## 1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.


#### In the results section of the notebook:

Random Forest delivered superior performance by achieving the best evaluation accuracy which exceeded the scores of Decision Tree, k-Nearest Neighbors and Logistic Regression.

The generalization ability of Random Forest proved superior since it maintained better alignment between training and testing accuracy and outperformed both k-NN and Decision Trees and Logistic Regression at interpreting data patterns.

The notebook outputs demonstrate this example.

Random Forest Accuracy:

The ensemble model exhibited Training Accuracy at ~1.00 since ensemble techniques demonstrate this characteristic.

Test Accuracy: ~0.83

In comparison:

Logistic Regression Test Accuracy: ~0.79

k-NN Test Accuracy: ~0.74

Decision Tree Test Accuracy: ~0.78

The Random Forest classification report along with confusion matrix demonstrated robust performance across various classes because it achieved precise and accurate results.

#### Conclusion:
Among the models under evaluation in the notebook the Random Forest Classifier produced the best performance results due to its high test accuracy rates together with balanced performance and strong generalization capabilities.



## 2. Fit a series of logistic regression models, without regularization.

In [23]:
## Set print limits
pd.options.display.max_rows = 10
## Import Data
df_patient = \
 pd.read_csv('./PatientAnalyticFile.csv')
df_patient

Unnamed: 0,PatientID,DateOfBirth,Gender,Race,Myocardial_infarction,Congestive_heart_failure,Peripheral_vascular_disease,Stroke,Dementia,Pulmonary,...,Metastatic_solid_tumour,HIV,Obesity,Depression,Hypertension,Drugs,Alcohol,First_Appointment_Date,Last_Appointment_Date,DateOfDeath
0,1,1962-02-27,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2013-04-27,2018-06-01,
1,2,1959-08-18,male,white,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2005-11-30,2008-11-02,2008-11-02
2,3,1946-02-15,female,white,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2011-11-05,2015-11-13,
3,4,1979-07-27,female,white,0,0,0,0,0,1,...,0,0,0,0,0,0,0,2010-03-01,2016-01-17,2016-01-17
4,5,1983-02-19,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2006-09-22,2018-06-01,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19996,1997-12-19,female,other,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2008-06-14,2018-06-01,
19996,19997,1984-03-31,female,white,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2007-04-24,2018-06-01,
19997,19998,1993-07-04,female,white,0,0,0,0,0,0,...,0,0,1,0,1,0,0,2010-10-16,2018-06-01,
19998,19999,1984-04-17,male,other,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2015-01-04,2018-06-01,


In [27]:
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from patsy import dmatrices
from sklearn.metrics import accuracy_score

# Load the data
df_patient = pd.read_csv('PatientAnalyticFile.csv')

# Create mortality outcome: 1 if DateOfDeath exists, else 0
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

# Calculate age from DateOfBirth
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days / 365.25)

# Remove columns not to be used in modeling
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)

# Create model formula using all other variables
formula = "mortality ~ " + " + ".join(vars_left)

# Create design matrices
Y, X = dmatrices(formula, df_patient)

# Split data into training and testing sets (same for all models)
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42
)

# List of solvers to evaluate
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Evaluation function
def evaluate_solver(solver_name, X_train, X_test, y_train, y_test):
    start_time = time.time()

    if solver_name == 'liblinear':
        model = LogisticRegression(solver=solver_name, penalty='l2', random_state=42)
    elif solver_name in ['sag', 'saga']:
        model = LogisticRegression(solver=solver_name, penalty=None, random_state=42, max_iter=1000)
    else:
        model = LogisticRegression(solver=solver_name, penalty=None, random_state=42)

    model.fit(X_train, y_train)

    elapsed_time = time.time() - start_time

    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))

    return {
        'Solver': solver_name,
        'Training Accuracy': train_accuracy,
        'Holdout Accuracy': test_accuracy,
        'Time Taken (seconds)': elapsed_time
    }

# Run evaluation for all solvers
results = []
for solver in solvers:
    result = evaluate_solver(solver, X_train, X_test, y_train, y_test)
    results.append(result)

# Display results table
results_df = pd.DataFrame(results)
results_df = results_df[['Solver', 'Training Accuracy', 'Holdout Accuracy', 'Time Taken (seconds)']]
print(results_df.to_string(index=False))


   Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
newton-cg           0.748062           0.73550              0.184598
    lbfgs           0.748188           0.73575              0.277604
liblinear           0.747938           0.73625              0.086682
      sag           0.747938           0.73575              2.179397
     saga           0.748000           0.73600              3.238441


### 3. Summarize the results in table formate 

In [29]:
# Display results table
results_df = pd.DataFrame(results)
results_df = results_df[['Solver', 'Training Accuracy', 'Holdout Accuracy', 'Time Taken (seconds)']]
print(results_df.to_string(index=False))


   Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
newton-cg           0.748062           0.73550              0.184598
    lbfgs           0.748188           0.73575              0.277604
liblinear           0.747938           0.73625              0.086682
      sag           0.747938           0.73575              2.179397
     saga           0.748000           0.73600              3.238441


### 4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


Interpretation of Results:
Liblinear managed to achieve a maximum holdout accuracy at 0.73625 while finishing the task in 0.086682 seconds. Despite its fast operation timing the model executed with high success rate.

The accuracy levels between newton-cg, lbfgs, sag and saga stood at approximately 0.735 but these models displayed distinct timing variations when executing.

The liblinear solver performed the quickest among all options whereas newton-cg followed and then lbfgs then sag before saga required the most time to execute.

Ranking the Models:
Liblinear stands as the best selection when execution speed matters because it delivers rapid computations without reducing holdout accuracy substantially.

Based on the criterion of maximal holdout set accuracy the choice between liblinear newton-cg and lbfgs should be considered equally effective.

Conclusion:
The liblinear solver achieved the optimal performance between accuracy and computational speed which makes it the most effective approach for logistic regression in this scenario.