# Task 1: Classification Data Mining Models - Random Forest

### B1. Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods

<p>One question that is relevant to a real-world organizational situation that I will answer using the Random Forest classification method is: What factors are the most predictive of patient readmission within a month of release?</p>

### B2. Define one goal of the data analysis

<p>The goal of this data analysis is to identify the most significant factors that predict patient readmission within a month of release, using demographic, medical, and hospitalization data. This will help the hospital chain to prioritize targeted interventions and resource allocation to reduce readmission rates and avoid penalties from organizations like the Centers for Medicare and Medicaid Services (CMS).</p>

<p>This goal is reasonable as the dataset includes relevant variables such as patient demographics, medical conditions, and hospitalization details, which are related to the problem of patient readmissions.</p>

### C1. Explain how the classification method you chose analyzes the selected dataset. Include expected outcomes.

<p>The Random Forest classification method analyzes the medical dataset in multiple steps. The first step involves preprocessing the dataset. The dataset is prepared by handling missing values and encoding categorical variables. Scaling is not required for Random Forest. Relevant features are selected from the dataset, such as patient demographics, medical conditions, and hospitalization details. The target variable, ReAdmis (indicating if a patient was readmitted within a month), is separated from the predictor variables. The second step involves model training. The Random Forest algorithm creates multiple decision trees. Each decision tree in the Random Forest is “built from a different subset of data and features” (Sruthi, 2024, par. 46). This is to reduce overfitting and improve generalization. The target variable, ReAdmis, is used to train the model, with the goal of predicting whether a patient will be readmitted. The third step involves majority voting. Each decision tree in the forest makes a prediction. The final prediction is based on majority voting across all trees to determine the most likely class for each patient. The fourth step involves feature importance. The Random Forest algorithm calculates the importance of each feature by measuring how much it reduces impurity across all decision trees in the forest. This helps identify which factors are most predictive of patient readmission. </p>

<p>Using the Random Forest classification method on the medical dataset yields multiple outcomes. The first outcome involves assessing the model’s ability to predict readmissions using metrics like accuracy, precision, and recall. The second outcome involves identifying the most important factors influencing patient readmission. This will help hospitals prioritize intervention strategies. The third outcome involves uncovering patterns such as patients with certain conditions or demographic characteristics being at higher risk of readmission. This can help hospitals design programs, such as improved follow-up care for high-risk patients. </p>

### C2. List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis

<p>There are three Python libraries that I used for the Random Forest classification analysis. First is the pandas library and it was used to load, manipulate, and preprocess the medical dataset. The to_csv() function from the pandas library was also used to export the training, validation, and test datasets as CSV files. Second is the scikit-learn library and it provides the RandomForestClassifier() function, which I used to build and train the Random Forest model. Scikit-learn provides functions like accuracy_score(), precision_score(), and recall_score(), which I used to evaluate the model’s performance. Scikit-learn also includes tools for hyperparameter tuning, such as GridSearchCV() and StratifiedKFold(), to optimize the Random Forest model. Third is the NumPy library and it was used implicitly by scikit-learn for computations during the Random Forest model training and evaluation.</p>

### D1. Describe one data preprocessing goal relevant to the classification method from part B1

<p>One data preprocessing goal for using the Random Forest classification method is to handle categorical variables by encoding them into numerical formats. The Random Forest algorithm cannot process categorical data directly, as the algorithm needs numerical inputs to compute splits and make predictions. The medical dataset contains several categorical variables like ReAdmis (Yes/No), Gender (Male/Female), Area (Urban/Suburban/Rural), Services (Blood Work or CT Scan), etc. These variables must be converted into numerical format for use in the Random Forest algorithm, typically through label encoding. Label encoding assigns each category a unique integer value.</p>

### D2. Identify the initial dataset variables that you will use to perform the analysis for the classification question from part B1, and classify each variable as continuous or categorical

<p>The variables that I will use to perform the analysis for the Random Forest classification question from part B1 are: ReAdmis(categorical), Age(continuous), Gender(categorical), Income(continuous), Marital(categorical), Area(categorical), HighBlood(categorical), Stroke(categorical), Diabetes(categorical), Overweight(categorical), Arthritis(categorical), Complication_risk(categorical), Initial_admin(categorical), Initial_days(continuous), Doc_visits(continuous), VitD_levels(continuous), Full_meals_eaten(continuous), Soft_drink(categorical), vitD_supp(continuous), TotalCharge(continuous), and Additional_charges(continuous). </p>

### D3. Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step

In [1]:
import pandas as pd

#loading the medical dataset
df = pd.read_csv('C:/Users/jcaye/Downloads/medical_clean.csv')

In [2]:
df.head()

Unnamed: 0,CaseOrder,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,...,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
0,1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,...,3726.70286,17939.40342,3,3,2,2,4,3,3,4
1,2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,...,4193.190458,17612.99812,3,4,3,4,4,4,3,3
2,3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,...,2434.234222,17505.19246,2,4,4,4,3,4,3,3
3,4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,...,2127.830423,12993.43735,3,5,5,3,4,5,5,5
4,5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,...,2113.073274,3716.525786,2,1,3,3,5,3,4,3


<p>I used the read_csv() function from the pandas library to load the medical dataset into a DataFrame for analysis. The output from the head() function verifies that the medical dataset was loaded successfully.</p>

In [3]:
#selecting relevant columns
columns_to_use = [
    'ReAdmis', 'Age', 'Gender', 'Income', 'Marital', 'Area',
    'HighBlood', 'Stroke', 'Diabetes', 'Overweight', 'Arthritis',
    'Complication_risk', 'Initial_admin', 'Initial_days', 
    'Doc_visits', 'VitD_levels', 'Full_meals_eaten', 
    'Soft_drink', 'vitD_supp', 'TotalCharge', 'Additional_charges'
]
df = df[columns_to_use]

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ReAdmis             10000 non-null  object 
 1   Age                 10000 non-null  int64  
 2   Gender              10000 non-null  object 
 3   Income              10000 non-null  float64
 4   Marital             10000 non-null  object 
 5   Area                10000 non-null  object 
 6   HighBlood           10000 non-null  object 
 7   Stroke              10000 non-null  object 
 8   Diabetes            10000 non-null  object 
 9   Overweight          10000 non-null  object 
 10  Arthritis           10000 non-null  object 
 11  Complication_risk   10000 non-null  object 
 12  Initial_admin       10000 non-null  object 
 13  Initial_days        10000 non-null  float64
 14  Doc_visits          10000 non-null  int64  
 15  VitD_levels         10000 non-null  float64
 16  Full_

<p>These are the variables that I selected for analysis: ReAdmis, Age, Gender, Income, Marital, Area, HighBlood, Stroke, Diabetes, Overweight, Arthritis, Complication_risk, Initial_admin, Initial_days, Doc_visits, VitD_levels, Full_meals_eaten, Soft_drink, vitD_supp, TotalCharge, and Additional_charges. The dataset is filtered to keep only the specified variables, and all other variables are removed. The output from the info() confirms the selected variables exist and shows their data types and non-null value counts. </p>

In [5]:
#checking for missing values
print(df.isnull().sum())

ReAdmis               0
Age                   0
Gender                0
Income                0
Marital               0
Area                  0
HighBlood             0
Stroke                0
Diabetes              0
Overweight            0
Arthritis             0
Complication_risk     0
Initial_admin         0
Initial_days          0
Doc_visits            0
VitD_levels           0
Full_meals_eaten      0
Soft_drink            0
vitD_supp             0
TotalCharge           0
Additional_charges    0
dtype: int64


<p>I used the isnull() function with the sum() function to count missing values in each column of the medical dataset. Since there are no missing values, we can skip handling them and move straight to encoding categorical variables.</p>

In [6]:
from sklearn.preprocessing import LabelEncoder

#categorical variables to label encode
vars_to_encode = [
    'ReAdmis', 'Gender', 'Marital', 'Area', 'HighBlood', 
    'Stroke', 'Diabetes', 'Overweight', 'Arthritis', 
    'Initial_admin', 'Soft_drink', 'Complication_risk'
]

#applying label encoding to each variable
encoder = LabelEncoder()
for var in vars_to_encode:
    df[var] = encoder.fit_transform(df[var])

In [7]:
df.head()

Unnamed: 0,ReAdmis,Age,Gender,Income,Marital,Area,HighBlood,Stroke,Diabetes,Overweight,...,Complication_risk,Initial_admin,Initial_days,Doc_visits,VitD_levels,Full_meals_eaten,Soft_drink,vitD_supp,TotalCharge,Additional_charges
0,0,53,1,86575.93,0,1,1,0,1,0,...,2,1,10.58577,6,19.141466,0,0,0,3726.70286,17939.40342
1,0,51,0,46805.99,1,2,1,0,0,1,...,0,1,15.129562,4,18.940352,2,0,1,4193.190458,17612.99812
2,0,53,0,14370.14,4,1,1,0,1,1,...,2,0,4.772177,4,18.057507,1,0,0,2434.234222,17505.19246
3,0,78,1,39741.49,1,1,0,1,0,0,...,2,0,1.714879,4,16.576858,1,0,0,2127.830423,12993.43735
4,0,22,0,1209.56,4,0,0,0,0,0,...,1,0,1.254807,5,17.439069,0,1,2,2113.073274,3716.525786


<p>I imported the LabelEncoder() function from the sklearn library to convert categorical variables into numerical values for Random Forest, which only works with numerical inputs. I used a for loop to iterate over each categorical variable and apply label encoding. I used the head() function to display the first 5 rows of the updated medical dataset after label encoding.</p>

### D4. Provide a copy of the cleaned dataset

In [8]:
#exporting the cleaned dataset to a CSV file
df.to_csv('cleaned_dataset.csv', index=False)

<p>Cleaned_dataset.csv is the medical dataset that has been properly prepared for analysis using the Random Forest classification method. I included this file in my submission. </p>

### E1. Split the data into training, validation and test datasets and provide the files

In [9]:
from sklearn.model_selection import train_test_split

#separating the dataset into features and the target variable 
X = df.drop(columns=['ReAdmis'])
y = df['ReAdmis']

#splitting the dataset into training + validation (80%) and test set (20%)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=24)

In [10]:
#splitting the training + validation set into separate training (70%) and validation (30%) sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3, random_state=24)

In [11]:
#printing dataset sizes to confirm splits
print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")

Training set: (5600, 20)
Validation set: (2400, 20)
Test set: (2000, 20)


<p>Before splitting the data, I separated the predictors and the target variable, ReAdmis. I assigned the predictors to variable X and the target variable to y. Next, I used the train_test_split() function from the sklearn library to split the data into training and test sets. Then I split the training set into training and validation sets. The training set is 56%, the validation set is 24%, and the test set is 20% of the original data. </p>

#### Providing the training, validation, and test data files

In [12]:
#combining features and target for the training set, and exporting it
training_data = X_train.copy()
training_data['ReAdmis'] = y_train
training_data.to_csv('training_data.csv', index=False)

#combining features and target for the validation set, and exporting it
validation_data = X_val.copy()
validation_data['ReAdmis'] = y_val
validation_data.to_csv('validation_data.csv', index=False)

#combinining features and target for the test set, and exporting it
test_data = X_test.copy()
test_data['ReAdmis'] = y_test
test_data.to_csv('test_data.csv', index=False)

<p>I combined features and target for the training, validation, and test sets, and exported all three as CSV files. The training set is saved as “training_data.csv.” The validation set is saved as “validation_data.csv.” The test set is saved as “test_data.csv.” </p>

### E2. Create an initial model using the training dataset and provide a screenshot of the following metrics

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

#initializing the Random Forest model
rf_model = RandomForestClassifier(random_state=24)

#training the model using the training dataset
rf_model.fit(X_train, y_train)

#making predictions on the validation set
y_pred = rf_model.predict(X_val)
y_pred_prob = rf_model.predict_proba(X_val)[:, 1]

In [14]:
#calculating evaluation metrics
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
auc_roc = roc_auc_score(y_val, y_pred_prob)
conf_matrix = confusion_matrix(y_val, y_pred)

#printing evaluation metrics
print("Evaluating the initial Random Forest model on the validation dataset\n")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"AUC-ROC: {auc_roc:.2f}")
print(f"Confusion Matrix:\n{conf_matrix}")

Evaluating the initial Random Forest model on the validation dataset

Accuracy: 0.98
Precision: 0.97
Recall: 0.97
F1 Score: 0.97
AUC-ROC: 1.00
Confusion Matrix:
[[1435   26]
 [  26  913]]


<p>In this code, I started by importing the RandomForestClassifier() function from the sklearn library, as this function is used to create and train the Random Forest model. I also imported functions like accuracy_score(), precision_score(), recall_score(), etc., as these functions are used to evaluate the model’s performance. </p>

<p>I initialized the Random Forest model by using the RandomForestClassifier() function. Next, I fit the model to the training set. Then, I used the predict() function for the model to predict the class labels for the validation set. I also used the predict_proba() function for the model to predict probabilities for each class. This is used to calculate the AUC-ROC metric.</p>

<p>I calculated and displayed the accuracy, precision, recall, F1 score, AUC-ROC, and confusion matrix for the Random Forest model. The model has an accuracy score of 0.98, precision score of 0.97, recall score of 0.97, F1 score of 0.97, and an AUC-ROC of 1.00. The model’s confusion matrix shows 1435 true negatives, 26 false positives, 26 false negatives, and 913 true positives. </p>

### E3. Perform hyperparameter tuning on the validation dataset using k-fold cross validation to find the optimized model.

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

#defining the parameter grid
param_grid = {
    "n_estimators": [50, 100, 200],  #number of trees
    "max_depth": [5, 10, 20, None],  #max depth of each tree
    "min_samples_split": [2, 5, 10],  #min samples required to split a node
    "min_samples_leaf": [1, 2, 4],  #min samples required at each leaf
}

#initializing the Random Forest classifier
rf_model = RandomForestClassifier(random_state=24)

#defining the cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=24)

#initializing GridSearchCV
grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    scoring="accuracy",  #accuracy is the metric to optimize
    cv=cv,  #use 5-fold cross-validation
    verbose=2, 
    n_jobs=-1,  
)

#performing the grid search
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


<p>The hyperparameters that I selected for tuning are ‘n_estimators’, ‘max_depth’, ‘min_samples_split’, and’ min_samples_leaf’. The hyperparameter ‘n_estimators’ represents the number of trees in the forest. The hyperparameter ‘max_depth’ controls the maximum depth of each tree. The hyperparameter ‘min_samples_split’ is the minimum number of samples required to split a node. The hyperparameter ‘min_samples_leaf’ is the minimum number of samples required in a leaf node.</p>

<p>Here, I will discuss the justification of the selected hyperparameters. I selected ‘n_estimators’ because it controls the number of trees in the forest, with more trees typically improving performance but increasing training time. I selected ‘max_depth’ because it limits tree depth, helping to prevent overfitting by avoiding overly complex trees. I selected ‘min_samples_split’ because it reduces overfitting by requiring more samples to split a node. I selected ‘min_samples_leaf’ because it reduces overfitting by ensuring leaf nodes have sufficient samples.</p>

In [16]:
#finding best hyperparameters
best_params = grid_search.best_params_
best_score = grid_search.best_score_

#displaying the results
print("Best Hyperparameters:", best_params)
print("Best Cross-Validation Score:", best_score)

Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
Best Cross-Validation Score: 0.979642857142857


<p>The optimized Random Forest model’s best hyperparameters are max_depth=10, min_samples_leaf=4, min_samples_split=10, and n_estimators=50. This combination of hyperparameters achieved an average accuracy score of 97.96% during cross-validation.</p>

#### Evaluating the optimized model on the validation dataset

In [17]:
#training the model with the best hyperparameters
best_rf_model = RandomForestClassifier(**grid_search.best_params_, random_state=24)
best_rf_model.fit(X_train, y_train)

#making predictions on the validation set
y_val_pred = best_rf_model.predict(X_val)
y_val_pred_prob = best_rf_model.predict_proba(X_val)[:, 1]

#calculating evaluation metrics
accuracy = accuracy_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
recall = recall_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
auc_roc = roc_auc_score(y_val, y_val_pred_prob)
conf_matrix = confusion_matrix(y_val, y_val_pred)

#displaying metrics
print("Evaluating the optimized Random Forest model on the validation dataset\n")
print(f"Validation Accuracy: {accuracy:.2f}")
print(f"Validation Precision: {precision:.2f}")
print(f"Validation Recall: {recall:.2f}")
print(f"Validation F1 Score: {f1:.2f}")
print(f"Validation AUC-ROC: {auc_roc:.2f}")
print(f"Validation Confusion Matrix:\n{conf_matrix}")

Evaluating the optimized Random Forest model on the validation dataset

Validation Accuracy: 0.98
Validation Precision: 0.97
Validation Recall: 0.97
Validation F1 Score: 0.97
Validation AUC-ROC: 1.00
Validation Confusion Matrix:
[[1437   24]
 [  25  914]]


<p>I evaluated the optimized Random Forest model on the validation dataset. The optimized model has an accuracy score of 0.98, precision score of 0.97, recall score of 0.97, F1 score of 0.97, and an AUC-ROC of 1.00. The model’s confusion matrix shows 1437 true negatives, 24 false positives, 25 false negatives, and 914 true positives.</p>

### E4. Use the optimized model identified in part E3 to make predictions using the test dataset and provide a screenshot of the following metrics

In [18]:
#making predictions on the test set
y_test_pred = best_rf_model.predict(X_test)
y_test_pred_prob = best_rf_model.predict_proba(X_test)[:, 1]

#calculating evaluation metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_auc_roc = roc_auc_score(y_test, y_test_pred_prob)
conf_matrix = confusion_matrix(y_test, y_test_pred)

#displaying metrics
print("Evaluating the optimized Random Forest model on the test dataset\n")
print(f"Test Accuracy: {test_accuracy:.2f}")
print(f"Test Precision: {test_precision:.2f}")
print(f"Test Recall: {test_recall:.2f}")
print(f"Test F1 Score: {test_f1:.2f}")
print(f"Test AUC-ROC: {test_auc_roc:.2f}")
print(f"Test Confusion Matrix:\n{conf_matrix}")

Evaluating the optimized Random Forest model on the test dataset

Test Accuracy: 0.98
Test Precision: 0.98
Test Recall: 0.98
Test F1 Score: 0.98
Test AUC-ROC: 1.00
Test Confusion Matrix:
[[1250   15]
 [  16  719]]


<p>I evaluated the optimized Random Forest model on the test dataset. The optimized model has an accuracy score of 0.98, precision score of 0.98, recall score of 0.98, F1 score of 0.98, and an AUC-ROC of 1.00. The model’s confusion matrix shows 1250 true negatives, 15 false positives, 16 false negatives, and 719 true positives. </p>

#### Feature importance

In [19]:
#getting features importances
importances = best_rf_model.feature_importances_

#pairing feature names with their importance values
feature_importance = sorted(zip(X_train.columns, importances), key=lambda x: x[1], reverse=True)

#printing top features
for feature, importance in feature_importance:
    print(f"{feature}: {importance:.4f}")

Initial_days: 0.5439
TotalCharge: 0.4190
Additional_charges: 0.0060
Income: 0.0058
VitD_levels: 0.0057
Age: 0.0039
Complication_risk: 0.0027
Marital: 0.0019
Initial_admin: 0.0018
Full_meals_eaten: 0.0015
Doc_visits: 0.0013
Area: 0.0013
Arthritis: 0.0009
HighBlood: 0.0008
vitD_supp: 0.0008
Gender: 0.0007
Overweight: 0.0007
Stroke: 0.0005
Soft_drink: 0.0005
Diabetes: 0.0005


<p>I extracted and sorted the feature importance values from the optimized Random Forest model, ranking features by their contribution to predicting patient readmissions within a month. The most significant predictor of readmission is ‘Initial_days’ with an importance value of 0.5439. This suggests that longer stays may indicate severe health issues, increasing the likelihood of readmission. The second most significant predictor is ‘TotalCharge’ with an importance value of 0.4190. This suggests that higher charges may indicate intensive treatments or complex conditions, which correlate with higher readmission risks. Medical conditions like arthritis, high blood pressure, stroke, and diabetes have very low importance values, indicating they contribute minimally to the model’s predictions. </p>

### F1. Compare and discuss the metrics of accuracy, precision, recall, F1 score, and AUC-ROC from the use of the optimized model on the test dataset and the initial model on the training dataset to evaluate the performance of the optimized model

<p>Here we will compare and discuss the metrics between the initial Random Forest model that was evaluated on the validation dataset and the optimized Random Forest model that was evaluated on the test dataset. Both initial and optimized models achieved an accuracy of 98%, which means both models are very effective at correctly classifying patient readmissions. The optimized model’s performance on the test dataset confirms that the model generalizes well to new data. Precision improved slightly in the optimized model, increasing from 97% to 98%. This indicates that the optimized model is slightly better at reducing false positives, making it more accurate in identifying patients likely to be readmitted. Recall also improved slightly in the optimized model, increasing from 97% to 98%. This indicates that the optimized model is slightly better at identifying true positives, making it more effective at detecting patients who will actually be readmitted. The F1 score improved from 0.97 in the initial model to 0.98 in the optimized model. This improvement shows a better balance between precision and recall in the optimized model. Both models achieved an AUC-ROC of 1.00, indicating that they are excellent at distinguishing between patients who will and will not be readmitted. This consistency shows the optimized model maintained the excellent class separation of the initial model. </p>

<p>Overall, the optimized Random Forest model showed slight improvements in precision, recall, and F1 score compared to the initial model. These improvements demonstrate more accurate identification of true positives with fewer false positives. These improvements also suggest that hyperparameter tuning and cross-validation successfully refined the Random Forest model, making it more reliable for predicting patient readmissions on unseen data.   </p>

### F2. Discuss the results and implications of your classification analysis

<p>The Random Forest classification analysis on the medical dataset produced strong results, as highlighted by the key metrics. The optimized model achieved an accuracy score of 98%, indicating excellent performance in correctly classifying patient readmissions. The model achieved a precision score of 98%, showing excellent performance in reducing false positives and reliably identifying patients likely to be readmitted. The model achieved a recall score of 98%, indicating that the model is very effective at identifying patients who will actually be readmitted. The model achieved an F1 score of 98%, showing a near-perfect balance between precision and recall, making it reliable for predicting readmissions. The model achieved an AUC-ROC of 1.00, demonstrating excellent ability in distinguishing between patients who will and will not be readmitted. </p>

<p>Regarding the results of performing feature importance analysis in the optimized Random Forest model, there are two features that significantly influence patient readmission. The most important feature is ‘Initial_days’ with an importance value of 0.5439. This suggests that longer stays could reflect serious health problems, raising the chances of readmission. The second most important feature is ‘TotalCharge’ with an importance value of 0.4190. This suggests that higher charges might reflect intensive treatments or complex conditions, which are associated with increased readmission risks. Medical condition features like arthritis, high blood pressure, stroke, and diabetes have very low importance values, showing they have little impact on the model’s predictions.</p>

<p>The analysis has several important implications. The first implication is that the optimized model can accurately identify patients at risk of readmission, since it has high precision and recall. This allows hospitals to focus resources on high-risk patients, potentially reducing readmission rates and improving patient outcomes. The second implication is that the optimized model reduces false positives, minimizing unnecessary interventions for low-risk patients. The third implication is that hospitals can use this model to predict readmissions proactively, helping reduce penalties from organizations like CMS. The fourth implication is hospitals can prioritize interventions for patients with long hospital stays and high treatment costs, as these factors often indicate higher readmission risks. The fifth implication is hospitals can simplify their predictive models by focusing on key features like ‘Initial_days’ and ‘TotalCharge’, reducing complexity while maintaining accuracy.</p>

### F3.  Discuss one limitation of your data analysis

<p>One limitation of the Random Forest data analysis is it lacks straightforward interpretability compared to simpler models like logistic regression. While feature importance provides some interpretability, it does not detail the relationships between predictors and the target variable, such as how ‘Initial_days’ specifically influences readmission risk. This makes it harder for healthcare professionals to understand or trust the model’s predictions, which is critical in healthcare decision-making.</p>

### F4. Recommend a course of action for the real-world organizational situation from part B1 based on your results and implications discussed in part F2

<p>One recommended course of action to address the question “What factors are the most predictive of patient readmission within a month of release?” is to focus on high-risk predictors such as ‘Initial_days’ and ‘TotalCharge.’ Hospitals should improve discharge planning for patients with longer hospital stays, as they face a higher risk of readmission. Hospitals should provide additional follow-up care, such as home visits, telehealth consultations or regular check-ins for these patients. Patients with higher hospital charges often have more complex medical conditions. Hospitals should create personalized post-discharge care plans for these patients to address potential complications early. Overall, hospitals should prioritize resources for patients with the highest risk scores and reduce unnecessary interventions for low-risk patients to save time and costs. </p>

### G. Panopto Video

<p>Panopto Video Link: https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=00a8caac-64cf-4544-9387-b26e001a4503</p>

### H. Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable

<p>
Firdose, T. (2023, August 24). <i>Fine-tuning your random forest classifier: A guide to hyperparameter tuning</i>. Medium. https://tahera-firdose.medium.com/fine-tuning-your-random-forest-classifier-a-guide-to-hyperparameter-tuning-d5ceab0c4852
</p>

### I. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized

<p>Sruthi. (2024, December 11). <i>Understanding random forest algorithm with examples</i>. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/</p>