In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("appointments.csv")
df

Unnamed: 0,Noshow,SMSreceived,Age,GenderM,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,WeekDay,AgeCategory,WaitingTimeCategory,TotalConditions
0,0,1,84.0,True,0,1,1,0,1,115,0,Friday,Senior,Long Wait,3
1,0,1,83.0,False,0,1,0,0,0,115,0,Friday,Senior,Long Wait,1
2,0,1,74.0,False,0,0,0,0,0,109,0,Friday,Senior,Long Wait,0
3,0,1,70.0,False,0,1,1,0,0,109,0,Friday,Senior,Long Wait,2
4,0,1,87.0,False,0,0,0,0,0,109,0,Friday,Senior,Long Wait,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,0,0,54.0,True,0,0,0,0,0,0,0,Wednesday,Adult,Same Day,0
110521,0,0,43.0,False,0,0,0,0,0,0,1,Wednesday,Adult,Same Day,0
110522,0,0,27.0,True,0,0,0,0,0,0,0,Wednesday,Adult,Same Day,0
110523,0,0,30.0,False,0,0,0,0,0,0,0,Wednesday,Adult,Same Day,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110525 entries, 0 to 110524
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Noshow               110525 non-null  int64  
 1   SMSreceived          110525 non-null  int64  
 2   Age                  110525 non-null  float64
 3   GenderM              110525 non-null  bool   
 4   Scholarship          110525 non-null  int64  
 5   Hipertension         110525 non-null  int64  
 6   Diabetes             110525 non-null  int64  
 7   Alcoholism           110525 non-null  int64  
 8   Handcap              110525 non-null  int64  
 9   TimeGapDays          110525 non-null  int64  
 10  prevNoshow           110525 non-null  int64  
 11  WeekDay              110525 non-null  object 
 12  AgeCategory          110525 non-null  object 
 13  WaitingTimeCategory  110525 non-null  object 
 14  TotalConditions      110525 non-null  int64  
dtypes: bool(1), float

### Logistic Regression and Hypothesis Testing: Data Preparation

Train-Test Split: Split your dataset into a training set and a test set. This will allow you to train the model on one portion of the data and evaluate its performance on another, unseen portion.

Feature Selection: Based on your earlier analysis, decide which features (independent variables) to include in the model. Make sure that the selected features are relevant and meaningful for predicting appointment no-shows.

Categorical Encoding: If your dataset contains categorical variables like 'GenderM', 'WeekDay', 'AgeCategory', and 'WaitingTimeCategory', you'll need to encode them into numerical format. This can be done using techniques like one-hot encoding or label encoding.

Normalization/Scaling: Depending on the scale and distribution of your numerical features (e.g., 'Age', 'TimeGapDays'), consider whether you need to normalize or scale them to ensure that they have similar ranges. Some models, including logistic regression, can benefit from feature scaling.

Interactions and Polynomial Features: Explore the possibility of creating interaction terms or polynomial features if you suspect that certain feature combinations or higher-order relationships might have an impact on appointment no-shows.

Checking for Imbalanced Classes: Ensure that you've addressed the issue of class imbalance in the 'Noshow' target variable. If there's a severe class imbalance, consider techniques like oversampling, undersampling, or using appropriate class weights during model training.

Handling Multicollinearity: If you have highly correlated independent variables, consider addressing multicollinearity by removing one of the correlated variables or using dimensionality reduction techniques like Principal Component Analysis (PCA).

Adding a Constant Term: For logistic regression using Statsmodels, you'll need to add a constant (intercept) term to your feature matrix. You can do this using sm.add_constant().

Data Summary and Exploration: Conduct exploratory data analysis to understand the distribution of your features and the target variable, identify outliers, and check for any potential issues that may impact the modeling process.

Check for Missing Values: Even though you mentioned that the data has been cleaned, it's a good practice to double-check for any remaining missing values that might have been overlooked during the initial data cleaning process.

Encoding the Target Variable: Ensure that the 'Noshow' target variable is correctly encoded as 0 for showed up and 1 for did not show up, as logistic regression models typically work with binary outcomes.

Once you've completed these pre-processing steps, you can move forward with building the logistic regression model using Statsmodels. Remember that a well-prepared dataset is essential for obtaining reliable and interpretable results from your statistical analysis.

In [5]:
# List of categorical columns to encode
categorical_columns = ['GenderM', 'WeekDay', 'AgeCategory', 'WaitingTimeCategory']

# Use pandas' get_dummies function for one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

In [6]:
df_encoded

Unnamed: 0,Noshow,SMSreceived,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,...,WeekDay_Saturday,WeekDay_Thursday,WeekDay_Tuesday,WeekDay_Wednesday,AgeCategory_Child,AgeCategory_Senior,AgeCategory_Teenager,WaitingTimeCategory_Medium Wait,WaitingTimeCategory_Same Day,WaitingTimeCategory_Short Wait
0,0,1,84.0,0,1,1,0,1,115,0,...,0,0,0,0,0,1,0,0,0,0
1,0,1,83.0,0,1,0,0,0,115,0,...,0,0,0,0,0,1,0,0,0,0
2,0,1,74.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
3,0,1,70.0,0,1,1,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
4,0,1,87.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,0,0,54.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110521,0,0,43.0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
110522,0,0,27.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110523,0,0,30.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [7]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110525 entries, 0 to 110524
Data columns (total 23 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Noshow                           110525 non-null  int64  
 1   SMSreceived                      110525 non-null  int64  
 2   Age                              110525 non-null  float64
 3   Scholarship                      110525 non-null  int64  
 4   Hipertension                     110525 non-null  int64  
 5   Diabetes                         110525 non-null  int64  
 6   Alcoholism                       110525 non-null  int64  
 7   Handcap                          110525 non-null  int64  
 8   TimeGapDays                      110525 non-null  int64  
 9   prevNoshow                       110525 non-null  int64  
 10  TotalConditions                  110525 non-null  int64  
 11  GenderM_True                     110525 non-null  uint8  
 12  We

In [8]:
# Loop through columns and replace True/False with 1/0
for column in df_encoded.columns:
    if df_encoded[column].dtype == bool:
        df_encoded[column] = df_encoded[column].astype(int)

In [9]:
df_encoded

Unnamed: 0,Noshow,SMSreceived,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,...,WeekDay_Saturday,WeekDay_Thursday,WeekDay_Tuesday,WeekDay_Wednesday,AgeCategory_Child,AgeCategory_Senior,AgeCategory_Teenager,WaitingTimeCategory_Medium Wait,WaitingTimeCategory_Same Day,WaitingTimeCategory_Short Wait
0,0,1,84.0,0,1,1,0,1,115,0,...,0,0,0,0,0,1,0,0,0,0
1,0,1,83.0,0,1,0,0,0,115,0,...,0,0,0,0,0,1,0,0,0,0
2,0,1,74.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
3,0,1,70.0,0,1,1,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
4,0,1,87.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,0,0,54.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110521,0,0,43.0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
110522,0,0,27.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110523,0,0,30.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [10]:
df_encoded.to_csv("appointments_encoded.csv", index = False)

### Fitting the Model

In [11]:
import statsmodels.api as sm

In [12]:
# Step 2: Define target variable and feature matrix
X = df_encoded.drop('Noshow', axis=1)  # Independent variables (features)
y = df_encoded['Noshow']  # Dependent variable

In [13]:
X

Unnamed: 0,SMSreceived,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,TotalConditions,...,WeekDay_Saturday,WeekDay_Thursday,WeekDay_Tuesday,WeekDay_Wednesday,AgeCategory_Child,AgeCategory_Senior,AgeCategory_Teenager,WaitingTimeCategory_Medium Wait,WaitingTimeCategory_Same Day,WaitingTimeCategory_Short Wait
0,1,84.0,0,1,1,0,1,115,0,3,...,0,0,0,0,0,1,0,0,0,0
1,1,83.0,0,1,0,0,0,115,0,1,...,0,0,0,0,0,1,0,0,0,0
2,1,74.0,0,0,0,0,0,109,0,0,...,0,0,0,0,0,1,0,0,0,0
3,1,70.0,0,1,1,0,0,109,0,2,...,0,0,0,0,0,1,0,0,0,0
4,1,87.0,0,0,0,0,0,109,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,0,54.0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110521,0,43.0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
110522,0,27.0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110523,0,30.0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [14]:
y

0         0
1         0
2         0
3         0
4         0
         ..
110520    0
110521    0
110522    0
110523    0
110524    0
Name: Noshow, Length: 110525, dtype: int64

In [15]:
# Step 3: Add a constant term to the feature matrix
X = sm.add_constant(X)

In [16]:
X

Unnamed: 0,const,SMSreceived,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,...,WeekDay_Saturday,WeekDay_Thursday,WeekDay_Tuesday,WeekDay_Wednesday,AgeCategory_Child,AgeCategory_Senior,AgeCategory_Teenager,WaitingTimeCategory_Medium Wait,WaitingTimeCategory_Same Day,WaitingTimeCategory_Short Wait
0,1.0,1,84.0,0,1,1,0,1,115,0,...,0,0,0,0,0,1,0,0,0,0
1,1.0,1,83.0,0,1,0,0,0,115,0,...,0,0,0,0,0,1,0,0,0,0
2,1.0,1,74.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
3,1.0,1,70.0,0,1,1,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
4,1.0,1,87.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,1.0,0,54.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110521,1.0,0,43.0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
110522,1.0,0,27.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110523,1.0,0,30.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [17]:
# Step 4: Create the logistic regression model
logit_model = sm.Logit(y, X)
logit_model

<statsmodels.discrete.discrete_model.Logit at 0x158e49b10>

In [18]:
# Step 5: Fit the model
result = logit_model.fit()
result

Optimization terminated successfully.
         Current function value: 0.442143
         Iterations 7


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x11fc12e50>

In [19]:
# Print the summary
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                 Noshow   No. Observations:               110525
Model:                          Logit   Df Residuals:                   110502
Method:                           MLE   Df Model:                           22
Date:                Mon, 19 Aug 2024   Pseudo R-squ.:                  0.1211
Time:                        19:29:58   Log-Likelihood:                -48868.
converged:                       True   LL-Null:                       -55601.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                      coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              -0.1782      0.047     -3.754      0.000      -0.271      -0.085
SMSreceived                        -0.2501      0.019    -13.438    

To fit a logistic regression model using the statsmodels library, you can follow these steps:

Import the necessary libraries.
Define your target variable (dependent variable) and feature matrix (independent variables).
Add a constant term to the feature matrix.
Fit the logistic regression model using statsmodels.Logit.
Get the summary of the model to examine coefficients, p-values, and other statistics.

### Exploring the Regression and Testing Results

Missing Values (NaNs): It appears that some coefficients have missing values (NaNs) in the "std err," "z," "P>|z|," and confidence interval columns. This indicates that there might be issues with those specific variables. Reasons for NaNs can include multicollinearity, complete separation of data points, or other numerical instabilities. You should investigate and address this issue before interpreting the results.

Pseudo R-squared: The Pseudo R-squared value of approximately 0.1211 indicates the goodness of fit for the model. While it provides an overall measure of model performance, it's essential to note that logistic regression doesn't have an R-squared value like linear regression. Instead, pseudo R-squared values are used, which are model-specific and shouldn't be compared directly to linear regression R-squared values.

Coefficients and p-values: Examine the coefficients and associated p-values for each feature. The p-values indicate the statistical significance of each feature in explaining appointment no-shows. Features with low p-values (typically less than 0.05) are considered statistically significant. For example, 'Age,' 'SMSreceived,' 'Scholarship,' 'prevNoshow,' 'WaitingTimeCategory,' and 'GenderM_True' seem to be statistically significant.

Confidence Intervals: Pay attention to the confidence intervals for coefficients. They provide a range within which the true parameter value is likely to fall. Wider confidence intervals suggest more uncertainty about the parameter estimates.

WeekDay Variables: Some 'WeekDay' variables (e.g., 'WeekDay_Monday' and 'WeekDay_Saturday') do not appear to be statistically significant based on their p-values. You may consider whether these variables are necessary for the model or if they can be removed.

AgeCategory_Teenager and AgeCategory_Senior: These two variables have p-values close to the significance threshold (0.05). Consider whether their inclusion is meaningful for your analysis or if you should explore other ways to represent age categories.

Interactions and Polynomial Terms: Depending on your research question and domain knowledge, you may want to explore interactions between features or include polynomial terms if they make sense in the context of your analysis.

Model Assumptions: Ensure that the assumptions of logistic regression are met. Logistic regression assumes that the relationship between the independent variables and the log-odds of the dependent variable is linear. Check for any violations of this assumption.

To address the issue of missing values (NaNs):

Investigate the specific variables that have missing values to understand the root cause (e.g., multicollinearity, data issues).
Check if there is complete separation in your data, which can lead to numerical instability. If so, you may need to address it using techniques like Firth's penalized likelihood logistic regression.
Consider using regularization techniques such as L1 (Lasso) or L2 (Ridge) regularization to stabilize coefficient estimates and potentially reduce multicollinearity.
It's essential to thoroughly investigate the issues with missing values and evaluate whether the model assumptions are met before making conclusions based on the logistic regression results. Additionally, domain knowledge and context are crucial for interpreting the significance of certain features and making informed decisions about model refinement.

###  Test and correct for Multicollinearity

In [20]:
y

0         0
1         0
2         0
3         0
4         0
         ..
110520    0
110521    0
110522    0
110523    0
110524    0
Name: Noshow, Length: 110525, dtype: int64

In [21]:
X

Unnamed: 0,const,SMSreceived,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,TimeGapDays,prevNoshow,...,WeekDay_Saturday,WeekDay_Thursday,WeekDay_Tuesday,WeekDay_Wednesday,AgeCategory_Child,AgeCategory_Senior,AgeCategory_Teenager,WaitingTimeCategory_Medium Wait,WaitingTimeCategory_Same Day,WaitingTimeCategory_Short Wait
0,1.0,1,84.0,0,1,1,0,1,115,0,...,0,0,0,0,0,1,0,0,0,0
1,1.0,1,83.0,0,1,0,0,0,115,0,...,0,0,0,0,0,1,0,0,0,0
2,1.0,1,74.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
3,1.0,1,70.0,0,1,1,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
4,1.0,1,87.0,0,0,0,0,0,109,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110520,1.0,0,54.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110521,1.0,0,43.0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
110522,1.0,0,27.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110523,1.0,0,30.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [22]:
X.drop(columns = ["TotalConditions"], inplace = True)

In [23]:
# Step 4: Create the logistic regression model
logit_model = sm.Logit(y, X)
logit_model

<statsmodels.discrete.discrete_model.Logit at 0x158e62350>

In [24]:
# Step 5: Fit the model
result = logit_model.fit()
result

Optimization terminated successfully.
         Current function value: 0.442143
         Iterations 7


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x158e48dd0>

In [25]:
# Print the summary
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                 Noshow   No. Observations:               110525
Model:                          Logit   Df Residuals:                   110503
Method:                           MLE   Df Model:                           21
Date:                Mon, 19 Aug 2024   Pseudo R-squ.:                  0.1211
Time:                        19:30:04   Log-Likelihood:                -48868.
converged:                       True   LL-Null:                       -55601.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                      coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              -0.1782      0.047     -3.754      0.000      -0.271      -0.085
SMSreceived                        -0.2501      0.019    -13.438    

After removing the 'TotalConditions' variable, the logistic regression results have been updated. Here are some observations based on the updated results:

Multicollinearity: The issue of multicollinearity seems to have improved. The coefficients for the remaining variables no longer show NaN values in the "std err," "z," "P>|z|," and confidence interval columns, which suggests better stability and reduced multicollinearity.

Pseudo R-squared: The Pseudo R-squared value remains approximately 0.1211, which indicates the goodness of fit for the model. Remember that this pseudo R-squared value is specific to logistic regression and should be considered in the context of logistic models.

Individual Variable Significance: Examine the coefficients and associated p-values for each remaining feature. Features such as 'SMSreceived,' 'Age,' 'Scholarship,' 'prevNoshow,' and 'WaitingTimeCategory' continue to appear statistically significant with low p-values.

Non-Significant Variables: Some variables, such as 'Hipertension,' 'WeekDay_Saturday,' 'WeekDay_Monday,' 'WeekDay_Tuesday,' 'AgeCategory_Senior,' do not appear to be statistically significant based on their p-values (p>|z| > 0.05). You may consider whether these variables are necessary for your model or if they can be removed to simplify the model.

Variable Interpretation: Interpret the coefficients carefully. A positive coefficient suggests an increase in the log-odds of not showing up for each unit increase in the corresponding independent variable (e.g., 'Diabetes,' 'Alcoholism,' 'WeekDay_Saturday'). Conversely, a negative coefficient suggests a decrease in the log-odds of not showing up.

WeekDay Variables: 'WeekDay_Saturday,' 'WeekDay_Monday,' and 'WeekDay_Tuesday' still have p-values above 0.05, indicating non-significance. Consider whether these variables provide valuable information for your analysis or if they can be removed.

AgeCategory_Teenager: 'AgeCategory_Teenager' remains statistically significant (p-value < 0.05), indicating that it contributes to the model.

Interactions and Polynomial Terms: Depending on your research goals, you may explore interactions between features or include polynomial terms if you suspect non-linear relationships.

Overall, the removal of 'TotalConditions' seems to have addressed the issue of multicollinearity, and the model appears to be more stable. Continue to evaluate the statistical significance and practical significance of each remaining variable based on your research objectives. Additionally, consider model evaluation techniques such as ROC curves, confusion matrices, and model performance metrics to assess the model's predictive power.

### Exploring and interpreting the Results and outlook

The overall fit of the logistic regression model can be assessed based on various factors, including the pseudo R-squared value, model performance metrics, and domain-specific considerations. In this case, the pseudo R-squared value is approximately 0.1211, which provides an indication of the model's goodness of fit. However, interpreting the strength of the fit requires some context:

Pseudo R-squared Interpretation: Logistic regression pseudo R-squared values are typically lower than traditional R-squared values in linear regression. A pseudo R-squared of 0.1211 suggests that the model explains a moderate proportion of the variation in appointment no-shows based on the selected features.

Model Performance Metrics: Beyond the pseudo R-squared value, it's essential to evaluate model performance using appropriate metrics such as accuracy, precision, recall, F1-score, and the ROC curve. These metrics provide a more comprehensive assessment of how well the model predicts no-shows.

Domain Knowledge: Consider the context and domain-specific insights. While statistical measures provide valuable information, domain experts may have a better understanding of what constitutes a strong or weak model fit based on the practical implications of the results.

To further improve the model fit and predictive performance, you can consider the following steps:

Feature Engineering: Explore additional features or transformations of existing features that may capture more relevant information. This can involve creating interaction terms, polynomial features, or deriving new variables from domain knowledge.

Feature Selection: Assess the importance of each feature and consider removing non-significant or redundant variables. This can simplify the model and reduce noise.

Regularization: Apply regularization techniques such as L1 (Lasso) or L2 (Ridge) regularization to penalize certain feature coefficients and prevent overfitting.

Model Evaluation: Conduct a thorough evaluation of the model using cross-validation, where you split the dataset into training and testing sets to assess its generalization performance.

Address Class Imbalance: If there is a significant class imbalance (e.g., a much higher number of non-shows compared to shows), consider techniques like oversampling, undersampling, or using different evaluation metrics to account for imbalance.

Collect More Data: Gathering more data, if feasible, can improve the model's performance, especially if the dataset is limited in size.

Interaction Terms: Explore potential interactions between features, as some relationships may not be linear.

Outlier Detection and Handling: Identify and handle outliers that may be influencing the model's fit.

Model Selection: Consider alternative machine learning algorithms beyond logistic regression, such as decision trees, random forests, or gradient boosting, to see if they provide a better fit to the data.

Model Interpretability: Depending on the importance of interpretability, you can explore interpretable machine learning techniques or use model-agnostic methods to understand how the model makes predictions.

Improving the model fit is an iterative process that involves a combination of feature engineering, model selection, and evaluation. It's crucial to balance model complexity with interpretability and performance, depending on the specific goals of your analysis. Consulting with domain experts and continuously monitoring the model's performance is essential for refining the model further.

###  Comparison with Bivariate Analysis

Let's compare the regression coefficient for "SMSreceived" with its correlation with the target variable "Noshow."

Regression Coefficient for "SMSreceived":
In your logistic regression results, the coefficient for "SMSreceived" is approximately -0.2501.

Correlation between "SMSreceived" and "Noshow":
In the correlation matrix, the correlation between "SMSreceived" and "Noshow" is approximately 0.1264.

Comparison:
The regression coefficient for "SMSreceived" is -0.2501, indicating that receiving an SMS reminder is associated with a decrease in the log-odds of a no-show appointment. This suggests that patients who receive SMS reminders are less likely to miss their appointments.

The correlation coefficient between "SMSreceived" and "Noshow" is approximately 0.1264, indicating a positive linear relationship between receiving an SMS reminder and the likelihood of a no-show. This correlation value suggests that, in general, receiving an SMS reminder is associated with a slightly higher probability of a no-show.

In this case, the regression coefficient and correlation coefficient show opposite directions of the relationship between "SMSreceived" and "Noshow." The regression coefficient, when considered within the context of the logistic regression model, suggests that SMS reminders have a protective effect against no-shows, while the correlation indicates a positive linear relationship.

This discrepancy is a reminder that regression coefficients in logistic regression capture the effects of variables while considering the influence of other variables in the model, often revealing non-linear relationships or interactions. Correlations, on the other hand, measure linear associations between variables without considering the effects of other covariates. The choice of which measure to rely on should depend on the research question and the model's assumptions.