1. Feature Importance Analysis: Which factors are most predictive of readmission for diabetic patients?
    - This question aims to utilize predictive modeling to identify variables that significantly influence the likelihood of a              patient being readmitted again after discharge.
2. Comparative Analysis: How do readmission rates differ among different demographic groups (age, gender, race)?
    - This explores whether there is a disparity in readmission rates among different patient demographics, which could guide               targeted interventions.
3. Treatment and Testing Variables: What role do the number of lab tests performed during the stay, changes in medication, and the results of HbA1c tests play in predicting readmissions?
4. Healthcare Utilization and Outcomes: How do the number of emergency visits, inpatient stays, and outpatient visits in the year prior to the hospitalization correlate with readmission rates?

In [8]:
from datasets import load_dataset
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels.api as sm
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

train = load_dataset("csv", data_files="dataset/train.csv")
test = load_dataset("csv", data_files="dataset/test.csv")
train_df = pd.DataFrame(train)
test_df = pd.DataFrame(test)

# **Feature Importance Analysis**
Which factors are most predictive of readmission for diabetic patients?
    - This question aims to utilize predictive modeling to identify variables that significantly influence the likelihood of a              patient being readmitted again after discharge.

In [13]:
# Adjusting the code to display only the top 10 features with the greatest feature importance

# Assuming the rest of the code is unchanged and focusing on the feature importance part

# Get the indices of the 10 features with the greatest importance
top_10_indices = np.argsort(feature_importances)[-10:]

plt.figure(figsize=(10, 8))
plt.title('Top 10 Feature Importances')
plt.barh(range(len(top_10_indices)), feature_importances[top_10_indices], color='b', align='center')
plt.yticks(range(len(top_10_indices)), [features[i] for i in top_10_indices])
plt.xlabel('Relative Importance')
plt.show()

# **Comparative Analysis**
How do readmission rates differ among different demographic groups (age, gender, race)?
    - This explores whether there is a disparity in readmission rates among different patient demographics, which could guide               targeted interventions.

In [16]:
# Display the first few rows, the first 10 columns, and the last column of the training dataset
train_df.iloc[:, np.r_[0:10, -1]].head()

# Display the first few rows, the first 10 columns, and the last column of the testing dataset
test_df.iloc[:, np.r_[0:10, -1]].head()

Getting a sum from each data set for the race demographic:

In [54]:
# Sum of 'readmitted' values where 'race:AfricanAmerican' equals 1 for both training and testing datasets
train_readmittedAA_sum = train_df[train_df['race:AfricanAmerican'] == 1]['readmitted'].sum()
test_readmittedAA_sum = test_df[test_df['race:AfricanAmerican'] == 1]['readmitted'].sum()

train_readmittedAA_sum, test_readmittedAA_sum

In [55]:
# Sum of 'readmitted' values where 'race:Asian' equals 1 for both training and testing datasets
train_readmittedA_sum = train_df[train_df['race:Asian'] == 1]['readmitted'].sum()
test_readmittedA_sum = test_df[test_df['race:Asian'] == 1]['readmitted'].sum()

train_readmittedA_sum, test_readmittedA_sum

In [56]:
# Sum of 'readmitted' values where 'race:Caucasian' equals 1 for both training and testing datasets
train_readmittedC_sum = train_df[train_df['race:Caucasian'] == 1]['readmitted'].sum()
test_readmittedC_sum = test_df[test_df['race:Caucasian'] == 1]['readmitted'].sum()

train_readmittedC_sum, test_readmittedC_sum

In [57]:
# Sum of 'readmitted' values where 'race:Hispanic' equals 1 for both training and testing datasets
train_readmittedH_sum = train_df[train_df['race:Hispanic'] == 1]['readmitted'].sum()
test_readmittedH_sum = test_df[test_df['race:Hispanic'] == 1]['readmitted'].sum()

train_readmittedH_sum, test_readmittedH_sum

In [58]:
# Sum of 'readmitted' values where 'race:Other' equals 1 for both training and testing datasets
train_readmittedO_sum = train_df[train_df['race:Other'] == 1]['readmitted'].sum()
test_readmittedO_sum = test_df[test_df['race:Other'] == 1]['readmitted'].sum()

train_readmittedO_sum, test_readmittedO_sum

In [59]:
import matplotlib.pyplot as plt

# Sum of readmitted values for each race including the missing category
sums = [train_readmittedA_sum + test_readmittedA_sum, 
        train_readmittedC_sum + test_readmittedC_sum,
        train_readmittedH_sum + test_readmittedH_sum,
        train_readmittedO_sum + test_readmittedO_sum,
        train_readmittedAA_sum + test_readmittedAA_sum]  # Placeholder for the missing category sum

# Labels for each race including the missing category
labels = ['Asian', 'Caucasian', 'Hispanic', 'Other', 'African American']  # Placeholder for the missing category label

# Create pie chart
plt.figure(figsize=(10, 7))
plt.pie(sums, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Race Demographics of Readmitted Patients')
plt.show()

Getting a sum from each dataset for the gender demogrphic:

In [9]:
# Sum of 'readmitted' values where 'gender:Female' equals 1 for both training and testing datasets
train_readmittedF_sum = train_df[train_df['gender:Female'] == 1]['readmitted'].sum()
test_readmittedF_sum = test_df[test_df['gender:Female'] == 1]['readmitted'].sum()

train_readmittedF_sum, test_readmittedF_sum

In [10]:
# Sum of 'readmitted' values where 'gender:Male' equals 1 for both training and testing datasets
train_readmittedM_sum = train_df[train_df['gender:Male'] == 1]['readmitted'].sum()
test_readmittedM_sum = test_df[test_df['gender:Male'] == 1]['readmitted'].sum()

train_readmittedM_sum, test_readmittedM_sum

In [11]:
import matplotlib.pyplot as plt

# Sum of readmitted values for both genders in both datasets
total_readmittedF = train_readmittedF_sum + test_readmittedF_sum
total_readmittedM = train_readmittedM_sum + test_readmittedM_sum

# Labels for the sections of our pie chart
labels = 'Female', 'Male'

# The values for each section of the pie chart
sizes = [total_readmittedF, total_readmittedM]

# The colors for each section of the pie chart
colors = ['#ff9999','#66b3ff']

# Plotting the pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Gender Demographics of Readmitted Patients')
plt.show()

The percentages amongst the gender demographic to not different signficantly to be considered a factor as signficant. 

Getting a sum from each dataset for the age demogrphic:

In [12]:
# Sum of 'readmitted' values where 'age:70+' equals 1 for both training and testing datasets
train_readmitted70plus_sum = train_df[train_df['age:70+'] == 1]['readmitted'].sum()
test_readmitted70plus_sum = test_df[test_df['age:70+'] == 1]['readmitted'].sum()

train_readmitted70plus_sum, test_readmitted70plus_sum

In [13]:
# Sum of 'readmitted' values where 'age:[0-10)' equals 1 for both training and testing datasets
train_readmittedLess10_sum = train_df[train_df['age:[0-10)'] == 1]['readmitted'].sum()
test_readmittedLess10_sum = test_df[test_df['age:[0-10)'] == 1]['readmitted'].sum()

train_readmittedLess10_sum, test_readmittedLess10_sum

In [14]:
# Sum of 'readmitted' values where 'age:[10-20)' equals 1 for both training and testing datasets
train_readmitted10to19_sum = train_df[train_df['age:[10-20)'] == 1]['readmitted'].sum()
test_readmitted10to19_sum = test_df[test_df['age:[10-20)'] == 1]['readmitted'].sum()

train_readmitted10to19_sum, test_readmitted10to19_sum

In [15]:
# Sum of 'readmitted' values where 'age:[20-50)' equals 1 for both training and testing datasets
train_readmitted20to49_sum = train_df[train_df['age:[20-50)'] == 1]['readmitted'].sum()
test_readmitted20to49_sum = test_df[test_df['age:[20-50)'] == 1]['readmitted'].sum()

train_readmitted20to49_sum, test_readmitted20to49_sum

In [16]:
# Sum of 'readmitted' values where 'age:[50-70)' equals 1 for both training and testing datasets
train_readmitted50to69_sum = train_df[train_df['age:[50-70)'] == 1]['readmitted'].sum()
test_readmitted50to69_sum = test_df[test_df['age:[50-70)'] == 1]['readmitted'].sum()

train_readmitted50to69_sum, test_readmitted50to69_sum

In [17]:
import matplotlib.pyplot as plt

# Assuming the sums for the 70+ demographic have been calculated as follows:
# train_readmitted70plus_sum = train_df[train_df['age:[70+)'] == 1]['readmitted'].sum()
# test_readmitted70plus_sum = test_df[test_df['age:[70+)'] == 1]['readmitted'].sum()

# Update the age_demographics_sums list to include the 70+ demographic
#combining the 0-10 and 10-20 age groups because their percentages do not have a significant difference between them
age_demographics_sums = [
    train_readmittedLess10_sum + test_readmittedLess10_sum + train_readmitted10to19_sum + test_readmitted10to19_sum, 
    train_readmitted20to49_sum + test_readmitted20to49_sum,
    train_readmitted50to69_sum + test_readmitted50to69_sum,
    train_readmitted70plus_sum + test_readmitted70plus_sum  # Updated to include the 70+ demographic
]

age_labels = ['0-20', '20-50', '50-70', '70+']

plt.figure(figsize=(10, 7))
plt.pie(age_demographics_sums, labels=age_labels, autopct='%1.1f%%')
plt.title('Pie Chart of Age Demographics Readmitted')
plt.show()

From this chart, we can see that individuals 50 and older tend to have higher readmission rates. 

# **Treatment and Testing Variables**
What role do the number of lab tests performed during the stay, changes in medication, and the results of HbA1c tests play in predicting readmissions?

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Prepare the dataset for modeling
X = train_df.drop('readmitted', axis=1)  # Features
y = train_df['readmitted'].apply(lambda x: 1 if x == 'Yes' else 0)  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Feature importance
feature_importances = rf_classifier.feature_importances_

# Create a dataframe to display feature importance
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Output the model evaluation and feature importance
accuracy, report, feature_importance_df.head(10)

## **Number of lab tests performed during the stay**

## **Changes in medication**

## **HbA1c test results**

# **Healthcare Utilization and Outcomes**
How do the number of emergency visits, inpatient stays, and outpatient visits in the year prior to the hospitalization correlate with readmission rates?

In [45]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Create box and whisker plots to visualize the relationships
plt.figure(figsize=(18, 6))

plt.subplot(1, 3, 1)
sns.boxplot(x='readmitted', y='number_emergency', data=train_df, palette="muted")
plt.title('Emergency Visits vs. Readmission')

plt.subplot(1, 3, 2)
sns.boxplot(x='readmitted', y='number_outpatient', data=train_df, palette="muted")
plt.title('Outpatient Visits vs. Readmission')

plt.subplot(1, 3, 3)
sns.boxplot(x='readmitted', y='number_inpatient', data=train_df, palette="muted")
plt.title('Inpatient Visits vs. Readmission')

plt.show()

Since the vast majority of people didn't have any emergency visits, outpatient visits, or inpatient visits, lets remove them from the plots.

In [69]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Create box and whisker plots to visualize the relationships without showing outliers
plt.figure(figsize=(18, 12))

cleaned_data = train_df[['number_emergency', 'number_outpatient', 'number_inpatient', 'readmitted']].copy()

cleaned_data = cleaned_data[cleaned_data['number_emergency'] > 0]
cleaned_data = cleaned_data[cleaned_data['number_outpatient'] > 0]
cleaned_data = cleaned_data[cleaned_data['number_inpatient'] > 0]


plt.subplot(2, 3, 1)
sns.boxplot(x='readmitted', y='number_emergency', data=cleaned_data, palette="muted")
plt.title('Emergency Visits vs. Readmission')

plt.subplot(2, 3, 2)
sns.boxplot(x='readmitted', y='number_outpatient', data=cleaned_data, palette="muted")
plt.title('Outpatient Visits vs. Readmission')

plt.subplot(2, 3, 3)
sns.boxplot(x='readmitted', y='number_inpatient', data=cleaned_data, palette="muted")
plt.title('Inpatient Visits vs. Readmission')

plt.subplot(2, 3, 4)
sns.boxplot(x='readmitted', y='number_emergency', data=cleaned_data, palette="muted", showfliers=False)
plt.title('Emergency Visits vs. Readmission (Outliers Removed)')

plt.subplot(2, 3, 5)
sns.boxplot(x='readmitted', y='number_outpatient', data=cleaned_data, palette="muted", showfliers=False)
plt.title('Outpatient Visits vs. Readmission (Outliers Removed)')

plt.subplot(2, 3, 6)
sns.boxplot(x='readmitted', y='number_inpatient', data=cleaned_data, palette="muted", showfliers=False)
plt.title('Inpatient Visits vs. Readmission (Outliers Removed)')

plt.show()

So, for the people who did have either inpatient, outpatient, or emergency visits, these visits don't seem to have much correlation with being readmitted.

In [50]:
train_df['number_outpatient'].describe()

In [48]:
import plotly.graph_objects as go

# Aggregate data for visualization
agg_lab = data_for_visualization.groupby(['num_lab_procedures', 'readmitted']).size().reset_index(name='count')
agg_med = data_for_visualization.groupby(['num_medications', 'readmitted']).size().reset_index(name='count')

# Create bar plots to visualize the relationships more effectively
# Lab Procedures vs. Readmission
fig_lab = go.Figure()
for readmitted_status in agg_lab['readmitted'].unique():
    filtered_data = agg_lab[agg_lab['readmitted'] == readmitted_status]
    fig_lab.add_trace(go.Bar(x=filtered_data['num_lab_procedures'], y=filtered_data['count'], name=f'Readmitted: {readmitted_status}'))

fig_lab.update_layout(title='Lab Procedures vs. Readmission', xaxis_title='Number of Lab Procedures', yaxis_title='Count', barmode='group')

# Medications vs. Readmission
fig_med = go.Figure()
for readmitted_status in agg_med['readmitted'].unique():
    filtered_data = agg_med[agg_med['readmitted'] == readmitted_status]
    fig_med.add_trace(go.Bar(x=filtered_data['num_medications'], y=filtered_data['count'], name=f'Readmitted: {readmitted_status}'))

fig_med.update_layout(title='Medications vs. Readmission', xaxis_title='Number of Medications', yaxis_title='Count', barmode='group')

# For the HbA1c results, we can use a bar plot to show the count of readmitted vs. not-readmitted for each A1Cresult_combined category
agg_hba1c = data_for_visualization.groupby(['A1Cresult_combined', 'readmitted']).size().reset_index(name='count')
fig_hba1c = go.Figure()
for readmitted_status in agg_hba1c['readmitted'].unique():
    filtered_data = agg_hba1c[agg_hba1c['readmitted'] == readmitted_status]
    fig_hba1c.add_trace(go.Bar(x=filtered_data['A1Cresult_combined'], y=filtered_data['count'], name=f'Readmitted: {readmitted_status}'))

fig_hba1c.update_layout(title='HbA1c Test Results vs. Readmission', xaxis_title='A1C Result Combined', yaxis_title='Count', barmode='group')

fig_lab.show()
fig_med.show()
fig_hba1c.show()

In [70]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Preparing the data
X = data_for_visualization.drop('readmitted', axis=1)
# Converting categorical data into dummy/indicator variables
X = pd.get_dummies(X, drop_first=True)
y = data_for_visualization['readmitted']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a linear regression model
model = LinearRegression()

# Fitting the model
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Getting the feature importance (coefficients in case of Linear Regression)
feature_importance = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])

# Displaying the model
print("Linear Regression Model:")
print("Intercept:", model.intercept_)
print("Coefficients:")
print(feature_importance)

# Displaying the evaluation metrics
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)