### Linear Regression model

In [1]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Step 1: Load data and compute Length of Stay
df = pd.read_csv('healthcare_dataset.csv')
df['Date of Admission'] = pd.to_datetime(df['Date of Admission'])
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'])
df['Length_of_Stay'] = (df['Discharge Date'] - df['Date of Admission']).dt.days

# Step 2: Select features and target
features = [
    'Age', 'Gender', 'Admission Type', 'Medical Condition',
    'Medication', 'Room Number'
]
target = 'Length_of_Stay'

X = df[features]
y = df[target]

# Step 3: Preprocessing pipeline
categorical_cols = ['Gender', 'Admission Type', 'Medical Condition', 'Medication']
numerical_cols = ['Age', 'Room Number']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('num', SimpleImputer(strategy='median'), numerical_cols)
])

# Step 4: Create the pipeline with Linear Regression
model = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('regressor', LinearRegression())
])

# Step 5: Train-test split and model fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)

# Step 6: Evaluate the model
y_pred = model.predict(X_test)

print("Mean Absolute Error (MAE):", mean_absolute_error(y_test, y_pred))
print("Root Mean Squared Error (RMSE):", np.sqrt(mean_squared_error(y_test, y_pred)))


Mean Absolute Error (MAE): 7.482258558325104
Root Mean Squared Error (RMSE): 8.631175739030652


### Model Performance Summary
MAE: 7.48 days — average prediction error

RMSE: 8.63 days — typical error size, slightly penalizing larger mistakes

📉 Interpretation
The model's predictions are off by about 7.5–8.6 days, which is high unless hospital stays are very long or variable.

The small gap between MAE and RMSE suggests errors are consistent, with few large outliers.

⚠️ Possible Issues
Limited features: Key clinical or operational data may be missing.

Room Number and Medication may not be meaningful or well-processed.

Random Forest might not capture complex patterns in the data.

✅ Recommendations
Add better features — diagnosis severity, comorbidities, medication classes.

Try advanced models — like XGBoost or LightGBM.

Transform the target — consider log-scaling if Length_of_Stay is skewed.

Evaluate error by patient group — e.g., short vs. long stays.