# Healthcare Length of Stay Prediction - Data Preprocessing

## Objective
في النوتبوك ده هنجهز بيانات ملف `mah_test_healthcare_dataset_clean.csv`
علشان نستخدمها في بناء نموذج Machine Learning يتنبأ بـ **Length_of_Stay** (مدة الإقامة في المستشفى).

### خطوات الـ Preprocessing:
1. تحميل البيانات واستكشافها.
2. تنظيف البيانات (NaNs, duplicated rows).
3. تحويل الأعمدة الزمنية (تواريخ الدخول والخروج).
4. اختيار المتغير المستهدف (Target) والمتغيرات التفسيرية (Features).
5. Encoding للأعمدة الـ Categorical.
6. Scaling للأعمدة الـ Numerical.
7. تقسيم البيانات Train / Test وتجهيزها للنموذج.


In [1]:
!pip install scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp39-cp39-win_amd64.whl.metadata (15 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp39-cp39-win_amd64.whl (11.2 MB)
   ---------------------------------------- 0.0/11.2 MB ? eta -:--:--
    --------------------------------------- 0.3/11.2 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.2 MB 1.2 MB/s eta 0:00:09
   -- ------------------------------------- 0.8/11.2 MB 1.2 MB/s eta 0:00:09
   --- ------------------------------------ 1.0/11.2 MB 1.3 MB/s eta 0:00:09
   ---- ----------------------------------- 1.3/11.2 MB 1.3 MB/s eta 0:00:08
   ----- ---------------------------------- 1.6/11.2 MB 1.3 MB/s eta 0:00:08
   ------ --------------------------------- 1.8/11.2 MB 1.3 MB/s eta 0:00:08
   -----

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


## 2. Load Dataset & First Look

- بنقرأ ملف الـ CSV باستخدام `pandas.read_csv`.
- بنعرض أول شوية صفوف (`head()`) علشان ناخد فكرة عن شكل الداتا والأعمدة.


In [3]:
# حمل الداتا (غيّر المسار لو لازم)
file_path = "mah_test_healthcare_dataset_clean.csv"

df = pd.read_csv(file_path)

# عرض أول 5 صفوف
df.head()


Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results,Length_of_Stay
0,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons And Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal,2
1,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,6
2,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook Plc,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal,15
3,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers And Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal,30
4,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,20


In [None]:
## 3. Dataset Shape & Column Types

- `df.shape` بتوضح عدد الصفوف وعدد الأعمدة.
- `df.info()` بتوضح نوع كل عمود (عددية، نصية، إلخ) ووجود قيم مفقودة.
ده بيساعدنا نخطط للـ preprocessing:
- الأعمدة الرقمية → ممكن نعمل لها `scaling`.
- الأعمدة النصية/الفئوية → محتاجة `encoding`.
- الأعمدة الزمنية → محتاجة `datetime` ومعالجة خاصة.


In [4]:
# حجم الداتا (عدد الصفوف والأعمدة)
print("Shape:", df.shape)

# معلومات عن أنواع الأعمدة
df.info()


Shape: (54966, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54966 entries, 0 to 54965
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                54966 non-null  object 
 1   Age                 54966 non-null  int64  
 2   Gender              54966 non-null  object 
 3   Blood Type          54966 non-null  object 
 4   Medical Condition   54966 non-null  object 
 5   Date of Admission   54966 non-null  object 
 6   Doctor              54966 non-null  object 
 7   Hospital            54966 non-null  object 
 8   Insurance Provider  54966 non-null  object 
 9   Billing Amount      54966 non-null  float64
 10  Room Number         54966 non-null  int64  
 11  Admission Type      54966 non-null  object 
 12  Discharge Date      54966 non-null  object 
 13  Medication          54966 non-null  object 
 14  Test Results        54966 non-null  object 
 15  Length_of_Stay      54966 non-null

## 4. Descriptive Statistics & Missing Values

- `df.describe()` لمراجعة الإحصائيات الأساسية للأعمدة الرقمية (متوسط، min, max...).
- `df.describe(include="object")` لمراجعة الأعمدة النصية/الفئوية.
- `df.isna().sum()` لنعرف الأعمدة اللي فيها قيم ناقصة وطبيعة المشكلة.
بناءً على ده نقرر:
- هنملأ القيم المفقودة (imputation)؟
- ولا هنحذف صفوف/أعمدة معينة؟


In [5]:
# ملخص للأعمدة الرقمية
display(df.describe())

# ملخص للأعمدة الكاتيجوريكال
display(df.describe(include="object"))

# عدد القيم المفقودة في كل عمود
df.isna().sum()


Unnamed: 0,Age,Billing Amount,Room Number,Length_of_Stay
count,54966.0,54966.0,54966.0,54966.0
mean,51.535185,25544.306284,301.124404,15.49929
std,19.605661,14208.409711,115.223143,8.661471
min,13.0,-2008.49214,101.0,1.0
25%,35.0,13243.718641,202.0,8.0
50%,52.0,25542.749145,302.0,15.0
75%,68.0,37819.858159,401.0,23.0
max,89.0,52764.276736,500.0,30.0


Unnamed: 0,Name,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Admission Type,Discharge Date,Medication,Test Results
count,54966,54966,54966,54966,54966,54966,54966,54966,54966,54966,54966,54966
unique,40235,2,8,6,1827,40341,39876,5,3,1856,5,3
top,Michael Williams,Male,A-,Arthritis,2024-03-16,Michael Smith,Llc Smith,Cigna,Elective,2020-03-15,Lipitor,Abnormal
freq,24,27496,6898,9218,50,27,44,11139,18473,53,11038,18437


Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Room Number           0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
Length_of_Stay        0
dtype: int64

## 5. Basic Cleaning

### 5.1 إزالة الصفوف المكررة
لو في صفوف مكررة بنشيلها بـ `drop_duplicates()` علشان ما نكرر نفس المعلومة في التدريب.

### 5.2 حذف الأعمدة غير المفيدة
بنحذف أعمدة تعريفية أو قليلة الفايدة للنموذج زي:
- `Name`
- `Doctor`
- `Hospital`
- `Room Number`

الهدف إننا نقلل الـ Noise ونركز على المتغيرات اللي لها علاقة منطقية بمدة الإقامة.


In [6]:
# إزالة الصفوف المكررة (إن وجد)
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before - after} duplicated rows")

# حذف الأعمدة اللي مش هنستخدمها في النموذج
cols_to_drop = ["Name", "Doctor", "Hospital", "Room Number"]
df = df.drop(columns=cols_to_drop, errors="ignore")

df.head()


Removed 0 duplicated rows


Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Date of Admission,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results,Length_of_Stay
0,30,Male,B-,Cancer,2024-01-31,Blue Cross,18856.281306,Urgent,2024-02-02,Paracetamol,Normal,2
1,62,Male,A+,Obesity,2019-08-20,Medicare,33643.327287,Emergency,2019-08-26,Ibuprofen,Inconclusive,6
2,76,Female,A-,Obesity,2022-09-22,Aetna,27955.096079,Emergency,2022-10-07,Aspirin,Normal,15
3,28,Female,O+,Diabetes,2020-11-18,Medicare,37909.78241,Elective,2020-12-18,Ibuprofen,Abnormal,30
4,43,Female,AB+,Cancer,2022-09-19,Aetna,14238.317814,Urgent,2022-10-09,Penicillin,Abnormal,20


## 6. Date Columns Handling

1. حوّلنا الأعمدة:
   - `Date of Admission`
   - `Discharge Date`
   إلى نوع `datetime`.

2. استخرجنا Features جديدة:
   - `Admission_Year`
   - `Admission_Month`
   - `Admission_Day` (اليوم في الأسبوع).

3. حذفنا أعمدة التواريخ الأصلية بعد ما طلعنا منها معلومات مفيدة
   علشان نبسّط النموذج ونقلل الـ leakage من التواريخ تجاه `Length_of_Stay`.


In [7]:
# تحويل الأعمدة الزمنية لـ datetime
df["Date of Admission"] = pd.to_datetime(df["Date of Admission"])
df["Discharge Date"] = pd.to_datetime(df["Discharge Date"])

# استخراج Features من تاريخ الدخول
df["Admission_Year"] = df["Date of Admission"].dt.year
df["Admission_Month"] = df["Date of Admission"].dt.month
df["Admission_Day"] = df["Date of Admission"].dt.dayofweek  # 0=Monday

# ممكن نحذف الأعمدة الزمنية الأصلية لتبسيط النموذج
df = df.drop(columns=["Date of Admission", "Discharge Date"])

df.head()


Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Insurance Provider,Billing Amount,Admission Type,Medication,Test Results,Length_of_Stay,Admission_Year,Admission_Month,Admission_Day
0,30,Male,B-,Cancer,Blue Cross,18856.281306,Urgent,Paracetamol,Normal,2,2024,1,2
1,62,Male,A+,Obesity,Medicare,33643.327287,Emergency,Ibuprofen,Inconclusive,6,2019,8,1
2,76,Female,A-,Obesity,Aetna,27955.096079,Emergency,Aspirin,Normal,15,2022,9,3
3,28,Female,O+,Diabetes,Medicare,37909.78241,Elective,Ibuprofen,Abnormal,30,2020,11,2
4,43,Female,AB+,Cancer,Aetna,14238.317814,Urgent,Penicillin,Abnormal,20,2022,9,0


## 7. Define Target and Features

- المتغير المستهدف (Target) = `Length_of_Stay`.
- `y` يحتوي على قيم مدة الإقامة.
- `X` يحتوي على باقي الأعمدة اللي هنستخدمها كـ Features للنموذج.
لو حابب تغير الـ Target (مثلاً `Billing Amount` أو `Medical Condition`)
تقدر تعدّل `target_col` في الكود.


In [8]:
# تعريف الـ Target
target_col = "Length_of_Stay"

# التأكد إن العمود موجود
assert target_col in df.columns, f"{target_col} column not found!"

# فصل الـ Target عن باقي الأعمدة
y = df[target_col]
X = df.drop(columns=[target_col])

X.head()



Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Insurance Provider,Billing Amount,Admission Type,Medication,Test Results,Admission_Year,Admission_Month,Admission_Day
0,30,Male,B-,Cancer,Blue Cross,18856.281306,Urgent,Paracetamol,Normal,2024,1,2
1,62,Male,A+,Obesity,Medicare,33643.327287,Emergency,Ibuprofen,Inconclusive,2019,8,1
2,76,Female,A-,Obesity,Aetna,27955.096079,Emergency,Aspirin,Normal,2022,9,3
3,28,Female,O+,Diabetes,Medicare,37909.78241,Elective,Ibuprofen,Abnormal,2020,11,2
4,43,Female,AB+,Cancer,Aetna,14238.317814,Urgent,Penicillin,Abnormal,2022,9,0


## 8. Identify Numerical & Categorical Columns

- استخدمنا `select_dtypes` علشان نحدد:
  - الأعمدة الرقمية `numeric_features` → هنطبق عليها `StandardScaler`.
  - الأعمدة الفئوية `categorical_features` → هنطبق عليها `OneHotEncoder`.

ده بيسهل علينا نبني `ColumnTransformer` و `Pipeline` في سكيت-ليرن.


In [9]:
# تحديد الأعمدة الرقمية والفئوية
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)


Numeric features: ['Age', 'Billing Amount']
Categorical features: ['Gender', 'Blood Type', 'Medical Condition', 'Insurance Provider', 'Admission Type', 'Medication', 'Test Results']


## 9. Build Preprocessing ColumnTransformer

- للأعمدة الرقمية: بنستخدم `StandardScaler` لتوحيد المدى (mean=0, std=1).
- للأعمدة الكاتيجوريكال: بنستخدم `OneHotEncoder` لتحويل القيم الفئوية إلى أرقام (0/1).
- `ColumnTransformer` بيطبّق كل Transformer على الأعمدة الخاصة بيه في نفس الوقت.


In [10]:
# Transformer للأعمدة الرقمية
numeric_transformer = StandardScaler()

# Transformer للأعمدة الفئوية
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# ColumnTransformer يجمع الاثنين
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)


## 10. Train/Test Split & Final Pipeline

1. قسمنا البيانات:
   - `X_train`, `X_test`, `y_train`, `y_test` بنسبة 80% تدريب و 20% اختبار.

2. بنينا `Pipeline` تحتوي على:
   - خطوة `preprocessor` (Encoding + Scaling).
   - خطوة `model` (هنا استخدمنا `LinearRegression` كمثال بسيط).

3. عملنا `fit` على بيانات التدريب للتأكد إن الـ preprocessing شغال بدون مشاكل.

الخطوة الجاية:
- تقييم النموذج (`score`, `MAE`, `RMSE`, ...).
- تجربة نماذج تانية (RandomForest, XGBoost, ...).


In [11]:
# تقسيم البيانات إلى Train/Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# بناء Pipeline يطبق الـ preprocessing قبل أي نموذج
from sklearn.linear_model import LinearRegression  # مثال لنموذج بسيط

model = LinearRegression()

clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", model),
    ]
)

# تجربة fit للتأكد إن كل حاجة شغالة
clf.fit(X_train, y_train)

print("Pipeline is ready! You can now evaluate or tune the model.")


Pipeline is ready! You can now evaluate or tune the model.


In [12]:
# لو عندك preprocessor, X_train, X_test من قبل

X_train_pre = preprocessor.fit_transform(X_train)
X_test_pre = preprocessor.transform(X_test)

print("X_train_pre shape:", X_train_pre.shape)
print("X_test_pre shape:", X_test_pre.shape)

# لو sparse matrix
print("Type:", type(X_train_pre))


X_train_pre shape: (43972, 34)
X_test_pre shape: (10994, 34)
Type: <class 'scipy.sparse._csr.csr_matrix'>


In [13]:
X_train_pre_dense = X_train_pre[:100].toarray()
print("Any NaNs in subset?", np.isnan(X_train_pre_dense).any())


Any NaNs in subset? False


In [15]:
# من OneHotEncoder نجيب أسماء الأعمدة الكاتيجوري
ohe = preprocessor.named_transformers_["cat"]
ohe_feature_names = ohe.get_feature_names_out(categorical_features)

# نضم الأعمدة كلها
all_feature_names = np.concatenate([numeric_features, ohe_feature_names])

# نحول الناتج لـ DataFrame
X_train_pre_df = pd.DataFrame(X_train_pre.toarray(), columns=all_feature_names)
X_train_pre_df.head()


Unnamed: 0,Age,Billing Amount,Gender_Female,Gender_Male,Blood Type_A+,Blood Type_A-,Blood Type_AB+,Blood Type_AB-,Blood Type_B+,Blood Type_B-,...,Admission Type_Emergency,Admission Type_Urgent,Medication_Aspirin,Medication_Ibuprofen,Medication_Lipitor,Medication_Paracetamol,Medication_Penicillin,Test Results_Abnormal,Test Results_Inconclusive,Test Results_Normal
0,1.093512,-1.36605,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,-0.693415,1.535562,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.429796,1.631803,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.838236,0.483997,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.093512,0.353036,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


In [16]:
X_train_pre_df.describe()


Unnamed: 0,Age,Billing Amount,Gender_Female,Gender_Male,Blood Type_A+,Blood Type_A-,Blood Type_AB+,Blood Type_AB-,Blood Type_B+,Blood Type_B-,...,Admission Type_Emergency,Admission Type_Urgent,Medication_Aspirin,Medication_Ibuprofen,Medication_Lipitor,Medication_Paracetamol,Medication_Penicillin,Test Results_Abnormal,Test Results_Inconclusive,Test Results_Normal
count,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,...,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0,43972.0
mean,-1.192533e-16,-9.420686e-17,0.498999,0.501001,0.125466,0.12533,0.124693,0.124375,0.127081,0.125762,...,0.327731,0.335509,0.199286,0.201014,0.201151,0.199809,0.19874,0.337055,0.330119,0.332825
std,1.000011,1.000011,0.500005,0.500005,0.331251,0.331096,0.330374,0.330012,0.333067,0.331585,...,0.469392,0.472173,0.399468,0.400763,0.400865,0.399861,0.399057,0.472709,0.470261,0.47123
min,-1.969791,-1.917835,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.84658,-0.8658779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.02135576,-0.002311329,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.8382365,0.8610084,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
max,1.910392,1.916899,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
