# Step Two: Machine Learning Model Using XGBoost
Report Text (FINAL – use this)

After data preprocessing, an XGBoost classifier was trained to predict medicine stock-out risk. XGBoost was selected due to its strong performance on structured tabular data and its ability to model complex non-linear relationships. The model was trained on the prepared training dataset and evaluated on the test dataset using standard classification metrics.

In [49]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# ENCODING CATEGORICAL DATA
Encoding = converting text categories into numbers, because ML models cannot understand text.

In [50]:
df = pd.read_csv('Data_CleanedFor_Tabular.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,record_id,date,pharmacy_id,medicine_id,current_stock_level,avg_weekly_sales,reorder_quantity,lead_time_days,supplier_delay_frequency,price_change_rate,storage_capacity,pharmacy_location_code,medicine_category,target_stockout
0,0,1,2024-03-05,82,46,41,34,45,13,0.8,-0.0,359,9,Painkiller,0
1,1,2,2024-11-13,42,7,192,13,80,6,0.65,0.13,322,13,Cardiology,0
2,2,3,2024-11-02,68,26,79,17,62,1,0.55,0.08,242,4,Cardiology,0
3,3,4,2024-05-06,68,13,180,14,30,14,0.26,0.01,296,8,Antibiotic,0
4,4,5,2024-03-09,44,28,171,44,71,13,0.58,0.1,101,2,Vitamins,1


In [51]:
# drop="first" → avoids duplicate columns
# One-Hot Encode categorical column 
encoder = OneHotEncoder(drop="first",handle_unknown='ignore')
encoded_cat= encoder.fit_transform(df[["medicine_category"]])
encoder_cat_df=pd.DataFrame(
    encoded_cat.toarray(), # convert sparse to dense safely
    columns=encoder.get_feature_names_out(["medicine_category"])
)
df = df.drop(columns=["medicine_category"])

In [None]:
# Concatenate the original dataframe with the new one-hot encoded columns
df = pd.concat([df, encoder_cat_df], axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,record_id,date,pharmacy_id,medicine_id,current_stock_level,avg_weekly_sales,reorder_quantity,lead_time_days,supplier_delay_frequency,price_change_rate,storage_capacity,pharmacy_location_code,target_stockout,medicine_category_Cardiology,medicine_category_Diabetes,medicine_category_Painkiller,medicine_category_Vitamins
0,0,1,2024-03-05,82,46,41,34,45,13,0.8,-0.0,359,9,0,0.0,0.0,1.0,0.0
1,1,2,2024-11-13,42,7,192,13,80,6,0.65,0.13,322,13,0,1.0,0.0,0.0,0.0
2,2,3,2024-11-02,68,26,79,17,62,1,0.55,0.08,242,4,0,1.0,0.0,0.0,0.0
3,3,4,2024-05-06,68,13,180,14,30,14,0.26,0.01,296,8,0,0.0,0.0,0.0,0.0
4,4,5,2024-03-09,44,28,171,44,71,13,0.58,0.1,101,2,1,0.0,0.0,0.0,1.0


In [None]:
# Split data into features and target
X= df.drop(columns=["target_stockout", "date"])
y= df["target_stockout"]

In [None]:
# Check shapes
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (1000, 16)
y shape: (1000,)


In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Geth shape of splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (800, 16)
X_test shape: (200, 16)
y_train shape: (800,)
y_test shape: (200,)


In [57]:
# Create XGBoost model
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    eval_metric="logloss",
    random_state=42
)

In [58]:
# Train the model
xgb_model.fit(X_train, y_train)

In [59]:
#  Make predictions
y_pred = xgb_model.predict(X_test)

In [60]:
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       146
           1       1.00      1.00      1.00        54

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

