## **Milestone 2**

### **A. Introduction**

- **Name**  : Livia Amanda Annafiah
- **Batch** : BSD-005
- **Dataset** : [Cerebral Stroke Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset/data)
- **Hugging Face**: [Link](https://huggingface.co/spaces/liviamanda/CerebralStroke)

---------------------

**Problem Statement**

A healthcare center is facing challenges in **predicting strokes in patients**. This leads to delayed interventions and puts patients lives at risk. This problem comes from the complex nature of stroke symptoms and various risk factors, including age, health condition, and lifestyle choices. Not being able to quickly spot who is at high risk of a stroke makes it hard to act fast with medical help. As a result, patients might not get the urgent care they need, which could prevent serious consequences like long-term disability or even death. This situation shows the urgent need for better tools that can predict strokes earlier and customize care to each person's needs.

To address this challenges, the company wanted to develop a model capable of analyzing patient data to predict the likelihood of a stroke. By creating this algorithm, the healthcare center will be able to thoroughly analyze various aspects of patient data, such as medical history, lifestyle habits, and physiological indicators, to identify key patterns and risk factors linked to strokes.

The dataset for this analysis focuses on cerebral strokes, containing information about patients' personal details, health status, lifestyle, and whether they have experienced a stroke.

**Objective**

This project focuses on creating a **classification model** to predict strokes, selecting the best performer among `KNN`, `SVM`, `Logistic Regression`, `Decision Tree`, `Random Forest`, and `XGBoost`. The key evaluation metric for assessing model performance is `Recall`, which will be used to determine the effectiveness of each model in identifying stroke cases.

This notebook is to test the model's inference capabilities using the model that has been developed.

### **B. Libraries**

The libraries used to test the model are as follows:

In [1]:
# Import Library
import pandas as pd
import pickle

**Libraries Function**
- pandas: data manipulation
- pickle: loading model

### **C. Data Loading**

The initial step involves loading the model and the inference data, which have been previously separated from the model training file.

In [2]:
# Load model
with open('best_rf_model.pkl', 'rb') as model_file:
    model = pickle.load(model_file)

In [3]:
# Define new data
data = {'id': [1622, 8943, 1254],
        'gender': ['Female', 'Male', 'Female'],
        'age': [22, 27, 56],
        'hypertension': [0, 1, 1],
        'heart_disease': [0, 1, 0],
        'ever_married': ['No', 'Yes', 'Yes'],
        'work_type': ['Private', 'Govt_job', 'Self-employed'],
        'Residence_type': ['Urban', 'Rural', 'Urban'],
        'avg_glucose_level': [120.57, 231.45, 89.32],
        'bmi': [16.32, 32.88, 43.61],
        'smoking_status': ['never_smoked', 'formerly smoked', 'smokes']}

# Create the DataFrame
df_inf = pd.DataFrame(data)

# Print the DataFrame
df_inf

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,1622,Female,22,0,0,No,Private,Urban,120.57,16.32,never_smoked
1,8943,Male,27,1,1,Yes,Govt_job,Rural,231.45,32.88,formerly smoked
2,1254,Female,56,1,0,Yes,Self-employed,Urban,89.32,43.61,smokes


### **D. Model Prediction**

Since the model handles preprocessing within the pipeline, there's no need to separately preprocess the data. The next step is simply using the saved model for making predictions.

In [4]:
# Predict using best model
y_pred_inf = model.predict(df_inf)

y_pred_inf

array([0, 0, 1], dtype=int64)

In [5]:
# Print the predictions
for idx, pred in enumerate(y_pred_inf):
    if pred == 0:
        print(f'Prediction for ID {df_inf["id"].iloc[idx]}: Non-Stroke')
    elif pred == 1:
        print(f'Prediction for ID {df_inf["id"].iloc[idx]}: Stroke')

Prediction for ID 1622: Non-Stroke
Prediction for ID 8943: Non-Stroke
Prediction for ID 1254: Stroke


### **E. Conclusion**

The model accurately predicts the stroke status of the raw unseen data, as demonstrated by the example showing that the data are 2 non-stroke and 1 stroke.