## **Title:** Building a Disease Diagnosis and Prescription Recommendation System with Machine Learning  

**Description:**  
Explore our advanced Personalized Medical Recommendation System, powered by machine learning. This innovative platform analyzes symptoms to accurately predict potential diseases, helping users better understand and manage their health.

# 1. load the dataset

In [1]:
import  pandas as pd
dataset = pd.read_csv('datasets/Training.csv')
print(dataset.shape)
print(dataset.head())

(4920, 133)
   itching  skin_rash  nodal_skin_eruptions  continuous_sneezing  shivering  \
0        1          1                     1                    0          0   
1        0          1                     1                    0          0   
2        1          0                     1                    0          0   
3        1          1                     0                    0          0   
4        1          1                     1                    0          0   

   chills  joint_pain  stomach_pain  acidity  ulcers_on_tongue  ...  \
0       0           0             0        0                 0  ...   
1       0           0             0        0                 0  ...   
2       0           0             0        0                 0  ...   
3       0           0             0        0                 0  ...   
4       0           0             0        0                 0  ...   

   blackheads  scurring  skin_peeling  silver_like_dusting  \
0           0         0 

# 2. train test split

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Splitting features (X) and target variable (y)
X = dataset.drop('prognosis', axis=1)
y = dataset['prognosis']

# Encoding the 'prognosis' column (target variable)
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # Encode target variable

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=20)

# 3. Training & Select the best model

In [4]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, confusion_matrix
import numpy as np

# Create a dictionary to store models
models = {
    'SVC': SVC(kernel='linear'),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=50, random_state=42),
    'Bagging': BaggingClassifier(n_estimators=10, random_state=42),
    'KNeighbors': KNeighborsClassifier(n_neighbors=5),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB(),
    'LogisticRegression': LogisticRegression(random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42)
}

# Loop through the models, train, test, and print results
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Test the model
    predictions = model.predict(X_test)

    # Calculate F1-score
    f1 = f1_score(y_test, predictions, average='weighted')
    print(f"{model_name} F1-Score: {f1:.2f}")

    # Calculate confusion matrix
    cm = confusion_matrix(y_test, predictions)
    print(f"{model_name} Confusion Matrix:")
    print(np.array2string(cm, separator=', '))

    print("\n" + "="*40 + "\n")

SVC F1-Score: 1.00
SVC Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]


RandomForest F1-Score: 1.00
RandomForest Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]


GradientBoosting F1-Score: 1.00
GradientBoosting Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]






AdaBoost F1-Score: 0.10
AdaBoost Confusion Matrix:
[[0, 0, 0, ..., 0, 0, 0],
 [0, 0, 0, ..., 0, 0, 0],
 [0, 0, 0, ..., 0, 0, 0],
 ...,
 [0, 0, 0, ..., 0, 0, 0],
 [0, 0, 0, ..., 0, 0, 0],
 [0, 0, 0, ..., 0, 0, 0]]


Bagging F1-Score: 1.00
Bagging Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]


KNeighbors F1-Score: 1.00
KNeighbors Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]


MultinomialNB F1-Score: 1.00
MultinomialNB Confusion Matrix:
[[40,  0,  0, ...,  0,  0,  0],
 [ 0, 43,  0, ...,  0,  0,  0],
 [ 0,  0, 28, ...,  0,  0,  0],
 ...,
 [ 0,  0,  0, ..., 34,  0,  0],
 [ 0,  0,  0, ...,  0, 41,  0],
 [ 0,  0,  0, ...,  0,  0, 31]]


GaussianNB F1-Score:

Given the identical F1-scores and confusion matrices for most models (all at 1.00 except for AdaBoost), the decision-making criteria should go beyond simple performance metrics to consider the following factors:

### 1. Model Interpretability:

**Best Choice**: DecisionTree, LogisticRegression
Decision trees are highly interpretable, making it easy to understand feature importance and decision-making paths.
Logistic regression provides insights into feature contributions via coefficients.
### 2. Scalability and Computational Cost:

**Best Choice**: MultinomialNB, GaussianNB, LogisticRegression, KNeighbors
These models tend to have lower computational requirements, especially for large datasets.
### 3. Overfitting Risk:

**Best Choice**: Bagging, RandomForest, GradientBoosting
These ensemble methods are generally more robust to overfitting compared to a single decision tree.
### 4. Data Characteristics and Use Case:

**Sparse or Text Data**: MultinomialNB, LogisticRegression
Continuous Data or Multi-class Settings: GaussianNB, KNeighbors
### 5. Deployment Requirements:

**Best Choice**: LogisticRegression, SVC
Logistic regression and support vector classifiers are straightforward to deploy due to simpler parameter tuning.
Warnings/Issues:

Avoid using AdaBoost due to its poor F1 score (0.10) and the deprecation warning for the SAMME.R algorithm.
### 6. Recommendation:
**Primary Choice**: If interpretability is key, go with Logistic Regression or Decision Tree.

**Secondary Choice**: For scalability and robustness, choose Random Forest or Gradient Boosting.

**Avoid**: AdaBoost, due to its poor performance.

In [7]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))

# Feature Importance (Optional)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)
print("\nFeature Importances:")
print(feature_importance)

# Save the trained model (Optional)
import joblib
joblib.dump(rf_model, 'models/random_forest_model.pkl')

Confusion Matrix:
[[40  0  0 ...  0  0  0]
 [ 0 43  0 ...  0  0  0]
 [ 0  0 28 ...  0  0  0]
 ...
 [ 0  0  0 ... 34  0  0]
 [ 0  0  0 ...  0 41  0]
 [ 0  0  0 ...  0  0 31]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        43
           2       1.00      1.00      1.00        28
           3       1.00      1.00      1.00        46
           4       1.00      1.00      1.00        42
           5       1.00      1.00      1.00        33
           6       1.00      1.00      1.00        33
           7       1.00      1.00      1.00        39
           8       1.00      1.00      1.00        32
           9       1.00      1.00      1.00        49
          10       1.00      1.00      1.00        37
          11       1.00      1.00      1.00        42
          12       1.00      1.00      1.00        41
          13       1.00      1.00      1.00  

['models/random_forest_model.pkl']

In [10]:
# load model
rf = joblib.load('models/random_forest_model.pkl')

In [12]:
# Function to test and display predictions for given indices
def test_model(model, X_test, y_test, test_indices):
    """
    Test and display predictions for specific indices.

    Parameters:
    - model: Trained model for predictions.
    - X_test: Test feature set (pandas DataFrame or NumPy array).
    - y_test: Test labels (pandas Series or NumPy array).
    - test_indices: List of indices to test.

    Returns:
    - None
    """
    for idx in test_indices:
        # Ensure index is valid
        if idx >= len(X_test):
            print(f"Index {idx} is out of range. Skipping.")
            continue
        # Convert X_test[idx] to array to avoid feature name warnings
        X_sample = X_test.iloc[idx].values.reshape(1, -1) if hasattr(X_test, 'iloc') else X_test[idx].reshape(1, -1)
        predicted = model.predict(X_sample)[0]
        actual = y_test[idx]  # Access directly for NumPy arrays
        print(f"Test {idx + 1}:")
        print(f"  Predicted Disease: {predicted}")
        print(f"  Actual Disease: {actual}")
        print("-" * 40)

# Example usage
test_indices_rf = [0, 100]  # Indices to test for Random Forest

print("Random Forest Predictions:")
test_model(rf_model, X_test, y_test, test_indices_rf)

Random Forest Predictions:
Test 1:
  Predicted Disease: 40
  Actual Disease: 40
----------------------------------------
Test 101:
  Predicted Disease: 39
  Actual Disease: 39
----------------------------------------




# 4. Design the Recommendation System

## Load database and use logic for recommendations

In [24]:
%pip install --upgrade scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [14]:
sym_des = pd.read_csv("datasets/symtoms_df.csv")
precautions = pd.read_csv("datasets/precautions_df.csv")
workout = pd.read_csv("datasets/workout_df.csv")
description = pd.read_csv("datasets/description.csv")
medications = pd.read_csv('datasets/medications.csv')
diets = pd.read_csv("datasets/diets.csv")

In [25]:
# Helper Function
def helper(dis, description, precautions, medications, diets, workout):
    try:
        desc = description[description['Disease'] == dis]['Description']
        desc = " ".join(desc.tolist())
        pre = precautions[precautions['Disease'] == dis][['Precaution_1', 'Precaution_2', 'Precaution_3', 'Precaution_4']]
        pre = pre.values.flatten().tolist()
        med = medications[medications['Disease'] == dis]['Medication'].tolist()
        diet = diets[diets['Disease'] == dis]['Diet'].tolist()
        wrkout = workout[workout['disease'] == dis]['workout'].tolist()
        return desc, pre, med, diet, wrkout
    except Exception as e:
        print(f"Error retrieving details for disease '{dis}': {e}")
        return None, None, None, None, None

# Prediction Function
def get_predicted_disease(patient_symptoms, symptoms_dict, diseases_list, model, X_train=None):
    """
    Predict the disease based on patient symptoms using the trained model.

    Parameters:
    - patient_symptoms: List of symptoms provided by the patient.
    - symptoms_dict: Dictionary mapping symptoms to indices.
    - diseases_list: Dictionary mapping indices to diseases.
    - model: Trained classification model (e.g., RandomForestClassifier or SVC).
    - X_train: Optional, training data used for the model. Required for compatibility with older scikit-learn versions.

    Returns:
    - Predicted disease (str) or an error message.
    """
    try:
        # Determine input vector size
        if hasattr(model, 'n_features_in_'):
            input_vector_size = model.n_features_in_
        elif X_train is not None:
            input_vector_size = X_train.shape[1]
        else:
            raise ValueError("Model input size could not be determined. Provide X_train.")

        # Initialize input vector
        input_vector = np.zeros(input_vector_size)

        # Map symptoms to input vector
        for symptom in patient_symptoms:
            if symptom in symptoms_dict:
                symptom_index = symptoms_dict[symptom]
                if symptom_index < len(input_vector):
                    input_vector[symptom_index] = 1
                else:
                    print(f"Warning: Symptom index {symptom_index} is out of bounds.")
            else:
                print(f"Warning: Symptom '{symptom}' not recognized.")

        # Predict the disease
        predicted_index = model.predict([input_vector])[0]
        return diseases_list.get(predicted_index, "Unknown Disease")
    except Exception as e:
        print(f"Error during prediction: {e}")
        return "Prediction Failed"

In [26]:
# Example usage
patient_symptoms = ['itching', 'skin_rash', 'fatigue']

# Predict disease using Random Forest
predicted_disease = get_predicted_disease(patient_symptoms, symptoms_dict, diseases_list, rf_model, X_train=X_train)

# Retrieve details
desc, pre, med, diet, wrkout = helper(predicted_disease, description, precautions, medications, diets, workout)

# Display output
print(f"Predicted Disease: {predicted_disease}")
print(f"Description: {desc}")
print(f"Precautions: {pre}")
print(f"Medications: {med}")
print(f"Diet: {diet}")
print(f"Workout: {wrkout}")

Error during prediction: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Predicted Disease: Prediction Failed
Description: 
Precautions: []
Medications: []
Diet: []
Workout: []


In [28]:
# let's use pycharm flask app
# but install this version in pycharm
import sklearn
print(sklearn.__version__)

1.5.2
