Create a model using drug200.csv to decide which drug should be given to: <br>

###Patient 1:
* Age = 23
* Sex = Female
* BP = HIGH
* Cholesterol = HIGH
* Na_to_K = 25.355

###Patient 2:
* Age = 56
* Sex = MALE
* BP = NORMAL
* Cholesterol = HIGH
* Na_to_K = 11.567

##Explanation of columns

| Age | Sex | BP | Cholesterol | Na_to_K | Drug |
|-----|-----|----|-------------|---------|------|
| ... | ... | ...| ...         | ...     | ...  |

- **Age**: Age of the patient (numeric)
- **Sex**: Gender of the patient (M or F)
- **BP**: Blood pressure of the patient (High, Normal, Low)
- **Cholesterol**: Cholesterol level of the patient (High, Normal)
- **Na_to_K**: Sodium to Potassium ratio in patient's blood (numeric)
- **Drug**: The drug type prescribed to the patient (drugX, drugY, drugA, drugB, drugC)



##Visual Inspection of Data

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.read_csv('drug200.csv')

In [None]:
df

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
...,...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567,drugC
196,16,M,LOW,HIGH,12.006,drugC
197,52,M,NORMAL,HIGH,9.894,drugX
198,23,M,NORMAL,NORMAL,14.020,drugX


##Exploratory Data Analysis

In [None]:
# Dataframe Information
print("Dataframe Information:")
print(df.info())
print()

# Check for missing values
missing_values = df.isnull().sum().sum()  # Total number of missing values in the DataFrame
print()
if missing_values == 0:
    print("There are no missing values in the DataFrame.")
else:
    print(f"There are {missing_values} missing values in the DataFrame.")
print()

#Check for duplicates
duplicate_rows = df[df.duplicated()]
if not duplicate_rows.empty:
    print("Duplicate rows found in the DataFrame:")
    print(duplicate_rows)
else:
    print("No duplicate rows found in the DataFrame.")
print()

# Check for unique
columns_to_check_unique = ['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K', 'Drug']
for col in columns_to_check_unique:
    unique_values = df[col].unique()
    print(f"Unique values in '{col}':")
    print(unique_values)
    print()

# Descriptive statistics
print("Descriptive Statistics:")
print(df.describe())
print()

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# Check for outliers - numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
# Using IQR method for each numerical column
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    if not outliers.empty:
        print(f"Column '{col}' has {len(outliers)} outliers according to IQR method.")
print()

# Check class distribution
# Check the distribution of the target variable 'Drug'
class_distribution = df['Drug'].value_counts()
print("Class Distribution in 'Drug':")
print(class_distribution)

Dataframe Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB
None


There are no missing values in the DataFrame.

No duplicate rows found in the DataFrame.

Unique values in 'Age':
[23 47 28 61 22 49 41 60 43 34 74 50 16 69 32 57 63 48 33 31 39 45 18 65
 53 46 15 73 58 66 37 68 67 62 24 26 40 38 29 17 54 70 36 19 64 59 51 42
 56 20 72 35 52 55 30 21 25]

Unique values in 'Sex':
['F' 'M']

Unique values in 'BP':
['HIGH' 'LOW' 'NORMAL']

Unique values in 'Cholesterol':
['HIGH' 'NORMAL']

Unique values in 'Na_to_K':
[25.355 13.093 1

There is an imbalance in the target, Drug. drugY comprises of a significant portion of the dataset, with drugX in second. The remainder drugs are less frequent. We will need to use class_weight='balanced' in model.

Cap the outliers to replace Na_to_K values. Adjusts outlier values to a maximum or minimum threshold. It doesn’t remove them but rather limits their impact by adjusting them to fall within range.

In [None]:
# Cap/trim the outliers
df['Na_to_K'] = np.where(df['Na_to_K'] > upper_bound, upper_bound,
                         np.where(df['Na_to_K'] < lower_bound, lower_bound, df['Na_to_K']))

Convert Sex, BP, Cholesterol and Drug to numerical values:


In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoders
le_sex = LabelEncoder()
le_BP = LabelEncoder()
le_Cholesterol = LabelEncoder()
le_drug = LabelEncoder()

# Fit LabelEncoders
le_sex.fit(df['Sex'])
le_BP.fit(df['BP'])
le_Cholesterol.fit(df['Cholesterol'])
le_drug.fit(df['Drug'])

# Transform the features in the DataFrame
df['Sex'] = le_sex.transform(df['Sex'])
df['BP'] = le_BP.transform(df['BP'])
df['Cholesterol'] = le_Cholesterol.transform(df['Cholesterol'])
df['Drug'] = le_drug.transform(df['Drug'])

# Display to confirm updated DataFrame information
print(df.info())
df


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    int64  
 2   BP           200 non-null    int64  
 3   Cholesterol  200 non-null    int64  
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    int64  
dtypes: float64(1), int64(5)
memory usage: 9.5 KB
None


Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,0,0,0,25.355,4
1,47,1,1,0,13.093,2
2,47,1,1,0,10.114,2
3,28,0,2,0,7.798,3
4,61,0,1,0,18.043,4
...,...,...,...,...,...,...
195,56,0,1,0,11.567,2
196,16,1,1,0,12.006,2
197,52,1,2,0,9.894,3
198,23,1,2,1,14.020,3


Verify the mappings

In [None]:
# Create and print readable mappings
# le_sex.classes_ contains F, M; le_sex.transform(le_sex.classes_) transforms unique categori to numeric value
# Encoding will be F = 0, M = 1 then le_sex.transform(le_sex.classes_) will produce [0, 1]
# it uses zip to pair each category with its numeric value
# zip(['F', 'M'], [0, 1]) produces an iterator of tuples: [('F', 0), ('M', 1)]
# dict function then converts this iterator into a dictionary { 'F': 0, 'M': 1 } categories = ['F', 'M']; values = [0, 1]
sex_mapping = dict(zip(le_sex.classes_, le_sex.transform(le_sex.classes_)))
bp_mapping = dict(zip(le_BP.classes_, le_BP.transform(le_BP.classes_)))
cholesterol_mapping = dict(zip(le_Cholesterol.classes_, le_Cholesterol.transform(le_Cholesterol.classes_)))
drug_mapping = dict(zip(le_drug.classes_, le_drug.transform(le_drug.classes_)))

print("Sex Mapping:", sex_mapping)
print("BP Mapping:", bp_mapping)
print("Cholesterol Mapping:", cholesterol_mapping)
print("Drug Mapping:", drug_mapping)


Sex Mapping: {'F': 0, 'M': 1}
BP Mapping: {'HIGH': 0, 'LOW': 1, 'NORMAL': 2}
Cholesterol Mapping: {'HIGH': 0, 'NORMAL': 1}
Drug Mapping: {'drugA': 0, 'drugB': 1, 'drugC': 2, 'drugX': 3, 'drugY': 4}


In [None]:
df

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,0,0,0,25.355,4
1,47,1,1,0,13.093,2
2,47,1,1,0,10.114,2
3,28,0,2,0,7.798,3
4,61,0,1,0,18.043,4
...,...,...,...,...,...,...
195,56,0,1,0,11.567,2
196,16,1,1,0,12.006,2
197,52,1,2,0,9.894,3
198,23,1,2,1,14.020,3


Save processed DataFrame to a CSV File

In [None]:
df.to_csv('processed_drug200.csv', index=False)

###Model Selection, Train and Evaluate

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

# Separate features and target
X = df.drop(columns=['Drug'])
y = df['Drug']

# Split data into training / test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize scaler
scaler = StandardScaler()

# Fit scaler on training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data
X_test_scaled = scaler.transform(X_test)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Cross-validation
X_scaled = scaler.fit_transform(X)  # Scaling all data for cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
print(f'Logistic Regression Cross-Validation Accuracy: Mean = {cv_scores.mean():.2f}, Std = {cv_scores.std():.2f}')

# Train model on scaled training data
model.fit(X_train_scaled, y_train)

# Predictions on both training and test data
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# Calculate metrics for training data
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')
train_f1 = f1_score(y_train, y_train_pred, average='weighted')

# Calculate metrics for test data
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_test_pred)

print('Logistic Regression:')
print(f' Train - Accuracy: {train_accuracy:.2f}, Precision: {train_precision:.2f}, Recall: {train_recall:.2f}, F1 Score: {train_f1:.2f}')
print(f' Test  - Accuracy: {test_accuracy:.2f}, Precision: {test_precision:.2f}, Recall: {test_recall:.2f}, F1 Score: {test_f1:.2f}')
print(f' Confusion Matrix:\n{conf_matrix}')


Logistic Regression Cross-Validation Accuracy: Mean = 0.93, Std = 0.04
Logistic Regression:
 Train - Accuracy: 0.96, Precision: 0.96, Recall: 0.96, F1 Score: 0.96
 Test  - Accuracy: 0.93, Precision: 0.93, Recall: 0.93, F1 Score: 0.93
 Confusion Matrix:
[[ 6  0  0  0  0]
 [ 0  3  0  0  0]
 [ 0  0  4  0  1]
 [ 0  0  0 10  1]
 [ 1  0  0  0 14]]


LogisticRegression model has cross-validation mean accuracy of 93% with a standard deviation of 0.04 which suggests it performs well/consistent across the data.

The train metrics achieve high performance of all metrics at 96%.

The test metrics achieve high performance of all metrics at 93%.

Overall it has good performance and appears to generalize well to unseen data.

Confusion Matrix: Rows are true labels and columns are predicted. True positives are diagonal:
- 6 for drugA-0: 6 instances correctly predicted.
- 3 for drugB-1: 3 instances were correctly predicted.
- 4 for drugC-2: 4 instances were correctly predicted and 1 incorrect.
- 10 for drugX-3: 10 instances were correctly predicted and 1 incorrect.
- 14 for drugY-4: 14 instances were correctly predicted.

###Label Encode new patients.

Sex Mapping: {'F': 0, 'M': 1}<br>
BP Mapping: {'HIGH': 0, 'LOW': 1, 'NORMAL': 2}<br>
Cholesterol Mapping: {'HIGH': 0, 'NORMAL': 1}<br>
Drug Mapping: {'drugA': 0, 'drugB': 1, 'drugC': 2, 'drugX': 3, 'drugY': 4}

In [None]:
# Prepare new patients' data
new_patients = pd.DataFrame([
    {'Age': 23, 'Sex': 'Female', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na_to_K': 25.355},
    {'Age': 56, 'Sex': 'MALE', 'BP': 'NORMAL', 'Cholesterol': 'HIGH', 'Na_to_K': 11.567}
])

# Encode the categorical features using the existing LabelEncoders
new_patients['Sex'] = le_sex.transform(new_patients['Sex'].map({'Female': 'F', 'MALE': 'M'}))
new_patients['BP'] = le_BP.transform(new_patients['BP'])
new_patients['Cholesterol'] = le_Cholesterol.transform(new_patients['Cholesterol'])

# Display the encoded DataFrame
print(new_patients)


   Age  Sex  BP  Cholesterol  Na_to_K
0   23    0   0            0   25.355
1   56    1   2            0   11.567


###Predict for new patients.

In [None]:
# Prepare new patient data for prediction
# Ensures this DataFrame has the same columns / format as training data used in model
X_new = new_patients[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']] # X_new to be passed through predictions in next line

# Predict the drugs for the new patients
# model.predict from Predictions on both training / test data y_train_pred = model.predict(X_train_scaled) y_test_pred = model.predict(X_test_scaled)
predictions = model.predict(X_new)

# Decode the predicted drugs back to original names
predicted_drugs = le_drug.inverse_transform(predictions)

# Print the results;
# i is pointer to the position 0, then 1, then 2...
# drug is drugA, drugB...
# enumerate gives each item in a list with its position number eg. 0, drugA
for i, drug in enumerate(predicted_drugs):
    print(f'Patient {i+1} should be given: {drug}') #{i+1} means starting at 1 not 0

Patient 1 should be given: drugY
Patient 2 should be given: drugB


