This code showcases the process of training, evaluating, and visualizing a logistic regression model for outbreak detection using weather and epidemiological data.

**Importing Libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score

In [None]:
# Load the dataset
data = pd.read_csv("../Datasets/outbreak_detect.csv")

**Data Preprocessing**
- It drops any rows with missing values using the dropna function.
- It separates the features (X) from the target variable (y).
- It encodes the target variable 'Outbreak' as binary, where 'Yes' is encoded as 1 and 'No' as 0.

In [None]:
# Preprocess the data
data.dropna(inplace=True)
X = data[['maxTemp', 'minTemp', 'avgHumidity', 'Rainfall', 'Positive', 'pf']]  # Features
y = data['Outbreak']  # Target variable

# Encode target variable 'Outbreak' as binary (0 and 1)
y_binary = (y == 'Yes').astype(int)

**Train-Test Split**
- It splits the dataset into training and testing sets using the train_test_split function from scikit-learn. The testing set size is 20% of the whole dataset, and the random state is set to ensure reproducibility.

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

**Model Training**
- It initializes a logistic regression model (LogisticRegression) from scikit-learn with the max_iter parameter set to 1000 to remove convergence warnings.
- It trains the logistic regression model using the training data (X_train and y_train) with the fit method.

In [None]:
# Train the model
model = LogisticRegression(max_iter=1000) # Set max_iter to 1000 to remove warnings
model.fit(X_train, y_train)

**Prediction**
- It makes predictions (y_pred) on the testing set using the trained model.

In [None]:
# Make predictions
y_pred = model.predict(X_test)

**Model Evaluation**
- Calculates the accuracy of the model by comparing the predicted labels (y_pred) with the true labels (y_test) using the accuracy_score function from scikit-learn.
- Generates a classification report using the classification_report function, which includes metrics such as precision, recall, and F1-score.
- Generates a confusion matrix using the confusion_matrix function, which shows the counts of true positive, true negative, false positive, and false negative predictions.

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

**ROC Curve and AUC Calculation**
- Calculates the predicted probabilities for the positive class (class 'Yes') using the predict_proba method.
- Calculates the False Positive Rate (FPR), True Positive Rate (TPR), and thresholds for different probability cutoffs using the roc_curve function.
- Calculates the Area Under the Curve (AUC) score using the roc_auc_score function.

In [None]:
# Calculate predicted probabilities for positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

**Plotting Visualizations**
- Generates a heatmap of the confusion matrix, annotating cell counts and visualizing actual versus predicted labels.
- Displays a histogram to visualize the distribution of predicted probabilities for the positive class.

In [None]:
# Plot confusion matrix heatmap
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Plot distribution of predicted probabilities
plt.figure(figsize=(8, 6))
sns.histplot(y_pred_proba, bins=20, kde=True, color='blue')
plt.title("Distribution of Predicted Probabilities")
plt.xlabel("Predicted Probability")
plt.ylabel("Frequency")
plt.show()