## COVID TEST FINDINGS CLASSIFICATION USING ML MODELS.

sex: 1 for female and 2 for male.

age: of the patient.

classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different

degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.

patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.

pneumonia: whether the patient already have air sacs inflammation or not.

pregnancy: whether the patient is pregnant or not.

diabetes: whether the patient has diabetes or not.

copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.

asthma: whether the patient has asthma or not.

inmsupr: whether the patient is immunosuppressed or not.

hypertension: whether the patient has hypertension or not.

cardiovascular: whether the patient has heart or blood vessels related disease.

renal chronic: whether the patient has chronic renal disease or not.

other disease: whether the patient has other disease or not.

obesity: whether the patient is obese or not.

tobacco: whether the patient is a tobacco user.

usmr: Indicates whether the patient treated medical units of the first, second or third level.

medical unit: type of institution of the National Health System that provided the care.

intubed: whether the patient was connected to the ventilator.

icu: Indicates whether the patient had been admitted to an Intensive Care Unit.

date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.

### Step 1: Load and Explore the Dataset
#### First, we import necessary libraries and load the data for preliminary exploration.

In [None]:
# import required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier


In [None]:
#import sys
#print(sys.version)

In [None]:
# imports the Python warnings module to provides a mechanism for handling warnings that occur during program execution.
import warnings
warnings.filterwarnings('ignore')

In [None]:
# read the dataset in csv format
df = pd.read_csv("Covid.csv")

In [None]:
# display the first 20 rows of the dataset
df.head(20)

In [None]:
# Check the number of rows and columns
df.shape

In [None]:
# display dataset table info
df.info()

In [None]:
# display dataset statistics info
df.describe()

In [None]:
# display all the dataset features or column names
df.columns

### Step 2: Data Preprocessing
#### Next  required handling missing values, encode categorical variables if not encoded, and address any data inconsistencies.

In [None]:
# check for total null values in each column
df.isnull().sum()

##### Shows there is no missing values.

In [None]:
# Convert 'DATE_DIED' to a binary 'DIED' feature (1 if died, 0 otherwise)
df['DIED'] = df['DATE_DIED'].apply(lambda x: 0 if x == '9999-99-99' else 1)

# Drop the original 'DATE_DIED' column
df.drop('DATE_DIED', axis=1, inplace=True)



In [None]:
# Check the modified dataframe
df.head(20)

In [None]:
# checking all columns unique valus exclude 'DATE_DIED'
features = ['USMER', 'MEDICAL_UNIT', 'SEX', 'PATIENT_TYPE', 'INTUBED',
                'PNEUMONIA', 'AGE', 'PREGNANT', 'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR',
                'HIPERTENSION', 'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY',
                'RENAL_CHRONIC', 'TOBACCO', 'CLASIFFICATION_FINAL', 'ICU']


for i in features:
    unique_values = df[i].unique()
    print(f"Column: {i}, Unique Values: {unique_values}")

In [None]:
# display all ages in sort form
age_unique_values = df['AGE'].unique()
age_unique_values_sorted = sorted(age_unique_values)

print(f"Unique Values in 'AGE' column (Sorted): {age_unique_values_sorted}")

In [None]:
# Visualize missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

### Step 3: Exploratory Data Analysis (EDA)
#### Visualize the data to understand the distribution and relationship between variables.

In [None]:
# Visualizing the distribution of ages
plt.figure(figsize=(10, 6))
sns.histplot(df['AGE'], bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

### Comment 
##### The histogram above  depicts the distribution of ages among COVID-19 cases within a particular dataset. It is characterized by a roughly bell-shaped curve, skewed slightly to the right, suggesting that the number of cases tends to be higher in middle-aged individuals, with a gradual decrease in frequency among older populations.

##### The most common age range for COVID-19 cases in this dataset is between approximately 30 and 60 years, with the peak frequency in the 50s age group. The number of cases among the very young and elderly is significantly lower, which might reflect a variety of factors including exposure risks, social behaviors, and perhaps the data collection focus.

In [None]:
# Relationship between age and COVID classification
plt.figure(figsize=(10, 6))
sns.boxplot(x='CLASIFFICATION_FINAL', y='AGE', data=df)
plt.title('COVID Classification vs Age')
plt.xlabel('COVID Classification')
plt.ylabel('Age')
plt.show()

### Comment
##### This box plot visualizes the relationship between patient age and COVID-19 classification, which appears to denote the severity or status of the disease. Each box represents the interquartile range (IQR) of ages for patients within each classification, with the horizontal line inside the box marking the median age.

##### Key observations from the chart are as follows:

COVID Classifications 1, 2, and 3: These likely represent confirmed cases of COVID-19, with different degrees of severity. The median age for these classifications is roughly between 40 to 60 years. The age distribution for these classifications is quite similar, although classification 2 has a slightly higher median age and more variability as indicated by the longer IQR.

Classifications 4, 5, 6, and 7: These might denote negative tests, inconclusive tests, or different medical classifications related to the disease. There's a noticeable shift toward a younger median age in classification 4, and the ages broaden out for classifications 5 and 6, suggesting a wider age range among those classifications. Classification 7 has a higher median age, similar to the confirmed cases, but with a very wide age range, indicating this classification affects a broad spectrum of ages.

Outliers and Spread: There are outliers present, particularly in classifications 1 and 2, indicating that there are patients significantly younger than the typical age range. The spread of ages, indicated by the "whiskers" of the plot, is quite wide for all classifications, showing that COVID-19 affects a broad age range.

For stakeholders, this chart indicates that while COVID-19 confirmed cases (classifications 1-3) tend to be more prevalent in the middle-aged groups, the disease does not exclusively affect these ages. There is significant variance and outliers, especially in classifications deemed less severe or inconclusive, pointing to the need for vigilance and healthcare support across all age groups. The relatively consistent median age across several classifications suggests that midlife adults are generally the most affected, which might inform how resources for prevention and treatment are allocated.

In [None]:
# Visualize correlation matrix
# Check for Multi-collinearity:
# Multi-collinearity refers to high correlations between predictor variables

plt.figure(figsize=(16, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### Considering the heatmap:

Features such as 'INTUBED', 'PNEUMONIA', and 'ICU' have a stronger negative correlation with 'PATIENT_TYPE', which indicates their potential importance in predicting patient outcomes.
'PREGNANT' has a high positive correlation with 'SEX', which makes sense since only females can be pregnant; thus, one of these features might be redundant but could also be critical to the health sector.
'DIED' has a strong negative correlation with 'PATIENT_TYPE', suggesting that 'DIED' is a significant predictor of patient outcomes.
'USMER' and 'MEDICAL_UNIT' have low correlations with other features, suggesting that they might not be as critical for the model.
### Based on these observations:

Retain 'DIED', 'INTUBED', 'PNEUMONIA', and 'ICU' due to their strong correlations with patient outcomes.
Consider removing or combining 'PREGNANT' and 'SEX' to avoid redundancy.
Review 'USMER' and 'MEDICAL_UNIT' for clinical significance; they could potentially be removed if they do not add predictive value or if they're not operationally relevant.

### Step 4: Feature Engineering and Selection
##### Create new features and select the most relevant features for the model.

In [None]:
# Remove highly correlated features with correlated value of more than 0.8 but retain 'PREGNANT'
irrelevant_columns = ['CARDIOVASCULAR', 'COPD', 'ASTHMA' , 'INMSUPR', 'OTHER_DISEASE', 'TOBACCO','USMER']
df = df.drop(columns=irrelevant_columns, axis=1)


In [None]:
df.head()

In [None]:
# Visualize correlation matrix after features removal
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

In [None]:
# Visualize variance of numerical features
numerical_features = df.select_dtypes(include=[np.number])
variances = numerical_features.var()
variances.plot(kind='bar')
plt.show()


In [None]:
# Plot histograms for individual numeric variables
# This helps in understanding the spread and skewness of the data.
numeric_columns = df.select_dtypes(include='number').columns

for col in numeric_columns:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Checking for outlier using Boxplot
# Plot boxplots for numeric variables
for col in numeric_columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

In [None]:
# Statistical significance of variables:
# For assessing the significance of categorical variables against a numerical one (e.g., 'AGE') using statistical tests like ANOVA:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# ANOVA for a categorical variable 'SEX' against numeric variable 'AGE'
model = ols('AGE ~ C(SEX)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

In summary, the above ANOVA results indicate that 'SEX' has a statistically significant effect on 'AGE' in the dataset, with a very strong degree of certainty (given the extremely low p-value). The high F-statistic further supports the conclusion that the differences in mean 'AGE' across 'SEX' are not due to random variation.

In [None]:
# Checking for class imbalance

# Checking class distribution
class_counts = df['CLASIFFICATION_FINAL'].value_counts()
print(class_counts)

# Visualizing class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='CLASIFFICATION_FINAL', data=df)
plt.title('Class Distribution')
plt.show()

The classes are significantly imbalanced, with some classes having a substantially larger number of instances compared to others. Class imbalance can potentially lead to biased models that are more accurate at predicting the majority class but perform poorly on minority classes.

Addressing class imbalance is crucial to ensure that the model is not biased toward the majority class and that all classes are adequately represented. Several techniques can be employed:

Resampling Techniques: Oversampling the minority class (e.g., using techniques like SMOTE - Synthetic Minority Over-sampling Technique) or undersampling the majority class to balance the class distribution.

Generating Synthetic Samples: Creating synthetic samples for the minority class to balance the dataset.

Weighted Models: Assigning different weights to classes, giving higher weights to the minority class to increase their influence during model training.

Different Algorithms: Using algorithms that handle class imbalance better, like ensemble methods (Random Forest, Gradient Boosting) or algorithms that inherently account for class imbalance.


<b>Balancing an imbalanced dataset is crucial for models to learn effectively across all classes. A common method to address class imbalance is oversampling or undersampling. Here, I'll demonstrate oversampling using the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset:

### SMOTE for Balancing the Dataset

In [None]:
# Import required libraries
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Separate features and target variable
X = df.drop('CLASIFFICATION_FINAL', axis=1)
y = df['CLASIFFICATION_FINAL']

# Initialize SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)

# Apply SMOTE to balance the dataset
X_resampled, y_resampled = smote.fit_resample(X, y)

# Combine the resampled features and target variable into a new DataFrame
df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='CLASIFFICATION_FINAL')], axis=1)

In [None]:
# Check the distribution after balancing
print(df_resampled['CLASIFFICATION_FINAL'].value_counts())

## Splitting features and target

In [None]:
# Splitting features and target
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

import plotly.graph_objects as go

X = df.drop(columns=['CLASIFFICATION_FINAL'])
y = df['CLASIFFICATION_FINAL']

# Encoding categorical variables (if any)
label_encoder = LabelEncoder()
X_encoded = X.copy()
for col in X.columns:
    if X[col].dtype == 'object':
        X_encoded[col] = label_encoder.fit_transform(X[col])

# Scaling features if necessary
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)

#convert y back to Dataframe
y=y.to_frame()

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:

y_train

In [None]:
# Check the shape of each splitted dataset if it's okay
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

## Models Implementation (RFC, SVC, XGBoot, DT)

In [None]:
# Model Implementation and Evaluation
models = {}

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.preprocessing import label_binarize
from scipy.stats import loguniform, randint, uniform
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

# Implement SVC with kernel=linear on X_train and y_train
svm = SVC(kernel="linear", probability=True)  # Enable probability estimates
svm.fit(X_train, y_train)

# Ensure that X_test has the same columns as X_train
X_test = X_test[X_train.columns]

# Predictions
y_pred_svm = svm.predict(X_test)

# Flatten y_test if it's a DataFrame
y_test_flat = y_test.values.flatten()

# Use predict_proba for obtaining class probabilities
y_pred_svm_probs = svm.predict_proba(X_test)

# Now calculate metrics
svm_metrics = {
    "Accuracy": accuracy_score(y_test_flat, y_pred_svm),
    "Precision": precision_score(y_test_flat, y_pred_svm, average='weighted'),
    "Recall": recall_score(y_test_flat, y_pred_svm, average='weighted'),
    "F1-score": f1_score(y_test_flat, y_pred_svm, average='weighted'),
    "AUC-ROC": roc_auc_score(y_test_flat, y_pred_svm_probs, multi_class='ovr'),
}

print(svm_metrics)

In [None]:
# Plotting F1 Scores for comparison
f1_scores = {model: accuracy_score(y_test, res['predictions']) for model, res in results.items()}
fig = go.Figure([go.Bar(x=list(f1_scores.keys()), y=list(f1_scores.values()))])
fig.update_layout(title='Model Comparison Based on F1 Score', xaxis_title='Model', yaxis_title='F1 Score')
fig.show()

In [None]:
# Select the model with the highest F1 score
best_model_name = max(f1_scores, key=f1_scores.get)
best_model = results[best_model_name]['model']
print(f"The best performing model is {best_model_name} with an F1 score of {f1_scores[best_model_name]}")