**1. Introduction**

This documents the analysis of a COVID-19 dataset aimed at predicting mortality outcomes. The analysis includes data preprocessing, undersampling to handle class imbalance, feature selection, model training using XGBoost, and evaluation of the model's performance.

**2. Structure of the Project**

The project comprises several components:

Data loading and preprocessing: The dataset is loaded from Google Drive, duplicates are removed, and non-informative values in the 'death_yn' column are filtered out.
Undersampling: As the dataset is imbalanced, undersampling is performed to address this issue.
Feature selection: Relevant features for mortality prediction are selected.
Model training: XGBoost classifier is trained on the preprocessed data.
Evaluation: The trained model is evaluated using various metrics like accuracy, precision, recall, and F1-score.

**3. Description of Functions and Usage**

- The code includes several functions and pipelines:


- Data loading: The data is loaded from Google Drive using the drive.mount() function.
- Undersampling: Resampling of the majority class (death=0) is done using the resample() function from sklearn.utils.
Feature selection: Relevant features are selected and stored in the features list.
- Preprocessing pipeline: Categorical variables are one-hot encoded using -
- OneHotEncoder and missing values are imputed using SimpleImputer.
- Model training pipeline: The XGBoost classifier is trained using a pipeline consisting of preprocessing and classification steps.
- Evaluation metrics: Various evaluation metrics such as accuracy, precision, recall, F1-score, confusion matrix, and classification report are calculated.

**4 Data Collection and Cleaning**

Data Source: The dataset is assumed to be stored in a CSV file on Google Drive.
Data Cleaning: Duplicates are removed, and non-informative values in the 'death_yn' column are filtered out.

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
# Load the dataset
data_path = '/content/drive/Shareddrives/Pinkode/data.csv'
df = pd.read_csv(data_path)


Mounted at /content/drive


  df = pd.read_csv(data_path)


In [None]:
# Removing duplicates
df = df.drop_duplicates()

import pandas as pd
from sklearn.utils import resample

# Filter out non-informative values in the 'death_yn' column
df = df[df['death_yn'].isin(['Yes', 'No'])]

# Create a binary outcome variable for death
df['death'] = (df['death_yn'] == 'Yes').astype(int)

# Count the number of "Yes" and "No" values
death_counts = df['death'].value_counts()

# Get the number of "Yes" values (minority class)
minority_class_count = death_counts.iloc[1]  # Assuming "Yes" is at index 1

# Randomly sample 10,000 rows from the majority class ("No")
df_majority_undersampled = resample(df[df['death'] == 0],  # Select rows with death=0 ("No")
                                     replace=False,    # Don't allow replacement during sampling
                                     n_samples=49000,  # Sample 10,000 rows
                                     random_state=42)  # Set a random seed for reproducibility

# Combine the undersampled majority with the minority class
df_undersampled = pd.concat([df_majority_undersampled, df[df['death'] == 1]])  # Combine with death=1 ("Yes")

# Shuffle the combined DataFrame
df_undersampled = df_undersampled.sample(frac=1, random_state=42).reset_index(drop=True)  # Shuffle rows

# Now you can proceed with feature selection, model training, etc. using df_undersampled


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['death'] = (df['death_yn'] == 'Yes').astype(int)


In [None]:

# Selecting relevant features
features = ['symptom_status','res_state','age_group', 'sex', 'race', 'ethnicity', 'hosp_yn', 'icu_yn', 'underlying_conditions_yn']
X = df[features]
y = df['death']


In [None]:
death_counts = df_undersampled['death'].value_counts()
print(death_counts)

death
1    49359
0    49000
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# One-hot encoding for categorical variables
categorical_features = ['symptom_status','res_state','age_group', 'sex', 'race', 'ethnicity', 'hosp_yn', 'icu_yn', 'underlying_conditions_yn']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ])



# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


SyntaxError: incomplete input (<ipython-input-6-0d75f13b7f31>, line 1)

In [None]:
# another model approch
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

# Create a pipeline with preprocessing and XGBoost model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Replace 'preprocessor' with your actual preprocessing step
    ('classifier', XGBClassifier(objective='binary:logistic', random_state=42))
])

# Train the model
model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Predictions
y_pred = model.predict(X_test)

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")


Accuracy: 0.9837793577051143
Precision: 0.8359219434488252
Recall: 0.42271674554425537
F1-Score: 0.5614926770547717
Confusion Matrix:
[[393483    824]
 [  5733   4198]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    394307
           1       0.84      0.42      0.56      9931

    accuracy                           0.98    404238
   macro avg       0.91      0.71      0.78    404238
weighted avg       0.98      0.98      0.98    404238



**5.Challenges/Limitations/Assumptions**

- Class Imbalance: The dataset suffers from class imbalance, with a significantly higher number of instances for the negative class (death=0). Undersampling is used as a solution, but this might result in loss of information.
- Feature Selection: Features are selected based on assumed relevance to mortality prediction. The actual relevance might vary, and feature engineering could improve model performance.
- Assumptions: The analysis assumes the provided features are sufficient for mortality prediction and that the data is representative of the population.

**6. Conclusion**

This technical report outlines the process of analyzing a COVID-19 dataset for mortality prediction. It covers data preprocessing, undersampling, feature selection, model training using XGBoost, and evaluation of the model's performance. Challenges such as class imbalance and assumptions regarding feature relevance are discussed. The report provides insights into the project structure, functions, data processing steps, and limitations.