# Template notebook

It's good to start with an introduction, to set the scene and introduce your audience to the data, and the problem you're solving as a team.

<br>

## Libraries
As always, we'll start by importing the necessary libraries.

In [760]:
# !pip install numpy pandas matplotlib seaborn ipykernel plotly nbformat scikit-learn

After installation, it is necessary to restart the kernel.

In [761]:
# It's good practice to add comments to explain your code 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

**Question / Task 1**

Insert context about question / task 1 here.

### The problem

In [762]:
# Add your code here
df = pd.read_csv("data/corona_tested_individuals_ver_006.english.csv")


Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.



In [763]:
df["test_indication"].value_counts()

test_indication
Other                     242741
Abroad                     25468
Contact with confirmed     10639
Name: count, dtype: int64

### Description of the Dataset

The dataset consists of the following features for COVID-19 test records:

* **`test_date`**: The date when the test was conducted.
* **`cough`**: Binary value indicating the presence (1) or absence (0) of a cough.
* **`fever`**: Binary value indicating the presence (1) or absence (0) of a fever.
* **`sore_throat`**: Binary value indicating the presence (1) or absence (0) of a sore throat.
* **`shortness_of_breath`**: Binary value indicating the presence (1) or absence (0) of shortness of breath.
* **`head_ache`**: Binary value indicating the presence (1) or absence (0) of a headache.
* **`corona_result`**: The result of the COVID-19 test, which can be 'negative', 'positive', or possibly 'other'.

<details style="padding-left: 3rem">
    <summary>more details</summary>
    <p>In the context of COVID-19 test results, the category "other" typically represents test outcomes that do not fall neatly into the binary categories of "negative" or "positive." Here are some possible meanings for "other":</p>
    <p>Indeterminate or Inconclusive: The test result was neither clearly positive nor negative. This can happen if the test sample was insufficient or contaminated.
    Pending: The test result has not yet been finalized or reported.
    Invalid: The test was not conducted properly, or there was an error in the testing process, leading to an invalid result.
    Recovered: In some datasets, individuals who have previously tested positive and are now considered recovered may be categorized separately.
    Understanding the exact meaning of "other" would require more detailed documentation or metadata from the dataset provider.</p>

</details>

* **`age_60_and_above`**: Indicator of whether the individual is aged 60 or above ('Yes', 'No'), with some missing values (NaN).
* **`gender`**: The gender of the individual ('male' or 'female').
* **`test_indication`**: The reason for taking the test, categorized as 'Contact with confirmed', 'Abroad', or 'Other'.

The dataset captures various symptoms and demographic information along with COVID-19 test results, which can be used for exploratory data analysis and model building to predict COVID-19 test outcomes based on symptoms and other features.

In [764]:
df.head()

Unnamed: 0,test_date,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,age_60_and_above,gender,test_indication
0,2020-04-30,0.0,0.0,0.0,0.0,0.0,negative,,female,Other
1,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,female,Other
2,2020-04-30,0.0,1.0,0.0,0.0,0.0,negative,,male,Other
3,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,female,Other
4,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,male,Other


In [765]:
df.describe()

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache
count,278596.0,278596.0,278847.0,278847.0,278847.0
mean,0.151574,0.078077,0.006907,0.005655,0.008657
std,0.358608,0.268294,0.082821,0.07499,0.09264
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [766]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278848 entries, 0 to 278847
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   test_date            278848 non-null  object 
 1   cough                278596 non-null  float64
 2   fever                278596 non-null  float64
 3   sore_throat          278847 non-null  float64
 4   shortness_of_breath  278847 non-null  float64
 5   head_ache            278847 non-null  float64
 6   corona_result        278848 non-null  object 
 7   age_60_and_above     151528 non-null  object 
 8   gender               259285 non-null  object 
 9   test_indication      278848 non-null  object 
dtypes: float64(5), object(5)
memory usage: 21.3+ MB


In [767]:
df["gender"].value_counts()

gender
female    130158
male      129127
Name: count, dtype: int64

In [768]:
df["corona_result"].value_counts()

corona_result
negative    260227
positive     14729
other         3892
Name: count, dtype: int64

In [769]:
df["test_indication"].value_counts()

test_indication
Other                     242741
Abroad                     25468
Contact with confirmed     10639
Name: count, dtype: int64

## Exploratory Data Analysis

### 1. About possible biases and limitations of this dataset

##### Data Collection Method

* **Source of Data**: The data is collected from a specific region, demographic, or population group, it might not be representative of the entire population. 
These data was collected on of all individuals in Israel tested for SARS-CoV-2 during the first months of the COVID-19 pandemic.
* **Testing Access**: Individuals who have better access to healthcare facilities are more likely to get tested, which can lead to a selection bias.
* **Symptom Reporting**: Self-reported symptoms can introduce bias due to underreporting or overreporting of symptoms. People might not report mild symptoms or may misreport symptoms due to fear or misunderstanding.

##### Missing Data

* **`age_60_and_above`**: about 60% of data is missing. Missing data can introduce bias if the missingness is not random and is related to other variables in the dataset.

##### Feature Values

* **Binary Representation of Symptoms**: The symptoms are represented as binary (0 or 1), which does not capture the severity of the symptoms. This simplification can lead to loss of information.
* **`test_date`**: The dataset includes a `test_date`, but the relevance of this date to the onset of symptoms or to other temporal factors isn't clear.


#### Target Variable (corona_result):

* **Class Imbalance**: The dataset has a significant imbalance in the target variable (e.g., many more negative cases than positive cases), it can affect model performance and evaluation metrics.

#### External Factors

* **Temporal Changes**: The spread and detection of COVID-19 can change over time due to various factors like new variants, public health measures, and vaccination rates. If the data spans a long time period, these temporal changes can introduce bias.
* **Behavioral Changes**: Changes in public behavior, such as mask-wearing and social distancing, can influence the likelihood of reporting certain symptoms and testing positive.

In [770]:
df["test_indication"].value_counts()

test_indication
Other                     242741
Abroad                     25468
Contact with confirmed     10639
Name: count, dtype: int64

### 2. Format of Feature Values

| Feature | Type | Format | Missing values |
| :------ |:---- | :----- |:------- |
| **`test_date`** | Date string | "YYYY-MM-DD", "2020-04-30" | 0 |
| **`cough`** | Binary (Numeric) | 0.0 or 1.0 | 252 |
| **`fever`** | Binary (Numeric) | 0.0 or 1.0 | 252 |
| **`sore_throat`** | Binary (Numeric) | 0.0 or 1.0 | 0 |
| **`shortness_of_breath`** | Binary (Numeric) | 0.0 or 1.0 | 1 |
| **`head_ache`** | Binary (Numeric) | 0.0 or 1.0 | 1 |
| **`corona_result`** | Categorical (String) | "negative", "positive", or "other"  | 0 |
| **`age_60_and_above`** | Categorical (String) with missing values | "Yes", "No", or NaN  | 127,320 |
| **`gender`** | Categorical (String) | "male" or "female"  | 19,563 |
| **`test_indication`** | Categorical (String) | "Contact with confirmed", "Abroad", or "Other"  | 0 |

### 3. The statistics of feature values

There are `278,848` entries in the dataset for all features.

#### 3.1`test_date`

##### Interpretation

* **Non-null Entries**: All entries are non-null, indicating that every record has a test date.
* **Data Type**: The `test_date` is currently stored as an object (string), though it could be converted to a datetime type for more effective date-based operations and analysis.
* **Frequency**: The highest number of tests was conducted on `2020-04-20` (10,921 tests) and the lowest on `2020-03-11` (294 tests).

In [771]:
test_date_counts = df["test_date"].value_counts().sort_index()
# test_date_counts.plot()
fig = px.line(
    test_date_counts, 
    x=test_date_counts.index,
    y=test_date_counts.values,
    title="Number of Tests Over Time",
    labels={"index": "Test Date", "y": "Number of Tests"}
)
fig.update_xaxes(tickangle=45)
fig.show()

#### 3.2`cough`

- **Missing values**: There are `252` missing values for the `cough` feature.
- **Value Counts**:
  - `0.0`: Reported in `236,368` instances.
  - `1.0`: Reported in `42,228` instances.
- **Prevalence**: Cough is reported in approximately `15.1%` of the total cases.
- **Data Type Consideration**: `cough` is stored as a float64 (`0.0` and `1.0`), representing binary presence (`1.0`) or absence (`0.0`) of cough.

In [772]:
value_counts = df["cough"].value_counts()
value_counts[0]

np.int64(236368)

In [773]:
def pie_value_count(feature, label=None):
    value_counts = df[feature].value_counts()
    missing = df[feature].isna().sum()
    value_counts["missing"] = missing
    if not label:
        label = value_counts.index.map({
            0.0: f"No {feature}: {value_counts.iloc[0]:,}", 
            1.0: f"{feature}: {value_counts.iloc[1]:,}", 
            "missing": f"missing: {missing}"
        })
    else:
        label = value_counts.index.map({
            key: f"{key}: {value_counts[key]:,}" for key in value_counts.keys()
        })
    print(label)
    data_for_pie = pd.DataFrame({
        'value_counts': value_counts.values,
        'status': label,
        "missing": value_counts["missing"]
    })
    
    fig = px.pie(
        data_for_pie,
        values="value_counts",
        names="status",
        title=f"Distribution of {feature} Feature"
    )
    fig.update_layout(
        title_x=0.5
    )
    fig.show()

pie_value_count("cough")

Index(['No cough: 236,368', 'cough: 42,228', 'missing: 252'], dtype='object', name='cough')


#### 3.3`fever`

- **Missing values**: There are `252` missing values.
- **Value Counts**:
  - `0.0`: Reported in `256,844` instances.
  - `1.0`: Reported in `21,752` instances.
- **Prevalence**: Fever is reported in approximately `7.8%` of the total cases.
- **Data Type Consideration**: `fever` is stored as a float64(`0.0` and `1.0`), representing binary presence (`1.0`) or absence (`0.0`) of fever.

In [774]:
pie_value_count("fever")

Index(['No fever: 256,844', 'fever: 21,752', 'missing: 252'], dtype='object', name='fever')


#### 3.4 `sore_throat`

- **Missing values**: There are no missing values.
- **Value Counts**:
  - `0.0`: Reported in `276,921` instances.
  - `1.0`: Reported in `1,926` instances.
- **Prevalence**: Sore throat is reported in approximately `0.7%` of the total cases (`1,926 / 278,848`).
- **Data Type Consideration**: `sore_throat` is stored as a float64 (`0.0` and `1.0`), representing binary presence (`1.0`) or absence (`0.0`) of sore throat.


In [775]:
pie_value_count("sore_throat")

Index(['No sore_throat: 276,921', 'sore_throat: 1,926', 'missing: 1'], dtype='object', name='sore_throat')


#### 3.5 `shortness_of_breath`

- **Missing values**: There is `1` missing value.
- **Value Counts**:
  - `0.0`: Reported in `277,270` instances.
  - `1.0`: Reported in `1,577` instances.
- **Prevalence**: Shortness of breath is reported in approximately `0.6%` of the total cases (`1,577 / 278,848`).
- **Data Type Consideration**: `shortness_of_breath` is stored as a float64 (`0.0` and `1.0`), representing binary presence (`1.0`) or absence (`0.0`) of shortness of breath.


In [776]:
pie_value_count("shortness_of_breath")

Index(['No shortness_of_breath: 277,270', 'shortness_of_breath: 1,577',
       'missing: 1'],
      dtype='object', name='shortness_of_breath')


#### 3.6 `head_ache`

- **Missing values**: There is `1` missing value.
- **Value Counts**:
  - `0.0`: Reported in `276,433` instances.
  - `1.0`: Reported in `2,414` instances.
- **Prevalence**: Headache is reported in approximately `0.9%` of the total cases (`2,414 / 278,848`).
- **Data Type Consideration**: `head_ache` is stored as a float64 (`0.0` and `1.0`), representing binary presence (`1.0`) or absence (`0.0`) of headache.


In [777]:
pie_value_count("head_ache")

Index(['No head_ache: 276,433', 'head_ache: 2,414', 'missing: 1'], dtype='object', name='head_ache')


#### 3.7 `corona_result`

- **Missing values**: There are no missing values.
- **Value Counts**:
  - `negative`: Reported in `260,227` instances.
  - `positive`: Reported in `14,729` instances.
  - `other`: Reported in `3,892` instances.
- **Distribution**:
  - `negative`: Approximately `93.3%`.
  - `positive`: Approximately `5.3%`.
  - `other`: Approximately `1.4%`.
- **Data Type Consideration**: `corona_result` is stored as an object (string), indicating the test result categories (`negative`, `positive`, `other`).


In [778]:
pie_value_count("corona_result", label=True)

Index(['negative: 260,227', 'positive: 14,729', 'other: 3,892', 'missing: 0'], dtype='object', name='corona_result')


#### 3.8 `age_60_and_above`

- **Missing values**: There are `127,320` missing values.
- **Value Counts**:
  - `No`: Reported in `125,703` instances.
  - `Yes`: Reported in `25,825` instances.
- **Distribution**:
  - `No`: Approximately `83.0%`.
  - `Yes`: Approximately `17.0%`.
- **Data Type Consideration**: `age_60_and_above` is stored as an object (string), indicating binary categories (`No` and `Yes`) for age above 60 years.


In [779]:
pie_value_count("age_60_and_above", label=True)

Index(['No: 125,703', 'Yes: 25,825', 'missing: 127,320'], dtype='object', name='age_60_and_above')


#### 3.9 `gender`

- **Missing values**: There are `19,563` missing values.
- **Value Counts**:
  - `female`: Reported in `130,158` instances.
  - `male`: Reported in `129,127` instances.
- **Distribution**:
  - `female`: Approximately `50.2%`.
  - `male`: Approximately `49.8%`.
- **Data Type Consideration**: `gender` is stored as an object (string), indicating binary categories (`female` and `male`) for gender.


In [780]:
pie_value_count("gender", label=True)

Index(['female: 130,158', 'male: 129,127', 'missing: 19,563'], dtype='object', name='gender')


### 3.10 `test_indication`

- **Missing values**: There are `0` missing values.
- **Value Counts**:
  - `Other`: Reported in `242,741` instances.
  - `Abroad`: Reported in `25,468` instances.
  - `Contact with confirmed`: Reported in `10,639` instances.
- **Data Type Consideration**: `test_indication` is stored as an object (string), categorizing reasons for COVID-19 testing.

In [781]:
pie_value_count("test_indication", label=True)

Index(['Other: 242,741', 'Abroad: 25,468', 'Contact with confirmed: 10,639',
       'missing: 0'],
      dtype='object', name='test_indication')


### 4. Features grouped by the target class

First, map all categorical features to binary number

In [782]:
df.head()

Unnamed: 0,test_date,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,age_60_and_above,gender,test_indication
0,2020-04-30,0.0,0.0,0.0,0.0,0.0,negative,,female,Other
1,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,female,Other
2,2020-04-30,0.0,1.0,0.0,0.0,0.0,negative,,male,Other
3,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,female,Other
4,2020-04-30,1.0,0.0,0.0,0.0,0.0,negative,,male,Other


In [783]:
df["test_indication"].value_counts()

test_indication
Other                     242741
Abroad                     25468
Contact with confirmed     10639
Name: count, dtype: int64

In [784]:
def filter(df):
    df_filtered = df[df["corona_result"] != "other"]

    # drop 'test_data' and age_60_and_above (we have too many missing values)
    df_filtered = df_filtered.drop(columns=["test_date", "age_60_and_above"])


    df_filtered["corona_result"] = df["corona_result"].map({
        "negative": 0, 
        "positive": 1
    })
    df_filtered["gender"] = df["gender"].map({
        "male": 0, 
        "female": 1
    })
    df_filtered["test_indication"] = df["test_indication"].map({
        "Other": 1, 
        "Abroad": 2,
        "Contact with confirmed": 3
    })
    df_filtered = df_filtered.dropna().astype(int)
    return df_filtered

In [785]:
df_filtered = filter(df)

In [786]:
# drop null
df_filtered.isna().sum()

cough                  0
fever                  0
sore_throat            0
shortness_of_breath    0
head_ache              0
corona_result          0
gender                 0
test_indication        0
dtype: int64

In [787]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 255668 entries, 0 to 265120
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype
---  ------               --------------   -----
 0   cough                255668 non-null  int64
 1   fever                255668 non-null  int64
 2   sore_throat          255668 non-null  int64
 3   shortness_of_breath  255668 non-null  int64
 4   head_ache            255668 non-null  int64
 5   corona_result        255668 non-null  int64
 6   gender               255668 non-null  int64
 7   test_indication      255668 non-null  int64
dtypes: int64(8)
memory usage: 17.6 MB


In [788]:
grouped = df_filtered.groupby('corona_result').sum().transpose()
fig = px.bar(
    grouped,
    text_auto='.2s',
    title="Features Count for Corona Result"
)
fig.update_xaxes(title_text='Feature')
fig.update_yaxes(title_text='Count')
# Set barmode to 'group' for side-by-side bars
fig.update_layout(
    barmode='group',
    title_x=0.5
)
# Map legend labels
# Update trace names (legend labels)
fig.update_traces(
    name="negative",  # For corona_result 0
    selector={"name": "0"}
)
fig.update_traces(
    name="positive",  # For corona_result 1
    selector={"name": "1"}
)
fig.show()

In [789]:
df_filtered

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,gender,test_indication
0,0,0,0,0,0,0,1,1
1,1,0,0,0,0,0,1,1
2,0,1,0,0,0,0,0,1
3,1,0,0,0,0,0,1,1
4,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
265116,1,1,0,0,0,0,1,2
265117,1,1,1,0,1,0,1,2
265118,1,0,0,0,0,0,1,2
265119,1,0,0,0,0,0,0,1


## Feature engineering

In [790]:
df_filtered.describe()

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,gender,test_indication
count,255668.0,255668.0,255668.0,255668.0,255668.0,255668.0,255668.0,255668.0
mean,0.152745,0.077655,0.005851,0.004216,0.008226,0.052928,0.502265,1.166212
std,0.359742,0.267629,0.07627,0.064797,0.090321,0.22389,0.499996,0.4622
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0


In [791]:
def dummy(df):
    df_dummies = pd.get_dummies(
        df, 
        columns=['test_indication'],
        drop_first=True,
        dtype=int
    )
    df_dummies.rename(
        columns={'test_indication_2': 'test_indication_Abroad', 'test_indication_3': 'test_indication_Contact with confirmed'}, 
        inplace=True
    )
    return df_dummies

In [792]:
df_dummies = dummy(df_filtered)
df_dummies.sample(10)

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,gender,test_indication_Abroad,test_indication_Contact with confirmed
107592,0,0,0,0,0,0,1,0,0
143560,0,0,0,0,0,0,1,0,0
200493,0,0,0,0,0,0,1,0,0
169113,1,0,0,0,0,0,0,1,0
56593,0,0,0,0,0,0,1,0,0
169045,1,0,0,0,0,0,0,0,0
115105,0,0,0,0,0,0,0,0,0
37824,0,0,0,0,0,0,1,0,0
123791,0,0,0,0,0,0,1,0,0
194052,0,0,0,0,0,0,0,0,0


In [793]:
X = df_dummies.drop(columns=["corona_result"])
y = df_dummies["corona_result"]

## Models

### Base model

In [794]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, auc

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    stratify=y,
    random_state=42
)

In [795]:
class BaseModel:
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        return [0] * len(X)
    
    def predict_proba(self, X):
        return [[1, 0]] * len(X)  # Probability distribution for negative class

base_model = BaseModel()
base_model.fit(X_train, y_train)
base_pred = base_model.predict(X_test)
base_pred_proba = base_model.predict_proba(X_test)

In [796]:
unique, counts = np.unique(base_pred_proba, return_counts=True)
unique, counts

(array([0, 1]), array([51134, 51134]))

### Random Forest model

In [797]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

In [798]:
y_pred = rf_model.predict(X_test)

In [799]:
print(classification_report(y_test, y_pred))
rf_confusion_matrix = confusion_matrix(y_test, y_pred)
print(rf_confusion_matrix)

              precision    recall  f1-score   support

           0       0.98      0.99      0.98     48428
           1       0.78      0.58      0.66      2706

    accuracy                           0.97     51134
   macro avg       0.88      0.79      0.82     51134
weighted avg       0.97      0.97      0.97     51134

[[47975   453]
 [ 1133  1573]]


In [800]:
report = classification_report(y_test, y_pred, output_dict=True)
precision_0 = report['0']['precision']
recall_0 = report['0']['recall']
f1_score_0 = report['0']['f1-score']
support_0 = report['0']['support']

# Extract metrics for class 1
precision_1 = report['1']['precision']
recall_1 = report['1']['recall']
f1_score_1 = report['1']['f1-score']
support_1 = report['1']['support']

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

green = '\033[92m'
reset_color = '\033[0m'

print(f"""
{green}Classification Report Interpretation:{reset_color}
    
{green}Precision:{reset_color} Precision measures the accuracy of positive predictions. 
    For class 0 (negative cases), the precision is {green}{precision_0:.2f}{reset_color}{reset_color}, 
        indicating that {precision_0 * 100:.0f}% of the samples predicted as negative were actually negative. 
    For class 1 (positive cases), the precision is {green}{precision_1:.2f}{reset_color}, 
        meaning that {precision_1 * 100:.0f}% of the samples predicted as positive were actually positive.

{green}Recall (Sensitivity):{reset_color} Recall measures the proportion of actual positives that are correctly identified by the model. 
    For class 0 (negative cases), the recall is {green}{recall_0:.2f}{reset_color}, 
        indicating that {recall_0 * 100:.0f}% of the actual negative samples were correctly identified as negative. 
    For class 1 (positive cases), the recall is {green}{recall_1:.2f}{reset_color}, 
        meaning that {recall_1 * 100:.0f}% of the actual positive samples were correctly identified as positive.

{green}F1-score:{reset_color} The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both measures.
    For class 0, the F1-score is {green}{f1_score_0:.2f}{reset_color}, 
    and for class 1, it is {green}{f1_score_1:.2f}{reset_color}.

{green}Support:{reset_color} Support refers to the number of actual occurrences of each class in the test set. 
    In this case, there are {green}{support_0}{reset_color} samples of class 0 and {green}{support_1}{reset_color} samples of class 1.

{green}Accuracy:{reset_color} Overall accuracy of the model is {green}{accuracy:.2f}{reset_color}, 
    meaning that {accuracy * 100:.0f}% of the predictions made by the model are correct.

{green}Macro Avg:{reset_color} The macro average calculates the average of the metrics (precision, recall, F1-score) for all classes without considering class imbalance. 
Here, the macro average F1-score is {green}{report['macro avg']['f1-score']:.2f}.

{green}Weighted Avg:{reset_color} The weighted average calculates the average of the metrics, 
    weighted by support (the number of true instances for each label). 
    It gives more weight to the metrics of the majority class (class 0, negative cases). 
    Here, the weighted average F1-score is {green}{report['weighted avg']['f1-score']:.2f}{reset_color}.
""")


[92mClassification Report Interpretation:[0m
    
[92mPrecision:[0m Precision measures the accuracy of positive predictions. 
    For class 0 (negative cases), the precision is [92m0.98[0m[0m, 
        indicating that 98% of the samples predicted as negative were actually negative. 
    For class 1 (positive cases), the precision is [92m0.78[0m, 
        meaning that 78% of the samples predicted as positive were actually positive.

[92mRecall (Sensitivity):[0m Recall measures the proportion of actual positives that are correctly identified by the model. 
    For class 0 (negative cases), the recall is [92m0.99[0m, 
        indicating that 99% of the actual negative samples were correctly identified as negative. 
    For class 1 (positive cases), the recall is [92m0.58[0m, 
        meaning that 58% of the actual positive samples were correctly identified as positive.

[92mF1-score:[0m The F1-score is the harmonic mean of precision and recall, providing a single metric 

In [801]:
fig = px.imshow(
    rf_confusion_matrix,
    text_auto=True,
    labels={
        "x": "Predicted Label",
        "y": "Actual Label",
        "color": "Count"
    },
    x=["Negative", "Positive"],
    y=["Negative", "Positive"],
    title="Confusion Matrix for Random Forest Model"
)
for i in range(2):
    fig.add_shape(type="line", x0=0.5 + i, y0=-0.5, x1=0.5 + i, y1=2 - 0.5, line=dict(color="white", width=2))
    fig.add_shape(type="line", x0=-0.5, y0=0.5 + i, x1=2 - 0.5, y1=0.5 + i, line=dict(color="white", width=2))

fig.update_layout(title_x=0.5)
fig.show()

In [802]:
print(rf_confusion_matrix)

[[47975   453]
 [ 1133  1573]]


Without stratify:  
[[48053   418]  
 [ 1096  1567]]

In [803]:
print(f"""
{green}Confusion Matrix Interpretation:{reset_color}
{reset_color}
The confusion matrix provides a more detailed breakdown of predictions versus actual outcomes:

{green}True Negative (TN):{reset_color} {rf_confusion_matrix[0][0]:,} samples were correctly predicted as negative.
{green}False Positive (FP):{reset_color} {rf_confusion_matrix[0][1]:,} samples were incorrectly predicted as positive.
{green}False Negative (FN):{reset_color} {rf_confusion_matrix[1][0]:,} samples were incorrectly predicted as negative (actually positive).
{green}True Positive (TP):{reset_color} {rf_confusion_matrix[1][1]:,} samples were correctly predicted as positive.
""")


[92mConfusion Matrix Interpretation:[0m
[0m
The confusion matrix provides a more detailed breakdown of predictions versus actual outcomes:

[92mTrue Negative (TN):[0m 47,975 samples were correctly predicted as negative.
[92mFalse Positive (FP):[0m 453 samples were incorrectly predicted as positive.
[92mFalse Negative (FN):[0m 1,133 samples were incorrectly predicted as negative (actually positive).
[92mTrue Positive (TP):[0m 1,573 samples were correctly predicted as positive.



<details>
<summary>Accuracy (ACC)</summary>

$\text{ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{48,053 + 1,567}{48,053 + 418 + 1,096 + 1,567} \approx 0.97 \quad \left(\frac{\text{TP} + \text{TN}}{\text{Total}} \approx \text{Accuracy}\right)$

</details>

<details>
<summary>Precision for class 0 (Negative)</summary>

$\frac{\text{TN}}{\text{TN} + \text{FN}} = \frac{48,053}{48,053 + 1,096} \approx 0.98 \quad \left(\frac{\text{TN}}{\text{TN} + \text{FN}} \approx \text{Precision for class 0}\right)$

</details>

<details>
<summary>Precision for class 1 (Positive)</summary>

$\frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{1,567}{1,567 + 418} \approx 0.79 \quad \left(\frac{\text{TP}}{\text{TP} + \text{FP}} \approx \text{Precision for class 1}\right)$

</details>

<details>
<summary>Recall for class 0 (Negative)</summary>

$\frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{48,053}{48,053 + 418} \approx 0.99 \quad \left(\frac{\text{TN}}{\text{TN} + \text{FP}} \approx \text{Recall for class 0}\right)$

</details>

<details>
<summary>Recall for class 1 (Positive)</summary>

$\frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{1,567}{1,567 + 1,096} \approx 0.59 \quad \left(\frac{\text{TP}}{\text{TP} + \text{FN}} \approx \text{Recall for class 1}\right)$

</details>

In [804]:
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

ROC-AUC Score: 0.9033


In [805]:
rf_model.predict_proba(X_test)

array([[0.90414038, 0.09585962],
       [0.97432466, 0.02567534],
       [0.94080137, 0.05919863],
       ...,
       [0.92125643, 0.07874357],
       [0.22915269, 0.77084731],
       [0.9873608 , 0.0126392 ]])

In [806]:
def plot_roc_auc(true, pred_proba, known=True):
    roc_auc = roc_auc_score(true, pred_proba)
    fpr, tpr, thresholds = roc_curve(true, pred_proba)

    fig = px.area(
        x=fpr, y=tpr,
        title=f'Random Forest ROC Curve using {"Known" if known else "Unseen"} Data (AUC={roc_auc:.4f})',
        labels={
            "x": "False Positive Rate", 
            "y": "True Positive Rate"
        },
        width=700, height=500
    )
    fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=0, y1=1
    )

    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    fig.update_xaxes(constrain='domain')
    fig.update_layout(title_x=0.5)
    fig.show()

In [807]:
plot_roc_auc(y_test, y_pred_proba)

Without stratify: 0.8976

`auPRC` Area under the Precision-Recall Curve

In [808]:
# Precision-Recall Curve
def plot_pr_curve(true, pred_proba, known=True):
    precision, recall, _ = precision_recall_curve(true, pred_proba)
    pr_auc = auc(recall, precision)
    fig_pr = px.area(
        x=recall, y=precision,
        title=f'Random Forest Precision-Recall Curve {"Known" if known else "Unseen"} Data (auPRC={pr_auc:.4f})',
        labels={
            "x": "Recall", 
            "y": "Precision"
        },
        width=700, height=500
    )
    fig_pr.update_yaxes(scaleanchor="x", scaleratio=1)
    fig_pr.update_xaxes(constrain='domain')
    fig_pr.update_layout(title_x=0.5)
    fig_pr.show()

In [809]:
plot_pr_curve(y_test, y_pred_proba)

#### Validation using unseen data

In [810]:
df_validate = pd.read_csv("data/corona_tested_individuals_ver_0083.english.csv")


Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.



In [811]:
# Preprocess the validation dataset
df_validate_filtered = filter(df_validate)
df_val_dummy = dummy(df_validate_filtered)

In [812]:
df_val_dummy["test_indication_Contact with confirmed"].value_counts()

test_indication_Contact with confirmed
0    2446207
1     163982
Name: count, dtype: int64

In [813]:
X_validate = df_val_dummy.drop(columns=["corona_result"])
y_validate = df_val_dummy["corona_result"]

In [814]:
y_validate_proba = rf_model.predict_proba(X_validate)[:, 1]

In [815]:
plot_roc_auc(y_validate, y_validate_proba, known=False)

In [816]:
plot_pr_curve(y_validate, y_validate_proba, known=False)