In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from imblearn.under_sampling import RandomUnderSampler


## Loading the Data

In [15]:
%run _setup.py

2026-02-06 13:47:29,468 | INFO | Added to PYTHONPATH: C:\Users\User\Desktop\covid-19\data-science-machine-learning-for-covid-19-using-Python\src\ingestion
2026-02-06 13:47:29,470 | INFO | Added to PYTHONPATH: C:\Users\User\Desktop\covid-19\data-science-machine-learning-for-covid-19-using-Python\src\cleaning
2026-02-06 13:47:29,471 | INFO | Added to PYTHONPATH: C:\Users\User\Desktop\covid-19\data-science-machine-learning-for-covid-19-using-Python\src\transformation
2026-02-06 13:47:29,472 | INFO | Added to PYTHONPATH: C:\Users\User\Desktop\covid-19\data-science-machine-learning-for-covid-19-using-Python\src\utils


In [18]:
from ingestion.data_loader import DataLoader

loader = DataLoader()

df = loader.load_csv(
    filename="corona_tested_individuals_ver_0083.english.csv",
    layer="raw"
)

df.head()


Unnamed: 0,test_date,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,age_60_and_above,gender,test_indication
0,2020-11-12,0,0,0,0,0,negative,No,male,Other
1,2020-11-12,0,1,0,0,0,negative,No,male,Other
2,2020-11-12,0,0,0,0,0,negative,Yes,female,Other
3,2020-11-12,0,0,0,0,0,negative,No,male,Other
4,2020-11-12,0,1,0,0,0,negative,No,male,Contact with confirmed


## Data Overview

## Model Predictors and Exact Variable Names (True = 1, False = 0)
* **Age over 60** - Age_60_and_above
* **Sex** - Male (Male=1, Female=0)
* **Cough** - Cough
* **Shortness of breath** - Shortness_of_breath
* **Fever** - Fever
* **Sore throat** - Sore_throat
* **Headache** - Headache
* **Test Indication** - Important thing in this is whether the patient contacted with a confirmed patient or not

Source: https://data.gov.il/dataset/covid-19



In [19]:
df['test_indication'].value_counts()

test_indication
Other                     153505
Contact with confirmed      4934
Abroad                       166
Name: count, dtype: int64

## Preparing Data



## Dropping NA Values

In [23]:
print(df.shape)
df = df.dropna()
print(df.shape)

(158606, 10)
(143502, 10)


## Converting Columns to accepted format

In [28]:
def gender(x):
    row = dict(x)
    gender = row['gender'].lower()

    if gender == 'male':
        return 1
    else:
        return 0

def age_60_and_above(x):
    row = dict(x)
    age = row['age_60_and_above']

    if age == 'Yes':
        return 1
    else:
        return 0

def corona_result(x):
    row = dict(x)
    gender = row['corona_result'].lower()

    if corona_result == 'positive':
        return 1
    else:
        return 0

def contact_with_confirmed_convert(x):
    row = dict(x)
    test_indication = row['test_indication']

    if test_indication == 'Contact with confirmed':
        return 1
    else:
        return 0

In [29]:
df['contact_with_confirmed'] = df.apply(lambda row: contact_with_confirmed_convert(row), axis = 1)
df['gender'] = df.apply(lambda row: gender(row), axis = 1)
df['age_60_and_above'] = df.apply(lambda row: age_60_and_above(row), axis = 1)
df['corona_result'] = df.apply(lambda row: corona_result(row), axis = 1)

In [30]:
df.head()

Unnamed: 0,test_date,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result,age_60_and_above,gender,test_indication,contact_with_confirmed
0,2020-11-12,0,0,0,0,0,0,0,1,Other,0
1,2020-11-12,0,1,0,0,0,0,0,1,Other,0
2,2020-11-12,0,0,0,0,0,0,1,0,Other,0
3,2020-11-12,0,0,0,0,0,0,0,1,Other,0
4,2020-11-12,0,1,0,0,0,0,0,1,Contact with confirmed,1


## Gradient Boosting Classifier

**Learning Rate**: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

**N_estimators**: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

**Max Depth**: The maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.



## Python imbalanced-learn module
A number of more sophisticated resapling techniques have been proposed in the scientific literature.

For example, we can cluster the records of the majority class, and do the under-sampling by removing records from each cluster, thus seeking to preserve information. In over-sampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples.

## Under-sampling: Cluster Centroids
This technique performs under-sampling by generating centroids based on clustering methods. The data will be previously grouped by similarity, in order to preserve information.

Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples.

## Assigment: 

Try Oversampling Technique SMOTE, Compare and Contrast it with Undersampling Techniques and observe which one is better in our case
## Oversampling

**SMOTE (Synthetic Minority Oversampling TEchnique)** consists of synthesizing elements for the minority class, based on those that already exist. It works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.



## Saving the Model


