<p style="font-family: Cambria; text-align: center; font-size: 48px;"> Predictive  Analysis</p>

In [9]:
!pip install scipy
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp313-cp313-macosx_12_0_arm64.whl (8.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m30.0 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hDownloading joblib-1.5.3-py3-none-any.whl (309 kB)
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [scikit-learn][0m [scikit-learn]
[1A[2KSuccessfully installed joblib-1.5.3 scikit-learn-1.8.0 threadpoolctl-3.6.0


In [10]:
# importing all libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency
from scipy.stats import spearmanr

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [None]:
#Reading the cleaned data 
survey1=pd.read_excel("cleaned_schema_1_data.xlsx")
survey2=pd.read_excel("cleaned_schema_2_data.xlsx")
survey3=pd.read_excel("cleaned_schema_3_data.xlsx")

#Creating a copy of the original data to protect unnecessary modification in original data.
df1=survey1.copy()
df2=survey2.copy()
df3=survey3.copy()
df3.info()

# Early Risk Prediction of Probable COVID-19 Infection Using Survey Data

### Research Objective
***To evaluate whether early self-reported symptoms, exposure history, and demographic characteristics collected through the survey can predict the probablity of COVID-19 case, prior to confirmatory testing or clinical diagnosis.***
- **Null Hypothesis (H₀)**
Probable COVID-19 infection **is not** significantly associated with symptoms, exposure history, or demographic and health-related factors.
- **Alternative Hypothesis (H₁)**
Probable COVID-19 infection **is** significantly associated with symptoms, exposure history, demographic and health-related factors.

### Chosen Parameters and Their Connection to Hypothesis

| **Feature**                          | **Why it’s Included**                                                                 |
|--------------------------------------|----------------------------------------------------------------------------------------|
| **Age > 65**                         | Older age is associated with increased susceptibility and severity of COVID-19        |
| **Sex**                              | Biological differences may influence immune response                                  |
| **Fever**                            | Core symptom of acute viral infection                                                 |
| **Persistent Cough**                 | Strong indicator of respiratory involvement                                           |
| **Shortness of Breathk**             | Associated with pulmonary compromise                                                  |
| **Loss of Smell/Taste**              | Highly specific early COVID-19 symptom                                                |
| **Sore Throat**                      | Common early symptom of viral illness                                                 |
| **Recent Travel**                    | Increases exposure risk                                                               |
| **Contact with Positive Case**       | Direct predictor of infection                                                         |
| **Chronic Illness**                  | Comorbidities increase infection susceptibility                                       |
| **Tobaco usage**                     | Associated with impaired lung function                                                |
| **FSA (Geographic Area)**            | Captures spatial transmission patterns                                                |
| **Survey Awareness Channel**         | Reflects information dissemination pathways                                           |

---

### Target Variable

| **Feature**            | **Why it’s Included**                                                            |
|------------------------|--------------------------------------------------------------------------------- |
| **covid_positive**     | Target variable indicating the presence (1) or absence (0) of covid 19 infaction |



### Exploratory Association Analysis(Correlation Check)
**Since most variables are binary or ordinal and not normally distributed, Spearman’s rank correlation is used to assess associations with the GDM diagnosis equivalent — here, probable COVID-19 infection.**

In [5]:
features = [
    'age_1_>65', 'fever_chills_shakes', 'cough', 'shortness_of_breath',
    'symp_lossOfSmellTaste', 'travel_outside_canada',
    'contact_with_illness',
]
corr_results = []
for feature in features:
    subset = df3[[feature, 'covid_positive']].dropna()
    corr, p = spearmanr(subset[feature], subset['covid_positive'])
    corr_results.append((feature, corr, p))
'''
# Check and filter valid features
valid_features = [f for f in features if f in df3.columns]
#print(valid_features)

# Calculate Spearman correlation with GDM_binary
corr_spearman = df3[valid_features + ['covid_positive']].corr(method='spearman')['covid_positive']

# Print sorted results
corr_spearman.sort_values(ascending=False)
print(corr_spearman)

'''
#print(df3[features].nunique())

'''
for feature in features:
    corr, p = spearmanr(df3[feature], df3['covid_positive'])
    corr_results.append((feature, corr, p))
'''
corr_df = pd.DataFrame(
    corr_results,
    columns=['Feature', 'Spearman_r', 'p_value']
).sort_values(by='Spearman_r', ascending=False)

print(corr_df)

age_1_>65                2
fever_chills_shakes      2
cough                    2
shortness_of_breath      2
symp_lossOfSmellTaste    2
travel_outside_canada    2
contact_with_illness     2
dtype: int64
                 Feature  Spearman_r       p_value
4  symp_lossOfSmellTaste    0.466928  6.676940e-28
1    fever_chills_shakes    0.340556  9.031574e-15
2                  cough    0.289843  6.132708e-11
3    shortness_of_breath    0.256322  8.619661e-09
6   contact_with_illness    0.165585  2.319008e-04
5  travel_outside_canada    0.109322  1.547706e-02
0              age_1_>65   -0.082521  6.798295e-02


### Logistic Regression (Interpretable Baseline Model)

In [12]:
X = df3[features]
y = df3['covid_positive'].dropna()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_prob)

print(f"AUC for early COVID-19 risk prediction: {auc:.3f}")

ValueError: Found input variables with inconsistent numbers of samples: [15534, 490]