Author: Kevin Thomas
License: No license information was provided.
This report presents the development and evaluation of a binary classification model designed to predict influenza cases (is_influenza
) within Chicago’s healthcare data. Utilizing a comprehensive dataset spanning clinical visits from 2015 to 2025, the model aims to enhance early detection and resource allocation for influenza management.
1. Continuous Variables Analysis
- Week Distribution: Data exhibits a uniform distribution across weeks, ensuring temporal coverage.
- Percent Variable: Heavily skewed towards lower values, indicating most observations have low percentage metrics.
- Class Imbalance: The target variable
is_influenza
is significantly imbalanced, with non-influenza cases predominating. This imbalance poses challenges for accurate prediction, potentially biasing the model towards non-influenza outcomes.
2. Categorical Variables Analysis
- Seasonal Representation: Data is primarily concentrated between 2019 and 2024, with sparse entries for the 2024-2025 season.
- Visit Types: Emergency Department (ED) visits dominate, while outpatient visits are infrequent.
- Respiratory Categories: Influenza, RSV, and COVID-19 are well-represented, reflecting prevalent respiratory conditions.
- Demographics: Balanced distribution across key demographic groups, particularly among individuals aged 65+, and various racial/ethnic categories.
3. Advanced Visualizations
- Dodged Bar Charts & Heatmaps: Highlight consistent high counts of Influenza and RSV, with notable COVID-19 spikes during 2020-2023. ED visits remain the primary mode of healthcare engagement, especially among older populations during pandemic periods.
- Box and Violin Plots: Reveal increased variability in the
percent
variable during pandemic years, with significant outliers in younger (0-4) and older (65+) age groups. - Correlation Analysis: Minimal correlations between week, percent, and
is_influenza
(with a weak positive correlation of 0.21 between percent andis_influenza
), indicating limited linear relationships among these variables. - Trend Analysis: Demonstrates upward trends in percentage metrics over weeks, particularly in recent seasons, and consistent increases in ED visits across demographics.
Model Evaluation
- Model 96: Exhibited overfitting, compromising its generalizability.
- Model 99: Encountered convergence issues, likely due to multicollinearity or excessive complexity.
- Model 117: Selected as the optimal model due to its balance between performance and simplicity. Key metrics include:
- Accuracy: 80.21%
- ROC AUC: 0.696
- Complexity: 14 coefficients, reducing the risk of overfitting and enhancing stability during training.
The selected Model 117 demonstrates robust performance with adequate accuracy and discriminative ability, making it suitable for deployment in predicting influenza cases. Its simplicity ensures better generalization to new data, essential for real-world applications in public health surveillance and response.
- Address Class Imbalance: Implement techniques such as resampling, synthetic data generation, or algorithmic adjustments to mitigate the impact of class imbalance and improve predictive performance for influenza cases.
- Feature Engineering: Explore additional features or interactions that may enhance model performance, particularly focusing on demographic and temporal patterns.
- Model Validation: Conduct further validation using external datasets to ensure the model's reliability and applicability across different populations and timeframes.
- Deployment and Monitoring: Integrate the model into public health systems for real-time influenza prediction and continuously monitor its performance to facilitate timely interventions.
The development of a stable and effective influenza prediction model, supported by thorough data analysis and careful model selection, provides a valuable tool for enhancing public health strategies in Chicago. By addressing existing data challenges and refining predictive capabilities, this model holds significant potential for improving influenza case management and resource allocation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import statsmodels.formula.api as smf
df = pd.read_csv('Inpatient__Emergency_Department__and_Outpatient_Visits_for_Respiratory_Illnesses.csv')
df.shape
(48181, 13)
df.dtypes
mmwr_week int64
week int64
week_start object
week_end object
season object
data_source object
essence_category object
respiratory_category object
visit_type object
demographic_category object
demographic_group object
percent float64
current_week_ending object
dtype: object
_ = [print(f'{df[column].value_counts()}\n') for column in df.columns]
mmwr_week
202043 131
202119 131
202347 131
202237 131
202036 131
...
201838 66
201816 66
201630 66
201725 66
201936 66
Name: count, Length: 471, dtype: int64
week
40 1050
42 1050
41 1050
43 919
52 919
7 919
1 919
18 919
50 919
47 919
20 919
14 919
10 919
48 919
25 919
36 919
9 919
2 919
28 919
51 919
38 919
12 919
15 919
16 919
3 919
5 919
4 919
32 919
34 919
39 919
6 919
29 919
49 919
26 919
24 919
37 919
30 919
11 919
8 919
46 919
35 919
27 919
44 919
19 919
23 919
21 919
13 919
17 919
45 919
33 919
22 919
31 919
Name: count, dtype: int64
week_start
10/18/2020 131
05/09/2021 131
11/19/2023 131
09/11/2022 131
08/30/2020 131
...
09/16/2018 66
04/15/2018 66
07/24/2016 66
06/18/2017 66
09/01/2019 66
Name: count, Length: 471, dtype: int64
week_end
10/24/2020 131
05/15/2021 131
11/25/2023 131
09/17/2022 131
09/05/2020 131
...
09/22/2018 66
04/21/2018 66
07/30/2016 66
06/24/2017 66
09/07/2019 66
Name: count, Length: 471, dtype: int64
season
2020-2021 6812
2019-2020 6812
2023-2024 6812
2021-2022 6812
2022-2023 6812
2016-2017 3432
2017-2018 3432
2015-2016 3432
2018-2019 3432
2024-2025 393
Name: count, dtype: int64
data_source
ESSENCE 47710
ILINet 471
Name: count, dtype: int64
essence_category
CDC Influenza DD v1 9542
CDC Respiratory Syncytial Virus DD v1 9542
CDC Broad Acute Respiratory DD v1 9542
Influenza-like Illness 9542
CDC COVID-Specific DD v1 9542
Not Applicable 471
Name: count, dtype: int64
respiratory_category
ILI 10013
Influenza 9542
RSV 9542
Broad Acute Respiratory 9542
COVID-19 9542
Name: count, dtype: int64
visit_type
ED Visits 30615
Admissions 17095
Outpatient Visits 471
Name: count, dtype: int64
demographic_category
Age Group 22020
Race/Ethnicity 22020
ALL 4141
Name: count, dtype: int64
demographic_group
ALL 4141
65+ 3670
Asian Non-Hispanic 3670
45_64 3670
Age Unknown 3670
Other Race/Ethnicity 3670
Hispanic or Latino 3670
Unknown Race/Ethnicity 3670
18_44 3670
Black Non-Hispanic 3670
White Non-Hispanic 3670
05_17 3670
00_04 3670
Name: count, dtype: int64
percent
0.00 14089
0.02 641
0.05 537
0.07 496
0.03 476
...
28.46 1
35.93 1
21.96 1
22.14 1
34.50 1
Name: count, Length: 2827, dtype: int64
current_week_ending
10/05/2024 1050
10/19/2024 1050
10/12/2024 1050
10/26/2024 919
12/28/2024 919
02/15/2025 919
01/04/2025 919
05/03/2025 919
12/14/2024 919
11/23/2024 919
05/17/2025 919
04/05/2025 919
03/08/2025 919
11/30/2024 919
06/21/2025 919
09/06/2025 919
03/01/2025 919
01/11/2025 919
07/12/2025 919
12/21/2024 919
09/20/2025 919
03/22/2025 919
04/12/2025 919
04/19/2025 919
01/18/2025 919
02/01/2025 919
01/25/2025 919
08/09/2025 919
08/23/2025 919
09/27/2025 919
02/08/2025 919
07/19/2025 919
12/07/2024 919
06/28/2025 919
06/14/2025 919
09/13/2025 919
07/26/2025 919
03/15/2025 919
02/22/2025 919
11/16/2024 919
08/30/2025 919
07/05/2025 919
11/02/2024 919
05/10/2025 919
06/07/2025 919
05/24/2025 919
03/29/2025 919
04/26/2025 919
11/09/2024 919
08/16/2025 919
05/31/2025 919
08/02/2025 919
Name: count, dtype: int64
df.nunique()
mmwr_week 471
week 52
week_start 471
week_end 471
season 10
data_source 2
essence_category 6
respiratory_category 5
visit_type 3
demographic_category 3
demographic_group 13
percent 2827
current_week_ending 52
dtype: int64
df['is_influenza'] = df['respiratory_category'].apply(lambda x: 1 if x == 'Influenza' else 0)
df.is_influenza.value_counts()
is_influenza
0 38639
1 9542
Name: count, dtype: int64
df.isna().sum()
mmwr_week 0
week 0
week_start 0
week_end 0
season 0
data_source 0
essence_category 0
respiratory_category 0
visit_type 0
demographic_category 0
demographic_group 0
percent 217
current_week_ending 0
is_influenza 0
dtype: int64
df_copy = df.copy()
df_copy = df_copy.dropna()
df_copy.isna().sum()
mmwr_week 0
week 0
week_start 0
week_end 0
season 0
data_source 0
essence_category 0
respiratory_category 0
visit_type 0
demographic_category 0
demographic_group 0
percent 0
current_week_ending 0
is_influenza 0
dtype: int64
df_copy = df_copy.drop(columns=['mmwr_week',
'week_start',
'week_end',
'data_source',
'current_week_ending'])
df_copy.shape
(47964, 9)
df_copy.dtypes
week int64
season object
essence_category object
respiratory_category object
visit_type object
demographic_category object
demographic_group object
percent float64
is_influenza int64
dtype: object
_ = [print(f'{df_copy[column].value_counts()}\n') for column in df_copy.columns]
week
41 1050
42 1046
40 1045
36 919
22 919
45 919
35 919
13 919
5 919
1 919
18 919
20 919
33 919
9 919
51 919
24 919
26 919
29 919
15 919
38 915
7 915
12 915
3 915
43 915
31 915
17 915
39 915
6 915
37 915
27 915
34 915
25 914
19 914
28 914
2 914
11 914
32 914
52 911
8 911
10 911
44 911
21 911
48 910
14 910
46 910
47 910
50 910
30 910
49 910
23 910
16 910
4 910
Name: count, dtype: int64
season
2022-2023 6812
2023-2024 6807
2021-2022 6807
2019-2020 6771
2020-2021 6762
2017-2018 3420
2015-2016 3420
2016-2017 3408
2018-2019 3364
2024-2025 393
Name: count, dtype: int64
essence_category
CDC COVID-Specific DD v1 9525
CDC Influenza DD v1 9492
CDC Respiratory Syncytial Virus DD v1 9492
CDC Broad Acute Respiratory DD v1 9492
Influenza-like Illness 9492
Not Applicable 471
Name: count, dtype: int64
respiratory_category
ILI 9963
COVID-19 9525
Influenza 9492
RSV 9492
Broad Acute Respiratory 9492
Name: count, dtype: int64
visit_type
ED Visits 30398
Admissions 17095
Outpatient Visits 471
Name: count, dtype: int64
demographic_category
Race/Ethnicity 22020
Age Group 21803
ALL 4141
Name: count, dtype: int64
demographic_group
ALL 4141
65+ 3670
Asian Non-Hispanic 3670
45_64 3670
Other Race/Ethnicity 3670
Hispanic or Latino 3670
Unknown Race/Ethnicity 3670
18_44 3670
Black Non-Hispanic 3670
White Non-Hispanic 3670
05_17 3670
00_04 3670
Age Unknown 3453
Name: count, dtype: int64
percent
0.00 14089
0.02 641
0.05 537
0.07 496
0.03 476
...
28.46 1
35.93 1
21.96 1
22.14 1
34.50 1
Name: count, Length: 2827, dtype: int64
is_influenza
0 38472
1 9492
Name: count, dtype: int64
df_copy.nunique()
week 52
season 10
essence_category 6
respiratory_category 5
visit_type 3
demographic_category 3
demographic_group 13
percent 2827
is_influenza 2
dtype: int64
- The dataset shows a fairly uniform distribution of data over the weeks of the year, but the percent variable is heavily skewed towards lower values, and the target variable is_influenza is highly imbalanced with far more non-influenza cases. This imbalance may affect the model’s ability to predict influenza cases accurately, leading to lower predicted probabilities in many scenarios.
[
(
sns.displot(data=df_copy,
x=column,
kind='hist',
bins=20,
kde=True),
plt.title(f'Distribution of {column.replace("_", " ").title()}', fontsize=16),
plt.show(),
plt.close()
)
for column in df_copy.select_dtypes(include='number').columns
]
plt.show()
- The data visualizations reveal that the dataset is dominated by certain seasons, visit types, and respiratory categories. Notably, the most represented seasons are from 2019 to 2024, with very few entries for 2024-2025. The majority of visit types are ED Visits, while Outpatient Visits are rare. Similarly, certain respiratory categories such as Influenza, RSV, and COVID-19 are well-distributed, but demographic categories and groups show a balanced count distribution, with ALL, 65+, and various racial/ethnic groups being the most represented. This distribution suggests that the dataset is rich in clinical visits during these specific periods and categories.
[
(
sns.catplot(data=df_copy,
x=column,
kind='count').set_xticklabels(rotation=90),
plt.title(f'Count of {column.replace("_", " ").title()}', fontsize=16),
plt.show(),
plt.close()
)
if df_copy[column].nunique() <= 20
else (
sns.catplot(data=df_copy,
x=column,
kind='count').set(xticklabels=[]),
plt.title(f'Count of {column.replace("_", " ").title()}', fontsize=16),
plt.show(),
plt.close()
)
for column in df_copy.select_dtypes(exclude='number').columns
]
plt.show()
Visualize the Combinations and Conditional Distributions w/ Categorical-to-Categorical Relationships or Combinations
- The visualizations offer a detailed analysis of respiratory conditions, demographic categories, and visit types over flu seasons from 2015 to 2025. Key findings highlight consistent high counts of conditions like Influenza and RSV, with notable variations during the 2021-2022 and 2022-2023 seasons due to COVID-19. The charts also reveal that ED Visits accounted for the majority of hospitalizations, while specific spikes in conditions like ILI and COVID-19 were observed across age groups, particularly in older populations (65+), and race/ethnicity categories. Overall, the data trends show stability, but pandemic-related seasons reflected significant shifts in case counts and hospitalizations.
[
(
sns.catplot(data=df_copy,
x=x_col,
hue=hue_col,
kind='count',
aspect=1.5).set_xticklabels(rotation=90),
plt.title(
f'Count of {x_col.replace("_", " ").title()} by {hue_col.replace("_", " ").title()}',
fontsize=16
),
plt.show(),
plt.close()
)
for x_col, hue_col in itertools.combinations(df_copy.select_dtypes(include='object').columns, 2)
]
None
- The visualizations offer a detailed breakdown of respiratory conditions, visit types, and demographic trends across seasons from 2015 to 2025. Influenza and RSV consistently show high counts, with significant spikes in COVID-19 during the 2020-2023 seasons. The majority of cases occurred through Emergency Department (ED) visits, while admissions remained fewer in comparison. The demographic analysis highlights that older age groups, particularly those 65+, had higher case counts during peak pandemic years. The overall trends illustrate the considerable impact of respiratory illnesses during pandemic-related seasons, with a strong reliance on emergency care.
unique_pairs = list(itertools.combinations(
df_copy.select_dtypes(include='object').columns, 2
))
[
(
plt.figure(figsize=(8, 6)),
sns.heatmap(
pd.crosstab(df_copy[x_col], df_copy[hue_col]),
annot=True,
fmt='d',
cbar=False
),
plt.xticks(rotation=90, ha='right'),
plt.yticks(rotation=0),
plt.title(
f'Counts of {x_col.replace("_", " ").title()} by {hue_col.replace("_", " ").title()}',
fontsize=16
),
plt.tight_layout(),
plt.show(),
plt.close()
)
for x_col, hue_col in unique_pairs
]
None
- The visualizations provide a comprehensive analysis of respiratory conditions, visit types, and demographic trends across various seasons, from 2015 to 2025. They show consistent week distributions across seasons, with notable differences for the 2024-2025 season. Percent variability is more pronounced, particularly in recent years like 2021-2022 and 2022-2023, indicating spikes in cases for certain conditions. Broad Acute Respiratory and COVID-19 stand out with higher percentage variability across essence and respiratory categories. ED visits dominate among visit types, with notable outliers in percent trends, while admissions and outpatient visits show fewer percent spikes. Demographic analysis reveals consistency in week distributions across groups but highlights outliers in percent for younger (00-04) and older (65+) age groups, particularly in respiratory conditions. These insights emphasize stability in week reporting but variability in percent outliers during peak illness periods, especially during the pandemic.
[
(
sns.catplot(
data=df_copy,
x=x_col,
y=y_var,
kind='box',
aspect=1.5
).set_xticklabels(rotation=90),
plt.title(
f'{y_var.capitalize()} vs. {x_col.replace("_", " ").title()} by {x_col.replace("_", " ").title()}',
fontsize=16
),
plt.show(),
plt.close()
)
for x_col in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza'
]
None
[
(
sns.catplot(
data=df_copy,
x=x_col,
y=y_var,
kind='violin',
aspect=1.5
).set_xticklabels(rotation=90),
plt.title(
f'{y_var.capitalize()} vs. {x_col.replace("_", " ").title()} by {x_col.replace("_", " ").title()}',
fontsize=16
),
plt.show(),
plt.close()
)
for x_col in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza'
]
None
[
(
sns.catplot(
data=df_copy,
x=x_col,
y=y_var,
kind='point',
linestyles='',
aspect=1.5
).set_xticklabels(rotation=90),
plt.title(
f'{y_var.capitalize()} vs. {x_col.replace("_", " ").title()} by {x_col.replace("_", " ").title()}',
fontsize=16
),
plt.show(),
plt.close()
)
for x_col in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza'
]
None
- The pairplot visualizes the relationships between the numerical variables: week, percent, and is_influenza. It shows the distributions of each variable along the diagonal and scatter plots illustrating their interactions. The data displays significant clustering of low percentage values, while the week variable shows a consistent distribution. The correlation heatmap highlights minimal correlations between the variables. Week and percent have a near-zero correlation (0.01), while percent and is_influenza show a weak positive correlation (0.21). This suggests that these variables do not have strong linear relationships with one another.
pairplot = sns.pairplot(
data=df_copy.select_dtypes(include='number').drop(columns=['is_influenza']),
aspect=1.5
)
pairplot.fig.suptitle('Pairplot of Numerical Variables', fontsize=16)
pairplot.fig.subplots_adjust(top=0.85)
plt.show()
fig, ax = plt.subplots()
sns.heatmap(
data=df_copy.select_dtypes(include='number').corr(numeric_only=True),
vmin=-1,
vmax=1,
center=0,
cbar=False,
annot=True,
annot_kws={'size': 7},
fmt='.2f',
ax=ax
)
plt.title('Correlation Heatmap of Numerical Variables', fontsize=16)
plt.show()
- The visualizations show the relationship between percentage and week across various categories, such as season, essence category, respiratory category, visit type, and demographic categories/groups. In the “Percent vs. Week by Season” chart, there’s a clear upward trend in percentage as the weeks progress, especially for recent seasons like 2021-2022 and 2022-2023. Similar trends are observed across essence categories, where Broad Acute Respiratory and COVID-19 show distinct trends compared to others. The “Percent vs. Week by Visit Type” highlights that emergency department (ED) visits show the most consistent increase over time, while outpatient visits show a more sporadic trend. When broken down by demographic categories, there’s a distinction in trends between age groups and race/ethnicity, with certain groups like 65+ and younger demographics showing more variability over time. These visualizations reflect how respiratory illness trends evolve over weeks and across different population segments, highlighting significant patterns during peak weeks of illness.
numerical_vars = [
col for col in df_copy.select_dtypes(include=['number']).columns
if col != 'is_influenza'
]
[
(
sns.lmplot(
data=df_copy,
x=y_var,
y=x_var,
hue=cat_var,
line_kws={'linewidth': 2},
scatter_kws={'alpha': 0.5},
height=5,
aspect=1.5,
),
plt.title(
f'{y_var.capitalize()} vs. {x_var.capitalize()} by {cat_var.replace("_", " ").title()}',
fontsize=16
),
plt.xticks(rotation=90),
plt.ylabel(y_var.capitalize(), rotation=0, labelpad=30),
plt.xlabel(x_var.capitalize()),
plt.show(),
plt.close()
)
for cat_var in df_copy.select_dtypes(include='object').columns
for x_var, y_var in itertools.combinations(numerical_vars, 2)
]
None
- The visualizations explore the relationship between categorical variables (such as demographic group, essence category, visit type, and season) and continuous responses (week and percent) by segmenting them based on Influenza Status (0 for non-influenza, 1 for influenza). The analysis reveals that week distributions remain consistent across different groups, categories, and influenza status, showing no significant timing shifts between influenza-positive and negative cases. However, the percent variable shows more variability, with non-influenza cases generally displaying wider ranges, especially in categories like Broad Acute Respiratory and visit types like ED visits. Specific demographic groups, such as the very young (00-04) and older adults (65+), exhibit more pronounced percentage spikes, indicating potential areas of differentiation between influenza and non-influenza cases. Violin and point plots further emphasize these patterns, particularly showing a higher concentration of non-influenza cases in certain groups and across different weeks, making percent a more distinctive factor in predicting influenza status.
[
(
sns.catplot(
data=df_copy,
kind='box',
x=var,
y=y_var,
hue='is_influenza',
height=8,
aspect=1.5
),
plt.xticks(rotation=90),
plt.ylabel(y_var.replace('_', ' ').capitalize(), rotation=0, labelpad=30),
plt.title(
f'Box Plot of {y_var.replace("_", " ").capitalize()} by {var.replace("_", " ").title()} and Influenza Status',
fontsize=16
),
plt.xlabel(var.replace('_', ' ').capitalize()),
plt.show(),
plt.close()
)
for var in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza' and y_var in df_copy.columns
]
None
[
(
sns.catplot(
data=df_copy,
kind='violin',
x=var,
y=y_var,
hue='is_influenza',
height=8,
aspect=1.5
),
plt.xticks(rotation=90),
plt.ylabel(y_var.replace('_', ' ').capitalize(), rotation=0, labelpad=30),
plt.title(
f'Violin Plot of {y_var.replace("_", " ").capitalize()} by {var.replace("_", " ").title()} and Influenza Status',
fontsize=16
),
plt.xlabel(var.replace('_', ' ').capitalize()),
plt.show(),
plt.close()
)
for var in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza' and y_var in df_copy.columns
]
None
[
(
sns.catplot(
data=df_copy,
kind='point',
linestyles='',
x=var,
y=y_var,
hue='is_influenza',
height=8,
aspect=1.5
),
plt.xticks(rotation=90),
plt.ylabel(y_var.replace('_', ' ').capitalize(), rotation=0, labelpad=30),
plt.title(
f'Point Plot of {y_var.replace("_", " ").capitalize()} by {var.replace("_", " ").title()} and Influenza Status',
fontsize=16
),
plt.xlabel(var.replace('_', ' ').capitalize()),
plt.show(),
plt.close()
)
for var in df_copy.select_dtypes(include='object').columns
for y_var in df_copy.select_dtypes(include='number').columns
if y_var != 'is_influenza' and y_var in df_copy.columns
]
None
def generate_formulas(numeric_vars, categorical_vars, target):
"""
Generates all possible combinations of additive, interaction, and polynomial terms (quadratic, cubic, and quartic)
for a binary classification target using numeric and categorical variables.
Params:
numeric_vars: list
categorical_vars: list
target: str
Returns:
list
"""
# initialize an empty list to hold the formulas
formulas = []
# helper function to wrap categorical variables with C()
def wrap_categorical(var_list):
"""
Wraps categorical variables in the statsmodels C() function.
This function takes a list of variable names and checks if each variable is
categorical. If the variable is categorical (found in the `categorical_vars` list),
it wraps the variable in `C()` for use with the formula interface in statsmodels.
Otherwise, it returns the variable as-is for numeric variables.
Params:
var_list: list
Returns:
list
"""
return [f'C({var})' if var in categorical_vars else var for var in var_list]
# 1. additive formulas (all combinations of 2 or more variables)
for r in range(2, len(categorical_vars + numeric_vars) + 1):
for combo in itertools.combinations(categorical_vars + numeric_vars, r):
terms = wrap_categorical(combo)
formulas.append(f'{target} ~ {" + ".join(terms)}')
# 2. interaction terms (all combinations of 2 or more variables)
for r in range(2, len(categorical_vars + numeric_vars) + 1):
for combo in itertools.combinations(categorical_vars + numeric_vars, r):
terms = wrap_categorical(combo)
formulas.append(f'{target} ~ {" * ".join(terms)}')
# 3. polynomial terms (quadratic, cubic, quartic) for numeric variables
for var in numeric_vars:
formulas.append(f'{target} ~ {var} + I({var}**2)')
formulas.append(f'{target} ~ {var} + I({var}**2) + I({var}**3)')
formulas.append(f'{target} ~ {var} + I({var}**2) + I({var}**3) + I({var}**4)')
# 4. additive and interaction combinations with polynomials for each pair of numeric and categorical variables
for num_var in numeric_vars:
for cat_var in categorical_vars:
formulas.append(f'{target} ~ {num_var} + I({num_var}**2) + C({cat_var})')
formulas.append(f'{target} ~ {num_var} * I({num_var}**2) * C({cat_var})')
# 5. combinations of multiple categorical and numeric variables with interaction and polynomials
for cat1, cat2 in itertools.combinations(categorical_vars, 2):
for num_var in numeric_vars:
formulas.append(f'{target} ~ C({cat1}) + C({cat2}) + {num_var} + I({num_var}**2)')
formulas.append(f'{target} ~ C({cat1}) * C({cat2}) * {num_var} * I({num_var}**2)')
# 6. full interaction with polynomial terms across all numeric and categorical variables
for r in range(2, len(categorical_vars + numeric_vars) + 1):
for combo in itertools.combinations(categorical_vars + numeric_vars, r):
numeric_in_combo = [var for var in combo if var in numeric_vars]
categorical_in_combo = [var for var in combo if var in categorical_vars]
if numeric_in_combo: # If numeric variables are present, include polynomial terms
for num_var in numeric_in_combo:
terms = wrap_categorical(combo)
formulas.append(f'{target} ~ {" * ".join(terms)} + I({num_var}**2) + I({num_var}**3) + I({num_var}**4)')
# return the forumlas list
return formulas
numeric_vars = ['week',
'percent']
categorical_vars = ['season',
'essence_category',
'respiratory_category',
'visit_type',
'demographic_category',
'demographic_group']
target = 'is_influenza'
formulas = generate_formulas(numeric_vars, categorical_vars, target)
formulas
['is_influenza ~ C(season) + C(essence_category)',
'is_influenza ~ C(season) + C(respiratory_category)',
'is_influenza ~ C(season) + C(visit_type)',
'is_influenza ~ C(season) + C(demographic_category)',
'is_influenza ~ C(season) + C(demographic_group)',
'is_influenza ~ C(season) + week',
'is_influenza ~ C(season) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category)',
'is_influenza ~ C(essence_category) + C(visit_type)',
'is_influenza ~ C(essence_category) + C(demographic_category)',
'is_influenza ~ C(essence_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + week',
'is_influenza ~ C(essence_category) + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type)',
'is_influenza ~ C(respiratory_category) + C(demographic_category)',
'is_influenza ~ C(respiratory_category) + C(demographic_group)',
'is_influenza ~ C(respiratory_category) + week',
'is_influenza ~ C(respiratory_category) + percent',
'is_influenza ~ C(visit_type) + C(demographic_category)',
'is_influenza ~ C(visit_type) + C(demographic_group)',
'is_influenza ~ C(visit_type) + week',
'is_influenza ~ C(visit_type) + percent',
'is_influenza ~ C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(demographic_category) + week',
'is_influenza ~ C(demographic_category) + percent',
'is_influenza ~ C(demographic_group) + week',
'is_influenza ~ C(demographic_group) + percent',
'is_influenza ~ week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category)',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type)',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category)',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + week',
'is_influenza ~ C(season) + C(essence_category) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type)',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category)',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(respiratory_category) + week',
'is_influenza ~ C(season) + C(respiratory_category) + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(season) + C(visit_type) + week',
'is_influenza ~ C(season) + C(visit_type) + percent',
'is_influenza ~ C(season) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(demographic_group) + percent',
'is_influenza ~ C(season) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(visit_type) + week',
'is_influenza ~ C(essence_category) + C(visit_type) + percent',
'is_influenza ~ C(essence_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(demographic_category) + week',
'is_influenza ~ C(essence_category) + C(demographic_category) + percent',
'is_influenza ~ C(essence_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + week + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(respiratory_category) + C(visit_type) + week',
'is_influenza ~ C(respiratory_category) + C(visit_type) + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + week',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_group) + week',
'is_influenza ~ C(respiratory_category) + C(demographic_group) + percent',
'is_influenza ~ C(respiratory_category) + week + percent',
'is_influenza ~ C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(visit_type) + week + percent',
'is_influenza ~ C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(demographic_category) + week + percent',
'is_influenza ~ C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + week',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + week + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(visit_type) + week + percent',
'is_influenza ~ C(season) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + week + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + week + percent',
'is_influenza ~ C(essence_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(essence_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + week + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) + C(essence_category) + C(respiratory_category) + C(visit_type) + C(demographic_category) + C(demographic_group) + week + percent',
'is_influenza ~ C(season) * C(essence_category)',
'is_influenza ~ C(season) * C(respiratory_category)',
'is_influenza ~ C(season) * C(visit_type)',
'is_influenza ~ C(season) * C(demographic_category)',
'is_influenza ~ C(season) * C(demographic_group)',
'is_influenza ~ C(season) * week',
'is_influenza ~ C(season) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category)',
'is_influenza ~ C(essence_category) * C(visit_type)',
'is_influenza ~ C(essence_category) * C(demographic_category)',
'is_influenza ~ C(essence_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * week',
'is_influenza ~ C(essence_category) * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type)',
'is_influenza ~ C(respiratory_category) * C(demographic_category)',
'is_influenza ~ C(respiratory_category) * C(demographic_group)',
'is_influenza ~ C(respiratory_category) * week',
'is_influenza ~ C(respiratory_category) * percent',
'is_influenza ~ C(visit_type) * C(demographic_category)',
'is_influenza ~ C(visit_type) * C(demographic_group)',
'is_influenza ~ C(visit_type) * week',
'is_influenza ~ C(visit_type) * percent',
'is_influenza ~ C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(demographic_category) * week',
'is_influenza ~ C(demographic_category) * percent',
'is_influenza ~ C(demographic_group) * week',
'is_influenza ~ C(demographic_group) * percent',
'is_influenza ~ week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * week',
'is_influenza ~ C(season) * C(essence_category) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(respiratory_category) * week',
'is_influenza ~ C(season) * C(respiratory_category) * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(season) * C(visit_type) * week',
'is_influenza ~ C(season) * C(visit_type) * percent',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(demographic_group) * percent',
'is_influenza ~ C(season) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(visit_type) * week',
'is_influenza ~ C(essence_category) * C(visit_type) * percent',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(demographic_category) * week',
'is_influenza ~ C(essence_category) * C(demographic_category) * percent',
'is_influenza ~ C(essence_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * week * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week',
'is_influenza ~ C(respiratory_category) * C(visit_type) * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * percent',
'is_influenza ~ C(respiratory_category) * week * percent',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(visit_type) * week * percent',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(demographic_category) * week * percent',
'is_influenza ~ C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * week',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * week * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(visit_type) * week * percent',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * week * percent',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(essence_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent',
'is_influenza ~ week + I(week**2)',
'is_influenza ~ week + I(week**2) + I(week**3)',
'is_influenza ~ week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ percent + I(percent**2)',
'is_influenza ~ percent + I(percent**2) + I(percent**3)',
'is_influenza ~ percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ week + I(week**2) + C(season)',
'is_influenza ~ week * I(week**2) * C(season)',
'is_influenza ~ week + I(week**2) + C(essence_category)',
'is_influenza ~ week * I(week**2) * C(essence_category)',
'is_influenza ~ week + I(week**2) + C(respiratory_category)',
'is_influenza ~ week * I(week**2) * C(respiratory_category)',
'is_influenza ~ week + I(week**2) + C(visit_type)',
'is_influenza ~ week * I(week**2) * C(visit_type)',
'is_influenza ~ week + I(week**2) + C(demographic_category)',
'is_influenza ~ week * I(week**2) * C(demographic_category)',
'is_influenza ~ week + I(week**2) + C(demographic_group)',
'is_influenza ~ week * I(week**2) * C(demographic_group)',
'is_influenza ~ percent + I(percent**2) + C(season)',
'is_influenza ~ percent * I(percent**2) * C(season)',
'is_influenza ~ percent + I(percent**2) + C(essence_category)',
'is_influenza ~ percent * I(percent**2) * C(essence_category)',
'is_influenza ~ percent + I(percent**2) + C(respiratory_category)',
'is_influenza ~ percent * I(percent**2) * C(respiratory_category)',
'is_influenza ~ percent + I(percent**2) + C(visit_type)',
'is_influenza ~ percent * I(percent**2) * C(visit_type)',
'is_influenza ~ percent + I(percent**2) + C(demographic_category)',
'is_influenza ~ percent * I(percent**2) * C(demographic_category)',
'is_influenza ~ percent + I(percent**2) + C(demographic_group)',
'is_influenza ~ percent * I(percent**2) * C(demographic_group)',
'is_influenza ~ C(season) + C(essence_category) + week + I(week**2)',
'is_influenza ~ C(season) * C(essence_category) * week * I(week**2)',
'is_influenza ~ C(season) + C(essence_category) + percent + I(percent**2)',
'is_influenza ~ C(season) * C(essence_category) * percent * I(percent**2)',
'is_influenza ~ C(season) + C(respiratory_category) + week + I(week**2)',
'is_influenza ~ C(season) * C(respiratory_category) * week * I(week**2)',
'is_influenza ~ C(season) + C(respiratory_category) + percent + I(percent**2)',
'is_influenza ~ C(season) * C(respiratory_category) * percent * I(percent**2)',
'is_influenza ~ C(season) + C(visit_type) + week + I(week**2)',
'is_influenza ~ C(season) * C(visit_type) * week * I(week**2)',
'is_influenza ~ C(season) + C(visit_type) + percent + I(percent**2)',
'is_influenza ~ C(season) * C(visit_type) * percent * I(percent**2)',
'is_influenza ~ C(season) + C(demographic_category) + week + I(week**2)',
'is_influenza ~ C(season) * C(demographic_category) * week * I(week**2)',
'is_influenza ~ C(season) + C(demographic_category) + percent + I(percent**2)',
'is_influenza ~ C(season) * C(demographic_category) * percent * I(percent**2)',
'is_influenza ~ C(season) + C(demographic_group) + week + I(week**2)',
'is_influenza ~ C(season) * C(demographic_group) * week * I(week**2)',
'is_influenza ~ C(season) + C(demographic_group) + percent + I(percent**2)',
'is_influenza ~ C(season) * C(demographic_group) * percent * I(percent**2)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + week + I(week**2)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week * I(week**2)',
'is_influenza ~ C(essence_category) + C(respiratory_category) + percent + I(percent**2)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * percent * I(percent**2)',
'is_influenza ~ C(essence_category) + C(visit_type) + week + I(week**2)',
'is_influenza ~ C(essence_category) * C(visit_type) * week * I(week**2)',
'is_influenza ~ C(essence_category) + C(visit_type) + percent + I(percent**2)',
'is_influenza ~ C(essence_category) * C(visit_type) * percent * I(percent**2)',
'is_influenza ~ C(essence_category) + C(demographic_category) + week + I(week**2)',
'is_influenza ~ C(essence_category) * C(demographic_category) * week * I(week**2)',
'is_influenza ~ C(essence_category) + C(demographic_category) + percent + I(percent**2)',
'is_influenza ~ C(essence_category) * C(demographic_category) * percent * I(percent**2)',
'is_influenza ~ C(essence_category) + C(demographic_group) + week + I(week**2)',
'is_influenza ~ C(essence_category) * C(demographic_group) * week * I(week**2)',
'is_influenza ~ C(essence_category) + C(demographic_group) + percent + I(percent**2)',
'is_influenza ~ C(essence_category) * C(demographic_group) * percent * I(percent**2)',
'is_influenza ~ C(respiratory_category) + C(visit_type) + week + I(week**2)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week * I(week**2)',
'is_influenza ~ C(respiratory_category) + C(visit_type) + percent + I(percent**2)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * percent * I(percent**2)',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + week + I(week**2)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week * I(week**2)',
'is_influenza ~ C(respiratory_category) + C(demographic_category) + percent + I(percent**2)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * percent * I(percent**2)',
'is_influenza ~ C(respiratory_category) + C(demographic_group) + week + I(week**2)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week * I(week**2)',
'is_influenza ~ C(respiratory_category) + C(demographic_group) + percent + I(percent**2)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * percent * I(percent**2)',
'is_influenza ~ C(visit_type) + C(demographic_category) + week + I(week**2)',
'is_influenza ~ C(visit_type) * C(demographic_category) * week * I(week**2)',
'is_influenza ~ C(visit_type) + C(demographic_category) + percent + I(percent**2)',
'is_influenza ~ C(visit_type) * C(demographic_category) * percent * I(percent**2)',
'is_influenza ~ C(visit_type) + C(demographic_group) + week + I(week**2)',
'is_influenza ~ C(visit_type) * C(demographic_group) * week * I(week**2)',
'is_influenza ~ C(visit_type) + C(demographic_group) + percent + I(percent**2)',
'is_influenza ~ C(visit_type) * C(demographic_group) * percent * I(percent**2)',
'is_influenza ~ C(demographic_category) + C(demographic_group) + week + I(week**2)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week * I(week**2)',
'is_influenza ~ C(demographic_category) + C(demographic_group) + percent + I(percent**2)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * percent * I(percent**2)',
'is_influenza ~ C(season) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(week**2) + I(week**3) + I(week**4)',
'is_influenza ~ C(season) * C(essence_category) * C(respiratory_category) * C(visit_type) * C(demographic_category) * C(demographic_group) * week * percent + I(percent**2) + I(percent**3) + I(percent**4)']
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5,
shuffle=True,
random_state=101)
input_names = df_copy.drop(columns=[target]).\
copy().\
columns.\
to_list()
output_name = target
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
def my_coefplot(model, figsize_default=(10, 4), figsize_expansion_factor=0.5, max_default_vars=10):
"""
Function that plots a coefficient plot with error bars for a given statistical model
and prints out which variables are statistically significant and whether they are positive or negative.
The graph height dynamically adjusts based on the number of variables.
Params:
model: object
figsize_default: tuple, optional
figsize_expansion_factor: float, optional
max_default_vars: int, optional
"""
# cap the standard errors (bse) to avoid overly large error bars, upper bound set to 2
capped_bse = model.bse.clip(upper=2)
# calculate the minimum and maximum coefficient values adjusted by the standard errors
coef_min = (model.params - 2 * capped_bse).min()
coef_max = (model.params + 2 * capped_bse).max()
# define buffer space for the x-axis limits
buffer_space = 0.5
xlim_min = coef_min - buffer_space
xlim_max = coef_max + buffer_space
# dynamically calculate figure height based on the number of variables
num_vars = len(model.params)
if num_vars > max_default_vars:
height = figsize_default[1] + figsize_expansion_factor * (num_vars - max_default_vars)
else:
height = figsize_default[1]
# create the plot
fig, ax = plt.subplots(figsize=(figsize_default[0], height))
# identify statistically significant and non-significant variables based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index
not_significant_vars = model.pvalues[model.pvalues >= 0.05].index
# plot non-significant variables with grey error bars
ax.errorbar(y=not_significant_vars,
x=model.params[not_significant_vars],
xerr=2 * capped_bse[not_significant_vars],
fmt='o',
color='grey',
ecolor='grey',
elinewidth=2,
ms=10,
label='not significant')
# plot significant variables with red error bars
ax.errorbar(y=significant_vars,
x=model.params[significant_vars],
xerr=2 * capped_bse[significant_vars],
fmt='o',
color='red',
ecolor='red',
elinewidth=2,
ms=10,
label='significant (p < 0.05)')
# add a vertical line at 0 to visually separate positive and negative coefficients
ax.axvline(x=0, linestyle='--', linewidth=2.5, color='grey')
# adjust the x-axis limits to add some buffer space on either side
ax.set_xlim(min(-0.5, coef_min - 0.2), max(0.5, coef_max + 0.2))
ax.set_xlabel('coefficient value')
# add legend to distinguish between significant and non-significant variables
ax.legend()
# show the plot
plt.show()
# print the summary of statistically significant variables
print('\n--- statistically significant variables ---')
# check if there are any significant variables, if not, print a message
if significant_vars.empty:
print('No statistically significant variables found.')
else:
# for each significant variable, print its coefficient, standard error, p-value, and direction
for var in significant_vars:
coef_value = model.params[var]
std_err = model.bse[var]
p_val = model.pvalues[var]
direction = 'positive' if coef_value > 0 else 'negative'
print(f'variable: {var}, coefficient: {coef_value:.4f}, std err: {std_err:.4f}, p-value: {p_val:.4f}, direction: {direction}')
def train_and_test_logistic_with_cv(model, formula, df, x_names, y_name, cv, threshold=0.5, use_scaler=False):
"""
Function to train and test a logistic binary classification model with Cross-Validation,
including accuracy and ROC AUC score calculations.
Params:
model: object
formula: str
df: object
x_names: list
y_name: str
cv: object
threshold: float, optional
use_scaler: bool, optional
Returns:
object
"""
# separate the inputs and output
input_df = df.loc[:, x_names].copy()
# initialize the performance metric storage lists
train_res = []
test_res = []
train_auc_scores = []
test_auc_scores = []
# split the data and iterate over the folds
for train_id, test_id in cv.split(input_df.to_numpy(), df[y_name].to_numpy()):
# subset the training and test splits within each fold
train_data = df.iloc[train_id, :].copy()
test_data = df.iloc[test_id, :].copy()
# if the use_scaler flag is set, standardize the numeric features within each fold
if use_scaler:
scaler = StandardScaler()
# identify numeric columns to scale, excluding the target variable
columns_to_scale = train_data.select_dtypes(include=[np.number]).columns.tolist()
columns_to_scale = [col for col in columns_to_scale if col != y_name]
# fit scaler on training data
scaler.fit(train_data[columns_to_scale])
# transform training and test data
train_data[columns_to_scale] = scaler.transform(train_data[columns_to_scale])
test_data[columns_to_scale] = scaler.transform(test_data[columns_to_scale])
# fit the model on the training data within the current fold
a_model = smf.logit(formula=formula, data=train_data).fit()
# predict the training within each fold
train_copy = train_data.copy()
train_copy['pred_probability'] = a_model.predict(train_data)
train_copy['pred_class'] = np.where(train_copy.pred_probability > threshold, 1, 0)
# predict the testing within each fold
test_copy = test_data.copy()
test_copy['pred_probability'] = a_model.predict(test_data)
test_copy['pred_class'] = np.where(test_copy.pred_probability > threshold, 1, 0)
# calculate the performance metric (accuracy) on the training set within the fold
train_res.append(np.mean(train_copy[y_name] == train_copy.pred_class))
# calculate the performance metric (accuracy) on the testing set within the fold
test_res.append(np.mean(test_copy[y_name] == test_copy.pred_class))
# calculate the roc_auc_score for the training set
train_auc_scores.append(roc_auc_score(train_copy[y_name], train_copy['pred_probability']))
# calculate the roc_auc_score for the testing_set
test_auc_scores.append(roc_auc_score(test_copy[y_name], test_copy['pred_probability']))
# book keeping to store the results (accuracy)
train_df = pd.DataFrame({'accuracy': train_res, 'roc_auc': train_auc_scores})
train_df['from_set'] = 'training'
train_df['fold_id'] = train_df.index + 1
test_df = pd.DataFrame({'accuracy': test_res, 'roc_auc': test_auc_scores})
test_df['from_set'] = 'testing'
test_df['fold_id'] = test_df.index + 1
# combine the splits together
res_df = pd.concat([train_df, test_df], ignore_index=True)
# add information about the model
res_df['model'] = model
res_df['formula'] = formula
res_df['num_coefs'] = len(a_model.params)
res_df['threshold'] = threshold
# return the results DataFrame
return res_df
df_copy[input_names].describe()
week | percent | |
---|---|---|
count | 47964.000000 | 47964.000000 |
mean | 26.610437 | 3.307313 |
std | 14.998518 | 6.332838 |
min | 1.000000 | 0.000000 |
25% | 14.000000 | 0.000000 |
50% | 27.000000 | 0.480000 |
75% | 40.000000 | 3.340000 |
max | 52.000000 | 100.000000 |
sns.catplot(data=df_copy.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'],
value_vars=df_copy[input_names].select_dtypes(include=['number']).columns),
x='variable',
y='value',
kind='box',
aspect=3)
plt.title('Box Plots of All Numeric Variables for Comparison')
plt.xticks(rotation=90)
plt.show()
sns.catplot(data=pd.DataFrame(StandardScaler().\
fit_transform(df_copy[input_names].\
select_dtypes(include=['number'])),
columns=df_copy[input_names].\
select_dtypes(include=['number']).\
columns).reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'],
value_vars=df_copy[input_names].\
select_dtypes(include=['number']).\
columns),
x='variable',
y='value',
kind='box', aspect=3)
plt.title('Box Plots of All Numeric Variables (After Scaling)')
plt.xticks(rotation=90)
plt.show()
import os
import contextlib
res_list = []
error_log = []
with contextlib.redirect_stdout(open(os.devnull, 'w')), contextlib.redirect_stderr(open(os.devnull, 'w')):
for model in range(min(125, len(formulas))):
try:
res_list.append(train_and_test_logistic_with_cv(model,
formula=formulas[model],
df=df_copy,
x_names=input_names,
y_name=output_name,
cv=kf,
use_scaler=True))
except Exception as e:
error_log.append(f'Formula ID {model} failed: {str(e)}')
cv_results = pd.concat(res_list, ignore_index=True)
cv_results.loc[(cv_results['from_set'] == 'testing') &
(cv_results['accuracy'] < 1.0) &
(cv_results['roc_auc'] < 1.0)].\
groupby('model').\
aggregate({'accuracy': 'mean',
'roc_auc': 'mean',
'num_coefs': 'first'}).\
reset_index().\
sort_values(by='accuracy', ascending=False)
model | accuracy | roc_auc | num_coefs | |
---|---|---|---|---|
33 | 96 | 0.950485 | 0.999744 | 28 |
34 | 99 | 0.844261 | 0.974864 | 18 |
32 | 83 | 0.802102 | 0.697427 | 15 |
24 | 75 | 0.802102 | 0.495022 | 6 |
25 | 76 | 0.802102 | 0.693318 | 6 |
26 | 77 | 0.802102 | 0.486048 | 16 |
27 | 78 | 0.802102 | 0.700490 | 16 |
28 | 79 | 0.802102 | 0.669038 | 5 |
29 | 80 | 0.802102 | 0.489487 | 16 |
30 | 81 | 0.802102 | 0.698892 | 16 |
31 | 82 | 0.802102 | 0.689041 | 5 |
0 | 2 | 0.802102 | 0.488119 | 12 |
22 | 48 | 0.802102 | 0.692584 | 12 |
35 | 109 | 0.802102 | 0.480564 | 26 |
36 | 110 | 0.802102 | 0.484685 | 15 |
37 | 112 | 0.802102 | 0.479757 | 25 |
38 | 113 | 0.802102 | 0.706227 | 25 |
39 | 115 | 0.802102 | 0.483641 | 25 |
40 | 116 | 0.802102 | 0.701737 | 25 |
41 | 117 | 0.802102 | 0.695748 | 14 |
23 | 74 | 0.802102 | 0.486738 | 17 |
21 | 47 | 0.802102 | 0.701737 | 23 |
1 | 3 | 0.802102 | 0.489829 | 12 |
10 | 23 | 0.802102 | 0.498705 | 4 |
2 | 4 | 0.802102 | 0.484082 | 22 |
3 | 5 | 0.802102 | 0.483389 | 11 |
4 | 6 | 0.802102 | 0.695034 | 11 |
5 | 18 | 0.802102 | 0.497179 | 5 |
6 | 19 | 0.802102 | 0.486738 | 15 |
7 | 20 | 0.802102 | 0.498211 | 4 |
8 | 21 | 0.802102 | 0.669338 | 4 |
9 | 22 | 0.802102 | 0.490298 | 15 |
11 | 24 | 0.802102 | 0.689566 | 4 |
20 | 46 | 0.802102 | 0.483641 | 23 |
12 | 25 | 0.802102 | 0.489487 | 14 |
13 | 26 | 0.802102 | 0.698892 | 14 |
14 | 27 | 0.802102 | 0.668223 | 3 |
15 | 39 | 0.802102 | 0.486121 | 14 |
16 | 40 | 0.802102 | 0.480564 | 24 |
17 | 43 | 0.802102 | 0.484082 | 24 |
18 | 44 | 0.802102 | 0.488469 | 13 |
19 | 45 | 0.802102 | 0.697651 | 13 |
42 | 118 | 0.802102 | 0.700290 | 24 |
- Since Model 96 appears to be overfitting, and Model 99 has convergence issues likely due to multicollinearity or overfitting, a more stable and simpler alternative would be Model 117. This model has an accuracy of 0.802102, a reasonable balance between performance and complexity, and an ROC AUC of 0.695748, indicating its ability to distinguish between influenza and non-influenza cases. With only 14 coefficients, Model 117 is less complex, reducing the risk of overfitting and making it more likely to generalize to new data. Its stability during the fitting process also makes it a preferable option, avoiding numerical issues like singular matrices or failed convergence.
best_model = smf.logit(formula=formulas[117],
data=df_copy).fit()
Optimization terminated successfully.
Current function value: 0.449109
Iterations 8
best_model.params
Intercept -1.103243
C(season)[T.2016-2017] 0.045845
C(season)[T.2017-2018] 0.108385
C(season)[T.2018-2019] 0.068572
C(season)[T.2019-2020] 0.190227
C(season)[T.2020-2021] 0.124803
C(season)[T.2021-2022] 0.144458
C(season)[T.2022-2023] 0.129073
C(season)[T.2023-2024] 0.153577
C(season)[T.2024-2025] 0.006620
C(demographic_category)[T.Age Group] 0.050765
C(demographic_category)[T.Race/Ethnicity] 0.142890
week -0.000800
percent -0.336332
dtype: float64
best_model.pvalues < 0.05
Intercept True
C(season)[T.2016-2017] False
C(season)[T.2017-2018] False
C(season)[T.2018-2019] False
C(season)[T.2019-2020] True
C(season)[T.2020-2021] True
C(season)[T.2021-2022] True
C(season)[T.2022-2023] True
C(season)[T.2023-2024] True
C(season)[T.2024-2025] False
C(demographic_category)[T.Age Group] False
C(demographic_category)[T.Race/Ethnicity] True
week False
percent True
dtype: bool
best_model.params[best_model.pvalues < 0.05].sort_values(ascending=False)
C(season)[T.2019-2020] 0.190227
C(season)[T.2023-2024] 0.153577
C(season)[T.2021-2022] 0.144458
C(demographic_category)[T.Race/Ethnicity] 0.142890
C(season)[T.2022-2023] 0.129073
C(season)[T.2020-2021] 0.124803
percent -0.336332
Intercept -1.103243
dtype: float64
my_coefplot(best_model)
--- statistically significant variables ---
variable: Intercept, coefficient: -1.1032, std err: 0.0641, p-value: 0.0000, direction: negative
variable: C(season)[T.2019-2020], coefficient: 0.1902, std err: 0.0545, p-value: 0.0005, direction: positive
variable: C(season)[T.2020-2021], coefficient: 0.1248, std err: 0.0544, p-value: 0.0217, direction: positive
variable: C(season)[T.2021-2022], coefficient: 0.1445, std err: 0.0543, p-value: 0.0078, direction: positive
variable: C(season)[T.2022-2023], coefficient: 0.1291, std err: 0.0542, p-value: 0.0173, direction: positive
variable: C(season)[T.2023-2024], coefficient: 0.1536, std err: 0.0542, p-value: 0.0046, direction: positive
variable: C(demographic_category)[T.Race/Ethnicity], coefficient: 0.1429, std err: 0.0456, p-value: 0.0017, direction: positive
variable: percent, coefficient: -0.3363, std err: 0.0083, p-value: 0.0000, direction: negative
import pickle
with open('logit_model.pkl', 'wb') as file:
pickle.dump(best_model, file)
with open('logit_model.pkl', 'rb') as file:
loaded_model = pickle.load(file)
sample_data = pd.DataFrame({
'week': [22],
'season': ['2024-2025'],
'essence_category': ['CDC Influenza DD v1'],
'respiratory_category': ['Influenza'],
'visit_type': ['ED Visits'],
'demographic_category': ['Race/Ethnicity'],
'demographic_group': ['Black Non-Hispanic'],
'percent': [65.0]
})
loaded_model.predict(sample_data)
0 1.212773e-10
dtype: float64