

# Model Evaluation and Hypothesis Testing

## Introduction
In this notebook, we will cover model evaluation metrics for classification and regression, followed by hypothesis testing for model comparison.

## Steps:
1. Generate synthetic classification data.
2. Train Logistic Regression and Random Forest models.
3. Evaluate models using accuracy, precision, recall, and F1-score.
4. Perform hypothesis testing using paired t-test and Wilcoxon test.
5. Visualize model performance.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, mean_absolute_error, mean_squared_error, r2_score
from scipy.stats import ttest_rel, wilcoxon
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Generate synthetic classification dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
print(X)
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Hypothesis Testing
To determine whether there is a significant difference between the two models, we perform:
- **Paired t-test**: Assumes normal distribution of performance differences.
- **Wilcoxon signed-rank test**: A non-parametric alternative to the t-test.


In [None]:
# Train classification models
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)

# Predictions and evaluation
y_pred1 = clf1.predict(X_test)
y_pred2 = clf2.predict(X_test)

accuracy1 = accuracy_score(y_test, y_pred1)
accuracy2 = accuracy_score(y_test, y_pred2)
precision1 = precision_score(y_test, y_pred1)
precision2 = precision_score(y_test, y_pred2)
recall1 = recall_score(y_test, y_pred1)
recall2 = recall_score(y_test, y_pred2)
f1_1 = f1_score(y_test, y_pred1)
f1_2 = f1_score(y_test, y_pred2)

# Hypothesis testing
model1_scores = np.random.normal(accuracy1, 0.02, 10)  # Simulating 10 runs
model2_scores = np.random.normal(accuracy2, 0.02, 10)

# Paired t-test
t_stat, p_value = ttest_rel(model1_scores, model2_scores)

# Wilcoxon test
w_stat, w_p_value = wilcoxon(model1_scores, model2_scores)

In [None]:
# Visualization
plt.figure(figsize=(8, 6))
sns.boxplot(data=[model1_scores, model2_scores])
plt.xticks([0, 1], ['Logistic Regression', 'Random Forest'])
plt.ylabel('Accuracy Score')
plt.title('Model Performance Comparison')
plt.show()

# Print results
print(f'Paired t-test: t-statistic = {t_stat:.3f}, p-value = {p_value:.3f}')
print(f'Wilcoxon test: W-statistic = {w_stat:.3f}, p-value = {w_p_value:.3f}')


## Conclusion
- If the p-value from the paired t-test or Wilcoxon test is less than 0.05, we reject the null hypothesis and conclude that the models perform significantly differently.
- Visualization helps in understanding the distribution of performance scores.



## Clinical Study Exercise: Drug vs. Placebo
### Problem Statement
A clinical study is conducted to compare the effectiveness of a new drug against a placebo. We will:
1. Generate synthetic data representing patient improvement scores (scale 0-100).
2. Use hypothesis testing to determine if the drug significantly outperforms the placebo.


In [None]:
# Generate synthetic clinical study data
import numpy as np


# Set the random seed for reproducibility
np.random.seed(42)  

# Define the number of patients in the clinical study
num_patients = 30  

# Generate synthetic data for the drug group:
# - Mean improvement score = 70
# - Standard deviation = 10
# - Number of samples = 30
drug_effect = np.random.normal(70, 10, num_patients)  

# Generate synthetic data for the placebo group:
# - Mean improvement score = 60
# - Standard deviation = 10
# - Number of samples = 30
placebo_effect = np.random.normal(60, 10, num_patients)  


In [None]:
print(drug_effect)

In [None]:
print(placebo_effect)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Plot distributions
plt.figure(figsize=(8, 6))
sns.histplot(drug_effect, kde=True, color='blue', label='Drug', bins=15, alpha=0.6)
sns.histplot(placebo_effect, kde=True, color='red', label='Placebo', bins=15, alpha=0.6)

In [None]:
# Perform hypothesis testing
t_stat_clinical, p_value_clinical = ttest_rel(drug_effect, placebo_effect)
w_stat_clinical, w_p_value_clinical = wilcoxon(drug_effect, placebo_effect)

In [None]:
# Visualization
plt.figure(figsize=(8, 6))
sns.boxplot(data=[drug_effect, placebo_effect])
plt.xticks([0, 1], ['Drug', 'Placebo'])
plt.ylabel('Patient Improvement Score')
plt.title('Clinical Study: Drug vs. Placebo')
plt.show()

In [None]:
# Print results
print(f'Paired t-test (Drug vs. Placebo): t-statistic = {t_stat_clinical:.3f}, p-value = {p_value_clinical:.3f}')
print(f'Wilcoxon test (Drug vs. Placebo): W-statistic = {w_stat_clinical:.3f}, p-value = {w_p_value_clinical:.3f}')


## Conclusion
- If the p-value is less than 0.05, we conclude that the drug has a significant effect compared to the placebo.
- The visualization helps us understand the distribution of patient improvement scores.



In [None]:
!pip install kagglehub


In [None]:
!pip install pandas
#command to install any library

In [None]:
import pandas as pd

# Load the Excel file
df = pd.read_csv(r'C:\Users\ksait\Desktop\data mining and machine learning\insurance.csv')

# Display the first few rows of the dataframe
df.head()   ### Why prints 0-4?


In [None]:
df.info()
df.describe()

In [None]:
df.isnull()
df.isnull().sum()

In [None]:
df.columns

# Set Hypotheses for Testing: We choose a hypothesis to test, for example:
## Test for difference in premiums based on smoking status:

1. Null Hypothesis (H0): There is no significant difference in insurance premiums between smokers and non-smokers.

2. Alternative Hypothesis (H1): There is a significant difference in insurance premiums between smokers and non-smokers.

In [None]:
import scipy.stats as stats

# Separate the data based on 'Smoker' column
smoker = df[df['smoker'] == 'yes']['charges']
non_smoker = df[df['smoker'] == 'no']['charges']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(smoker, non_smoker)

# Print the results
t_stat, p_value


# Interpret the Results:
1. t-statistic: It tells you how far the sample mean is from the null hypothesis.
2. p-value: If the p-value is less than your significance level (commonly 0.05), you reject the null hypothesis.

In [None]:
# If p-value is less than 0.05, we reject the null hypothesis
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in premiums between smokers and non-smokers.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in premiums between smokers and non-smokers.")


# Linear Regression Model & Hypothesis Test

In [None]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

# Load the dataset
df = pd.read_csv(r'C:\Users\ksait\Desktop\data mining and machine learning\insurance.csv')

# Convert 'smoker' to 1 for 'yes' and 0 for 'no'
df['smoker_yes'] = df['smoker'].apply(lambda x: 1 if x == 'yes' else 0)

# Create lagged values for the dependent variable (charges)
df['lagged_charges'] = df['charges'].shift(1)  # Lag 1

# Drop the first row which will have NaN for the lagged variable
df = df.dropna()

# Define the independent variable as the lagged charges and the dependent variable as charges
X_ar = df[['lagged_charges']]
y_ar = df['charges']

# Add a constant (intercept) to the features
X_with_intercept = sm.add_constant(X_ar)
print(X_with_intercept)

# Fit the linear regression model using statsmodels
model = sm.OLS(y_ar, X_with_intercept).fit()

# Get the summary of the model (includes t-statistics and p-values)
print(model.summary())


## Interpretation:
If the p-value is less than 0.05, it indicates that the lagged value of charges has a statistically significant relationship with the current charges.

The coefficient for lagged_charges tells you how much of an effect the previous value of charges has on the current value.

# ARX Model (AutoRegressive with Exogenous variables)

In [None]:
# Create lagged values for the dependent variable (charges)

df['lagged_charges'] = df['charges'].shift(1)  # 1-period lag

# Drop the first row which will have NaN for the lagged variable
df = df.dropna()

# Define the independent variables and the dependent variable
X_arx = df[['age', 'bmi', 'children', 'smoker_yes', 'lagged_charges']]
y_arx = df['charges']

# Add a constant to the independent variables
X_arx_with_intercept = sm.add_constant(X_arx)

# Fit the ARX model using statsmodels
model_arx = sm.OLS(y_arx, X_arx_with_intercept).fit()

# Get the summary of the ARX model
print(model_arx.summary())


# Model Overview:
1. R-squared: 0.750: This means that 75% of the variability in charges is explained by the independent variables (including the lagged value of charges). This is a good fit, especially in cases where the data is complex.
   
2. Adjusted R-squared: 0.749: This takes into account the number of predictors in the model, penalizing for the inclusion of unnecessary variables. The value being close to R-squared suggests that the model is well-calibrated.

   
3. F-statistic: 798.2 and Prob (F-statistic) = 0.000: This tests the overall significance of the regression model. A very low p-value (0.000) indicates that the model is statistically significant, and at least one of the predictors is related to charges.

# Hypothesis Testing for Individual Coefficients:
Each variable in your regression model has an associated coefficient, t-statistic, p-value, and confidence interval. Here's how to interpret these values for hypothesis testing:

a. Constant (Intercept):
coef = -11,900: This is the estimated intercept of the regression line. It represents the baseline level of charges when all independent variables are zero (though this may not be meaningful in this context).
t = -12.378: This is the t-statistic for the constant term, used to test if the coefficient is significantly different from zero.
P>|t| = 0.000: This p-value is very small, indicating that the constant term is significantly different from zero.

b. Age:
coef = 257.49: For every 1-year increase in age, the charges are expected to increase by about 257.49 units.
t = 21.587: A very large t-statistic suggests a strong relationship between age and charges.
P>|t| = 0.000: This p-value is very small, meaning that age is a statistically significant predictor of charges.

c. BMI:
coef = 320.94: For every 1-unit increase in bmi, the charges are expected to increase by 320.94 units.
t = 11.713: This is also a strong t-statistic, suggesting a significant relationship.
P>|t| = 0.000: The small p-value indicates that bmi is a statistically significant predictor of charges.

d. Children:
coef = 467.70: For every additional child, charges are expected to increase by 467.70 units.
t = 3.393: The t-statistic is significant, showing a moderate relationship.
P>|t| = 0.001: This small p-value shows that the number of children is a statistically significant predictor of charges.

e. Smoker (smoker_yes):
coef = 23,840: Smokers have charges that are 23,840 units higher than non-smokers, all other variables being equal.
t = 57.887: This very large t-statistic shows a strong relationship between smoking status and charges.
P>|t| = 0.000: This small p-value indicates that being a smoker is a highly significant predictor of charges.

f. Lagged Charges:
coef = -0.0113: The coefficient for lagged_charges indicates that for every 1-unit increase in the previous period's charges, the current charges would decrease by 0.0113 units.
t = -0.821: This is quite small, indicating that the lagged value of charges has a weak effect.
P>|t| = 0.412: The p-value is greater than 0.05, indicating that lagged_charges is not a statistically significant predictor of current charges. This suggests that past charges do not significantly influence the current charges in this model.

## Conclusion:
Significant predictors: age, bmi, children, and smoker_yes are all statistically significant predictors of charges because their p-values are all much smaller than 0.05.
Insignificant predictor: lagged_charges is not statistically significant (p-value = 0.412), which suggests that including the previous value of charges as a predictor does not improve the model.

# Exercise 

## Create lagged values for the dependent variable (charges)

In [None]:
# Create lagged values for exogenous variables (independent variables)
df['lagged_age'] = df['age'].shift(1)
df['lagged_bmi'] = df['bmi'].shift(1)
df['lagged_children'] = df['children'].shift(1)
df['lagged_smoker_yes'] = df['smoker_yes'].shift(1)

# Drop the rows where any of the lagged values will have NaN (this happens for the first row after lagging)
df = df.dropna()

# Define the independent variables (including lagged variables) and the dependent variable
X_arx_lagged = df[['age', 'bmi', 'children', 'smoker_yes', 'lagged_charges', 
                   'lagged_age', 'lagged_bmi', 'lagged_children', 'lagged_smoker_yes']]

y_arx = df['charges']

# Add a constant (intercept) to the independent variables
X_arx_with_intercept = sm.add_constant(X_arx_lagged)

# Fit the ARX model using statsmodels
model_arx = sm.OLS(y_arx, X_arx_with_intercept).fit()

# Get the summary of the ARX model
print(model_arx.summary())
