# Step 1: Import Libraries
First, ensure you have statsmodels installed, or you can install it using pip (pip install statsmodels). Then, import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error


# Step 2: Load and Prepare the Data
Load your dataset and separate your features (independent variables) and target (dependent variable):

In [None]:
# Load dataset
df = pd.read_csv('path/to/your/data.csv')

df=pd.read_csv('path/to/your/data.csv')
df.head()
list(df.columns)



In [None]:
# Assuming 'X1', 'X2', ..., 'Xn' are your feature columns and 'Y' is your target variable
X = df[['X1', 'X2', 'Xn']]
Y = df['Y']

# Add a constant to the model (intercept)
X = sm.add_constant(X)


# Step 3: Create and Fit the Model
Create the model using OLS (Ordinary Least Squares) from statsmodels, fit it, and then view the summary for detailed statistics:

In [None]:
# Create the model
# Use if you want to use a linear model
model = sm.OLS(Y, X)

# Use if you want to use a logistic model
#model = sm.Logit(Y, X)

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())


#### Use the VIF factor to determine if there is multicollinearity in you model
#### Interpreting VIF Values:
VIF = 1: No correlation between the independent variable and the other variables.<br>
VIF < 5: Generally considered okay, indicating moderate correlation that may not require action.<br>
VIF >= 5 to 10: Indicates high correlation that may distort the reliability of the coefficient of this variable and should be examined for potential removal or adjustment.<br>
VIF > 10: Suggests extreme multicollinearity. Variables with such high VIFs are typically removed from the model, or the model is re-specified to reduce multicollinearity.<br>

In [2]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assuming X is the DataFrame of predictors including the constant
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

NameError: name 'X' is not defined

# Step 4: Extracting Model Metrics
While the summary provides a comprehensive overview, including p-values, R-squared, and coefficients, you might want to extract specific metrics or values into a DataFrame:

In [None]:
# Residuals
residuals = Y - results.fittedvalues

# Plotting residuals
plt.figure(figsize=(10,6))
plt.scatter(results.fittedvalues, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

# Checking for normality of residuals
plt.figure(figsize=(10,6))
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Histogram of Residuals')
plt.show()

# Creating a DataFrame for coefficients and p-values
coefficients = results.params
p_values = results.pvalues
mse = mean_squared_error(Y, results.fittedvalues)

model_metrics = pd.DataFrame({'Coefficient': coefficients, 'P-Value': p_values})
model_metrics['MSE'] = mse

print(model_metrics)

## Note:
Ensure to replace 'path/to/your/data.csv', 'X1', 'X2', 'Xn', and 'Y' with actual column names from your dataset.
This approach provides a constant term (intercept) in the model, which is generally recommended in regression analysis.
statsmodels' OLS does not automatically split the data into training and testing sets since it's often used for statistical analysis rather than predictive modeling, aligning with your requirement.
This setup gives you a detailed statistical overview of your regression model, focusing on survey analysis, and extracts key metrics and estimates for further examination.


For evaluating the quality of a linear regression model, especially in contexts like survey analysis where interpretation and understanding of model behavior are paramount, there are several key metrics and statistical tests you can consider in addition to the Mean Squared Error (MSE) and p-values:

### 1. R-squared (Coefficient of Determination):
What it is: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
How to interpret: Values range from 0 to 1. A higher R-squared value indicates a better fit between the model and the observed data.
### 2. Adjusted R-squared:
What it is: Adjusts the R-squared value based on the number of predictors in the model relative to the number of observations. It accounts for the model complexity.
How to interpret: Like R-squared, but provides a more accurate measure in the context of multiple predictors by penalizing the addition of non-significant predictors.
### 3. F-statistic and its associated p-value:
What it is: Tests the overall significance of the model.
How to interpret: The F-statistic tests whether at least one predictor variable has a non-zero coefficient. A very low p-value (typically <.05) indicates that your model is significantly different from a model with no independent variables.
### 4. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion):
What they are: These are information criteria metrics that evaluate the model fit while penalizing the model complexity (number of predictors).
How to interpret: They help in model selection. Lower values indicate a better model, by balancing goodness of fit and complexity.
### 5. Residual Analysis:
What it is: Examination of the residuals (the differences between observed and predicted values) can provide insights into the model's accuracy and assumptions (like homoscedasticity and normality).
 
How to perform and interpret:
Plotting residuals vs. predicted values: Should show no clear pattern.
Normality test of residuals: E.g., using a Q-Q plot or statistical tests. Residuals should ideally follow a normal distribution.