#Section 10 Python example - model diagnostic plots exercise

Model diagnostic plots are essential tools in the evaluation of statistical models. They provide visual insights into how well a model has performed, help identify areas where the model may be lacking, and pinpoint underlying assumptions that may have been violated. This section demonstrates how to create model diagnostic plots in Python using matplotlib and statsmodels, focusing on a linear regression model. These plots will include residuals plots, Q-Q plots for checking normality of residuals, and leverage plots to identify influential observations. Then, it is over to you to use these techniques on two other datasets.

1. Setting Up the Environment:

Ensure Python, matplotlib, statsmodels, and numpy are installed in your environment. If not, you can install them using pip:

In [None]:
pip install numpy matplotlib statsmodels

2. Importing Required Libraries:

Begin by importing necessary Python libraries for handling data, performing regression, and plotting:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

3. Generating Synthetic Data:

For this example, let’s create a synthetic dataset that represents a simple linear relationship with some noise:

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
x = np.linspace(10, 100, 100)
y = 0.5 * x + np.random.normal(0, 10, size=len(x))

# Create a DataFrame
data = pd.DataFrame({'X': x, 'Y': y})

4. Fitting a Linear Regression Model:

Using statsmodels to fit a linear regression model:

In [None]:
# Fit model
model = ols('Y ~ X', data=data).fit()

5. Plotting Residuals:

Residual plots are used to assess the homoscedasticity of residuals (constant variance across the range of values).

In [None]:
# Plot residuals
plt.scatter(data['X'], model.resid)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

6. Normality Check with Q-Q Plot:

A Q-Q plot shows if the residuals are normally distributed, which is an assumption of linear regression.

In [None]:
# Q-Q plot for normality
fig = sm.qqplot(model.resid, line='s')
plt.title('Q-Q plot for residual normality check')
plt.show()

7. Leverage Plot:

Leverage plots help identify influential cases that might have an unduly influence on the model fit.

In [None]:
from statsmodels.graphics.regressionplots import influence_plot

# Influence plot
fig, ax = plt.subplots(figsize=(8,6))
influence_plot(model, ax=ax)
plt.title('Influence plot')
plt.show()

8. Conclusion:

These diagnostic plots provide critical feedback on the adequacy of the regression model. The residual plot can reveal patterns indicating non-linearity, heteroscedasticity, or outliers. The Q-Q plot checks the assumption that residuals are normally distributed, which is crucial for the validity of many statistical tests. Finally, the leverage plot helps identify influential data points that could disproportionately affect the regression line.

By incorporating these diagnostics into the modeling process, data scientists can better understand their models and make informed decisions about potential modifications or whether additional or alternative analysis is necessary. This enhances the reliability and robustness of their findings, ensuring that conclusions drawn from the data are both valid and actionable.

# Chapter Exercise

Now we know how to evaluate models, and how to use diagnostic plots, revisit some of the earlier datasets we have looked at. Can you the housing.csv and the customer_purchases.csv datasets in the same way.