### Example Multiple Linear Regression 3.5

We use a *confidence interval* to quantify the uncertainty surrounding the *average* **sales** over a large number of cities. We restrict ourselves to the regression of **sales** on **TV** and **radio** since **newspaper** can be neglected as followed from the previous discussion. 

For example, given that CHF 100000 is spent on **TV** advertising and CHF 20000 is spent on **radio** advertising in each city, the $95\%$ confidence interval is 
\begin{equation*}
[10'985,11'528]
\end{equation*}

In [2]:
import pandas as pd
import statsmodels.api as sm

# Load data
df = pd.read_csv('./data/Advertising.csv')
x = df[['TV', 'radio']]
y = df['sales']

# Fit Model:
x_sm = sm.add_constant(x)
model = sm.OLS(y, x_sm).fit()

# Get prediction and confidence interval at x = [100, 20]
x0 = [[100, 20]]
x0 = sm.add_constant(x0, has_constant='add')

predictionsx0 = model.get_prediction(x0)
predictionsx0 = predictionsx0.summary_frame(alpha=0.05)

# Print the results. mean_ci_ corresponds to the confidence interval
# whereas obs_ci corresponds to the prediction interval
print(predictionsx0)


        mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
0  11.256466  0.137526      10.985254      11.527677      7.929616   

   obs_ci_upper  
0     14.583316  


We interpret this to mean that $95\%$ of intervals of this form will contain the true value of $ f(X_{1},X_{2}) $ . In other words, if we collect a large number of data sets like the **Advertising** data set, and we construct a confidence interval for the average **sales** on the basis of each data set - given CHF 100000 in **TV** and CHF 20000 in **radio** advertising - then $95\%$ of these confidence intervals will contain the true value of average **sales**. 

On the other hand, a *prediction interval* can be used to quantify the uncertainty surrounding **sales** for a *particular* city. Given that CHF 100000 is spent on **TV** and CHF 20000 is spent on **radio** advertising in that city the $95\%$ *prediction interval* is

\begin{equation*}
[7'930,14'583]
\end{equation*}

We interpret this to mean that $95\%$ of intervals of this form will contain the true value of $ Y $ for this city. 

Note that both intervals are centered at 11256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about **sales** for a given city in comparison to the average **sales** over many locations.

