In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)

In [None]:
california_housing

In [None]:
print(california_housing.DESCR)

In [None]:
housing = california_housing["frame"]
housing.head()

In [None]:
housing.tail()

In [None]:
housing.describe()

In [None]:
housing["MedInc"]

In [None]:
housing.iloc[1]

In [None]:
housing.loc[1, "AveRooms"]

In [None]:
housing.cov()

In [None]:
housing.corr()

# Correlation: Definition and Intuition


Correlation measures the linear relationship between two variables X and Y.
The Pearson correlation coefficient (r) is defined as:

    r = cov(X, Y) / (σ_X * σ_Y)

where cov(X, Y) is the covariance, and σ_X, σ_Y are the standard deviations of X and Y.

- r ∈ [-1,1]: 
  - r =  1 → Perfect positive correlation
  - r = -1 → Perfect negative correlation
  - r =  0 → No linear relationship

Key properties:
- Correlation is symmetric: r(X, Y) = r(Y, X).
- Correlation ≠ Causation.

In [None]:
sm = pd.plotting.scatter_matrix(housing, figsize=(10, 10))

In [None]:
housing.plot(kind='scatter', x='AveRooms', y='AveBedrms', figsize=(10, 10))

In [None]:
from statsmodels.api import OLS

In [None]:
housing = california_housing["data"]
housing["bias"] = 1

model = OLS(california_housing["target"], housing).fit()

In [None]:
model.summary()

# R-squared: Definition and Intuition


R-squared ($R^2$) measures the proportion of variance in the dependent variable $Y$ explained by the independent variable(s) $X$. It is defined as:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

where:
- $SS_{res} = \sum (y_i - \hat{y}_i)^2$ (Residual Sum of Squares)
- $SS_{tot} = \sum (y_i - \bar{y})^2$ (Total Sum of Squares)

### Key Properties:
- $R^2 \in [0,1]$:  
  - $R^2 = 1$ → Perfect fit (model explains all variance)  
  - $R^2 = 0$ → Model explains no variance  
- Higher $R^2$ suggests better fit but does not imply causation.  
- Adding more variables can artificially inflate $R^2$ (use adjusted $R^2$ to correct for this).  

Adjusted R-squared ($R^2_{adj}$) accounts for the number of predictors in a model, preventing overestimation of goodness-of-fit. It is defined as:

$$ R^2_{adj} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right) $$

where:
- $R^2$ = Regular R-squared
- $n$ = Number of observations
- $p$ = Number of predictors

### Key Properties:
- $R^2_{adj}$ penalizes adding irrelevant variables.
- Unlike $R^2$, it **only increases if the new predictor improves model performance**.
- If $p$ increases while predictive power does not, $R^2_{adj}$ **decreases**.
- Better suited for comparing models with different numbers of predictors.

# F-statistic: Definition and Intuition
The F-statistic in regression tests whether at least one predictor variable significantly explains variation in the dependent variable. It is defined as:

$$ F = \frac{\text{Explained Variance per Predictor}}{\text{Unexplained Variance per Observation}} = \frac{MSR}{MSE} $$

where:
- $MSR = \frac{SSR}{p}$ (Mean Square Regression)  
- $MSE = \frac{SSE}{n - p - 1}$ (Mean Square Error)  
- $SSR = \sum (\hat{y}_i - \bar{y})^2$ (Sum of Squares for Regression)  
- $SSE = \sum (y_i - \hat{y}_i)^2$ (Sum of Squares for Error)  
- $n$ = Number of observations  
- $p$ = Number of predictors  

### Key Properties:
- **Null Hypothesis ($H_0$)**: All regression coefficients are zero ($\beta_1 = \beta_2 = ... = \beta_p = 0$).
- **Alternative Hypothesis ($H_a$)**: At least one $\beta_j \neq 0$, meaning at least one predictor has explanatory power.
- A **higher F-statistic** suggests the model explains more variance relative to error.
- The **p-value** from the F-test indicates statistical significance.

In [None]:
housing["error"] = (model.predict(housing) - california_housing["target"])/california_housing["target"]

In [None]:
housing["error"].plot()

In [None]:
housing["error"].plot(kind="hist",bins=100)

# Durbin-Watson Statistic: Definition and Intuition
The Durbin-Watson (DW) statistic tests for **autocorrelation** (serial correlation) in regression residuals. It is defined as:

$$ D = \frac{\sum_{t=2}^{n} (e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2} $$

where:
- $e_t$ = Residual at time $t$
- $n$ = Number of observations

### Interpretation:
- $D \approx 2$ → No autocorrelation  
- $D < 2$ → Positive autocorrelation (consecutive residuals are correlated)  
- $D > 2$ → Negative autocorrelation (alternating pattern in residuals)  

### Key Notes:
- Values close to **0 or 4** indicate strong autocorrelation.
- Used in **time series** and **regression analysis**.
- Autocorrelation violates OLS assumptions, affecting inference.



# Jarque-Bera Test: Definition and Intuition

The Jarque-Bera (JB) test checks whether residuals follow a **normal distribution** based on skewness and kurtosis. It is defined as:

$$ JB = \frac{n}{6} \left( S^2 + \frac{(K - 3)^2}{4} \right) $$

where:
- $n$ = Number of observations  
- $S$ = Skewness of residuals  
- $K$ = Kurtosis of residuals  

### Interpretation:
- **Null Hypothesis ($H_0$)**: Residuals are normally distributed.  
- **Alternative Hypothesis ($H_a$)**: Residuals are not normally distributed.  
- A **higher JB statistic** suggests deviation from normality.  
- The **p-value** indicates whether to reject $H_0$.  

### Key Notes:
- Normal residuals imply valid OLS inference.
- JB test is commonly used in **regression diagnostics**.
- Sensitive to large sample sizes.

Let us try with another variable set

In [None]:
model2 = OLS(housing["AveBedrms"], housing[["bias", "AveRooms"]]).fit()

In [None]:
model2.summary()

In [None]:
less_data = housing[housing <= housing.describe().loc["75%"]].dropna()
less_data

In [None]:
y = california_housing["target"].loc[less_data.index]
y

In [None]:
model = OLS(y, less_data).fit()

In [None]:
model.summary()