# Linear Regression with SciPy and NumPy

In this notebook, we'll explore how to perform **Simple Linear Regression (SLR)** and **Multiple Linear Regression (MLR)** using **SciPy** and **NumPy**.

SciPy provides a lightweight and extremely fast approach (`stats.linregress`) for SLR, and NumPy allows efficient Œ≤-coefficient calculation for MLR using the least squares solution (`np.linalg.lstsq`).

I'm gonna make this notebook intentionally minimal, as I've already covered the theory and concepts in previous notebooks. Here, the focus is on:
- Clean explanations
- Lightweight modeling
- Comparison with StatsModels & Scikit-learn

***Let's get started!***

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
df = pd.read_csv("sbp_data.csv")
df.head()

Unnamed: 0,Age,BMI,Activity,SaltIntake,SBP
0,63,25.9,2,7.4,131.3
1,76,29.4,2,8.0,153.8
2,53,26.1,5,10.6,122.0
3,39,25.7,4,5.5,111.7
4,67,31.5,4,9.5,142.1


## Simple Linear Regression with SciPy

SciPy‚Äôs `linregress` is one of the easiest ways to run a quick linear regression.

It returns:
- slope (`Œ≤1`)
- intercept (`Œ≤0`)
- `r-value` (correlation)
- `p-value`
- standard error

This is actually great for fast statistical checks.

In [3]:
X = df["Age"]
y = df["SBP"]

slr = stats.linregress(X, y)
slr

LinregressResult(slope=np.float64(0.35198419048180324), intercept=np.float64(114.26630027685167), rvalue=np.float64(0.5566698385593014), pvalue=np.float64(1.1381697331370315e-17), stderr=np.float64(0.03732972289740193), intercept_stderr=np.float64(2.024298664902707))

### Let me display the results clearly for you

In [4]:
print(f"Slope (Œ≤1): {slr.slope}")
print(f"Intercept (Œ≤0): {slr.intercept}")
print(f"R-value: {slr.rvalue}")
print(f"P-value: {slr.pvalue}")
print(f"Standard Error: {slr.stderr}")

Slope (Œ≤1): 0.35198419048180324
Intercept (Œ≤0): 114.26630027685167
R-value: 0.5566698385593014
P-value: 1.1381697331370315e-17
Standard Error: 0.03732972289740193


### Let me give you a quick summary of the regression results:

Our model basically has the form:

$$
\text{SBP} = \beta_0 + \beta_1 (\text{Age})
$$

Where:
* `Œ≤0` (Intercept) = 114.2663
* `Œ≤1` (Slope) = 0.35198

1. Slope (`Œ≤1 = 0.35198`)
This means, for each additional year of age, SBP increases by about `0.35 mmHg` on average. So, a 10-year increase in age corresponds to about a `3.5 mmHg` rise in predicted SBP. This indicates a positive and moderate relationship between age and SBP.

2. Intercept (`Œ≤0 = 114.2663`)
This is the predicted SBP when `Age = 0`. While no adult has age 0, this value serves as the baseline of the regression line. It doesn't have meaningful medical interpretation but is needed for the mathematical model.

3. `R-value (0.5567)`
This is the correlation coefficient, which indicates a moderate positive correlation between age and SBP. Values close to 1 are strong, `~0.56` means there is a clear but not perfect relationship.

4. `P-value (~1.14 √ó 10‚Åª¬π‚Å∑)`
This is extremely small. The relationship between age and SBP is statistically significant. There is essentially zero probability that the observed correlation occurred by chance.

5. Standard Error of the Slope (`0.03733`)
This measures uncertainty in the slope estimate. A small SE relative to the slope means the slope estimate is precise.
(`0.037 vs. slope 0.352` is actually very precise.)

### Making Predictions using SLR

In [5]:
example_age = [25, 40, 55, 70]
pred_sbp = slr.intercept + slr.slope * pd.Series(example_age)

pd.DataFrame({"Age": example_age, "Predicted_SBP": pred_sbp})

Unnamed: 0,Age,Predicted_SBP
0,25,123.065905
1,40,128.345668
2,55,133.625431
3,70,138.905194


Using our model:

$$
\text{Predicted SBP} = 114.2663 + 0.352 \times \text{Age}
$$

From our result DataFrame, as age increases from `25 ‚Üí 70`, predicted SBP rises steadily from `123 ‚Üí 139 mmHg`, which is consistent with the positive slope.

### So, what our Model is telling us?

* Age is a significant predictor of systolic blood pressure.
* SBP increases by roughly `0.35 mmHg` per year of age.
* The model shows a moderate correlation between age and SBP.
* The statistical significance is extremely strong.

## Multiple Linear Regression (MLR) with NumPy

SciPy doesn't have a direct `linregress` equivalent for multivariate regression. So, instead, we'll use:

$$
\beta = (X^\top X)^{-1} X^\top y
$$

We can perform this using NumPy's `linalg.lstsq`. This is the mathematical foundation of linear regression. If you remember in our manual calculations, we derived the same formula for Œ≤-coefficients.

### Prepare `X` and `y`

In [6]:
features = ["Age", "BMI", "Activity", "SaltIntake"]

X = df[features].to_numpy()
X = np.column_stack([np.ones(len(X)), X])  # add intercept manually

y = df["SBP"].to_numpy()

What we did here is:
- Selected multiple features: `Age`, `BMI`, `Activity`, `SaltIntake` from our DataFrame.
- Converted them to a NumPy array.
- Added a column of ones to `X` to account for the intercept term in the regression model. Because when performing ordinary least squares without using a library that adds intercepts automatically, the model:

    $$
    y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots
    $$

    needs the intercept term (`Œ≤0`). Adding a column of ones lets the regression algorithm estimate (`Œ≤0`) as part of the coefficient vector.
- Defined `y` as the target variable, `SBP`.

### Fit using least squares

In [7]:
beta, residuals, rank, s = np.linalg.lstsq(X, y, rcond=None)

print("Œ≤-coefficients:", beta)
print("Residuals:", residuals)
print("Rank of X:", rank)
print("Singular values of X:", s)

Œ≤-coefficients: [81.30327427  0.37877197  0.91775946 -0.94135103  1.11620301]
Residuals: [10525.88865788]
Rank of X: 5
Singular values of X: [860.5689742  117.73475252  28.98122409  21.69467267   1.38870955]


Too many info to interpret at once? üòÖ Let me explain each output from `np.linalg.lstsq` one by one.

1. `beta`: the regression coefficients

```py
Œ≤-coefficients: [81.30327427  0.37877197  0.91775946 -0.94135103  1.11620301]
```

These numbers correspond to the estimated parameters in our linear model:

$$
y = \beta_0 + \beta_1 \text{Age} + \beta_2 \text{BMI} + \beta_3 \text{Activity} + \beta_4 \text{SaltIntake}
$$

| Coefficient | Meaning                                         |
| ----------- | ----------------------------------------------- |
| **81.30**   | Intercept (baseline SBP when all features = 0)  |
| **0.379**   | Increase in SBP per unit increase in Age        |
| **0.918**   | Increase in SBP per unit increase in BMI        |
| **‚Äì0.941**  | Decrease in SBP per unit increase in Activity   |
| **1.116**   | Increase in SBP per unit increase in SaltIntake |


2. `residuals`: the sum of squared errors (SSE)

```
Residuals: [10525.88865788]
```

`np.linalg.lstsq` gives us the total sum of squared residuals:

$$
\text{SSE} = \sum (y_i - \hat{y}_i)^2
$$

This measures how far the predictions are from the actual SBP values. Lower SSE means better fit. But it's not the same as residuals per data point (just one number).

3. `rank`: the effective rank of the matrix `X`

```
Rank of X: 5
```

Our `X` matrix has 5 columns (intercept + 4 features). A rank of `5` means:
* All features are linearly independent.
* No perfect multicollinearity.
* Regression is well-posed (full column rank).

If the rank were < 5, you would have redundant or perfectly correlated predictors.

4. `s`: the singular values of `X`

```
Singular values: [860.5689742  117.73475252  28.98122409  21.69467267   1.38870955]
```

These are used internally for solving the least-squares problem and tell us about numerical stability. We have one much smaller singular value: `1.3887` which suggests some mild multicollinearity but not enough to cause rank deficiency.

A common rule of thumb is:

$$
\text{condition number} = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{860.569}{1.389} \approx 619
$$

Values > 1000 are usually considered severe, so our model is moderately but not dangerously collinear.

### Present Œ≤ coefficients cleanly

In [8]:
coef_df = pd.DataFrame({
    "Feature": ["Intercept"] + features,
    "Coefficient": beta
})

coef_df

Unnamed: 0,Feature,Coefficient
0,Intercept,81.303274
1,Age,0.378772
2,BMI,0.917759
3,Activity,-0.941351
4,SaltIntake,1.116203


I have simply built a table (`coef_df`) that lists:
- Each feature used in the regression (including the intercept)
- The corresponding estimated coefficient from the least-squares solution (beta)

In short, it organizes the regression results into a readable form.

The DataFrame is just showing how each predictor affects SBP according to our linear model:

| Feature               | Interpretation of Coefficient                                        |
| --------------------- | -------------------------------------------------------------------- |
| Intercept (81.30)     | Baseline SBP when all features are zero                              |
| Age (0.38)            | Each +1 year of age increases SBP by `~0.38` units                   |
| BMI (0.92)            | Each +1 BMI unit increases SBP by `~0.92` units                      |
| Activity (‚àí0.94)      | More physical activity lowers SBP by `~0.94` units per activity unit |
| SaltIntake (1.12)     | Higher salt intake increases SBP by `~1.12` units                    |


### Making Predictions using MLR

In [9]:
# Profiles similar to earlier notebooks
people = pd.DataFrame({
    "Age": [45, 60, 30],
    "BMI": [27, 31, 22],
    "Activity": [3, 1, 6],
    "SaltIntake": [8.0, 10.0, 5.0]
})

X_new = np.column_stack([np.ones(len(people))] + [people[col] for col in people])
pred = X_new @ beta

pd.DataFrame({"person": ["A (45y, 27)", "B (60y, 31)", "C (30y, 22)"],
              "predicted_SBP": pred})

Unnamed: 0,person,predicted_SBP
0,"A (45y, 27)",129.23309
1,"B (60y, 31)",142.700815
2,"C (30y, 22)",112.79005


# Final Thoughts

SciPy is excellent when you want:
- A super fast regression
- Minimal setup
- Immediate slope/intercept insights

It works beautifully alongside StatsModels and Scikit-learn, giving our full regression framework another clean, mathematically grounded option.

## Comparison Summary

### SciPy x NumPy

| person          | predicted_SBP  |
|-----------------|----------------|
| A (45y, 27)     | 129.233090     |
| B (60y, 31)     | 142.700815     |
| C (30y, 22)     | 112.790050     |


### StatsModels 

| person          | predicted_SBP  |
|-----------------|----------------|
| A (45y, 27)     | 129.2          |
| B (60y, 31)     | 142.7          |
| C (30y, 22)     | 112.8          |


### Scikit-learn

| person          | predicted_SBP  |
|-----------------|----------------|
| A (45y, 27)     | 129.4          |
| B (60y, 31)     | 142.2          |
| C (30y, 22)     | 113.7          |


### Difference Range between Methods

| Person | NumPy/SciPy | StatsModels | Scikit-learn | Difference Range |
| ------ | ----------- | ----------- | ------------ | ---------------- |
| A      | 129.23      | 129.2       | 129.4        | ¬±0.2             |
| B      | 142.70      | 142.7       | 142.2        | ~0.5             |
| C      | 112.79      | 112.8       | 113.7        | ~0.9             |

These are all very close, with minor differences due to numerical precision and implementation details. Overall, SciPy x NumPy provides a fast, lightweight way to perform both SLR and MLR with clear outputs. This makes it a great choice for quick regression tasks without the overhead of larger libraries. However all three methods yield very similar predictions, confirming the validity of each approach. So, you can confidently choose based on your specific needs and preferences! 

This concludes our exploration of Linear Regression using different Python libraries. 