# Multivariate linear

Multivariate linear regression is a statistical method used for modeling the relationship between multiple independent variables and a single dependent variable.
In contrast to [univariate linear regression](../univariate-linear), which involves only one independent variable, multivariate linear regression considers several simultaneously.
The goal is to create a linear equation that best fits the observed data, allowing for predictions or explanations of the dependent variable based on the values of the independent variables.

As always, let's load our CSV file.

In [1]:
import numpy as np
import pandas as pd

In [2]:
CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/advertising-data.csv"

df = pd.read_csv(CSV_PATH)

In the case of multivariate regression, we collect all of our independent variables in one dataframe called `df_features` and our dependent variable in `df_targets`.

In [3]:
target_column = "Product_Sold"

df_targets = df[target_column]
df_features = df.drop(columns=[target_column], inplace=False)

Now we need to convert the dataframe to NumPy arrays and reshape the targets.

In [4]:
targets = df_targets.to_numpy().reshape(-1, 1)
features = df_features.to_numpy()

## Linear

Linear regression is a linear modeling algorithm used in machine learning for regression tasks.
Its purpose is to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The linear equation is of the form:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n
$$

-   $y$ is the dependent variable (target),
-   $\beta_0$​ is the intercept (constant),
-   $\beta_1, \beta_2, \ldots, \beta_n$​ are the coefficients associated with each feature $x_1, x_2, \ldots, x_n$​​.

Linear regression is commonly used for predicting a continuous variable (regression task) based on one or more input features.
It assumes a linear relationship between the independent and dependent variables.

The accuracy of a linear regression model is typically measured using metrics such as R-squared ($R^2$), Mean Squared Error (MSE), or Mean Absolute Error (MAE).
$R^2$ represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
A higher $R^2$ indicates a better fit.

Linear regression is widely used in various fields, including economics, finance, biology, and marketing, for modeling and analyzing the relationships between variables.
In the context of your example, the company is using linear regression to understand the impact of advertising spending on product sales, helping them make informed decisions about adjusting ad budgets.

### Objective function

The goal of linear regression is to find the values of the coefficients that minimize the difference between the predicted values and the actual values in the training data. This is often done using a method called Ordinary Least Squares (OLS).
Specifically, the objective function to be minimized is the sum of squared differences between the predicted values ($y_i$) and the actual values ($\hat{y}$):

$$
\text{minimize} \sum \left( y_i - \hat{y}_i \right)^{2}
$$

### Limitations

Linear regression is a powerful and widely used statistical method for modeling the relationship between variables.
However, like any modeling approach, it has its limitations.
Here, we explore the constraints and considerations associated with linear regression.

**Assumption of Linearity**

Linear regression assumes a linear relationship between the independent and dependent variables. The model is designed to capture linear patterns, and deviations from linearity can lead to inaccurate predictions.

**Sensitivity to Outliers**

Linear regression is sensitive to outliers, which are extreme observations that can disproportionately influence the model. Outliers can skew coefficient estimates and compromise the accuracy of predictions.

**Assumption of Independence**

The independence of residuals is a critical assumption in linear regression. If residuals exhibit autocorrelation, where the values are not independent, it can violate the model's assumptions and impact the reliability of results.

**Multicollinearity**

Multicollinearity arises when independent variables are highly correlated, leading to instability in coefficient estimates. This phenomenon complicates the interpretation of individual variable impacts and poses challenges in identifying true predictors.

**Assumption of Homoscedasticity**

Homoscedasticity, or constant variance of residuals across different levels of the independent variable, is assumed in linear regression. Heteroscedasticity can introduce bias in standard errors and affect hypothesis testing.

**Limited to Linear Relationships**

Linear regression is limited to capturing linear relationships between variables. When faced with nonlinear patterns, the model may fail to accurately represent the underlying dynamics.

**Influence of Scaling**

The scale of variables can impact linear regression coefficients. Changes in variable scale can alter their perceived impact on the model, and comparisons between coefficients may lose meaning.

**No Variable Selection**

Linear regression includes all features in the model, irrespective of their relevance. Lack of automatic variable selection can lead to overfitting if irrelevant features are incorporated.

**Not Robust to Violations of Assumptions**

Violations of linear regression assumptions, such as nonlinearity or heteroscedasticity, can result in biased estimates and unreliable predictions. The model's performance is contingent on meeting these assumptions.

**Correlation vs. Causation**

While linear regression establishes correlations between variables, it does not imply causation. Associations observed may be coincidental or influenced by unobserved factors, necessitating caution in drawing causal inferences.

### Implementation

#### sklearn

In [5]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X=features, y=targets)
print(reg.coef_)
print(reg.intercept_)
print(reg.score(X=features, y=targets))

[[1.97147823 2.79786525 1.59446751 2.43283307 1.40693022 3.91183385]]
[36.65524744]
0.9401750192922066


The coefficients (`reg.coef_`) represent the weights assigned to each advertising channel (TV, Billboards, Google Ads, Social Media, Influencer Marketing, Affiliate Marketing) in predicting product sales.
These coefficients indicate the estimated change in product sales for a one-unit increase in each respective advertising channel, holding other variables constant.

The intercept (`reg.intercept_`) is the constant term in the linear equation.
The intercept represents the estimated product sales when all advertising spending is zero.

The R-squared value (`reg.score(X=features, y=targets)`) indicates the proportion of the variance in product sales that the advertising spending variables can explain. In this case, the model explains approximately 94% of the variability in product sales.

In helping the company, you can use these coefficients to guide them on how each advertising channel contributes to product sales.
For example, higher coefficients suggest a stronger positive impact on sales.
Adjusting ad budgets based on these coefficients could optimize their advertising strategy for higher sales.
Remember that correlation does not imply causation, so further analysis and experimentation may be needed to validate the findings.

## Ridge

Ridge regression, also known as Tikhonov regularization or $L2$ regularization, is a linear regression technique that introduces a regularization term to the linear regression objective function.
The primary goal of ridge regression is to address the issue of multicollinearity in linear regression models.

### Objective function


In ridge regression, a regularization term is added to this objective function.
The new objective function becomes:

$$
\text{minimize} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2
$$

Here:

-   $n$ is the number of observations.
-   $p$ is the number of predictors (features).
-   $y_i$​ is the actual value of the dependent variable for the ii-th observation.
-   $\hat{y}_i$​ is the predicted value of the dependent variable for the ii-th observation.
-   $\alpha$ is the regularization parameter, controlling the strength of regularization.
-   $\beta_j$ represents the coefficients of the linear regression model.


### Limitations


**Not Suitable for Feature Selection**

One of the primary limitations of ridge regression lies in its inability to perform feature selection.
Unlike some other regularization techniques, ridge regression retains all features in the model, making it less suitable for scenarios where variable selection is a critical requirement.

**Interpretability Challenges**

While ridge regression addresses multicollinearity, it does so at the expense of straightforward interpretability.
The regularization term introduces a level of complexity to coefficient interpretation, as the coefficients are shrunk towards zero.
This departure from the clear interpretability of standard linear regression should be acknowledged.

**Dependency on Scaling**

The performance of ridge regression is influenced by the scale of the variables.
Rescaling or standardizing the features is often necessary to ensure effective regularization.
Failure to do so can result in unequal penalization of coefficients, impacting the model's stability.

**No Sparsity in Coefficients**

Unlike some regularization methods, such as LASSO (L1 regularization), ridge regression does not lead to exact zero coefficients.
It shrinks coefficients towards zero but retains all features in the model.
If sparsity is a critical consideration, alternative regularization methods may be more appropriate.

**Model Complexity**

Ridge regression introduces a level of model complexity, with the choice of the regularization parameter ($\alpha$) playing a pivotal role.
Selecting an optimal αα value often involves techniques like cross-validation, adding an additional layer of complexity to model tuning.

**Assumption of Linearity**

Similar to standard linear regression, ridge regression assumes a linear relationship between the independent and dependent variables.
If the true relationship is significantly nonlinear, ridge regression may not capture the underlying patterns accurately.

**Limited Handling of Multicollinearity**

While ridge regression is effective in mitigating multicollinearity, it may not completely eliminate the issue.
In cases of severe multicollinearity, additional techniques or data preprocessing methods may be necessary.

**Sensitivity to Outliers**

Ridge regression exhibits reduced sensitivity to outliers compared to standard linear regression, yet extreme values can still influence the regularization process and impact the resulting coefficients.
Practitioners should exercise caution in the presence of outliers.

**Not a Panacea**

It is crucial to recognize that ridge regression is not a universal solution.
Its effectiveness depends on the specific characteristics of the data and the objectives of the analysis.
Careful consideration of alternative regularization techniques may be warranted in certain scenarios.

### Implementation

#### sklearn

In [6]:
from sklearn.linear_model import Ridge

reg = Ridge(alpha=1.0)
reg.fit(X=features, y=targets)
print(reg.coef_)
print(reg.intercept_)
print(reg.score(X=features, y=targets))

[[1.97147817 2.79786515 1.59446745 2.43283298 1.40693016 3.9118337 ]]
[36.65551041]
0.9401750192922053
