## Multiple Linear Regression

We have a dataset to analyze success of start up based on various parameters...

- Dummy variable for categorical variables
for ex: State column values are NY and CA, we have to transpose and covert these values in to columns. Actually we don't need to consider both of these columns in your Regression model because NY column itself capturing all the variation

your columns
Dependent   Independent variables
Profit      R&D, Admin, Marketing, State

Formulat

y = b0 + b1*x1 + b2*x2 + b3*x3    + b4*X4[this is for categorical column]

It is easy if we have single independent variable then it is easy to decide and build regression model

but with lot of variables we need to decide which ones to keep which ones to eliminate

why

1) Garbage in - Garbage out

2) Explainability

#### 5 Methods of building models

1. All-in
2. Backward Elimination
3. Forward selection
4. Bidirectional elimination
5. score comparison

2,3,4 are stepwise regression

### 1.All-in

-   Throw in all variables. This can be decided using prior knowledge or built a model already
-   You have to use all variables
-   Preparing for Backword Elimination

### 2. Backward Elimination

- Select significance level to stay in the model(SL = 0.05)
- Fit full model with all possible predictors
- Consider the predictor[Independent variable] with highest P-value. If P > SL, go to step 4 otherwise done.[Simple english review all the independent variable and thier P-value and check whether they impact the dependent variable or not. if P-value > SL then they won't]
- Remove the predictor
- Fit model without this variable [repeat from Step 3]


For example: Number of windows of a house won't impact the price of the house this can be understood by P-value




In [8]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

# Generate sample data
np.random.seed(42)
size = np.random.rand(20) * 2000  # House size in sq ft
location = np.random.randint(1, 10, 20)  # Location rating (1-10)
windows = np.random.randint(1, 20, 20)  # Number of windows
price = 5000 + 3 * size + 10000 * location + 50 * windows + np.random.normal(0, 10000, 20)  # House price

# Create DataFrame
df = pd.DataFrame({'size': size, 'location': location, 'windows': windows, 'price': price})
X = df[['size', 'location', 'windows']]
y = df['price']

# Backward Elimination Function
def backward_elimination(X, y, significance_level=0.05):
    selected_features = list(X.columns)

    while len(selected_features) > 0:
        model = LinearRegression().fit(X[selected_features], y)
        _, p_values = f_regression(X[selected_features], y)
        worst_p_value = max(p_values)  # Find highest P-value
        
        if worst_p_value > significance_level:
            worst_feature = selected_features[p_values.argmax()]
            selected_features.remove(worst_feature)
            print(f"❌ Removed: {worst_feature}, P-value: {worst_p_value}")
        else:
            break  # Stop when all features are significant

    return selected_features

# Run Backward Elimination
selected_features = backward_elimination(X, y)
print("\nFinal Selected Features:", selected_features)


❌ Removed: size, P-value: 0.5682448463623814
❌ Removed: windows, P-value: 0.2221852348118141

Final Selected Features: ['location']


### 3.Forward selection

1. Select the significance level to enter the model(for ex: SL=0.05)
2. Fit all simple regression models y ~ Xn select the one with lowest P-value
3. Keep this variable and fit all possible models with one extra predictor added to the one's you have already selected
4. Consider the predictor with lowest P-value. If P < SL then go to step 3, otherwise go to Final...

In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

# Generate sample data
np.random.seed(42)
size = np.random.rand(20) * 2000  # House size in sq ft
location = np.random.randint(1, 10, 20)  # Location rating (1-10)
windows = np.random.randint(1, 20, 20)  # Number of windows
price = 5000 + 3 * size + 10000 * location + 50 * windows + np.random.normal(0, 10000, 20)  # House price

# Create DataFrame
df = pd.DataFrame({'size': size, 'location': location, 'windows': windows, 'price': price})
X = df[['size', 'location', 'windows']]
y = df['price']

# Forward Selection Function
def forward_selection(X, y):
    selected_features = []
    remaining_features = list(X.columns)
    while remaining_features:
        best_p_value = 1
        best_feature = None
        for feature in remaining_features:
            model = LinearRegression().fit(X[selected_features + [feature]], y)
            _, p_values = f_regression(X[selected_features + [feature]], y)
            p_value = p_values[-1]  # P-value of the newly added feature
            print(f"Feature: {feature}, P-value: {p_values[-1]}")  # Debugging line
            if p_value < 0.05 and p_value < best_p_value:
                best_p_value = p_value
                best_feature = feature
        
        if best_feature is None:
            break  # Stop when no feature meets the threshold
        
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)
        print(f"Added: {best_feature}, P-value: {best_p_value}")

    return selected_features

# Run Forward Selection
selected_features = forward_selection(X, y)
print("\nFinal Selected Features:", selected_features)


Feature: size, P-value: 0.5682448463623814
Feature: location, P-value: 1.928929878013406e-09
Feature: windows, P-value: 0.2221852348118144
Added: location, P-value: 1.928929878013406e-09
Feature: size, P-value: 0.5682448463623814
Feature: windows, P-value: 0.2221852348118141

Final Selected Features: ['location']


#### 4.Bidrectional Elimination

1. Select a significance level to enter and stay in the model. For SLENTER = 0.05 and SLSTAY=0.05
2. Perform the next step of the Forward selection(new variables must have P < SLENTER)
3. Perform all steps of the Backward elimination (old variables must have P < SLSTAY)
4. Exit if no new variables can be added.

In [7]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

# Generate sample data
np.random.seed(42)
size = np.random.rand(20) * 2000  # House size in sq ft
location = np.random.randint(1, 10, 20)  # Location rating (1-10)
windows = np.random.randint(1, 20, 20)  # Number of windows
price = 5000 + 3 * size + 10000 * location + 50 * windows + np.random.normal(0, 10000, 20)  # House price

# Create DataFrame
df = pd.DataFrame({'size': size, 'location': location, 'windows': windows, 'price': price})
X = df[['size', 'location', 'windows']]
y = df['price']

# Bidirectional Elimination Function
def bidirectional_elimination(X, y, significance_level=0.05):
    selected_features = []
    remaining_features = list(X.columns)

    while remaining_features:
        # Forward Step: Add best feature
        best_p_value = 1
        best_feature = None
        for feature in remaining_features:
            model = LinearRegression().fit(X[selected_features + [feature]], y)
            _, p_values = f_regression(X[selected_features + [feature]], y)
            p_value = p_values[-1]  # P-value of the newly added feature
            if p_value < significance_level and p_value < best_p_value:
                best_p_value = p_value
                best_feature = feature
        
        if best_feature is None:
            break  # Stop if no significant feature is found
        
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)
        print(f"✅ Added: {best_feature}, P-value: {best_p_value}")

        # Backward Step: Remove least significant feature
        while selected_features:
            model = LinearRegression().fit(X[selected_features], y)
            _, p_values = f_regression(X[selected_features], y)
            worst_p_value = max(p_values)
            if worst_p_value > significance_level:
                worst_feature = selected_features[p_values.argmax()]
                selected_features.remove(worst_feature)
                remaining_features.append(worst_feature)
                print(f"❌ Removed: {worst_feature}, P-value: {worst_p_value}")
            else:
                break  # Stop when all selected features are significant

    return selected_features

# Run Bidirectional Elimination
selected_features = bidirectional_elimination(X, y)
print("\nFinal Selected Features:", selected_features)


✅ Added: location, P-value: 1.928929878013406e-09

Final Selected Features: ['location']


### 5.All possible models
Name suggest that build models for all subsets possible and pick the best one. for ex: 10 variables dataset will have 1024 models probably. very expensive and doesn't scale..