# Linear Regression Project
This notebook outlines the steps to preprocess data, build regression models (OLS, Ridge, Lasso), and evaluate their performance. Each step is explained in detail.

## Step 1: Load the Data
Here we load the dataset and inspect its structure to understand the features and target variable. We'll check for missing values and explore the data types.

In [None]:
import pandas as pd
from IPython.display import display

# Load the dataset
data = pd.read_json('path/to/your/dataset.json')
display(data.head())
display(data.info())

## Step 2: Preprocess the Data
Before modeling, we clean the data by handling missing values, encoding categorical variables, and scaling numerical features if necessary.

In [None]:
# Example preprocessing
# Drop missing values
data = data.dropna()

# Encode categorical variables (if any)
data = pd.get_dummies(data, drop_first=True)

# Separate predictors and response
X = data.drop(columns=["target_column"])
y = data["target_column"]

## Step 3: Split the Data
We split the data into training and testing sets to evaluate model performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 4: Define the Regression Function
Here we define a reusable function to fit and evaluate different types of regression models: OLS, Ridge, and Lasso. This function automates the modeling process.

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

def regression_pipeline(X_train, X_test, y_train, y_test, model_type='ols', **kwargs):
    """
    A function to preprocess data, fit a regression model, and evaluate it.
    Args:
        X_train: Training predictors
        X_test: Testing predictors
        y_train: Training response
        y_test: Testing response
        model_type: 'ols', 'ridge', or 'lasso'
        **kwargs: Additional parameters for the regression model
    Returns:
        model: Trained model
        metrics: Dictionary of evaluation metrics
    """
    # Initialize model
    if model_type == 'ols':
        model = LinearRegression(**kwargs)
    elif model_type == 'ridge':
        model = Ridge(**kwargs)
    elif model_type == 'lasso':
        model = Lasso(**kwargs)
    else:
        raise ValueError("Invalid model type. Choose 'ols', 'ridge', or 'lasso'.")

    # Train model
    model.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = model.predict(X_test)
    metrics = {
        'MSE': mean_squared_error(y_test, y_pred),
        'R2': r2_score(y_test, y_pred)
    }

    return model, metrics

## Step 5: Train and Evaluate Models
We train different regression models (OLS, Ridge, Lasso) and compare their performance using metrics like MSE and \(R^2\).

In [None]:
# Train and evaluate OLS
ols_model, ols_metrics = regression_pipeline(X_train, X_test, y_train, y_test, model_type='ols')
print("OLS Metrics:", ols_metrics)

# Train and evaluate Ridge
ridge_model, ridge_metrics = regression_pipeline(X_train, X_test, y_train, y_test, model_type='ridge', alpha=1.0)
print("Ridge Metrics:", ridge_metrics)

# Train and evaluate Lasso
lasso_model, lasso_metrics = regression_pipeline(X_train, X_test, y_train, y_test, model_type='lasso', alpha=0.1)
print("Lasso Metrics:", lasso_metrics)

## Step 6: Visualize Results
We create visualizations to compare model performance and residuals to validate assumptions.

In [None]:
import matplotlib.pyplot as plt

# Residual plot for OLS
y_pred_ols = ols_model.predict(X_test)
residuals_ols = y_test - y_pred_ols

plt.scatter(y_pred_ols, residuals_ols)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("OLS Residuals")
plt.show()

## Step 7: Conclusion
Summarize the findings and discuss which model performed best based on metrics and assumptions.