# Regression

# Introduction to Regression

Regression is the process of predicting a continuous value, such as CO2 emission, using other variables. It involves two types of variables:

- **Dependent Variable (Y)**: The target or outcome we want to predict.
- **Independent Variables (X)**: The factors that may influence the dependent variable.

A regression model relates the dependent variable (Y) to a function of the independent variables (X).

## Key Points in Regression

- The dependent variable must be continuous.
- Independent variables can be either categorical or continuous.

## Types of Regression Models

### Simple Regression

Uses one independent variable to estimate a dependent variable. It can be linear or non-linear.

- **Example**: Predicting CO2 emission using engine size.

### Multiple Regression

Involves more than one independent variable.

- **Example**: Predicting CO2 emission using engine size and the number of cylinders.

## Applications of Regression

Regression is used to estimate continuous values in various fields:

- **Sales Forecasting**: Predicting a salesperson's total yearly sales based on age, education, and years of experience.
- **Psychology**: Determining individual satisfaction from demographic and psychological factors.
- **Real Estate**: Predicting house prices based on size, number of bedrooms, etc.
- **Employment Income**: Estimating income based on hours of work, education, occupation, sex, age, years of experience, and more.


# Simple Linear Regression

## Linear Regression Explained

This video provides a high-level introduction to linear regression, a statistical method for modeling the relationship between two or more variables. Here's a breakdown of the key concepts:

**What is linear regression?**

* It's a technique to predict a continuous value (dependent variable) based on another variable (independent variable).
* It approximates a linear relationship between the variables using a straight line.

**Types of linear regression:**

* **Simple linear regression:** Uses one independent variable. (This video focuses on this type.)
* **Multiple linear regression:** Uses two or more independent variables.

**Understanding the model:**

* The fitted line represents the model.
* The equation for the line is `ŷ = θ₀ + θ₁x₁`, where:
    * `ŷ` (y-hat) is the predicted value of the dependent variable.
    * `θ₀` (theta-nought) is the intercept (y-axis value where the line crosses).
    * `θ₁` (theta-one) is the slope of the line.
    * `x₁` is the independent variable.

**Finding the best fit line:**

* Linear regression aims to minimize the mean squared error (MSE) between the actual data points and the fitted line.
* MSE represents the average squared difference between predictions and actual values.
* There are mathematical formulas to calculate the optimal values for `θ₀` and `θ₁` that minimize MSE.

**Using the model for prediction:**

* Once you have the parameters `θ₀` and `θ₁`, you can plug them into the equation to predict the dependent variable for a new independent variable value.

**Advantages of linear regression:**

* Easy to understand and interpret
* Fast to compute
* Doesn't require complex parameter tuning

**Additional notes:**

* While the video mentions memorizing formulas might not be necessary, understanding the concepts behind them is beneficial.
* There are libraries available in programming languages like Python and R to perform linear regression without manual calculations.


Welcome to this video on model evaluation in regression. The goal of regression is to predict unknown cases accurately, which requires evaluating the model after building it. We'll discuss two evaluation approaches: training and testing on the same dataset, and train/test split, along with their pros and cons. Additionally, we'll introduce accuracy metrics for regression models.

### Train and Test on the Same Dataset
- **Approach**: Train the model on the entire dataset and test it using a portion of the same dataset.
- **Steps**:
  1. Use the entire dataset for training.
  2. Select a subset as the test set (e.g., rows 6-9 without labels).
  3. Predict the target values using the model and compare them with the actual values to calculate accuracy.
- **Pros**: Simple to implement.
- **Cons**: Likely to result in high training accuracy but low out-of-sample accuracy due to overfitting, meaning the model performs well on the training data but poorly on new, unseen data.

### Train/Test Split
- **Approach**: Split the dataset into separate training and testing sets.
- **Steps**:
  1. Divide the dataset (e.g., rows 0-5 for training and rows 6-9 for testing).
  2. Train the model on the training set.
  3. Predict the target values for the test set and compare them with the actual values.
- **Pros**: Provides a more accurate evaluation of out-of-sample accuracy since the test set is not used in training.
- **Cons**: The results can vary significantly depending on how the dataset is split, leading to potential bias.

### K-Fold Cross-Validation
- **Approach**: Divide the dataset into K folds and perform multiple train/test splits.
- **Steps**:
  1. Split the dataset into K equal parts.
  2. Use each part as a test set while the remaining parts form the training set, repeating K times.
  3. Average the results from each fold.
- **Pros**: Reduces the variability of the train/test split approach, leading to a more reliable estimate of out-of-sample accuracy.
- **Cons**: More computationally intensive.

In summary, while training and testing on the same dataset is straightforward, it can lead to overfitting. The train/test split approach offers a better assessment of a model's performance on new data, but can still be biased by the specific split. K-fold cross-validation provides a more robust evaluation by averaging results across multiple splits, thus mitigating some of the variability and dependency issues.

## Model Evaluation Metrics in Regression Analysis

This video provides an introduction to commonly used accuracy metrics for evaluating regression models.

**Key Points:**

* **Evaluation metrics** quantify the performance of a regression model by comparing predicted values to actual values.
* **Error**, in the context of regression, refers to the difference between an individual data point and the corresponding value predicted by the model.

**Common Regression Evaluation Metrics:**

* **Mean Absolute Error (MAE):** The average of the absolute error terms, representing the average magnitude of the errors. (Equation: MAE = 1/n * Σ |y_i - f(x_i)|) 
* **Mean Squared Error (MSE):** The average of the squared error terms, placing greater emphasis on larger errors due to squaring. (Equation: MSE = 1/n * Σ (y_i - f(x_i))^2)
* **Root Mean Squared Error (RMSE):** The square root of MSE, making it interpretable in the same units as the target variable. (Equation: RMSE = √(MSE))
* **Relative Absolute Error (RAE):** A metric expressed as a ratio that normalizes the absolute error. It measures the average absolute difference between the actual and predicted values relative to the average absolute difference between the actual values and their mean. (Equation: RAE = Σ|actual - predicted| / Σ|actual - mean|)
* **Residual Sum of Squares (RSS):** Calculates the sum of the squared differences between actual and predicted values. (Equation: RSS = Σ(actual - predicted)^2)

**Distinction Between RAE and RSS:**

* **RAE:** Focuses on the average magnitude of errors relative to a baseline model (average error of a simple predictor using the mean as the prediction for all data points).
* **RSS:** Represents the total squared error between the actual and predicted values. It is used internally by regression algorithms to fit the model but not directly for evaluating model performance. 

**Selection of Evaluation Metrics:**

The choice of an appropriate metric depends on factors such as the specific regression model, data characteristics, and the domain knowledge of the problem being addressed.

**Note:** A more in-depth exploration of metric selection criteria is beyond the scope of this course.

## Multiple Linear Regression
This video offers a comprehensive introduction to multiple linear regression, a cornerstone statistical technique used to model the connection between a continuous dependent variable and multiple independent variables.

**Key Principles:**

* **Multiple linear regression** extends simple linear regression by enabling the investigation of how multiple independent variables collectively influence a dependent variable.
* This approach is particularly valuable in:
    * Quantifying the strength of the influence exerted by independent variables on the dependent variable.
    * Predicting the effect of changes in independent variables on the values of the dependent variable.
* The model expresses the target value (Y) as a linear combination of the independent variables (X). Mathematically, this relationship can be represented as: `ŷ = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ`
    * `ŷ` denotes the predicted target value.
    * θ represents the parameters (coefficients) estimated through the modeling process.
    * x represents the independent variables (features) used for prediction.
* The central objective is to identify the optimal hyperplane (in higher dimensions) that minimizes the mean squared error (MSE) between the predicted and actual values.
* Common methods for estimating the model parameters include:
    * Ordinary least squares (OLS): This method employs matrix operations to determine the optimal theta values. However, it can be computationally expensive for large datasets.
    * Optimization algorithms: These iterative algorithms, such as gradient descent, efficiently minimize the error on the training data for large datasets, making them a preferred choice.
* Once the parameters are estimated, prediction involves substituting the input values into the established linear model equation.
* While multiple linear regression offers the capability to assess the relative importance of predictors, it's crucial to consider the following:
    * Including an excessive number of independent variables can lead to overfitting, resulting in a model that is overly complex and lacks generalizability.
    * Categorical variables can be integrated into the model by converting them into numerical representations (e.g., dummy variables).
    * A fundamental assumption of multiple linear regression is the presence of a linear relationship between the dependent variable and each independent variable. Scatter plots can be used to visually assess linearity. If the observed relationship deviates from linearity, non-linear regression models are more appropriate.


# QUIZ
### Solution

#### Question 1
**Which of the following is the meaning of "Out of Sample Accuracy" in the context of evaluation of models?**

**Answer:** "Out of Sample Accuracy" is the percentage of correct predictions that the model makes on data that the model has NOT been trained on.

#### Question 2
**When should we use Multiple Linear Regression? (Select two)**

**Answers:**
- When we would like to predict impacts of changes in independent variables on a dependent variable.
- When we would like to identify the strength of the effect that the independent variables have on a dependent variable.

#### Question 3
**Which sentence is TRUE about linear regression?**

**Answer:** A linear relationship is necessary between the independent variables and the dependent variable.

### Solution

#### Question 1
**What are the requirements for independent and dependent variables in regression?**

**Answer:** Independent variables can be either categorical or continuous. Dependent variables must be continuous.

#### Question 2
**The key difference between simple and multiple regression is:**

**Answer:** To estimate a single dependent variable, simple regression uses one independent variable whereas multiple regression uses multiple.

#### Question 3
**Recall that we tried to predict CO2 emission with car information. Say that now we can describe the relationship as: CO2_emission = 130 - 2.4*cylinders + 8.3*fuel_consumption. What is TRUE of this relationship?**

**Answer:** W When “cylinders” decreases by 1 while fuel_consumption remains constant, CO2_emission increases by 2.4 units.

#### Question 4
**What could be the cause of a model yielding high training accuracy and low out-of-sample accuracy?**

**Answer:** The model is training on a small training set, so it is overfitting.

#### Question 5
**Multiple Linear Regression is appropriate for:**

**Answer:** Predicting tomorrow’s rainfall amount based on the wind speed and temperature.