In [1]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt  

# Statistical and visualization libraries
import statsmodels.api as sm  
import seaborn as sns

# For reproducibility
from pandas.core.common import random_state  # Note: This import might be deprecated

# Machine learning libraries
from sklearn.linear_model import LinearRegression  # Fixed capitalization
from sklearn.model_selection import train_test_split  # Fixed typo: 'sklern' -> 'sklearn'

# Set the default seaborn style for plots
sns.set()

In [2]:
# load data from local storage
data = pd.read_csv('PLTR.csv')
data.head() 

Unnamed: 0,Date,Close/Last,Volume,Open,High,Low
0,09/02/2025,$157.09,65434970,$151.20,$158.39,$150.28
1,08/29/2025,$156.71,45270500,$156.98,$158.42,$153.00
2,08/28/2025,$158.12,57885240,$157.63,$158.23,$152.55
3,08/27/2025,$156.72,76380550,$162.32,$162.40,$155.9801
4,08/26/2025,$160.87,86573720,$155.39,$162.13,$154.57


In [None]:
# This is the average set up in Jupyter notes

---

# Machine Learning

- is learning (training) and infering (testing) a set of attributes (features,independent variables) and targets (labels, dependent variables)

- y = f(x) = ax + b machines can learn a function like this easily


# Linear Model
1. Practical and most straightforward model
2. Linear relations between variables, meaning that, no variable-variable multiplication.
    a. Addition of variables
    b. Multiplication of a variable and a constant
3. Non-Example:
   y = ax^2 + bx + c any exponent other than 1 is not LINEAR

# Finding the best method 

 - when the line passes through as close as possible to the points
 
    How do we evaluate the concept?

   Mean Squared Loss:

       find the outer edge vectors, mean and square them to reduce the residual losses. 


   Mean Squared Loss produces:
   
         - saddle points - to the left = lower to the right = higher

         - local maximum - to the left lower and to the right lower

         - local minimum - to the left higher to the right higher

The relationship between this and derivatives are: to see which of those points = the relevant data we need. 

1st and 2nd derivatives are necessary to derive each point! (saddle, max, min) 

end of 09/03/2025 class notes
---


# Personal notes on the book [A First Course in Machine Learning Chapter 1](https://bibliu.com/app/#/view/books/9781498738569/epub/OPS/xhtml/12_Chapter01.html#page_18)


# Chapter 1: Linear Modelling - A Least Squares Approach

## Overview

Linear modeling represents one of the most fundamental approaches in machine learning for discovering functional relationships between input attributes and target responses. The chapter demonstrates these concepts through a compelling example: predicting Olympic 100-meter sprint winning times based on the year of competition. This practical application illustrates how machine learning can extract meaningful patterns from historical data to make future predictions.

The core premise involves learning a mathematical function that maps input variables (like Olympic year) to output variables (like winning time). Once this relationship is established, the model can predict outcomes for previously unseen inputs. For instance, using Olympic data from 1896 to 2008, we can attempt to predict winning times for the 2012 and 2016 Olympics.

## The Linear Model Framework

Linear modeling assumes that the relationship between inputs and outputs can be adequately represented by a straight line. In its simplest form, this relationship follows the familiar equation t = w₀ + w₁x, where t represents the target value (winning time), x represents the input attribute (Olympic year), and w₀ and w₁ are parameters that define the line's characteristics.

Consider the Olympic example: the data shows that winning times generally decrease over the years, suggesting a negative linear relationship. The parameter w₁ (gradient) captures this downward trend, while w₀ (intercept) represents the theoretical winning time at year zero. Through careful analysis of the Olympic data, the chapter derives the optimal linear model as f(x; w₀, w₁) = 36.416 - 0.013x, indicating that winning times decrease by approximately 0.013 seconds per year on average.

| Parameter | Value | Interpretation |
|-----------|-------|----------------|
| w₀ (intercept) | 36.416 | Theoretical winning time at year 0 |
| w₁ (gradient) | -0.013 | Average decrease in seconds per year |

The model makes specific predictions: for 2012, it predicts a winning time of 9.595 seconds, and for 2016, 9.541 seconds. These precise predictions highlight both the power and limitations of linear modeling, as real-world events rarely conform to such mathematical precision.

## Loss Functions and Optimization

The quality of any model depends on how well it fits the observed data. The squared loss function provides an objective measure of model performance by calculating the squared difference between actual and predicted values for each data point. For the nth Olympic year, this loss is expressed as Lₙ = (tₙ - f(xₙ; w₀, w₁))², where tₙ is the actual winning time and f(xₙ; w₀, w₁) is the predicted time.

Squaring the differences serves multiple purposes: it prevents positive and negative errors from canceling each other out, it penalizes larger errors more heavily than smaller ones, and it creates a mathematically tractable optimization problem. The total model performance is measured by averaging these individual losses across all data points: L = (1/N) Σ Lₙ.

The least squares method finds the parameter values that minimize this average loss. This optimization process involves taking partial derivatives of the loss function with respect to each parameter, setting these derivatives to zero, and solving the resulting system of equations. For the Olympic data, this mathematical procedure yields the optimal parameters mentioned above.

## Worked Example with Synthetic Data

To illustrate the mechanics of least squares fitting, consider a simple synthetic dataset with three data points:

| n | xₙ | tₙ | xₙtₙ | xₙ² |
|---|----|----|------|-----|
| 1 | 1  | 4.8| 4.8  | 1   |
| 2 | 3  |11.3| 33.9 | 9   |
| 3 | 5  |17.2| 86   | 25  |
|Avg| 3  |11.1| 41.57|11.67|

Using the derived formulas, we first calculate w₁ = (41.57 - 3 × 11.1)/(11.67 - 3 × 3) = 8.27/2.67 = 3.1. Then w₀ = 11.1 - 3.1 × 3 = 1.8. The resulting model f(x; w₀, w₁) = 1.8 + 3.1x demonstrates how the least squares method produces a line that balances proximity to all data points rather than passing through any specific point perfectly.

## Vector and Matrix Formulation

As models become more complex, involving multiple input attributes or higher-order terms, the scalar approach becomes unwieldy. Vector and matrix notation provides an elegant solution that scales to arbitrary complexity. Instead of treating parameters individually, we combine them into a parameter vector w = [w₀, w₁]ᵀ and augment each input with a constant term: xₙ = [1, xₙ]ᵀ.

This reformulation transforms the model into f(xₙ; w) = wᵀxₙ, a simple dot product. The loss function becomes L = (1/N)(t - Xw)ᵀ(t - Xw), where X is a matrix containing all input vectors and t is a vector of all target values. This matrix formulation leads to the general solution ŵ = (XᵀX)⁻¹Xᵀt, which applies regardless of the number of parameters or complexity of the input transformations.

The power of this approach becomes apparent when extending to multiple attributes. Consider predicting Olympic winning times using both the year and the personal best times of the eight lane competitors. The input vector becomes x = [1, year, s₁, s₂, ..., s₈]ᵀ, creating a 10-dimensional parameter space. The matrix formulation handles this complexity seamlessly, requiring no changes to the solution methodology.

## Non-Linear Extensions Through Basis Functions

Linear models need not be limited to straight-line relationships. By transforming the input variables through basis functions, we can capture non-linear patterns while maintaining the computational advantages of linear parameter estimation. The most common transformation involves polynomial terms: instead of using just x, we include powers like x², x³, etc.

For example, a quadratic model uses the augmented input vector [1, x, x²]ᵀ, resulting in f(x; w) = w₀ + w₁x + w₂x². This model can capture curved relationships in the data while still using the same least squares solution framework. The chapter demonstrates this with an eighth-order polynomial fitted to the Olympic data, showing how higher-order models can achieve lower training errors.

The concept extends beyond polynomials to any set of basis functions. For the Olympic data, which exhibits some periodic behavior, a model incorporating trigonometric functions might be appropriate: f(x; w) = w₀ + w₁x + w₂sin((x-a)/b). This approach allows complex, non-linear relationships to be captured within the linear modeling framework.

## The Challenge of Overfitting

A critical insight emerges when comparing models of different complexity: more complex models invariably fit the training data better, but this improved fit doesn't necessarily translate to better predictions. The eighth-order polynomial fitted to the Olympic data achieves a training loss of 0.459 compared to 1.358 for the linear model, yet its predictions appear unrealistic, especially when extrapolated beyond the observed data range.

This phenomenon, known as overfitting, occurs when a model becomes so complex that it memorizes the specific details of the training data rather than learning the underlying pattern. The model effectively becomes overspecialized to the training examples and loses its ability to generalize to new, unseen data. Understanding this trade-off between fitting the training data and maintaining predictive capability represents one of the most important concepts in machine learning.

## Model Selection Through Validation

To address the overfitting problem, the chapter introduces validation techniques that estimate how well a model will perform on unseen data. The simplest approach involves setting aside a portion of the available data (the validation set) and using it to test models trained on the remaining data (the training set). For the Olympic example, data from 1980 onwards serves as validation data, with earlier years used for training.

This validation approach reveals that simpler models often perform better on unseen data despite achieving higher training errors. When polynomial models of increasing order are compared, the training error decreases monotonically with model complexity, but validation error typically decreases initially then increases, creating a U-shaped curve that identifies the optimal model complexity.

Cross-validation extends this concept by systematically rotating which data serves as the validation set. In K-fold cross-validation, the data is divided into K equal blocks, with each block serving as validation data while the model is trained on the remaining K-1 blocks. Leave-one-out cross-validation represents the extreme case where each individual data point serves as a validation set. This technique provides a more robust estimate of model performance, especially when data is limited.

## Regularization as Complexity Control

An alternative approach to preventing overfitting involves explicitly penalizing model complexity during the optimization process. Regularization achieves this by adding a penalty term to the loss function that grows with the magnitude of the model parameters. The most common form, known as Ridge regression or L2 regularization, modifies the loss function to L' = L + λwᵀw.

The regularization parameter λ controls the trade-off between fitting the training data accurately and maintaining model simplicity. When λ = 0, we recover the standard least squares solution. As λ increases, the optimization process increasingly favors simpler models with smaller parameter values, even at the cost of slightly higher training error.

The regularized solution becomes ŵ = (XᵀX + NλI)⁻¹Xᵀt, where I represents the identity matrix. This modification has the effect of "shrinking" parameter estimates toward zero, producing smoother, more generalizable models. The challenge lies in selecting an appropriate value for λ, which again relies on validation techniques to balance complexity and performance.

## Practical Implications and Limitations

The Olympic sprint example illustrates both the strengths and limitations of linear modeling. The linear trend in the data makes intuitive sense and provides reasonable short-term predictions. However, extrapolating far into the future reveals the model's limitations: it eventually predicts impossible negative winning times and fails to account for physical limits on human performance.

These limitations highlight important considerations when applying linear models. First, the linear assumption may be inappropriate for many real-world relationships. Second, predictions become increasingly unreliable as they move further from the training data range. Third, the model's simplicity, while computationally advantageous, may miss important patterns or relationships in the data.

Despite these limitations, linear modeling provides an essential foundation for machine learning. Its analytical solutions offer computational efficiency, its mathematical properties are well understood, and its interpretable parameters provide insight into the relationships being modeled. Many advanced machine learning techniques build upon these linear foundations, making mastery of these concepts crucial for understanding more complex methods.

## Summary

This chapter establishes linear modeling as both a practical tool for regression problems and a theoretical foundation for machine learning. The progression from simple line fitting through matrix formulations to regularization and validation provides a comprehensive introduction to key machine learning concepts. The Olympic sprint example effectively demonstrates how these abstract mathematical concepts apply to real-world prediction problems, while also illustrating the importance of understanding model limitations and the dangers of overfitting.

---

---
# Chapter 2: Linear Modelling Summary: A Least Squares Approach 

## Introduction to Linear Modeling

Linear modeling forms the cornerstone of machine learning approaches for discovering functional relationships between input attributes and target responses. The chapter uses Olympic 100-meter sprint winning times as a compelling example to demonstrate how machine learning can extract meaningful patterns from historical data and make future predictions. The fundamental premise involves learning a mathematical function that maps input variables to output variables, enabling predictions for previously unseen inputs.

The Olympic example illustrates this concept by using historical data from 1896 to 2008 to predict winning times for the 2012 and 2016 Olympics. This practical application demonstrates both the power and limitations of linear modeling approaches in real-world scenarios.

## The Linear Model Framework

Linear modeling operates under the assumption that the relationship between inputs and outputs can be adequately represented by a straight line. The simplest form follows the equation t = w₀ + w₁x, where t represents the target value, x represents the input attribute, and w₀ and w₁ are parameters defining the line's characteristics.

The Olympic data analysis reveals a negative linear relationship, with winning times generally decreasing over the years. Through mathematical analysis, the optimal linear model emerges as f(x; w₀, w₁) = 36.416 - 0.013x, indicating that winning times decrease by approximately 0.013 seconds per year on average.

| Parameter | Value | Interpretation |
|-----------|-------|----------------|
| w₀ (intercept) | 36.416 | Theoretical winning time at year 0 |
| w₁ (gradient) | -0.013 | Average decrease in seconds per year |

The model generates specific predictions: 9.595 seconds for 2012 and 9.541 seconds for 2016. These precise predictions highlight both the mathematical precision possible with linear modeling and its inherent limitations when applied to real-world events.

## Loss Functions and Optimization Principles

Model quality depends fundamentally on how well it fits observed data. The squared loss function provides an objective measure by calculating the squared difference between actual and predicted values for each data point. For the nth Olympic year, this loss is expressed as Lₙ = (tₙ - f(xₙ; w₀, w₁))².

Squaring the differences serves multiple critical purposes: preventing positive and negative errors from canceling each other, penalizing larger errors more heavily than smaller ones, and creating a mathematically tractable optimization problem. Total model performance is measured by averaging individual losses across all data points: L = (1/N) Σ Lₙ.

The least squares method identifies parameter values that minimize this average loss through calculus-based optimization. This process involves taking partial derivatives of the loss function with respect to each parameter, setting these derivatives to zero, and solving the resulting system of equations to find optimal parameters.

## Worked Example with Synthetic Data

A simple synthetic dataset with three data points illustrates the mechanics of least squares fitting effectively:

| n | xₙ | tₙ | xₙtₙ | xₙ² |
|---|----|----|------|-----|
| 1 | 1  | 4.8| 4.8  | 1   |
| 2 | 3  |11.3| 33.9 | 9   |
| 3 | 5  |17.2| 86   | 25  |
|Avg| 3  |11.1| 41.57|11.67|

Using derived formulas, w₁ = (41.57 - 3 × 11.1)/(11.67 - 3 × 3) = 8.27/2.67 = 3.1, and w₀ = 11.1 - 3.1 × 3 = 1.8. The resulting model f(x; w₀, w₁) = 1.8 + 3.1x demonstrates how least squares produces a line balancing proximity to all data points rather than passing through any specific point perfectly.

## Vector and Matrix Formulation

As models increase in complexity, involving multiple input attributes or higher-order terms, scalar approaches become unwieldy. Vector and matrix notation provides an elegant solution that scales to arbitrary complexity. Parameters combine into a parameter vector w = [w₀, w₁]ᵀ, and each input is augmented with a constant term: xₙ = [1, xₙ]ᵀ.

This reformulation transforms the model into f(xₙ; w) = wᵀxₙ, a simple dot product. The loss function becomes L = (1/N)(t - Xw)ᵀ(t - Xw), where X contains all input vectors and t contains all target values. This matrix formulation leads to the general solution ŵ = (XᵀX)⁻¹Xᵀt, applicable regardless of parameter count or input transformation complexity.

The approach's power becomes apparent when extending to multiple attributes. Predicting Olympic winning times using both year and personal best times of eight lane competitors creates a 10-dimensional parameter space with input vector x = [1, year, s₁, s₂, ..., s₈]ᵀ. The matrix formulation handles this complexity seamlessly without requiring methodology changes.

## Non-Linear Extensions Through Basis Functions

Linear models need not be limited to straight-line relationships. Transforming input variables through basis functions enables capturing non-linear patterns while maintaining computational advantages of linear parameter estimation. The most common transformation involves polynomial terms, including powers like x², x³, etc.

A quadratic model uses the augmented input vector [1, x, x²]ᵀ, resulting in f(x; w) = w₀ + w₁x + w₂x². This model captures curved relationships while using the same least squares solution framework. The chapter demonstrates this with an eighth-order polynomial fitted to Olympic data, showing how higher-order models achieve lower training errors.

The concept extends beyond polynomials to any set of basis functions. For Olympic data exhibiting periodic behavior, a model incorporating trigonometric functions might be appropriate: f(x; w) = w₀ + w₁x + w₂sin((x-a)/b). This approach captures complex, non-linear relationships within the linear modeling framework.

## The Challenge of Overfitting

A critical insight emerges when comparing models of different complexity: more complex models invariably fit training data better, but improved fit doesn't necessarily translate to better predictions. The eighth-order polynomial achieves a training loss of 0.459 compared to 1.358 for the linear model, yet its predictions appear unrealistic, especially when extrapolated beyond observed data ranges.

Overfitting occurs when models become so complex that they memorize specific training data details rather than learning underlying patterns. The model becomes overspecialized to training examples and loses its ability to generalize to new, unseen data. Understanding this trade-off between fitting training data and maintaining predictive capability represents one of machine learning's most important concepts.

## Model Selection Through Validation

To address overfitting, the chapter introduces validation techniques estimating how well models will perform on unseen data. The simplest approach involves setting aside a data portion (validation set) and using it to test models trained on remaining data (training set). For the Olympic example, data from 1980 onwards serves as validation data, with earlier years used for training.

This validation approach reveals that simpler models often perform better on unseen data despite achieving higher training errors. When comparing polynomial models of increasing order, training error decreases monotonically with model complexity, but validation error typically decreases initially then increases, creating a U-shaped curve identifying optimal model complexity.

Cross-validation extends this concept by systematically rotating which data serves as the validation set. In K-fold cross-validation, data is divided into K equal blocks, with each block serving as validation data while models train on remaining K-1 blocks. Leave-one-out cross-validation represents the extreme case where each individual data point serves as a validation set, providing more robust performance estimates, especially with limited data.

## Regularization as Complexity Control

An alternative approach to preventing overfitting involves explicitly penalizing model complexity during optimization. Regularization achieves this by adding a penalty term to the loss function that grows with parameter magnitude. The most common form, Ridge regression or L2 regularization, modifies the loss function to L' = L + λwᵀw.

The regularization parameter λ controls the trade-off between fitting training data accurately and maintaining model simplicity. When λ = 0, the standard least squares solution is recovered. As λ increases, optimization increasingly favors simpler models with smaller parameter values, even at the cost of slightly higher training error.

| λ Value | Effect | Model Characteristics |
|---------|--------|----------------------|
| λ = 0 | No regularization | Standard least squares solution |
| Low λ | Minimal penalty | Slight parameter shrinkage |
| High λ | Strong penalty | Significant parameter shrinkage, smoother models |

The regularized solution becomes ŵ = (XᵀX + NλI)⁻¹Xᵀt, where I represents the identity matrix. This modification "shrinks" parameter estimates toward zero, producing smoother, more generalizable models. The challenge lies in selecting appropriate λ values, again relying on validation techniques to balance complexity and performance.

## Practical Implications and Limitations

The Olympic sprint example illustrates both strengths and limitations of linear modeling. The linear trend makes intuitive sense and provides reasonable short-term predictions. However, extrapolating far into the future reveals model limitations: it eventually predicts impossible negative winning times and fails to account for physical limits on human performance.

These limitations highlight important considerations when applying linear models. The linear assumption may be inappropriate for many real-world relationships. Predictions become increasingly unreliable as they move further from training data ranges. The model's simplicity, while computationally advantageous, may miss important patterns or relationships in data.

Despite these limitations, linear modeling provides an essential foundation for machine learning. Its analytical solutions offer computational efficiency, its mathematical properties are well understood, and its interpretable parameters provide insight into modeled relationships. Many advanced machine learning techniques build upon these linear foundations, making mastery of these concepts crucial for understanding more complex methods.

## Key Takeaways

Linear modeling serves as both a practical tool for regression problems and a theoretical foundation for machine learning. The progression from simple line fitting through matrix formulations to regularization and validation provides a comprehensive introduction to key machine learning concepts. The Olympic sprint example effectively demonstrates how abstract mathematical concepts apply to real-world prediction problems while illustrating the importance of understanding model limitations and the dangers of overfitting.

The chapter establishes that successful machine learning requires balancing model complexity with generalization capability, introducing fundamental concepts that extend far beyond linear models into the broader field of machine learning and statistical modeling.

# Machine Learning Study Guide

This study guide is based on class notes and readings from *A First Course in Machine Learning*.

---

## 1. Introduction to Machine Learning

**Definition:**  
Machine Learning is the process of *training* a model to learn patterns from data (features, independent variables) and *inferring* outcomes (labels, dependent variables).

- We try to approximate a function `y = f(x)` such that predictions `ŷ` are as close as possible to the true labels `y`.
- Example: A simple linear function `y = ax + b`.

---

## 2. Linear Models

**Key Characteristics:**  
1. Linear relationships only.  
   - **Addition of variables**: `y = a1x1 + a2x2 + ... + anxn + b`.  
   - **Multiplication by constants**: scaling is allowed.  
2. No multiplication of variables with each other (no `x1 * x2`) and no exponents beyond 1.  

**Non-Example:**  
- Quadratic equation: `y = ax² + bx + c` → not linear because of exponent `2`.

**Why Linear Models?**  
- Simple to interpret.  
- Computationally efficient.  
- Often a good starting point before moving to more complex models.  

---

## 3. Evaluating Linear Models

We want the model’s predictions to be as close as possible to actual data points.  
This is done by minimizing **loss functions**.

### Mean Squared Loss (MSE)
- Formula:  
  \[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - ŷ_i)^2 \]

- **Steps:**  
  1. Compute the error (residual) between prediction and true value.  
  2. Square each error (to avoid negatives).  
  3. Take the mean of squared errors.  

- **Why Square?**  
  - Avoids cancellation of positive and negative errors.  
  - Penalizes large errors more heavily.  

---

## 4. Understanding Optimization with Derivatives

Minimizing MSE requires finding the **best parameters (weights)**.  
We use calculus to identify where the loss function is minimized.

### Critical Points from Derivatives
1. **Saddle Point** – function goes down in one direction, up in another.  
2. **Local Maximum** – highest point in a region (not useful for minimization).  
3. **Local Minimum** – lowest point in a region (what we want).  

- **1st Derivative (Gradient):** Identifies slope; zero at critical points.  
- **2nd Derivative (Curvature):** Helps classify if it’s a min, max, or saddle.  

### Gradient Descent
- An iterative algorithm to minimize loss.  
- Updates parameters in the opposite direction of the gradient.  
- Step size controlled by a **learning rate**.  

---

## 5. Notes from *A First Course in Machine Learning – Chapter 1*

### Linear Modelling: A Least Squares Approach

- **Goal:** Fit a straight line through data by minimizing squared differences between predicted and observed values.  
- **Least Squares Method:**  
  - Choose coefficients (weights) that minimize the total squared error.  
- **Applications:**  
  - Predicting continuous outcomes (e.g., house prices, exam scores).  
- **Limitations:**  
  - Sensitive to outliers.  
  - Cannot capture non-linear relationships.  

---

## 6. Key Takeaways

- Machine learning is about approximating unknown functions from data.  
- Linear models are simple but powerful tools.  
- Mean Squared Error is the most common loss function for regression.  
- Derivatives help us optimize models and find minima.  
- Least Squares is the foundation of linear regression.  

---

## 7. Study Tips

- Be able to explain **why linear models are linear**.  
- Practice deriving MSE and computing it on small datasets.  
- Understand the intuition behind gradient descent, not just the formula.  
- Work through examples of fitting a line by hand using least squares.  
- Contrast linear vs. non-linear models (when each is appropriate).  



# Machine Learning Cheat Sheet

## 1. Machine Learning Basics
- Learn patterns (training), predict outcomes (inference).
- Features = independent variables (x).  
- Labels = dependent variables (y).  
- Example: `y = ax + b`.

---

## 2. Linear Models
- Linear = addition + scaling only.  
- No variable multiplication or exponents > 1.  
- Form: `y = a1x1 + a2x2 + ... + anxn + b`.  
- Non-linear: `y = ax² + bx + c`.

---

## 3. Loss Function (Regression)
**Mean Squared Error (MSE):**  
\[ MSE = \frac{1}{n} \sum (y - ŷ)^2 \]  
- Squaring penalizes large errors.  
- Minimization goal: smallest MSE.

---

## 4. Optimization & Derivatives
- **1st Derivative = Gradient (slope).**  
- **2nd Derivative = Curvature.**
  - Saddle point: mix of up/down.  
  - Local max: peak.  
  - Local min: valley → best fit.  

**Gradient Descent:** Iteratively updates weights in direction opposite gradient.  

---

## 5. Least Squares (Linear Regression)
- Choose coefficients that minimize squared errors.  
- Foundation of regression.  
- Good for continuous predictions (e.g., prices).  
- Weakness: sensitive to outliers, cannot model non-linear patterns.

---

## 6. Key Reminders
- Start simple: linear → more complex later.  
- Outliers can distort linear regression.  
- Learning rate in gradient descent controls speed & accuracy.  
- Linear ≠ always best, but essential foundation.

