# CS50 AI with Python

##  Lecture 4

##  Learning

---

#### Machine Learning

Machine learning provides a computer with data, opposed to explicit instructions. Using this data, the computer learns to pick out patterns and becomes able to execute tasks on it's own.

#### Supervised Learning

* A task where a computer learns a function that maps inputs to outputs based on a dataset of input-output pairs.

There are multiple tasks that fall under supervised learning. One of these tasks is ***Classification***.

**Classification** is a task where a function maps an input to a discrete output

Ex.

We want to predict if it **will rain** based on *humidity* and *pressure*.

To classify wether it **will or will** **not rain**, we can use past data to learn a rule that works similarly to nature.

Each day is a data point, labeled **Rain** or **No Rain** below.

The computer's job is to label a new day, which has it's own place on the graph(white)

![alt text](images/classification.png)

There are multiple methods/ways to classify our white dot into a category.

---

#### Nearest-Neighbor Classification

One way of solving a task like the one above is **Nearest-Neighbor Classification**. **Nearest-Neighbor Classification** is assigning the variable in question the value of the closest already classified data-point.

This sounds good in theory but we run into problem when faced with a task like the one below.

![](images/nearestneighbor.png)

In the example above, **Nearest-Neighbor Classification** would classify our variable as red, because that is it's nearest observation. However, when you zoom out and see the bigger picture, the white dot is more probable to be blue. 

One way to get around the limitations of **Nearest-Neighbor Classification** is to use **k-nearest-neighbors classification**, where the dot is colored based on the most frequent color of the *k*(it is up to programmer of what *k* is) nearest neighbors. 

A drawback of the **k-nearest-neighbors classification** is when it uses a naive approach,it will measure the distance of every single point to the point in question, which is computationally expensive. 

This can be sped up by using data structures that enable finding neighbors more quickly or by pruning irrelevant observations.

---

#### Perceptron Learning

Instead of using **nearest-neighbor** classification, we can approach classification by examining the entire dataset and creating a decision boundary. This is called **Perceptron Learning**.

For two-dimensional data, this means drawing a line between the two types of observations, and classifying new points based on which side of the line they fall.

![alt text](images/decisionboundary.png)

The drawback to to this approach is that data is often messy, and it is rare that one can draw a line and neatly divide the classes into two observations without any mistakes. Often, we will compromise, drawing a line that separates the observations correctly more often then not, but still occasionally misclassifies them.

In this case above, we have two inputs:

* x₁ = Humidity
* x₂ = Pressure

These inputs will be given to *hypothesis function **h(x₁, x₂)***, which will output it's prediction of whether it is *going to rain or not*. The function will check on which side of the *decision boundary* the *observation* falls. In other words, the function will weight each of the *inputs* with an addition of a *constant*, ending in a linear equation of the following form:

* Rain w₀ + w₁x₁ + w₂x₂ ≥ 0
* No Rain otherwise

*Often the output variable will be coded as 1 and 0, where if the equation's value is greater than 0, the output is 1 (Rain), and 0 otherwise (No Rain).*

The weights and values are represented by **vectors**, which are sequences of numbers (which can be stored in lists or tuples in Python). We produce a *Weight Vector w: (w₀, w₁, w₂)*, and getting to the best weight vector is the goal of the machine learning algorithm. We also produce an *Input Vector x: (1, x₁, x₂).*

We take the dot product of these two vectors. Meaning we multiply each values in one vector by the corresponding value in the other vector, arriving at this expression: *w₀ + w₁x₁ + w₂x₂*. The first value in the input vector is 1 because, when multiplied by the weight vector w₀, we want to keep it a constant. 

Thus, our hypothesis function can be represented the following way:

![alt text](images/dotproduct.png)

⚠️

**Since the goal of the algorithm is to find the best weight vector**, when the algorithm encounters new data it updates the current weights. It does so using the *perceptron learning rule*:


![alt text](images/perceptronlearning.png)

The weights are updated for each data point to improve accuracy. If the model's prediction matches the actual result, the weights stay the same. If it underestimates, the weights increase; if it overestimates, the weights decrease. The amount of change depends on the input value and the learning rate *α*, which controls *how strongly each new example affects the weights.*

The result of this process is a **threshold function** that switches from 0 to 1 once the estimated value crosses some threshold.


![alt text](images/hardthreshold.png)

A hard threshold function only outputs 0 or 1, making it unable to express uncertainty. To address this, *a logistic (soft threshold) function is used*, which outputs values between 0 and 1 to represent confidence in the prediction — the closer to 1, the higher the likelihood of rain.

![alt text](images/softthreshold.png)


### Support Vector Machines

**Support Vector Machines** are a powerful method for classification. Unlike simpler methods like *nearest-neighbor or linear regression*, **SVMs** focus on finding the best boundary to separate data points from different classes. This approach uses an additional vector (*support vector*) near the decision boundary to make the best decision when separating the data. Consider the example below.

![alt text](images/supportvector.png)

Even if several decision boundaries correctly *separate* the data (i.e., no mistakes), they **are not all equally effective**. For example, boundaries that are too close to the data points (like the two on the left) may misclassify new data points that are only slightly different. In contrast, the boundary on the right is positioned to keep the *maximum possible distance* from both groups. This is called the **Maximum Margin Separator**. It provides more flexibility and reliability when classifying new data.

A key advantage of **SVMs** is their ability to handle more complex situations:

* They work not just in 2D, but in many dimensions.

* They can also model non-linear boundaries using techniques like the kernel trick(like below).


![alt text](images/circleboundary.png)

### Regression

Regression is a type of supervised learning where the goal is to predict a continuous value (a real number) based on input data. This contrasts with classification, which predicts discrete categories (ex. "Rain" or "No Rain").

**Example:**

A company may want to predict sales revenue based on advertising spend. Here, the true relationship between advertising and revenue is an unknown function $f(\text{advertising})$, and our goal is to create an approximate function $h(\text{advertising})$ that can predict future revenue values. Unlike classification, this prediction isn't about choosing a category, but about estimating a numerical outcome.

![alt text](images/regression.png)

---

### Loss Functions

Loss functions measure how bad a prediction is—the bigger the error, the higher the loss.

**For classification**, a common loss function is the **0-1 Loss Function**, defined as:

```
L(actual, predicted) = 
    0 if actual == predicted  
    1 if actual != predicted
```

In simple terms:

* If the prediction is correct, the loss is 0.
* If the prediction is wrong, the loss is 1.

This helps evaluate how well a model is performing.

![alt text](images/01loss.png)

In the example above, we used a line to separate rainy from non-rainy days.

* **Correct predictions** (ex. rainy days below the line, non-rainy days above) are given a loss of **0**.
* **Incorrect predictions** (ex. rainy days above the line or non-rainy days below it) are given a loss of **1**.
  By summing all the 1s, we estimate how many mistakes our model made—this is the **empirical loss** based on the 0-1 loss function.

For **regression** problems (predicting continuous values), we measure **how far off** a prediction is from the actual value using:

* **L₁ Loss (Absolute Error)**:
  $L = |\text{actual} - \text{predicted}|$
  – Treats all errors equally; robust to outliers.

* **L₂ Loss (Squared Error)**:
  $L = (\text{actual} - \text{predicted})^2$
  – Punishes larger errors more; sensitive to outliers.

**Choosing between L₁ and L₂:**
Use **L₁** when you want stability and resistance to outliers.
Use **L₂** when you want to emphasize minimizing large errors.


![alt text](images/l1.png)

---

### Overfitting

Overfitting happens when a model learns the training data too perfectly, but can't perform well on new, unseen data. This shows the downside of loss functions—they can be minimized to zero (as in the two examples below), but that doesn’t mean the model will work well on other data.

![alt text](images/overfitting.png)

For example, in the left graph, a dot next to the red one at the bottom of the screen is likely to be Rain (blue). However, with the overfitted model, it will be classified as No Rain (red).

---

### Regularization 

Regularization helps prevent overfitting by adding a penalty for complexity to the model’s total cost. This encourages simpler models that generalize better.

The total cost is calculated as:

```
cost(h) = loss(h) + λ * complexity(h)
```

Here, **λ (lambda)** controls how much we penalize complexity—higher λ means stronger regularization.

To check if a model is overfitting, we can use **Holdout Cross-Validation**: split the data into training and test sets, train on one and test on the other. If the model performs well on the test set, it likely generalizes well.

The drawback is that we don’t train on all the data. **k-Fold Cross-Validation** solves this by splitting the data into *k* parts and rotating through them, training on *k–1* parts and testing on the remaining one each time. This gives a better generalization estimate without wasting data.

---

### scikit-learn

Python has several machine learning libraries, and **scikit-learn** is one of the most popular.

In this example, we’ll use **scikit-learn** with a **CSV dataset of counterfeit banknotes**(check /banknotes) to train a model that can detect whether a banknote is real or fake.

(Run on your own the following notes just explain the code)

**The CSV**

![alt text](images/banknotes.png)

The first four columns contain features we can use to predict whether a banknote is real or fake. The last column, labeled as 0 (genuine) or 1 (counterfeit), is the target provided by a human.

We’ll train our model on this data to see if it can accurately classify new banknotes.



In [None]:
import csv
import random

import sys
!{sys.executable} -m pip install sklearn

from sklearn import svm
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# model = KNeighborsClassifier(n_neighbors=1)
# model = svm.SVC()
model = Perceptron()

After importing our libraries, we can choose which model to use. 

For example, `SVC` is a Support Vector Classifier (support vector machine), while `KNeighborsClassifier` uses the k-nearest neighbors approach and requires us to specify how many neighbors to consider.



In [3]:
# Read data in from file
with open("banknotes/banknotes.csv") as f:
    reader = csv.reader(f)
    next(reader)

    data = []
    for row in reader:
        data.append({
            "evidence": [float(cell) for cell in row[:4]],
            "label": "Authentic" if row[4] == "0" else "Counterfeit"
        })

# Separate data into training and testing groups
holdout = int(0.40 * len(data))
random.shuffle(data)
testing = data[:holdout]
training = data[holdout:]

# Train model on training set
X_training = [row["evidence"] for row in training]
y_training = [row["label"] for row in training]
model.fit(X_training, y_training)

# Make predictions on the testing set
X_testing = [row["evidence"] for row in testing]
y_testing = [row["label"] for row in testing]
predictions = model.predict(X_testing)

# Compute how well we performed
correct = 0
incorrect = 0
total = 0
for actual, predicted in zip(y_testing, predictions):
    total += 1
    if actual == predicted:
        correct += 1
    else:
        incorrect += 1

# Print results
print(f"Results for model {type(model).__name__}")
print(f"Correct: {correct}")
print(f"Incorrect: {incorrect}")
print(f"Accuracy: {100 * correct / total:.2f}%")

Results for model Perceptron
Correct: 539
Incorrect: 9
Accuracy: 98.36%


The manual version of running the algorithm is in the source code file **banknotes0.py**.

Since this algorithm is commonly used in a similar way, **scikit-learn** provides higher-level functions that simplify the code. This more concise version is available in **banknotes1.py**.

---

### Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns by taking actions and receiving feedback in the form of rewards or penalties(positive or negative numeric values) based on its behavior.

![alt text](images/reinforcement-1.png)


In reinforcement learning, the process begins with the environment giving the agent a **state**. The agent then takes an **action**, and in response, the environment returns a **new state** and a **reward**.

Rewards can be positive (encouraging the behavior) or negative (discouraging it).

This approach is useful in tasks like training walking robots—each successful step earns a reward, while each fall results in a penalty.

---

### Markov Decision Processes

Reinforcement learning can be framed as a **Markov Decision Process (MDP)**, which includes:

* **S**: A set of possible **states**
* **Actions(S)**: The set of possible **actions** in each state
* **P(s’ | s, a)**: The **transition model**, giving the probability of moving to state *s’* after taking action *a* in state *s*
* **R(s, a, s’)**: The **reward function**, defining the reward received when transitioning from *s* to *s’* via action *a*

![alt text](images/markov.png)

In this example, the **agent** is the yellow circle. Its goal is to reach the **green square** while avoiding the **red squares**.

* Each square represents a **state**.
* Actions are movements: up, down, left, or right.
* The **transition model** determines where the agent ends up after taking an action.
* The **reward function** gives feedback: negative for red squares (penalty), positive for the green square (goal).

For instance, if the agent is in the bottom-left and moves right onto a red square, it gets negative feedback. Over time, it learns to avoid that action in that state. The agent explores different paths, adjusting its behavior based on rewards, and learns which state-action combinations are best.

The process is often **probabilistic**, meaning the agent chooses actions based on probabilities that shift as it learns. Reaching the green square gives a positive reward, reinforcing the path that led to success.

### Q-Learning

**Q-Learning** is a reinforcement learning algorithm that learns how good it is to take a specific action in a given state, using a function **Q(s, a)**.

At first, all Q-values are set to 0. As the agent takes actions and receives rewards, it updates these values using:

![alt text](images/qlearning.png)

Where:

* **Q(s, a)** is the current estimate for taking action *a* in state *s*
* **r** is the reward received for the action
* **s’** is the new state after the action
* **max(Q(s’, a’))** is the highest Q-value possible in the new state
* **γ** (gamma) controls how much future rewards matter
* **α** (alpha) is the learning rate, controlling how much we update old estimates

In other words, Q-learning improves its estimates over time by combining immediate rewards with the expected future rewards. It balances learning new information while still using past experience.

A **Greedy Decision-Making algorithm** always picks the action with the highest current Q-value, ignoring future possibilities. It exploits what it already knows to be the best—but never explores better options.

This leads to the **explore vs. exploit** tradeoff. Exploiting means sticking with proven choices; exploring means trying something new that might be even better. Think of always playing your favorite songs vs. discovering new ones you might like more.

To balance both, we use the **ε-greedy algorithm**:

* With probability **1 – ε**, it picks the best-known move (exploit).
* With probability **ε**, it picks a random move (explore).

Sometimes, we give feedback only at the end of a task instead of after each move. Take the game of **Nim**: an AI plays many random games and only gets a reward at the end—+1 for a win, –1 for a loss. After enough games (e.g., 10,000), it starts to play much more strategically.

In complex games like **chess**, it’s impossible to store a Q-value for every state and move. Instead, we use **function approximation**, where Q(s, a) is estimated using features of the state and action. This allows the model to generalize and make smart decisions even in situations it hasn’t seen before.

---

### Unsupervised Learning

In all the previous examples, we used **supervised learning**, where the data came with **labels**—answers the algorithm could learn from. For instance, when training a model to detect counterfeit banknotes, each note had four input features and a label indicating whether it was real or fake.

In **unsupervised learning**, there are **no labels**. The algorithm only gets the input data and must find patterns or structure on its own.

#### Clustering

**Clustering** is a common unsupervised learning method. It groups similar data points together based on their features. The goal is for items in the same group (or *cluster*) to be more similar to each other than to those in other groups.

Clustering is useful in many areas—for example:

* In **genetics**, to identify groups of similar genes.
* In **image segmentation**, to divide an image into regions based on pixel similarity.

----

#### k-means Clustering

**K-means clustering** is an algorithm used to group data into **k clusters**. Here's how it works:

1. **Initialization**: All data points are plotted in a feature space. Then, **k cluster centers** (also called centroids) are placed at random positions. The number of clusters, **k**, is chosen by the programmer.

2. **Assignment**: Each data point is assigned to the nearest cluster center. This creates k groups of points.

3. **Update**: Each cluster center is moved to the **mean (average) position** of the points assigned to it.

4. **Repeat**: Steps 2 and 3 are repeated—reassigning points to the nearest updated center and recalculating the center positions—until the cluster assignments no longer change. This means the algorithm has **converged**.

The result is a set of well-separated clusters, where each point belongs to the cluster with the nearest center.

![alt text](images/kclustering.png)
