# Module 2 - Intro to Machine Learning

## Module Learning Outcomes 🎯

---

1.  Identify the foundational concepts associated with machine learning.
2.  Determine whether a given variable is an **input** or **output** in a machine learning context.
3.  Classify the goal of a machine learning project as either **forecasting** or **inference**.
4.  Differentiate between **machine learning** and **statistics**.
5.  Analyse fundamental machine learning concepts, including variable classification, functions, forecasting vs inference and the distinction between machine learning and statistics.
6.  Identify **data types** in machine learning applications.
7.  Classify machine learning problems as either **prediction** or **classification**.
8.  Identify the characteristics of **parametric vs non-parametric** approaches.
9.  Differentiate between **supervised** and **unsupervised** machine learning.
10. Analyse the first three steps of the machine learning process.
11. Analyse the steps required to **handle missing data** in Python.
12. Implement the steps to **handle missing data** in Python.
13. Examine the key concepts of machine learning, specifically the major dividing lines in the machine learning landscape and the machine learning process.
14. Identify a real-world machine learning problem for a specific industry.
15. Determine if machine learning is a suitable solution for a specific business problem.

## Foundational Concepts of Machine Learning 🧠

---

Understanding what machine learning is and its role in today's technological landscape is essential for effectively applying it to your field.

In this section, you will explore key concepts, including:
* **Variables** and **functions**
* **Inputs** and **outputs**
* **Forecasting** and **inference**
* The similarities and differences between **machine learning** and **statistics**

You will begin with a quick brainstorming exercise to familiarise yourself with these concepts.

## What is Machine Learning? 

# **Module 2, Video 2: What is machine learning?**

---

## **The Fundamental Problem in Machine Learning**

A large part of machine learning is about understanding the relationship between a number of **input variables** (denoted as 'X_1' up to 'X_P') and an **output variable** (denoted as Y).

This relationship can be described as a function 'f' that we want to learn. However, learning this function is tricky for two main reasons:
* **Limited data**: We only have a limited number of data points to learn the relationship from.
* **Presence of noise**: The connection between the input and output variables is stochastic, or random. This could be due to unmeasured variables or because the true relationship is inherently random.

<br>

### **Terminology**

**Input variables** are also known by several other names:
* Independent variables
* Predictors
* Features
* Fields

Likewise, the **output variable** has various synonyms:
* Dependent variable
* Response variable
* Target variable
* Outcome variable

<br>

---

## **Why Learn the Relationship Between Input and Output Variables?**

There are two primary reasons we want to learn the function 'f':
1.  Forecasting
2.  Inference

<br>

### **1. Forecasting**

In forecasting, the main goal is to accurately **predict the value of the outcome variable (Y)** for new data where we only have the input variables. Prediction accuracy is the top priority, and the underlying function 'f' can be highly complex. For instance, neural networks are often used for this purpose, creating complicated but effective predictors.

**Examples of Forecasting:**
* **Medicine**: Determining if a patient has a specific type of cancer based on characteristics of a blood sample.
* **Finance**: Detecting fraudulent expense or tax claims based on their properties.

<br>

### **2. Inference**

In inference, the focus shifts to **understanding the relationship** between the input and output variables. We want the model for the function 'f' to be simple and interpretable by a human decision-maker.

**Examples of Inference:**
* **Marketing**: Understanding how different marketing campaigns (e.g., TV, radio, newspaper) affect sales to determine the optimal marketing mix.
* **Real Estate**: Not just predicting a house price, but also understanding how much specific features, like a river view, contribute to the price.

![Screenshot%202025-07-25%20at%204.18.03%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.18.03%E2%80%AFpm.png)

![Screenshot%202025-07-25%20at%204.18.18%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.18.18%E2%80%AFpm.png)

![Screenshot%202025-07-25%20at%204.18.32%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.18.32%E2%80%AFpm.png)

![Screenshot%202025-07-25%20at%204.20.27%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.20.27%E2%80%AFpm.png)

![Screenshot%202025-07-25%20at%204.20.47%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.20.47%E2%80%AFpm.png)



# Key Terms in Statistics and Machine Learning

This document summarizes key terms used in statistics, machine learning, or both fields.

---

## Key Terms Used in Machine Learning

* **Predictive accuracy**: Describes how well predicted values match the actual values on a given test set of data for a given model.
* **Black box**: A system or process, the internal workings of which are hidden or not fully understood.

---

## Key Terms Used in Statistics

* **Stochastic data models**: Models that treat the data generation process as a random variable. Common examples include normal, binomial and student's t-distributions.
* **Goodness of fit**: A measure that summarises the discrepancy between the observed values and expected values given the model under consideration. A common example is the root mean square.
* **Summary statistics**: A set of values used to describe a data set. Common summary statistics include mean, standard deviation and median.
* **Residual analysis**: A residual is the difference between the observed value and the value predicted by the model. Analysis of residuals can determine how useful a model is.

---

## Key Terms Used in Both Statistics and Machine Learning

* **Model validation**: The process of confirming the model achieves its purpose.
* **Generalisation**: The ability of the model to adapt to new, previously unseen data.

![Screenshot%202025-07-25%20at%204.23.59%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.23.59%E2%80%AFpm.png)

# Mini-Lesson 2.2: Similarities and Differences Between Machine Learning and Statistics

---

While machine learning (ML) has much in common with statistics, there are also important differences. The field of statistics has existed for centuries, with its theory developed largely without computers. In contrast, while ML's foundations emerged in the 1950s, its widespread use grew in the 1980s and 1990s with the rise of large-scale data collection and affordable computing power.

## Data and Models 📊

The primary difference lies in how each field approaches data and modeling.

* **Statistics**: It is typically assumed that data is generated by a given **stochastic data model**, which is a model that incorporates randomness to describe underlying patterns. The goal is to estimate the model's parameters and then use **summary statistics**—a set of values like mean, standard deviation, and median—to describe the data set.

* **Machine Learning**: It is assumed that data comes from a complex, unknown process with fewer assumptions about its structure. ML algorithms learn patterns directly from the data to find a function that maps input variables ($x$) to predictions ($y$).

## Model Validation and Goals 🎯

The validation methods and ultimate goals also differ significantly.

* **Statistics**: Tends to focus on the **explainability and simplicity** of the model. Validation is often done using **goodness-of-fit** metrics, which measure the discrepancy between observed and expected values, and **residual analysis**, which is the analysis of the difference between the observed and predicted values.

* **Machine Learning**: Focuses on **generalisation**, which is the model's ability to adapt to new, previously unseen data. It is validated using predictive accuracy, sometimes at the cost of explainability. This can result in a **black box** model, which is a system or process where the internal workings are hidden or not fully understood.

## Crossover

Note that machine learning uses a lot of statistical methods (for example, logistic regression is a statistical method used for classification problems in ML). There is a great deal of crossover in the terms used, and both fields are large and cover many different techniques.

# Prediction vs Classification

There are several major dividing lines in the machine learning landscape that help classify problems and guide the selection of suitable methods.

In this section, you’ll explore the first major distinction: prediction vs classification. Additionally, you’ll examine various data types and their role in machine learning and artificial intelligence (AI).

![Screenshot%202025-07-25%20at%204.28.15%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.28.15%E2%80%AFpm.png)

### **Video 2.3: Prediction vs Classification**

One of the major dividing lines in machine learning is between prediction and classification problems. Understanding this helps categorise algorithms and their trade-offs.

---

### **Prediction Problems (Regression)** 📈

Prediction problems, also called **regression** or **estimation** problems, have an output variable that is a continuous number.

* **Example**: Estimating the value of a house price. The input variables might be categorical (e.g., Is it a Victorian house?) or numeric (e.g., square footage), but the important thing is that the output variable (the price) is a number.

---

### **Classification Problems** 🏷️

In classification problems, the output variable is a **category**, not a number.

* **Example**: A spam filter that categorises emails into `spam` or `non-spam` (also called ham) based on the properties of the email.
* **Another Example**: A medical test where the outcome is not yes/no, but a category like `patient has cancer type A`, `patient has cancer type B`, or `patient does not have cancer`.

---

### **Types of Categorical Variables**

It's also important to distinguish between two types of categorical variables.

* **Nominal Categorical Variable**: This type does not have any natural ordering. For example, the categories `Cancer A`, `Cancer B`, and `No Cancer` have no inherent order.
* **Ordinal Categorical Variable**: This type has a natural ordering. For example, the categories `high`, `medium`, and `low` have a clear sequence.

![Screenshot%202025-07-25%20at%204.30.42%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.30.42%E2%80%AFpm.png)



![Screenshot%202025-07-25%20at%204.31.11%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.31.11%E2%80%AFpm.png)

![Screenshot%202025-07-25%20at%204.31.30%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.31.30%E2%80%AFpm.png)

### **Video 2.4: Parametric vs Non-Parametric Approaches**

Another important distinction in machine learning is between **parametric** and **non-parametric** approaches. This distinction is based on whether we make an assumption about the form of the function 'f' that we want to estimate.

---

### **Parametric Approaches** 📐

Parametric approaches simplify the problem by making an assumption about the shape of the function 'f'.

* **How it works**: An assumption is made that the function 'f' has a specific form, such as being a linear function. This gives the approach a lot of power because the task is reduced to learning just a few parameters, like the slopes of the linear function.
* **Risk**: The main risk is that the assumption might be wrong, as the true underlying function may not match the chosen form.

---

### **Non-Parametric Approaches** 〰️

Non-parametric approaches do not make any strong assumptions about the shape of the function 'f'.

* **How it works**: These methods are flexible enough to learn any shape for the function 'f' without needing to make assumptions about the true relationship between the input and output variables.
* **Cost**: This power comes at a "hefty price". Non-parametric approaches typically need much more data to learn a reliable functional relationship compared to parametric methods.

### **Parametric vs. Non-Parametric Approaches in Machine Learning 🧠**

In machine learning, the choice between **parametric** and **non-parametric** approaches affects how models learn from data and make predictions. Let's look at an overview of each approach. Please note that detailed applications of these methods will be covered in later modules; this mini-lesson briefly introduces their potential uses.

---

#### **Parametric Approaches** 📐

Parametric approaches make specific assumptions about the underlying function relating input to output variables. This means they assume that the relationship between inputs and outputs can be represented by a specific mathematical form, such as a linear equation.

* They estimate a finite number of **parameters**, simplifying the learning process.
* They are generally **faster to train** and more effective with **smaller data sets**.
* **Examples of parametric approaches include**:
    * **Linear regression**: Predicting continuous outcomes (e.g., sales forecasting).
    * **Logistic regression**: Classifying binary outcomes (e.g., spam detection).
    * **Naïve Bayes**: Text classification (e.g., sentiment analysis).

---

#### **Non-Parametric Approaches** 〰️

Non-parametric approaches do not assume a specific functional form, allowing for greater flexibility.

* They adapt to data without prior distribution assumptions.
* They are capable of capturing **complex relationships**.
* They require **larger data sets** for reliable performance and can be computationally intensive.
* **Examples of non-parametric approaches include**:
    * **K-nearest neighbours (KNN)**: Classification tasks (e.g., recommendation systems).
    * **Decision trees**: Used for both classification and regression (e.g., customer segmentation).
    * **Support vector machines**: Effective in complex classification problems (e.g., bioinformatics).

---

### **Key Differences and Trade-offs**

Parametric models are often faster and simpler but may be less accurate if the data does not fit the assumed pattern, whereas non-parametric models offer greater flexibility but require more data and computational resources.

Non-parametric models are more prone to **overfitting** if not properly tuned. Overfitting occurs when a model learns noise instead of patterns, performing well on training data but failing to generalise. In parametric models, this happens when the model is too complex, while in non-parametric models, it occurs when the model memorises training data and adapts to every fluctuation.

# Supervised vs unsupervised 

### **Video 2.5: Supervised vs. Unsupervised Learning**

The final distinction is between supervised and unsupervised learning. In fact, everything discussed so far relates to supervised learning.

---

### **Supervised Learning** 🧑‍🏫

In supervised learning, we have a set of input variables, `$X_1$` to `$X_P$`, as well as an output variable, `$Y$`. The goal is to study the functional relationship between these input variables and the output variable, based on a training dataset.

---

### **Unsupervised Learning** 🤖

In unsupervised learning, we do not have an output variable `$Y$`. Instead, our training data is a set of records, where each record contains the values for all the input variables.

The goal is to understand how the data decomposes structurally into different groups.
* **Example**: Consider a retailer's customer database. Each customer can be described by their previous purchases (the input variables). A natural question is whether the customer dataset naturally forms various groups or **clusters**. The customers do not come with labels like "big customer" or "regular customer." Instead, we want to cluster the customers into groups based on their input variables alone.

### **Video 2.6: The Machine Learning Process: Part One**

In an idealized form, a typical machine learning project can be broken down into 10 distinct steps. Let's look at the first three.

---

### **Step 1: Define the Project's Purpose**

The first step is to define the purpose of the project with your client or sponsor. This involves determining:
* Whether the project is a **one-off effort** or an **ongoing procedure**.
* Who the end users will be and how they will interpret the results. A one-off project might be used by a handful of experts, while an ongoing one could have many different end users with varying requirements.

---

### **Step 2: Pull Together the Data**

Next, you need to pull together the data for your analysis.
* This involves figuring out what the results will eventually be used for.
* The data can come from internal sources (like customer or purchasing databases) or external sources (like credit rating databases).

---

### **Step 3: Explore, Clean, and Preprocess Data**

The third step is to explore, clean, and preprocess the data. This includes tasks like scaling some of the fields so they can be better handled by machine learning approaches. It also involves dealing with **missing data** and **outliers**. There are three broad ways to handle this:
1.  **Remove**: Simply remove the affected records.
2.  **Manual Fill**: Use domain experts to look at the data and try to replace missing or outlier values with the correct values.
3.  **Algorithmic Fill**: Use an algorithm to replace missing data with values like the average across other fields.

### **Mini-lesson 2.6: Basics of Preprocessing Data** 🧹

Data preprocessing is a crucial step in machine learning because raw data is often incomplete, inconsistent, or noisy. Properly prepared data improves model performance, ensures reliability, and prevents misleading insights that can lead to inaccurate predictions.

---

### **Steps in Data Preprocessing**

#### **1. Data Cleaning**
* Handling missing values through removal or imputation.
* Correcting inconsistencies and duplicate records.
* Removing outliers that could distort the analysis.

#### **2. Data Transformation**
* **Normalisation**: Scaling values to a fixed range (e.g., 0 to 1).
* **Standardisation**: Centring data to have a zero mean and unit variance.
* **Encoding categorical variables**: Converting categorical variables into binary vectors. For example, a `colour` variable with values `red`, `blue`, and `green` becomes three features: `is_red`, `is_blue`, and `is_green`.

#### **3. Data Reduction**
* **Feature selection**: Removing irrelevant or redundant variables.
* **Dimensionality reduction**: Reducing the number of variables while preserving important patterns.
* **Sampling**: Reducing the data set size while maintaining representativeness.

#### **4. Feature Engineering**
* Creating new features that improve model learning.
* Combining or splitting variables for better representation.
* Transforming data using domain knowledge (e.g., converting a date into seasons).
* **Example**: Transforming a birth date into a more useful `age` feature by subtracting the birth year from the current year.

#### **5. Data Splitting**
* Dividing data into training, validation, and test sets.
    * **Training set**: Trains the model to recognise patterns and make predictions.
    * **Validation set**: Used to tune hyperparameters and monitor performance during training.
    * **Test set**: Assesses the model's final performance with an unbiased evaluator.
* Ensuring proper stratification when dealing with imbalanced classes.

# Module 2: Machine Learning Notes

This document contains a compilation of notes covering the fundamental concepts of machine learning discussed.

---

## **What is Machine Learning?**

A large part of machine learning is about understanding the relationship between **input variables** (features) and an **output variable**. This is primarily done for two reasons: forecasting or inference.

* **Forecasting**: The main goal is to accurately predict the output variable for new, unseen data. Prediction accuracy is the top priority, and the underlying model can be a complex "black box".
* **Inference**: The focus is on understanding the relationship between the inputs and output. This requires simpler, more interpretable models to explain how the inputs affect the outcome.

---

## **Key Terminology**

#### **Terms Used in Machine Learning**
* **Predictive accuracy**: Describes how well predicted values match the actual values on a given test set of data for a given model.
* **Black box**: A system or process, the internal workings of which are hidden or not fully understood.

#### **Terms Used in Statistics**
* **Stochastic data models**: Models that treat the data generation process as a random variable. Common examples include normal, binomial and student's t-distributions.
* **Goodness of fit**: A measure that summarises the discrepancy between the observed values and expected values given the model under consideration. A common example is the root mean square.
* **Summary statistics**: A set of values used to describe a data set. Common summary statistics include mean, standard deviation and median.
* **Residual analysis**: A residual is the difference between the observed value and the value predicted by the model. Analysis of residuals can determine how useful a model is.

#### **Terms Used in Both**
* **Model validation**: The process of confirming the model achieves its purpose.
* **Generalisation**: The ability of the model to adapt to new, previously unseen data.

---

## **Fundamental Distinctions in Machine Learning**

Machine learning algorithms can be categorized based on several key distinctions.

### **Prediction vs. Classification** 📈
* **Prediction (Regression)**: The output variable is a continuous number (e.g., predicting a house price).
* **Classification**: The output variable is a category (e.g., classifying an email as `spam` or `not spam`).
    * **Nominal Variable**: A category with no natural order (e.g., `red`, `green`, `blue`).
    * **Ordinal Variable**: A category with a natural order (e.g., `low`, `medium`, `high`).

### **Parametric vs. Non-Parametric Approaches** 🧠
This distinction is about the assumptions a model makes about the data.

* **Parametric Approaches** 📐
    * Assume a specific mathematical form for the relationship between inputs and outputs (e.g., a linear equation).
    * Are generally faster to train and work well with smaller data sets.
    * Examples: Linear Regression, Logistic Regression.

* **Non-Parametric Approaches** 〰️
    * Do not assume a specific functional form, allowing for greater flexibility.
    * Require larger data sets and can be more computationally intensive.
    * Are more prone to **overfitting** if not properly tuned.
    * Examples: K-Nearest Neighbours (KNN), Decision Trees, Support Vector Machines.

### **Supervised vs. Unsupervised Learning** 🧑‍🏫
This is about whether the data has a defined output variable.

* **Supervised Learning**: The training data includes both input variables ($X$) and an output variable ($Y$). The goal is to learn the mapping function from inputs to the output.
* **Unsupervised Learning**: The training data only includes input variables ($X$) and has no corresponding output variable. The goal is to discover underlying patterns or structures in the data, such as **clusters**.

---

## **The Machine Learning Process**

An idealized machine learning project can be broken down into several steps.

#### **Step 1: Define the Project's Purpose**
* Define the project's goal with the client or sponsor.
* Determine if it's a **one-off effort** or an **ongoing procedure**, as this impacts who the end-users are.

#### **Step 2: Pull Together the Data**
* Gather data from internal sources (e.g., customer databases) or external sources (e.g., credit rating databases).

#### **Step 3: Explore, Clean, and Preprocess Data**
* This crucial step involves preparing the data for modeling. Key tasks include scaling features and handling **missing data** and **outliers**.

---

## **Basics of Data Preprocessing** 🧹

Properly prepared data improves model performance and ensures reliability.

#### **1. Data Cleaning**
* Handling missing values (removal or imputation).
* Correcting inconsistencies and duplicate records.
* Removing outliers.

#### **2. Data Transformation**
* **Normalisation**: Scaling values to a fixed range (e.g., 0 to 1).
* **Standardisation**: Centring data to have a zero mean and unit variance.
* **Encoding categorical variables**: Converting categories into numerical vectors.

#### **3. Data Reduction**
* **Feature selection**: Removing irrelevant or redundant variables.
* **Dimensionality reduction**: Reducing the number of variables while preserving patterns.

#### **4. Feature Engineering**
* Creating new, more useful features from existing data (e.g., creating an `age` feature from a birth date).

#### **5. Data Splitting**
* Dividing data into:
    * **Training set**: To train the model.
    * **Validation set**: To tune model hyperparameters.
    * **Test set**: To assess final model performance on unseen data.

---

## **Handling Missing Data in Python** 🐍

Pandas is a common library for managing missing data.

#### **Detecting Missing Data**
* Use `df.isnull()` to find missing values and `df.isnull().sum()` to count them per column.

#### **Handling Methods**
1.  **Removing Data**:
    * `df.dropna()` removes rows (or columns with `axis=1`) containing missing values.
    * Best used when the data set is large and the proportion of missing data is small.

2.  **Imputing (Filling) Data**:
    * **Simple Imputation**: Use `df.fillna()` to fill with the `mean`, `median`, or `mode`.
    * **Interpolation**: Use `df.interpolate()` to estimate values based on trends in the data, which is especially useful for time-series.

Examples of AI use:

1) https://medium.com/pytorch/ai-for-ag-production-machine-learning-for-agriculture-e8cfdb9849a1


![Screenshot%202025-07-25%20at%204.47.17%E2%80%AFpm.png](attachment:Screenshot%202025-07-25%20at%204.47.17%E2%80%AFpm.png)