# Introduction to Machine Learning 

## What is Machine Learning?
One definition: *"Machine learning is the semi-automated extraction of knowledge from data"*

- **Knowledge from data**: Starts with a question that might be answerable using data
- **Automated extraction**: A computer provides the insight
- **Semi-automated**: Requires many smart decisions by a human


<center><img src="../img/introduction-to-machine-learning.png"/></center>

## How does Machine Learning work
A machine learning system builds prediction models, learns from previous data, and predicts the output of new data whenever it receives it. The amount of data helps to build a better model that accurately predicts the output, which in turn affects the accuracy of the predicted output.

Let's say we have a complex problem in which we need to make predictions. Instead of writing code, we just need to feed the data to generic algorithms, which build the logic based on the data and predict the output. Our perspective on the issue has changed as a result of machine learning. The Machine Learning algorithm's operation is depicted in the following block diagram:

<center><img src="../img/how-ml-works.png"/></center>

# Machine Learning Terminology

## Types of Data 
- **Quantitative**: Numerical, measurable quantities in which arithmetic operations often make sense
- **Continuous**: could take on any value within an interval,many possible values
- **Discrete**:countable value, finite number of values. 
- **Categorical**: Classifies individuals or items into different groups
- **Ordinal**: groups have an order or ranking
- **Nominal**: groups are merely names, no ranking

## Observation and Feature
- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)


## Early-stage Diabetes Risk Prediction Dataset (ESDRPD)
These datasets were used to develop machine and deep learning classifiers to predict diabetes. The two datasets were separately used to compare how each classifier performed during model training and testing phases.

**Source:** Joseph, Lionel Prakasah; Joseph, Erica Angelic; Prasad, Ramendra (2022), “Diabetes Datasets”, Mendeley Data, V1, doi: 10.17632/7zcc8v6hvp.1

## Dataset Documentation 
| Variable Name       | Type         | Values                | Description                                              |
|---------------------|--------------|-----------------------|----------------------------------------------------------|
| Age                 | Numeric      | -                     | Represents the age of the patient in years.              |
| Gender              | Categorical  | Male, Female          | Indicates the gender of the patient.                      |
| Polyuria            | Categorical  | Yes, No               | Presence or absence of excessive urination.              |
| Polydipsia          | Categorical  | Yes, No               | Presence or absence of excessive thirst.                 |
| Sudden Weight Loss  | Categorical  | Yes, No               | Indicates whether the patient has experienced sudden weight loss. |
| Weakness            | Categorical  | Yes, No               | Presence or absence of weakness in the patient.          |
| Polyphagia          | Categorical  | Yes, No               | Presence or absence of excessive hunger or increased appetite. |
| Genital Thrush      | Categorical  | Yes, No               | Presence or absence of fungal infection in the genital area. |
| Visual Blurring     | Categorical  | Yes, No               | Indicates whether the patient experiences visual blurring. |
| Itching             | Categorical  | Yes, No               | Presence or absence of itching in the patient.           |
| Irritability        | Categorical  | Yes, No               | Indicates whether the patient exhibits irritability.      |
| Delayed Healing     | Categorical  | Yes, No               | Presence or absence of delayed wound healing.            |
| Partial Paresis     | Categorical  | Yes, No               | Indicates whether the patient experiences partial paralysis. |
| Muscle Stiffness    | Categorical  | Yes, No               | Presence or absence of muscle stiffness.                 |
| Alopecia            | Categorical  | Yes, No               | Presence or absence of hair loss.                        |
| Obesity             | Categorical  | Yes, No               | Indicates whether the patient is obese.                  |
| Class               | Categorical  | Positive, Negative    | The target variable indicating the presence or absence of diabetes. |


In [9]:
# import packages for data manipulation 
import numpy as np 
import pandas as pd 

In [10]:
# read data 
df = pd.read_excel("../data/ESDRPD.xlsx")

# examine first few rows 
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,Sudden weight loss,Weakness,Polyphagia,Genital thrush,Visual blurring,Itching,Irritability,Delayed healing,Partial paresis,Muscle stiffness,Alopecia,Obesity,Class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


## Independent and Dependent Variable 
- **Independent Variable**: A variable which value does not change by the effect of other variables and is used to manipulate the dependent variable. It often denoted as $X$.

- **Dependent Variable**: A variable whose value change when there is any manipulation in the values of independent variable. It is often denoted as $Y$

**Example**: Salary depends on year of experience

- **Salary**: Dependent Variable 
- **Year of experience**: Independent Variable 

In [11]:
# Independent variable 
X = df.drop(columns="Class", axis = 1)

In [12]:
# Dependent variable 
Y = df['Class']

## Training and Test Dataset
- **Training Set**: Here, you have the complete training dataset. You can extract features and train to fit a model and so on. 
- **Validation Set**: This is crucial to choose the right parameters for your estimator. We can divide the training set into a train set and validation set. Based on the validation test results, the model can be trained(for instance, changing parameters, classifiers). This will help us get the most optimized model.

- **Test Set**: Testing Set: Here, once the model is obtained, you can predict using the model obtained on the training set.
<center><img src="../img/test_train.png"/></center>

## Types of Machine Learning


## Supervised Learning 
- The majority of practical machine learning uses supervised learning.
- Supervised learning is where you have input variables ($X$) and an output variable ($Y$) and you use an algorithm to learn the mapping function from the input to the output.
$$Y = f(X)$$
- The goal is to approximate the mapping function so well that when you have new input data (X) that you can predict the output variables (Y) for that data.
- It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems can be further grouped into regression and classification problems.
- **Classification**: A classification problem is when the **output** variable is a category, such as “red” or “blue” or “disease” and “no disease”, "female" or "male".
- **Regression**: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.


Some common types of problems built on top of classification and regression include **recommendation** and **time series** prediction respectively.
Some popular examples of supervised machine learning algorithms are:
- Linear regression for regression problems.
- Random forest for classification and regression problems.
- Support vector machines for classification problems.

## Example of Supervised Learning 
Making predictions using data
- Example: Is a given email "spam" or "ham"?
- There is an outcome we are trying to predict

## Unsupervised Learning

- Unsupervised learning is where you only have input data ($X$) and no corresponding output variables.
- The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
- These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.
- **Clustering**: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
- **Association**:  An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy $X$ also tend to buy $Y$.

## Example of Unsupervised Learning 
Extracting structure from data

- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors
- There is no "right answer"

__Summary__
- **Supervised**: All data is labeled and the algorithms learn to predict the output from the input data.
- **Unsupervised**: All data is unlabeled and the algorithms learn to inherent structure from the input data.