# Machine Learning - An introduction


## Structure of  the class  
___

### Part I: Machine Learning
  - Basics & Theory
    - What's this all about?
    - Supervised, Unsupervised Learning
    - Classification, Regression, Clustering
  - Python Basics for Data Science / ML
    - numpy, scipy
    - scikit learn
    - pytorch
  - Hands-on examples
    - Classification
        * Churn prediction
        * Digit classification
    - Regression example
        * Estimating House Prices
    - Clustering
        - kmeans document clustering


  - What is Machine Learning anyway?
  - Some examples


# Machine Learning


A (very informal) definition: Build machines that learn through examples (data) and generalize to unseen data


Examples we are all using in a daily basis:

      - Car plate detection systems
      - spam email classifier  
      - voice recognition in smartphones
      - face recognition @ facebook 
      - amazon recomendations  
      

Aim of this series is to build the foundations and principles to be able to implement some of this systems from scratch. 


### Introduction and definitions 



#### Supervised Learning
   
    Here, the data comes with (a fixed number of) labels  
    
      
      - a dataset of emails and for every email a flag if it is spam or not spam 
      - photos of animals, where for each photo we have a label if it is a cat or a dog (assume for a moment there is only one animal per photo) 
      - fraud detection: a dataset with valid and fraudulent transation examples
      
#### Unsupervised Learning
    
     In the unsupervised learning case, data comes without labels 
     
     - a collection of documents
     - a collection of photographs
     - timeseries data
   
   
   
   
#### Regression (Παλλινδρόμηση)

      In regression, the outcome we are trying to predict is or can be thought as real number (for example house prices, demand etc)    

#### Classification (Ταξινόμηση)
    
    In classification the target variable is one of many categories. For example (cat, dog, bird), (spam, non-spam), (fraud, non-fraud)

#### Classification 

The repeating pattern in supervised learing is that data comes with labels and we wish to build a system to be able to predict the label given a new data example. For example, if we are building a pet detection system, we would like the system to answer that this is a photo of a cat with very high accuracy, eg predict the class of of the new image to be cat and not dog or bird.


<b><span style="color:red">WARNING: MATHS AHEAD</span></b>

Without being too formal, we would like to learn from the data a function $f(x) \to D $, where $x \in X$ is the data and D the set of the labels to be predicted.

This function must have the property to be able to produce correct results in new, unseen data. That is, if we take a photo of a random cat somewhere in the world and we feed this cat in to our system, the system should be able to respond that this is a cat although it has never seen this cat before. We call this fuction property "Gereralisation" and is the most important aspect of machine learning systems: to be able to perform well in unseen data.


Given the above, we need the following ingredients to build a machine learning system
- The function we are going to be using to model our data (usually called hypothesis) 
- A way to measure how "wrong" is this function and a configuration of its parameters, given our data
- A way to train this function, eg to modify the parameters in such a way that the error is minimised 
- A way to test this function to new, unseen data and make a claim of how well we expect this function to behave in unseen data. 

Similarly, for the regression part, instead of trying to predict a label we now try to predict a real value number. 

Again we need a hypothesis, a loss function, a way to tune the hypothesis and a way to see how well we are generalizing to unseen data.


Let's see that in practice. 


#### Linear Regression

One of  the simplest parametric models in statistics and machine learning is Linear Regression. 

Linear Regression tries to find a best straight line to fit the data. 


##### Linear Regression Example

![alt text](./images/lr.png "Linear Regression Example")



We are given a data set D of values  $\{ (x_i,y_i) \}$ and our hypothesis is that there is a straight line 
$y = h(x) = w_0 + w_1 * x$ that fits the data. 

We are now asked to find the "best" weights w that fit our data.

A typicall way to define "best" is to try to minimize the square difference between our targets y and what the model predicts, $\hat{y} = f(x)$,  eg minimize the total loss 

$$ L = \sum_i{  ( y_i - \hat{y_i})^2 } = \sum_i{  ( y_i - \ w_0 + w_1*x_i )^2 }$$ 

There a few ways to do that, either look for a closed form analytic expression or use an optimisation algorithm such as gradient descent. 


## Generallization, model performance 


When we are solving a supervised learning problem, we are given data and labels and we want to build a predictive model which will be good enough to generalize to unseen data.



### Train/Validation/Test
A common way to proceed is to split the data in 3 parts, assuming we have enough data. Sometimes this is a reasonable assumption, some times it is not.


We split the data in train, validation and test parts. (80/10/10 typically or something similar)

We then train the model in the 80% of the data and we validate/optimise the model in the validate data. When we are happy, then ** and only then ** we are testing the model's performance on the test data. 




### Cross Validation
Another strategy used in practice is Cross Validation.


Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
       - Take the group as a hold out or test data set
        - Take the remaining groups as a training data set
        - Fit a model on the training set and evaluate it on the test set
        - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores


### Hyperparameter optimisation 


In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.[1] The objective function takes a tuple of hyperparameters and returns the associated loss.[1] Cross-validation is often used to estimate this generalization performance.[2]