# Introduction to Machine Learning

6 november 2018, [slides](https://github.com/gsarti/machine_learning_data_analytics/blob/master/05_intro_medvet/2018_MLDA_slides.pdf) 1 -  

## Supervised and Unsupervised Learning


A model which allows to predict an outcome from a given set of categories given in input is called **supervised learning**.

A model which doesn't have any information about the possible subgroups in data and could build those from relationships and structures in data is called **supervised learning**.

Examples:

* **Spam flagging**: _supervised learning_, since we are giving examples of spam mail to make it easier for the system to recognize spam mails.


* **Image understanding**: _supervised learning_, since image categories should be previously defined. Since often images are missing tags or labels, we can infer the label from captions attached to the image.


* **Flight trajectories analysis**: _unsupervised learning_, since we just want to cluster on common patterns and trajectories.


* **Authoring regular expressions**: _none of the above_, since the learning doesn't happen at the beginning, this is an example of **online learning**.


## Terminology

* The variables used to predict or infer a result from data are called **inputs, independent variables, features or attributes**.


* The predicted or inferred variable is called the **output, dependent variable or response**.


* A line inside our dataset is called an **observation, instance or data point**.


* A dataset is formed by **$n$ observations**, which have a value for each of the **$p$ variables**.


* A problem of **binary classification** is made up by two parts: make a **classifier** model which can learn from data, train it and use the model on new observations.

* **Repeatability** denotes a procedure that could be repeated by someone else. **Reproducibility** denotes a procedure that will held the same results if it is repeated by someone else. We ideally want to achieve both when building a learning model.

In [None]:
# Hiding warnings for readability
options(warn=0)

library(ggplot2)
library(dplyr)
library(tidyr)

In [3]:
# Get basic information about iris dataset
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

In [5]:
dr = iris %>% group_by(Species) %>%
summarise(Avg.Sepal.Ratio = mean(Sepal.Length / Sepal.Width),
          Avg.Petal.Ratio = mean(Petal.Length / Petal.Width))
dr %>% gather(ratio, value, -Species)

Species,ratio,value
setosa,Avg.Sepal.Ratio,1.470188
versicolor,Avg.Sepal.Ratio,2.160402
virginica,Avg.Sepal.Ratio,2.230453
setosa,Avg.Petal.Ratio,6.908
versicolor,Avg.Petal.Ratio,3.242837
virginica,Avg.Petal.Ratio,2.780662
