# Lecture 6

# Review

We looked at machine learning as a general practice of creating mathematical/statistical/algorithmic models out of data in order to make predictions.

![model](images/model.png)

There are various ways we can group machine learning methods. The most obvious one is *supervised* vs *unsupervised* learning methods.

## Different Learning Methods

### Supervised learning

We have examples of input-output pairs.  Examples: OLS regression, k-nn and naive Bayes.

### Unsupervised learning

There are no input output pairs and the model is generated by looking at the internal structure of the data.

## Models

One can classify the types of models each learning method into two main classes: 

* Parametric: Model is determined by specific values called *parameter*. 
  * OLS regression: the parameters are $\alpha$ and $\beta$ where $y \sim \alpha x + \beta$
  * k-means: the parameters are cluster centers.
  * naive Bayes: the parameters are apriori probabilities.

* Non-parametric: Model is a monolithic(no parts) blackbox algorithm:
  * k-nn
  * neural networks
  
## Model building process

The general method of building **parametric** models can be described by the diagram below:

![build](images/build.png)

* For OLS regression models the *cost* function was:
$$ RSS(\alpha,\beta) = \sum_i (\alpha x_i + \beta - y_i)^2 $$
* For k-means the cost function was
$$ KM(f,c_1,\ldots,c_k) = \sum_j d(c_{f(j)},x_j) $$
where $f\colon D\to \{1,\ldots,k\}$ is the labelling function and $c_1,\ldots,c_k$ are the cluster centers.

Nonparametric models do something else.

# Numeric vs Categorical Variables

One factor that determines the type of the model and the type of the learning method we use is the type of the dependent or independent variables

### Numerical variables

A numeric variable is a number. These come from measurements such as 

* length, 
* count, 
* weight, 
* electric charge etc.


In [1]:
head(iris[sample(1:150,18),])

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
11,5.4,3.7,1.5,0.2,setosa
80,5.7,2.6,3.5,1.0,versicolor
84,6.0,2.7,5.1,1.6,versicolor
91,5.5,2.6,4.4,1.2,versicolor
138,6.4,3.1,5.5,1.8,virginica
63,6.0,2.2,4.0,1.0,versicolor


### Categorical variables

These types of variables come from labels such as 

* preferences (omnivoir, carnevoir, vegeterian or vegan)
* education levels (none, primary school, high school, university, master or PhD)
* brand names (car: ford, fiat, renault, etc)
* yes/no or true/false variables (do you have a car: yes/no)

These variables can be *unranked* as in 'preferences' example above or they can be ranked as in 'education levels' example above.


In [11]:
titanic <- as.data.frame(Titanic)
N <- nrow(titanic)
titanic[titanic$Class=='1st',]

Unnamed: 0,Class,Sex,Age,Survived,Freq
1,1st,Male,Child,No,0
5,1st,Female,Child,No,0
9,1st,Male,Adult,No,118
13,1st,Female,Adult,No,4
17,1st,Male,Child,Yes,5
21,1st,Female,Child,Yes,1
25,1st,Male,Adult,Yes,57
29,1st,Female,Adult,Yes,140


In [5]:
golf <- read.csv('data/golf.csv', sep=" ")
golf

outlook,temperature,humidity,windy,play
Rainy,Hot,High,False,No
Rainy,Hot,High,True,No
Overcast,Hot,High,False,Yes
Sunny,Mild,High,False,Yes
Sunny,Cool,Normal,False,Yes
Sunny,Cool,Normal,True,No
Overcast,Cool,Normal,True,Yes
Rainy,Mild,High,False,No
Rainy,Cool,Normal,False,Yes
Sunny,Mild,Normal,False,Yes


## Naive Bayes Classification

### Bayesian rule

We use the notation $p(A|B)$ for the probability that $A$ happens given that $B$ happened which is calculated as follows:
$$ p(A|B) =  \frac{p(A\cap B)}{p(B)} $$
Then we can also write
$$ P(B|A) =  \frac{p(A\cap B)}{P(A)} = \frac{p(A|B)p(B)}{P(A)} $$
The last equality is know as the [Bayes Rule](https://en.wikipedia.org/wiki/Bayes%27_theorem).

So, if $A$ is a categorical variable which ranges over a finite set $\{a_1,\ldots,a_n\}$ and $B$ is another categorical variable that ranges over another finite set $\{b_1,\ldots,b_m\}$ we write $p(A=a_i)$, or simply $p(a_i)$ for the probability that $A$ is $a_i$.  Then
$$ p(b_j,a_i) = p(b_j|a_i)p(a_i) $$

### Classification with categorical variables

Assume we have columns of categorical data $A,\ldots,Z$ and a class variable $CLASS$ with known probabilities:
$$p(a_{i_1},\ldots,z_{i_m}|CLASS_i)$$
In other words, we know the counts of each observation $(a_{i_1},\ldots,z_{i_m})$ in each class $CLASS_i$.

In [15]:
titanic[ titanic$Survived == 'Yes', ]

Unnamed: 0,Class,Sex,Age,Survived,Freq
17,1st,Male,Child,Yes,5
18,2nd,Male,Child,Yes,11
19,3rd,Male,Child,Yes,13
20,Crew,Male,Child,Yes,0
21,1st,Female,Child,Yes,1
22,2nd,Female,Child,Yes,13
23,3rd,Female,Child,Yes,14
24,Crew,Female,Child,Yes,0
25,1st,Male,Adult,Yes,57
26,2nd,Male,Adult,Yes,14


In [16]:
titanic[ titanic$Survived == 'No', ]

Class,Sex,Age,Survived,Freq
1st,Male,Child,No,0
2nd,Male,Child,No,0
3rd,Male,Child,No,35
Crew,Male,Child,No,0
1st,Female,Child,No,0
2nd,Female,Child,No,0
3rd,Female,Child,No,17
Crew,Female,Child,No,0
1st,Male,Adult,No,118
2nd,Male,Adult,No,154


We want to *reverse* these conditionals: for example, how do we canculate the survival rate of male crew members vs female crew members in the Titanic disaster:

In [21]:
sum(titanic$Freq[titanic$Class == 'Crew' & titanic$Sex == 'Male' & titanic$Survived == 'Yes'])/sum(titanic$Freq[titanic$Class == 'Crew'])
sum(titanic$Freq[titanic$Class == 'Crew' & titanic$Sex == 'Female' & titanic$Survived == 'Yes'])/sum(titanic$Freq[titanic$Class == 'Crew'])