# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Introduction to Machine Learning
What are our learning objectives for this lesson?
* Understand what machine learning is
* Revisit the concept of labelled and unlabelled data
* Understand the difference between supervised and unsupervised machine learning
* Understand the difference between classification and regression

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Machine Learning
At a high level, machine learning is building and using models that are learned from data. Machine learning is a subset of artificial intelligence, and it greatly overlaps with data mining. Let's see the "unofficial" definitions for these areas from Wikipedia:
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
    * Take away point: Discovering and using patterns in data
* [Artificial intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence): The study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" (known as Machine Learning).
    * Take away point: Implementing human-cognition on a machine
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): The subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed." Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data driven predictions or decisions, through building a model from sample inputs.
    * Take away point: Learning from and making predictions on data

## Supervised Learning
Supervised learning requires "labeled" training data from a "supervisor." Such labels are considered the ground-truth for describing the data. The label comes from a knowledgeable expert and can be used to learn what information describes different labels.
* If the labeled attribute is categorical, then the learning task is called "classification"
* If the labeled attribute is continuous, then the learning task is called "regression"

Supervised learning is typically composed of training and testing. We will train a machine (AKA a student, learner, mathematical model) to learn a concept. Then we will test the machine's learned concept by applying their knowledge.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png" width="650">

(image from [https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png))

### Training
As an example, suppose you are trying to teach someone (say a student) who has no notion of a cat or dog, the concept of cat vs. dog. You might first show the student some pictures of cats and say, "these are cats". Then you might show the person some pictures of dogs and say, "these are dogs". The set of cat and dog images is called the *training set*, a set of labeled examples (e.g. *instances*). For example, consider the following cat vs. dog training set:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/figures/cat_dog_training.png" width="500"/>

The student is going to look at different attributes of the image to try to learn a model of cat and a model of a dog. In doing so, the student will identify some aspects (AKA *attributes* or *features*) of the examples that distinguish a cat vs a dog. The features might include:

|Feature|Cat value|Dog value|
|-|-|-|
|Tongue out|No|Yes|
|Fur color|Light|Dark|
|Ears up|Yes|No|

What other features did you come up with?

#### Building a Model
A model to represent cat vs. dog based on these features might be rule-based:

>if tongue is out and the fur is dark and the ears are down then this is a dog

We will see later how we can use a tree with a rules (like the above) as a model to represent a classification such as dog vs. cat!

### Testing
Now, suppose we want to apply the student's learned conception of dog vs. cat by providing the student with a new, unseen example:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/figures/cat_or_dog.png" width="150"/>

Based on the above features, this image has the tongue out (dog), light fur color (cat), and ears up (cat). Thus our student would likely classify this image as a cat. But wait! We (the expert supervisors) know this is a dog (a puppy, but a dog none the less). Our training set didn't include any images that were as borderline cat/dog as this testing example. As you can see, the examples that comprise your training set and the features that are utilized greatly impact the accuracy of the learner, and consequently the model that is built. 

### Classification and Regression
The basic of idea of classification is:
* Given a data set (samples) and a new, unclassified instance
* Try to predict its classification (based on samples)

Note regression can be used in a similar way ... Let's say we have: $y = mx + b$

Q: How do we use this on a new instance?
* Predict a new $y'$ value from a new, unseen instance $x_{unseen}$ by calculating $y' = mx_{unseen} + b$

Approaches we will look at to classification
* k Nearest Neighbor (k-NN)... find "close cases"
* Naive Bayes... select "most probable" class for instance
* Decision Tree Induction... find "general" rules based on entropy
* Ensemble Methods... use many approaches to find best class (hybrid)

We'll also look at ways to evaluate classification results
* These largely involve splitting up a data set into training and testing sets
* Plus some basic statistics/metrics for accuracy, error

## Unsupervised Learning
Unsupervised learning does not require labeled training data. Information learned from the examples is data-driven and includes the process of discovering and describing patterns in the data. 

For example, to apply unsupervised learning to our cat vs. dog example, we would not try to "train" our student to learn the notion of "cat" or "dog". Instead, we would have our student look for patterns in the data, or perhaps a natural grouping. 

Here are our cat-dog training examples sorted in order based on the feature fur color:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/figures/cat_dog_fur_ordering.png" width="500"/>

We could apply a clustering algorithm, such as $k$-means clustering, to the data to reveal two natural groups in the data ($k = 2$):

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/figures/cat_dog_grouping.png" width="500"/>

Note that these two groups, blue and red, are not representative of cat and dog, since we have no cat/dog labels!

Now, upon seeing a new instance, we can determine the new instance's membership to either the blue group or the red group:

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U4-Supervised-Learning/figures/cat_dog_membership.png" width="500"/>

Like supervised machine learning, there are several unsupervised machine learning algorithms.